Archive
Exctract the significant parts of a web page
Problem
From a web page you want to extract the significant parts: title, author, date of publication, body, etc.
Solution
Mercury Web Parser does exactly this. It’s free. After registration you get an API key. Their web service returns a structured JSON response. I tried it with my previous post:
curl -H "x-api-key: <my_api_key>" "https://mercury.postlight.com/parser?url=https://ubuntuincident.wordpress.com/2017/10/01/re-run-a-command-in-the-terminal-every-x-seconds/" | python3 -m json.tool
Output:
{
"title": "Re-run a command in the terminal every X\u00a0seconds",
"author": "Jabba Laci",
"date_published": "2017-09-30T22:02:19.000Z",
"dek": null,
"lead_image_url": "https://secure.gravatar.com/blavatar/db6c398dc21dc8e8f82d7bc83130c0ab?s=200&ts=1506809436",
"content": "<div class=\"content\"> <p><strong>Problem</strong><br>\nYou want re-execute a command in the terminal every X seconds. For instance, you copy a lot of big files to a partition and you want to monitor the size of the free space on that partition.</p>\n<p><strong>Solution</strong><br>\nA naive and manual approach to the problem mentioned above is to execute the commands “<code>clear; df -h</code>” regularly, say every 2 seconds.</p>\n<p>A better way is to use the command “<code>watch</code>“. Usage:</p>\n<pre> watch -n 2 df -h </pre>\n<p>That is: execute “<code>df -h</code>” every two seconds. <code>watch</code> will also clear the screen and print the result to the top. You can quit with <code>Ctrl + c</code>.</p>\n<p>Tip from <a href=\"https://askubuntu.com/questions/430382/repeat-a-command-every-x-interval-of-time-in-terminal\">here</a>.</p> </div>",
"next_page_url": null,
"url": "https://ubuntuincident.wordpress.com/2017/10/01/re-run-a-command-in-the-terminal-every-x-seconds/",
"domain": "ubuntuincident.wordpress.com",
"excerpt": "Problem You want re-execute a command in the terminal every X seconds. For instance, you copy a lot of big files to a partition and you want to monitor the size of the free space on that partition.\u2026",
"word_count": 108,
"direction": "ltr",
"total_pages": 1,
"rendered_pages": 1
}
Pretty impressive.
JSON Path
I wrote a command-line program that outputs the full path of every key / value in a JSON file.
Example
$ ./json_path.py sample.json root.a => 1 root.b.c => 2 root.b.friends[0].best => Alice root.b.friends[1].second => Bob root.b.friends[2][0] => 5 root.b.friends[2][1] => 6 root.b.friends[2][2] => 7 root.b.friends[3][0]. 1 root.b.friends[3][1].two => 2
More information at the project’s github page.
Detailed Twitter info in JSON: an undocumented feature
Problem
Using a script, I wanted to figure out the number of my followers on Twitter. Here is my (mostly abandoned) Twitter page: https://twitter.com/szathmar . I didn’t want to use any API since I didn’t want to register for an API key so I went on the easy way: let’s scrape the necessary data out :) Digging in the HTML code I found the number of followers, but I also found a hidden treasure!
Solution
And the hidden treasure is a long json string that contains all kinds of information about a twitter user:
Here on the screenshot you can see just an extract, the json string is much longer. Fine, let’s get it!
#!/usr/bin/env python3
# coding: utf-8
import json
import readline
import sys
from pprint import pprint
import requests
from bs4 import BeautifulSoup
def main():
url = input("Full twitter URL: ")
html = requests.get(url).text
soup = BeautifulSoup(html, "lxml")
tag = soup.find('input', {'class': 'json-data'})
j = tag['value']
d = json.loads(j)
json_out = json.dumps(d, indent=4)
print(json_out)
# followers = d['profile_user']['followers_count']
# print(followers)
##############################################################################
if __name__ == "__main__":
main()
If you want the number of followers for instance, then uncomment the last two lines.
Thank you Twitter! It’s really nice of you to provide all these data in JSON!
Sample
The JSON that I could extract from my page is 743 lines long! Here is an extract of it:
... "profile_image_url": "http://pbs.twimg.com/profile_images/459783802395430912/vcMT0CGX_normal.png", "business_profile_state": "none", "url": null, "profile_background_image_url_https": "https://abs.twimg.com/images/themes/theme6/bg.gif", "screen_name": "szathmar", "is_translator": false, "friends_count": 123, "followers_count": 70, "profile_text_color": "333333", "profile_link_color": "FF3300", "translator_type": "none", "profile_background_color": "709397", ...
show my position on the map
The site http://ipinfo.io/ gives you back not only your IP address, but your geolocation too. Example (with a fake IP):
$ curl http://ipinfo.io/
{
"ip": "734.675.653.542",
"hostname": "No Hostname",
"city": "Debrecen",
"region": "Debrecen",
"country": "HU",
"loc": "47.5333,21.6333",
"org": "...",
"postal": "..."
}
Let’s visualize my location:
<img src="https://maps.googleapis.com/maps/api/staticmap?center=47.5333,21.6333&zoom=9&size=480x240&sensor=false">
Debrecen, Hungary, center of the world :)
Firefox: restore your lost tabs
Problem
Over the last 1.5-2 years, I collected 700+ tabs in my Firefox :) Maybe this summer I will have some time to sort them out. However, today when I switched my computer on, all my tabs were gone and I got a clean Firefox instance with one tab only. Hmm… I had a similar problem once and then I installed an add-on called “Session Manager”. In this add-on I made the setting to offer the list of previous sessions upon restart but it didn’t do anything! Damn, how to get back my tab collection?
Solution
In the .mozilla directory there is a file called sessionstore.js that stores — among others — the opened tabs. However, this file was very small, my previous tabs were clearly not in it. Thank God there was a backup copy of this file next to it called sessionstore.bak. It was a big file and the timestamp of the file indicated that it was created 2 days ago when everything was OK with my tabs.
So, how to extract the old tabs from sessionstore.bak?
This is a JSON file, but it’s not pretty printed. I suggest copying this file to somewhere else where you can experiment with it. First, let’s make it readable:
$ python -m json.tool sessionstore.bak > session.json
Now you can open session.json with a text editor. You will find lines with a “url” key, but the number of these rows is huge. I had 731 tabs (that I lost) but this file contained 6500+ URLs. As I noticed, it also contains the URLs of closed tabs. How to extract the URLs of the opened tabs only?
Again, Python came to my rescue. After analyzing the structure of this JSON file, I could extract the tab URLs the following way:
$ python # version 2.7
>>> import json
>>> f = open('session.json') # input file
>>> g = open('tabs.txt', 'w') # output file
>>> d = json.load(f)
>>> tabs = d["windows"][0]["tabs"]
>>> cnt = 0
>>> for t in tabs:
... print >>g, t["entries"][0]["url"]
... cnt += 1
>>> cnt
731 # Yeah! All of them are here!
>>> g.close()
>>> f.close()
The URLs of the lost tabs are now in the tabs.txt file.
I didn’t make a script of it but feel free to do it. From now on I will make regular backups of my opened tabs with the URL Lister add-on.
Google’s URL shortener
Problem
You want to shorten a long URL from the command line / from a script.
Solution
There are lots of URL shorteners. With the Google URL shortener you can do it like this:
curl https://www.googleapis.com/urlshortener/v1/url -H 'Content-Type: application/json' -d '{"longUrl": "https://ubuntuincident.wordpress.com"}'
Sample output:
{
"kind": "urlshortener#url",
"id": "http://goo.gl/Zeigx",
"longUrl": "https://ubuntuincident.wordpress.com/"
}
Exercise
Let’s do it in Python using the requests module:
import requests
import json
url = "https://www.googleapis.com/urlshortener/v1/url"
data = {"longUrl": "https://ubuntuincident.wordpress.com"}
headers = {'Content-type': 'application/json', 'Accept': 'text/plain'}
r = requests.post(url, data=json.dumps(data), headers=headers)
print r.text
print 'Short URL:', r.json()["id"]
Links
- Shorten a long URL @developers.google.com
jq — a lightweight and flexible command-line JSON processor
“jq is like sed for JSON data – you can use it to slice and filter and map and transform structured data with the same ease that sed, awk, grep and friends let you play with text.
jq is written in portable C, and it has zero runtime dependencies.
jq can mangle the data format that you have into the one that you want with very little effort…” (link)
Check out the tutorial here.
You can also use jq to pretty print an ugly JSON file:
cat ugly_one_liner.json | jq '.'
GitHub: contact watchers
Problem
You want to contact all the watchers of a project. For instance, you want to notify them about some radical changes.
Solution
Simply click on the “Eye” icon that shows the number of watchers. It will list your followers.
Or, you can get the list of watchers through an API:
curl http://github.com/api/v2/json/repos/show/USERNAME/REPONAME/watchers?full=1 | python -mjson.tool
More info about the Repositories API: here. General information about the APIs: here.
Pretty print a JSON file
This post is based on the following SO threads: one; two.
Problem
You have an unreadable JSON file from which you want to extract some data… How to prettify it, i.e. how to make it human readable?
Solution
There are web-based and command-line solutions. As an extra, we show you how to do it in Vim too.
Web-based prettifiers
- http://chris.photobooks.com/json/default.htm (it can show you the path of a tag too)
- http://pretty-print.org/
- http://jsonviewer.stack.hu/
- http://www.shell-tools.net/index.php?op=json_format
- http://jsonlint.com/
- http://jsonformatter.curiousconcept.com/ (formatter and validator)
Command-line beautifiers
curl -s http://www.reddit.com/r/nsfw/.json | python -mjson.toolsudo apt-get install edit-json; prettify_json myfile.json
Vim :)
This tip is based on this post: Editing json files in vim.
In my .vimrc file I had to add the following lines:
" pretty-print JSON files autocmd BufRead,BufNewFile *.json set filetype=json " json.vim is here: http://www.vim.org/scripts/script.php?script_id=1945 autocmd Syntax json sou ~/.vim/syntax/json.vim " json_reformat is part of yajl: http://lloyd.github.com/yajl/ autocmd FileType json set equalprg=json_reformat
When opening a .json file, it will be colored using the json.vim syntax file. Selecting a text and pressing the “=” button will indent the marked text using json_reformat.
Firefox add-on
There are several JSON visualizer add-ons for Firefox, e.g. JSONView.

You must be logged in to post a comment.