Extract data from unfriendly website with jQuery and jq
It probably happened to you already. You arrive on a website that contains a lot of relevant information that you need, but the presentation is so disgusting that you can’t export easily. In these scenarios, I usually use the developper tools of my browser along with jQuery
(which is loaded on almost any website in 2018) and extract the information from the DOM.
Note: if the website does not have jQuery, you can add it to the page <a href="https://stackoverflow.com/questions/7474354/include-jquery-in-the-javascript-console">via the console</a>
For example, suppose you want the titles of all the blog posts on the front page of this site. You could run this kind of code in the console to extract them:
jQuery('.post-header h1').map((i, v) => v.innerText);
The result will look like this (full output: here):
{
"0": "Dealing with multipart forms with akka-http",
"1": "Kafka on Docker for Mac",
"2": "Follow-up: Automatic Releases to Maven Central with Travis and SBT",
"3": "Variable Python Decorators",
"4": "Play 2.5 - Streaming requests",
"5": "Let's donate",
"6": "Building an online store",
"7": "Cassandra flush to disk delay and docker images",
"8": "Welcome",
"length": 9,
"prevObject": {
"0": {},
"1": {},
"2": {},
// load of useless information ...
},
"context": {
// load of useless information ...
}
}
With the help of jq, you can parse/transform json quite easily. In this particular case, suppose we want a JSON array with the titles in it, as strings, like so:
[
"Dealing with multipart forms with akka-http",
"Kafka on Docker for Mac",
"Follow-up: Automatic Releases to Maven Central with Travis and SBT",
"Variable Python Decorators",
"Play 2.5 - Streaming requests",
"Let's donate"
]
To obtain this result, pipe the JSON object into jq
like so:
cat jq-blob.json | jq '. | to_entries | map(select(.key|test("\\d")) | .value)'
I invite you to read more on jq
, but here is what’s happening:
jq
takes an input and pass it through a filter that produces an output.
is the filter to select the top level object|
is a filter that pipes the output of one filter into a second oneto_entries
is a filter that takes a JSON object and turn it into an array of objects that contain a key and a value. eg:{ "a": 1, "b": 2 }
will produce[ { "key": "a", "value": 1 }, { "key": "b", "value": 2 }]
map
apply a filter to all the elements of the array it receives in inputselect
takes a predicate and yield its input unchanged if the predicate is true other wise discard it (likefilter
on arrays in multiple programming languages)test
takes a regex, apply it on the input and produce true if the input match the regex
So in sumarry, we turn the root object in an array of key/value, for each of these, if the key is a digit ("\\d"
), extract the value, otherwise discard it.