Extract data from unfriendly website with jQuery and jq

2018-06-14

It probably happened to you already. You arrive on a website that contains a lot of relevant information that you need, but the presentation is so disgusting that you can’t export easily. In these scenarios, I usually use the developper tools of my browser along with jQuery (which is loaded on almost any website in 2018) and extract the information from the DOM.

Note: if the website does not have jQuery, you can add it to the page <a href="https://stackoverflow.com/questions/7474354/include-jquery-in-the-javascript-console">via the console</a>

For example, suppose you want the titles of all the blog posts on the front page of this site. You could run this kind of code in the console to extract them:

jQuery('.post-header h1').map((i, v) => v.innerText);

The result will look like this (full output: here):

{
  "0": "Dealing with multipart forms with akka-http",
  "1": "Kafka on Docker for Mac",
  "2": "Follow-up: Automatic Releases to Maven Central with Travis and SBT",
  "3": "Variable Python Decorators",
  "4": "Play 2.5 - Streaming requests",
  "5": "Let's donate",
  "6": "Building an online store",
  "7": "Cassandra flush to disk delay and docker images",
  "8": "Welcome",
  "length": 9,
  "prevObject": {
    "0": {},
    "1": {},
    "2": {},
    // load of useless information ...
  },
  "context": {
    // load of useless information ...
  }
}

With the help of jq, you can parse/transform json quite easily. In this particular case, suppose we want a JSON array with the titles in it, as strings, like so:

[
    "Dealing with multipart forms with akka-http",
    "Kafka on Docker for Mac",
    "Follow-up: Automatic Releases to Maven Central with Travis and SBT",
    "Variable Python Decorators",
    "Play 2.5 - Streaming requests",
    "Let's donate"
]

To obtain this result, pipe the JSON object into jq like so:

cat jq-blob.json | jq '. | to_entries | map(select(.key|test("\\d")) | .value)'

I invite you to read more on jq, but here is what’s happening:

So in sumarry, we turn the root object in an array of key/value, for each of these, if the key is a digit ("\\d"), extract the value, otherwise discard it.