August 1, 2018

Using Scrapy to parse FARA

Quickstart

Run the spider as:

$ :> fprincipals.jl :> s.log; scrapy crawl active_foreign_principals -o fprincipals.jl --logfile=s.log

This will truncate the output and the log file. JSON-lines output format is always appended to.

Setup

Install scrapy for Python3.

$ virtualenv -p python3 py3
$ source py3/bin/activate
$ pip install Scrapy ipython

Now, create a crawler project

$ mkdir ~/src/fara
$ cd ~/src/fara
$ scrapy startproject fara
$ cd fara
$ scrapy genspider active_foreign_principals efile.fara.gov

Scrapy will generate skeleton code:

items.py is where you can define the models for the data that will be extracted by the spider
spiders/active_foreign_principals.py is the spider

Terminology

A spider is the bit of code that deals with the raw HTML. Its function is to parse and extract relevant data.

Parsing FARA

The data is within a table with class apexir_WORKSHEET_DATA. ActiveForeignPrincipalSpider.parse() is the entry point of parsing, the code is reasonably self-explanatory.

To get the exhibit_url, another page has to be scraped. The item is merged by passing it though the meta keyword argument of response.follow() which is just a shortcut to scrapy.Request().

FormRequest.from_response() cannot be used for pagination because:

the returned HTML is not complete, it is just a div.
the form action is munged somewhere

Using Firebug’s network tab the form data can be inspected and the required parameters can be found on the page. The code in ActiveForeignPrincipalSpider._next_page_data() extracts the form data required for the first and subsequent pages.

~~HTTP caching has been enabled in fara/settings.py.~~

Miscellaneous

To convert a json object/dict into a form suitable for posting, use the following incantation:

$ jq 'to_entries|map(.key + "=" + .value)|join("&")' np.json

If it is a dict, remember to change the 's to "s before using jq.

Powered by Hugo & Kiss.