Quickstart
Run the spider as:
$ :> fprincipals.jl :> s.log; scrapy crawl active_foreign_principals -o fprincipals.jl --logfile=s.log
This will truncate the output and the log file. JSON-lines output format is always appended to.
Setup
Install scrapy for Python3.
$ virtualenv -p python3 py3
$ source py3/bin/activate
$ pip install Scrapy ipython
Now, create a crawler project
$ mkdir ~/src/fara
$ cd ~/src/fara
$ scrapy startproject fara
$ cd fara
$ scrapy genspider active_foreign_principals efile.fara.gov
Scrapy will generate skeleton code:
items.pyis where you can define the models for the data that will be extracted by the spiderspiders/active_foreign_principals.pyis the spider
Terminology
A spider is the bit of code that deals with the raw HTML. Its function is to parse and extract relevant data.
Parsing FARA
The data is within a table with class apexir_WORKSHEET_DATA. ActiveForeignPrincipalSpider.parse() is the entry point of parsing, the code is reasonably self-explanatory.
To get the exhibit_url, another page has to be scraped. The item is merged by passing it though the meta keyword argument of response.follow() which is just a shortcut to scrapy.Request().
FormRequest.from_response() cannot be used for pagination because:
- the returned HTML is not complete, it is just a div.
- the form action is munged somewhere
Using Firebug’s network tab the form data can be inspected and the required parameters can be found on the page. The code in ActiveForeignPrincipalSpider._next_page_data() extracts the form data required for the first and subsequent pages.
HTTP caching has been enabled in fara/settings.py.
Miscellaneous
To convert a json object/dict into a form suitable for posting, use the following incantation:
$ jq 'to_entries|map(.key + "=" + .value)|join("&")' np.json
If it is a dict, remember to change the 's to "s before using jq.