Quickstart
Run the spider as:
$ :> fprincipals.jl :> s.log; scrapy crawl active_foreign_principals -o fprincipals.jl --logfile=s.log
This will truncate the output and the log file. JSON-lines output format is always appended to.
Setup
Install scrapy for Python3.
$ virtualenv -p python3 py3
$ source py3/bin/activate
$ pip install Scrapy ipython
Now, create a crawler project
$ mkdir ~/src/fara
$ cd ~/src/fara
$ scrapy startproject fara
$ cd fara
$ scrapy genspider active_foreign_principals efile.fara.gov
Scrapy will generate skeleton code:
items.py
is where you can define the models for the data that will be extracted by the spiderspiders/active_foreign_principals.py
is the spider
Terminology
A spider is the bit of code that deals with the raw HTML. Its function is to parse and extract relevant data.
Parsing FARA
The data is within a table with class apexir_WORKSHEET_DATA
. ActiveForeignPrincipalSpider.parse()
is the entry point of parsing, the code is reasonably self-explanatory.
To get the exhibit_url
, another page has to be scraped. The item
is merged by passing it though the meta
keyword argument of response.follow()
which is just a shortcut to scrapy.Request()
.
FormRequest.from_response()
cannot be used for pagination because:
- the returned HTML is not complete, it is just a div.
- the form action is munged somewhere
Using Firebug’s network tab the form data can be inspected and the required parameters can be found on the page. The code in ActiveForeignPrincipalSpider._next_page_data()
extracts the form data required for the first and subsequent pages.
HTTP caching has been enabled in fara/settings.py
.
Miscellaneous
To convert a json object/dict into a form suitable for posting, use the following incantation:
$ jq 'to_entries|map(.key + "=" + .value)|join("&")' np.json
If it is a dict
, remember to change the '
s to "
s before using jq
.