Run the spider as:
$ :> fprincipals.jl :> s.log; scrapy crawl active_foreign_principals -o fprincipals.jl --logfile=s.log
This will truncate the output and the log file. JSON-lines output format is always appended to.
Install scrapy for Python3.
$ virtualenv -p python3 py3
$ source py3/bin/activate
$ pip install Scrapy ipython
Now, create a crawler project
$ mkdir ~/src/fara
$ cd ~/src/fara
$ scrapy startproject fara
$ cd fara
$ scrapy genspider active_foreign_principals efile.fara.gov
Scrapy will generate skeleton code:
items.py is where you can define the models for the data that will be extracted by the spider
spiders/active_foreign_principals.py is the spider
A spider is the bit of code that deals with the raw HTML. Its function is to parse and extract relevant data.
The data is within a table with class
ActiveForeignPrincipalSpider.parse() is the entry point of parsing, the code is reasonably self-explanatory.
To get the
exhibit_url, another page has to be scraped. The
item is merged by passing it though the
meta keyword argument of
response.follow() which is just a shortcut to
FormRequest.from_response() cannot be used for pagination because:
- the returned HTML is not complete, it is just a div.
- the form action is munged somewhere
Using Firebug’s network tab the form data can be inspected and the required parameters can be found on the page. The code in
ActiveForeignPrincipalSpider._next_page_data() extracts the form data required for the first and subsequent pages.
HTTP caching has been enabled in
To convert a json object/dict into a form suitable for posting, use the following incantation:
$ jq 'to_entries|map(.key + "=" + .value)|join("&")' np.json
If it is a
dict, remember to change the
"s before using