Scraped data can often be the backbone of an investigation, but some websites are more difficult to scrape than others. This session will cover how to approach dealing with tricky sites, including coping with captchas, IP blocking, and browser fingerprinting. We'll cover how to figure out what might be preventing you from scraping a site, and what options you have to proceed, with their pros, cons, and costs.
This is an advanced session aimed at people who already have experience of writing code to scrape websites and want to move up to the next level: participants will leave with an understanding of how to deal with hard-to-scrape websites, plus the tradeoffs of different approaches. No tools are required to follow along, just a web browser.
Max Harlow is a data reporter at Bloomberg News. He also runs Journocoders, a community group for journalists to develop technical skills for use in their reporting.