Journalists often run into data that is visible on a website but impossible to download directly: a table buried in a government page, a list of public records, or search results that change with every query. Writing a full scraper can be time-consuming and technically demanding for what is often a one-time task.
This session introduces three lightweight approaches that cover most of these cases: reading a table directly from a page using pandas, downloading raw HTML and parsing it into a dataframe and pulling data through network requests. These techniques are practical tools for everyday newsroom situations. Participants will take home a GitHub repository with a working notebook to try on their own data, though some adaptation will be needed to apply it to different websites.
The three approaches vary in complexity. Basic Python knowledge is enough to follow along, but participants with more experience will be able to go further, and the code can be adapted with the help of an LLM.
Materials:
https://github.com/teodoracurcic/dh2026-getting-data