Loading…
Venue: 3.02 clear filter
arrow_back View All Dates
Friday, May 29
 

11:30am CEST

Build web scrapers with AI for non-coding journalists
Friday May 29, 2026 11:30am - 12:45pm CEST
Scraping data from the Internet has become a key skill for many investigations and reporting projects that rely on data. Building custom web scrapers used to require solid coding skills but in two recent environmental investigations supported by the Pulitzer Center, we used Large Language Models (LLMs) like ChatGPT, Google Gemini, or Claude to help us build scrapers for online content without much coding skills. This hands-on workshop will teach you how to inspect a website and choose a scraping strategy. Then it will demonstrate, step-by-step, how to build web scrapers that have been used in the investigations. LLM prompts will be shared and participants can follow along to create their first custom web scraper.

After attending you will understand website structure for scraping and be able to use LLMs to build basic web scrapers.

Participants should come with their own laptops, register a free account on any of the main LLMs (e.g. ChatGPT, Google Gemini, Claude) and have a free Google Colab account at colab.research.google.com.

No coding skill is required but basic familiarity with LLMs is recommended.
Speakers
avatar for Kuang Keng Kuek Ser

Kuang Keng Kuek Ser

Senior Editor for Rainforest Investigations, Pulitzer Center
Kuang Keng Kuek Ser is the Senior Editor for Rainforest Investigations at the Pulitzer Center, a non-profit organization based in Washington, DC that supports independent journalists globally. He supports and mentors three fellowships investigating issues related to tropical rainforest... Read More →
avatar for Anastasiia Morozova

Anastasiia Morozova

Data and investigative journalist, Onet.pl/Ringier Axel Springer
I’m a data and investigative journalist with a background in tracking Russian influence, desinformation operations and sanctions evasion in Europe. I’m especially interested in projects where I can combine data analysis and visual storytelling to expose hidden networks or financial... Read More →
Friday May 29, 2026 11:30am - 12:45pm CEST
3.02

2:00pm CEST

Scraping the unscrapable: advanced approaches to deal with complex sites and evade anti-scraping systems
Friday May 29, 2026 2:00pm - 3:15pm CEST
Scraped data can often be the backbone of an investigation, but some websites are more difficult to scrape than others. This session will cover how to approach dealing with tricky sites, including coping with captchas, IP blocking, and browser fingerprinting. We'll cover how to figure out what might be preventing you from scraping a site, and what options you have to proceed, with their pros, cons, and costs.

This is an advanced session aimed at people who already have experience of writing code to scrape websites and want to move up to the next level: participants will leave with an understanding of how to deal with hard-to-scrape websites, plus the tradeoffs of different approaches. No tools are required to follow along, just a web browser.
Speakers
avatar for Max Harlow

Max Harlow

Bloomberg News
Max Harlow is a data reporter at Bloomberg News. He also runs Journocoders, a community group for journalists to develop technical skills for use in their reporting.
Friday May 29, 2026 2:00pm - 3:15pm CEST
3.02

3:45pm CEST

Investigating inequality in Copenhagen’s nurseries
Friday May 29, 2026 3:45pm - 5:00pm CEST
Learn how the investigative team at the Danish Altinget used scraped data from 350 inspection reports to map structural inequality in Copenhagen’s nurseries and kindergartens, and how the method can be applied to other local areas and welfare institutions.
Speakers
avatar for Freja Wedenborg

Freja Wedenborg

Data Journalist, Altinget
Freja Wedenborg (Denmark) is a data journalist at the Danish news outlet Altinget. She also teaches data journalism, OSINT, and other digital investigative methods at the Center for Journalism at the University of Southern Denmark, and is the author of Cryptoguide for Journalists... Read More →
Friday May 29, 2026 3:45pm - 5:00pm CEST
3.02
 
Share Modal

Share this link via

Or copy link

Filter sessions
Apply filters to sessions.
Filtered by Date -