Loading…
Type: Data skills clear filter
arrow_back View All Dates
Friday, May 29
 

11:30am CEST

Build web scrapers with AI for non-coding journalists
Friday May 29, 2026 11:30am - 12:45pm CEST
Scraping data from the Internet has become a key skill for many investigations and reporting projects that rely on data. Building custom web scrapers used to require solid coding skills but in two recent environmental investigations supported by the Pulitzer Center, we used Large Language Models (LLMs) like ChatGPT, Google Gemini, or Claude to help us build scrapers for online content without much coding skills. This hands-on workshop will teach you how to inspect a website and choose a scraping strategy. Then it will demonstrate, step-by-step, how to build web scrapers that have been used in the investigations. LLM prompts will be shared and participants can follow along to create their first custom web scraper.

After attending you will understand website structure for scraping and be able to use LLMs to build basic web scrapers.

Participants should come with their own laptops, register a free account on any of the main LLMs (e.g. ChatGPT, Google Gemini, Claude) and have a free Google Colab account at colab.research.google.com.

No coding skill is required but basic familiarity with LLMs is recommended.
Speakers
avatar for Kuang Keng Kuek Ser

Kuang Keng Kuek Ser

Senior Editor for Rainforest Investigations, Pulitzer Center
Kuang Keng Kuek Ser is the Senior Editor for Rainforest Investigations at the Pulitzer Center, a non-profit organization based in Washington, DC that supports independent journalists globally. He supports and mentors three fellowships investigating issues related to tropical rainforest... Read More →
avatar for Anastasiia Morozova

Anastasiia Morozova

Data and investigative journalist, Onet.pl/Ringier Axel Springer
I’m a data and investigative journalist with a background in tracking Russian influence, desinformation operations and sanctions evasion in Europe. I’m especially interested in projects where I can combine data analysis and visual storytelling to expose hidden networks or financial... Read More →
Friday May 29, 2026 11:30am - 12:45pm CEST
3.02

11:30am CEST

How to extract Persons, Names and Locations from research material – and where AI fails to do it
Friday May 29, 2026 11:30am - 12:45pm CEST
Processing natural language is seen as the task that artificial intelligence is most adept at. However, as journalists and researchers, we need our technologies to be explainable, understandable, and deterministic. Because of this, not all artificial intelligence algorithms are well-suited for our work. And, when every company promises that their AI software is extraordinary, it's difficult to distinguish the empty promises from what the technology can actually do. Working on OpenAleph, an open-source tool for investigative journalism, has taught us a lot about processing natural language. We extract names of people and companies from raw text. We try to infer the language a text is written in. The names of places, cities, and countries are crucial to us, in order to situate data geographically. All of this is heavily reliant on algorithms. But not all algorithms are as good as getting us what we want!

In this session, we'll show you what works and what doesn't. Everything we demonstrate can be used independently of OpenAleph, and integrated into your own workflows. Some machine learning algorithms are excellent at getting us more insights from our data. In addition to this, data that we already have, or public data, can be harnessed to help us identify names of people and places, just based on similarity - no AI required!

Finally, we'll discuss how these approaches compare to using large language models and generative AI. This session is half teaching and discussing common solutions, half workshop. For the workshop part, bring a laptop running Python if possible.
Speakers
avatar for Simon Wörpel

Simon Wörpel

Director of Technology, Data and Research Center – DARC

avatar for Natalie Widmann

Natalie Widmann

Data Journalist, SWR Data Lab
I'm a Data Journalist supporting journalist and human rights activists with data, tools and automation.
I'm happy to talk about scraping data, extracting the most relevant information from it, understanding algorithms and using them for investigations.
Friday May 29, 2026 11:30am - 12:45pm CEST
3.13

11:30am CEST

Your first investigative data pipeline with agentic AI
Friday May 29, 2026 11:30am - 12:45pm CEST
Every investigative journalist has faced the same bottleneck. What would I find if I could check all of them: all the company registrations, all the addresses, all the permits? Until recently, answering that question required weeks of scripting. In this session, we introduce a faster way: directing an AI coding agent to build investigative data pipelines on demand. Participants will direct an agent to pull data from a public source, clean it, and turn it into an interactive visualization, all without writing code manually. The approach is applicable to a range of investigative beats, from financial crime and corruption to environmental accountability and lobbying networks.

To follow along, participants should have a basic understanding of web technologies, but no programming experience is needed. After attending this session, participants will be able to direct an AI coding agent to build a data pipeline, from raw data to interactive visualization, and apply this methodology to their own investigative questions. Participants should have a laptop with a modern web browser. We will provide API keys and access credentials during the session. Detailed setup instructions will be shared via a GitHub repository before the workshop.
Speakers
avatar for Jeremy Crowlesmith

Jeremy Crowlesmith

Data journalist / AI specialist, KRO-NCRV
hi, i'm jeremy. i build tools and tell stories with data. from scraping to analysis to visualization — the whole stack. i have twenty years of building for the web. now i'm focused on investigative data journalism: using code to find stories hidden in documents and datasets. - based... Read More →
avatar for Jan van der Burgt

Jan van der Burgt

Investigative coder / AI specialist, Freelance / Open State Foundation
I leverage AI technologies to collect and analyse data at scale, uncovering the hidden patterns that build stories.

Investigative focus: lobbying, government overreach, migration, global food supply chains.
Friday May 29, 2026 11:30am - 12:45pm CEST
3.05

2:00pm CEST

How to code anything
Friday May 29, 2026 2:00pm - 3:15pm CEST
Coding has long been a skill journalists wanted to learn to make their investigations more efficient and rigorous. The main barrier was the significant time investment required to develop that skill. But since large language models emerged, we no longer need to write code ourselves. We do, however, still need to make informed choices when instructing an LLM to write code for us. Otherwise, those choices get made for us by the model.

How do we instruct the LLM best? How can we understand a code? And how do we catch potential mistakes? No prior coding knowledge is required to attend this session. You'll learn a simple, systematic approach to conversations, context management, and effective prompting that will help you to code anything. The participants should have an account with a large language model provider (ChatGPT, Claude, Gemini or similar).
Speakers
avatar for Ada Homolova

Ada Homolova

ARENA, Austria/ Slovakia
Adriana is a freelance data journalist, trainer and public spending nerd. She coordinates the data skills training track on the Dataharvest conference, and herds frogs at The Pond.

https://homolova.sk/newsletter
avatar for Johan Schujit

Johan Schujit

Data Engineer, Resolve.
I'm a data engineer responsible for EveryPolitician and PoliLoom at OpenSanctions. I'm a self-taught hacker with a stubborn belief that good data should be open and technology should serve the public interest. Previously at Follow the Money.

Friday May 29, 2026 2:00pm - 3:15pm CEST
3.04

2:00pm CEST

Scraping the unscrapable: advanced approaches to deal with complex sites and evade anti-scraping systems
Friday May 29, 2026 2:00pm - 3:15pm CEST
Scraped data can often be the backbone of an investigation, but some websites are more difficult to scrape than others. This session will cover how to approach dealing with tricky sites, including coping with captchas, IP blocking, and browser fingerprinting. We'll cover how to figure out what might be preventing you from scraping a site, and what options you have to proceed, with their pros, cons, and costs.

This is an advanced session aimed at people who already have experience of writing code to scrape websites and want to move up to the next level: participants will leave with an understanding of how to deal with hard-to-scrape websites, plus the tradeoffs of different approaches. No tools are required to follow along, just a web browser.
Speakers
avatar for Max Harlow

Max Harlow

Bloomberg News
Max Harlow is a data reporter at Bloomberg News. He also runs Journocoders, a community group for journalists to develop technical skills for use in their reporting.
Friday May 29, 2026 2:00pm - 3:15pm CEST
3.02

2:00pm CEST

Using the cloud and local LLMs to rapidly analyse thousands of audio/text documents
Friday May 29, 2026 2:00pm - 3:15pm CEST
In this session, participants will take an archive of podcast episodes and other documents, and set up some cloud infrastructure to analyse the files using open source transcription, text extraction and generative AI tooling. The aim is to equip attendees with the skills to rapidly perform bulk operations on large troves of data by leveraging cloud platforms. By the end of the workshop participants will be have a pipeline that can answer questions like 'which podcast episodes have instances of greenwashing in them'. At The Guardian, we have used these techniques in two recent investigations. When investigating the Free Birth Society we needed to perform analysis on hundreds of hours of audio files. When the Epstein files were released we had to try and extract meaning out of millions of unstructured text documents. By making use of simple cloud tools (queues and instances) we were able to process hundreds of files in parallel whilst retaining control of the data.

Participants should have some experience of using the command line. All cloud accounts will be provided. After attending this session, participants will be able to use the cloud to quickly analyse large numbers of documents and media files. Participants using Windows could save some time by setting up WSL https://learn.microsoft.com/en-us/windows/wsl/install
Speakers
avatar for Philip McMahon

Philip McMahon

Software Developer, The Guardian

avatar for Teodora Curcic

Teodora Curcic

BBC
Teodora Ćurčić is an investigative and data journalist from Serbia with over seven years of experience reporting on corruption, political finance, gender-based violence, and social justice. She spent most of her career at the award-winning Center for Investigative Journalism of... Read More →
Friday May 29, 2026 2:00pm - 3:15pm CEST
3.05

3:45pm CEST

How local LLMs can help you with sensitive information: a beginner's guide
Friday May 29, 2026 3:45pm - 5:00pm CEST
Journalists often work with sensitive information. This information should not end up in web-based tools like ChatGPT and similar services. However, there are alternatives: local LLMs that run on your own computer. This not only ensures data protection when processing large volumes of documents, but it can also save costs on expensive APIs.

This introductory workshop aims to answer the most important questions: What hardware do I need? What frameworks are available (LM Studio, Ollama, etc.)? Which models can I use for which tasks? And what does such a workflow look like (e.g., with Python)? This session is a mix of presentation and hands-on elements.

To attend this session, no prior knowledge is required. If you want to participate in the hands-on parts, make sure to download and install Ollama and/or LM Studio and download a local model like Qwen3.5-4B

After attending this session, the participants will understand the pros and cons of using local AI models and get ideas from real-life examples on how to use this knowledge.
Speakers
avatar for Claus Hesseling

Claus Hesseling

Freier Journalist und Trainer
Macht Daten-Sachen für den NDR und HR, erfindet für die Interlink-Academy im EU-Projekt INJECT Tools für Newsrooms, ist Trainer bei der ARD.ZDF-Medienakademie und anderen. Twitter: @the_claus... Read More →
avatar for Johan Schujit

Johan Schujit

Data Engineer, Resolve.
I'm a data engineer responsible for EveryPolitician and PoliLoom at OpenSanctions. I'm a self-taught hacker with a stubborn belief that good data should be open and technology should serve the public interest. Previously at Follow the Money.

Friday May 29, 2026 3:45pm - 5:00pm CEST
3.04

3:45pm CEST

Mapping and spatial analysis in code
Friday May 29, 2026 3:45pm - 5:00pm CEST
Data journalists have traditionally thought of maps and spatial calculations as a job for special mapping software, like QGIS. But it's often more efficient to do GIS work within the same script that you perform the rest of your analysis.

In this session, you will see how easy it is to work with GIS within your code and share interactive maps with colleagues. To follow along, participants should have some experience in data journalism and a curiosity about the relationship between data and maps.

This session will introduce participants to a new world of possibilities for doing spatial analysis in code. While participants will benefit from simply observing, those who want to run the code should have R Studio installed https://posit.co/download/rstudio-desktop/
Speakers
avatar for Robert Gebeloff

Robert Gebeloff

Reporter, New York Times
Robert Gebeloff has worked as a data projects reporter for The New York Times since 2008 and has taught data journalism for many years in newsrooms and at conferences. He was co-winner of the George Polk Award in 2015 and was a Pulitzer Prize finalist in both 2015 and 2016 for projects... Read More →
avatar for Jonathan Stoneman

Jonathan Stoneman

Arena for Journalism in Europe
Former BBC journalist, turned datajournalist, trainer, consultant. Works with Arena as Lead Trainer, Arena Academy. 
Friday May 29, 2026 3:45pm - 5:00pm CEST
3.09

3:45pm CEST

Newsroom infrastructure for AI experimentation
Friday May 29, 2026 3:45pm - 5:00pm CEST
Learn approaches to tooling and infrastructure that allow every member of your newsroom to participate in your AI experiments, along with how to test and track both improvements and disappointments along the way!

In this workshop, we'll look at: Python libraries that can turn tiny snippets of code or prompts into shareable web apps (Gradio, Streamlit), platforms that allow non-technical users to build evaluations and experiment on their own (Braintrust, n8n), and approaches to models and tooling that provide long-term value and flexibility when selecting services and providers (Pydantic, OpenRouter).

Whether you're looking to use AI for investigative work or to ease the copy-editing burden, increasing participation across the newsroom can help discover limitations and inspiration, along with easing anxieties over automation. To get the most out of this session, participants should have a working knowledge of Python.

After attending this session, participants will have a suite of approaches to bring non-technical members of their newsroom into their AI processes. Participants should have Jupyter installed or a Google account to work in the cloud.
Speakers
avatar for Jonathan Soma

Jonathan Soma

Knight Chair in Data Journalism, Columbia University
Jonathan Soma is the Knight Chair in Data Journalism at Columbia University, where he serves as Director of the Data Journalism MS program and the Lede Program, an intensive data journalism summer course. His lectures cover everything from basic Python and data analysis to interactive... Read More →
avatar for Philip McMahon

Philip McMahon

Software Developer, The Guardian

Friday May 29, 2026 3:45pm - 5:00pm CEST
3.05
 
Share Modal

Share this link via

Or copy link

Filter sessions
Apply filters to sessions.
Filtered by Date -