Dataharvest 2026 - the European Investigative Journalism Conference: Full Schedule

arrow_back View All Dates

11:30am CEST

Build web scrapers with AI for non-coding journalists

Friday May 29, 2026 11:30am - 12:45pm CEST

Scraping data from the Internet has become a key skill for many investigations and reporting projects that rely on data. Building custom web scrapers used to require solid coding skills but in two recent environmental investigations supported by the Pulitzer Center, we used Large Language Models (LLMs) like ChatGPT, Google Gemini, or Claude to help us build scrapers for online content without much coding skills. This hands-on workshop will teach you how to inspect a website and choose a scraping strategy. Then it will demonstrate, step-by-step, how to build web scrapers that have been used in the investigations. LLM prompts will be shared and participants can follow along to create their first custom web scraper.

After attending you will understand website structure for scraping and be able to use LLMs to build basic web scrapers.

Participants should come with their own laptops, register a free account on any of the main LLMs (e.g. ChatGPT, Google Gemini, Claude) and have a free Google Colab account at colab.research.google.com.

No coding skill is required but basic familiarity with LLMs is recommended.

Materials: https://github.com/kuangkeng/dataharvest2026-ai-scraper

Speakers

Kuang Keng Kuek Ser

Senior Editor for Rainforest Investigations, Pulitzer Center

Kuang Keng Kuek Ser is the Senior Editor for Rainforest Investigations at the Pulitzer Center, a non-profit organization based in Washington, DC that supports independent journalists globally. He supports and mentors three fellowships investigating issues related to tropical rainforest... Read More →

Anastasiia Morozova

Data and investigative journalist, Onet.pl/Ringier Axel Springer

I’m a data and investigative journalist with a background in tracking Russian influence, desinformation operations and sanctions evasion in Europe. I’m especially interested in projects where I can combine data analysis and visual storytelling to expose hidden networks or financial... Read More →

Friday May 29, 2026 11:30am - 12:45pm CEST
3.02

Data skills, Workshop

11:30am CEST

How to extract Persons, Names and Locations from research material – and where AI fails to do it

Friday May 29, 2026 11:30am - 12:45pm CEST

Z2.01 - Mediadrôme

Processing natural language is seen as the task that artificial intelligence is most adept at. However, as journalists and researchers, we need our technologies to be explainable, understandable, and deterministic. Because of this, not all artificial intelligence algorithms are well-suited for our work. And, when every company promises that their AI software is extraordinary, it's difficult to distinguish the empty promises from what the technology can actually do. Working on OpenAleph, an open-source tool for investigative journalism, has taught us a lot about processing natural language. We extract names of people and companies from raw text. We try to infer the language a text is written in. The names of places, cities, and countries are crucial to us, in order to situate data geographically. All of this is heavily reliant on algorithms. But not all algorithms are as good as getting us what we want!

In this session, we'll show you what works and what doesn't. Everything we demonstrate can be used independently of OpenAleph, and integrated into your own workflows. Some machine learning algorithms are excellent at getting us more insights from our data. In addition to this, data that we already have, or public data, can be harnessed to help us identify names of people and places, just based on similarity - no AI required!

Finally, we'll discuss how these approaches compare to using large language models and generative AI. This session is half teaching and discussing common solutions, half workshop. For the workshop part, bring a laptop running Python if possible.

Speakers

Simon Wörpel

Director of Technology, Data and Research Center – DARC

Natalie Widmann

Data Journalist, SWR Data Lab / Freelance

I'm a data journalist supporting journalist with data, tools and automation.I'm happy to talk about scraping data, extracting the most relevant information from it, understanding algorithms and using them for investigations.

Friday May 29, 2026 11:30am - 12:45pm CEST
Z2.01 - Mediadrôme

Data skills, Workshop

11:30am CEST

Your first investigative data pipeline with agentic AI

Friday May 29, 2026 11:30am - 12:45pm CEST

Z1.15 - Aula Donche

Every investigative journalist has faced the same bottleneck. What would I find if I could check all of them: all the company registrations, all the addresses, all the permits? Until recently, answering that question required weeks of scripting. In this session, we introduce a faster way: directing an AI coding agent to build investigative data pipelines on demand. Participants will direct an agent to pull data from a public source, clean it, and turn it into an interactive visualization, all without writing code manually. The approach is applicable to a range of investigative beats, from financial crime and corruption to environmental accountability and lobbying networks.

To follow along, participants should have a basic understanding of web technologies, but no programming experience is needed. After attending this session, participants will be able to direct an AI coding agent to build a data pipeline, from raw data to interactive visualization, and apply this methodology to their own investigative questions. Participants should have a laptop with a modern web browser. We will provide API keys and access credentials during the session. Detailed setup instructions will be shared via a GitHub repository before the workshop.

Speakers

Jeremy Crowlesmith

Data journalist / AI specialist, KRO-NCRV

hi, i'm jeremy. i build tools and tell stories with data. from scraping to analysis to visualization — the whole stack. i have twenty years of building for the web. now i'm focused on investigative data journalism: using code to find stories hidden in documents and datasets. - based... Read More →

Jan van der Burgt

Investigative coder / AI specialist, Freelance / Open State Foundation

I leverage AI technologies to collect and analyse data at scale, uncovering the hidden patterns that build stories.

Investigative focus: lobbying, government overreach, migration, global food supply chains.

Friday May 29, 2026 11:30am - 12:45pm CEST
Z1.15 - Aula Donche

Data skills, Workshop

2:00pm CEST

How to code anything

Friday May 29, 2026 2:00pm - 3:15pm CEST

Z1.15 - Aula Donche

Coding has long been a skill journalists wanted to learn to make their investigations more efficient and rigorous. The main barrier was the significant time investment required to develop that skill. But since large language models emerged, we no longer need to write code ourselves. We do, however, still need to make informed choices when instructing an LLM to write code for us. Otherwise, those choices get made for us by the model.

How do we instruct the LLM best? How can we understand a code? And how do we catch potential mistakes? No prior coding knowledge is required to attend this session. You'll learn a simple, systematic approach to conversations, context management, and effective prompting that will help you to code anything.

The participants should either have an account with a subscription to large language model provider such as ChatGPT or Claude and be able to use them locally with Claude Code or Codex. Alternativelly, they should install Open Code (https://opencode.ai) and we will provide them with API keys.

Slides: https://datafrosch.fun/slides/code-anything/

Speakers

Ada Homolova

Coordinator of the data skills track, Arena for Journalism in Europe

A freelance data journalist with over 10 years of experience in data and investigative journalism, cross-border reporting, and teaching. She has worked with both small and large newsrooms across Europe, including Correctiv, Follow The Money, OCCRP, and Lost in Europe. Her heart beats... Read More →

Johan Schujit

Data Engineer, Resolve.

I'm a data engineer responsible for EveryPolitician and PoliLoom at OpenSanctions. I'm a self-taught hacker with a stubborn belief that good data should be open and technology should serve the public interest. Previously at Follow the Money.

Friday May 29, 2026 2:00pm - 3:15pm CEST
Z1.15 - Aula Donche

Data skills, Workshop

2:00pm CEST

Scraping the unscrapable: advanced approaches to deal with complex sites and evade anti-scraping systems

Friday May 29, 2026 2:00pm - 3:15pm CEST

3.02

Scraped data can often be the backbone of an investigation, but some websites are more difficult to scrape than others. This session will cover how to approach dealing with tricky sites, including coping with captchas, IP blocking, and browser fingerprinting. We'll cover how to figure out what might be preventing you from scraping a site, and what options you have to proceed, with their pros, cons, and costs.

This is an advanced session aimed at people who already have experience of writing code to scrape websites and want to move up to the next level: participants will leave with an understanding of how to deal with hard-to-scrape websites, plus the tradeoffs of different approaches. No tools are required to follow along, just a web browser.

Slides: docs.google

Speakers

Max Harlow

Bloomberg News

Max Harlow is a data reporter at Bloomberg News. He also runs Journocoders, a community group for journalists to develop technical skills for use in their reporting.

Friday May 29, 2026 2:00pm - 3:15pm CEST
3.02

Data skills, Workshop

2:00pm CEST

Using the cloud and local LLMs to rapidly analyse thousands of audio/text documents

Friday May 29, 2026 2:00pm - 3:15pm CEST

Z0.10

In this session, participants will take an archive of podcast episodes and other documents, and set up some cloud infrastructure to analyse the files using open source transcription, text extraction and generative AI tooling. The aim is to equip attendees with the skills to rapidly perform bulk operations on large troves of data by leveraging cloud platforms. By the end of the workshop participants will be have a pipeline that can answer questions like 'which podcast episodes have instances of greenwashing in them'. At The Guardian, we have used these techniques in two recent investigations. When investigating the Free Birth Society we needed to perform analysis on hundreds of hours of audio files. When the Epstein files were released we had to try and extract meaning out of millions of unstructured text documents. By making use of simple cloud tools (queues and instances) we were able to process hundreds of files in parallel whilst retaining control of the data.

Participants should have some experience of using the command line. All cloud accounts will be provided. After attending this session, participants will be able to use the cloud to quickly analyse large numbers of documents and media files.

You can see the repository for the workshop here https://github.com/philmcmahon/data-pipeline

We'll be using the following tools during the workshop. They can be installed quickly but if they are set up in advance that would save some time:
- AWS CLI: https://docs.aws.amazon.com/cli/latest/userguide/getting-started-install.html
- OpenTofu: https://opentofu.org/docs/intro/install/
- UV: https://docs.astral.sh/uv/getting-started/installation/
- (if using windows): You might need to set up WSL https://learn.microsoft.com/en-us/windows/wsl/install - but so long as you can run aws, opentofu and uv that's all you need

Speakers

Philip McMahon

Software Developer, The Guardian

Teodora Curcic

BBC

Teodora Ćurčić is an investigative and data journalist from Serbia with over seven years of experience reporting on corruption, political finance, gender-based violence, and social justice. She spent most of her career at the award-winning Center for Investigative Journalism of... Read More →

rapidly analyse documents using cloud data harvest 2026 slides pdf

Friday May 29, 2026 2:00pm - 3:15pm CEST
Z0.10

Data skills, Workshop

3:45pm CEST

How local LLMs can help you with sensitive information: a beginner's guide

Friday May 29, 2026 3:45pm - 5:00pm CEST

Z0.10

Journalists often work with sensitive information. This information should not end up in web-based tools like ChatGPT and similar services. However, there are alternatives: local LLMs that run on your own computer. This not only ensures data protection when processing large volumes of documents, but it can also save costs on expensive APIs.

This introductory workshop aims to answer the most important questions: What hardware do I need? What frameworks are available (LM Studio, Ollama, etc.)? Which models can I use for which tasks? And what does such a workflow look like (e.g., with Python)? This session is a mix of presentation and hands-on elements.

To attend this session, no prior knowledge is required. If you want to participate in the hands-on parts, make sure to download and install Ollama and/or LM Studio and download a local model like Qwen3.5-4B

After attending this session, the participants will understand the pros and cons of using local AI models and get ideas from real-life examples on how to use this knowledge.

Materials: https://github.com/chesselingfm/local-llm-investigative-journalism

Speakers

Claus Hesseling

Freelance AI, Data Journalist and Workshop Trainer, NDR

Data Journalist and AI expert at Public Broadcaster NDR in Hamburg/Germany. Workshop trainer and lecturer since 2004

Johan Schujit

Data Engineer, Resolve.

Friday May 29, 2026 3:45pm - 5:00pm CEST
Z0.10

Data skills, Workshop

3:45pm CEST

Mapping and spatial analysis in code

Friday May 29, 2026 3:45pm - 5:00pm CEST

3.02

Data journalists have traditionally thought of maps and spatial calculations as a job for special mapping software, like QGIS. But it's often more efficient to do GIS work within the same script that you perform the rest of your analysis.

In this session, you will see how easy it is to work with GIS within your code and share interactive maps with colleagues. To follow along, participants should have some experience in data journalism and a curiosity about the relationship between data and maps.

This session will introduce participants to a new world of possibilities for doing spatial analysis in code. While participants will benefit from simply observing, those who want to run the code should have R Studio installed https://posit.co/download/rstudio-desktop/

Materials: https://github.com/gebelo/Dataharvest2026

Speakers

Robert Gebeloff

Reporter, New York Times

Robert Gebeloff has worked as a data projects reporter for The New York Times since 2008 and has taught data journalism for many years in newsrooms and at conferences. He was co-winner of the George Polk Award in 2015 and was a Pulitzer Prize finalist in both 2015 and 2016 for projects... Read More →

Jonathan Stoneman

Arena for Journalism in Europe

Former BBC journalist, turned datajournalist, trainer, consultant. Works with Arena as Lead Trainer, Arena Academy.

Friday May 29, 2026 3:45pm - 5:00pm CEST
3.02

Data skills, Workshop

3:45pm CEST

Newsroom infrastructure for AI experimentation

Friday May 29, 2026 3:45pm - 5:00pm CEST

3.05

Learn approaches to tooling and infrastructure that allow every member of your newsroom to participate in your AI experiments, along with how to test and track both improvements and disappointments along the way!

In this workshop, we'll look at: Python libraries that can turn tiny snippets of code or prompts into shareable web apps (Gradio, Streamlit), platforms that allow non-technical users to build evaluations and experiment on their own (Braintrust, n8n), and approaches to models and tooling that provide long-term value and flexibility when selecting services and providers (Pydantic, OpenRouter).

Whether you're looking to use AI for investigative work or to ease the copy-editing burden, increasing participation across the newsroom can help discover limitations and inspiration, along with easing anxieties over automation. To get the most out of this session, participants should have a working knowledge of Python.

After attending this session, participants will have a suite of approaches to bring non-technical members of their newsroom into their AI processes.

You'll get the most out of this session if you have the following accounts:
- A GitHub account to run Codespaces (free cloud computer) https://github.com
- A Google account to work in Google Colab (a different free cloud computer)
- A Braintrust account to test prompts and run evaluations https://www.braintrust.dev
- An OpenRouter account if you'd like to make your own API keys instead of using mine

Materials: https://jsoma.github.io/workshop-newsroom-ai-infra/

Speakers

Jonathan Soma

Knight Chair in Data Journalism, Columbia University

Jonathan Soma is the Knight Chair in Data Journalism at Columbia University, where he serves as Director of the Data Journalism MS program and the Lede Program, an intensive data journalism summer course. His lectures cover everything from basic Python and data analysis to interactive... Read More →

Philip McMahon

Software Developer, The Guardian

Friday May 29, 2026 3:45pm - 5:00pm CEST
3.05

Data skills, Workshop

11:30am CEST

11:30am CEST

11:30am CEST

2:00pm CEST

2:00pm CEST

2:00pm CEST

3:45pm CEST

3:45pm CEST

3:45pm CEST

Get help with the event