Dataharvest 2026 - the European Investigative Journalism Conference: Full Schedule

arrow_back View All Dates

11:30am CEST

Your first investigative data pipeline with agentic AI

Friday May 29, 2026 11:30am - 12:45pm CEST

Every investigative journalist has faced the same bottleneck. What would I find if I could check all of them: all the company registrations, all the addresses, all the permits? Until recently, answering that question required weeks of scripting. In this session, we introduce a faster way: directing an AI coding agent to build investigative data pipelines on demand. Participants will direct an agent to pull data from a public source, clean it, and turn it into an interactive visualization, all without writing code manually. The approach is applicable to a range of investigative beats, from financial crime and corruption to environmental accountability and lobbying networks.

To follow along, participants should have a basic understanding of web technologies, but no programming experience is needed. After attending this session, participants will be able to direct an AI coding agent to build a data pipeline, from raw data to interactive visualization, and apply this methodology to their own investigative questions. Participants should have a laptop with a modern web browser. We will provide API keys and access credentials during the session. Detailed setup instructions will be shared via a GitHub repository before the workshop.

Speakers

Jeremy Crowlesmith

Data journalist / AI specialist, KRO-NCRV

hi, i'm jeremy. i build tools and tell stories with data. from scraping to analysis to visualization — the whole stack. i have twenty years of building for the web. now i'm focused on investigative data journalism: using code to find stories hidden in documents and datasets. - based... Read More →

Jan van der Burgt

Investigative coder / AI specialist, Freelance / Open State Foundation

I leverage AI technologies to collect and analyse data at scale, uncovering the hidden patterns that build stories.

Investigative focus: lobbying, government overreach, migration, global food supply chains.

Friday May 29, 2026 11:30am - 12:45pm CEST
3.05

Data skills, Workshop

2:00pm CEST

Using the cloud and local LLMs to rapidly analyse thousands of audio/text documents

Friday May 29, 2026 2:00pm - 3:15pm CEST

3.05

In this session, participants will take an archive of podcast episodes and other documents, and set up some cloud infrastructure to analyse the files using open source transcription, text extraction and generative AI tooling. The aim is to equip attendees with the skills to rapidly perform bulk operations on large troves of data by leveraging cloud platforms. By the end of the workshop participants will be have a pipeline that can answer questions like 'which podcast episodes have instances of greenwashing in them'. At The Guardian, we have used these techniques in two recent investigations. When investigating the Free Birth Society we needed to perform analysis on hundreds of hours of audio files. When the Epstein files were released we had to try and extract meaning out of millions of unstructured text documents. By making use of simple cloud tools (queues and instances) we were able to process hundreds of files in parallel whilst retaining control of the data.

Participants should have some experience of using the command line. All cloud accounts will be provided. After attending this session, participants will be able to use the cloud to quickly analyse large numbers of documents and media files. Participants using Windows could save some time by setting up WSL https://learn.microsoft.com/en-us/windows/wsl/install

Speakers

Philip McMahon

Software Developer, The Guardian

Teodora Curcic

BBC

Teodora Ćurčić is an investigative and data journalist from Serbia with over seven years of experience reporting on corruption, political finance, gender-based violence, and social justice. She spent most of her career at the award-winning Center for Investigative Journalism of... Read More →

Friday May 29, 2026 2:00pm - 3:15pm CEST
3.05

Data skills, Workshop

3:45pm CEST

Newsroom infrastructure for AI experimentation

Friday May 29, 2026 3:45pm - 5:00pm CEST

3.05

Learn approaches to tooling and infrastructure that allow every member of your newsroom to participate in your AI experiments, along with how to test and track both improvements and disappointments along the way!

In this workshop, we'll look at: Python libraries that can turn tiny snippets of code or prompts into shareable web apps (Gradio, Streamlit), platforms that allow non-technical users to build evaluations and experiment on their own (Braintrust, n8n), and approaches to models and tooling that provide long-term value and flexibility when selecting services and providers (Pydantic, OpenRouter).

Whether you're looking to use AI for investigative work or to ease the copy-editing burden, increasing participation across the newsroom can help discover limitations and inspiration, along with easing anxieties over automation. To get the most out of this session, participants should have a working knowledge of Python.

After attending this session, participants will have a suite of approaches to bring non-technical members of their newsroom into their AI processes. Participants should have Jupyter installed or a Google account to work in the cloud.

Speakers

Jonathan Soma

Knight Chair in Data Journalism, Columbia University

Jonathan Soma is the Knight Chair in Data Journalism at Columbia University, where he serves as Director of the Data Journalism MS program and the Lede Program, an intensive data journalism summer course. His lectures cover everything from basic Python and data analysis to interactive... Read More →

Philip McMahon

Software Developer, The Guardian

Friday May 29, 2026 3:45pm - 5:00pm CEST
3.05

Data skills, Workshop

Dataharvest 2026 - the European Investigative Journalism Conference

11:30am CEST

Jeremy Crowlesmith

Jan van der Burgt

2:00pm CEST

Philip McMahon

Teodora Curcic

3:45pm CEST

Jonathan Soma

Philip McMahon

Get help with the event