Dataharvest 2026 - the European Investigative Journalism Conference: Full Schedule

10:00am CEST

Masterclass: Get satellite imagery to tell you what on earth is going on! Using code and other tools (Masterclass ticket needed)

Thursday May 28, 2026 10:00am - 12:00pm CEST

3.05

A separate ticket is required to attend this masterclass. If you already have a conference ticket and would like to attend but haven't yet purchased a masterclass ticket, please contact us at [email protected]

Heat waves in Europe are increasing in frequency and intensity. People and economies are under pressure: extreme heat is costly for agriculture and deadly for people. At the same time, floods are among the most frequent and damaging natural disasters in Europe – yet understanding their true impact remains difficult.

In this session, participants will learn the skills necessary to make use of satellite images to analyse extreme heat or to systematically track flood damage. After a morning introduction to the topic, tools, and data/satellite imagery sources, participants will spend the afternoon working on one of two hands-on tracks:

Track 1: Flooding

Participants will learn how to use Copernicus Emergency Management Service (EMS) to retrieve flood data manually and via the Copernicus EMS API, clean and structure the data, and calculate flood extent and impact across agriculture, infrastructure, ecosystems, and population areas. They will learn how to link impacted areas to the EU’s statistical regions (using NUTS classifications).

Track 2: Extreme heat

Participants will learn to navigate USGS Earth Explorer to find and download imagery for land surface temperature (LST) analysis. Using R for spatial analysis, they’ll identify which neighborhoods in their region are most affected by heat. They will also use auxiliary data to examine the impact of different land types on heat. Participants are welcome to bring their own socioeconomic or location data (e.g., nursing homes, kindergartens) for investigation.

We will assume you have some experience with data in spreadsheets, but you do not need any prior knowledge of coding in R or Python or satellite imagery. You will leave with the skills (and the data!) needed to work on a hyper-local or national stories about the effects of extreme heat and flooding. These methodologies will also help you create a blueprint for other investigations, which would make good use of satellite imagery.

Key skills learned:

Learn the basics of R (navigating RStudio, importing data, tidyverse, ggplot) and Python (using the pandas library to load, filter, and analyze data, combine datasets, and export your results);

Basics of geodata (file types, projections, NUTS system);

Where to access free, high-quality satellite imagery, and common limitations of using it in investigations

Navigating satellite imagery portals and databases for natural disasters (depends on the choice of Copernicus EMS or Landsat).

Speakers

Max Donheiser

Data journalist, Tagesspiegel

Max Donheiser is a data journalist at the Tagesspiegel Innovation Lab. Using data analysis and visualizations, he makes complex social issues accessible – and enjoys wrestling with stubborn spreadsheets and uncovering hidden data treasures. Originally from the USA, he came to Berlin... Read More →

Konstantina Maltepioti

Data Journalist, Reporters United

Konstantina Maltepioti is a data journalist at Reporters United, an independent network of investigative journalists based in Greece. Her work focuses on political corruption, environmental issues, and human rights. She specialises in open-source investigations, ship-tracking, scraping... Read More →

Thursday May 28, 2026 10:00am - 12:00pm CEST
3.05

Data skills, Master class

1:00pm CEST

Masterclass: Get satellite imagery to tell you what on earth is going on! Using code and other tools (Masterclass ticket needed)

Thursday May 28, 2026 1:00pm - 3:00pm CEST

3.05

Speakers

Max Donheiser

Data journalist, Tagesspiegel

Konstantina Maltepioti

Data Journalist, Reporters United

Thursday May 28, 2026 1:00pm - 3:00pm CEST
3.05

Data skills, Master class

3:30pm CEST

Masterclass: Get satellite imagery to tell you what on earth is going on! Using code and other tools (Masterclass ticket needed)

Thursday May 28, 2026 3:30pm - 5:00pm CEST

3.05

Speakers

Max Donheiser

Data journalist, Tagesspiegel

Konstantina Maltepioti

Data Journalist, Reporters United

Thursday May 28, 2026 3:30pm - 5:00pm CEST
3.05

Data skills, Master class

11:30am CEST

Build web scrapers with AI for non-coding journalists

Friday May 29, 2026 11:30am - 12:45pm CEST

3.02

Scraping data from the Internet has become a key skill for many investigations and reporting projects that rely on data. Building custom web scrapers used to require solid coding skills but in two recent environmental investigations supported by the Pulitzer Center, we used Large Language Models (LLMs) like ChatGPT, Google Gemini, or Claude to help us build scrapers for online content without much coding skills. This hands-on workshop will teach you how to inspect a website and choose a scraping strategy. Then it will demonstrate, step-by-step, how to build web scrapers that have been used in the investigations. LLM prompts will be shared and participants can follow along to create their first custom web scraper.

After attending you will understand website structure for scraping and be able to use LLMs to build basic web scrapers.

Participants should come with their own laptops, register a free account on any of the main LLMs (e.g. ChatGPT, Google Gemini, Claude) and have a free Google Colab account at colab.research.google.com.

No coding skill is required but basic familiarity with LLMs is recommended.

Speakers

Kuang Keng Kuek Ser

Senior Editor for Rainforest Investigations, Pulitzer Center

Kuang Keng Kuek Ser is the Senior Editor for Rainforest Investigations at the Pulitzer Center, a non-profit organization based in Washington, DC that supports independent journalists globally. He supports and mentors three fellowships investigating issues related to tropical rainforest... Read More →

Anastasiia Morozova

Data and investigative journalist, Onet.pl/Ringier Axel Springer

I’m a data and investigative journalist with a background in tracking Russian influence, desinformation operations and sanctions evasion in Europe. I’m especially interested in projects where I can combine data analysis and visual storytelling to expose hidden networks or financial... Read More →

Friday May 29, 2026 11:30am - 12:45pm CEST
3.02

Data skills, Workshop

11:30am CEST

How to extract Persons, Names and Locations from research material – and where AI fails to do it

Friday May 29, 2026 11:30am - 12:45pm CEST

3.13

Processing natural language is seen as the task that artificial intelligence is most adept at. However, as journalists and researchers, we need our technologies to be explainable, understandable, and deterministic. Because of this, not all artificial intelligence algorithms are well-suited for our work. And, when every company promises that their AI software is extraordinary, it's difficult to distinguish the empty promises from what the technology can actually do. Working on OpenAleph, an open-source tool for investigative journalism, has taught us a lot about processing natural language. We extract names of people and companies from raw text. We try to infer the language a text is written in. The names of places, cities, and countries are crucial to us, in order to situate data geographically. All of this is heavily reliant on algorithms. But not all algorithms are as good as getting us what we want!

In this session, we'll show you what works and what doesn't. Everything we demonstrate can be used independently of OpenAleph, and integrated into your own workflows. Some machine learning algorithms are excellent at getting us more insights from our data. In addition to this, data that we already have, or public data, can be harnessed to help us identify names of people and places, just based on similarity - no AI required!

Finally, we'll discuss how these approaches compare to using large language models and generative AI. This session is half teaching and discussing common solutions, half workshop. For the workshop part, bring a laptop running Python if possible.

Speakers

Simon Wörpel

Director of Technology, Data and Research Center – DARC

Natalie Widmann

Data Journalist, SWR Data Lab

I'm a Data Journalist supporting journalist and human rights activists with data, tools and automation.
I'm happy to talk about scraping data, extracting the most relevant information from it, understanding algorithms and using them for investigations.

Friday May 29, 2026 11:30am - 12:45pm CEST
3.13

Data skills, Workshop

11:30am CEST

Your first investigative data pipeline with agentic AI

Friday May 29, 2026 11:30am - 12:45pm CEST

3.05

Every investigative journalist has faced the same bottleneck. What would I find if I could check all of them: all the company registrations, all the addresses, all the permits? Until recently, answering that question required weeks of scripting. In this session, we introduce a faster way: directing an AI coding agent to build investigative data pipelines on demand. Participants will direct an agent to pull data from a public source, clean it, and turn it into an interactive visualization, all without writing code manually. The approach is applicable to a range of investigative beats, from financial crime and corruption to environmental accountability and lobbying networks.

To follow along, participants should have a basic understanding of web technologies, but no programming experience is needed. After attending this session, participants will be able to direct an AI coding agent to build a data pipeline, from raw data to interactive visualization, and apply this methodology to their own investigative questions. Participants should have a laptop with a modern web browser. We will provide API keys and access credentials during the session. Detailed setup instructions will be shared via a GitHub repository before the workshop.

Speakers

Jeremy Crowlesmith

Data journalist / AI specialist, KRO-NCRV

hi, i'm jeremy. i build tools and tell stories with data. from scraping to analysis to visualization — the whole stack. i have twenty years of building for the web. now i'm focused on investigative data journalism: using code to find stories hidden in documents and datasets. - based... Read More →

Jan van der Burgt

Investigative coder / AI specialist, Freelance / Open State Foundation

I leverage AI technologies to collect and analyse data at scale, uncovering the hidden patterns that build stories.

Investigative focus: lobbying, government overreach, migration, global food supply chains.

Friday May 29, 2026 11:30am - 12:45pm CEST
3.05

Data skills, Workshop

2:00pm CEST

How to code anything

Friday May 29, 2026 2:00pm - 3:15pm CEST

3.04

Coding has long been a skill journalists wanted to learn to make their investigations more efficient and rigorous. The main barrier was the significant time investment required to develop that skill. But since large language models emerged, we no longer need to write code ourselves. We do, however, still need to make informed choices when instructing an LLM to write code for us. Otherwise, those choices get made for us by the model.

How do we instruct the LLM best? How can we understand a code? And how do we catch potential mistakes? No prior coding knowledge is required to attend this session. You'll learn a simple, systematic approach to conversations, context management, and effective prompting that will help you to code anything. The participants should have an account with a large language model provider (ChatGPT, Claude, Gemini or similar).

Speakers

Ada Homolova

ARENA, Austria/ Slovakia

Adriana is a freelance data journalist, trainer and public spending nerd. She coordinates the data skills training track on the Dataharvest conference, and herds frogs at The Pond.

https://homolova.sk/newsletter

Johan Schujit

Data Engineer, Resolve.

I'm a data engineer responsible for EveryPolitician and PoliLoom at OpenSanctions. I'm a self-taught hacker with a stubborn belief that good data should be open and technology should serve the public interest. Previously at Follow the Money.

Friday May 29, 2026 2:00pm - 3:15pm CEST
3.04

Data skills, Workshop

2:00pm CEST

Scraping the unscrapable: advanced approaches to deal with complex sites and evade anti-scraping systems

Friday May 29, 2026 2:00pm - 3:15pm CEST

3.02

Scraped data can often be the backbone of an investigation, but some websites are more difficult to scrape than others. This session will cover how to approach dealing with tricky sites, including coping with captchas, IP blocking, and browser fingerprinting. We'll cover how to figure out what might be preventing you from scraping a site, and what options you have to proceed, with their pros, cons, and costs.

This is an advanced session aimed at people who already have experience of writing code to scrape websites and want to move up to the next level: participants will leave with an understanding of how to deal with hard-to-scrape websites, plus the tradeoffs of different approaches. No tools are required to follow along, just a web browser.

Speakers

Max Harlow

Bloomberg News

Max Harlow is a data reporter at Bloomberg News. He also runs Journocoders, a community group for journalists to develop technical skills for use in their reporting.

Friday May 29, 2026 2:00pm - 3:15pm CEST
3.02

Data skills, Workshop

2:00pm CEST

Using the cloud and local LLMs to rapidly analyse thousands of audio/text documents

Friday May 29, 2026 2:00pm - 3:15pm CEST

3.05

In this session, participants will take an archive of podcast episodes and other documents, and set up some cloud infrastructure to analyse the files using open source transcription, text extraction and generative AI tooling. The aim is to equip attendees with the skills to rapidly perform bulk operations on large troves of data by leveraging cloud platforms. By the end of the workshop participants will be have a pipeline that can answer questions like 'which podcast episodes have instances of greenwashing in them'. At The Guardian, we have used these techniques in two recent investigations. When investigating the Free Birth Society we needed to perform analysis on hundreds of hours of audio files. When the Epstein files were released we had to try and extract meaning out of millions of unstructured text documents. By making use of simple cloud tools (queues and instances) we were able to process hundreds of files in parallel whilst retaining control of the data.

Participants should have some experience of using the command line. All cloud accounts will be provided. After attending this session, participants will be able to use the cloud to quickly analyse large numbers of documents and media files. Participants using Windows could save some time by setting up WSL https://learn.microsoft.com/en-us/windows/wsl/install

Speakers

Philip McMahon

Software Developer, The Guardian

Teodora Curcic

BBC

Teodora Ćurčić is an investigative and data journalist from Serbia with over seven years of experience reporting on corruption, political finance, gender-based violence, and social justice. She spent most of her career at the award-winning Center for Investigative Journalism of... Read More →

Friday May 29, 2026 2:00pm - 3:15pm CEST
3.05

Data skills, Workshop

3:45pm CEST

How local LLMs can help you with sensitive information: a beginner's guide

Friday May 29, 2026 3:45pm - 5:00pm CEST

3.04

Journalists often work with sensitive information. This information should not end up in web-based tools like ChatGPT and similar services. However, there are alternatives: local LLMs that run on your own computer. This not only ensures data protection when processing large volumes of documents, but it can also save costs on expensive APIs.

This introductory workshop aims to answer the most important questions: What hardware do I need? What frameworks are available (LM Studio, Ollama, etc.)? Which models can I use for which tasks? And what does such a workflow look like (e.g., with Python)? This session is a mix of presentation and hands-on elements.

To attend this session, no prior knowledge is required. If you want to participate in the hands-on parts, make sure to download and install Ollama and/or LM Studio and download a local model like Qwen3.5-4B

After attending this session, the participants will understand the pros and cons of using local AI models and get ideas from real-life examples on how to use this knowledge.

Speakers

Claus Hesseling

Freier Journalist und Trainer

Macht Daten-Sachen für den NDR und HR, erfindet für die Interlink-Academy im EU-Projekt INJECT Tools für Newsrooms, ist Trainer bei der ARD.ZDF-Medienakademie und anderen. Twitter: @the_claus... Read More →

Johan Schujit

Data Engineer, Resolve.

Friday May 29, 2026 3:45pm - 5:00pm CEST
3.04

Data skills, Workshop

3:45pm CEST

Mapping and spatial analysis in code

Friday May 29, 2026 3:45pm - 5:00pm CEST

3.09

Data journalists have traditionally thought of maps and spatial calculations as a job for special mapping software, like QGIS. But it's often more efficient to do GIS work within the same script that you perform the rest of your analysis.

In this session, you will see how easy it is to work with GIS within your code and share interactive maps with colleagues. To follow along, participants should have some experience in data journalism and a curiosity about the relationship between data and maps.

This session will introduce participants to a new world of possibilities for doing spatial analysis in code. While participants will benefit from simply observing, those who want to run the code should have R Studio installed https://posit.co/download/rstudio-desktop/

Speakers

Robert Gebeloff

Reporter, New York Times

Robert Gebeloff has worked as a data projects reporter for The New York Times since 2008 and has taught data journalism for many years in newsrooms and at conferences. He was co-winner of the George Polk Award in 2015 and was a Pulitzer Prize finalist in both 2015 and 2016 for projects... Read More →

Jonathan Stoneman

Arena for Journalism in Europe

Former BBC journalist, turned datajournalist, trainer, consultant. Works with Arena as Lead Trainer, Arena Academy.

Friday May 29, 2026 3:45pm - 5:00pm CEST
3.09

Data skills, Workshop

3:45pm CEST

Newsroom infrastructure for AI experimentation

Friday May 29, 2026 3:45pm - 5:00pm CEST

3.05

Learn approaches to tooling and infrastructure that allow every member of your newsroom to participate in your AI experiments, along with how to test and track both improvements and disappointments along the way!

In this workshop, we'll look at: Python libraries that can turn tiny snippets of code or prompts into shareable web apps (Gradio, Streamlit), platforms that allow non-technical users to build evaluations and experiment on their own (Braintrust, n8n), and approaches to models and tooling that provide long-term value and flexibility when selecting services and providers (Pydantic, OpenRouter).

Whether you're looking to use AI for investigative work or to ease the copy-editing burden, increasing participation across the newsroom can help discover limitations and inspiration, along with easing anxieties over automation. To get the most out of this session, participants should have a working knowledge of Python.

After attending this session, participants will have a suite of approaches to bring non-technical members of their newsroom into their AI processes. Participants should have Jupyter installed or a Google account to work in the cloud.

Speakers

Jonathan Soma

Knight Chair in Data Journalism, Columbia University

Jonathan Soma is the Knight Chair in Data Journalism at Columbia University, where he serves as Director of the Data Journalism MS program and the Lede Program, an intensive data journalism summer course. His lectures cover everything from basic Python and data analysis to interactive... Read More →

Philip McMahon

Software Developer, The Guardian

Friday May 29, 2026 3:45pm - 5:00pm CEST
3.05

Data skills, Workshop

9:30am CEST

AI-Assisted OSINT: Automating the investigative workflow

Saturday May 30, 2026 9:30am - 10:45am CEST

3.05

Most investigative workflows still rely on manually juggling dozens of tools. In this session, we'll walk through a live demo of a semi-automated pipeline built for real casework: web search and archiving with Playwright, face extraction, reverse image search, database cross-referencing with Telegram bots, social media analysis, and structured reporting via Obsidian mcp. All of this is orchestrated by Claude, an AI layer you can teach your own investigative methodology. At the end, participants will work through a simplified case using a workflow of their own.

Before the session, please install: Python, Claude Code. This session will teach participants to combine several smaller OSINT tools so they work together efficiently without requiring much manual effort. No special tools needed

Speakers

Anastasiia Morozova

Data and investigative journalist, Onet.pl/Ringier Axel Springer

Leopold Salzenstein

Data coordinator, Arena for Journalism in Europe

Leopold Salzenstein is a freelance investigative data journalist and trainer based in the south of France. At Arena, he coordinates the handling of data for publications and trainings. He is also a member of the collective of journalists Environmental Investigative Forum (EIF).

... Read More →

Saturday May 30, 2026 9:30am - 10:45am CEST
3.05

Data skills, Workshop

9:30am CEST

Scraping with a browser emulator

Saturday May 30, 2026 9:30am - 10:45am CEST

3.04

You need to harvest data from a Web site. But there's no download button. It's time to scrape! There are many options, but one of the most consistently effective is launching an automated browser. You tell the browser where to go and what to click, and when to ingest the content. To follow along, participants should have some knowledge of coding in any language.

Participants will come away from this class knowing the basics of Web scraping with a browser emulator. To follow along, participants should have R Studio installed https://posit.co/download/rstudio-desktop/, create a new project, download this file selenium-server-standalone-3.5.3.jar into the project directory, and have the appropriate Chrome binary downloaded into the directory https://googlechromelabs.github.io/chrome-for-testing/last-known-good-versions-with-downloads.json

Speakers

Robert Gebeloff

Reporter, New York Times

Simon Wörpel

Director of Technology, Data and Research Center – DARC

Saturday May 30, 2026 9:30am - 10:45am CEST
3.04

Data skills, Workshop

9:30am CEST

Showcase your work online: Build a portfolio with GitHub pages

Saturday May 30, 2026 9:30am - 10:45am CEST

Z0.15

Many journalists have published investigations, data stories, and visualizations across different outlets, but no single place to display all their work. In this hands-on session, participants will build a simple, professional portfolio website using GitHub Pages, creating a central hub where their work can live together, even without previous web development experience. By writing and modifying small pieces of HTML and CSS together, participants will see how a simple page can gradually become a polished portfolio.

To attend this session, no prior coding knowledge is required. Familiarity with GitHub or basic HTML is helpful but not necessary. After attending this session, participants will have a live portfolio website that they can continue improving and use immediately for job applications, pitching stories, or showcasing investigative work.

Participants should create a free GitHub account before the session: https://github.com/join. It is also useful to install Visual Studio Code: https://code.visualstudio.com/

Speakers

Ioanna Petsiou

Data Journalist, Freelancer

Ioanna Petsiou is an investigative data journalist working across data analysis, satellite imagery, and mapping to uncover and explain complex stories. She is particularly drawn to environmental reporting and to building clear, reproducible ways of working with data that others can... Read More →

Alina Yanchur

Data and Investigative Journalist

Investigative and data journalist with a focus on transnational corruption, sanctions evasion, and OSINT methods. Trainer and mentor for journalists working under repressive regimes. Strong background in collaborative and data-driven journalism across Belarus, Europe, and exile c... Read More →

Saturday May 30, 2026 9:30am - 10:45am CEST
Z0.15

Data skills, Workshop

11:15am CEST

Choosing the right web scraping strategy

Saturday May 30, 2026 11:15am - 12:30pm CEST

3.04

Web scraping is a powerful way to access otherwise unavailable data, but it’s becoming more complex as websites deploy defenses like Captchas and anti-bot systems. At SWR Data Lab, we’ve tackled this across investigations ranging from Google price comparisons to healthcare platforms and social media scraping, each requiring a different approach. In this session, we share a practical decision framework for choosing the right scraping strategy based on robustness, cost, and maintainability.

In this session, we will present a decision framework for selecting the right scraping strategy based on our learnings. Rather than promoting a single tool, we want to focus on choosing the right approach for your use case, considering robustness, cost, and maintainability in a newsroom context. Using real examples, we walk through our workflow: from analyzing sites with dev tools to selecting between HTTP scraping, browser automation, and advanced tools—along with best practices and when paid services are worth it.

To follow along, you should have some experience in scraping and, ideally, Python. The participants will be able to extend their toolkit, make smarter choices in their scraping workflow, and handle real-world obstacles efficiently. No special tools are required to follow along

Speakers

Stephanie Jauss

SWR Data Lab

Stephanie Jauss is a data reporter at the German public broadcaster SWR. She studied Computer Science and Media in Stuttgart as well as Investigative Journalism in Gothenburg.

Verena Steinacher

Data Engineer, pub.tech

Saturday May 30, 2026 11:15am - 12:30pm CEST
3.04

Data skills, Workshop

11:15am CEST

Embracing agents with Pydantic AI

Saturday May 30, 2026 11:15am - 12:30pm CEST

3.05

"Agentic AI" is all the rage, but what does it offer beyond traditional LLM workflows? In this hands-on session we'll answer this question (and more) while leveraging Python's Pydantic AI library to build a start-to-finish agentic AI workflow.

Participants will learn how agents work, when they're useful, how to build custom tools, and options for tracing and evaluation. You'll leave able to write agentic workflows to extract information from texts, do semi-autonomous research, and deliver clean, structured results.

Basic experience with Python/LLMs is helpful but not required. After attending this session, participants will be able to understand when and how to apply agentic approaches to problems. Participants should have Python/Jupyter installed or a Google account for working in the cloud.

Speakers

Jonathan Soma

Knight Chair in Data Journalism, Columbia University

Jan van der Burgt

Investigative coder / AI specialist, Freelance / Open State Foundation

Saturday May 30, 2026 11:15am - 12:30pm CEST
3.05

Data skills, Workshop

11:15am CEST

From data projects to pipelines

Saturday May 30, 2026 11:15am - 12:30pm CEST

Z0.15

Data journalism projects often rely on manually executed scripts, spreadsheet updates, or code running on private computers. As investigations become more complex, span longer timeframes, or require regular updates, these methods become inefficient and unsustainable. Automated data pipelines offer a solution to these challenges.

This workshop provides an introduction to Apache Airflow, an open-source platform for automating and managing workflows. The session demonstrates how Airflow can be utilized to efficiently automate data journalism processes—from scraping to creating and updating visualizations. Participants should have basic programming skills.

After attending this session, the participants will know why and when to use automated pipelines and understand the basics of Airflow.

Speakers

Natalie Widmann

Data Journalist, SWR Data Lab

Max Harlow

Bloomberg News

Max Harlow is a data reporter at Bloomberg News. He also runs Journocoders, a community group for journalists to develop technical skills for use in their reporting.

Saturday May 30, 2026 11:15am - 12:30pm CEST
Z0.15

Data skills, Workshop

1:45pm CEST

Data Magic made simple: three ways to crunch numbers in spreadsheets

Saturday May 30, 2026 1:45pm - 3:00pm CEST

We know that thousands of lines in a dataset can be intimidating, especially if you’re not a programmer. Spreadsheets can do the heavy lifting — and mastering them is easier than you expect!

In this session, we will walk you through three different ways to dive into data using nothing but spreadsheet tools. Along the way, we’ll show you how to cross-check your calculations, ensuring your findings are accurate and reliable. Whether you’re a complete beginner or have already used spreadsheets in your work, you’ll leave with practical skills to handle data confidently without ever touching a line of code.

Bring your laptop and join us to discover how easy and powerful data analysis can be!

Speakers

Alina Yanchur

Data and Investigative Journalist

Kuang Keng Kuek Ser

Senior Editor for Rainforest Investigations, Pulitzer Center

Saturday May 30, 2026 1:45pm - 3:00pm CEST

Data skills, Workshop

1:45pm CEST

How to manage mass FOI projects using AI, vibe coding and verification

Saturday May 30, 2026 1:45pm - 3:00pm CEST

3.05

Projects involving FOI requests to multiple bodies often create significant challenges, from different file formats and data trapped in PDFs, to organisations providing data in different structures and different levels of detail. To get the big picture often requires data extraction, cleaning, reshaping, and checking.

In this session, we will share a series of tips and tools used to manage one project — including vibe coding with AI — which can be used to make any multi-response FOI project more efficient and accurate. No prior knowledge is required. By the end of this session, attendees should be able to design a data structure for an FOI project, use a range of tools, including AI, to extract, reshape, clean, and combine data from FOI responses, and design a data validation process to check AI outputs.

You will need a laptop with Google Drive and an account with an AI tool such as ChatGPT, Gemini, Claude, or Copilot. Installing Tabula and Open Refine will help you get more out of the session.

Speakers

Paul Bradshaw

Journalist and Academic, BBC/Birmingham City University

Paul Bradshaw runs the MA in Data Journalism at Birmingham City University and also works as a consulting data journalist with the BBC Shared Data Unit. A journalist, writer and trainer, he has worked with news organisations including The Guardian, Telegraph, Mirror, Der Tagesspi... Read More →

Ioanna Petsiou

Data Journalist, Freelancer

Saturday May 30, 2026 1:45pm - 3:00pm CEST
3.05

Data skills, Workshop

1:45pm CEST

Turning raw data into reliable sources: Python for journalists

Saturday May 30, 2026 1:45pm - 3:00pm CEST

3.04

Have you ever tried to investigate how much groceries or rent in your city really impact people’s budgets? Journalists don’t always get all the data in one place. Often, we find it in ads, public announcements, or different sources, then clean, structure, and track it over time, compare it with other datasets, or monitor changes to uncover trends.

This hands-on workshop teaches journalists how to clean, transform, and structure real newsroom data using Python. Participants will learn practical techniques to handle messy data, including changing data types, filtering by values or dates, splitting columns, labeling and recoding, calculating averages and percentages, and extracting quantities from text fields. The session also covers tasks specific to regional datasets, such as converting scripts from Cyrillic to Latin.

With these skills, journalists can analyze grocery prices and compare them with income data or calculate meal costs to report on rising food prices, examine traffic accident data near schools, or track public officials’ gifts and benefits. By the end of the workshop, participants will have concrete tools and workflows to turn raw data into reliable sources ready for investigation and reporting.

To follow along, participants should have some experience with Python basics and working with datasets. After attending this session, participants will be able to turn messy data into clean, reliable sources, compare thousands of entries, and extract insights for investigative stories using Python. Participants should have Python installed on their own computers to follow along. This tutorial can also be accessed via Google Colab, where most of the steps are similar, though Python installed locally is the recommended option for a smoother experience.

Speakers

Teodora Curcic

BBC

Verena Steinacher

Data Engineer, pub.tech

Saturday May 30, 2026 1:45pm - 3:00pm CEST
3.04

Data skills, Workshop

3:30pm CEST

Hack your CMS (and the rest of the web!): Tampermonkey 101

Saturday May 30, 2026 3:30pm - 4:00pm CEST

3.05

Tampermonkey is an age-old browser extension that allows you to inject scripts and stylesheets into any web page, turning the web into your personal playground. We'll look at how to customize your CMS with DIY features, add "Download all" buttons to paginated websites, automate tedious processes like filling out forms and redesign websites however you'd like. Best of all, Tampermonkey scripts are saveable and sharable, allowing you to give other members of your newsroom superpowers without fiddling with distributing extensions or asking them to run Python scripts. To follow along, participants should be able to install extensions in their web browser of choice.

Speakers

Jonathan Soma

Knight Chair in Data Journalism, Columbia University

Saturday May 30, 2026 3:30pm - 4:00pm CEST
3.05

Data skills, Mini

3:30pm CEST

Make a publication-ready static map with QGIS

Saturday May 30, 2026 3:30pm - 4:00pm CEST

2.03

In this demo, participants will learn how to create a static map in QGIS that is ready for publication. The session will cover setting map dimensions, selecting a basemap, adding geospatial data, and incorporating key design elements such as text annotations, a north arrow, a scale bar, an inset map, and images. Participants will also learn how to export the finished map as a JPG.

Download and install QGIS on your laptop before the session and confirm that it opens properly. MacBook users who run into security warnings when opening QGIS can follow the workaround here

Speakers

Kuang Keng Kuek Ser

Senior Editor for Rainforest Investigations, Pulitzer Center

Saturday May 30, 2026 3:30pm - 4:00pm CEST
2.03

Data skills, Mini

3:30pm CEST

One template, many stories: Parameterized reports with Quarto

Saturday May 30, 2026 3:30pm - 4:00pm CEST

1.04

Learn how to build reusable report templates in Quarto that generate multiple outputs (PDF, HTML, Word documents) from a single source document. By defining parameters — such as a region, time period, or data source — you can produce dozens or even hundreds of tailored reports without duplicating code or copy-pasting results.

This is especially useful for cross-border investigations, where partners share a common dataset, but each team needs a report focused on its own country. Build the analysis once, then render a customized version for each partner with only their slice of the data.

To follow along, participants should have basic familiarity with Quarto, R Markdown, or Jupyter notebooks, and some experience writing code in R or Python.

Speakers

Leopold Salzenstein

Data coordinator, Arena for Journalism in Europe

Saturday May 30, 2026 3:30pm - 4:00pm CEST
1.04

Data skills, Mini

4:15pm CEST

Beyond data cleaning: Enhancing OpenRefine with LLM

Saturday May 30, 2026 4:15pm - 4:45pm CEST

1.04

Data journalism has always relied on clean, structured data; but cleaning messy datasets remains one of the most time-consuming parts of the workflow. Enter OpenRefine, our old buddy for data wrangling, now enhanced by Large Language Models (LLMs).

In this 20-minute session, we explore how combining OpenRefine’s powerful transformation capabilities with modern AI unlocks new possibilities for journalists. Using the open-source LLM extension for OpenRefine, we’ll demonstrate practical workflows for:
- Automated Enrichment: Extracting entities, categorizing content, and enriching records using natural language prompts.
- Smart Disambiguation: Resolving inconsistencies and matching fuzzy data with AI-assisted reconciliation.
- Rapid Prototyping: Turning raw, unstructured text into structured datasets ready for investigation

Why This Matters Now: Journalists are increasingly working with large, messy datasets, from leaked documents to public records.

While LLMs offer powerful analysis, they often lack precision on structured data. OpenRefine provides that precision. Together, they create a workflow that is both scalable and auditable; critical for investigative reporting where accuracy is non-negotiable.

What Attendees Will Take Away:
- A clear understanding of how to integrate LLMs into existing OpenRefine workflows.
- Practical examples relevant to journalistic investigations (entity extraction, classification, enrichment).

To attend this session, participants should have experience with data cleaning

Speakers

Herve Letoqueux

OpenFacto

Co-Founder of OpenFacto with Lou (@CapteursOuverts) and Aliaume (@yaolri), a french NGO dedicated to online investigation for journalists and activists, I love OpenSource researches, Python, Gephi, R and OpenRefine. I used to deal with money laundering, financial frauds and terrorism... Read More →

Saturday May 30, 2026 4:15pm - 4:45pm CEST
1.04

Data skills, Mini

4:15pm CEST

From 007 to n8n - build your own no-code AI Agents

Saturday May 30, 2026 4:15pm - 4:45pm CEST

2.03

With so-called low-code platforms like n8n, you can quickly click together programs that would otherwise require tedious Python coding. And you can integrate LLMs at various points to, for example, extract information from texts or summarize content. This allows you to build complex workflows. Receive a Teams message from an agent when a nearby river level approaches extreme values? No problem! Automatically monitor the police website for accident reports and generate suggestions for brief news items? With n8n, this can be automated quickly. This workshop provides an introduction to the free platform n8n. No prior knowledge is expected.

Speakers

Claus Hesseling

Freier Journalist und Trainer

Saturday May 30, 2026 4:15pm - 4:45pm CEST
2.03

Data skills, Mini

4:15pm CEST

No download button? Getting web data without writing a scraper

Saturday May 30, 2026 4:15pm - 4:45pm CEST

3.05

Journalists often run into data that is visible on a website but impossible to download directly: a table buried in a government page, a list of public records, or search results that change with every query. Writing a full scraper can be time-consuming and technically demanding for what is often a one-time task.

This session introduces three lightweight approaches that cover most of these cases: reading a table directly from a page using pandas, downloading raw HTML and parsing it into a dataframe and pulling data through network requests. These techniques are practical tools for everyday newsroom situations. Participants will take home a GitHub repository with a working notebook to try on their own data, though some adaptation will be needed to apply it to different websites.

The three approaches vary in complexity. Basic Python knowledge is enough to follow along, but participants with more experience will be able to go further, and the code can be adapted with the help of an LLM.

Speakers

Teodora Curcic

BBC

Saturday May 30, 2026 4:15pm - 4:45pm CEST
3.05

Data skills, Mini

5:15pm CEST

"The Mechelen Connection" (Escape Room)

Saturday May 30, 2026 5:15pm - 5:45pm CEST

3.04

There's a (genuine) story hiding in plain sight. Using a mixture of OSINT skills and clues hidden in the room, you will have 30 minutes to get to the story. First come, first served, max 3-4 teams per session competing to get the exit code and get out!

Speakers

Jonathan Stoneman

Arena for Journalism in Europe

Former BBC journalist, turned datajournalist, trainer, consultant. Works with Arena as Lead Trainer, Arena Academy.

Saturday May 30, 2026 5:15pm - 5:45pm CEST
3.04

Data skills, Escape Room

5:15pm CEST

How to look up named entities in text – fast

Saturday May 30, 2026 5:15pm - 5:45pm CEST

Have you ever stumbled at the problem "I have a bunch of documents, give me all the politicians named in it"? If yes, you know the hassle: NER is noisy, and to qualify names (Is this a politician or not) requires external services, APIs or a large language model.

Or, use "Juditha": It's an open source poor mans entity extraction and resolution tool. No external service required, just put in your list of names and then extract them from arbitrary unstructured content. Works on any laptop, super fast. Of course it works with names of criminals, too. Or company names. Whatever you need.

In this session I'll walk through how to use the "juditha" command line and how to populate it with names of interest. At the end, anyone can take it home to detect the names that matter in your material.

Knowledge about how to use a command line and install python packages helps. If you ever suffered the problems about named entity recognition, you'll have even more fun.

Speakers

Simon Wörpel

Director of Technology, Data and Research Center – DARC

Saturday May 30, 2026 5:15pm - 5:45pm CEST

Data skills, Mini

5:15pm CEST

Mining data from unstructured documents

Saturday May 30, 2026 5:15pm - 5:45pm CEST

3.05

You have a folder of documents and you want to extract data points from each one. And the data isn't in a structured table with neat rows and columns either. Here's where string functions and regular expressions can help. The demonstration will be in R but the skills are generic to all languages.

Speakers

Robert Gebeloff

Reporter, New York Times

Saturday May 30, 2026 5:15pm - 5:45pm CEST
3.05

Data skills, Mini

6:00pm CEST

"The Mechelen Connection" (Escape Room - 2nd session)

Saturday May 30, 2026 6:00pm - 6:30pm CEST

3.04

Speakers

Jonathan Stoneman

Arena for Journalism in Europe

Former BBC journalist, turned datajournalist, trainer, consultant. Works with Arena as Lead Trainer, Arena Academy.

Saturday May 30, 2026 6:00pm - 6:30pm CEST
3.04

Data skills, Escape Room

6:00pm CEST

Bluetooth Trackers for Investigations

Saturday May 30, 2026 6:00pm - 6:30pm CEST

Bluetooth trackers can help you develop interesting investigations. This team started using trackers while following two cars from Germany to Siberia, then a parcel from Prague to Moscow. In late 2024, they tracked more than 230 letters sent within Germany, using up to 80 trackers simultaneously. For almost 18 months they tracked 24 items of electronic waste from Germany to places as far afield as Pakistan.

In this session, the team will share the learnings and the technology behind all these projects and the scraping tools and software behind them. They will also bring some trackers and covers to inspire colleagues to use these devices, and share lessons learnt from ongoing collaborations in various countries where other journalists and newsrooms licensed them to help them move their projects forward.

Speakers

Marcus Lindemann

geschäftsführender Autor, autoren(werk) GmbH & Co.KG

Marcus Lindemann ist Dozent für Recherche, TV-Journalismus und Presserecht sowie geschäftsführender Autor der TV-Produktionsfirma autoren(werk). Seit 25 Jahren produziert er Magazinbeiträge und Dokumentationen für öffentlich-rechtliche Sender, insbesondere zu Wirtschafts- und... Read More →

Saturday May 30, 2026 6:00pm - 6:30pm CEST

Data skills, Mini

6:00pm CEST

Modern document processing with Natural PDF

Saturday May 30, 2026 6:00pm - 6:30pm CEST

3.05

Say hello to Natural PDF, a new Python library for wrangling PDFs that's focused on usability and feature-completeness. Process PDFs with scraping-like selectors and spatially-aware queries, asking for "the red alphanumeric string" or "the content below the big Summary header." Beyond the basics, Natural PDF is also full of modern conveniences like table detection, multiple OCR engines, and citation-aware LLM data extraction.

To get the most out of this session, participants should have experience with Python and struggling with terrible PDFs.

Speakers

Jonathan Soma

Knight Chair in Data Journalism, Columbia University

Saturday May 30, 2026 6:00pm - 6:30pm CEST
3.05

Data skills, Mini

9:30am CEST

Update your google skills

Sunday May 31, 2026 9:30am - 10:45am CEST

3.05

Google search is all the same since 1996? No, Google does change over time, but so slow that most people will not notice. The session will give you an update about recent changes (i.e. in the last 4-5 years), will point at workarounds where necessary and will show you what is really new and useful. Towards the end of the session it will give you some advanced Google dorks for immediate journalistic use, but also inspire you to build your own dorks and how to combine LLMs and Google searches.

To follow along, the participants should have used google operators before. After attending the session, you will have an up to date knowledge of Googles web search and other tools for journalistic use.

A Google account can be useful, but is not a must-have.

Speakers

Marcus Lindemann

geschäftsführender Autor, autoren(werk) GmbH & Co.KG

Sunday May 31, 2026 9:30am - 10:45am CEST
3.05

Data skills, Workshop

11:15am CEST

A map for every reader: how to generate hundreds of images for multiple audiences or partners using QGIS and Python

Sunday May 31, 2026 11:15am - 12:30pm CEST

3.02

The BBC Shared Data Unit wanted to generate a map image for each authority in the UK showing the state of flood defences in that area — so they turned to the mapping tool QGIS’s built-in Python functionality.

In this session, you will learn how to generate and export dozens of maps in QGIS centred at different points, and how AI can help speed up the process.

To follow along, participants should have some basic knowledge of QGIS and be comfortable using Python or vibe coding.

After attending this session, participants should be able to understand how Python works in QGIS and use AI to help generate, understand, and adapt code. Participants should have QGIS and Python installed on the computer (qgis.org/download + python.org/downloads) and a free account with an AI tool such as ChatGPT, Gemini, or Claude

Speakers

Paul Bradshaw

Journalist and Academic, BBC/Birmingham City University

Ioanna Petsiou

Data Journalist, Freelancer

Sunday May 31, 2026 11:15am - 12:30pm CEST
3.02

Data skills, Workshop

11:15am CEST

Text embeddings: navigating text in high dimensions

Sunday May 31, 2026 11:15am - 12:30pm CEST

1.14

Most "big data" problems in journalism aren't really data problems, they're reading problems: a big leak, a ministry dump of 12,000 pages, or a FOI coming back as zip of PDFs. The instinct is to search, but keyword search assumes you already know what you're looking for. Which sometimes is the thing you don't know yet.

This session introduces embeddings: a technique that turns any text into a point in space, positioned by meaning, so texts with similar meaning end up close together. You stop searching a pile and start looking at it.

To make the idea tangible, we'll walk through a live semantic map we built of Google's "trending now" feeds from 125 countries, projected into 3D.

The method applies beyond trending searches and is applicable to TikTok captions, YouTube transcripts, court filings, a scraped forum, or years of parliamentary speeches.

We'll cover the full workflow end to end: how to embed your corpus, how to project it without losing what matters, how to build a map you can actually navigate, and where this approach breaks.

To follow along, participants should be comfortable running basic Python scripts on their laptop or in google collab.

After attending this session, participants will be able to take a large, unstructured text corpus and turn it into a navigable semantic map.

Participants should have Python installed on their computer, or have a google account where they can run collab. A Hugging Face account is recommended for generating embeddings. We will provide examples of text to work with, but if you have your own collection, feel free bring it, but make sure it's in a text format, as we won't cover how to convert PDF's into text.

Speakers

Johan Schujit

Data Engineer, Resolve.

Ada Homolova

ARENA, Austria/ Slovakia

Sunday May 31, 2026 11:15am - 12:30pm CEST
1.14

Data skills, Workshop

10:00am CEST

1:00pm CEST

3:30pm CEST

11:30am CEST

11:30am CEST

11:30am CEST

2:00pm CEST

2:00pm CEST

2:00pm CEST

3:45pm CEST

3:45pm CEST

3:45pm CEST

9:30am CEST

9:30am CEST

9:30am CEST

11:15am CEST

11:15am CEST

11:15am CEST

1:45pm CEST

1:45pm CEST

1:45pm CEST

3:30pm CEST

3:30pm CEST

3:30pm CEST

4:15pm CEST

4:15pm CEST

4:15pm CEST

5:15pm CEST

5:15pm CEST

5:15pm CEST

6:00pm CEST

6:00pm CEST

6:00pm CEST

9:30am CEST

11:15am CEST

11:15am CEST

Get help with the event