Dataharvest 2026 - the European Investigative Journalism Conference: Full Schedule

3:30pm CEST

Hack your CMS (and the rest of the web!): Tampermonkey 101

Saturday May 30, 2026 3:30pm - 4:00pm CEST

Tampermonkey is an age-old browser extension that allows you to inject scripts and stylesheets into any web page, turning the web into your personal playground. We'll look at how to customize your CMS with DIY features, add "Download all" buttons to paginated websites, automate tedious processes like filling out forms and redesign websites however you'd like. Best of all, Tampermonkey scripts are saveable and sharable, allowing you to give other members of your newsroom superpowers without fiddling with distributing extensions or asking them to run Python scripts. To follow along, participants should be able to install extensions in their web browser of choice.

Materials: https://jsoma.github.io/workshop-tampermonkey/

Speakers

Jonathan Soma

Knight Chair in Data Journalism, Columbia University

Jonathan Soma is the Knight Chair in Data Journalism at Columbia University, where he serves as Director of the Data Journalism MS program and the Lede Program, an intensive data journalism summer course. His lectures cover everything from basic Python and data analysis to interactive... Read More →

Saturday May 30, 2026 3:30pm - 4:00pm CEST
Z1.15 - Aula Donche

Data skills, Mini

3:30pm CEST

Make a publication-ready static map with QGIS

Saturday May 30, 2026 3:30pm - 4:00pm CEST

3.05

In this demo, participants will learn how to create a static map in QGIS that is ready for publication. The session will cover setting map dimensions, selecting a basemap, adding geospatial data, and incorporating key design elements such as text annotations, a north arrow, a scale bar, an inset map, and images. Participants will also learn how to export the finished map as a JPG.

Download and install QGIS on your laptop before the session and confirm that it opens properly. MacBook users who run into security warnings when opening QGIS can follow the workaround here

Materials: https://github.com/kuangkeng/dataharvest2026-qgis-mapmaking

Speakers

Kuang Keng Kuek Ser

Senior Editor for Rainforest Investigations, Pulitzer Center

Kuang Keng Kuek Ser is the Senior Editor for Rainforest Investigations at the Pulitzer Center, a non-profit organization based in Washington, DC that supports independent journalists globally. He supports and mentors three fellowships investigating issues related to tropical rainforest... Read More →

Saturday May 30, 2026 3:30pm - 4:00pm CEST
3.05

Data skills, Mini

3:30pm CEST

One template, many stories: Parameterized reports with Quarto

Saturday May 30, 2026 3:30pm - 4:00pm CEST

3.09

Learn how to build reusable report templates in Quarto that generate multiple outputs (PDF, HTML, Word documents) from a single source document. By defining parameters — such as a region, time period, or data source — you can produce dozens or even hundreds of tailored reports without duplicating code or copy-pasting results.

This is especially useful for cross-border investigations, where partners share a common dataset, but each team needs a report focused on its own country. Build the analysis once, then render a customized version for each partner with only their slice of the data.

To follow along, participants should have basic familiarity with Quarto, R Markdown, or Jupyter notebooks, and some experience writing code in R or Python.

Materials: https://leopold-salzenstein.github.io/eijc26_quarto/#/title-slide

Speakers

Leopold Salzenstein

Data coordinator, Arena for Journalism in Europe

Leopold Salzenstein is a freelance investigative data journalist and trainer based in the south of France. At Arena, he coordinates the handling of data for publications and trainings. He is also a member of the collective of journalists Environmental Investigative Forum (EIF).

... Read More →

Saturday May 30, 2026 3:30pm - 4:00pm CEST
3.09

Data skills, Mini

4:15pm CEST

Beyond data cleaning: Enhancing OpenRefine with LLM

Saturday May 30, 2026 4:15pm - 4:45pm CEST

3.05

Data journalism has always relied on clean, structured data; but cleaning messy datasets remains one of the most time-consuming parts of the workflow. Enter OpenRefine, our old buddy for data wrangling, now enhanced by Large Language Models (LLMs).

In this 30-minute session, we explore how combining OpenRefine’s powerful transformation capabilities with modern AI unlocks new possibilities for journalists. Using the open-source LLM extension for OpenRefine, we’ll demonstrate practical workflows for:
- Automated Enrichment: Extracting entities, categorizing content, and enriching records using natural language prompts.
- Smart Disambiguation: Resolving inconsistencies and matching fuzzy data with AI-assisted reconciliation.
- Rapid Prototyping: Turning raw, unstructured text into structured datasets ready for investigation

Why This Matters Now: Journalists are increasingly working with large, messy datasets, from leaked documents to public records. While LLMs offer powerful analysis, they often lack precision on structured data. OpenRefine provides that precision. Together, they create a workflow that is both scalable and auditable; critical for investigative reporting where accuracy is non-negotiable.

What Attendees Will Take Away:
- A clear understanding of how to integrate local LLMs into existing OpenRefine workflows in a secure, even disconnected environment.
- Practical examples relevant to journalistic investigations (entity extraction/transforms, classification, enrichment).

To attend this session, participants should have a little experience with data cleaning with OpenRefine.

Repo on github : https://github.com/herve-checkfirst/DataHarvest2026-Refine_with_llm

Please note that the model Ministral 3b is a 2GB model. Feel free to read the tutorial and install everything prior to the session.

Speakers

Hervé Letoqueux

CEO, Checkfirst

CEO of Check First, a finnish company working on regulation (DSA...), FIMI, OSINT investigations and technology for CSOs, and journos. Former head of operations at VIGINUM, France. Also Co-Founder of OpenFacto, a french NGO dedicated to online investigation for journalists and activists... Read More →

refine llm en pdf

Saturday May 30, 2026 4:15pm - 4:45pm CEST
3.05

Data skills, Mini

4:15pm CEST

From 007 to n8n - build your own no-code AI Agents

Saturday May 30, 2026 4:15pm - 4:45pm CEST

Z1.15 - Aula Donche

With so-called low-code platforms like n8n, you can quickly click together programs that would otherwise require tedious Python coding. And you can integrate LLMs at various points to, for example, extract information from texts or summarize content. This allows you to build complex workflows. Receive a Teams message from an agent when a nearby river level approaches extreme values? No problem! Automatically monitor the police website for accident reports and generate suggestions for brief news items? With n8n, this can be automated quickly. This workshop provides an introduction to the free platform n8n. No prior knowledge is expected.

Materials: https://github.com/chesselingfm/dataharvest26-n8n

Speakers

Claus Hesseling

Freelance AI, Data Journalist and Workshop Trainer, NDR

Data Journalist and AI expert at Public Broadcaster NDR in Hamburg/Germany. Workshop trainer and lecturer since 2004

Saturday May 30, 2026 4:15pm - 4:45pm CEST
Z1.15 - Aula Donche

Data skills, Mini

4:15pm CEST

No download button? Getting web data without writing a scraper

Saturday May 30, 2026 4:15pm - 4:45pm CEST

3.09

Journalists often run into data that is visible on a website but impossible to download directly: a table buried in a government page, a list of public records, or search results that change with every query. Writing a full scraper can be time-consuming and technically demanding for what is often a one-time task.

This session introduces three lightweight approaches that cover most of these cases: reading a table directly from a page using pandas, downloading raw HTML and parsing it into a dataframe and pulling data through network requests. These techniques are practical tools for everyday newsroom situations. Participants will take home a GitHub repository with a working notebook to try on their own data, though some adaptation will be needed to apply it to different websites.

The three approaches vary in complexity. Basic Python knowledge is enough to follow along, but participants with more experience will be able to go further, and the code can be adapted with the help of an LLM.

Materials: https://github.com/teodoracurcic/dh2026-getting-data

Speakers

Teodora Curcic

BBC

Teodora Ćurčić is an investigative and data journalist from Serbia with over seven years of experience reporting on corruption, political finance, gender-based violence, and social justice. She spent most of her career at the award-winning Center for Investigative Journalism of... Read More →

Saturday May 30, 2026 4:15pm - 4:45pm CEST
3.09

Data skills, Mini

5:15pm CEST

How to look up named entities in text – fast

Saturday May 30, 2026 5:15pm - 5:45pm CEST

3.09

Have you ever stumbled at the problem "I have a bunch of documents, give me all the politicians named in it"? If yes, you know the hassle: NER is noisy, and to qualify names (Is this a politician or not) requires external services, APIs or a large language model.

Or, use "Juditha": It's an open source poor mans entity extraction and resolution tool. No external service required, just put in your list of names and then extract them from arbitrary unstructured content. Works on any laptop, super fast. Of course it works with names of criminals, too. Or company names. Whatever you need.

In this session I'll walk through how to use the "juditha" command line and how to populate it with names of interest. At the end, anyone can take it home to detect the names that matter in your material.

Knowledge about how to use a command line and install python packages helps. If you ever suffered the problems about named entity recognition, you'll have even more fun.

Juditha: https://github.com/dataresearchcenter/juditha

Speakers

Simon Wörpel

Director of Technology, Data and Research Center – DARC

Saturday May 30, 2026 5:15pm - 5:45pm CEST
3.09

Data skills, Mini

5:15pm CEST

Mining data from unstructured documents

Saturday May 30, 2026 5:15pm - 5:45pm CEST

Z1.13 - Aula Hanswijk

You have a folder of documents and you want to extract data points from each one. And the data isn't in a structured table with neat rows and columns either. Here's where string functions and regular expressions can help. The demonstration will be in R but the skills are generic to all languages.

Materials: https://github.com/gebelo/Dataharvest2026

Speakers

Robert Gebeloff

Reporter, New York Times

Robert Gebeloff has worked as a data projects reporter for The New York Times since 2008 and has taught data journalism for many years in newsrooms and at conferences. He was co-winner of the George Polk Award in 2015 and was a Pulitzer Prize finalist in both 2015 and 2016 for projects... Read More →

Saturday May 30, 2026 5:15pm - 5:45pm CEST
Z1.13 - Aula Hanswijk

Data skills, Mini

6:00pm CEST

Bluetooth Trackers for Investigations

Saturday May 30, 2026 6:00pm - 6:30pm CEST

3.09

Bluetooth trackers can help you develop interesting investigations. This team started using trackers while following two cars from Germany to Siberia, then a parcel from Prague to Moscow. In late 2024, they tracked more than 230 letters sent within Germany, using up to 80 trackers simultaneously. For almost 18 months they tracked 24 items of electronic waste from Germany to places as far afield as Pakistan.

In this session, the team will share the learnings and the technology behind all these projects and the scraping tools and software behind them. They will also bring some trackers and covers to inspire colleagues to use these devices, and share lessons learnt from ongoing collaborations in various countries where other journalists and newsrooms licensed them to help them move their projects forward.

Speakers

Marcus Lindemann

managing editor / geschäftsführender Autor, autoren(werk) GmbH & Co.KG

Marcus Lindemann is a lecturer in research, television journalism and media law, as well as managing director of the TV production company autoren(werk). For 25 years, he has been producing magazine features and documentaries for public service broadcasters, particularly on economic... Read More →

Saturday May 30, 2026 6:00pm - 6:30pm CEST
3.09

Data skills, Mini

6:00pm CEST

Modern document processing with Natural PDF

Saturday May 30, 2026 6:00pm - 6:30pm CEST

Z1.13 - Aula Hanswijk

Say hello to Natural PDF, a new Python library for wrangling PDFs that's focused on usability and feature-completeness. Process PDFs with scraping-like selectors and spatially-aware queries, asking for "the red alphanumeric string" or "the content below the big Summary header." Beyond the basics, Natural PDF is also full of modern conveniences like table detection, multiple OCR engines, and citation-aware LLM data extraction.

To get the most out of this session, participants should have experience with Python and struggling with terrible PDFs.

Materials: https://jsoma.github.io/natural-pdf-workshop/

Speakers

Jonathan Soma

Knight Chair in Data Journalism, Columbia University

Saturday May 30, 2026 6:00pm - 6:30pm CEST
Z1.13 - Aula Hanswijk

Data skills, Mini

3:30pm CEST

3:30pm CEST

3:30pm CEST

4:15pm CEST

4:15pm CEST

4:15pm CEST

5:15pm CEST

5:15pm CEST

6:00pm CEST

6:00pm CEST

Get help with the event