Loading…
Sunday May 31, 2026 11:15am - 12:30pm CEST
Keyword search is great when you know what you're looking for. But what about the structure you don't know is there? The topics, the patterns, the outliers?

This workshop introduces a workflow for making large text collections navigable. The core idea is simple: a machine learning model reads your text and converts each piece into a list of numbers — an embedding — where texts with similar meaning get similar numbers. Once everything is numbers, you can do maths on meaning.

That lets you do a couple things, for example:
  • Semantically search your corpus for documents with similar meaning to your query
  • Project it into two dimensions with UMAP, so you can plot your documents on a scatter chart and see where they cluster and where they don't
  • Find natural groupings with HDBSCAN, a clustering algorithm that discovers groups, and flags documents that don't fit anywhere

The session is hands-on and code-based. You don't need prior experience with machine learning, we'll explain what's happening at each step. By the end, you'll understand how to take a pile of text and turn it into a visual map you can explore.

What to bring: a Google account for Colab. We'll provide example data to work with. You're welcome to bring your own, as long as it's in a text format.

Materials: https://resolveworks.github.io/dataharvest2026/#/
Speakers
avatar for Johan Schujit

Johan Schujit

Data Engineer, Resolve.
I'm a data engineer responsible for EveryPolitician and PoliLoom at OpenSanctions. I'm a self-taught hacker with a stubborn belief that good data should be open and technology should serve the public interest. Previously at Follow the Money.

avatar for Ada Homolova

Ada Homolova

Coordinator of the data skills track, Arena for Journalism in Europe
A freelance data journalist with over 10 years of experience in data and investigative journalism, cross-border reporting, and teaching. She has worked with both small and large newsrooms across Europe, including Correctiv, Follow The Money, OCCRP, and Lost in Europe. Her heart beats... Read More →
Sunday May 31, 2026 11:15am - 12:30pm CEST
1.16

Sign up or log in to save this to your schedule, view media, leave feedback and see who's attending!

Share Modal

Share this link via

Or copy link