Loading…
Sunday May 31, 2026 11:15am - 12:30pm CEST
Most "big data" problems in journalism aren't really data problems, they're reading problems: a big leak, a ministry dump of 12,000 pages, or a FOI coming back as zip of PDFs. The instinct is to search, but keyword search assumes you already know what you're looking for. Which sometimes is the thing you don't know yet.

This session introduces embeddings: a technique that turns any text into a point in space, positioned by meaning, so texts with similar meaning end up close together. You stop searching a pile and start looking at it.

To make the idea tangible, we'll walk through a live semantic map we built of Google's "trending now" feeds from 125 countries, projected into 3D.

The method applies beyond trending searches and is applicable to TikTok captions, YouTube transcripts, court filings, a scraped forum, or years of parliamentary speeches.

We'll cover the full workflow end to end: how to embed your corpus, how to project it without losing what matters, how to build a map you can actually navigate, and where this approach breaks.

To follow along, participants should be comfortable running basic Python scripts on their laptop or in google collab.

After attending this session, participants will be able to take a large, unstructured text corpus and turn it into a navigable semantic map.

Participants should have Python installed on their computer, or have a google account where they can run collab. A Hugging Face account is recommended for generating embeddings. We will provide examples of text to work with, but if you have your own collection, feel free bring it, but make sure it's in a text format, as we won't cover how to convert PDF's into text.
Speakers
avatar for Johan Schujit

Johan Schujit

Data Engineer, Resolve.
I'm a data engineer responsible for EveryPolitician and PoliLoom at OpenSanctions. I'm a self-taught hacker with a stubborn belief that good data should be open and technology should serve the public interest. Previously at Follow the Money.

avatar for Ada Homolova

Ada Homolova

ARENA, Austria/ Slovakia
Adriana is a freelance data journalist, trainer and public spending nerd. She coordinates the data skills training track on the Dataharvest conference, and herds frogs at The Pond.

https://homolova.sk/newsletter
Sunday May 31, 2026 11:15am - 12:30pm CEST
1.14

Attendees (2)


Sign up or log in to save this to your schedule, view media, leave feedback and see who's attending!

Share Modal

Share this link via

Or copy link