Most "big data" problems in journalism aren't really data problems, they're reading problems: a big leak, a ministry dump of 12,000 pages, or a FOI coming back as zip of PDFs. The instinct is to search, but keyword search assumes you already know what you're looking for. Which sometimes is the thing you don't know yet.
This session introduces embeddings: a technique that turns any text into a point in space, positioned by meaning, so texts with similar meaning end up close together. You stop searching a pile and start looking at it.
To make the idea tangible, we'll walk through
a live semantic map we built of Google's "trending now" feeds from 125 countries, projected into 3D.
The method applies beyond trending searches and is applicable to TikTok captions, YouTube transcripts, court filings, a scraped forum, or years of parliamentary speeches.
We'll cover the full workflow end to end: how to embed your corpus, how to project it without losing what matters, how to build a map you can actually navigate, and where this approach breaks.
To follow along, participants should be comfortable running basic Python scripts on their laptop or in google collab.
After attending this session, participants will be able to take a large, unstructured text corpus and turn it into a navigable semantic map.
Participants should have Python installed on their computer, or have a google account where they can run collab. A
Hugging Face account is recommended for generating embeddings. We will provide examples of text to work with, but if you have your own collection, feel free bring it, but make sure it's in a text format, as we won't cover how to convert PDF's into text.