Keyword search is great when you know what you're looking for. But what about the
structure you don't know is there? The topics, the patterns, the outliers?
This workshop introduces a workflow for making large text collections navigable. The core idea is simple: a machine learning model reads your text and converts each piece into a list of numbers — an
embedding — where texts with similar meaning get similar numbers. Once everything is numbers, you can do maths on meaning.
That lets you do a couple things, for example:
- Semantically search your corpus for documents with similar meaning to your query
- Project it into two dimensions with UMAP, so you can plot your documents on a scatter chart and see where they cluster and where they don't
- Find natural groupings with HDBSCAN, a clustering algorithm that discovers groups, and flags documents that don't fit anywhere
The session is hands-on and code-based. You don't need prior experience with machine learning, we'll explain what's happening at each step. By the end, you'll understand how to take a pile of text and turn it into a visual map you can explore.
What to bring: a Google account for
Colab. We'll provide example data to work with. You're welcome to bring your own, as long as it's in a text format.
Materials:
https://resolveworks.github.io/dataharvest2026/#/