Dataharvest 2026 - the European Investigative Journalism Conference: Full Schedule

arrow_back View All Dates

9:30am CEST

Unlocking the apps: How can you scrap data, trace leaks on mobile and turn it into stories

Sunday May 31, 2026 9:30am - 10:45am CEST

Much of modern life is mediated through phones and apps. To investigate anything they touch, you need to understand where these apps' data comes from, what they're sharing, and with whom. But while network forensics & scraping for the web have received plenty of attention, the same isn't true for mobile, where techniques can be more challenging and clear guides are harder to find.

In this talk, we'll take a hands-on look at how HTTP Toolkit and other tools make it possible to easily capture, inspect & modify network traffic on mobile. We'll explore real-world examples of these techniques in data journalism, and you'll learn how you can use this to extract the datasets that power mobile apps, expose privacy leaks & security issues, and investigate exactly how apps do what they do.

Speakers

Tim Perry

Sunday May 31, 2026 9:30am - 10:45am CEST
1.14

Investigative method, Presentation

11:15am CEST

Text embeddings: navigating text in high dimensions

Sunday May 31, 2026 11:15am - 12:30pm CEST

1.14

Most "big data" problems in journalism aren't really data problems, they're reading problems: a big leak, a ministry dump of 12,000 pages, or a FOI coming back as zip of PDFs. The instinct is to search, but keyword search assumes you already know what you're looking for. Which sometimes is the thing you don't know yet.

This session introduces embeddings: a technique that turns any text into a point in space, positioned by meaning, so texts with similar meaning end up close together. You stop searching a pile and start looking at it.

To make the idea tangible, we'll walk through a live semantic map we built of Google's "trending now" feeds from 125 countries, projected into 3D.

The method applies beyond trending searches and is applicable to TikTok captions, YouTube transcripts, court filings, a scraped forum, or years of parliamentary speeches.

We'll cover the full workflow end to end: how to embed your corpus, how to project it without losing what matters, how to build a map you can actually navigate, and where this approach breaks.

To follow along, participants should be comfortable running basic Python scripts on their laptop or in google collab.

After attending this session, participants will be able to take a large, unstructured text corpus and turn it into a navigable semantic map.

Participants should have Python installed on their computer, or have a google account where they can run collab. A Hugging Face account is recommended for generating embeddings. We will provide examples of text to work with, but if you have your own collection, feel free bring it, but make sure it's in a text format, as we won't cover how to convert PDF's into text.

Speakers

Johan Schujit

Data Engineer, Resolve.

I'm a data engineer responsible for EveryPolitician and PoliLoom at OpenSanctions. I'm a self-taught hacker with a stubborn belief that good data should be open and technology should serve the public interest. Previously at Follow the Money.

Ada Homolova

ARENA, Austria/ Slovakia

Adriana is a freelance data journalist, trainer and public spending nerd. She coordinates the data skills training track on the Dataharvest conference, and herds frogs at The Pond.

https://homolova.sk/newsletter

Sunday May 31, 2026 11:15am - 12:30pm CEST
1.14

Data skills, Workshop

Dataharvest 2026 - the European Investigative Journalism Conference

9:30am CEST

Tim Perry

11:15am CEST

Johan Schujit

Ada Homolova

Get help with the event