Name: Using the cloud and local LLMs to rapidly analyse thousands of audio/text documents
Start: 2026-05-29T14:00:00+0200
End: 2026-05-29T15:15:00+0200

Using the cloud and local LLMs to rapidly analyse thousands of audio/text documents

Friday May 29, 2026 2:00pm - 3:15pm CEST

Z0.10

In this session, participants will take an archive of podcast episodes and other documents, and set up some cloud infrastructure to analyse the files using open source transcription, text extraction and generative AI tooling. The aim is to equip attendees with the skills to rapidly perform bulk operations on large troves of data by leveraging cloud platforms. By the end of the workshop participants will be have a pipeline that can answer questions like 'which podcast episodes have instances of greenwashing in them'. At The Guardian, we have used these techniques in two recent investigations. When investigating the Free Birth Society we needed to perform analysis on hundreds of hours of audio files. When the Epstein files were released we had to try and extract meaning out of millions of unstructured text documents. By making use of simple cloud tools (queues and instances) we were able to process hundreds of files in parallel whilst retaining control of the data.

Participants should have some experience of using the command line. All cloud accounts will be provided. After attending this session, participants will be able to use the cloud to quickly analyse large numbers of documents and media files.

You can see the repository for the workshop here https://github.com/philmcmahon/data-pipeline

We'll be using the following tools during the workshop. They can be installed quickly but if they are set up in advance that would save some time:
- AWS CLI: https://docs.aws.amazon.com/cli/latest/userguide/getting-started-install.html
- OpenTofu: https://opentofu.org/docs/intro/install/
- UV: https://docs.astral.sh/uv/getting-started/installation/
- (if using windows): You might need to set up WSL https://learn.microsoft.com/en-us/windows/wsl/install - but so long as you can run aws, opentofu and uv that's all you need

Speakers

Philip McMahon

Software Developer, The Guardian

Teodora Curcic

BBC

Teodora Ćurčić is an investigative and data journalist from Serbia with over seven years of experience reporting on corruption, political finance, gender-based violence, and social justice. She spent most of her career at the award-winning Center for Investigative Journalism of... Read More →

rapidly analyse documents using cloud data harvest 2026 slides pdf

Friday May 29, 2026 2:00pm - 3:15pm CEST
Z0.10

Data skills, Workshop

Dataharvest 2026 - the European Investigative Journalism Conference

Philip McMahon

Teodora Curcic

Attendees (40)

Get help with the event

Dataharvest 2026 - the European Investigative Journalism Conference

Philip McMahon

Teodora Curcic

Attendees (40)

Sign up or log in to save this to your schedule, view media, leave feedback and see who's attending!

Get help with the event