Loading…
Friday May 29, 2026 2:00pm - 3:15pm CEST
In this session, participants will take an archive of podcast episodes and other documents, and set up some cloud infrastructure to analyse the files using open source transcription, text extraction and generative AI tooling. The aim is to equip attendees with the skills to rapidly perform bulk operations on large troves of data by leveraging cloud platforms. By the end of the workshop participants will be have a pipeline that can answer questions like 'which podcast episodes have instances of greenwashing in them'. At The Guardian, we have used these techniques in two recent investigations. When investigating the Free Birth Society we needed to perform analysis on hundreds of hours of audio files. When the Epstein files were released we had to try and extract meaning out of millions of unstructured text documents. By making use of simple cloud tools (queues and instances) we were able to process hundreds of files in parallel whilst retaining control of the data.

Participants should have some experience of using the command line. All cloud accounts will be provided. After attending this session, participants will be able to use the cloud to quickly analyse large numbers of documents and media files.

You can see the repository for the workshop here https://github.com/philmcmahon/data-pipeline

We'll be using the following tools during the workshop. They can be installed quickly but if they are set up in advance that would save some time:
- AWS CLI: https://docs.aws.amazon.com/cli/latest/userguide/getting-started-install.html
- OpenTofu: https://opentofu.org/docs/intro/install/
- UV: https://docs.astral.sh/uv/getting-started/installation/
- (if using windows): You might need to set up WSL https://learn.microsoft.com/en-us/windows/wsl/install - but so long as you can run aws, opentofu and uv that's all you need
Speakers
avatar for Philip McMahon

Philip McMahon

Software Developer, The Guardian

avatar for Teodora Curcic

Teodora Curcic

BBC
Teodora Ćurčić is an investigative and data journalist from Serbia with over seven years of experience reporting on corruption, political finance, gender-based violence, and social justice. She spent most of her career at the award-winning Center for Investigative Journalism of... Read More →
Friday May 29, 2026 2:00pm - 3:15pm CEST
Z0.10

Sign up or log in to save this to your schedule, view media, leave feedback and see who's attending!

Share Modal

Share this link via

Or copy link