Loading…
Friday May 29, 2026 2:00pm - 3:15pm CEST
In this session, participants will take an archive of podcast episodes and other documents, and set up some cloud infrastructure to analyse the files using open source transcription, text extraction and generative AI tooling. The aim is to equip attendees with the skills to rapidly perform bulk operations on large troves of data by leveraging cloud platforms. By the end of the workshop participants will be have a pipeline that can answer questions like 'which podcast episodes have instances of greenwashing in them'. At The Guardian, we have used these techniques in two recent investigations. When investigating the Free Birth Society we needed to perform analysis on hundreds of hours of audio files. When the Epstein files were released we had to try and extract meaning out of millions of unstructured text documents. By making use of simple cloud tools (queues and instances) we were able to process hundreds of files in parallel whilst retaining control of the data.

Participants should have some experience of using the command line. All cloud accounts will be provided. After attending this session, participants will be able to use the cloud to quickly analyse large numbers of documents and media files. Participants using Windows could save some time by setting up WSL https://learn.microsoft.com/en-us/windows/wsl/install
Speakers
avatar for Philip McMahon

Philip McMahon

Software Developer, The Guardian

avatar for Teodora Curcic

Teodora Curcic

BBC
Teodora Ćurčić is an investigative and data journalist from Serbia with over seven years of experience reporting on corruption, political finance, gender-based violence, and social justice. She spent most of her career at the award-winning Center for Investigative Journalism of... Read More →
Friday May 29, 2026 2:00pm - 3:15pm CEST
3.05

Sign up or log in to save this to your schedule, view media, leave feedback and see who's attending!

Share Modal

Share this link via

Or copy link