In this session, participants will take an archive of podcast episodes and other documents, and set up some cloud infrastructure to analyse the files using open source transcription, text extraction and generative AI tooling. The aim is to equip attendees with the skills to rapidly perform bulk operations on large troves of data by leveraging cloud platforms. By the end of the workshop participants will be have a pipeline that can answer questions like 'which podcast episodes have instances of greenwashing in them'. At The Guardian, we have used these techniques in two recent investigations. When
investigating the Free Birth Society we needed to perform analysis on hundreds of hours of audio files. When the Epstein files were released we had to try and extract meaning out of millions of unstructured text documents. By making use of simple cloud tools (queues and instances) we were able to process hundreds of files in parallel whilst retaining control of the data.
Participants should have some experience of using the command line. All cloud accounts will be provided. After attending this session, participants will be able to use the cloud to quickly analyse large numbers of documents and media files.
You can see the repository for the workshop here
https://github.com/philmcmahon/data-pipelineWe'll be using the following tools during the workshop. They can be installed quickly but if they are set up in advance that would save some time:
- AWS CLI: https://docs.aws.amazon.com/cli/latest/userguide/getting-started-install.html
- OpenTofu: https://opentofu.org/docs/intro/install/
- UV: https://docs.astral.sh/uv/getting-started/installation/
- (if using windows): You might need to set up WSL
https://learn.microsoft.com/en-us/windows/wsl/install - but so long as you can run aws, opentofu and uv that's all you need