Large language models can do more than generate text – they can help clean and structure messy data files as well as enrich datasets. As LLMs increasingly become a useful tool for data journalists, the Ellmer package is a useful resource for R users to easily work with LLMs. The Guardian data team has used the Ellmer R package to clean and organise thousands of emails from the Epstein files, to investigate private equity firms in the United Kingdom, and to classify recipients of climate finance.
Using some of these examples, attendees will learn when this package can be the perfect tool for your investigation, which are the good practices when using LLMs, how to connect to an API of an LLM, how to write an efficient prompt, how to submit the prompts in bulk using the batch function for structured data and how to evaluate your results and iterate for improvements.
This is an advanced R session and we will assume that attendees have some prior knowledge of R.