Name: Modern document processing with Natural PDF
Start: 2026-05-30T18:00:00+0200
End: 2026-05-30T18:30:00+0200

Modern document processing with Natural PDF

Saturday May 30, 2026 6:00pm - 6:30pm CEST

Z1.13 - Aula Hanswijk

Say hello to Natural PDF, a new Python library for wrangling PDFs that's focused on usability and feature-completeness. Process PDFs with scraping-like selectors and spatially-aware queries, asking for "the red alphanumeric string" or "the content below the big Summary header." Beyond the basics, Natural PDF is also full of modern conveniences like table detection, multiple OCR engines, and citation-aware LLM data extraction.

To get the most out of this session, participants should have experience with Python and struggling with terrible PDFs.

Materials: https://jsoma.github.io/natural-pdf-workshop/

Speakers

Jonathan Soma

Knight Chair in Data Journalism, Columbia University

Jonathan Soma is the Knight Chair in Data Journalism at Columbia University, where he serves as Director of the Data Journalism MS program and the Lede Program, an intensive data journalism summer course. His lectures cover everything from basic Python and data analysis to interactive... Read More →

Saturday May 30, 2026 6:00pm - 6:30pm CEST
Z1.13 - Aula Hanswijk

Data skills, Mini

Dataharvest 2026 - the European Investigative Journalism Conference

Jonathan Soma

Attendees (38)

Get help with the event

Dataharvest 2026 - the European Investigative Journalism Conference

Jonathan Soma

Attendees (38)

Sign up or log in to save this to your schedule, view media, leave feedback and see who's attending!

Get help with the event