Say hello to Natural PDF, a new Python library for wrangling PDFs that's focused on usability and feature-completeness. Process PDFs with scraping-like selectors and spatially-aware queries, asking for "the red alphanumeric string" or "the content below the big Summary header." Beyond the basics, Natural PDF is also full of modern conveniences like table detection, multiple OCR engines, and citation-aware LLM data extraction.
To get the most out of this session, participants should have experience with Python and struggling with terrible PDFs.
Materials:
https://jsoma.github.io/natural-pdf-workshop/