top of page
Search

Evaluation-driven Approach to Document Parsing – Case Study with Legislative PDFs

  • Thilo
  • Dec 9, 2024
  • 4 min read

Updated: Dec 13, 2024

Parsing structured data from PDFs is a widespread challenge due to the unstructured nature of PDFs. Generative AI offers promising solutions, with a variety of available tools.


Choosing the right approach for a specific task depends on balancing cost, performance, and data privacy concerns, often tied to the memory size of the language model. Smaller models are less precise but more cost-effective and easier to deploy locally. Systematic evaluation is itself a difficult task, but essential to navigate the diverse methods effectively.


In this article, we focus on parsing the full content of legislative consultation PDFs into structured a JSON format comprised of headers, contents, formatted lists and tables, and footnotes. We present three different approaches based on LlamaParse and ChatGPT and introduce methods for systematically evaluating these approaches and comparing them using MLflow.


Introduction to Demokratis

The Demokratis project empowers Swiss citizens to engage with legislative consultations through a transparent and collaborative digital platform. Many legislative documents are only published as PDF to date. To facilitate meaningful participation, converting these PDF documents into structured, machine-readable formats is vital.


Evaluation-Driven Generative AI Development

Building robust parsing pipelines requires an evaluation-driven approach, which emphasizes iterative testing and performance tracking. In analogy to test-driven development practices in software engineering, we should discipline ourselves to think about how to evaluate different approaches before even starting with the fun part of experimenting with different AI models and prompts. As with many things, initial discipline may ensure true fun in the long run, while quick fun might soon end up in head aches.


MLflow for Generative AI

Although originally designed for organizing development lifecycle of classical machine learning models, MLflow is now also well-suited for managing generative AI workflows. Especially the introduction of their Model From Code feature allows tracking of a variety of different model approaches, prompts and parameters.


Evaluation Metrics with difflib

For evaluation, we convert the parsed JSON output into a raw text string. Using Python’s difflib library, we then compare the converted structured output to a directly extracted text from the PDF (with a PDF-processing library such as PyPDF2). Python’s difflib provides tools for comparing sequences, generating similarity ratios, and creating human-readable differences (diffs) between strings or other data types using a modified version of the Ratcliff/Obershelp algorithm.


To evaluate the parsing pipelines, the project relies on the following:

  1. avg_percnt_missing_chars and avg_percnt_added_chars: Calculate the percentage number of added and missing characters in the converted JSON output text. These are somewhat fuzzy versions of recall and precision. The directly extracted text is only an approximate ground truth. But it is easy to implement, and it has turned out to be quite powerful in comparing the different approaches for model selection.

  2. HTML Diff Visualization: Creates visual comparisons of the parsed and directly extracted texts, highlighting discrepancies for manual review using difflib’s HTML diff. These HTML text comparisons are super helpful for understanding each approach on individual documents, and also for manual inspection and correction required to ensure quality in the final production workflow.

  3. valid_schemas: Verifies that the parsed JSON adheres to the expected JSON schema. Used for model selection as well as in the production workflow.

  4. Cost: Tracks the costs of calling the LlamaParse an OpenAI API’s.


Parsing Pipelines with LlamaParse and ChatGPT


1. ChatGPT File Upload Model

Pipeline:

  • PDF uploaded via the ChatGPT assistant API.

  • ChatGPT generates JSON output from the document.

Strengths: Leverages GPT’s powerful natural language capabilities.

Weaknesses:

  • Results are inconsistent across runs.

  • Incomplete adherence to the required JSON schema.

  • Sometimes it outputs only parts of the document.


2. LlamaParse + Python Parsing (Most Stable)

Pipeline:

  • Extracts text from PDFs as structured Markdown using LlamaParse.

  • Splits Markdown into nodes with MarkdownNodeParser.

  • Inserts footnotes directly into the text where they are referenced.

  • Processes each node into structured JSON via a Python function.

Strengths:

  • Most stable approach.

  • Maintains high accuracy for complex document structures.

  • Intermediate Markdown representation ensures structural preservation.

Weaknesses: Manual effort required to refine Python functions for parsing, especially if JSON output structure is changed.


3. LlamaParse + ChatGPT Parsing

Pipeline: Same as the Markdown model, but uses ChatGPT instead of Python functions to process nodes into structured JSON.

Strengths: Combines LlamaParse’s structural accuracy with GPT’s natural language capabilities.

Weaknesses:

  • Higher computational cost and time compared to other methods.

  • Output still lacks consistency for complex nested structures.


Results: Comparing Parsing Models


Evaluation Metrics for Different Approaches

The table below compares the different pipelines. The pipelines are ordered by avg_percnt_missing_chars. The LlamaParse + Python Parsing pipeline has the least missing characters, indicating the best parsing performance. However, for a detailed assessment, manual inspection of the HTML diffs is required (see next section).


Comparing different pipeline runs in MLflow (screenshot).
Comparing different pipeline runs in MLflow (screenshot).

Example Results for LlamaParse + Python Pipeline

  1. Excerpt of the input PDF: A legislative document from the Swiss parliament containing dense text, footnotes, and nested lists.

  2. Parsed and processed Markdown: Headers, paragraphs, and lists converted into Markdown format. Footnotes inserted into text at their point of reference.

  3. Excerpt of parsed JSON output: A machine-readable representation of the document’s structure.

  4. HTML Diff Visualization: Inspecting parsing details and explaining why percnt_missing_chars is still larger than zero. Left text: Directly extracted text from PDF using PyPDF2. Parts missing in the right text are highlighted in red. Right text: Parsed JSON output converted back into a plain text. Parts added by the parsing pipeline are marked in green. Yellow segments show changes in the two texts, here due to the replacement of the footnote reference (3 in the left text) with the actual footnote content (SR 784.40).


Outlook: Expanding Possibilities

Future work could explore more of the many parsing approaches and tools out there, for example Unstructured, Nuextract, MinerU, docling, etc. The proposed evaluation-driven setup is modular, enabling easy integration and experimentation with new methods.

Comments


bottom of page