Deterministic extraction from financial PDFs without the hallucinations

Last quarter, a customer ran a single 10-K filing through three commercial LLM-based extractors and got three different numbers for the same line item. Total long-term debt, page 47. They sent us screenshots — one model returned $8.4 billion, another $8.4 million, and the third confidently invented a value that did not appear anywhere in the document.

This is the worst kind of failure mode in financial document processing: silent, plausible, and uniformly distributed. A hallucinated value in a press-release summary is awkward. A hallucinated value in a credit underwriting pipeline is litigation.

Over the past six months we've been building Extract, the HyperAPI endpoint that turns financial documents into typed, validated data — and we've been doing it without trusting an LLM to be the source of truth. This post is about how that pipeline actually works, why we made the design choices we did, and what we measured along the way.

The premise: language models are not databases

The dominant approach to PDF extraction in 2025 was to point a large language model at a document and ask it to return structured JSON. It mostly worked. It produced demos that got people excited. It also produced the kind of accuracy curve where the average F1 looked great and the tail would lose you a customer.

The failure modes were not random. They clustered around exactly the kinds of fields that make finance interesting:

Numerical magnitudes — billions vs. millions, basis points vs. percent, $ vs. €, parens-for-negative vs. minus-sign.
Table cells that span multiple rows or columns — the merged header in a balance sheet that the model can't see.
Footnoted values — the headline number is on page 47 but the actual reportable number lives in note 14 on page 112.
Restatements — last year's number on this year's filing is often subtly different than last year's filing said it was.

Every one of these is a layout problem dressed up as a language problem. The information is in the document; the model just couldn't see where to look. Asking an LLM to also do reliable layout reasoning is asking the wrong organ to do the wrong job.

The shape of the new pipeline

We collapsed five LLM calls into a four-stage pipeline where each stage is responsible for exactly one thing and each stage's output is independently verifiable:

PDF
 │
 ▼
[1] LAYOUT          — typed bounding-box tree of the document
 │   (pages, sections, tables, footnotes, page references)
 ▼
[2] LOCATE          — schema-driven candidate selection
 │   "which spans of the document are candidates for each field?"
 ▼
[3] READ            — narrow, targeted LLM calls per candidate
 │   ("what is the total long-term debt at the end of fiscal 2025?",
 │    context = a 400-token window, schema = the field type)
 ▼
[4] VALIDATE        — schema + cross-field consistency + provenance
     ("does this number reconcile with the balance sheet total?
       does the cited page actually contain this value?")

The LLM appears in exactly one step. It never sees the whole document. It is never asked to do arithmetic. It is never the source of structure — only the source of words.

Stage 1: layout, not OCR

Most extraction services start by treating a PDF as an image and running OCR. For born-digital PDFs (which most 10-Ks are) this is a category error — the text is already there, with its exact coordinates, fonts, and z-order. The job isn't recognition; it's structural understanding.

Our layout stage parses the underlying PDF object stream and builds a typed tree:

type LayoutNode =
  | { kind: "page";      n: number; size: [number, number]; children: LayoutNode[] }
  | { kind: "section";   title: string; level: 1|2|3|4; children: LayoutNode[] }
  | { kind: "paragraph"; spans: Span[];   bbox: BBox }
  | { kind: "table";     rows: TableRow[]; headers: HeaderTree; bbox: BBox }
  | { kind: "footnote";  marker: string; spans: Span[] }
  | { kind: "page_ref";  target: number;  spans: Span[] };

Tables are the load-bearing case. We do not flatten them into markdown. We preserve the row/column header relationships so that “long-term debt” in the row header and “2025” in the column header together address a single cell — and that cell carries its bounding box and the page number it lives on.

This stage uses zero machine learning for born-digital PDFs. For scanned filings we drop in a small layout model (we use a fine-tuned DocLayout-YOLO variant) but the output type is identical, so downstream stages don't care which path produced it.

Stage 2: schema-driven candidate selection

Every Extract request includes the customer's target schema. For 10-K parsing it looks like this:

{
  "long_term_debt": {
    "type": "currency",
    "period": "fiscal_year_end",
    "unit_hints": ["USD", "millions", "thousands"],
    "locator_hints": ["consolidated balance sheet", "long-term debt"],
    "schema_constraints": {
      "must_reconcile_with": "balance_sheet_total_liabilities",
      "must_appear_in_page_range": [40, 60]
    }
  }
}

The schema isn't decoration — it's how we narrow the search space before the LLM ever runs. The locator hints turn into a weighted query over the layout tree: section titles, table headers, and adjacent labels all contribute. For long-term debt on a 250-page filing, we typically reduce the candidate set from ~12,000 cells to 3-7 cells before any LLM call.

Counter-intuitive lesson: the more constraints the schema carries, the more accurate the extraction — and the cheaper. The opposite intuition (more schema = more model confusion) is wrong because the schema gates the candidate selector, not the prompt.

Stage 3: narrow, targeted reads

For each candidate, we run a single LLM call with a tightly-scoped prompt:

Prompt template (sketch):
---
You are extracting a single typed value from a financial filing.

Context (verbatim from the document):
{candidate.text}  // ~400 tokens, including the surrounding row/column
                  // headers and any inline footnote markers

Field: long_term_debt
Type: currency (USD)
Period: fiscal_year_end_2025

Return JSON: { "value": number | null, "unit": string,
               "as_of": string, "footnote_refs": string[],
               "confidence": "high" | "medium" | "low" }

If the value is not present in the context, return null. Do not infer.
Do not compute. Do not extrapolate.

Two things matter here. First, the context window is small enough that the model cannot drift into a different page. Second, the prompt explicitly forbids inference — and we measure adherence to that instruction (more on this in §5). When the model returns null, that's often the correct answer for that candidate, and we fall through to the next one.

Stage 4: validate, then trust

The validator does five things, in order:

Type check. Was a number returned where a number was asked for? Does the unit string parse?
Provenance check. Does the citation (page + bbox) actually contain the returned value? We verify by string-matching against the layout tree. A hallucinated citation fails here, by construction.
Magnitude sanity. Does the value fall within the historical range for this field for this company? A 100× deviation triggers a re-read with a tighter context window.
Cross-field reconciliation. If the schema declares that long_term_debt must reconcile with the balance sheet total, we sum the related fields and check. Off by more than a rounding error → re-read both sides.
Confidence calibration. The model's self-reported confidence is mapped to a calibrated probability using a per-customer reliability diagram we maintain. Low calibrated confidence after re-reads is surfaced to the API consumer as “requires_review” rather than swallowed.

Step 2 is the one that most of our competitors don't do, and it's the one that closes the hallucination gap entirely. If the model invents a number, it has nowhere to put a citation — and we won't accept the value without one.

What we measured

We benchmarked against an annotated set of 412 SEC 10-K filings across the S&P 500, with 38 fields per filing graded by a panel of two CFA-holders (inter-annotator agreement κ = 0.94).

                                    LLM-only*    Old pipeline    New pipeline
Field-level accuracy                   89.1%          96.3%          99.4%
Hallucination rate                     4.7%           0.8%           0.04%
Median latency / filing                 18.2s          24.1s          11.6s
Cost / filing (model spend)            $0.47          $0.31          $0.09
Fields requiring human review           2.1%           1.4%           1.1%

* GPT-4-class model, single-shot, schema in prompt.

The numbers worth dwelling on:

Hallucination rate dropped from 4.7% → 0.04%. That's a 100× reduction, and the residual 0.04% is dominated by adversarial typesetting in scanned filings (we have examples).
Cost dropped 5×. Counter-intuitive, but narrowing the context window before each LLM call dominates token cost. We spend more on layout and less on language, and the ratio favors us.
Latency dropped despite adding two stages. Stages 1 and 2 are parallelizable across all fields in a single filing. The LLM calls in stage 3 are now small enough to batch aggressively.

What we'd do differently

Three things, looking back:

1. We treated the schema as an extraction contract too late. The first version of Extract let customers pass freeform field descriptions. Most of our accuracy gains came once we forced fields into a typed schema with locator hints and reconciliation rules. We should have started there.

2. We underestimated how much the layout stage would matter. About 60% of our engineering effort over the rewrite went into layout. We thought it would be 20%. The lesson is that for financial documents, structure is meaning, and the model can't recover what you don't hand it.

3. Calibration is its own product. Customers act differently on a 0.85 calibrated probability than a 0.85 model-reported confidence, because they've learned to distrust the latter. Publishing per-customer reliability diagrams alongside the API responses was the single most-requested feature once we shipped it.

What this means for the API

For anyone building on top of Extract, the contract is simple:

Every returned field carries a citation. If there's no citation, the value is null. No exceptions.
Every returned value is reconcilable against the schema you provided. If reconciliation fails, the field comes back as requires_review with the conflicting evidence attached.
Confidence is calibrated probability, not model self-report.

We think this is the floor for financial document APIs going forward. If your provider can't tell you where a number came from, you can't safely use it.

Daman is a founding engineer at HyperAPI. Comments, criticism, and weird filings welcome at engineering@hyperapi.ai.