Teams building automated pipelines for invoices, contracts, and forms need a metric that captures more than surface-level correctness. In practice, F1 Score for document extraction has become a standard part of extraction accuracy benchmarking because it measures both how accurately and how completely a system captures structured data from documents. Understanding F1 Score — and the Precision and Recall components behind it — gives practitioners a clearer way to benchmark systems, diagnose failures, and improve extraction quality.
OCR and document extraction systems face a fundamental challenge: raw text recognition is only one part of the problem. Even when characters are read correctly, a system still has to identify the right fields, assign values to the correct labels, and handle variable layouts without over-extracting or under-extracting data. Standard accuracy metrics can hide these failures, especially when documents are sparse or field distributions are uneven. F1 Score addresses this directly by penalizing both incorrect extractions and missed values within a single, interpretable number.
F1 Score and Why It Matters for Document Extraction
F1 Score is the harmonic mean of Precision and Recall, combining both metrics into a single value that reflects the overall quality of an extraction system's output. It is calculated using the formula:
F1 = 2 × (Precision × Recall) / (Precision + Recall)
The result is a score between 0 and 1, where 1 represents perfect extraction and 0 represents complete failure. Because F1 Score uses a harmonic mean rather than an arithmetic mean, it penalizes extreme imbalances between Precision and Recall — a system cannot achieve a high F1 Score by improving one metric at the expense of the other.
Why F1 Score Is More Reliable Than Accuracy Alone
Accuracy measures the proportion of all predictions that are correct, but in document extraction, this can be misleading. When most fields in a document are absent or null, a model that extracts nothing can still appear highly accurate simply because it correctly identified all the empty fields. F1 Score avoids this distortion by focusing only on the fields that matter — those that were extracted or should have been extracted.
F1 Score is especially useful in Intelligent Document Processing (IDP), OCR pipelines, and NLP-based extraction systems where tasks like named entity recognition and token classification are used to identify field boundaries, labels, and values across inconsistent document formats.
The following table compares F1 Score against related evaluation metrics to clarify when each is most appropriate in a document extraction context.
| Metric | Formula | What It Measures | Limitation in Document Extraction | Best Used When |
|---|---|---|---|---|
| Accuracy | Correct predictions ÷ Total predictions | Overall correctness across all outcomes | Inflated by true negatives in sparse documents | Fields are evenly distributed and all outcomes carry equal weight |
| Precision | TP ÷ (TP + FP) | Trustworthiness of extracted output | Does not penalize the model for missing fields entirely | Minimizing incorrect extractions is the primary concern |
| Recall | TP ÷ (TP + FN) | Completeness of extracted output | Does not penalize the model for over-extracting | Minimizing missed fields is the primary concern |
| F1 Score | 2 × (P × R) / (P + R) | Balanced extraction quality | Treats Precision and Recall as equally important | Both incorrect extractions and missed fields carry meaningful cost |
Precision and Recall as the Building Blocks of F1 Score
Precision and Recall are the two foundational components of F1 Score, each capturing a distinct dimension of extraction quality. Precision measures how trustworthy the extracted output is, while Recall measures how complete it is. Neither metric alone is sufficient — a system can achieve perfect Precision by extracting very few fields, or perfect Recall by extracting everything indiscriminately.
The Four Extraction Outcomes Explained
To calculate Precision and Recall, extraction results must first be classified into four outcome types. The table below defines each outcome in concrete document extraction terms and identifies its effect on the relevant metric.
| Outcome Type | What Happened During Extraction | Value Present in Document? | Document Extraction Example | Impact on Precision / Recall |
|---|---|---|---|---|
| True Positive (TP) | Model extracted a value that matches the ground truth | Yes | Model extracted "Invoice Total: $500" and the document contained "Invoice Total: $500" | Increases both Precision and Recall |
| False Positive (FP) | Model extracted a value that does not match or was not expected | No | Model extracted "Invoice Total: $750" but the correct value was "$500", or extracted a field that does not exist | Lowers Precision; no effect on Recall |
| False Negative (FN) | Model failed to extract a value that was present in the document | Yes | Model did not extract the "Due Date" field, which was present in the document | Lowers Recall; no effect on Precision |
| True Negative (TN) | Model correctly did not extract a value that was not present | No | Model did not extract a "Discount" field, and no discount existed in the document | Not included in F1 Score calculation |
The True Negative row deserves particular attention: TN outcomes are excluded from the F1 Score formula entirely. This is intentional — in document extraction, correctly identifying absent fields does not demonstrate extraction capability and should not inflate performance scores.
How Precision and Recall Are Calculated
With the four outcome types defined, Precision and Recall are calculated as follows:
- Precision = TP ÷ (TP + FP) — of all fields the model extracted, how many were correct
- Recall = TP ÷ (TP + FN) — of all fields that should have been extracted, how many were captured
Using the invoice line item example: if a model misses a "Payment Terms" field, that is a false negative that suppresses Recall. If the model extracts a "Vendor Code" with an incorrect value, that is a false positive that suppresses Precision. Both types of errors reduce the final F1 Score, which is why the metric provides a more complete picture of extraction performance than either component alone.
Calculating F1 Score for a Document Extraction System
Calculating F1 Score follows a consistent three-step process: classify extraction outcomes, compute Precision and Recall, then apply the F1 formula. In production settings, teams often formalize this process through structured evaluation workflows that compare extracted fields against labeled ground truth and track changes over time.
Step-by-Step F1 Calculation with an Invoice Example
Scenario: An extraction model is evaluated against an invoice with 10 expected fields. The model extracts 9 fields total — 8 of which are correct matches to the ground truth, and 1 of which is incorrect. Of the 10 expected fields, 2 were not extracted at all.
Step 1 — Classify outcomes:
- True Positives (TP): 8 — fields correctly extracted
- False Positives (FP): 1 — field extracted with an incorrect value
- False Negatives (FN): 2 — expected fields the model missed
Step 2 — Calculate Precision and Recall:
- Precision = 8 ÷ (8 + 1) = 8 ÷ 9 ≈ 0.889
- Recall = 8 ÷ (8 + 2) = 8 ÷ 10 = 0.800
Step 3 — Apply the F1 formula:
- F1 = 2 × (0.889 × 0.800) / (0.889 + 0.800) = 2 × 0.711 / 1.689 ≈ 0.842
The table below consolidates all components of this worked example into a single reference.
| Metric Component | Definition in This Context | Value from This Example | Formula Applied |
|---|---|---|---|
| Total Expected Fields | All fields that should be present according to ground truth | 10 | — |
| True Positives (TP) | Fields the model extracted that matched the ground truth | 8 | Counted directly |
| False Positives (FP) | Fields the model extracted that did not match the ground truth | 1 | Counted directly |
| False Negatives (FN) | Expected fields the model failed to extract | 2 | Counted directly |
| Precision | Share of extracted fields that were correct | 0.889 | TP ÷ (TP + FP) = 8 ÷ 9 |
| Recall | Share of expected fields that were successfully extracted | 0.800 | TP ÷ (TP + FN) = 8 ÷ 10 |
| F1 Score | Harmonic mean of Precision and Recall | 0.842 | 2 × (0.889 × 0.800) / (0.889 + 0.800) |
An F1 Score of 0.842 indicates strong but imperfect extraction performance. Scores approaching 1.0 reflect high accuracy and completeness; scores below 0.5 typically indicate significant extraction failures requiring model or pipeline review.
Choosing Between Field-Level and Document-Level Scoring
F1 Score can be calculated at two different levels of granularity, and the choice between them affects both what the metric reveals and how it should be interpreted. The table below compares both approaches across key decision dimensions.
| Scoring Approach | Unit of Measurement | How F1 Is Calculated | Strengths | Limitations | Recommended Use Case |
|---|---|---|---|---|---|
| Field-Level Scoring | Individual extracted field | TP/FP/FN counted per field type across all documents | Isolates which specific fields underperform; enables targeted model improvement | More granular reporting overhead; requires field-by-field ground truth annotation | Model debugging, field-specific benchmarking, identifying systematic extraction failures |
| Document-Level Scoring | Entire document | A document is scored as correct only if all fields are extracted correctly, or TP/FP/FN are aggregated per document | Reflects end-to-end pipeline performance; aligns with business SLA reporting | Can mask field-specific errors; a single missed field fails the entire document | Business reporting, SLA compliance, overall pipeline health monitoring |
For most diagnostic and model improvement workflows, field-level scoring provides more useful insight. Document-level scoring is better suited to reporting contexts where the business outcome depends on complete, accurate document processing rather than individual field performance.
Final Thoughts
F1 Score provides a balanced, reliable measure of document extraction quality by combining Precision and Recall into a single metric that penalizes both incorrect extractions and missed fields. Understanding the four extraction outcomes — true positives, false positives, false negatives, and true negatives — is the foundation for calculating and interpreting F1 Score accurately. This becomes even more important when extracted data feeds downstream tasks such as multi-document summarization, where missing or incorrect fields can distort the final output across many files at once.
LlamaParse delivers VLM-powered agentic OCR that goes beyond simple text extraction, boasting industry-leading accuracy on complex documents without custom training. By leveraging advanced reasoning from large language and vision models, its agentic OCR engine intelligently understands layouts, interprets embedded charts, images, and tables, and enables self-correction loops for higher straight-through processing rates over legacy solutions. LlamaParse employs a team of specialized document understanding agents working together for unrivaled accuracy in real-world document intelligence, outputting structured Markdown, JSON, or HTML. It's free to try today and gives you 10,000 free credits upon signup.