What is F1 Score for Document Extraction?

Teams building automated pipelines for invoices, contracts, and forms need a metric that captures more than surface-level correctness. In practice, F1 Score for document extraction has become a standard part of extraction accuracy benchmarking because it measures both how accurately and how completely a system captures structured data from documents. Understanding F1 Score — and the Precision and Recall components behind it — gives practitioners a clearer way to benchmark systems, diagnose failures, and improve extraction quality.

OCR and document extraction systems face a fundamental challenge: raw text recognition is only one part of the problem. Even when characters are read correctly, a system still has to identify the right fields, assign values to the correct labels, and handle variable layouts without over-extracting or under-extracting data. Standard accuracy metrics can hide these failures, especially when documents are sparse or field distributions are uneven. F1 Score addresses this directly by penalizing both incorrect extractions and missed values within a single, interpretable number.

F1 Score and Why It Matters for Document Extraction

F1 Score is the harmonic mean of Precision and Recall, combining both metrics into a single value that reflects the overall quality of an extraction system's output. It is calculated using the formula:

F1 = 2 × (Precision × Recall) / (Precision + Recall)

The result is a score between 0 and 1, where 1 represents perfect extraction and 0 represents complete failure. Because F1 Score uses a harmonic mean rather than an arithmetic mean, it penalizes extreme imbalances between Precision and Recall — a system cannot achieve a high F1 Score by improving one metric at the expense of the other.

Why F1 Score Is More Reliable Than Accuracy Alone

Accuracy measures the proportion of all predictions that are correct, but in document extraction, this can be misleading. When most fields in a document are absent or null, a model that extracts nothing can still appear highly accurate simply because it correctly identified all the empty fields. F1 Score avoids this distortion by focusing only on the fields that matter — those that were extracted or should have been extracted.

F1 Score is especially useful in Intelligent Document Processing (IDP), OCR pipelines, and NLP-based extraction systems where tasks like named entity recognition and token classification are used to identify field boundaries, labels, and values across inconsistent document formats.

The following table compares F1 Score against related evaluation metrics to clarify when each is most appropriate in a document extraction context.

Metric	Formula	What It Measures	Limitation in Document Extraction	Best Used When
Accuracy	Correct predictions ÷ Total predictions	Overall correctness across all outcomes	Inflated by true negatives in sparse documents	Fields are evenly distributed and all outcomes carry equal weight
Precision	TP ÷ (TP + FP)	Trustworthiness of extracted output	Does not penalize the model for missing fields entirely	Minimizing incorrect extractions is the primary concern
Recall	TP ÷ (TP + FN)	Completeness of extracted output	Does not penalize the model for over-extracting	Minimizing missed fields is the primary concern
F1 Score	2 × (P × R) / (P + R)	Balanced extraction quality	Treats Precision and Recall as equally important	Both incorrect extractions and missed fields carry meaningful cost

Precision and Recall as the Building Blocks of F1 Score

Precision and Recall are the two foundational components of F1 Score, each capturing a distinct dimension of extraction quality. Precision measures how trustworthy the extracted output is, while Recall measures how complete it is. Neither metric alone is sufficient — a system can achieve perfect Precision by extracting very few fields, or perfect Recall by extracting everything indiscriminately.

The Four Extraction Outcomes Explained

To calculate Precision and Recall, extraction results must first be classified into four outcome types. The table below defines each outcome in concrete document extraction terms and identifies its effect on the relevant metric.

Outcome Type	What Happened During Extraction	Value Present in Document?	Document Extraction Example	Impact on Precision / Recall
True Positive (TP)	Model extracted a value that matches the ground truth	Yes	Model extracted "Invoice Total: $500" and the document contained "Invoice Total: $500"	Increases both Precision and Recall
False Positive (FP)	Model extracted a value that does not match or was not expected	No	Model extracted "Invoice Total: $750" but the correct value was "$500", or extracted a field that does not exist	Lowers Precision; no effect on Recall
False Negative (FN)	Model failed to extract a value that was present in the document	Yes	Model did not extract the "Due Date" field, which was present in the document	Lowers Recall; no effect on Precision
True Negative (TN)	Model correctly did not extract a value that was not present	No	Model did not extract a "Discount" field, and no discount existed in the document	Not included in F1 Score calculation

The True Negative row deserves particular attention: TN outcomes are excluded from the F1 Score formula entirely. This is intentional — in document extraction, correctly identifying absent fields does not demonstrate extraction capability and should not inflate performance scores.

How Precision and Recall Are Calculated

With the four outcome types defined, Precision and Recall are calculated as follows:

Precision = TP ÷ (TP + FP) — of all fields the model extracted, how many were correct
Recall = TP ÷ (TP + FN) — of all fields that should have been extracted, how many were captured

Using the invoice line item example: if a model misses a "Payment Terms" field, that is a false negative that suppresses Recall. If the model extracts a "Vendor Code" with an incorrect value, that is a false positive that suppresses Precision. Both types of errors reduce the final F1 Score, which is why the metric provides a more complete picture of extraction performance than either component alone.

Calculating F1 Score for a Document Extraction System

Calculating F1 Score follows a consistent three-step process: classify extraction outcomes, compute Precision and Recall, then apply the F1 formula. In production settings, teams often formalize this process through structured evaluation workflows that compare extracted fields against labeled ground truth and track changes over time.

Step-by-Step F1 Calculation with an Invoice Example

Scenario: An extraction model is evaluated against an invoice with 10 expected fields. The model extracts 9 fields total — 8 of which are correct matches to the ground truth, and 1 of which is incorrect. Of the 10 expected fields, 2 were not extracted at all.

Step 1 — Classify outcomes:

True Positives (TP): 8 — fields correctly extracted
False Positives (FP): 1 — field extracted with an incorrect value
False Negatives (FN): 2 — expected fields the model missed

Step 2 — Calculate Precision and Recall:

Precision = 8 ÷ (8 + 1) = 8 ÷ 9 ≈ 0.889
Recall = 8 ÷ (8 + 2) = 8 ÷ 10 = 0.800

Step 3 — Apply the F1 formula:

F1 = 2 × (0.889 × 0.800) / (0.889 + 0.800) = 2 × 0.711 / 1.689 ≈ 0.842

The table below consolidates all components of this worked example into a single reference.

Metric Component	Definition in This Context	Value from This Example	Formula Applied
Total Expected Fields	All fields that should be present according to ground truth	10	—
True Positives (TP)	Fields the model extracted that matched the ground truth	8	Counted directly
False Positives (FP)	Fields the model extracted that did not match the ground truth	1	Counted directly
False Negatives (FN)	Expected fields the model failed to extract	2	Counted directly
Precision	Share of extracted fields that were correct	0.889	TP ÷ (TP + FP) = 8 ÷ 9
Recall	Share of expected fields that were successfully extracted	0.800	TP ÷ (TP + FN) = 8 ÷ 10
F1 Score	Harmonic mean of Precision and Recall	0.842	2 × (0.889 × 0.800) / (0.889 + 0.800)

An F1 Score of 0.842 indicates strong but imperfect extraction performance. Scores approaching 1.0 reflect high accuracy and completeness; scores below 0.5 typically indicate significant extraction failures requiring model or pipeline review.

Choosing Between Field-Level and Document-Level Scoring

F1 Score can be calculated at two different levels of granularity, and the choice between them affects both what the metric reveals and how it should be interpreted. The table below compares both approaches across key decision dimensions.

Scoring Approach	Unit of Measurement	How F1 Is Calculated	Strengths	Limitations	Recommended Use Case
Field-Level Scoring	Individual extracted field	TP/FP/FN counted per field type across all documents	Isolates which specific fields underperform; enables targeted model improvement	More granular reporting overhead; requires field-by-field ground truth annotation	Model debugging, field-specific benchmarking, identifying systematic extraction failures
Document-Level Scoring	Entire document	A document is scored as correct only if all fields are extracted correctly, or TP/FP/FN are aggregated per document	Reflects end-to-end pipeline performance; aligns with business SLA reporting	Can mask field-specific errors; a single missed field fails the entire document	Business reporting, SLA compliance, overall pipeline health monitoring

For most diagnostic and model improvement workflows, field-level scoring provides more useful insight. Document-level scoring is better suited to reporting contexts where the business outcome depends on complete, accurate document processing rather than individual field performance.

Final Thoughts

F1 Score provides a balanced, reliable measure of document extraction quality by combining Precision and Recall into a single metric that penalizes both incorrect extractions and missed fields. Understanding the four extraction outcomes — true positives, false positives, false negatives, and true negatives — is the foundation for calculating and interpreting F1 Score accurately. This becomes even more important when extracted data feeds downstream tasks such as multi-document summarization, where missing or incorrect fields can distort the final output across many files at once.

LlamaParse delivers VLM-powered agentic OCR that goes beyond simple text extraction, boasting industry-leading accuracy on complex documents without custom training. By leveraging advanced reasoning from large language and vision models, its agentic OCR engine intelligently understands layouts, interprets embedded charts, images, and tables, and enables self-correction loops for higher straight-through processing rates over legacy solutions. LlamaParse employs a team of specialized document understanding agents working together for unrivaled accuracy in real-world document intelligence, outputting structured Markdown, JSON, or HTML. It's free to try today and gives you 10,000 free credits upon signup.

F1 Score For Document Extraction