Live Webinar 5/27: Dive into ParseBench and learn what it takes to evaluate document OCR for AI Agents

Precision And Recall In OCR

Precision and recall are two foundational evaluation metrics that determine how well an Optical Character Recognition (OCR) system performs against a known reference. Without these metrics, it is impossible to objectively measure whether an OCR pipeline produces output that is both accurate and complete. Understanding how to calculate, interpret, and balance precision and recall is essential for any team building or maintaining document processing workflows and establishing consistent extraction accuracy benchmarking.

What Precision and Recall Measure in OCR

Precision and recall are complementary metrics that describe the quality of OCR output from two distinct angles. Precision answers the question: "Of everything the OCR system returned, how much of it was correct?" Recall answers a different question: "Of everything that should have been returned, how much did the OCR system actually capture?"

Both metrics compare OCR output against ground truth data — a verified, human-confirmed reference version of the text in a document. This comparison can be performed at different levels of granularity depending on the requirements of the use case.

The table below presents precision and recall as parallel concepts, making their structural similarities and key differences immediately visible.

AttributePrecisionRecall
**Definition**Percentage of returned text that is correctPercentage of correct text that was returned
**What It Measures**Accuracy of OCR outputCompleteness of OCR output
**Error Type It Penalizes**False positives (incorrectly recognized text)False negatives (missed or omitted text)
**Formula Input**True Positives + False PositivesTrue Positives + False Negatives
**Example Failure Mode**OCR returns "lnvoice" instead of "Invoice"OCR skips a word or line entirely
**Evaluation Level**Character, word, or fieldCharacter, word, or field

Precision focuses on avoiding false positives — characters or words the OCR system returned that do not match the ground truth. A system with low precision produces noisy, unreliable output. Recall focuses on avoiding false negatives — characters or words that exist in the ground truth but were missed entirely by the OCR system. A system with low recall produces incomplete output.

Both metrics can be applied at the character level (useful for evaluating raw recognition accuracy), the word level (useful for search, indexing, and downstream document retrieval systems), or the field level (useful for structured form extraction where specific data fields must be captured correctly). In more structured pipelines, field-level evaluation often overlaps with token classification tasks, where the goal is not just to read text correctly but to assign it to the right semantic label. Choosing the right evaluation level depends on the downstream use of the extracted text. A legal contract review system may require field-level precision, while a full-text search index may be better evaluated at the word level.

How to Calculate Precision and Recall for OCR

Calculating precision and recall requires classifying every unit of OCR output — whether a character, word, or field — into one of three categories relative to the ground truth.

Defining the Input Variables

Before applying the formulas, it is important to understand what each input variable means in an OCR context. The table below defines each term, provides a concrete example, and identifies which formula it appears in.

TermWhat It Means in OCRExampleUsed In
**True Positive (TP)**A word the OCR system returned that matches the ground truthOCR outputs "invoice"; ground truth is "invoice" → matchPrecision, Recall
**False Positive (FP)**A word the OCR system returned that does not match the ground truthOCR outputs "lnvoice"; ground truth is "invoice" → mismatchPrecision only
**False Negative (FN)**A word present in the ground truth that the OCR system did not returnGround truth contains "total"; OCR output omits it entirelyRecall only
**True Negative (TN)**A non-text region the OCR system correctly ignoredBackground whitespace not extractedNot used in OCR precision/recall — the universe of non-text is unbounded and not meaningful to count

The Formulas

With these terms defined, the formulas follow directly:

  • Precision = True Positives / (True Positives + False Positives)
  • Recall = True Positives / (True Positives + False Negatives)
  • F1 Score = 2 × (Precision × Recall) / (Precision + Recall)

The F1 Score is a single balanced metric that combines both precision and recall into one value. It is particularly useful when neither metric should be prioritized over the other, and it penalizes systems that perform well on one metric while performing poorly on the other.

Worked Example: Word-by-Word Comparison

The following example shows how these classifications work in practice. Assume a scanned invoice is processed by an OCR system, and the output is compared word-by-word against the ground truth. The table below shows the comparison for a ten-word segment.

Word #Ground Truth WordOCR Output WordClassificationNotes
1InvoiceInvoiceTrue PositiveExact match
2NumberNumberTrue PositiveExact match
31048210482True PositiveExact match
4Date0ateFalse PositiveCharacter substitution — "D" misread as "0"
52024-01-152024-01-15True PositiveExact match
6TotalTotalTrue PositiveExact match
7DueDueTrue PositiveExact match
8Amount*(omitted)*False NegativeWord skipped entirely by OCR engine
94500.004500.00True PositiveExact match
10USDU5DFalse PositiveCharacter substitution — "S" misread as "5"
**Totals****10 words****9 words returned****TP: 7 | FP: 2 | FN: 1**

Using the totals from the summary row:

  • Precision = 7 / (7 + 2) = 7/9 ≈ 0.778 (77.8%)
  • Recall = 7 / (7 + 1) = 7/8 = 0.875 (87.5%)
  • F1 Score = 2 × (0.778 × 0.875) / (0.778 + 0.875) ≈ 0.824 (82.4%)

This example shows that the system is reasonably complete (high recall) but introduces some incorrect characters (lower precision). The F1 Score of 82.4% reflects the combined performance, and the worked table makes it straightforward to replicate this process with any OCR output and ground truth pair.

Balancing the Precision-Recall Tradeoff in OCR Systems

Precision and recall exist in tension with each other in most OCR systems. Adjusting the system to return more text generally increases recall but introduces more errors, lowering precision. Conversely, restricting output to only high-confidence results improves precision but risks missing valid content, reducing recall.

How Confidence Thresholds Shift the Balance

Most OCR engines assign a confidence score to each recognized character or word. Raising the confidence threshold means the system only returns results it is highly certain about:

  • Higher threshold → Fewer results returned → Higher precision, lower recall
  • Lower threshold → More results returned → Higher recall, lower precision

The right threshold depends entirely on the use case. There is no universally correct setting.

Matching Metric Priority to Use Case

The table below maps common OCR deployment scenarios to their recommended metric priority, the business rationale behind that priority, and the corresponding threshold approach.

Use Case / IndustryPriority MetricReason for PriorityRecommended Threshold Approach
Legal document processingPrecisionErrors in legal text carry liability and compliance riskRaise threshold — accept fewer but more accurate results
Regulatory / compliance filingPrecisionIncorrect data in filings can trigger penaltiesRaise threshold — prioritize correctness over completeness
Large-scale data extractionRecallMissing records reduce dataset completeness and downstream model qualityLower threshold — capture more content, filter errors post-processing
Medical records digitizationPrecisionMisread clinical terms or dosages pose patient safety risksRaise threshold — human review of low-confidence output recommended
E-commerce catalog digitizationRecallMissing product attributes reduce searchability and conversionLower threshold — completeness drives discoverability
General-purpose document indexingBalanced (F1)Neither accuracy nor completeness strongly dominatesTune threshold to maximize F1 Score across the document corpus

In high-risk workflows, threshold tuning is often paired with human validation pipelines so low-confidence extractions can be reviewed before they affect downstream systems. That combination helps teams protect precision without giving up entirely on recall.

Techniques for Improving Precision and Recall

Beyond threshold adjustment, several techniques can raise the baseline for both precision and recall. The table below organizes these methods by which metric they improve, how they work, when to apply them, and the relative effort required.

TechniqueMetric(s) ImprovedHow It HelpsBest Applied WhenComplexity / Effort
**Denoising**BothRemoves background noise that causes the OCR engine to misread or skip charactersSource documents are low-resolution scans or photocopiesLow
**Deskewing**BothCorrects rotated or tilted text so the OCR engine can segment lines accuratelyDocuments were scanned at an angle or contain skewed columnsLow
**Contrast Enhancement**BothIncreases the distinction between text and background, reducing misreadsDocuments have faded ink, poor lighting, or low contrastLow
**Binarization**BothConverts grayscale images to black-and-white to simplify character boundariesMixed-background or colored documents where text blends into the pageLow–Medium
**Confidence Threshold Adjustment**Precision or Recall (not both simultaneously)Shifts the balance between returning more results (recall) or more accurate results (precision)Baseline metrics are established and a specific metric needs targeted improvementLow
**Domain-Specific Model Training**BothFine-tunes the OCR model on vocabulary, fonts, and layouts specific to the target document typeStandard models underperform on specialized documents such as medical forms or legal contractsHigh
**Model Selection**BothChoosing a model architecture suited to the document type raises the performance ceiling before any tuningEvaluating a new OCR pipeline or replacing an underperforming engineMedium

Image preprocessing techniques — denoising, deskewing, contrast enhancement, and binarization — are the best starting point because they improve both metrics simultaneously with relatively low implementation effort. Threshold adjustment is a fast, low-cost lever but only shifts the balance between metrics rather than raising both. Domain-specific training delivers the largest absolute gains but requires labeled training data and significant time investment.

Final Thoughts

Precision and recall provide a structured, objective way to evaluate OCR system performance, measuring accuracy and completeness respectively against a verified ground truth. Calculating these metrics using true positives, false positives, and false negatives — and combining them into an F1 Score — gives teams a reproducible method for benchmarking and comparing OCR pipelines. The precision-recall tradeoff is not a flaw to be eliminated but a configuration decision to be made deliberately, guided by the specific requirements of each use case and addressed through a combination of preprocessing, threshold tuning, and model selection.

LlamaParse delivers VLM-powered agentic OCR that goes beyond simple text extraction, boasting industry-leading accuracy on complex documents without custom training. By leveraging advanced reasoning from large language and vision models, its agentic OCR engine intelligently understands layouts, interprets embedded charts, images, and tables, and enables self-correction loops for higher straight-through processing rates over legacy solutions. LlamaParse employs a team of specialized document understanding agents working together for unrivaled accuracy in real-world document intelligence, outputting structured Markdown, JSON, or HTML. It's free to try today and gives you 10,000 free credits upon signup.

Start building your first document agent today

PortableText [components.type] is missing "undefined"