Precision and recall are two foundational evaluation metrics that determine how well an Optical Character Recognition (OCR) system performs against a known reference. Without these metrics, it is impossible to objectively measure whether an OCR pipeline produces output that is both accurate and complete. Understanding how to calculate, interpret, and balance precision and recall is essential for any team building or maintaining document processing workflows and establishing consistent extraction accuracy benchmarking.
What Precision and Recall Measure in OCR
Precision and recall are complementary metrics that describe the quality of OCR output from two distinct angles. Precision answers the question: "Of everything the OCR system returned, how much of it was correct?" Recall answers a different question: "Of everything that should have been returned, how much did the OCR system actually capture?"
Both metrics compare OCR output against ground truth data — a verified, human-confirmed reference version of the text in a document. This comparison can be performed at different levels of granularity depending on the requirements of the use case.
The table below presents precision and recall as parallel concepts, making their structural similarities and key differences immediately visible.
| Attribute | Precision | Recall |
|---|---|---|
| **Definition** | Percentage of returned text that is correct | Percentage of correct text that was returned |
| **What It Measures** | Accuracy of OCR output | Completeness of OCR output |
| **Error Type It Penalizes** | False positives (incorrectly recognized text) | False negatives (missed or omitted text) |
| **Formula Input** | True Positives + False Positives | True Positives + False Negatives |
| **Example Failure Mode** | OCR returns "lnvoice" instead of "Invoice" | OCR skips a word or line entirely |
| **Evaluation Level** | Character, word, or field | Character, word, or field |
Precision focuses on avoiding false positives — characters or words the OCR system returned that do not match the ground truth. A system with low precision produces noisy, unreliable output. Recall focuses on avoiding false negatives — characters or words that exist in the ground truth but were missed entirely by the OCR system. A system with low recall produces incomplete output.
Both metrics can be applied at the character level (useful for evaluating raw recognition accuracy), the word level (useful for search, indexing, and downstream document retrieval systems), or the field level (useful for structured form extraction where specific data fields must be captured correctly). In more structured pipelines, field-level evaluation often overlaps with token classification tasks, where the goal is not just to read text correctly but to assign it to the right semantic label. Choosing the right evaluation level depends on the downstream use of the extracted text. A legal contract review system may require field-level precision, while a full-text search index may be better evaluated at the word level.
How to Calculate Precision and Recall for OCR
Calculating precision and recall requires classifying every unit of OCR output — whether a character, word, or field — into one of three categories relative to the ground truth.
Defining the Input Variables
Before applying the formulas, it is important to understand what each input variable means in an OCR context. The table below defines each term, provides a concrete example, and identifies which formula it appears in.
| Term | What It Means in OCR | Example | Used In |
|---|---|---|---|
| **True Positive (TP)** | A word the OCR system returned that matches the ground truth | OCR outputs "invoice"; ground truth is "invoice" → match | Precision, Recall |
| **False Positive (FP)** | A word the OCR system returned that does not match the ground truth | OCR outputs "lnvoice"; ground truth is "invoice" → mismatch | Precision only |
| **False Negative (FN)** | A word present in the ground truth that the OCR system did not return | Ground truth contains "total"; OCR output omits it entirely | Recall only |
| **True Negative (TN)** | A non-text region the OCR system correctly ignored | Background whitespace not extracted | Not used in OCR precision/recall — the universe of non-text is unbounded and not meaningful to count |
The Formulas
With these terms defined, the formulas follow directly:
- Precision = True Positives / (True Positives + False Positives)
- Recall = True Positives / (True Positives + False Negatives)
- F1 Score = 2 × (Precision × Recall) / (Precision + Recall)
The F1 Score is a single balanced metric that combines both precision and recall into one value. It is particularly useful when neither metric should be prioritized over the other, and it penalizes systems that perform well on one metric while performing poorly on the other.
Worked Example: Word-by-Word Comparison
The following example shows how these classifications work in practice. Assume a scanned invoice is processed by an OCR system, and the output is compared word-by-word against the ground truth. The table below shows the comparison for a ten-word segment.
| Word # | Ground Truth Word | OCR Output Word | Classification | Notes |
|---|---|---|---|---|
| 1 | Invoice | Invoice | True Positive | Exact match |
| 2 | Number | Number | True Positive | Exact match |
| 3 | 10482 | 10482 | True Positive | Exact match |
| 4 | Date | 0ate | False Positive | Character substitution — "D" misread as "0" |
| 5 | 2024-01-15 | 2024-01-15 | True Positive | Exact match |
| 6 | Total | Total | True Positive | Exact match |
| 7 | Due | Due | True Positive | Exact match |
| 8 | Amount | *(omitted)* | False Negative | Word skipped entirely by OCR engine |
| 9 | 4500.00 | 4500.00 | True Positive | Exact match |
| 10 | USD | U5D | False Positive | Character substitution — "S" misread as "5" |
| **Totals** | **10 words** | **9 words returned** | **TP: 7 | FP: 2 | FN: 1** |
Using the totals from the summary row:
- Precision = 7 / (7 + 2) = 7/9 ≈ 0.778 (77.8%)
- Recall = 7 / (7 + 1) = 7/8 = 0.875 (87.5%)
- F1 Score = 2 × (0.778 × 0.875) / (0.778 + 0.875) ≈ 0.824 (82.4%)
This example shows that the system is reasonably complete (high recall) but introduces some incorrect characters (lower precision). The F1 Score of 82.4% reflects the combined performance, and the worked table makes it straightforward to replicate this process with any OCR output and ground truth pair.
Balancing the Precision-Recall Tradeoff in OCR Systems
Precision and recall exist in tension with each other in most OCR systems. Adjusting the system to return more text generally increases recall but introduces more errors, lowering precision. Conversely, restricting output to only high-confidence results improves precision but risks missing valid content, reducing recall.
How Confidence Thresholds Shift the Balance
Most OCR engines assign a confidence score to each recognized character or word. Raising the confidence threshold means the system only returns results it is highly certain about:
- Higher threshold → Fewer results returned → Higher precision, lower recall
- Lower threshold → More results returned → Higher recall, lower precision
The right threshold depends entirely on the use case. There is no universally correct setting.
Matching Metric Priority to Use Case
The table below maps common OCR deployment scenarios to their recommended metric priority, the business rationale behind that priority, and the corresponding threshold approach.
| Use Case / Industry | Priority Metric | Reason for Priority | Recommended Threshold Approach |
|---|---|---|---|
| Legal document processing | Precision | Errors in legal text carry liability and compliance risk | Raise threshold — accept fewer but more accurate results |
| Regulatory / compliance filing | Precision | Incorrect data in filings can trigger penalties | Raise threshold — prioritize correctness over completeness |
| Large-scale data extraction | Recall | Missing records reduce dataset completeness and downstream model quality | Lower threshold — capture more content, filter errors post-processing |
| Medical records digitization | Precision | Misread clinical terms or dosages pose patient safety risks | Raise threshold — human review of low-confidence output recommended |
| E-commerce catalog digitization | Recall | Missing product attributes reduce searchability and conversion | Lower threshold — completeness drives discoverability |
| General-purpose document indexing | Balanced (F1) | Neither accuracy nor completeness strongly dominates | Tune threshold to maximize F1 Score across the document corpus |
In high-risk workflows, threshold tuning is often paired with human validation pipelines so low-confidence extractions can be reviewed before they affect downstream systems. That combination helps teams protect precision without giving up entirely on recall.
Techniques for Improving Precision and Recall
Beyond threshold adjustment, several techniques can raise the baseline for both precision and recall. The table below organizes these methods by which metric they improve, how they work, when to apply them, and the relative effort required.
| Technique | Metric(s) Improved | How It Helps | Best Applied When | Complexity / Effort |
|---|---|---|---|---|
| **Denoising** | Both | Removes background noise that causes the OCR engine to misread or skip characters | Source documents are low-resolution scans or photocopies | Low |
| **Deskewing** | Both | Corrects rotated or tilted text so the OCR engine can segment lines accurately | Documents were scanned at an angle or contain skewed columns | Low |
| **Contrast Enhancement** | Both | Increases the distinction between text and background, reducing misreads | Documents have faded ink, poor lighting, or low contrast | Low |
| **Binarization** | Both | Converts grayscale images to black-and-white to simplify character boundaries | Mixed-background or colored documents where text blends into the page | Low–Medium |
| **Confidence Threshold Adjustment** | Precision or Recall (not both simultaneously) | Shifts the balance between returning more results (recall) or more accurate results (precision) | Baseline metrics are established and a specific metric needs targeted improvement | Low |
| **Domain-Specific Model Training** | Both | Fine-tunes the OCR model on vocabulary, fonts, and layouts specific to the target document type | Standard models underperform on specialized documents such as medical forms or legal contracts | High |
| **Model Selection** | Both | Choosing a model architecture suited to the document type raises the performance ceiling before any tuning | Evaluating a new OCR pipeline or replacing an underperforming engine | Medium |
Image preprocessing techniques — denoising, deskewing, contrast enhancement, and binarization — are the best starting point because they improve both metrics simultaneously with relatively low implementation effort. Threshold adjustment is a fast, low-cost lever but only shifts the balance between metrics rather than raising both. Domain-specific training delivers the largest absolute gains but requires labeled training data and significant time investment.
Final Thoughts
Precision and recall provide a structured, objective way to evaluate OCR system performance, measuring accuracy and completeness respectively against a verified ground truth. Calculating these metrics using true positives, false positives, and false negatives — and combining them into an F1 Score — gives teams a reproducible method for benchmarking and comparing OCR pipelines. The precision-recall tradeoff is not a flaw to be eliminated but a configuration decision to be made deliberately, guided by the specific requirements of each use case and addressed through a combination of preprocessing, threshold tuning, and model selection.
LlamaParse delivers VLM-powered agentic OCR that goes beyond simple text extraction, boasting industry-leading accuracy on complex documents without custom training. By leveraging advanced reasoning from large language and vision models, its agentic OCR engine intelligently understands layouts, interprets embedded charts, images, and tables, and enables self-correction loops for higher straight-through processing rates over legacy solutions. LlamaParse employs a team of specialized document understanding agents working together for unrivaled accuracy in real-world document intelligence, outputting structured Markdown, JSON, or HTML. It's free to try today and gives you 10,000 free credits upon signup.