Live Webinar 5/27: Dive into ParseBench and learn what it takes to evaluate document OCR for AI Agents

Low-Quality Scan Processing

Low-quality scan processing is a persistent challenge in document-heavy workflows. When scanned images are degraded — by poor resolution, misalignment, or low contrast — the consequences range from failed OCR to unreliable data extraction. Knowing how to identify, correct, and prevent these issues is essential for anyone responsible for document digitization, archiving, automated processing, or agentic document extraction.

Because degraded inputs have a direct effect on OCR accuracy, low-quality scans should be treated as an upstream quality problem rather than a downstream exception. The better teams get at detecting and correcting scan defects early, the more reliable their search, archiving, and extraction workflows become.

What Counts as a Low-Quality Scan

Low-quality scan processing refers to identifying and correcting degraded scanned document images so they can be used for tasks like optical character recognition, PDF character recognition, data extraction, or long-term archiving. A scan is considered low quality when one or more of its core image attributes fall below the threshold needed for reliable machine or human interpretation.

How Scan Defects Appear and What They Affect

Scan quality issues fall into distinct categories, each with a different cause and a different impact on how the document can be used. The table below maps each defect type to its visual appearance, common cause, and downstream consequences.

Quality Issue TypeHow It ManifestsCommon CauseImpact on Downstream Tasks
Low Resolution / Low DPIText appears pixelated or illegible at normal zoomScanner set below 300 DPI; default low-quality settingsOCR misreads characters; data extraction fails on fine print
Poor ContrastText blends into background; faint or washed-out appearanceAging documents, incorrect brightness settings, faded inkOCR engines miss characters; binarization produces incomplete output
Skew / MisalignmentText lines appear tilted or rotated on the pageDocument placed unevenly on scanner glass; manual feeding errorsOCR line segmentation fails; column-based layouts misread
Image Noise / ArtifactsSpeckles, streaks, or random dots appear across the imageDust on scanner glass; worn scanner hardware; compression artifactsFalse character recognition; corrupted output in automated pipelines
Blurriness / Motion BlurText edges are soft or smeared; characters bleed togetherScanner lid movement; document not flat; worn scanning elementOCR confidence drops significantly; handwriting becomes unreadable
Uneven Lighting / ShadowingDark edges or gradient shadows across the pageBound documents, curved pages, or inconsistent lamp outputBinarization produces uneven results; edge content lost

Why Low-Quality Scans Happen

In practice, scan quality problems rarely stem from a single failure. They typically result from a combination of aging hardware, operator error, poor source document condition, and misconfigured scanner settings.

Common contributing factors include:

  • High-volume batch scanning where individual document quality is not reviewed
  • Legacy or poorly maintained scanners with degraded optical components
  • Fragile or aged source documents that cannot be flattened without risk of damage
  • Inconsistent operator training leading to variable scanner configurations

How Scan Quality Affects OCR and Data Extraction

OCR engines are highly sensitive to image quality. Even moderate degradation can cause character substitution errors, missed words, or complete line failures. In automated data extraction pipelines, these errors compound — a single misread field in a structured form can invalidate an entire record. This is especially costly in structured workflows such as ACORD form processing, where small field-level mistakes can break downstream validation and routing. Archiving workflows are similarly affected, as low-quality scans reduce searchability and long-term readability of stored documents.

Techniques for Correcting Low-Quality Scans

Processing a low-quality scan means applying targeted corrections to the image before or during OCR and data extraction. The right technique depends on the specific defect present. Applying the wrong correction — or applying corrections in the wrong order — can make image quality worse, not better.

Image Preprocessing Methods

Preprocessing is the foundational layer of scan correction. These operations are applied directly to the image file before it is passed to an OCR engine or extraction pipeline.

Deskewing detects and corrects rotational misalignment by calculating the angle of text lines and rotating the image to horizontal. It works best when skew is between 1 and 15 degrees.

Denoising removes random pixel noise and artifacts using filters such as Gaussian blur, median filtering, or morphological operations. It should be applied before binarization to avoid amplifying noise during thresholding.

Contrast enhancement increases the difference between foreground text and background using histogram equalization or adaptive contrast methods. It is particularly effective for faded or low-contrast documents.

Binarization converts a grayscale image to pure black and white using a threshold value. Adaptive binarization methods — such as Sauvola or Otsu — handle uneven lighting better than global thresholding.

The table below matches each quality issue to the appropriate technique, its type, and the conditions under which it should be applied.

Quality IssueRecommended Technique(s)Technique TypeWhen to ApplyKey Limitation or Consideration
Skew / MisalignmentDeskewingAutomatedWhen text lines deviate more than 1–2 degrees from horizontalMay fail on documents with mixed orientations or no clear text baseline
Low Contrast / Faded TextContrast Enhancement, BinarizationAutomatedWhen text is visually faint or OCR confidence is lowAggressive enhancement can introduce noise; test on sample pages first
Image Noise / ArtifactsDenoising (median filter, Gaussian blur)AutomatedWhen speckles or streaks are visible at normal zoomOver-smoothing can blur fine text; apply before binarization
Low Resolution / Low DPIAI-based super-resolution upscalingAI / ML-BasedWhen source DPI is below 200 and rescanning is not possibleUpscaling cannot recover detail that was never captured
Blurriness / Motion BlurSharpening filters, AI deblurringAutomated / AIWhen character edges are soft or text bleeds into backgroundSharpening amplifies noise; denoise first if both issues are present
Uneven Lighting / ShadowingAdaptive binarization, background normalizationAutomatedWhen shadow gradients affect edge content or column areasGlobal thresholding will fail; adaptive methods required
Compound / Mixed IssuesMulti-step preprocessing pipelineAutomated / AIWhen multiple defects are present simultaneouslyApply in sequence: denoise → deskew → normalize → binarize

Configuring OCR for Degraded Documents

Even after preprocessing, OCR engines may need additional configuration to handle degraded documents reliably.

  • Set OCR language models to match the document's language and character set precisely
  • Use confidence scoring to flag low-certainty output for human review rather than passing it downstream unchecked
  • Enable page segmentation modes that match the document layout (e.g., single-column, multi-column, or form-based)
  • For handwritten documents, use OCR engines specifically trained on handwriting rather than print-optimized models

These configuration choices also matter when teams use generative AI for document extraction, since even advanced models still depend on legible, well-prepared inputs to produce consistent structured output.

AI and Machine Learning Approaches to Scan Correction

AI-based tools have expanded the range of recoverable scan quality issues beyond what rule-based preprocessing can address. The table below compares the three primary processing approaches across key evaluation dimensions.

Approach TypeHow It WorksBest ForSkill / Resource RequirementTypical Outcome
ManualUser applies corrections individually using image editing softwareLow-volume, high-value documents requiring human judgmentBasicHigh accuracy for simple corrections; time-intensive at scale
Automated (Rule-Based)Software applies predefined filters based on set parametersHigh-volume batches with consistent, predictable quality issuesIntermediateConsistent results for known issue types; limited adaptability
AI / ML-BasedModel predicts and reconstructs image quality from learned patternsHighly degraded or structurally variable documents where rule-based methods failAdvanced / SpecializedHighest accuracy for complex degradation; requires trained models or licensed tools

AI-based approaches are particularly effective for:

  • Super-resolution upscaling of scans captured below 200 DPI
  • Intelligent deblurring that distinguishes text edges from background noise
  • Automated quality triage that routes documents to the appropriate correction pipeline without manual inspection

Model performance can improve further when teams use synthetic data for document training to simulate blur, skew, faded text, and scanning artifacts that may be underrepresented in production datasets. At the same time, more model reasoning does not automatically produce better parsing results; as discussed in why reasoning models fail at document parsing, poorly designed inference chains can add latency without fixing image-level quality problems.

Preventing Scan Quality Problems Before They Start

Reducing the frequency of low-quality scans is more efficient than correcting them after the fact. Consistent scanner settings, regular maintenance, and structured quality control checkpoints significantly reduce the volume of documents that require post-processing correction.

In regulated environments, the stakes are even higher. Healthcare organizations, for example, often pair stricter capture standards with HIPAA-compliant OCR because low-quality scans can undermine both compliance workflows and extraction quality in sensitive records.

Scanner misconfiguration is one of the most common and preventable causes of low-quality output. The table below provides recommended settings for the most common document types encountered in document management workflows.

Document TypeRecommended DPIRecommended File FormatColor ModeSpecial Considerations
Standard Text / Office Documents300 DPI (600 DPI for archival)Searchable PDF, PDF/ABlack and White / GrayscaleEnable auto-deskew; disable auto-brightness for consistent output
Photographs and Images600–1200 DPITIFF, JPEG (high quality)Full ColorHigher DPI increases file size significantly; calibrate color profile
Mixed Text and Image Documents300–600 DPIPDF, TIFFGrayscale or Full ColorUse grayscale to balance file size and image fidelity
Handwritten Notes or Forms300–400 DPIPDF, TIFFGrayscaleDisable automatic brightness adjustment to preserve ink variation
Bound Books or Magazines400–600 DPIPDF/A, TIFFGrayscale or Full ColorEnable book-edge correction or use a flatbed scanner with a book cradle
Aged or Fragile Historical Documents400–600 DPI (600+ for archival)TIFF, PDF/AGrayscale or Full ColorAvoid automatic document feeders; scan flat with minimal pressure
Legal or Compliance Documents300 DPI minimumPDF/A (archival standard)Black and White or GrayscaleEnsure metadata and file naming comply with retention policy requirements

Scanner Maintenance Habits

Consistent hardware maintenance directly affects scan output quality. The following practices should be part of regular operational routines:

  1. Clean the scanner glass before each batch run using a lint-free cloth and appropriate glass cleaner — dust and smudges are a primary cause of streak artifacts
  2. Inspect and clean the automatic document feeder (ADF) rollers weekly to prevent skew and double-feed errors
  3. Run scanner calibration according to the manufacturer's schedule, or whenever output quality visibly changes
  4. Check lamp output on flatbed scanners periodically — aging lamps produce uneven illumination that causes shadow gradients
  5. Update scanner firmware and drivers to maintain compatibility with current operating systems and apply manufacturer quality improvements

Quality Control Checkpoints by Workflow Stage

Catching quality issues early prevents low-quality scans from reaching downstream systems. The table below outlines QC actions organized by workflow stage.

Workflow StageQC ActionWhat It Prevents
Pre-Scan PreparationInspect scanner glass for dust, smudges, or debrisStreak artifacts and noise in output images
Pre-Scan PreparationFlatten and clean source documents; remove staples and foldsSkew, shadow gradients, and physical damage to scanner
Scanner ConfigurationVerify DPI, color mode, and file format match document type requirementsResolution mismatch causing OCR failure or oversized files
During ScanningReview the first 2–3 pages of each batch before continuingCatches systematic errors (skew, cutoff edges) before they affect the full batch
Post-Scan ReviewVisually inspect a random sample of output files for alignment and clarityIdentifies hardware or configuration issues not caught during scanning
Pre-Processing / IngestionRun OCR confidence scoring on a sample batch before full pipeline ingestionPrevents low-confidence output from corrupting downstream data records
Post-IngestionVerify file naming, metadata integrity, and folder structureEnsures documents are retrievable and compliant with retention standards

For high-stakes use cases, these checkpoints are especially important in clinical data extraction workflows, where faint text, clipped margins, or poor contrast can affect downstream record quality and review time.

Preparing Source Documents Before Scanning

The condition of the source document before it reaches the scanner directly affects output quality. Before scanning:

  • Flatten folded or curled documents by placing them under a flat weight for several minutes
  • Remove all staples, paper clips, and binding materials to prevent ADF jams and physical damage
  • Gently clean dusty or dirty originals with a soft brush before placing them on the scanner glass
  • Separate stuck or adhered pages carefully to avoid tearing, which creates permanent document damage

Final Thoughts

Low-quality scan processing covers both the correction of degraded images and the prevention of quality issues before they occur. The most effective approach combines targeted preprocessing — deskewing, denoising, contrast enhancement, and binarization — with consistent scanner configuration, hardware maintenance, and structured quality control checkpoints. Knowing which technique addresses which defect, and applying corrections in the right sequence, is what separates reliable document workflows from those that produce inconsistent or unusable output.

Once scan quality has been addressed at the image level, the next challenge is accurate data extraction — particularly for documents containing tables, charts, or irregular layouts that standard OCR engines still misread even after preprocessing. LlamaParse delivers VLM-powered agentic OCR that goes beyond simple text extraction, boasting industry-leading accuracy on complex documents without custom training. By leveraging advanced reasoning from large language and vision models, its agentic OCR engine intelligently understands layouts, interprets embedded charts, images, and tables, and enables self-correction loops for higher straight-through processing rates over legacy solutions. LlamaParse employs a team of specialized document understanding agents working together for unrivaled accuracy in real-world document intelligence, outputting structured Markdown, JSON, or HTML. It's free to try today and gives you 10,000 free credits upon signup.

Start building your first document agent today

PortableText [components.type] is missing "undefined"