← LOGBOOK LOG-304
WORKING · SOFTWARE ·
OCRTESSERACTPDF2IMAGEPYTHONLLMDATA-EXTRACTIONPIPELINEAI

OCR Pipeline — Scanned PDF to Structured Data

Converting scanned exam PDFs to machine-readable structured data: pdf2image for rasterization, Tesseract for text extraction, page markers for context preservation, and GPT-4-Turbo to recover structure from noisy OCR output.

Goal

Scanned exam papers are images wrapped in a PDF. The text isn’t selectable — it’s pixels. Getting from that to a structured list of questions with metadata (section, marks, question number) requires two stages: extracting raw text from the images, then recovering meaningful structure from that text.

Stage 1 — PDF to Text via OCR

import pytesseract
from pdf2image import convert_from_path

def extract_text_from_scanned_pdf(pdf_path):
    images = convert_from_path(pdf_path)

    page_texts = []
    for i, img in enumerate(images):
        text = pytesseract.image_to_string(img)
        page_texts.append({"page": i + 1, "text": text})

    return page_texts

pdf2image converts each page to a PIL Image at 200 DPI (default). pytesseract runs Tesseract on each image and returns a string. Each page is processed independently — no context carries across pages.

The output is wrapped with page markers before being passed downstream:

text = "\n\n".join([f"--- Page {p['page']} ---\n{p['text']}" for p in page_texts])

The --- Page N --- markers let the LLM extraction stage know where each page starts. Without them, a question split across two pages loses its boundary information entirely.

Stage 2 — LLM Extraction of Structure

OCR output from a scanned exam paper is noisy. Section headers, question numbers, mark values, and instructional text all appear in the raw output but without any structural hierarchy — it’s just a blob of text.

GPT-4-Turbo at temperature 0 handles the structural recovery:

chain = extract_exam_data_prompt | gpt4_turbo_llm | extraction_output_parser
result: ExtractionOutput = chain.invoke({"paper_text": text})

The prompt asks for a specific JSON schema: semester, course, subject, sections (A/B/C), and for each section, the questions with their text and mark value. The model’s job is to parse meaning from structure-free text.

This works surprisingly well for MBA exam papers specifically — they follow a consistent format (Section A: short answer, Section B: essays, Section C: case study) that the model has seen many times in its training data.

What OCR Gets Wrong

Tesseract on scanned documents has predictable failure modes:

OCR issueExampleImpact
Character substitutionl1, O0Corrupts question numbers
Line break misplacementMid-sentence breaksFragments question text
Header/footer bleedPage numbers, exam codeAppears in question text
Low-contrast textFaint printMissing words or lines

The LLM extraction stage recovers from most of these gracefully because it’s doing semantic parsing, not literal text matching. “Wlhat is motivation?” → “What is motivation?” is an obvious correction for the model even without explicit instructions to fix OCR errors.

Dependency Setup

# System-level
brew install tesseract
brew install poppler  # pdf2image depends on poppler for PDF rendering

# Python
pip install pytesseract pdf2image Pillow

If convert_from_path raises a PDFInfoNotInstalledError, poppler isn’t on the PATH. On macOS: export PATH="/opt/homebrew/bin:$PATH".

What to Watch

Scan quality is the ceiling. Tesseract accuracy degrades significantly on low-DPI scans, skewed pages, or handwritten annotations. For exam papers that were printed, photocopied, and scanned, quality varies paper to paper. Consider preprocessing with Pillow: convert to grayscale, apply adaptive thresholding, deskew before passing to Tesseract.

Total token count. A 10-page exam paper produces substantial OCR text. Combined with the detailed extraction prompt and format instructions, a single extraction call can hit 8–12k tokens. At GPT-4-Turbo pricing, extraction is the most expensive single call in the pipeline.

pytesseract is a thin wrapper. It calls the Tesseract binary via subprocess. The binary needs to be installed separately at the OS level — pip install pytesseract alone isn’t enough. First-time setup catches people on this.

What’s Next

  • Evaluate pymupdf (fitz) for PDFs that are born-digital (not scanned) — it extracts text directly without OCR and is dramatically faster and more accurate
  • Add image preprocessing: grayscale + adaptive threshold + deskew before Tesseract to improve accuracy on low-quality scans
  • Try easyocr as an alternative to Tesseract — better accuracy on degraded scans, runs on GPU if available