OCR Pipeline — Scanned PDF to Structured Data
Converting scanned exam PDFs to machine-readable structured data: pdf2image for rasterization, Tesseract for text extraction, page markers for context preservation, and GPT-4-Turbo to recover structure from noisy OCR output.
Goal
Scanned exam papers are images wrapped in a PDF. The text isn’t selectable — it’s pixels. Getting from that to a structured list of questions with metadata (section, marks, question number) requires two stages: extracting raw text from the images, then recovering meaningful structure from that text.
Stage 1 — PDF to Text via OCR
import pytesseract
from pdf2image import convert_from_path
def extract_text_from_scanned_pdf(pdf_path):
images = convert_from_path(pdf_path)
page_texts = []
for i, img in enumerate(images):
text = pytesseract.image_to_string(img)
page_texts.append({"page": i + 1, "text": text})
return page_texts
pdf2image converts each page to a PIL Image at 200 DPI (default). pytesseract runs Tesseract on each image and returns a string. Each page is processed independently — no context carries across pages.
The output is wrapped with page markers before being passed downstream:
text = "\n\n".join([f"--- Page {p['page']} ---\n{p['text']}" for p in page_texts])
The --- Page N --- markers let the LLM extraction stage know where each page starts. Without them, a question split across two pages loses its boundary information entirely.
Stage 2 — LLM Extraction of Structure
OCR output from a scanned exam paper is noisy. Section headers, question numbers, mark values, and instructional text all appear in the raw output but without any structural hierarchy — it’s just a blob of text.
GPT-4-Turbo at temperature 0 handles the structural recovery:
chain = extract_exam_data_prompt | gpt4_turbo_llm | extraction_output_parser
result: ExtractionOutput = chain.invoke({"paper_text": text})
The prompt asks for a specific JSON schema: semester, course, subject, sections (A/B/C), and for each section, the questions with their text and mark value. The model’s job is to parse meaning from structure-free text.
This works surprisingly well for MBA exam papers specifically — they follow a consistent format (Section A: short answer, Section B: essays, Section C: case study) that the model has seen many times in its training data.
What OCR Gets Wrong
Tesseract on scanned documents has predictable failure modes:
| OCR issue | Example | Impact |
|---|---|---|
| Character substitution | l → 1, O → 0 | Corrupts question numbers |
| Line break misplacement | Mid-sentence breaks | Fragments question text |
| Header/footer bleed | Page numbers, exam code | Appears in question text |
| Low-contrast text | Faint print | Missing words or lines |
The LLM extraction stage recovers from most of these gracefully because it’s doing semantic parsing, not literal text matching. “Wlhat is motivation?” → “What is motivation?” is an obvious correction for the model even without explicit instructions to fix OCR errors.
Dependency Setup
# System-level
brew install tesseract
brew install poppler # pdf2image depends on poppler for PDF rendering
# Python
pip install pytesseract pdf2image Pillow
If convert_from_path raises a PDFInfoNotInstalledError, poppler isn’t on the PATH. On macOS: export PATH="/opt/homebrew/bin:$PATH".
What to Watch
Scan quality is the ceiling. Tesseract accuracy degrades significantly on low-DPI scans, skewed pages, or handwritten annotations. For exam papers that were printed, photocopied, and scanned, quality varies paper to paper. Consider preprocessing with Pillow: convert to grayscale, apply adaptive thresholding, deskew before passing to Tesseract.
Total token count. A 10-page exam paper produces substantial OCR text. Combined with the detailed extraction prompt and format instructions, a single extraction call can hit 8–12k tokens. At GPT-4-Turbo pricing, extraction is the most expensive single call in the pipeline.
pytesseract is a thin wrapper. It calls the Tesseract binary via subprocess. The binary needs to be installed separately at the OS level — pip install pytesseract alone isn’t enough. First-time setup catches people on this.
What’s Next
- Evaluate
pymupdf(fitz) for PDFs that are born-digital (not scanned) — it extracts text directly without OCR and is dramatically faster and more accurate - Add image preprocessing: grayscale + adaptive threshold + deskew before Tesseract to improve accuracy on low-quality scans
- Try
easyocras an alternative to Tesseract — better accuracy on degraded scans, runs on GPU if available