lakehouse

Go to file

root 2592f8fcb3 PDF OCR via Tesseract — scanned documents now ingestible

Two-tier PDF extraction: lopdf text layer first (fast, digital PDFs),
Tesseract OCR fallback when text extraction yields zero pages (scanned
documents, image-only PDFs). Falls back gracefully if Tesseract isn't
installed — returns an actionable error directing the operator to
`apt install tesseract-ocr tesseract-ocr-eng`.

OCR path: extract embedded XObject /Image streams from each page via
lopdf, detect format from magic bytes (JPEG/PNG/TIFF), write to temp
file, shell out to tesseract with --oem 3 --psm 6 (LSTM + uniform
text block), read output, clean up. Temp files cleaned even on error.

Schema unchanged — both paths produce (source_file, page_number,
text_content) so downstream consumers (chunker, vectord, queryd) work
identically regardless of how text was produced.

Verified: created a synthetic scanned PDF (PIL → image → PDF with no
text layer), ingested via POST /ingest/file. Tesseract recovered the
text with expected OCR artifacts. Queryable via DataFusion SQL.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

2026-04-16 20:45:00 -05:00

crates

PDF OCR via Tesseract — scanned documents now ingestible

2026-04-16 20:45:00 -05:00

data

Stress test suite: 9/9 passed — architecture validated

2026-03-27 22:13:27 -05:00

docs

Phase E.2: Compaction integrates tombstones — physical deletion closes GDPR loop

2026-04-16 10:38:30 -05:00

inbox/processed

Scheduled ingest: file watcher auto-ingests from ./inbox