2 Commits

Author SHA1 Message Date
root
2592f8fcb3 PDF OCR via Tesseract — scanned documents now ingestible
Two-tier PDF extraction: lopdf text layer first (fast, digital PDFs),
Tesseract OCR fallback when text extraction yields zero pages (scanned
documents, image-only PDFs). Falls back gracefully if Tesseract isn't
installed — returns an actionable error directing the operator to
`apt install tesseract-ocr tesseract-ocr-eng`.

OCR path: extract embedded XObject /Image streams from each page via
lopdf, detect format from magic bytes (JPEG/PNG/TIFF), write to temp
file, shell out to tesseract with --oem 3 --psm 6 (LSTM + uniform
text block), read output, clean up. Temp files cleaned even on error.

Schema unchanged — both paths produce (source_file, page_number,
text_content) so downstream consumers (chunker, vectord, queryd) work
identically regardless of how text was produced.

Verified: created a synthetic scanned PDF (PIL → image → PDF with no
text layer), ingested via POST /ingest/file. Tesseract recovered the
text with expected OCR artifacts. Queryable via DataFusion SQL.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-16 20:45:00 -05:00
root
bb05c4412e Phase 6: Ingest pipeline — CSV, JSON, PDF, text file support
- ingestd crate: detect file type → parse → schema detection → Parquet → catalog
- CSV: auto-detect column types (int, float, bool, string), handles $, %, commas
  Strips dollar signs from amounts, flexible row parsing, sanitized column names
- JSON: array or newline-delimited, nested object flattening (a.b.c → a_b_c)
- PDF: text extraction via lopdf, one row per page (source_file, page_number, text)
- Text/SMS: line-based ingestion with line numbers
- Dedup: SHA-256 content hash, re-ingest same file = no-op
- Gateway: POST /ingest/file multipart upload, 256MB body limit
- Schema detection per ADR-010: ambiguous types default to String
- 12 unit tests passing (CSV parsing, JSON flattening, type inference, dedup)
- Tested: messy CSV with missing data, dollar amounts, N/A values → queryable

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-27 08:07:31 -05:00