2 Commits

Author SHA1 Message Date
root
35f0559d78 Phase 14: Schema evolution with AI migration rules
- Schema diff detection: compare old vs new schema, identify changes
  (added, removed, type changed, renamed columns)
- Fuzzy rename detection: "first_name" → "full_name" detected by shared word parts
- Auto-generated migration rules: direct map, cast, concat, split, drop
  Each rule has confidence score (0.0-1.0)
- AI migration prompt builder: generates LLM prompt for complex schema changes
  LLM suggests JSON migration rules when heuristics aren't enough
- 5 new unit tests (detect added, removed, type change, rename, rule generation)
- 30 total unit tests passing

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-27 19:31:19 -05:00
root
bb05c4412e Phase 6: Ingest pipeline — CSV, JSON, PDF, text file support
- ingestd crate: detect file type → parse → schema detection → Parquet → catalog
- CSV: auto-detect column types (int, float, bool, string), handles $, %, commas
  Strips dollar signs from amounts, flexible row parsing, sanitized column names
- JSON: array or newline-delimited, nested object flattening (a.b.c → a_b_c)
- PDF: text extraction via lopdf, one row per page (source_file, page_number, text)
- Text/SMS: line-based ingestion with line numbers
- Dedup: SHA-256 content hash, re-ingest same file = no-op
- Gateway: POST /ingest/file multipart upload, 256MB body limit
- Schema detection per ADR-010: ambiguous types default to String
- 12 unit tests passing (CSV parsing, JSON flattening, type inference, dedup)
- Tested: messy CSV with missing data, dollar amounts, N/A values → queryable

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-27 08:07:31 -05:00