lakehouse

profit/lakehouse

Fork 0

Commit Graph

Author	SHA1	Message	Date
root	35f0559d78	Phase 14: Schema evolution with AI migration rules - Schema diff detection: compare old vs new schema, identify changes (added, removed, type changed, renamed columns) - Fuzzy rename detection: "first_name" → "full_name" detected by shared word parts - Auto-generated migration rules: direct map, cast, concat, split, drop Each rule has confidence score (0.0-1.0) - AI migration prompt builder: generates LLM prompt for complex schema changes LLM suggests JSON migration rules when heuristics aren't enough - 5 new unit tests (detect added, removed, type change, rename, rule generation) - 30 total unit tests passing Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-27 19:31:19 -05:00
root	9e53caaec3	Phase 10: Rich catalog v2 — metadata as product - DatasetManifest expanded: description, owner, sensitivity, columns, lineage, freshness contract, tags, row_count - All new fields use #[serde(default)] for backward compatibility - PII auto-detection: scans column names for email, phone, SSN, salary, address, DOB, medical terms — flags as PII/PHI/Financial - Column-level metadata: name, type, sensitivity, is_pii flag - Lineage tracking: source_system, source_file, ingest_job, timestamp - Ingest pipeline auto-populates: PII scan, column meta, lineage, row count - PATCH /catalog/datasets/by-name/{name}/metadata — update metadata - Catalog responses now include all rich fields - 25 unit tests passing (5 new PII detection tests) Per ADR-013: datasets without metadata become mystery files. This makes every ingested file self-describing from day one. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-27 09:15:09 -05:00
root	bb05c4412e	Phase 6: Ingest pipeline — CSV, JSON, PDF, text file support - ingestd crate: detect file type → parse → schema detection → Parquet → catalog - CSV: auto-detect column types (int, float, bool, string), handles $, %, commas Strips dollar signs from amounts, flexible row parsing, sanitized column names - JSON: array or newline-delimited, nested object flattening (a.b.c → a_b_c) - PDF: text extraction via lopdf, one row per page (source_file, page_number, text) - Text/SMS: line-based ingestion with line numbers - Dedup: SHA-256 content hash, re-ingest same file = no-op - Gateway: POST /ingest/file multipart upload, 256MB body limit - Schema detection per ADR-010: ambiguous types default to String - 12 unit tests passing (CSV parsing, JSON flattening, type inference, dedup) - Tested: messy CSV with missing data, dollar amounts, N/A values → queryable Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-27 08:07:31 -05:00

Author

SHA1

Message

Date

root

35f0559d78

Phase 14: Schema evolution with AI migration rules

- Schema diff detection: compare old vs new schema, identify changes
  (added, removed, type changed, renamed columns)
- Fuzzy rename detection: "first_name" → "full_name" detected by shared word parts
- Auto-generated migration rules: direct map, cast, concat, split, drop
  Each rule has confidence score (0.0-1.0)
- AI migration prompt builder: generates LLM prompt for complex schema changes
  LLM suggests JSON migration rules when heuristics aren't enough
- 5 new unit tests (detect added, removed, type change, rename, rule generation)
- 30 total unit tests passing

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

2026-03-27 19:31:19 -05:00

root

9e53caaec3

Phase 10: Rich catalog v2 — metadata as product

- DatasetManifest expanded: description, owner, sensitivity, columns,
  lineage, freshness contract, tags, row_count
- All new fields use #[serde(default)] for backward compatibility
- PII auto-detection: scans column names for email, phone, SSN, salary,
  address, DOB, medical terms — flags as PII/PHI/Financial
- Column-level metadata: name, type, sensitivity, is_pii flag
- Lineage tracking: source_system, source_file, ingest_job, timestamp
- Ingest pipeline auto-populates: PII scan, column meta, lineage, row count
- PATCH /catalog/datasets/by-name/{name}/metadata — update metadata
- Catalog responses now include all rich fields
- 25 unit tests passing (5 new PII detection tests)

Per ADR-013: datasets without metadata become mystery files.
This makes every ingested file self-describing from day one.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

2026-03-27 09:15:09 -05:00

root

bb05c4412e

Phase 6: Ingest pipeline — CSV, JSON, PDF, text file support

- ingestd crate: detect file type → parse → schema detection → Parquet → catalog
- CSV: auto-detect column types (int, float, bool, string), handles $, %, commas
  Strips dollar signs from amounts, flexible row parsing, sanitized column names
- JSON: array or newline-delimited, nested object flattening (a.b.c → a_b_c)
- PDF: text extraction via lopdf, one row per page (source_file, page_number, text)
- Text/SMS: line-based ingestion with line numbers
- Dedup: SHA-256 content hash, re-ingest same file = no-op
- Gateway: POST /ingest/file multipart upload, 256MB body limit
- Schema detection per ADR-010: ambiguous types default to String
- 12 unit tests passing (CSV parsing, JSON flattening, type inference, dedup)
- Tested: messy CSV with missing data, dollar amounts, N/A values → queryable

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

2026-03-27 08:07:31 -05:00

3 Commits