- ingestd crate: detect file type → parse → schema detection → Parquet → catalog - CSV: auto-detect column types (int, float, bool, string), handles $, %, commas Strips dollar signs from amounts, flexible row parsing, sanitized column names - JSON: array or newline-delimited, nested object flattening (a.b.c → a_b_c) - PDF: text extraction via lopdf, one row per page (source_file, page_number, text) - Text/SMS: line-based ingestion with line numbers - Dedup: SHA-256 content hash, re-ingest same file = no-op - Gateway: POST /ingest/file multipart upload, 256MB body limit - Schema detection per ADR-010: ambiguous types default to String - 12 unit tests passing (CSV parsing, JSON flattening, type inference, dedup) - Tested: messy CSV with missing data, dollar amounts, N/A values → queryable Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
15 lines
447 B
JSON
15 lines
447 B
JSON
{
|
|
"id": "478072c3-0c95-46a2-9193-f4b3ac4085ab",
|
|
"name": "test_ingest",
|
|
"schema_fingerprint": "4bdc4e5baeddc1187aecd4bfb788654f26145c2ba346b4bec6ca8ab950e1c133",
|
|
"objects": [
|
|
{
|
|
"bucket": "data",
|
|
"key": "datasets/test_ingest.parquet",
|
|
"size_bytes": 3129,
|
|
"created_at": "2026-03-27T13:06:57.437484309Z"
|
|
}
|
|
],
|
|
"created_at": "2026-03-27T13:06:57.437488259Z",
|
|
"updated_at": "2026-03-27T13:06:57.437488259Z"
|
|
} |