- ingestd crate: detect file type → parse → schema detection → Parquet → catalog - CSV: auto-detect column types (int, float, bool, string), handles $, %, commas Strips dollar signs from amounts, flexible row parsing, sanitized column names - JSON: array or newline-delimited, nested object flattening (a.b.c → a_b_c) - PDF: text extraction via lopdf, one row per page (source_file, page_number, text) - Text/SMS: line-based ingestion with line numbers - Dedup: SHA-256 content hash, re-ingest same file = no-op - Gateway: POST /ingest/file multipart upload, 256MB body limit - Schema detection per ADR-010: ambiguous types default to String - 12 unit tests passing (CSV parsing, JSON flattening, type inference, dedup) - Tested: messy CSV with missing data, dollar amounts, N/A values → queryable Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
15 lines
449 B
JSON
15 lines
449 B
JSON
{
|
|
"id": "1ca61945-d151-490b-81fd-2ca0397b68fa",
|
|
"name": "sms_messages",
|
|
"schema_fingerprint": "e1d079cbb2b7eedae5019767a886bd9a3396e291aa03630b9db69e9864948c09",
|
|
"objects": [
|
|
{
|
|
"bucket": "data",
|
|
"key": "datasets/sms_messages.parquet",
|
|
"size_bytes": 2018,
|
|
"created_at": "2026-03-27T13:07:14.253881797Z"
|
|
}
|
|
],
|
|
"created_at": "2026-03-27T13:07:14.253886027Z",
|
|
"updated_at": "2026-03-27T13:07:14.253886027Z"
|
|
} |