root
84407eeb51
Stress test suite: 9/9 passed — architecture validated
...
Tests:
1. Concurrent (10 queries): avg 48ms, max 50ms, no contention
2. Cross-reference (1.3M rows): 130ms, 3 JOINs + anti-join
3. Restart recovery: 12 datasets, 100K rows identical after restart
4. Pagination: 100K rows in 1000 pages, random page fetch works
5. Sustained: 70 QPS over 100 queries, 0 errors
6. Journal: write, flush, read-back correct
7. Tool registry: 6 tools execute correctly with audit
8. Cache: hot/cold verified
9. MySQL comparison: schema-on-read, vector+SQL, portable backup, PII auto-detect
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-27 22:13:27 -05:00
root
037555802e
Systemd services: gateway, sidecar, UI survive reboots
...
- lakehouse.service: release gateway on :3100, auto-restart
- lakehouse-sidecar.service: Python FastAPI on :3200, auto-restart
- lakehouse-ui.service: WASM file server on :3300, auto-restart
- All enabled at boot (multi-user.target)
- scripts/serve_ui.py for systemd-compatible file serving
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-27 22:06:28 -05:00
root
bb05c4412e
Phase 6: Ingest pipeline — CSV, JSON, PDF, text file support
...
- ingestd crate: detect file type → parse → schema detection → Parquet → catalog
- CSV: auto-detect column types (int, float, bool, string), handles $, %, commas
Strips dollar signs from amounts, flexible row parsing, sanitized column names
- JSON: array or newline-delimited, nested object flattening (a.b.c → a_b_c)
- PDF: text extraction via lopdf, one row per page (source_file, page_number, text)
- Text/SMS: line-based ingestion with line numbers
- Dedup: SHA-256 content hash, re-ingest same file = no-op
- Gateway: POST /ingest/file multipart upload, 256MB body limit
- Schema detection per ADR-010: ambiguous types default to String
- 12 unit tests passing (CSV parsing, JSON flattening, type inference, dedup)
- Tested: messy CSV with missing data, dollar amounts, N/A values → queryable
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-27 08:07:31 -05:00
root
6740a017c7
PRD v2: production roadmap with ingest, vector search, hot cache phases
...
- Phase 6: Ingest pipeline (CSV/JSON → schema detect → Parquet → catalog)
- Phase 7: Vector index + RAG (embed → HNSW → semantic search → LLM answer)
- Phase 8: Hot cache + incremental updates (MemTable, delta files, merge-on-read)
- ADR-008 through ADR-011: embeddings as Parquet, delta files not Delta Lake,
schema defaults to string, not a CRM replacement
- Staffing company reference dataset (286K rows, 7 tables)
- Honest risk assessment: vector search at scale and incremental updates are hard
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-27 07:54:24 -05:00