lakehouse

Author	SHA1	Message	Date
root	84407eeb51	Stress test suite: 9/9 passed — architecture validated Tests: 1. Concurrent (10 queries): avg 48ms, max 50ms, no contention 2. Cross-reference (1.3M rows): 130ms, 3 JOINs + anti-join 3. Restart recovery: 12 datasets, 100K rows identical after restart 4. Pagination: 100K rows in 1000 pages, random page fetch works 5. Sustained: 70 QPS over 100 queries, 0 errors 6. Journal: write, flush, read-back correct 7. Tool registry: 6 tools execute correctly with audit 8. Cache: hot/cold verified 9. MySQL comparison: schema-on-read, vector+SQL, portable backup, PII auto-detect Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-27 22:13:27 -05:00
root	037555802e	Systemd services: gateway, sidecar, UI survive reboots - lakehouse.service: release gateway on :3100, auto-restart - lakehouse-sidecar.service: Python FastAPI on :3200, auto-restart - lakehouse-ui.service: WASM file server on :3300, auto-restart - All enabled at boot (multi-user.target) - scripts/serve_ui.py for systemd-compatible file serving Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-27 22:06:28 -05:00
root	eae51977ab	Scale test: 2.47M rows + 10K vector index benchmarked Benchmarks on 128GB RAM server: - 100K candidate filter (skills+city+status): 257ms - 1M timesheet aggregation (revenue by client): 942ms - 800K call log cross-reference (cold leads): 642ms - Triple JOIN recruiter performance: 487ms - 500K email open rate aggregation: 259ms - COUNT all 2.47M rows: 84ms - 10K vector search (cosine similarity): ~450ms - Embedding throughput: 49 chunks/sec via Ollama - RAG correctly refuses to hallucinate when no match exists Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-27 08:31:37 -05:00
root	bb05c4412e	Phase 6: Ingest pipeline — CSV, JSON, PDF, text file support - ingestd crate: detect file type → parse → schema detection → Parquet → catalog - CSV: auto-detect column types (int, float, bool, string), handles $, %, commas Strips dollar signs from amounts, flexible row parsing, sanitized column names - JSON: array or newline-delimited, nested object flattening (a.b.c → a_b_c) - PDF: text extraction via lopdf, one row per page (source_file, page_number, text) - Text/SMS: line-based ingestion with line numbers - Dedup: SHA-256 content hash, re-ingest same file = no-op - Gateway: POST /ingest/file multipart upload, 256MB body limit - Schema detection per ADR-010: ambiguous types default to String - 12 unit tests passing (CSV parsing, JSON flattening, type inference, dedup) - Tested: messy CSV with missing data, dollar amounts, N/A values → queryable Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-27 08:07:31 -05:00
root	6740a017c7	PRD v2: production roadmap with ingest, vector search, hot cache phases - Phase 6: Ingest pipeline (CSV/JSON → schema detect → Parquet → catalog) - Phase 7: Vector index + RAG (embed → HNSW → semantic search → LLM answer) - Phase 8: Hot cache + incremental updates (MemTable, delta files, merge-on-read) - ADR-008 through ADR-011: embeddings as Parquet, delta files not Delta Lake, schema defaults to string, not a CRM replacement - Staffing company reference dataset (286K rows, 7 tables) - Honest risk assessment: vector search at scale and incremental updates are hard Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-27 07:54:24 -05:00

1 2

55 Commits