lakehouse

Author	SHA1	Message	Date
root	10383b40b7	Staffing day simulation — multi-agent stress test on 10K Ethereal workers 5 contracts, 16 positions, 10K worker pool. Four agents: Matcher (SQL + vector hybrid), Communicator (LLM SMS drafts), Verifier (fact-checks against golden data), Analyzer (RAG intelligence questions). Results: - SQL matching: 16/16 positions filled, ZERO hallucinations. Every worker's name, role, city, state, certifications, and reliability score verified against the golden dataset. - SMS generation: 16/16 messages drafted with correct worker names. - RAG intelligence: retrieval returns semantically similar but structurally wrong workers (wrong state, wrong archetype) because vector search can't do structured filtering. LLM correctly reports context limitations — doesn't hallucinate beyond retrieved chunks. Key finding: SQL path is production-ready. RAG path needs hybrid SQL+vector routing — SQL for structured constraints (state, role, cert, reliability), vector for semantic similarity. That's the architectural gap to close. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-16 22:31:54 -05:00
root	a710896db2	Ingest Ethereal 10K worker profiles — domain data in the substrate 10,000 staffing worker profiles from profit/ethereal repo. Flattened JSON → CSV → Parquet. Indexed on HNSW (9.5s) + Lance IVF_PQ (7.2s). SQL hybrid verified: forklift operators in IL with reliability > 0.8 returned exact matches. Vector search alone missed the state filter — confirms the hybrid SQL+vector routing need from quality eval. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-16 22:26:19 -05:00
root	b38812481e	Quality evaluation pipeline — tests correctness, not just structure Three-tier evaluation: 1. NL→SQL with verifiable ground truth (10 questions): 7/10 (70%) 2. RAG with LLM reranker (5 questions): 4/5 (80%) 3. Self-assessment calibration: 2.8/5 avg, NOT calibrated Real problems surfaced: - qwen2.5 generates `WHERE vertical = 'Java'` instead of `WHERE skills LIKE '%Java%'` without few-shot schema examples - DataFusion-specific SQL quirks (must SELECT the COUNT in GROUP BY queries) trip the model without explicit instruction - Vector search can't do structured filtering (city, status) — needs hybrid SQL+vector routing - Self-assessment is uncalibrated: wrong answers score higher than correct ones (3.0 vs 2.8) Fixes validated: - Few-shot examples fix NL→SQL accuracy from 70% → ~90% - Reranker stage works but needs more diversity in results Also includes lance_tune.py IVF_PQ parameter sweep script. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-16 22:14:06 -05:00
root	390ebf0c36	IVF_PQ recall tuned from 0.80 → 0.97 via parameter sweep Systematic sweep of 8 IVF_PQ configs on 100K × 768d resumes. num_sub_vectors is the dominant lever: 48 → 192 pushes recall from 0.795 → 0.970. Winner: partitions=500, bits=8, subs=192. Build 61s (vs 18s baseline), acceptable for background builds. Hybrid status: HNSW recall=1.00 at <1ms, Lance IVF_PQ recall=0.97 at 60ms. Both backends production-grade. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-16 22:08:34 -05:00
root	13660a017e	Autonomous stress-test agent — recursive playbooks, hot-swap, error pipeline Python agent that exercises the full Lakehouse substrate as a real consumer would: ingests 10 Postgres tables (1,356 rows), embeds 5,415 chunks into 2 vector indexes, creates hot-swap profiles (Parquet+HNSW with qwen2.5 vs Lance IVF_PQ with mistral), runs stress queries across SQL + vector search + RAG, reads its own error pipeline to generate recursive test scenarios, and iterates. 50/50 tests pass across 2 iterations with zero errors. Error pipeline flushes failures back to the lakehouse as a queryable dataset so the next iteration can target weak spots. The agent IS the proof that the substrate works end-to-end: ingest → embed → index → search → generate → profile swap → iterate. Every capability we built today gets exercised in one script. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-16 22:00:13 -05:00
root	84407eeb51	Stress test suite: 9/9 passed — architecture validated Tests: 1. Concurrent (10 queries): avg 48ms, max 50ms, no contention 2. Cross-reference (1.3M rows): 130ms, 3 JOINs + anti-join 3. Restart recovery: 12 datasets, 100K rows identical after restart 4. Pagination: 100K rows in 1000 pages, random page fetch works 5. Sustained: 70 QPS over 100 queries, 0 errors 6. Journal: write, flush, read-back correct 7. Tool registry: 6 tools execute correctly with audit 8. Cache: hot/cold verified 9. MySQL comparison: schema-on-read, vector+SQL, portable backup, PII auto-detect Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-27 22:13:27 -05:00
root	037555802e	Systemd services: gateway, sidecar, UI survive reboots - lakehouse.service: release gateway on :3100, auto-restart - lakehouse-sidecar.service: Python FastAPI on :3200, auto-restart - lakehouse-ui.service: WASM file server on :3300, auto-restart - All enabled at boot (multi-user.target) - scripts/serve_ui.py for systemd-compatible file serving Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-27 22:06:28 -05:00
root	eae51977ab	Scale test: 2.47M rows + 10K vector index benchmarked Benchmarks on 128GB RAM server: - 100K candidate filter (skills+city+status): 257ms - 1M timesheet aggregation (revenue by client): 942ms - 800K call log cross-reference (cold leads): 642ms - Triple JOIN recruiter performance: 487ms - 500K email open rate aggregation: 259ms - COUNT all 2.47M rows: 84ms - 10K vector search (cosine similarity): ~450ms - Embedding throughput: 49 chunks/sec via Ollama - RAG correctly refuses to hallucinate when no match exists Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-27 08:31:37 -05:00
root	bb05c4412e	Phase 6: Ingest pipeline — CSV, JSON, PDF, text file support - ingestd crate: detect file type → parse → schema detection → Parquet → catalog - CSV: auto-detect column types (int, float, bool, string), handles $, %, commas Strips dollar signs from amounts, flexible row parsing, sanitized column names - JSON: array or newline-delimited, nested object flattening (a.b.c → a_b_c) - PDF: text extraction via lopdf, one row per page (source_file, page_number, text) - Text/SMS: line-based ingestion with line numbers - Dedup: SHA-256 content hash, re-ingest same file = no-op - Gateway: POST /ingest/file multipart upload, 256MB body limit - Schema detection per ADR-010: ambiguous types default to String - 12 unit tests passing (CSV parsing, JSON flattening, type inference, dedup) - Tested: messy CSV with missing data, dollar amounts, N/A values → queryable Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-27 08:07:31 -05:00
root	6740a017c7	PRD v2: production roadmap with ingest, vector search, hot cache phases - Phase 6: Ingest pipeline (CSV/JSON → schema detect → Parquet → catalog) - Phase 7: Vector index + RAG (embed → HNSW → semantic search → LLM answer) - Phase 8: Hot cache + incremental updates (MemTable, delta files, merge-on-read) - ADR-008 through ADR-011: embeddings as Parquet, delta files not Delta Lake, schema defaults to string, not a CRM replacement - Staffing company reference dataset (286K rows, 7 tables) - Honest risk assessment: vector search at scale and incremental updates are hard Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-27 07:54:24 -05:00

10 Commits