lakehouse

Go to file

root b38812481e Quality evaluation pipeline — tests correctness, not just structure

Three-tier evaluation:
1. NL→SQL with verifiable ground truth (10 questions): 7/10 (70%)
2. RAG with LLM reranker (5 questions): 4/5 (80%)
3. Self-assessment calibration: 2.8/5 avg, NOT calibrated

Real problems surfaced:
- qwen2.5 generates `WHERE vertical = 'Java'` instead of
  `WHERE skills LIKE '%Java%'` without few-shot schema examples
- DataFusion-specific SQL quirks (must SELECT the COUNT in
  GROUP BY queries) trip the model without explicit instruction
- Vector search can't do structured filtering (city, status) —
  needs hybrid SQL+vector routing
- Self-assessment is uncalibrated: wrong answers score higher
  than correct ones (3.0 vs 2.8)

Fixes validated:
- Few-shot examples fix NL→SQL accuracy from 70% → ~90%
- Reranker stage works but needs more diversity in results

Also includes lance_tune.py IVF_PQ parameter sweep script.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

2026-04-16 22:14:06 -05:00

crates

S3 backend for Lance — hybrid operates on real MinIO object storage

2026-04-16 21:09:42 -05:00

data

Stress test suite: 9/9 passed — architecture validated

2026-03-27 22:13:27 -05:00

docs

Update PRD + PHASES.md — reflect 8-commit 2026-04-17 push

2026-04-16 20:54:05 -05:00

inbox/processed

Scheduled ingest: file watcher auto-ingests from ./inbox