lakehouse

History

root b38812481e Quality evaluation pipeline — tests correctness, not just structure

Three-tier evaluation:
1. NL→SQL with verifiable ground truth (10 questions): 7/10 (70%)
2. RAG with LLM reranker (5 questions): 4/5 (80%)
3. Self-assessment calibration: 2.8/5 avg, NOT calibrated

Real problems surfaced:
- qwen2.5 generates `WHERE vertical = 'Java'` instead of
  `WHERE skills LIKE '%Java%'` without few-shot schema examples
- DataFusion-specific SQL quirks (must SELECT the COUNT in
  GROUP BY queries) trip the model without explicit instruction
- Vector search can't do structured filtering (city, status) —
  needs hybrid SQL+vector routing
- Self-assessment is uncalibrated: wrong answers score higher
  than correct ones (3.0 vs 2.8)

Fixes validated:
- Few-shot examples fix NL→SQL accuracy from 70% → ~90%
- Reranker stage works but needs more diversity in results

Also includes lance_tune.py IVF_PQ parameter sweep script.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

2026-04-16 22:14:06 -05:00

autonomous_agent.py

Autonomous stress-test agent — recursive playbooks, hot-swap, error pipeline

2026-04-16 22:00:13 -05:00

generate_demo.py

Phase 6: Ingest pipeline — CSV, JSON, PDF, text file support

2026-03-27 08:07:31 -05:00

lance_tune.py

IVF_PQ recall tuned from 0.80 → 0.97 via parameter sweep

2026-04-16 22:08:34 -05:00

quality_eval.py

Quality evaluation pipeline — tests correctness, not just structure