lakehouse

profit/lakehouse

Fork 0

Commit Graph

Author	SHA1	Message	Date
root	b38812481e	Quality evaluation pipeline — tests correctness, not just structure Three-tier evaluation: 1. NL→SQL with verifiable ground truth (10 questions): 7/10 (70%) 2. RAG with LLM reranker (5 questions): 4/5 (80%) 3. Self-assessment calibration: 2.8/5 avg, NOT calibrated Real problems surfaced: - qwen2.5 generates `WHERE vertical = 'Java'` instead of `WHERE skills LIKE '%Java%'` without few-shot schema examples - DataFusion-specific SQL quirks (must SELECT the COUNT in GROUP BY queries) trip the model without explicit instruction - Vector search can't do structured filtering (city, status) — needs hybrid SQL+vector routing - Self-assessment is uncalibrated: wrong answers score higher than correct ones (3.0 vs 2.8) Fixes validated: - Few-shot examples fix NL→SQL accuracy from 70% → ~90% - Reranker stage works but needs more diversity in results Also includes lance_tune.py IVF_PQ parameter sweep script. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-16 22:14:06 -05:00

Author

SHA1

Message

Date

root

b38812481e

Quality evaluation pipeline — tests correctness, not just structure

Three-tier evaluation:
1. NL→SQL with verifiable ground truth (10 questions): 7/10 (70%)
2. RAG with LLM reranker (5 questions): 4/5 (80%)
3. Self-assessment calibration: 2.8/5 avg, NOT calibrated

Real problems surfaced:
- qwen2.5 generates `WHERE vertical = 'Java'` instead of
  `WHERE skills LIKE '%Java%'` without few-shot schema examples
- DataFusion-specific SQL quirks (must SELECT the COUNT in
  GROUP BY queries) trip the model without explicit instruction
- Vector search can't do structured filtering (city, status) —
  needs hybrid SQL+vector routing
- Self-assessment is uncalibrated: wrong answers score higher
  than correct ones (3.0 vs 2.8)

Fixes validated:
- Few-shot examples fix NL→SQL accuracy from 70% → ~90%
- Reranker stage works but needs more diversity in results

Also includes lance_tune.py IVF_PQ parameter sweep script.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

2026-04-16 22:14:06 -05:00

1 Commits