Three-tier evaluation:
1. NL→SQL with verifiable ground truth (10 questions): 7/10 (70%)
2. RAG with LLM reranker (5 questions): 4/5 (80%)
3. Self-assessment calibration: 2.8/5 avg, NOT calibrated
Real problems surfaced:
- qwen2.5 generates `WHERE vertical = 'Java'` instead of
`WHERE skills LIKE '%Java%'` without few-shot schema examples
- DataFusion-specific SQL quirks (must SELECT the COUNT in
GROUP BY queries) trip the model without explicit instruction
- Vector search can't do structured filtering (city, status) —
needs hybrid SQL+vector routing
- Self-assessment is uncalibrated: wrong answers score higher
than correct ones (3.0 vs 2.8)
Fixes validated:
- Few-shot examples fix NL→SQL accuracy from 70% → ~90%
- Reranker stage works but needs more diversity in results
Also includes lance_tune.py IVF_PQ parameter sweep script.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>