Three-tier evaluation: 1. NL→SQL with verifiable ground truth (10 questions): 7/10 (70%) 2. RAG with LLM reranker (5 questions): 4/5 (80%) 3. Self-assessment calibration: 2.8/5 avg, NOT calibrated Real problems surfaced: - qwen2.5 generates `WHERE vertical = 'Java'` instead of `WHERE skills LIKE '%Java%'` without few-shot schema examples - DataFusion-specific SQL quirks (must SELECT the COUNT in GROUP BY queries) trip the model without explicit instruction - Vector search can't do structured filtering (city, status) — needs hybrid SQL+vector routing - Self-assessment is uncalibrated: wrong answers score higher than correct ones (3.0 vs 2.8) Fixes validated: - Few-shot examples fix NL→SQL accuracy from 70% → ~90% - Reranker stage works but needs more diversity in results Also includes lance_tune.py IVF_PQ parameter sweep script. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Description
Rust-first object storage system
Languages
TypeScript
38.4%
Rust
35.8%
HTML
13.9%
Python
7.8%
Shell
2.1%
Other
2%