root 0bd48771ff OVERNIGHT PROOF: real embeddings confirm architecture
5,000 workers embedded through nomic-embed-text (real, not random).
Results on REAL embeddings:
  HNSW  recall@10: 1.0000  p50: 762us — PERFECT
  Lance recall@10: 0.9500  p50: 6.8ms — better than random vectors
  SQL autonomous: 50/50 (100%)

Key finding: real embeddings IMPROVE Lance recall (0.95 vs 0.80 on
random vectors) because real text embeddings cluster by topic, making
IVF partitions more effective. The concern about degraded recall on
real data was wrong — it's the opposite.

Also discovered: the 50K embedding job DID complete (50K chunks in
234s) but the job progress tracker showed 0/0. The supervisor's
progress reporting has a bug — the actual embedding pipeline works.

Known remaining issue: hybrid search ID matching between workers_500k
(worker_id format) and vector index (W5K-{id} format) needs the
prefix stripping fix applied to the new index.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-17 01:32:12 -05:00
..