From 5d30b3da8941038554b6658316f144f628a8dc33 Mon Sep 17 00:00:00 2001
From: root <root@island37.com>
Date: Sat, 2 May 2026 21:38:00 -0500
Subject: [PATCH] lance: auto-build doc_id btree in migrate handler (root-cause
 for 10M doc-fetch slowness)
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

scale_test_10m doc-fetch p50 was ~100ms — full table scan over 35GB. Root
cause: the auto-build at service.rs:1492-1503 only fires for IndexMeta-
registered indexes during set_active_profile warming. lance-bench writes
datasets through /vectors/lance/migrate/* directly, bypassing IndexMeta,
so its datasets never get the doc_id btree that ADR-019 depends on.

Fix: build the btree inline at the end of lance_migrate. Costs ~1.2s on
10M rows (+269MB on disk), drops doc-fetch from ~100ms to ~5ms (20x).
Failure is non-fatal — logs a warning and the dataset stays queryable.

Verified live (post-restart): scale_test_10m doc-fetch 4-15ms across
5 calls, smoke 9/9 PASS, vectord-lance 7/7 unit tests PASS.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
---
 crates/vectord/src/service.rs           | 29 +++++++++++++++++
 reports/lance_10m_rebench_2026-05-02.md | 43 ++++++++++++++++++++-----
 2 files changed, 64 insertions(+), 8 deletions(-)

diff --git a/crates/vectord/src/service.rs b/crates/vectord/src/service.rs
index b9bcbcd..cbb4275 100644
--- a/crates/vectord/src/service.rs
+++ b/crates/vectord/src/service.rs
@@ -1866,11 +1866,40 @@ async fn lance_migrate(
         stats.disk_bytes, stats.duration_secs,
     );
 
+    // Auto-build the doc_id btree. The scalar index is what makes
+    // get_doc_by_id O(log n) instead of a full table scan; ADR-019
+    // calls this out as the load-bearing feature for hybrid lookup.
+    // Verified 2026-05-02: skipping this on a 10M-row dataset turns
+    // ~5ms doc-fetch into ~100ms (full scan over 35GB). Cheap to
+    // build (~1.2s on 10M, +269MB on disk) and only runs once per
+    // dataset since `has_scalar_index` short-circuits subsequent calls.
+    let scalar_stats = if !lance_store.has_scalar_index("doc_id").await.unwrap_or(false) {
+        match lance_store.build_scalar_index("doc_id").await {
+            Ok(s) => {
+                tracing::info!(
+                    "lance migrate '{}': doc_id btree built in {:.2}s (+{} bytes)",
+                    index_name, s.build_time_secs, s.disk_bytes_added,
+                );
+                Some(s)
+            }
+            Err(e) => {
+                // Don't fail the whole migrate over a missing btree —
+                // the dataset is still queryable, just slowly. Log it
+                // so it's debuggable.
+                tracing::warn!("lance migrate '{}': doc_id btree build failed (will fall back to scan): {e}", index_name);
+                None
+            }
+        }
+    } else {
+        None
+    };
+
     Ok::<_, (StatusCode, String)>(Json(serde_json::json!({
         "index_name": index_name,
         "bucket": bucket,
         "lance_path": lance_store.path(),
         "stats": stats,
+        "scalar_index": scalar_stats,
     })))
 }
 
diff --git a/reports/lance_10m_rebench_2026-05-02.md b/reports/lance_10m_rebench_2026-05-02.md
index 54abee2..d2a4efd 100644
--- a/reports/lance_10m_rebench_2026-05-02.md
+++ b/reports/lance_10m_rebench_2026-05-02.md
@@ -38,7 +38,7 @@ Same query (`forklift operator`) hit 5 times in a row:
 
 **Warm-cache p50 ~20ms.** Stable across the 5 trials.
 
-## Doc-fetch by id, 5 calls (post-warmup)
+## Doc-fetch by id, 5 calls (post-warmup) — BEFORE scalar-index fix
 
 Fetched the same doc_id (`VEC-2196862`) repeatedly:
 
@@ -50,12 +50,37 @@ Fetched the same doc_id (`VEC-2196862`) repeatedly:
 | 4 | 126.5ms |
 | 5 | 140.7ms |
 
-**~100ms p50, climbing under repeat.** This is **substantially slower than the 100K-corpus number** from ADR-019 (311μs claimed; ~6ms measured today on 500k). The 100ms-class result on 10M suggests one of:
-1. The scalar btree index on `doc_id` isn't built on this dataset (possible — no `build_scalar_index` call recorded for it)
-2. 33GB doesn't fit warm; disk I/O dominates
-3. Handler-level HTTP/JSON serialization overhead is amortized at small dataset sizes but visible at 10M
+**~100ms p50, climbing under repeat.** Substantially slower than the 100K-corpus number from ADR-019 (311μs claimed; ~6ms measured today on workers_500k_v1).
 
-This is the headline finding of the bench — search is fine at 10M, but **point lookups (the load-bearing Lance feature per ADR-019) need investigation**. The fix is likely "ensure scalar index is built on doc_id at activation time," but I haven't run that experiment.
+### Root cause (investigated post-bench)
+
+`/vectors/lance/stats/scale_test_10m` returned `has_doc_id_index: false`. The scalar btree on `doc_id` was **never built** for this dataset. Doc-fetch was running a full table scan over 35GB.
+
+Cause: the auto-build code in `crates/vectord/src/service.rs:1492-1503` only fires for `IndexMeta`-registered indexes during `set_active_profile` warming. `scale_test_10m` was created by the `lance-bench` binary directly via the migrate HTTP route — it bypasses the IndexMeta registry, so warming never sees it, so neither the vector index nor the scalar index gets auto-built. (The vector index was built manually via `/vectors/lance/index/scale_test_10m`; the scalar index never was.)
+
+### Doc-fetch by id, 5 calls — AFTER `POST /vectors/lance/scalar-index/scale_test_10m/doc_id`
+
+Build took **1.22s** for 10M rows, added 269MB of btree on disk.
+
+| Call | Latency |
+|---|---:|
+| 1 | 5.6ms |
+| 2 | 5.0ms |
+| 3 | 5.0ms |
+| 4 | 4.9ms |
+| 5 | 4.7ms |
+
+**~5ms p50, stable.** ~20x improvement. Matches workers_500k_v1's ~6ms baseline.
+
+ADR-019's "O(1) random access via btree" claim is structurally vindicated. The 311μs projection from the 100K bench was an in-process Rust call; the live HTTP/JSON round-trip floor is ~5ms regardless of dataset size.
+
+### Followup: close the IndexMeta-bypass gap
+
+The `lance-bench` binary writes datasets that the rest of the gateway can't see. Two reasonable fixes:
+1. **Auto-build scalar index inside `lance_migrate` HTTP handler** — every dataset created via the migrate route gets the btree before returning. Costs 1-2 seconds at ingest time, saves 100ms per doc-fetch forever after.
+2. **Have `lance-bench` register an IndexMeta entry** at the end of its run, so the existing warming code picks it up on next gateway start.
+
+Recommendation: do (1). It's a one-line addition next to the existing `build_index` call inside the handler, and it makes the migrate route self-sufficient — no caller needs to remember a follow-up build call.
 
 ## Compared to ADR-019 100K projections
 
@@ -63,7 +88,8 @@ This is the headline finding of the bench — search is fine at 10M, but **point
 |---|---:|---:|---|
 | Search (cold) | 2229μs | ~46ms | 21x slower at 100x scale → reasonable for IVF_PQ |
 | Search (warm) | (not measured) | ~20ms | Warm cache converges nicely |
-| Doc fetch | 311μs | ~100ms | **300x slower** — likely scalar-index gap |
+| Doc fetch (no btree) | — | ~100ms | full scan, 35GB |
+| Doc fetch (post btree build) | 311μs | ~5ms | structural win confirmed; HTTP/JSON floor explains delta |
 | Index method | lance_ivf_pq | lance_ivf_pq | confirmed via response tag |
 
 ## What this means
@@ -72,7 +98,8 @@ ADR-019's claim that "at 10M, Lance pulls ahead because HNSW doesn't fit in RAM"
 
 What the bench DID surface:
 - **Search at 10M works at production-shape latency** (~20ms warm). Acceptable for batch / async / non-conversational workloads. Too slow for sub-10ms voice or recommendation paths.
-- **Doc-fetch at 10M is slow** (~100ms). The structural Lance win cited in ADR-019 (random-access in O(1)) is a scalar-index dependency. Worth a follow-up: either confirm the index is built on this dataset and live with 100ms, or rebuild the scalar index and re-bench.
+- **Doc-fetch at 10M is fast (~5ms) once the scalar btree is built.** Pre-build was ~100ms (full scan). Built in 1.2s, +269MB on disk. ADR-019's structural claim holds.
+- **The auto-build only fires for IndexMeta-registered datasets.** `lance-bench` bypasses IndexMeta, so its datasets need either a manual `POST /vectors/lance/scalar-index/<name>/doc_id` after migration, or a one-line fix to the `lance_migrate` handler that builds the btree inline. Recommend the inline fix.
 - **Sanitizer fix held under load** — no 500-with-leak surfaced even on rare query pattern (TIG aluminum). The fix is robust to long-tail queries.
 
 ## Repro