lakehouse/reports/lance_10m_rebench_2026-05-02.md
root 5d30b3da89 lance: auto-build doc_id btree in migrate handler (root-cause for 10M doc-fetch slowness)
scale_test_10m doc-fetch p50 was ~100ms — full table scan over 35GB. Root
cause: the auto-build at service.rs:1492-1503 only fires for IndexMeta-
registered indexes during set_active_profile warming. lance-bench writes
datasets through /vectors/lance/migrate/* directly, bypassing IndexMeta,
so its datasets never get the doc_id btree that ADR-019 depends on.

Fix: build the btree inline at the end of lance_migrate. Costs ~1.2s on
10M rows (+269MB on disk), drops doc-fetch from ~100ms to ~5ms (20x).
Failure is non-fatal — logs a warning and the dataset stays queryable.

Verified live (post-restart): scale_test_10m doc-fetch 4-15ms across
5 calls, smoke 9/9 PASS, vectord-lance 7/7 unit tests PASS.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-02 21:38:00 -05:00

5.8 KiB
Raw Blame History

Lance backend re-benchmark — 10M vectors (scale_test_10m)

Date: 2026-05-02 Dataset: data/lance/scale_test_10m (33 GB, ~10M vectors, 768d) Driver: live HTTP gateway :3100/vectors/lance/* (post sanitizer-fix binary) Method tag on every search response: lance_ivf_pq (confirms IVF_PQ, not brute-force)

ADR-019 deferred a 10M re-bench: "at 10M we expect Lance to pull ahead because HNSW doesn't fit in RAM. Re-benchmark when we have a 10M-vector corpus to test against." The corpus exists; this is that benchmark.

Search latency, 10 diverse queries, top_k=10 (cold)

Query Latency
warehouse forklift operator second shift 50.5ms
senior software engineer kubernetes 52.9ms
registered nurse pediatric 37.6ms
welder TIG aluminum 127.7ms
data scientist python 41.6ms
electrician journeyman commercial 31.4ms
accountant CPA tax 28.6ms
machine learning research 32.1ms
construction site supervisor 31.8ms
biomedical engineer 25.0ms

Median ~32ms, mean ~46ms, one ~128ms outlier (TIG aluminum query — not investigated; could be query-specific IVF traversal pattern or transient I/O).

Search latency, repeated query (warm cache)

Same query (forklift operator) hit 5 times in a row:

Call Latency
1 21.9ms
2 20.2ms
3 19.2ms
4 22.4ms
5 18.6ms

Warm-cache p50 ~20ms. Stable across the 5 trials.

Doc-fetch by id, 5 calls (post-warmup) — BEFORE scalar-index fix

Fetched the same doc_id (VEC-2196862) repeatedly:

Call Latency
1 68.2ms
2 89.3ms
3 153.9ms
4 126.5ms
5 140.7ms

~100ms p50, climbing under repeat. Substantially slower than the 100K-corpus number from ADR-019 (311μs claimed; ~6ms measured today on workers_500k_v1).

Root cause (investigated post-bench)

/vectors/lance/stats/scale_test_10m returned has_doc_id_index: false. The scalar btree on doc_id was never built for this dataset. Doc-fetch was running a full table scan over 35GB.

Cause: the auto-build code in crates/vectord/src/service.rs:1492-1503 only fires for IndexMeta-registered indexes during set_active_profile warming. scale_test_10m was created by the lance-bench binary directly via the migrate HTTP route — it bypasses the IndexMeta registry, so warming never sees it, so neither the vector index nor the scalar index gets auto-built. (The vector index was built manually via /vectors/lance/index/scale_test_10m; the scalar index never was.)

Doc-fetch by id, 5 calls — AFTER POST /vectors/lance/scalar-index/scale_test_10m/doc_id

Build took 1.22s for 10M rows, added 269MB of btree on disk.

Call Latency
1 5.6ms
2 5.0ms
3 5.0ms
4 4.9ms
5 4.7ms

~5ms p50, stable. ~20x improvement. Matches workers_500k_v1's ~6ms baseline.

ADR-019's "O(1) random access via btree" claim is structurally vindicated. The 311μs projection from the 100K bench was an in-process Rust call; the live HTTP/JSON round-trip floor is ~5ms regardless of dataset size.

Followup: close the IndexMeta-bypass gap

The lance-bench binary writes datasets that the rest of the gateway can't see. Two reasonable fixes:

  1. Auto-build scalar index inside lance_migrate HTTP handler — every dataset created via the migrate route gets the btree before returning. Costs 1-2 seconds at ingest time, saves 100ms per doc-fetch forever after.
  2. Have lance-bench register an IndexMeta entry at the end of its run, so the existing warming code picks it up on next gateway start.

Recommendation: do (1). It's a one-line addition next to the existing build_index call inside the handler, and it makes the migrate route self-sufficient — no caller needs to remember a follow-up build call.

Compared to ADR-019 100K projections

Op 100K (ADR-019) 10M (today) Notes
Search (cold) 2229μs ~46ms 21x slower at 100x scale → reasonable for IVF_PQ
Search (warm) (not measured) ~20ms Warm cache converges nicely
Doc fetch (no btree) ~100ms full scan, 35GB
Doc fetch (post btree build) 311μs ~5ms structural win confirmed; HTTP/JSON floor explains delta
Index method lance_ivf_pq lance_ivf_pq confirmed via response tag

What this means

ADR-019's claim that "at 10M, Lance pulls ahead because HNSW doesn't fit in RAM" remains unverified-but-not-refuted. We can't directly compare to HNSW at 10M because HNSW's RAM footprint at 10M × 768d × 4 bytes = ~30 GB just for vectors, double that for the graph — way past any single-node deployment. So Lance "wins" at 10M by being the only contender that operationally exists.

What the bench DID surface:

  • Search at 10M works at production-shape latency (~20ms warm). Acceptable for batch / async / non-conversational workloads. Too slow for sub-10ms voice or recommendation paths.
  • Doc-fetch at 10M is fast (~5ms) once the scalar btree is built. Pre-build was ~100ms (full scan). Built in 1.2s, +269MB on disk. ADR-019's structural claim holds.
  • The auto-build only fires for IndexMeta-registered datasets. lance-bench bypasses IndexMeta, so its datasets need either a manual POST /vectors/lance/scalar-index/<name>/doc_id after migration, or a one-line fix to the lance_migrate handler that builds the btree inline. Recommend the inline fix.
  • Sanitizer fix held under load — no 500-with-leak surfaced even on rare query pattern (TIG aluminum). The fix is robust to long-tail queries.

Repro

# Search latency, single query
curl -sS -X POST http://127.0.0.1:3100/vectors/lance/search/scale_test_10m \
  -H 'Content-Type: application/json' \
  -d '{"query":"forklift operator","top_k":10}' | jq '.latency_us'

# Doc fetch by id
curl -sS http://127.0.0.1:3100/vectors/lance/doc/scale_test_10m/VEC-2196862 \
  | jq '.latency_us'