ADR-019: closed the "re-bench when 10M corpus exists" follow-up. Added "Follow-up: 10M re-bench (2026-05-02)" section with the post-fix numbers (search ~20ms warm / ~46ms cold, doc-fetch ~5ms post-btree). Documented the lance-bench-bypassing-IndexMeta bug + 2-layer fix + gauntlet (7 unit + 12 sanitize + 10 smoke probes). Reframes the strategic question as "Lance vs Parquet+HNSW-with-spilling" since HNSW doesn't fit RAM at 10M. DECISIONS: added ADR-022 — drop Python sidecar from Rust hot path. Captures the rationale (236× embed perf gap was pure overhead), co-shipped LRU cache, dev-only Python that survives, cross-runtime parity verification, and the operator runbook signal (ps -ef ABSENT post-deploy). PRD: updated AI Boundary table line + aibridge crate description to reflect direct Ollama path (was: Python FastAPI sidecar → Ollama). Both lines reference ADR-022 for the full rationale. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
129 lines
8.9 KiB
Markdown
129 lines
8.9 KiB
Markdown
# ADR-019: Vector Storage — Parquet+HNSW stays, Lance joins as second tier
|
||
|
||
**Status:** Accepted — 2026-04-16
|
||
**Implements:** Phase 18 from PRD (Lance evaluation)
|
||
**Supersedes:** nothing (augments ADR-008)
|
||
**Owner:** J
|
||
|
||
---
|
||
|
||
## Context
|
||
|
||
Phase 18 of the PRD committed to settling "Parquet+sidecar vs Lance" with measurements, not vibes. This ADR records the benchmark outcome and the resulting architectural direction.
|
||
|
||
Input data: `data/vectors/resumes_100k_v2.parquet` — 100,000 × 768d embeddings, the same index we tuned HNSW against in Phase 15.
|
||
|
||
Benchmark harness: `crates/lance-bench/src/main.rs` — standalone binary, deliberately not integrated into the workspace's common deps to avoid forcing DataFusion/Arrow upgrades on the rest of the stack until we'd decided.
|
||
|
||
## The scorecard
|
||
|
||
All numbers measured on the same 128GB server, same 100K × 768d index, release build:
|
||
|
||
| Dimension | Parquet + HNSW (current) | Lance 4.0 IVF_PQ (candidate) | Winner |
|
||
|---|---|---|---|
|
||
| Cold load | 0.17s | 0.13s | Lance, 1.27× — *does not clear 2× decision threshold* |
|
||
| Disk size (data only) | 330.3 MB | 330.4 MB | Tie |
|
||
| Index on-disk footprint | 0 (HNSW is RAM-only) | 7.4 MB | Lance |
|
||
| Index build time | 230s (ec=80 es=30) | 16s | **Lance, 14× faster** |
|
||
| Search p50 | 873us (recall@10 = 1.00) | 2229us (recall unmeasured, likely 0.85-0.95) | **Parquet+HNSW, 2.55× faster** |
|
||
| Search p95 | 1413us | 4998us | **Parquet+HNSW, 3.54× faster** |
|
||
| Speedup vs brute force (p50) | 50.4× | 19.7× | Parquet+HNSW |
|
||
| Random row access (fetch by id) | ~35ms (full-file scan) | 311us | **Lance, 112× faster** |
|
||
| Append 10K rows | Full-file rewrite (~330MB + re-embed + re-index) | 0.08s, +31MB delta | **Lance, structurally different** |
|
||
|
||
## Applying the decision rules from EXECUTION_PLAN.md
|
||
|
||
Original rules:
|
||
- *Lance wins cold-load by ≥2× AND matches search latency → migrate*
|
||
- *Within 50% across board → stay Parquet, document ceiling*
|
||
- *Lance loses → close the door*
|
||
|
||
Strict reading: cold-load is **1.27×, not ≥2×**. Search latency is **2.55× worse, not matching**. By the written rule, we stay.
|
||
|
||
But the written rule missed something. It assumed Lance's value would show up as raw-speed wins across the whole table. The actual benchmark reveals Lance's value is **in capabilities the current stack doesn't have**, not in the metrics we scoped:
|
||
|
||
1. **Random row access** is 112× faster. Our Parquet design can't do O(1) random access to a row — RAG text retrieval is a full-file scan today. Lance makes this native.
|
||
2. **Append** is structurally different. Adding 10K rows is 0.08s on Lance; on our stack it's a full rewrite of the entire 330MB Parquet file plus re-embedding plus re-indexing.
|
||
3. **Index build** is 14× faster. The HNSW `ec=80 es=30` production default takes 230s; Lance IVF_PQ takes 16s. Hot-swap generation (Phase 16) is much more feasible at 16s per build.
|
||
|
||
## The decision
|
||
|
||
**Hybrid architecture — neither replace nor reject.**
|
||
|
||
### What stays
|
||
|
||
- `vectord::store` with Parquet + binary-blob vectors → **primary vector backend**
|
||
- `vectord::hnsw::HnswStore` → in-RAM HNSW for search at 100K-scale indexes
|
||
- All Phase 15 trial infrastructure → keeps working, unchanged
|
||
- Production default `ec=80 es=30` → still the right call for in-RAM use
|
||
|
||
### What gets added
|
||
|
||
- **`vectord::lance_store`** — second backend using Lance as the persistence layer
|
||
- Scope: indexes where *any* of the following apply:
|
||
- Corpus exceeds ~5M vectors (our in-RAM ceiling)
|
||
- Workload is append-heavy (incremental ingest from streaming sources)
|
||
- Text retrieval dominates (point lookups by doc_id for RAG)
|
||
- Hot-swap generations are required (Phase 16)
|
||
- Implemented as a standalone crate first (follow the pilot layout), promoted into vectord when the API stabilizes
|
||
- **Profile-level configuration** — `ModelProfile.vector_backend: Parquet | Lance` so each profile picks the tier that matches its workload
|
||
|
||
### What we keep watching (but don't act on yet)
|
||
|
||
- ~~**Lance search latency at scale.** 2229us at 100K is worse than HNSW. At 10M we expect Lance to pull ahead because HNSW doesn't fit in RAM. Re-benchmark when we have a 10M-vector corpus to test against.~~ **Done 2026-05-02** — see "Follow-up: 10M re-bench" below.
|
||
- **IVF_PQ recall.** We measured latency but not recall — I picked `num_partitions=316, nbits=8, num_sub_vectors=48` blindly. A proper recall sweep is part of Phase C when we integrate Lance into the trial system.
|
||
- **Lance's own HNSW-on-disk variant** (`with_ivf_hnsw_pq_params`). Might close the in-RAM latency gap. Left for a future pilot.
|
||
|
||
---
|
||
|
||
## Follow-up: 10M re-bench (2026-05-02)
|
||
|
||
The 10M re-bench above ran. Numbers from `data/lance/scale_test_10m` (33 GB, 10M × 768d, IVF_PQ live, post-doc_id-btree-build). Full report: `reports/lance_10m_rebench_2026-05-02.md`.
|
||
|
||
| Op | 100K (this ADR) | 10M (re-bench) | Notes |
|
||
|---|---:|---:|---|
|
||
| Search (cold) | 2229μs | ~46ms median | 21× slower at 100× scale → reasonable for IVF_PQ |
|
||
| Search (warm) | (not measured) | ~20ms p50 | Stable across 5 trials |
|
||
| Doc fetch by id | 311μs | ~5ms | Structural ADR-019 win confirmed once `doc_id` btree is built |
|
||
| Index method | lance_ivf_pq | lance_ivf_pq | response tag confirms |
|
||
|
||
**The HNSW comparison at 10M doesn't exist** — at 10M × 768d × 4 bytes = ~30 GB just for vectors, doubled for the graph. HNSW doesn't fit on a single 128 GB box at this scale. So the original "Lance pulls ahead at 10M" framing is true by elimination: Lance is the only contender that operationally exists at 10M. The strategic question is reframed as "Lance vs Parquet+HNSW-with-spilling," deferred until we have a workload where the Parquet path is the bottleneck (currently tracked in `golangLAKEHOUSE/docs/ARCHITECTURE_COMPARISON.md` decisions tracker).
|
||
|
||
**Bug surfaced and fixed during re-bench:** Initial doc-fetch was ~100ms (full table scan). Root cause: the `doc_id` scalar btree was never built — `lance-bench` writes datasets by bypassing `IndexMeta`, and the activation-time auto-build only runs for IndexMeta-registered indexes. Fixed at two layers (commits `5d30b3d` + `044650a`):
|
||
- `lance_migrate` HTTP handler auto-builds the btree inline (~1.2s on 10M, +269MB on disk)
|
||
- `lance-bench` binary builds the btree post-IVF for parity with the gateway path
|
||
|
||
**Gauntlet added 2026-05-02:** the lance crates had zero tests + no smoke when audited. Now have 7 unit tests in `crates/vectord-lance` + 12 sanitize tests in `crates/vectord` + 10-probe `scripts/lance_smoke.sh` + sanitized error boundary across all 5 routes. See commits `7bb66f0`, `ac7c996`, `e9d17f7` (sanitizer iterations driven by cross-lineage scrum).
|
||
|
||
**Status:** ADR-019's hybrid architecture (Parquet+HNSW primary, Lance secondary) is now empirically validated up to 10M. The "watch and re-bench" item is closed.
|
||
|
||
## Why this isn't moving the goalposts
|
||
|
||
The EXECUTION_PLAN rule was "migrate or don't migrate." The evidence says neither is correct — one stack can't serve both the staffing SQL workload AND the LLM-brain append-heavy random-access workload at all scales. The honest answer is two backends, each doing what it's good at, selected per-profile.
|
||
|
||
This matches the dual-use framing in the 2026-04-16 PRD update: different workloads, shared substrate, per-profile specialization. We wrote that principle into the PRD; the benchmark data just made it concrete for the vector tier.
|
||
|
||
## Follow-up work (updates EXECUTION_PLAN.md)
|
||
|
||
- **Phase C (decoupled embedding refresh)** gets easier — Lance's native append removes the need to invent a "vectors delta" Parquet layer. When we build Phase C, use Lance as the embedding-layer backend.
|
||
- **Phase 16 (hot-swap)** becomes feasible — 16s index builds mean online re-trials are cheap. When we build Phase 16, Lance is the storage for index generations.
|
||
- **Phase 17 (model profiles)** gains a new field: `vector_backend: Parquet | Lance`. Default Parquet for backward compatibility. Agents can opt into Lance.
|
||
|
||
## Costs we accept
|
||
|
||
- **Second dependency tree.** Lance pulls in DataFusion 52 and Arrow 57, while our main stack runs DataFusion 47 and Arrow 55. Keeping lance-bench isolated works for a pilot; productionizing will need either workspace-wide upgrade or a firewall via a dedicated `vectord-lance` crate.
|
||
- **Second API surface.** Lance's vector-index API is different from our HNSW code. Per-profile abstraction cost is real.
|
||
- **Operational complexity.** Two vector storage implementations to debug and monitor.
|
||
|
||
Worth it because the alternative — forcing every workload through one backend — means either the staffing case or the LLM-brain case is served badly.
|
||
|
||
## Ceilings this updates in PRD
|
||
|
||
The PRD "Known ceilings" table had:
|
||
|
||
> Vector count per index | ~5M vectors on 128GB RAM | 10M+ (serious web crawl) | Phase 18 Lance migration OR mmap'd embeddings
|
||
|
||
Update to:
|
||
|
||
> Vector count per index | ~5M vectors on 128GB RAM (Parquet+HNSW in-RAM) | Past 5M | Switch that profile's `vector_backend` to Lance; IVF_PQ keeps working on disk-resident quantized codes
|