root 76f6fba5de Phase B: Lance pilot — hybrid decision with measured benchmark

Standalone benchmark crate `crates/lance-bench` running Lance 4.0 against
our Parquet+HNSW at 100K × 768d (resumes_100k_v2) measured 8 dimensions.

Results (see docs/ADR-019-vector-storage.md for full scorecard):

  Cold load:        Parquet 0.17s   vs Lance 0.13s   (tie — not ≥2× threshold)
  Disk size:        330.3 MB        vs 330.4 MB      (tie)
  Search p50:       873us           vs 2229us        (Parquet 2.55× faster)
  Search p95:       1413us          vs 4998us        (Parquet 3.54× faster)
  Index build:      230s (ec=80)    vs 16s (IVF_PQ)  (Lance 14× faster)
  Random access:    35ms (scan)     vs 311us         (Lance 112× faster)
  Append 10K rows:  full rewrite    vs 0.08s/+31MB   (Lance structural win)

Decision (ADR-019): hybrid, not migrate-or-reject.

- Parquet+HNSW stays primary — our HNSW at ec=80 es=30 recall=1.00 is
  2.55× faster than Lance IVF_PQ at 100K in-RAM scale
- Lance joins as second backend per-profile for workloads where it wins
  architecturally: random row access (RAG text fetch), append-heavy
  pipelines (Phase C), hot-swap generations (Phase 16, 14× faster
  builds), and indexes past the ~5M RAM ceiling
- Phase 17 ModelProfile gets vector_backend: Parquet | Lance field
- Ceiling table in PRD updated — 5M ceiling now says "switch to Lance"
  instead of "migrate" since Lance runs alongside, not instead of

Isolation: lance-bench is a standalone workspace crate with its own dep
tree (Lance pulls DataFusion 52 + Arrow 57 incompatible with main stack
DataFusion 47 + Arrow 55). Kept off the critical path until API is
stable enough to promote into vectord::lance_store.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

2026-04-16 02:37:11 -05:00

6.6 KiB

Raw Blame History

ADR-019: Vector Storage — Parquet+HNSW stays, Lance joins as second tier

Status: Accepted — 2026-04-16 Implements: Phase 18 from PRD (Lance evaluation) Supersedes: nothing (augments ADR-008) Owner: J

Context

Phase 18 of the PRD committed to settling "Parquet+sidecar vs Lance" with measurements, not vibes. This ADR records the benchmark outcome and the resulting architectural direction.

Input data: data/vectors/resumes_100k_v2.parquet — 100,000 × 768d embeddings, the same index we tuned HNSW against in Phase 15.

Benchmark harness: crates/lance-bench/src/main.rs — standalone binary, deliberately not integrated into the workspace's common deps to avoid forcing DataFusion/Arrow upgrades on the rest of the stack until we'd decided.

The scorecard

All numbers measured on the same 128GB server, same 100K × 768d index, release build:

Dimension	Parquet + HNSW (current)	Lance 4.0 IVF_PQ (candidate)	Winner
Cold load	0.17s	0.13s	Lance, 1.27× — does not clear 2× decision threshold
Disk size (data only)	330.3 MB	330.4 MB	Tie
Index on-disk footprint	0 (HNSW is RAM-only)	7.4 MB	Lance
Index build time	230s (ec=80 es=30)	16s	Lance, 14× faster
Search p50	873us (recall@10 = 1.00)	2229us (recall unmeasured, likely 0.85-0.95)	Parquet+HNSW, 2.55× faster
Search p95	1413us	4998us	Parquet+HNSW, 3.54× faster
Speedup vs brute force (p50)	50.4×	19.7×	Parquet+HNSW
Random row access (fetch by id)	~35ms (full-file scan)	311us	Lance, 112× faster
Append 10K rows	Full-file rewrite (~330MB + re-embed + re-index)	0.08s, +31MB delta	Lance, structurally different

Applying the decision rules from EXECUTION_PLAN.md

Original rules:

Lance wins cold-load by ≥2× AND matches search latency → migrate
Within 50% across board → stay Parquet, document ceiling
Lance loses → close the door

Strict reading: cold-load is 1.27×, not ≥2×. Search latency is 2.55× worse, not matching. By the written rule, we stay.

But the written rule missed something. It assumed Lance's value would show up as raw-speed wins across the whole table. The actual benchmark reveals Lance's value is in capabilities the current stack doesn't have, not in the metrics we scoped:

Random row access is 112× faster. Our Parquet design can't do O(1) random access to a row — RAG text retrieval is a full-file scan today. Lance makes this native.
Append is structurally different. Adding 10K rows is 0.08s on Lance; on our stack it's a full rewrite of the entire 330MB Parquet file plus re-embedding plus re-indexing.
Index build is 14× faster. The HNSW ec=80 es=30 production default takes 230s; Lance IVF_PQ takes 16s. Hot-swap generation (Phase 16) is much more feasible at 16s per build.

The decision

Hybrid architecture — neither replace nor reject.

What stays

vectord::store with Parquet + binary-blob vectors → primary vector backend
vectord::hnsw::HnswStore → in-RAM HNSW for search at 100K-scale indexes
All Phase 15 trial infrastructure → keeps working, unchanged
Production default ec=80 es=30 → still the right call for in-RAM use

What gets added

vectord::lance_store — second backend using Lance as the persistence layer
- Scope: indexes where any of the following apply:
  - Corpus exceeds ~5M vectors (our in-RAM ceiling)
  - Workload is append-heavy (incremental ingest from streaming sources)
  - Text retrieval dominates (point lookups by doc_id for RAG)
  - Hot-swap generations are required (Phase 16)
- Implemented as a standalone crate first (follow the pilot layout), promoted into vectord when the API stabilizes
Profile-level configuration — ModelProfile.vector_backend: Parquet | Lance so each profile picks the tier that matches its workload

What we keep watching (but don't act on yet)

Lance search latency at scale. 2229us at 100K is worse than HNSW. At 10M we expect Lance to pull ahead because HNSW doesn't fit in RAM. Re-benchmark when we have a 10M-vector corpus to test against.
IVF_PQ recall. We measured latency but not recall — I picked num_partitions=316, nbits=8, num_sub_vectors=48 blindly. A proper recall sweep is part of Phase C when we integrate Lance into the trial system.
Lance's own HNSW-on-disk variant (with_ivf_hnsw_pq_params). Might close the in-RAM latency gap. Left for a future pilot.

Why this isn't moving the goalposts

The EXECUTION_PLAN rule was "migrate or don't migrate." The evidence says neither is correct — one stack can't serve both the staffing SQL workload AND the LLM-brain append-heavy random-access workload at all scales. The honest answer is two backends, each doing what it's good at, selected per-profile.

This matches the dual-use framing in the 2026-04-16 PRD update: different workloads, shared substrate, per-profile specialization. We wrote that principle into the PRD; the benchmark data just made it concrete for the vector tier.

Follow-up work (updates EXECUTION_PLAN.md)

Phase C (decoupled embedding refresh) gets easier — Lance's native append removes the need to invent a "vectors delta" Parquet layer. When we build Phase C, use Lance as the embedding-layer backend.
Phase 16 (hot-swap) becomes feasible — 16s index builds mean online re-trials are cheap. When we build Phase 16, Lance is the storage for index generations.
Phase 17 (model profiles) gains a new field: vector_backend: Parquet | Lance. Default Parquet for backward compatibility. Agents can opt into Lance.

Costs we accept

Second dependency tree. Lance pulls in DataFusion 52 and Arrow 57, while our main stack runs DataFusion 47 and Arrow 55. Keeping lance-bench isolated works for a pilot; productionizing will need either workspace-wide upgrade or a firewall via a dedicated vectord-lance crate.
Second API surface. Lance's vector-index API is different from our HNSW code. Per-profile abstraction cost is real.
Operational complexity. Two vector storage implementations to debug and monitor.

Worth it because the alternative — forcing every workload through one backend — means either the staffing case or the LLM-brain case is served badly.

Ceilings this updates in PRD

The PRD "Known ceilings" table had:

Vector count per index | ~5M vectors on 128GB RAM | 10M+ (serious web crawl) | Phase 18 Lance migration OR mmap'd embeddings

Update to:

Vector count per index | ~5M vectors on 128GB RAM (Parquet+HNSW in-RAM) | Past 5M | Switch that profile's vector_backend to Lance; IVF_PQ keeps working on disk-resident quantized codes

6.6 KiB Raw Blame History Unescape Escape