lakehouse/docs/ADR-019-vector-storage.md
root 76f6fba5de Phase B: Lance pilot — hybrid decision with measured benchmark
Standalone benchmark crate `crates/lance-bench` running Lance 4.0 against
our Parquet+HNSW at 100K × 768d (resumes_100k_v2) measured 8 dimensions.

Results (see docs/ADR-019-vector-storage.md for full scorecard):

  Cold load:        Parquet 0.17s   vs Lance 0.13s   (tie — not ≥2× threshold)
  Disk size:        330.3 MB        vs 330.4 MB      (tie)
  Search p50:       873us           vs 2229us        (Parquet 2.55× faster)
  Search p95:       1413us          vs 4998us        (Parquet 3.54× faster)
  Index build:      230s (ec=80)    vs 16s (IVF_PQ)  (Lance 14× faster)
  Random access:    35ms (scan)     vs 311us         (Lance 112× faster)
  Append 10K rows:  full rewrite    vs 0.08s/+31MB   (Lance structural win)

Decision (ADR-019): hybrid, not migrate-or-reject.

- Parquet+HNSW stays primary — our HNSW at ec=80 es=30 recall=1.00 is
  2.55× faster than Lance IVF_PQ at 100K in-RAM scale
- Lance joins as second backend per-profile for workloads where it wins
  architecturally: random row access (RAG text fetch), append-heavy
  pipelines (Phase C), hot-swap generations (Phase 16, 14× faster
  builds), and indexes past the ~5M RAM ceiling
- Phase 17 ModelProfile gets vector_backend: Parquet | Lance field
- Ceiling table in PRD updated — 5M ceiling now says "switch to Lance"
  instead of "migrate" since Lance runs alongside, not instead of

Isolation: lance-bench is a standalone workspace crate with its own dep
tree (Lance pulls DataFusion 52 + Arrow 57 incompatible with main stack
DataFusion 47 + Arrow 55). Kept off the critical path until API is
stable enough to promote into vectord::lance_store.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-16 02:37:11 -05:00

6.6 KiB
Raw Blame History

ADR-019: Vector Storage — Parquet+HNSW stays, Lance joins as second tier

Status: Accepted — 2026-04-16 Implements: Phase 18 from PRD (Lance evaluation) Supersedes: nothing (augments ADR-008) Owner: J


Context

Phase 18 of the PRD committed to settling "Parquet+sidecar vs Lance" with measurements, not vibes. This ADR records the benchmark outcome and the resulting architectural direction.

Input data: data/vectors/resumes_100k_v2.parquet — 100,000 × 768d embeddings, the same index we tuned HNSW against in Phase 15.

Benchmark harness: crates/lance-bench/src/main.rs — standalone binary, deliberately not integrated into the workspace's common deps to avoid forcing DataFusion/Arrow upgrades on the rest of the stack until we'd decided.

The scorecard

All numbers measured on the same 128GB server, same 100K × 768d index, release build:

Dimension Parquet + HNSW (current) Lance 4.0 IVF_PQ (candidate) Winner
Cold load 0.17s 0.13s Lance, 1.27×does not clear 2× decision threshold
Disk size (data only) 330.3 MB 330.4 MB Tie
Index on-disk footprint 0 (HNSW is RAM-only) 7.4 MB Lance
Index build time 230s (ec=80 es=30) 16s Lance, 14× faster
Search p50 873us (recall@10 = 1.00) 2229us (recall unmeasured, likely 0.85-0.95) Parquet+HNSW, 2.55× faster
Search p95 1413us 4998us Parquet+HNSW, 3.54× faster
Speedup vs brute force (p50) 50.4× 19.7× Parquet+HNSW
Random row access (fetch by id) ~35ms (full-file scan) 311us Lance, 112× faster
Append 10K rows Full-file rewrite (~330MB + re-embed + re-index) 0.08s, +31MB delta Lance, structurally different

Applying the decision rules from EXECUTION_PLAN.md

Original rules:

  • Lance wins cold-load by ≥2× AND matches search latency → migrate
  • Within 50% across board → stay Parquet, document ceiling
  • Lance loses → close the door

Strict reading: cold-load is 1.27×, not ≥2×. Search latency is 2.55× worse, not matching. By the written rule, we stay.

But the written rule missed something. It assumed Lance's value would show up as raw-speed wins across the whole table. The actual benchmark reveals Lance's value is in capabilities the current stack doesn't have, not in the metrics we scoped:

  1. Random row access is 112× faster. Our Parquet design can't do O(1) random access to a row — RAG text retrieval is a full-file scan today. Lance makes this native.
  2. Append is structurally different. Adding 10K rows is 0.08s on Lance; on our stack it's a full rewrite of the entire 330MB Parquet file plus re-embedding plus re-indexing.
  3. Index build is 14× faster. The HNSW ec=80 es=30 production default takes 230s; Lance IVF_PQ takes 16s. Hot-swap generation (Phase 16) is much more feasible at 16s per build.

The decision

Hybrid architecture — neither replace nor reject.

What stays

  • vectord::store with Parquet + binary-blob vectors → primary vector backend
  • vectord::hnsw::HnswStore → in-RAM HNSW for search at 100K-scale indexes
  • All Phase 15 trial infrastructure → keeps working, unchanged
  • Production default ec=80 es=30 → still the right call for in-RAM use

What gets added

  • vectord::lance_store — second backend using Lance as the persistence layer
    • Scope: indexes where any of the following apply:
      • Corpus exceeds ~5M vectors (our in-RAM ceiling)
      • Workload is append-heavy (incremental ingest from streaming sources)
      • Text retrieval dominates (point lookups by doc_id for RAG)
      • Hot-swap generations are required (Phase 16)
    • Implemented as a standalone crate first (follow the pilot layout), promoted into vectord when the API stabilizes
  • Profile-level configurationModelProfile.vector_backend: Parquet | Lance so each profile picks the tier that matches its workload

What we keep watching (but don't act on yet)

  • Lance search latency at scale. 2229us at 100K is worse than HNSW. At 10M we expect Lance to pull ahead because HNSW doesn't fit in RAM. Re-benchmark when we have a 10M-vector corpus to test against.
  • IVF_PQ recall. We measured latency but not recall — I picked num_partitions=316, nbits=8, num_sub_vectors=48 blindly. A proper recall sweep is part of Phase C when we integrate Lance into the trial system.
  • Lance's own HNSW-on-disk variant (with_ivf_hnsw_pq_params). Might close the in-RAM latency gap. Left for a future pilot.

Why this isn't moving the goalposts

The EXECUTION_PLAN rule was "migrate or don't migrate." The evidence says neither is correct — one stack can't serve both the staffing SQL workload AND the LLM-brain append-heavy random-access workload at all scales. The honest answer is two backends, each doing what it's good at, selected per-profile.

This matches the dual-use framing in the 2026-04-16 PRD update: different workloads, shared substrate, per-profile specialization. We wrote that principle into the PRD; the benchmark data just made it concrete for the vector tier.

Follow-up work (updates EXECUTION_PLAN.md)

  • Phase C (decoupled embedding refresh) gets easier — Lance's native append removes the need to invent a "vectors delta" Parquet layer. When we build Phase C, use Lance as the embedding-layer backend.
  • Phase 16 (hot-swap) becomes feasible — 16s index builds mean online re-trials are cheap. When we build Phase 16, Lance is the storage for index generations.
  • Phase 17 (model profiles) gains a new field: vector_backend: Parquet | Lance. Default Parquet for backward compatibility. Agents can opt into Lance.

Costs we accept

  • Second dependency tree. Lance pulls in DataFusion 52 and Arrow 57, while our main stack runs DataFusion 47 and Arrow 55. Keeping lance-bench isolated works for a pilot; productionizing will need either workspace-wide upgrade or a firewall via a dedicated vectord-lance crate.
  • Second API surface. Lance's vector-index API is different from our HNSW code. Per-profile abstraction cost is real.
  • Operational complexity. Two vector storage implementations to debug and monitor.

Worth it because the alternative — forcing every workload through one backend — means either the staffing case or the LLM-brain case is served badly.

Ceilings this updates in PRD

The PRD "Known ceilings" table had:

Vector count per index | ~5M vectors on 128GB RAM | 10M+ (serious web crawl) | Phase 18 Lance migration OR mmap'd embeddings

Update to:

Vector count per index | ~5M vectors on 128GB RAM (Parquet+HNSW in-RAM) | Past 5M | Switch that profile's vector_backend to Lance; IVF_PQ keeps working on disk-resident quantized codes