lakehouse/docs/ADR-019-vector-storage.md
root 5368aca4d4 docs: sync ADR-019 + PRD + DECISIONS with 2026-05-02 substrate changes
ADR-019: closed the "re-bench when 10M corpus exists" follow-up. Added
"Follow-up: 10M re-bench (2026-05-02)" section with the post-fix numbers
(search ~20ms warm / ~46ms cold, doc-fetch ~5ms post-btree). Documented
the lance-bench-bypassing-IndexMeta bug + 2-layer fix + gauntlet
(7 unit + 12 sanitize + 10 smoke probes). Reframes the strategic
question as "Lance vs Parquet+HNSW-with-spilling" since HNSW doesn't
fit RAM at 10M.

DECISIONS: added ADR-022 — drop Python sidecar from Rust hot path.
Captures the rationale (236× embed perf gap was pure overhead),
co-shipped LRU cache, dev-only Python that survives, cross-runtime
parity verification, and the operator runbook signal (ps -ef ABSENT
post-deploy).

PRD: updated AI Boundary table line + aibridge crate description to
reflect direct Ollama path (was: Python FastAPI sidecar → Ollama).
Both lines reference ADR-022 for the full rationale.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-03 00:44:57 -05:00

129 lines
8.9 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# ADR-019: Vector Storage — Parquet+HNSW stays, Lance joins as second tier
**Status:** Accepted — 2026-04-16
**Implements:** Phase 18 from PRD (Lance evaluation)
**Supersedes:** nothing (augments ADR-008)
**Owner:** J
---
## Context
Phase 18 of the PRD committed to settling "Parquet+sidecar vs Lance" with measurements, not vibes. This ADR records the benchmark outcome and the resulting architectural direction.
Input data: `data/vectors/resumes_100k_v2.parquet` — 100,000 × 768d embeddings, the same index we tuned HNSW against in Phase 15.
Benchmark harness: `crates/lance-bench/src/main.rs` — standalone binary, deliberately not integrated into the workspace's common deps to avoid forcing DataFusion/Arrow upgrades on the rest of the stack until we'd decided.
## The scorecard
All numbers measured on the same 128GB server, same 100K × 768d index, release build:
| Dimension | Parquet + HNSW (current) | Lance 4.0 IVF_PQ (candidate) | Winner |
|---|---|---|---|
| Cold load | 0.17s | 0.13s | Lance, 1.27×*does not clear 2× decision threshold* |
| Disk size (data only) | 330.3 MB | 330.4 MB | Tie |
| Index on-disk footprint | 0 (HNSW is RAM-only) | 7.4 MB | Lance |
| Index build time | 230s (ec=80 es=30) | 16s | **Lance, 14× faster** |
| Search p50 | 873us (recall@10 = 1.00) | 2229us (recall unmeasured, likely 0.85-0.95) | **Parquet+HNSW, 2.55× faster** |
| Search p95 | 1413us | 4998us | **Parquet+HNSW, 3.54× faster** |
| Speedup vs brute force (p50) | 50.4× | 19.7× | Parquet+HNSW |
| Random row access (fetch by id) | ~35ms (full-file scan) | 311us | **Lance, 112× faster** |
| Append 10K rows | Full-file rewrite (~330MB + re-embed + re-index) | 0.08s, +31MB delta | **Lance, structurally different** |
## Applying the decision rules from EXECUTION_PLAN.md
Original rules:
- *Lance wins cold-load by ≥2× AND matches search latency → migrate*
- *Within 50% across board → stay Parquet, document ceiling*
- *Lance loses → close the door*
Strict reading: cold-load is **1.27×, not ≥2×**. Search latency is **2.55× worse, not matching**. By the written rule, we stay.
But the written rule missed something. It assumed Lance's value would show up as raw-speed wins across the whole table. The actual benchmark reveals Lance's value is **in capabilities the current stack doesn't have**, not in the metrics we scoped:
1. **Random row access** is 112× faster. Our Parquet design can't do O(1) random access to a row — RAG text retrieval is a full-file scan today. Lance makes this native.
2. **Append** is structurally different. Adding 10K rows is 0.08s on Lance; on our stack it's a full rewrite of the entire 330MB Parquet file plus re-embedding plus re-indexing.
3. **Index build** is 14× faster. The HNSW `ec=80 es=30` production default takes 230s; Lance IVF_PQ takes 16s. Hot-swap generation (Phase 16) is much more feasible at 16s per build.
## The decision
**Hybrid architecture — neither replace nor reject.**
### What stays
- `vectord::store` with Parquet + binary-blob vectors → **primary vector backend**
- `vectord::hnsw::HnswStore` → in-RAM HNSW for search at 100K-scale indexes
- All Phase 15 trial infrastructure → keeps working, unchanged
- Production default `ec=80 es=30` → still the right call for in-RAM use
### What gets added
- **`vectord::lance_store`** — second backend using Lance as the persistence layer
- Scope: indexes where *any* of the following apply:
- Corpus exceeds ~5M vectors (our in-RAM ceiling)
- Workload is append-heavy (incremental ingest from streaming sources)
- Text retrieval dominates (point lookups by doc_id for RAG)
- Hot-swap generations are required (Phase 16)
- Implemented as a standalone crate first (follow the pilot layout), promoted into vectord when the API stabilizes
- **Profile-level configuration** — `ModelProfile.vector_backend: Parquet | Lance` so each profile picks the tier that matches its workload
### What we keep watching (but don't act on yet)
- ~~**Lance search latency at scale.** 2229us at 100K is worse than HNSW. At 10M we expect Lance to pull ahead because HNSW doesn't fit in RAM. Re-benchmark when we have a 10M-vector corpus to test against.~~ **Done 2026-05-02** — see "Follow-up: 10M re-bench" below.
- **IVF_PQ recall.** We measured latency but not recall — I picked `num_partitions=316, nbits=8, num_sub_vectors=48` blindly. A proper recall sweep is part of Phase C when we integrate Lance into the trial system.
- **Lance's own HNSW-on-disk variant** (`with_ivf_hnsw_pq_params`). Might close the in-RAM latency gap. Left for a future pilot.
---
## Follow-up: 10M re-bench (2026-05-02)
The 10M re-bench above ran. Numbers from `data/lance/scale_test_10m` (33 GB, 10M × 768d, IVF_PQ live, post-doc_id-btree-build). Full report: `reports/lance_10m_rebench_2026-05-02.md`.
| Op | 100K (this ADR) | 10M (re-bench) | Notes |
|---|---:|---:|---|
| Search (cold) | 2229μs | ~46ms median | 21× slower at 100× scale → reasonable for IVF_PQ |
| Search (warm) | (not measured) | ~20ms p50 | Stable across 5 trials |
| Doc fetch by id | 311μs | ~5ms | Structural ADR-019 win confirmed once `doc_id` btree is built |
| Index method | lance_ivf_pq | lance_ivf_pq | response tag confirms |
**The HNSW comparison at 10M doesn't exist** — at 10M × 768d × 4 bytes = ~30 GB just for vectors, doubled for the graph. HNSW doesn't fit on a single 128 GB box at this scale. So the original "Lance pulls ahead at 10M" framing is true by elimination: Lance is the only contender that operationally exists at 10M. The strategic question is reframed as "Lance vs Parquet+HNSW-with-spilling," deferred until we have a workload where the Parquet path is the bottleneck (currently tracked in `golangLAKEHOUSE/docs/ARCHITECTURE_COMPARISON.md` decisions tracker).
**Bug surfaced and fixed during re-bench:** Initial doc-fetch was ~100ms (full table scan). Root cause: the `doc_id` scalar btree was never built — `lance-bench` writes datasets by bypassing `IndexMeta`, and the activation-time auto-build only runs for IndexMeta-registered indexes. Fixed at two layers (commits `5d30b3d` + `044650a`):
- `lance_migrate` HTTP handler auto-builds the btree inline (~1.2s on 10M, +269MB on disk)
- `lance-bench` binary builds the btree post-IVF for parity with the gateway path
**Gauntlet added 2026-05-02:** the lance crates had zero tests + no smoke when audited. Now have 7 unit tests in `crates/vectord-lance` + 12 sanitize tests in `crates/vectord` + 10-probe `scripts/lance_smoke.sh` + sanitized error boundary across all 5 routes. See commits `7bb66f0`, `ac7c996`, `e9d17f7` (sanitizer iterations driven by cross-lineage scrum).
**Status:** ADR-019's hybrid architecture (Parquet+HNSW primary, Lance secondary) is now empirically validated up to 10M. The "watch and re-bench" item is closed.
## Why this isn't moving the goalposts
The EXECUTION_PLAN rule was "migrate or don't migrate." The evidence says neither is correct — one stack can't serve both the staffing SQL workload AND the LLM-brain append-heavy random-access workload at all scales. The honest answer is two backends, each doing what it's good at, selected per-profile.
This matches the dual-use framing in the 2026-04-16 PRD update: different workloads, shared substrate, per-profile specialization. We wrote that principle into the PRD; the benchmark data just made it concrete for the vector tier.
## Follow-up work (updates EXECUTION_PLAN.md)
- **Phase C (decoupled embedding refresh)** gets easier — Lance's native append removes the need to invent a "vectors delta" Parquet layer. When we build Phase C, use Lance as the embedding-layer backend.
- **Phase 16 (hot-swap)** becomes feasible — 16s index builds mean online re-trials are cheap. When we build Phase 16, Lance is the storage for index generations.
- **Phase 17 (model profiles)** gains a new field: `vector_backend: Parquet | Lance`. Default Parquet for backward compatibility. Agents can opt into Lance.
## Costs we accept
- **Second dependency tree.** Lance pulls in DataFusion 52 and Arrow 57, while our main stack runs DataFusion 47 and Arrow 55. Keeping lance-bench isolated works for a pilot; productionizing will need either workspace-wide upgrade or a firewall via a dedicated `vectord-lance` crate.
- **Second API surface.** Lance's vector-index API is different from our HNSW code. Per-profile abstraction cost is real.
- **Operational complexity.** Two vector storage implementations to debug and monitor.
Worth it because the alternative — forcing every workload through one backend — means either the staffing case or the LLM-brain case is served badly.
## Ceilings this updates in PRD
The PRD "Known ceilings" table had:
> Vector count per index | ~5M vectors on 128GB RAM | 10M+ (serious web crawl) | Phase 18 Lance migration OR mmap'd embeddings
Update to:
> Vector count per index | ~5M vectors on 128GB RAM (Parquet+HNSW in-RAM) | Past 5M | Switch that profile's `vector_backend` to Lance; IVF_PQ keeps working on disk-resident quantized codes