Architectural snapshot of the lakehouse codebase at the point where the
full matrix-driven agent loop with Mem0 versioning + deletion was
validated end-to-end.
WHAT THIS REPO IS
A clean single-commit snapshot of the lakehouse code. Heavy test data
(.parquet datasets, vector indexes) excluded — see REPLICATION.md for
regen path. Full lakehouse history at git.agentview.dev/profit/lakehouse.
WHAT WAS PROVEN
- Vector retrieval across multi-corpora matrix (chicago_permits + entity
briefs + sec_tickers + distilled procedural + llm_team runs)
- Observer hand-review (cloud + heuristic fallback) gating each candidate
- Local-model agent loop (qwen3.5:latest) with tool use + scratchpad
- Playbook seal on success → next-iter retrieval surfaces it as preamble
- Mem0 versioning + deletion in pathway_memory:
* UPSERT: ADD on new workflow, UPDATE bumps replay_count on identical
* REVISE: chains versions, parent.superseded_at + superseded_by stamped
* RETIRE: marks specific trace retired with reason, excluded from retrieval
* HISTORY: walks chain root→tip, cycle-safe
KEY DIRECTORIES
- crates/vectord/src/pathway_memory.rs — Mem0 ops live here
- crates/vectord/src/playbook_memory.rs — original Mem0 reference
- tests/agent_test/ — local-model agent harness + PRD + session archives
- scripts/dump_raw_corpus.sh — MinIO bucket dump (raw test corpus)
- scripts/vectorize_raw_corpus.ts — corpus → vector indexes
- scripts/analyze_chicago_contracts.ts — real inference pipeline
- scripts/seal_agent_playbook.ts — Mem0 upsert from agent traces
Replication: see REPLICATION.md for Debian 13 clean install + cloud-only
adaptation (no local Ollama).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
6.6 KiB
ADR-019: Vector Storage — Parquet+HNSW stays, Lance joins as second tier
Status: Accepted — 2026-04-16 Implements: Phase 18 from PRD (Lance evaluation) Supersedes: nothing (augments ADR-008) Owner: J
Context
Phase 18 of the PRD committed to settling "Parquet+sidecar vs Lance" with measurements, not vibes. This ADR records the benchmark outcome and the resulting architectural direction.
Input data: data/vectors/resumes_100k_v2.parquet — 100,000 × 768d embeddings, the same index we tuned HNSW against in Phase 15.
Benchmark harness: crates/lance-bench/src/main.rs — standalone binary, deliberately not integrated into the workspace's common deps to avoid forcing DataFusion/Arrow upgrades on the rest of the stack until we'd decided.
The scorecard
All numbers measured on the same 128GB server, same 100K × 768d index, release build:
| Dimension | Parquet + HNSW (current) | Lance 4.0 IVF_PQ (candidate) | Winner |
|---|---|---|---|
| Cold load | 0.17s | 0.13s | Lance, 1.27× — does not clear 2× decision threshold |
| Disk size (data only) | 330.3 MB | 330.4 MB | Tie |
| Index on-disk footprint | 0 (HNSW is RAM-only) | 7.4 MB | Lance |
| Index build time | 230s (ec=80 es=30) | 16s | Lance, 14× faster |
| Search p50 | 873us (recall@10 = 1.00) | 2229us (recall unmeasured, likely 0.85-0.95) | Parquet+HNSW, 2.55× faster |
| Search p95 | 1413us | 4998us | Parquet+HNSW, 3.54× faster |
| Speedup vs brute force (p50) | 50.4× | 19.7× | Parquet+HNSW |
| Random row access (fetch by id) | ~35ms (full-file scan) | 311us | Lance, 112× faster |
| Append 10K rows | Full-file rewrite (~330MB + re-embed + re-index) | 0.08s, +31MB delta | Lance, structurally different |
Applying the decision rules from EXECUTION_PLAN.md
Original rules:
- Lance wins cold-load by ≥2× AND matches search latency → migrate
- Within 50% across board → stay Parquet, document ceiling
- Lance loses → close the door
Strict reading: cold-load is 1.27×, not ≥2×. Search latency is 2.55× worse, not matching. By the written rule, we stay.
But the written rule missed something. It assumed Lance's value would show up as raw-speed wins across the whole table. The actual benchmark reveals Lance's value is in capabilities the current stack doesn't have, not in the metrics we scoped:
- Random row access is 112× faster. Our Parquet design can't do O(1) random access to a row — RAG text retrieval is a full-file scan today. Lance makes this native.
- Append is structurally different. Adding 10K rows is 0.08s on Lance; on our stack it's a full rewrite of the entire 330MB Parquet file plus re-embedding plus re-indexing.
- Index build is 14× faster. The HNSW
ec=80 es=30production default takes 230s; Lance IVF_PQ takes 16s. Hot-swap generation (Phase 16) is much more feasible at 16s per build.
The decision
Hybrid architecture — neither replace nor reject.
What stays
vectord::storewith Parquet + binary-blob vectors → primary vector backendvectord::hnsw::HnswStore→ in-RAM HNSW for search at 100K-scale indexes- All Phase 15 trial infrastructure → keeps working, unchanged
- Production default
ec=80 es=30→ still the right call for in-RAM use
What gets added
vectord::lance_store— second backend using Lance as the persistence layer- Scope: indexes where any of the following apply:
- Corpus exceeds ~5M vectors (our in-RAM ceiling)
- Workload is append-heavy (incremental ingest from streaming sources)
- Text retrieval dominates (point lookups by doc_id for RAG)
- Hot-swap generations are required (Phase 16)
- Implemented as a standalone crate first (follow the pilot layout), promoted into vectord when the API stabilizes
- Scope: indexes where any of the following apply:
- Profile-level configuration —
ModelProfile.vector_backend: Parquet | Lanceso each profile picks the tier that matches its workload
What we keep watching (but don't act on yet)
- Lance search latency at scale. 2229us at 100K is worse than HNSW. At 10M we expect Lance to pull ahead because HNSW doesn't fit in RAM. Re-benchmark when we have a 10M-vector corpus to test against.
- IVF_PQ recall. We measured latency but not recall — I picked
num_partitions=316, nbits=8, num_sub_vectors=48blindly. A proper recall sweep is part of Phase C when we integrate Lance into the trial system. - Lance's own HNSW-on-disk variant (
with_ivf_hnsw_pq_params). Might close the in-RAM latency gap. Left for a future pilot.
Why this isn't moving the goalposts
The EXECUTION_PLAN rule was "migrate or don't migrate." The evidence says neither is correct — one stack can't serve both the staffing SQL workload AND the LLM-brain append-heavy random-access workload at all scales. The honest answer is two backends, each doing what it's good at, selected per-profile.
This matches the dual-use framing in the 2026-04-16 PRD update: different workloads, shared substrate, per-profile specialization. We wrote that principle into the PRD; the benchmark data just made it concrete for the vector tier.
Follow-up work (updates EXECUTION_PLAN.md)
- Phase C (decoupled embedding refresh) gets easier — Lance's native append removes the need to invent a "vectors delta" Parquet layer. When we build Phase C, use Lance as the embedding-layer backend.
- Phase 16 (hot-swap) becomes feasible — 16s index builds mean online re-trials are cheap. When we build Phase 16, Lance is the storage for index generations.
- Phase 17 (model profiles) gains a new field:
vector_backend: Parquet | Lance. Default Parquet for backward compatibility. Agents can opt into Lance.
Costs we accept
- Second dependency tree. Lance pulls in DataFusion 52 and Arrow 57, while our main stack runs DataFusion 47 and Arrow 55. Keeping lance-bench isolated works for a pilot; productionizing will need either workspace-wide upgrade or a firewall via a dedicated
vectord-lancecrate. - Second API surface. Lance's vector-index API is different from our HNSW code. Per-profile abstraction cost is real.
- Operational complexity. Two vector storage implementations to debug and monitor.
Worth it because the alternative — forcing every workload through one backend — means either the staffing case or the LLM-brain case is served badly.
Ceilings this updates in PRD
The PRD "Known ceilings" table had:
Vector count per index | ~5M vectors on 128GB RAM | 10M+ (serious web crawl) | Phase 18 Lance migration OR mmap'd embeddings
Update to:
Vector count per index | ~5M vectors on 128GB RAM (Parquet+HNSW in-RAM) | Past 5M | Switch that profile's
vector_backendto Lance; IVF_PQ keeps working on disk-resident quantized codes