docs: sync ADR-019 + PRD + DECISIONS with 2026-05-02 substrate changes

ADR-019: closed the "re-bench when 10M corpus exists" follow-up. Added "Follow-up: 10M re-bench (2026-05-02)" section with the post-fix numbers (search ~20ms warm / ~46ms cold, doc-fetch ~5ms post-btree). Documented the lance-bench-bypassing-IndexMeta bug + 2-layer fix + gauntlet (7 unit + 12 sanitize + 10 smoke probes). Reframes the strategic question as "Lance vs Parquet+HNSW-with-spilling" since HNSW doesn't fit RAM at 10M. DECISIONS: added ADR-022 — drop Python sidecar from Rust hot path. Captures the rationale (236× embed perf gap was pure overhead), co-shipped LRU cache, dev-only Python that survives, cross-runtime parity verification, and the operator runbook signal (ps -ef ABSENT post-deploy). PRD: updated AI Boundary table line + aibridge crate description to reflect direct Ollama path (was: Python FastAPI sidecar → Ollama). Both lines reference ADR-022 for the full rationale. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-03 00:44:57 -05:00 · 2026-05-03 00:44:57 -05:00 · 5368aca4d4
commit 5368aca4d4
parent e9d17f7d5a
3 changed files with 31 additions and 3 deletions
--- a/docs/ADR-019-vector-storage.md
+++ b/docs/ADR-019-vector-storage.md
@ -70,10 +70,33 @@ But the written rule missed something. It assumed Lance's value would show up as

 ### What we keep watching (but don't act on yet)

- **Lance search latency at scale.** 2229us at 100K is worse than HNSW. At 10M we expect Lance to pull ahead because HNSW doesn't fit in RAM. Re-benchmark when we have a 10M-vector corpus to test against.
+- ~~**Lance search latency at scale.** 2229us at 100K is worse than HNSW. At 10M we expect Lance to pull ahead because HNSW doesn't fit in RAM. Re-benchmark when we have a 10M-vector corpus to test against.~~ **Done 2026-05-02** — see "Follow-up: 10M re-bench" below.
 - **IVF_PQ recall.** We measured latency but not recall — I picked `num_partitions=316, nbits=8, num_sub_vectors=48` blindly. A proper recall sweep is part of Phase C when we integrate Lance into the trial system.
 - **Lance's own HNSW-on-disk variant** (`with_ivf_hnsw_pq_params`). Might close the in-RAM latency gap. Left for a future pilot.

+---
+
+## Follow-up: 10M re-bench (2026-05-02)
+
+The 10M re-bench above ran. Numbers from `data/lance/scale_test_10m` (33 GB, 10M × 768d, IVF_PQ live, post-doc_id-btree-build). Full report: `reports/lance_10m_rebench_2026-05-02.md`.
+
+| Op | 100K (this ADR) | 10M (re-bench) | Notes |
+|---|---:|---:|---|
+| Search (cold) | 2229μs | ~46ms median | 21× slower at 100× scale → reasonable for IVF_PQ |
+| Search (warm) | (not measured) | ~20ms p50 | Stable across 5 trials |
+| Doc fetch by id | 311μs | ~5ms | Structural ADR-019 win confirmed once `doc_id` btree is built |
+| Index method | lance_ivf_pq | lance_ivf_pq | response tag confirms |
+
+**The HNSW comparison at 10M doesn't exist** — at 10M × 768d × 4 bytes = ~30 GB just for vectors, doubled for the graph. HNSW doesn't fit on a single 128 GB box at this scale. So the original "Lance pulls ahead at 10M" framing is true by elimination: Lance is the only contender that operationally exists at 10M. The strategic question is reframed as "Lance vs Parquet+HNSW-with-spilling," deferred until we have a workload where the Parquet path is the bottleneck (currently tracked in `golangLAKEHOUSE/docs/ARCHITECTURE_COMPARISON.md` decisions tracker).
+
+**Bug surfaced and fixed during re-bench:** Initial doc-fetch was ~100ms (full table scan). Root cause: the `doc_id` scalar btree was never built — `lance-bench` writes datasets by bypassing `IndexMeta`, and the activation-time auto-build only runs for IndexMeta-registered indexes. Fixed at two layers (commits `5d30b3d` + `044650a`):
+- `lance_migrate` HTTP handler auto-builds the btree inline (~1.2s on 10M, +269MB on disk)
+- `lance-bench` binary builds the btree post-IVF for parity with the gateway path
+
+**Gauntlet added 2026-05-02:** the lance crates had zero tests + no smoke when audited. Now have 7 unit tests in `crates/vectord-lance` + 12 sanitize tests in `crates/vectord` + 10-probe `scripts/lance_smoke.sh` + sanitized error boundary across all 5 routes. See commits `7bb66f0`, `ac7c996`, `e9d17f7` (sanitizer iterations driven by cross-lineage scrum).
+
+**Status:** ADR-019's hybrid architecture (Parquet+HNSW primary, Lance secondary) is now empirically validated up to 10M. The "watch and re-bench" item is closed.
+
 ## Why this isn't moving the goalposts

 The EXECUTION_PLAN rule was "migrate or don't migrate." The evidence says neither is correct — one stack can't serve both the staffing SQL workload AND the LLM-brain append-heavy random-access workload at all scales. The honest answer is two backends, each doing what it's good at, selected per-profile.
--- a/docs/DECISIONS.md
+++ b/docs/DECISIONS.md
@ -104,3 +104,8 @@
 **Date:** 2026-04-24
 **Decision:** Extend `pathway_memory::PathwayTrace` (ADR added 2026-04-24 in same commit as this one) with a semantic-correctness layer so the matrix index compounds recognition of unit/type/shape bugs across iterations. Three new fields: `semantic_flags: Vec<SemanticFlag>` (enum: `UnitMismatch`, `TypeConfusion`, `NullableConfusion`, `OffByOne`, `StaleReference`, `PseudoImpl`, `DeadCode`, `WarningNoise`, `BoundaryViolation`), `type_hints_used: Vec<TypeHint>` (schema/type context the reviewer was given — catalogd column types for SQL-touching code, Arrow `RecordBatch.schema()` accessors for Rust, Rust struct field types for everything else), and `bug_fingerprints: Vec<BugFingerprint>` (structural pattern hash, e.g. `{lhs_unit: "rows", rhs_unit: "files", op: "-"}` → stable SHA for similarity retrieval). Scrum pipeline pre-review: query matrix index for bug fingerprints flagged on this file's narrow fingerprint (same `task_class + file_prefix + signal_class` as hot-swap) and prepend them to the reviewer prompt as "watch for these patterns historically found here." Reviewer prompt explicitly tags each finding with a `semantic_flag`. `truth::evaluate()` gets a review-time task_class (`code_review.unit_check`) that consumes parsed-fact rules like `FieldContainsAny { field: "code_expression", needles: ["row_count - file_count", "bytes_read - row_count"] }` — the same primitive we use for SQL guard in P42-002.
 **Rationale:** The 2026-04-24 `queryd/src/delta.rs` `base_rows = pre_filter_rows - delta_count` bug (86901f8) was found by a human reading the code and noticing units didn't match. The hardened mechanical applier *cannot* catch this — its gates are syntactic (warning count, patch size, rationale-token alignment) not semantic. At 100 bugs this deep, no human catches them all; the signal→commit loop is capped by what humans can notice per iteration. We already ship the primitives: `catalogd` knows column types per dataset, Arrow `RecordBatch.schema()` is on every hot-path call, `truth::evaluate()` runs arbitrary field conditions at runtime, `shared/arrow_helpers` has typed row/byte/file accessors. All of this is used at RUNTIME; none is fed into the REVIEW pipeline. Semantic flags + bug fingerprints turn the matrix index from "what review happened" (current) into "what category of bug appeared where" (compounding) — so iter-20 scrum on `crates/queryd/src/` preempts review prompts with "this crate had a row/file unit mismatch in iter 7 (delta.rs:189); check every arithmetic on `*_count` variables." Non-goals: we are NOT building a full type-inference engine (reuse Rust's `rustdoc`-level type info for structs, Arrow's schema for RecordBatch, catalogd's column types for SQL — everything beyond is Phase 3). Non-goals: this is not a linter — clippy/rustc already catch syntactic issues; this catches SEMANTIC ones (same type, wrong units/role). Bootstrap path: start with the 9 `SemanticFlag` variants above; add new variants only when a bug is found that doesn't fit an existing one. Gate alignment with hot-swap: a pathway that repeatedly produces bugs of the same `SemanticFlag` variant on the same narrow fingerprint is more valuable as a "watch this file for X" signal than as a hot-swap candidate — retirement logic needs to consider both replay success_rate AND whether the pathway is serving as a bug-pattern beacon.
+
+## ADR-022: Drop Python sidecar from Rust hot path; AiClient talks Ollama directly
+**Date:** 2026-05-02
+**Decision:** `crates/aibridge` AiClient was rewritten (commit `ba928b1`) to call Ollama HTTP directly: per-text `/api/embed`, `/api/generate` for chat + rerank-loop + admin (unload/preload), `/api/ps` + `nvidia-smi` for vram_snapshot. Public AiClient API is unchanged — 0 callers updated. The `sidecar/sidecar/{embed,generate,rerank,admin}.py` FastAPI hot-path routers are retired (kept on disk as historical reference only). The `sidecar/sidecar/{lab_ui,pipeline_lab}.py` ad-hoc Streamlit-shape dev UIs (~888 LOC) keep running as ad-hoc tooling; they are NOT on the runtime hot path. The systemd unit `lakehouse-sidecar.service` stays running for those dev UIs but the gateway no longer depends on it for any production call.
+**Rationale:** The Python sidecar embedded path was 236× slower than direct Ollama on warm workloads — the `/api/embed` HTTP hop + Python serialization tax was pure overhead with no logic added (sidecar was a pass-through translator). Co-shipped commit `150cc3b` added an in-process LRU embed cache to the gateway, dropping warm p50 from 78ms → 129us (cache hit) without changing the AiClient public API. Together: the gateway is one mega-process again instead of mega-process + Python sidecar, the Python runtime cost is paid only by dev tooling, and embed RPS is 236× higher on warm workloads. The dev UIs stayed because they're genuinely useful for ad-hoc embedding-pipeline experiments and don't run on the request path. PRD's "AI Boundary | Python FastAPI sidecar → Ollama HTTP API" line is updated to reflect the direct path; ADR-019's hybrid storage architecture is unaffected (this ADR changes the AI substrate, not the vector substrate). See cross-runtime mirror in Go-side `internal/aibridge` (Go was already direct-to-Ollama from D1 onward — this brings Rust to parity). Verified: `cargo test -p aibridge` 32/32 PASS, live `/ai/embed` returns 768d vector + `/v1/chat` returns "OK"; cross-runtime parity probe `embed_parity.sh` 8/8 PASS post-deploy. Cost: the Python sidecar was the canonical "model-version is loaded" probe — that capability moved into AiClient via `/api/ps` polling. Operator runbook: `ps -ef | grep "uvicorn sidecar"` should be ABSENT post-deploy on the production gateway box; presence indicates the old binary is still running.
--- a/docs/PRD.md
+++ b/docs/PRD.md
@ -77,7 +77,7 @@ A modular Rust service mesh over S3-compatible object storage, with a local AI l
 | Data Format | Parquet + Arrow | Yes |
 | RPC (internal) | tonic (gRPC) | Yes |
 | AI Runtime | Ollama (local models) | Yes |
-| AI Boundary | Python FastAPI sidecar → Ollama HTTP API | Yes |
+| AI Boundary | Direct Ollama HTTP API (gateway → Ollama, no sidecar). Updated 2026-05-02 per ADR-022 — was "Python FastAPI sidecar → Ollama HTTP API". Sidecar's lab_ui/pipeline_lab Python remain as dev-only tools (not on hot path). | Yes |
 | Vector Index | TBD — evaluate `hora`, `qdrant` crate, or HNSW from scratch | **Open** |

 No new frameworks without documented ADR.
@ -97,7 +97,7 @@ No new frameworks without documented ADR.
 | **ingestd** | Ingest pipeline: CSV / JSON / PDF / Postgres-stream → normalize → Parquet → catalog |
 | **vectord** | Embedding store + vector indexes + HNSW trial system (EmbeddingCache, trial journal, eval harness) |
 | **journald** | Append-only mutation event log (ADR-012) — distinct from storaged error journal |
-| **aibridge** | Rust↔Python boundary — HTTP client to FastAPI sidecar |
+| **aibridge** | AI client — direct Ollama HTTP (per ADR-022, 2026-05-02; was Rust↔Python sidecar boundary). Owns LRU embed cache. |
 | **ui** | Dioxus frontend — Ask, Explore, SQL, System tabs |
 | **shared** | Types, errors, Arrow helpers, config, protobuf definitions, **secrets provider trait**, **PII detection** |
 | **mcp-server** | Agent gateway (Bun) — MCP tools, intelligence endpoints, scenario observer (:3700) |