diff --git a/docs/ADR-019-vector-storage.md b/docs/ADR-019-vector-storage.md index 0ed2fe2..95151c1 100644 --- a/docs/ADR-019-vector-storage.md +++ b/docs/ADR-019-vector-storage.md @@ -70,10 +70,33 @@ But the written rule missed something. It assumed Lance's value would show up as ### What we keep watching (but don't act on yet) -- **Lance search latency at scale.** 2229us at 100K is worse than HNSW. At 10M we expect Lance to pull ahead because HNSW doesn't fit in RAM. Re-benchmark when we have a 10M-vector corpus to test against. +- ~~**Lance search latency at scale.** 2229us at 100K is worse than HNSW. At 10M we expect Lance to pull ahead because HNSW doesn't fit in RAM. Re-benchmark when we have a 10M-vector corpus to test against.~~ **Done 2026-05-02** — see "Follow-up: 10M re-bench" below. - **IVF_PQ recall.** We measured latency but not recall — I picked `num_partitions=316, nbits=8, num_sub_vectors=48` blindly. A proper recall sweep is part of Phase C when we integrate Lance into the trial system. - **Lance's own HNSW-on-disk variant** (`with_ivf_hnsw_pq_params`). Might close the in-RAM latency gap. Left for a future pilot. +--- + +## Follow-up: 10M re-bench (2026-05-02) + +The 10M re-bench above ran. Numbers from `data/lance/scale_test_10m` (33 GB, 10M × 768d, IVF_PQ live, post-doc_id-btree-build). Full report: `reports/lance_10m_rebench_2026-05-02.md`. + +| Op | 100K (this ADR) | 10M (re-bench) | Notes | +|---|---:|---:|---| +| Search (cold) | 2229μs | ~46ms median | 21× slower at 100× scale → reasonable for IVF_PQ | +| Search (warm) | (not measured) | ~20ms p50 | Stable across 5 trials | +| Doc fetch by id | 311μs | ~5ms | Structural ADR-019 win confirmed once `doc_id` btree is built | +| Index method | lance_ivf_pq | lance_ivf_pq | response tag confirms | + +**The HNSW comparison at 10M doesn't exist** — at 10M × 768d × 4 bytes = ~30 GB just for vectors, doubled for the graph. HNSW doesn't fit on a single 128 GB box at this scale. So the original "Lance pulls ahead at 10M" framing is true by elimination: Lance is the only contender that operationally exists at 10M. The strategic question is reframed as "Lance vs Parquet+HNSW-with-spilling," deferred until we have a workload where the Parquet path is the bottleneck (currently tracked in `golangLAKEHOUSE/docs/ARCHITECTURE_COMPARISON.md` decisions tracker). + +**Bug surfaced and fixed during re-bench:** Initial doc-fetch was ~100ms (full table scan). Root cause: the `doc_id` scalar btree was never built — `lance-bench` writes datasets by bypassing `IndexMeta`, and the activation-time auto-build only runs for IndexMeta-registered indexes. Fixed at two layers (commits `5d30b3d` + `044650a`): +- `lance_migrate` HTTP handler auto-builds the btree inline (~1.2s on 10M, +269MB on disk) +- `lance-bench` binary builds the btree post-IVF for parity with the gateway path + +**Gauntlet added 2026-05-02:** the lance crates had zero tests + no smoke when audited. Now have 7 unit tests in `crates/vectord-lance` + 12 sanitize tests in `crates/vectord` + 10-probe `scripts/lance_smoke.sh` + sanitized error boundary across all 5 routes. See commits `7bb66f0`, `ac7c996`, `e9d17f7` (sanitizer iterations driven by cross-lineage scrum). + +**Status:** ADR-019's hybrid architecture (Parquet+HNSW primary, Lance secondary) is now empirically validated up to 10M. The "watch and re-bench" item is closed. + ## Why this isn't moving the goalposts The EXECUTION_PLAN rule was "migrate or don't migrate." The evidence says neither is correct — one stack can't serve both the staffing SQL workload AND the LLM-brain append-heavy random-access workload at all scales. The honest answer is two backends, each doing what it's good at, selected per-profile. diff --git a/docs/DECISIONS.md b/docs/DECISIONS.md index 3cafa35..c382719 100644 --- a/docs/DECISIONS.md +++ b/docs/DECISIONS.md @@ -104,3 +104,8 @@ **Date:** 2026-04-24 **Decision:** Extend `pathway_memory::PathwayTrace` (ADR added 2026-04-24 in same commit as this one) with a semantic-correctness layer so the matrix index compounds recognition of unit/type/shape bugs across iterations. Three new fields: `semantic_flags: Vec` (enum: `UnitMismatch`, `TypeConfusion`, `NullableConfusion`, `OffByOne`, `StaleReference`, `PseudoImpl`, `DeadCode`, `WarningNoise`, `BoundaryViolation`), `type_hints_used: Vec` (schema/type context the reviewer was given — catalogd column types for SQL-touching code, Arrow `RecordBatch.schema()` accessors for Rust, Rust struct field types for everything else), and `bug_fingerprints: Vec` (structural pattern hash, e.g. `{lhs_unit: "rows", rhs_unit: "files", op: "-"}` → stable SHA for similarity retrieval). Scrum pipeline pre-review: query matrix index for bug fingerprints flagged on this file's narrow fingerprint (same `task_class + file_prefix + signal_class` as hot-swap) and prepend them to the reviewer prompt as "watch for these patterns historically found here." Reviewer prompt explicitly tags each finding with a `semantic_flag`. `truth::evaluate()` gets a review-time task_class (`code_review.unit_check`) that consumes parsed-fact rules like `FieldContainsAny { field: "code_expression", needles: ["row_count - file_count", "bytes_read - row_count"] }` — the same primitive we use for SQL guard in P42-002. **Rationale:** The 2026-04-24 `queryd/src/delta.rs` `base_rows = pre_filter_rows - delta_count` bug (86901f8) was found by a human reading the code and noticing units didn't match. The hardened mechanical applier *cannot* catch this — its gates are syntactic (warning count, patch size, rationale-token alignment) not semantic. At 100 bugs this deep, no human catches them all; the signal→commit loop is capped by what humans can notice per iteration. We already ship the primitives: `catalogd` knows column types per dataset, Arrow `RecordBatch.schema()` is on every hot-path call, `truth::evaluate()` runs arbitrary field conditions at runtime, `shared/arrow_helpers` has typed row/byte/file accessors. All of this is used at RUNTIME; none is fed into the REVIEW pipeline. Semantic flags + bug fingerprints turn the matrix index from "what review happened" (current) into "what category of bug appeared where" (compounding) — so iter-20 scrum on `crates/queryd/src/` preempts review prompts with "this crate had a row/file unit mismatch in iter 7 (delta.rs:189); check every arithmetic on `*_count` variables." Non-goals: we are NOT building a full type-inference engine (reuse Rust's `rustdoc`-level type info for structs, Arrow's schema for RecordBatch, catalogd's column types for SQL — everything beyond is Phase 3). Non-goals: this is not a linter — clippy/rustc already catch syntactic issues; this catches SEMANTIC ones (same type, wrong units/role). Bootstrap path: start with the 9 `SemanticFlag` variants above; add new variants only when a bug is found that doesn't fit an existing one. Gate alignment with hot-swap: a pathway that repeatedly produces bugs of the same `SemanticFlag` variant on the same narrow fingerprint is more valuable as a "watch this file for X" signal than as a hot-swap candidate — retirement logic needs to consider both replay success_rate AND whether the pathway is serving as a bug-pattern beacon. + +## ADR-022: Drop Python sidecar from Rust hot path; AiClient talks Ollama directly +**Date:** 2026-05-02 +**Decision:** `crates/aibridge` AiClient was rewritten (commit `ba928b1`) to call Ollama HTTP directly: per-text `/api/embed`, `/api/generate` for chat + rerank-loop + admin (unload/preload), `/api/ps` + `nvidia-smi` for vram_snapshot. Public AiClient API is unchanged — 0 callers updated. The `sidecar/sidecar/{embed,generate,rerank,admin}.py` FastAPI hot-path routers are retired (kept on disk as historical reference only). The `sidecar/sidecar/{lab_ui,pipeline_lab}.py` ad-hoc Streamlit-shape dev UIs (~888 LOC) keep running as ad-hoc tooling; they are NOT on the runtime hot path. The systemd unit `lakehouse-sidecar.service` stays running for those dev UIs but the gateway no longer depends on it for any production call. +**Rationale:** The Python sidecar embedded path was 236× slower than direct Ollama on warm workloads — the `/api/embed` HTTP hop + Python serialization tax was pure overhead with no logic added (sidecar was a pass-through translator). Co-shipped commit `150cc3b` added an in-process LRU embed cache to the gateway, dropping warm p50 from 78ms → 129us (cache hit) without changing the AiClient public API. Together: the gateway is one mega-process again instead of mega-process + Python sidecar, the Python runtime cost is paid only by dev tooling, and embed RPS is 236× higher on warm workloads. The dev UIs stayed because they're genuinely useful for ad-hoc embedding-pipeline experiments and don't run on the request path. PRD's "AI Boundary | Python FastAPI sidecar → Ollama HTTP API" line is updated to reflect the direct path; ADR-019's hybrid storage architecture is unaffected (this ADR changes the AI substrate, not the vector substrate). See cross-runtime mirror in Go-side `internal/aibridge` (Go was already direct-to-Ollama from D1 onward — this brings Rust to parity). Verified: `cargo test -p aibridge` 32/32 PASS, live `/ai/embed` returns 768d vector + `/v1/chat` returns "OK"; cross-runtime parity probe `embed_parity.sh` 8/8 PASS post-deploy. Cost: the Python sidecar was the canonical "model-version is loaded" probe — that capability moved into AiClient via `/api/ps` polling. Operator runbook: `ps -ef | grep "uvicorn sidecar"` should be ABSENT post-deploy on the production gateway box; presence indicates the old binary is still running. diff --git a/docs/PRD.md b/docs/PRD.md index 0a10db4..28d0bea 100644 --- a/docs/PRD.md +++ b/docs/PRD.md @@ -77,7 +77,7 @@ A modular Rust service mesh over S3-compatible object storage, with a local AI l | Data Format | Parquet + Arrow | Yes | | RPC (internal) | tonic (gRPC) | Yes | | AI Runtime | Ollama (local models) | Yes | -| AI Boundary | Python FastAPI sidecar → Ollama HTTP API | Yes | +| AI Boundary | Direct Ollama HTTP API (gateway → Ollama, no sidecar). Updated 2026-05-02 per ADR-022 — was "Python FastAPI sidecar → Ollama HTTP API". Sidecar's lab_ui/pipeline_lab Python remain as dev-only tools (not on hot path). | Yes | | Vector Index | TBD — evaluate `hora`, `qdrant` crate, or HNSW from scratch | **Open** | No new frameworks without documented ADR. @@ -97,7 +97,7 @@ No new frameworks without documented ADR. | **ingestd** | Ingest pipeline: CSV / JSON / PDF / Postgres-stream → normalize → Parquet → catalog | | **vectord** | Embedding store + vector indexes + HNSW trial system (EmbeddingCache, trial journal, eval harness) | | **journald** | Append-only mutation event log (ADR-012) — distinct from storaged error journal | -| **aibridge** | Rust↔Python boundary — HTTP client to FastAPI sidecar | +| **aibridge** | AI client — direct Ollama HTTP (per ADR-022, 2026-05-02; was Rust↔Python sidecar boundary). Owns LRU embed cache. | | **ui** | Dioxus frontend — Ask, Explore, SQL, System tabs | | **shared** | Types, errors, Arrow helpers, config, protobuf definitions, **secrets provider trait**, **PII detection** | | **mcp-server** | Agent gateway (Bun) — MCP tools, intelligence endpoints, scenario observer (:3700) |