From 71b35fb85e902b78f131029686e6ed752f3d2d3b Mon Sep 17 00:00:00 2001 From: root Date: Wed, 29 Apr 2026 18:12:10 -0500 Subject: [PATCH] =?UTF-8?q?SPEC=20=C2=A71=20+=20=C2=A73.4:=20name=20matrix?= =?UTF-8?q?=20indexer=20as=20a=20port=20target?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Adds matrix indexer as its own row in the §1 component table and a new §3.4 with port plan. Distinct from vectord (substrate); lives at internal/matrix/ + gateway /v1/matrix/*. Five components in dependency order: corpus builders → multi-corpus retrieve+merge → relevance filter → strong-model downgrade gate → learning-loop integration. Locks in the framing J flagged 2026-04-29: in Rust the matrix indexer was emergent across mode.rs + build_*_corpus.ts + observer /relevance, and earlier port-planning reduced it to "we have vectord." The SPEC now names it explicitly so the port preserves the multi-corpus retrieval shape AND the learning loop, not just the HNSW substrate. Sharding-by-id was investigated as a throughput fix and rejected — corpus-as-shard at the matrix layer is the existing retrieval shape and parallelizes Adds for free. Co-Authored-By: Claude Opus 4.7 (1M context) --- docs/SPEC.md | 76 ++++++++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 76 insertions(+) diff --git a/docs/SPEC.md b/docs/SPEC.md index 1d56020..6f3d958 100644 --- a/docs/SPEC.md +++ b/docs/SPEC.md @@ -28,6 +28,7 @@ Effort scale (one engineer-week = ~40h focused work): | `queryd` | datafusion, arrow | `cmd/queryd` | **`duckdb/duckdb-go/v2`** (cgo, official) | **HARD** | high — see §3 | | `ingestd` | csv, json, lopdf, postgres | `cmd/ingestd` | stdlib `encoding/csv`, `encoding/json`, `pdfcpu/pdfcpu`, `jackc/pgx/v5` | **L** | low | | `vectord` | hora, arrow, hnsw | `cmd/vectord` | `coder/hnsw`, `apache/arrow-go/v18` | **L** | medium — re-validate HNSW recall | +| **matrix indexer** (emergent in Rust — `mode.rs` + `build_*_corpus.ts` + observer `/relevance`) | scripts/build_*_corpus.ts, crates/gateway/src/v1/mode.rs, mcp-server/observer.ts | `internal/matrix/` + gateway routes (`/v1/matrix/*`) | stdlib + vectord client | **L** | medium — see §3.4. Corpus-as-shard composer; relevance filter; strong-model downgrade gate; multi-corpus retrieve+merge. The learning-loop layer that lifts vectord from "static index" to "meta-index that learns from playbooks." | | `vectord-lance` | lance | **DROPPED** | n/a | n/a | n/a — Parquet+HNSW only | | `journald` | parquet, arrow | `cmd/journald` | `apache/arrow-go/v18` | **M** | low | | `aibridge` | reqwest | library | `net/http` + connection pool · `anthropics/anthropic-sdk-go` available for direct Claude calls (currently routed via opencode) | **S** | low | @@ -116,6 +117,81 @@ needs revisiting in Go to confirm the sidecar format we ship. - G3.2.C — Recall@10 within 2% of Rust baseline on `lakehouse_arch_v1` +### §3.4 — Matrix indexer (corpus-as-shard composer) + +**What it is.** The matrix indexer is the layer above `vectord` that +turns a fleet of single-corpus HNSW indexes into a learning meta-index. +In the Rust system this is emergent — split between corpus builders +(`scripts/build_*_corpus.ts`), the mode runner (`crates/gateway/src/v1/mode.rs`), +the observer relevance endpoint (`mcp-server/observer.ts`), and the +strong-model downgrade gate (`mode.rs::execute`). In Go we name it +explicitly so future sessions don't reduce it to "vectord." + +**Why corpus-as-shard, not shard-by-id.** Sharding a single index by +hash(id) is a pure throughput hack with a recall tax. Sharding by +corpus is the existing retrieval shape — `lakehouse_arch_v1`, +`lakehouse_symbols_v1`, `scrum_findings_v1`, `lakehouse_answers_v1`, +`kb_team_runs_v1`, `successful_playbooks_live`, etc. — each with +distinct topology and a distinct retrieval intent. Concurrent Adds +parallelize naturally because they go to different corpora; the +matrix layer's job is to retrieve+merge across them, filter for +relevance, and downgrade composition when strong models prove the +matrix is anti-additive. + +**Components to port (in dependency order):** + +1. **Corpus builders** — Go equivalents of `scripts/build_*_corpus.ts`. + For each named corpus, a builder that reads source, splits into + chunks per the corpus's schema, embeds via `/v1/embed`, and adds + to a vectord index of the same name. Effort: **M** for the first + builder, **S** for each subsequent. + +2. **Multi-corpus retrieve+merge** (`internal/matrix/retrieve.go`) — + given a query and a list of corpus names, search each at top_k=K, + merge by score, return top N globally. Match Rust's pattern: + top_k=6 per corpus, top 8 globally before relevance filter. + +3. **Relevance filter** (`internal/matrix/relevance.go`) — port the + threshold-based filter from `mcp-server/observer.ts:/relevance`. + Drops adjacency-pollution chunks that share a corpus with the hit + but aren't actually about the query. `LH_RELEVANCE_FILTER` / + `LH_RELEVANCE_THRESHOLD` env knobs preserved. + +4. **Strong-model downgrade gate** (`internal/matrix/downgrade.go`) — + port `is_weak_model` + the `codereview_lakehouse → codereview_isolation` + flip from `mode.rs::execute`. Pass5 proved composed corpora lose + 5/5 vs isolation on grok-4.1-fast (p=0.031); the gate is + load-bearing for paid-model retrieval quality. + +5. **Learning-loop integration** — write outcomes back to a + playbook-memory corpus (probably `lakehouse_answers_v1` analogue). + This is what makes the matrix INDEX a learning system rather than + static retrieval. Per `feedback_meta_index_vision.md`: this is the + north star, not the data structure. + +**Gateway routes:** `/v1/matrix/search` (multi-corpus retrieve+merge), +`/v1/matrix/corpora` (list + metadata), `/v1/matrix/relevance` (filter +endpoint, used by both internal callers and external tooling). + +**Acceptance gates:** +- G3.4.A — `/v1/matrix/search` against ≥3 corpora returns merged top-N + with corpus attribution per result. +- G3.4.B — Relevance filter drops at least the threshold-margin chunks + on a known adjacency-pollution test case. +- G3.4.C — Strong-model downgrade gate flips composed→isolation when + the model is non-weak; bypassed when caller sets `force_mode`. +- G3.4.D — Concurrent Adds across N=4 corpora parallelize (no shared + write-lock); Add throughput scales near-linearly with corpus count. + +**Persistence:** each corpus's vectord index persists via the existing +G1P LHV1 format. The matrix layer is stateless above that — corpus +list lives in catalog, retrieval params in config. + +**Why this is its own §3.x:** in Rust the matrix indexer was emergent +and got reduced to "we have vectord" in earlier port-planning. The +SPEC names it explicitly so the port preserves the multi-corpus +retrieval shape AND the learning loop, not just the HNSW substrate. + ### §3.3 — UI (HTMX) **Approach:** server-rendered Go templates using `html/template`,