SPEC §1 + §3.4: name matrix indexer as a port target

Adds matrix indexer as its own row in the §1 component table and a new §3.4 with port plan. Distinct from vectord (substrate); lives at internal/matrix/ + gateway /v1/matrix/*. Five components in dependency order: corpus builders → multi-corpus retrieve+merge → relevance filter → strong-model downgrade gate → learning-loop integration. Locks in the framing J flagged 2026-04-29: in Rust the matrix indexer was emergent across mode.rs + build_*_corpus.ts + observer /relevance, and earlier port-planning reduced it to "we have vectord." The SPEC now names it explicitly so the port preserves the multi-corpus retrieval shape AND the learning loop, not just the HNSW substrate. Sharding-by-id was investigated as a throughput fix and rejected — corpus-as-shard at the matrix layer is the existing retrieval shape and parallelizes Adds for free. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-29 18:12:10 -05:00 · 2026-04-29 18:12:10 -05:00 · 71b35fb85e
commit 71b35fb85e
parent f1c188323c
1 changed files with 76 additions and 0 deletions
--- a/docs/SPEC.md
+++ b/docs/SPEC.md
@ -28,6 +28,7 @@ Effort scale (one engineer-week = ~40h focused work):
 | `queryd` | datafusion, arrow | `cmd/queryd` | **`duckdb/duckdb-go/v2`** (cgo, official) | **HARD** | high — see §3 |
 | `ingestd` | csv, json, lopdf, postgres | `cmd/ingestd` | stdlib `encoding/csv`, `encoding/json`, `pdfcpu/pdfcpu`, `jackc/pgx/v5` | **L** | low |
 | `vectord` | hora, arrow, hnsw | `cmd/vectord` | `coder/hnsw`, `apache/arrow-go/v18` | **L** | medium — re-validate HNSW recall |
+| **matrix indexer** (emergent in Rust — `mode.rs` + `build_*_corpus.ts` + observer `/relevance`) | scripts/build_*_corpus.ts, crates/gateway/src/v1/mode.rs, mcp-server/observer.ts | `internal/matrix/` + gateway routes (`/v1/matrix/*`) | stdlib + vectord client | **L** | medium — see §3.4. Corpus-as-shard composer; relevance filter; strong-model downgrade gate; multi-corpus retrieve+merge. The learning-loop layer that lifts vectord from "static index" to "meta-index that learns from playbooks." |
 | `vectord-lance` | lance | **DROPPED** | n/a | n/a | n/a — Parquet+HNSW only |
 | `journald` | parquet, arrow | `cmd/journald` | `apache/arrow-go/v18` | **M** | low |
 | `aibridge` | reqwest | library | `net/http` + connection pool · `anthropics/anthropic-sdk-go` available for direct Claude calls (currently routed via opencode) | **S** | low |
@ -116,6 +117,81 @@ needs revisiting in Go to confirm the sidecar format we ship.
 - G3.2.C — Recall@10 within 2% of Rust baseline on
  `lakehouse_arch_v1`

+### §3.4 — Matrix indexer (corpus-as-shard composer)
+
+**What it is.** The matrix indexer is the layer above `vectord` that
+turns a fleet of single-corpus HNSW indexes into a learning meta-index.
+In the Rust system this is emergent — split between corpus builders
+(`scripts/build_*_corpus.ts`), the mode runner (`crates/gateway/src/v1/mode.rs`),
+the observer relevance endpoint (`mcp-server/observer.ts`), and the
+strong-model downgrade gate (`mode.rs::execute`). In Go we name it
+explicitly so future sessions don't reduce it to "vectord."
+
+**Why corpus-as-shard, not shard-by-id.** Sharding a single index by
+hash(id) is a pure throughput hack with a recall tax. Sharding by
+corpus is the existing retrieval shape — `lakehouse_arch_v1`,
+`lakehouse_symbols_v1`, `scrum_findings_v1`, `lakehouse_answers_v1`,
+`kb_team_runs_v1`, `successful_playbooks_live`, etc. — each with
+distinct topology and a distinct retrieval intent. Concurrent Adds
+parallelize naturally because they go to different corpora; the
+matrix layer's job is to retrieve+merge across them, filter for
+relevance, and downgrade composition when strong models prove the
+matrix is anti-additive.
+
+**Components to port (in dependency order):**
+
+1. **Corpus builders** — Go equivalents of `scripts/build_*_corpus.ts`.
+   For each named corpus, a builder that reads source, splits into
+   chunks per the corpus's schema, embeds via `/v1/embed`, and adds
+   to a vectord index of the same name. Effort: **M** for the first
+   builder, **S** for each subsequent.
+
+2. **Multi-corpus retrieve+merge** (`internal/matrix/retrieve.go`) —
+   given a query and a list of corpus names, search each at top_k=K,
+   merge by score, return top N globally. Match Rust's pattern:
+   top_k=6 per corpus, top 8 globally before relevance filter.
+
+3. **Relevance filter** (`internal/matrix/relevance.go`) — port the
+   threshold-based filter from `mcp-server/observer.ts:/relevance`.
+   Drops adjacency-pollution chunks that share a corpus with the hit
+   but aren't actually about the query. `LH_RELEVANCE_FILTER` /
+   `LH_RELEVANCE_THRESHOLD` env knobs preserved.
+
+4. **Strong-model downgrade gate** (`internal/matrix/downgrade.go`) —
+   port `is_weak_model` + the `codereview_lakehouse → codereview_isolation`
+   flip from `mode.rs::execute`. Pass5 proved composed corpora lose
+   5/5 vs isolation on grok-4.1-fast (p=0.031); the gate is
+   load-bearing for paid-model retrieval quality.
+
+5. **Learning-loop integration** — write outcomes back to a
+   playbook-memory corpus (probably `lakehouse_answers_v1` analogue).
+   This is what makes the matrix INDEX a learning system rather than
+   static retrieval. Per `feedback_meta_index_vision.md`: this is the
+   north star, not the data structure.
+
+**Gateway routes:** `/v1/matrix/search` (multi-corpus retrieve+merge),
+`/v1/matrix/corpora` (list + metadata), `/v1/matrix/relevance` (filter
+endpoint, used by both internal callers and external tooling).
+
+**Acceptance gates:**
+- G3.4.A — `/v1/matrix/search` against ≥3 corpora returns merged top-N
+  with corpus attribution per result.
+- G3.4.B — Relevance filter drops at least the threshold-margin chunks
+  on a known adjacency-pollution test case.
+- G3.4.C — Strong-model downgrade gate flips composed→isolation when
+  the model is non-weak; bypassed when caller sets `force_mode`.
+- G3.4.D — Concurrent Adds across N=4 corpora parallelize (no shared
+  write-lock); Add throughput scales near-linearly with corpus count.
+
+**Persistence:** each corpus's vectord index persists via the existing
+G1P LHV1 format. The matrix layer is stateless above that — corpus
+list lives in catalog, retrieval params in config.
+
+**Why this is its own §3.x:** in Rust the matrix indexer was emergent
+and got reduced to "we have vectord" in earlier port-planning. The
+SPEC names it explicitly so the port preserves the multi-corpus
+retrieval shape AND the learning loop, not just the HNSW substrate.
+
 ### §3.3 — UI (HTMX)

 **Approach:** server-rendered Go templates using `html/template`,