SPEC §1 + §3.4: name matrix indexer as a port target

Adds matrix indexer as its own row in the §1 component table and a
new §3.4 with port plan. Distinct from vectord (substrate); lives at
internal/matrix/ + gateway /v1/matrix/*.

Five components in dependency order: corpus builders → multi-corpus
retrieve+merge → relevance filter → strong-model downgrade gate →
learning-loop integration.

Locks in the framing J flagged 2026-04-29: in Rust the matrix indexer
was emergent across mode.rs + build_*_corpus.ts + observer /relevance,
and earlier port-planning reduced it to "we have vectord." The SPEC
now names it explicitly so the port preserves the multi-corpus
retrieval shape AND the learning loop, not just the HNSW substrate.

Sharding-by-id was investigated as a throughput fix and rejected —
corpus-as-shard at the matrix layer is the existing retrieval shape
and parallelizes Adds for free.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
root 2026-04-29 18:12:10 -05:00
parent f1c188323c
commit 71b35fb85e

View File

@ -28,6 +28,7 @@ Effort scale (one engineer-week = ~40h focused work):
| `queryd` | datafusion, arrow | `cmd/queryd` | **`duckdb/duckdb-go/v2`** (cgo, official) | **HARD** | high — see §3 |
| `ingestd` | csv, json, lopdf, postgres | `cmd/ingestd` | stdlib `encoding/csv`, `encoding/json`, `pdfcpu/pdfcpu`, `jackc/pgx/v5` | **L** | low |
| `vectord` | hora, arrow, hnsw | `cmd/vectord` | `coder/hnsw`, `apache/arrow-go/v18` | **L** | medium — re-validate HNSW recall |
| **matrix indexer** (emergent in Rust — `mode.rs` + `build_*_corpus.ts` + observer `/relevance`) | scripts/build_*_corpus.ts, crates/gateway/src/v1/mode.rs, mcp-server/observer.ts | `internal/matrix/` + gateway routes (`/v1/matrix/*`) | stdlib + vectord client | **L** | medium — see §3.4. Corpus-as-shard composer; relevance filter; strong-model downgrade gate; multi-corpus retrieve+merge. The learning-loop layer that lifts vectord from "static index" to "meta-index that learns from playbooks." |
| `vectord-lance` | lance | **DROPPED** | n/a | n/a | n/a — Parquet+HNSW only |
| `journald` | parquet, arrow | `cmd/journald` | `apache/arrow-go/v18` | **M** | low |
| `aibridge` | reqwest | library | `net/http` + connection pool · `anthropics/anthropic-sdk-go` available for direct Claude calls (currently routed via opencode) | **S** | low |
@ -116,6 +117,81 @@ needs revisiting in Go to confirm the sidecar format we ship.
- G3.2.C — Recall@10 within 2% of Rust baseline on
`lakehouse_arch_v1`
### §3.4 — Matrix indexer (corpus-as-shard composer)
**What it is.** The matrix indexer is the layer above `vectord` that
turns a fleet of single-corpus HNSW indexes into a learning meta-index.
In the Rust system this is emergent — split between corpus builders
(`scripts/build_*_corpus.ts`), the mode runner (`crates/gateway/src/v1/mode.rs`),
the observer relevance endpoint (`mcp-server/observer.ts`), and the
strong-model downgrade gate (`mode.rs::execute`). In Go we name it
explicitly so future sessions don't reduce it to "vectord."
**Why corpus-as-shard, not shard-by-id.** Sharding a single index by
hash(id) is a pure throughput hack with a recall tax. Sharding by
corpus is the existing retrieval shape — `lakehouse_arch_v1`,
`lakehouse_symbols_v1`, `scrum_findings_v1`, `lakehouse_answers_v1`,
`kb_team_runs_v1`, `successful_playbooks_live`, etc. — each with
distinct topology and a distinct retrieval intent. Concurrent Adds
parallelize naturally because they go to different corpora; the
matrix layer's job is to retrieve+merge across them, filter for
relevance, and downgrade composition when strong models prove the
matrix is anti-additive.
**Components to port (in dependency order):**
1. **Corpus builders** — Go equivalents of `scripts/build_*_corpus.ts`.
For each named corpus, a builder that reads source, splits into
chunks per the corpus's schema, embeds via `/v1/embed`, and adds
to a vectord index of the same name. Effort: **M** for the first
builder, **S** for each subsequent.
2. **Multi-corpus retrieve+merge** (`internal/matrix/retrieve.go`) —
given a query and a list of corpus names, search each at top_k=K,
merge by score, return top N globally. Match Rust's pattern:
top_k=6 per corpus, top 8 globally before relevance filter.
3. **Relevance filter** (`internal/matrix/relevance.go`) — port the
threshold-based filter from `mcp-server/observer.ts:/relevance`.
Drops adjacency-pollution chunks that share a corpus with the hit
but aren't actually about the query. `LH_RELEVANCE_FILTER` /
`LH_RELEVANCE_THRESHOLD` env knobs preserved.
4. **Strong-model downgrade gate** (`internal/matrix/downgrade.go`) —
port `is_weak_model` + the `codereview_lakehouse → codereview_isolation`
flip from `mode.rs::execute`. Pass5 proved composed corpora lose
5/5 vs isolation on grok-4.1-fast (p=0.031); the gate is
load-bearing for paid-model retrieval quality.
5. **Learning-loop integration** — write outcomes back to a
playbook-memory corpus (probably `lakehouse_answers_v1` analogue).
This is what makes the matrix INDEX a learning system rather than
static retrieval. Per `feedback_meta_index_vision.md`: this is the
north star, not the data structure.
**Gateway routes:** `/v1/matrix/search` (multi-corpus retrieve+merge),
`/v1/matrix/corpora` (list + metadata), `/v1/matrix/relevance` (filter
endpoint, used by both internal callers and external tooling).
**Acceptance gates:**
- G3.4.A — `/v1/matrix/search` against ≥3 corpora returns merged top-N
with corpus attribution per result.
- G3.4.B — Relevance filter drops at least the threshold-margin chunks
on a known adjacency-pollution test case.
- G3.4.C — Strong-model downgrade gate flips composed→isolation when
the model is non-weak; bypassed when caller sets `force_mode`.
- G3.4.D — Concurrent Adds across N=4 corpora parallelize (no shared
write-lock); Add throughput scales near-linearly with corpus count.
**Persistence:** each corpus's vectord index persists via the existing
G1P LHV1 format. The matrix layer is stateless above that — corpus
list lives in catalog, retrieval params in config.
**Why this is its own §3.x:** in Rust the matrix indexer was emergent
and got reduced to "we have vectord" in earlier port-planning. The
SPEC names it explicitly so the port preserves the multi-corpus
retrieval shape AND the learning loop, not just the HNSW substrate.
### §3.3 — UI (HTMX)
**Approach:** server-rendered Go templates using `html/template`,