SPEC §1 + §3.4: name matrix indexer as a port target
Adds matrix indexer as its own row in the §1 component table and a new §3.4 with port plan. Distinct from vectord (substrate); lives at internal/matrix/ + gateway /v1/matrix/*. Five components in dependency order: corpus builders → multi-corpus retrieve+merge → relevance filter → strong-model downgrade gate → learning-loop integration. Locks in the framing J flagged 2026-04-29: in Rust the matrix indexer was emergent across mode.rs + build_*_corpus.ts + observer /relevance, and earlier port-planning reduced it to "we have vectord." The SPEC now names it explicitly so the port preserves the multi-corpus retrieval shape AND the learning loop, not just the HNSW substrate. Sharding-by-id was investigated as a throughput fix and rejected — corpus-as-shard at the matrix layer is the existing retrieval shape and parallelizes Adds for free. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
parent
f1c188323c
commit
71b35fb85e
76
docs/SPEC.md
76
docs/SPEC.md
@ -28,6 +28,7 @@ Effort scale (one engineer-week = ~40h focused work):
|
||||
| `queryd` | datafusion, arrow | `cmd/queryd` | **`duckdb/duckdb-go/v2`** (cgo, official) | **HARD** | high — see §3 |
|
||||
| `ingestd` | csv, json, lopdf, postgres | `cmd/ingestd` | stdlib `encoding/csv`, `encoding/json`, `pdfcpu/pdfcpu`, `jackc/pgx/v5` | **L** | low |
|
||||
| `vectord` | hora, arrow, hnsw | `cmd/vectord` | `coder/hnsw`, `apache/arrow-go/v18` | **L** | medium — re-validate HNSW recall |
|
||||
| **matrix indexer** (emergent in Rust — `mode.rs` + `build_*_corpus.ts` + observer `/relevance`) | scripts/build_*_corpus.ts, crates/gateway/src/v1/mode.rs, mcp-server/observer.ts | `internal/matrix/` + gateway routes (`/v1/matrix/*`) | stdlib + vectord client | **L** | medium — see §3.4. Corpus-as-shard composer; relevance filter; strong-model downgrade gate; multi-corpus retrieve+merge. The learning-loop layer that lifts vectord from "static index" to "meta-index that learns from playbooks." |
|
||||
| `vectord-lance` | lance | **DROPPED** | n/a | n/a | n/a — Parquet+HNSW only |
|
||||
| `journald` | parquet, arrow | `cmd/journald` | `apache/arrow-go/v18` | **M** | low |
|
||||
| `aibridge` | reqwest | library | `net/http` + connection pool · `anthropics/anthropic-sdk-go` available for direct Claude calls (currently routed via opencode) | **S** | low |
|
||||
@ -116,6 +117,81 @@ needs revisiting in Go to confirm the sidecar format we ship.
|
||||
- G3.2.C — Recall@10 within 2% of Rust baseline on
|
||||
`lakehouse_arch_v1`
|
||||
|
||||
### §3.4 — Matrix indexer (corpus-as-shard composer)
|
||||
|
||||
**What it is.** The matrix indexer is the layer above `vectord` that
|
||||
turns a fleet of single-corpus HNSW indexes into a learning meta-index.
|
||||
In the Rust system this is emergent — split between corpus builders
|
||||
(`scripts/build_*_corpus.ts`), the mode runner (`crates/gateway/src/v1/mode.rs`),
|
||||
the observer relevance endpoint (`mcp-server/observer.ts`), and the
|
||||
strong-model downgrade gate (`mode.rs::execute`). In Go we name it
|
||||
explicitly so future sessions don't reduce it to "vectord."
|
||||
|
||||
**Why corpus-as-shard, not shard-by-id.** Sharding a single index by
|
||||
hash(id) is a pure throughput hack with a recall tax. Sharding by
|
||||
corpus is the existing retrieval shape — `lakehouse_arch_v1`,
|
||||
`lakehouse_symbols_v1`, `scrum_findings_v1`, `lakehouse_answers_v1`,
|
||||
`kb_team_runs_v1`, `successful_playbooks_live`, etc. — each with
|
||||
distinct topology and a distinct retrieval intent. Concurrent Adds
|
||||
parallelize naturally because they go to different corpora; the
|
||||
matrix layer's job is to retrieve+merge across them, filter for
|
||||
relevance, and downgrade composition when strong models prove the
|
||||
matrix is anti-additive.
|
||||
|
||||
**Components to port (in dependency order):**
|
||||
|
||||
1. **Corpus builders** — Go equivalents of `scripts/build_*_corpus.ts`.
|
||||
For each named corpus, a builder that reads source, splits into
|
||||
chunks per the corpus's schema, embeds via `/v1/embed`, and adds
|
||||
to a vectord index of the same name. Effort: **M** for the first
|
||||
builder, **S** for each subsequent.
|
||||
|
||||
2. **Multi-corpus retrieve+merge** (`internal/matrix/retrieve.go`) —
|
||||
given a query and a list of corpus names, search each at top_k=K,
|
||||
merge by score, return top N globally. Match Rust's pattern:
|
||||
top_k=6 per corpus, top 8 globally before relevance filter.
|
||||
|
||||
3. **Relevance filter** (`internal/matrix/relevance.go`) — port the
|
||||
threshold-based filter from `mcp-server/observer.ts:/relevance`.
|
||||
Drops adjacency-pollution chunks that share a corpus with the hit
|
||||
but aren't actually about the query. `LH_RELEVANCE_FILTER` /
|
||||
`LH_RELEVANCE_THRESHOLD` env knobs preserved.
|
||||
|
||||
4. **Strong-model downgrade gate** (`internal/matrix/downgrade.go`) —
|
||||
port `is_weak_model` + the `codereview_lakehouse → codereview_isolation`
|
||||
flip from `mode.rs::execute`. Pass5 proved composed corpora lose
|
||||
5/5 vs isolation on grok-4.1-fast (p=0.031); the gate is
|
||||
load-bearing for paid-model retrieval quality.
|
||||
|
||||
5. **Learning-loop integration** — write outcomes back to a
|
||||
playbook-memory corpus (probably `lakehouse_answers_v1` analogue).
|
||||
This is what makes the matrix INDEX a learning system rather than
|
||||
static retrieval. Per `feedback_meta_index_vision.md`: this is the
|
||||
north star, not the data structure.
|
||||
|
||||
**Gateway routes:** `/v1/matrix/search` (multi-corpus retrieve+merge),
|
||||
`/v1/matrix/corpora` (list + metadata), `/v1/matrix/relevance` (filter
|
||||
endpoint, used by both internal callers and external tooling).
|
||||
|
||||
**Acceptance gates:**
|
||||
- G3.4.A — `/v1/matrix/search` against ≥3 corpora returns merged top-N
|
||||
with corpus attribution per result.
|
||||
- G3.4.B — Relevance filter drops at least the threshold-margin chunks
|
||||
on a known adjacency-pollution test case.
|
||||
- G3.4.C — Strong-model downgrade gate flips composed→isolation when
|
||||
the model is non-weak; bypassed when caller sets `force_mode`.
|
||||
- G3.4.D — Concurrent Adds across N=4 corpora parallelize (no shared
|
||||
write-lock); Add throughput scales near-linearly with corpus count.
|
||||
|
||||
**Persistence:** each corpus's vectord index persists via the existing
|
||||
G1P LHV1 format. The matrix layer is stateless above that — corpus
|
||||
list lives in catalog, retrieval params in config.
|
||||
|
||||
**Why this is its own §3.x:** in Rust the matrix indexer was emergent
|
||||
and got reduced to "we have vectord" in earlier port-planning. The
|
||||
SPEC names it explicitly so the port preserves the multi-corpus
|
||||
retrieval shape AND the learning loop, not just the HNSW substrate.
|
||||
|
||||
### §3.3 — UI (HTMX)
|
||||
|
||||
**Approach:** server-rendered Go templates using `html/template`,
|
||||
|
||||
Loading…
x
Reference in New Issue
Block a user