root 71b35fb85e SPEC §1 + §3.4: name matrix indexer as a port target
Adds matrix indexer as its own row in the §1 component table and a
new §3.4 with port plan. Distinct from vectord (substrate); lives at
internal/matrix/ + gateway /v1/matrix/*.

Five components in dependency order: corpus builders → multi-corpus
retrieve+merge → relevance filter → strong-model downgrade gate →
learning-loop integration.

Locks in the framing J flagged 2026-04-29: in Rust the matrix indexer
was emergent across mode.rs + build_*_corpus.ts + observer /relevance,
and earlier port-planning reduced it to "we have vectord." The SPEC
now names it explicitly so the port preserves the multi-corpus
retrieval shape AND the learning loop, not just the HNSW substrate.

Sharding-by-id was investigated as a throughput fix and rejected —
corpus-as-shard at the matrix layer is the existing retrieval shape
and parallelizes Adds for free.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-29 18:12:10 -05:00

437 lines
20 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# SPEC: Lakehouse-Go Component Port Plan
**Status:** DRAFT — companion to `PRD.md`. Component-by-component port
plan with library choices, effort estimates, and acceptance gates.
**Created:** 2026-04-28
**Owner:** J
This spec answers: for each piece of the Rust Lakehouse, what Go
library carries it, what the effort looks like, and what gate proves
the port is real.
Effort scale (one engineer-week = ~40h focused work):
- **S** — 13 days
- **M** — 1 engineer-week
- **L** — 23 engineer-weeks
- **XL** — 1+ months
- **HARD** — open research, see PRD §Hard problems
---
## §1. Component port table — Rust crates
| Crate | Rust deps that mattered | Go target | Library | Effort | Risk |
|---|---|---|---|---|---|
| `gateway` | axum, tokio, tonic, tower | `cmd/gateway` | `chi` + stdlib `net/http` + `google.golang.org/grpc` | **L** | low — Go's strongest domain |
| `catalogd` | parquet-rs, arrow, sqlite | `cmd/catalogd` | `apache/arrow-go/v18`, `mattn/go-sqlite3` | **L** | low |
| `storaged` | object_store, aws-sdk | `cmd/storaged` | `aws-sdk-go-v2`, `minio-go` for MinIO-specific paths | **M** | low |
| `queryd` | datafusion, arrow | `cmd/queryd` | **`duckdb/duckdb-go/v2`** (cgo, official) | **HARD** | high — see §3 |
| `ingestd` | csv, json, lopdf, postgres | `cmd/ingestd` | stdlib `encoding/csv`, `encoding/json`, `pdfcpu/pdfcpu`, `jackc/pgx/v5` | **L** | low |
| `vectord` | hora, arrow, hnsw | `cmd/vectord` | `coder/hnsw`, `apache/arrow-go/v18` | **L** | medium — re-validate HNSW recall |
| **matrix indexer** (emergent in Rust — `mode.rs` + `build_*_corpus.ts` + observer `/relevance`) | scripts/build_*_corpus.ts, crates/gateway/src/v1/mode.rs, mcp-server/observer.ts | `internal/matrix/` + gateway routes (`/v1/matrix/*`) | stdlib + vectord client | **L** | medium — see §3.4. Corpus-as-shard composer; relevance filter; strong-model downgrade gate; multi-corpus retrieve+merge. The learning-loop layer that lifts vectord from "static index" to "meta-index that learns from playbooks." |
| `vectord-lance` | lance | **DROPPED** | n/a | n/a | n/a — Parquet+HNSW only |
| `journald` | parquet, arrow | `cmd/journald` | `apache/arrow-go/v18` | **M** | low |
| `aibridge` | reqwest | library | `net/http` + connection pool · `anthropics/anthropic-sdk-go` available for direct Claude calls (currently routed via opencode) | **S** | low |
| `validator` | parquet, custom | library | `apache/arrow-go/v18` parquet reader | **M** | low — port the 24 unit tests as gates |
| `truth` | tomli, custom DSL | library | `pelletier/go-toml/v2` | **M** | low |
| `proto` | tonic-build | `proto/` + `protoc-gen-go` | `buf` + `protoc-gen-go-grpc` | **S** | low |
| `shared` | serde, anyhow | library | stdlib `encoding/json`, `errors` | **S** | low |
| `ui` | dioxus, wasm | **REPLACED** | `html/template` + HTMX | **L** | medium — see §3 |
| `lance-bench` | criterion | n/a — dropped with Lance | n/a | n/a | n/a |
**Total Rust crate port effort:** ~1218 engineer-weeks (34 months for
one engineer; 68 weeks for two).
---
## §2. Component port table — TypeScript surfaces
| TS surface | Current location | Go target | Library | Effort | Risk |
|---|---|---|---|---|---|
| `mcp-server/index.ts` | Bun, :3700 | `cmd/mcp` | **`modelcontextprotocol/go-sdk`** (official Go SDK, v1.5.0, Google-collab) | **L** | medium — MCP semantics |
| `mcp-server/observer.ts` | Bun, :3800 | `cmd/observer` | stdlib `net/http`, `slog` | **M** | low |
| `mcp-server/tracing.ts` | Bun, Langfuse client | library | `go.opentelemetry.io/otel` + Langfuse Go client (or hand-roll) | **M** | low — Langfuse Go OSS support varies |
| `auditor/*.ts` | TS, runs as systemd | `cmd/auditor` | stdlib + `gitea API client` | **L** | medium — auditor cross-lineage logic is intricate |
| `tests/real-world/scrum_master_pipeline.ts` | TS, ad-hoc | `cmd/scrum` | stdlib | **L** | medium — chunking + embed + ladder logic |
| `tests/real-world/scrum_applier.ts` | TS, ad-hoc | `cmd/scrum-apply` | stdlib + git CLI shell-out | **M** | medium |
| `bot/propose.ts` | TS | `cmd/bot` | stdlib | **S** | low |
| Search demo HTML/JS | static | static (no port) | n/a | n/a | n/a — copied as-is |
**Total TS port effort:** ~610 engineer-weeks.
---
## §3. Hard problem details
### §3.1 — Query engine (DuckDB via cgo)
**Library:** `github.com/duckdb/duckdb-go/v2` — official Go bindings via
cgo. (Replaces the legacy `marcboeker/go-duckdb`, which was deprecated
when the DuckDB team and Marc Boeker jointly relocated maintenance to
the DuckDB org at v2.5.0. Migration is a one-line `gofmt -r` rewrite of
import paths.) Current version v2.10502.0 (April 2026), DuckDB v1.5.2
compat. Statically links default extensions: ICU, JSON, Parquet,
Autocomplete.
**API shape** (replaces the DataFusion `SessionContext` pattern):
```go
db, _ := sql.Open("duckdb", "")
defer db.Close()
db.Exec("CREATE VIEW workers AS SELECT * FROM read_parquet('s3://bucket/workers/*.parquet')")
rows, _ := db.Query("SELECT role, count(*) FROM workers WHERE state='IL' GROUP BY role")
```
**Acceptance gates:**
- G3.1.A — `SELECT * FROM read_parquet('workers_500k.parquet') LIMIT 1`
returns a row with the expected schema. Establishes Parquet read
works.
- G3.1.B — Hybrid SQL+vector query (the `POST /vectors/hybrid`
surface) returns same workers as the Rust path on the same input,
ranked the same way modulo embedding precision.
- G3.1.C — Hot-cache merge-on-read: register a base table + a delta
Parquet, query, observe both rows merged with the delta winning on
conflict.
**Fallback if cgo is rejected:** run DuckDB as an external process
(`duckdb -json -c '...'` shelled or HTTP via a thin Go wrapper). Adds
operational surface; preserves SQL model.
### §3.2 — HNSW index
**Library:** `coder/hnsw` — pure-Go HNSW, in-process. Supports add /
delete / search / persist.
**Open question:** does `coder/hnsw` match the recall@10 we measured
on the Rust `hora` path? Need a calibration test:
- Rebuild `lakehouse_arch_v1` (the 1086-chunk arch corpus) in Go.
- Compare recall@10 on a fixed query set to the Rust baseline.
- Acceptance: ≤2% drop or we switch library / parameters.
**Persistence format:** TBD — `coder/hnsw` has its own snapshot format;
ADR equivalent of ADR-008 (Parquet for embeddings + sidecar HNSW file)
needs revisiting in Go to confirm the sidecar format we ship.
**Acceptance gates:**
- G3.2.A — Build HNSW from a Parquet of 100K vectors in <60s
- G3.2.B Search 100K vectors at k=10 in <50ms p50
- G3.2.C Recall@10 within 2% of Rust baseline on
`lakehouse_arch_v1`
### §3.4 — Matrix indexer (corpus-as-shard composer)
**What it is.** The matrix indexer is the layer above `vectord` that
turns a fleet of single-corpus HNSW indexes into a learning meta-index.
In the Rust system this is emergent split between corpus builders
(`scripts/build_*_corpus.ts`), the mode runner (`crates/gateway/src/v1/mode.rs`),
the observer relevance endpoint (`mcp-server/observer.ts`), and the
strong-model downgrade gate (`mode.rs::execute`). In Go we name it
explicitly so future sessions don't reduce it to "vectord."
**Why corpus-as-shard, not shard-by-id.** Sharding a single index by
hash(id) is a pure throughput hack with a recall tax. Sharding by
corpus is the existing retrieval shape `lakehouse_arch_v1`,
`lakehouse_symbols_v1`, `scrum_findings_v1`, `lakehouse_answers_v1`,
`kb_team_runs_v1`, `successful_playbooks_live`, etc. each with
distinct topology and a distinct retrieval intent. Concurrent Adds
parallelize naturally because they go to different corpora; the
matrix layer's job is to retrieve+merge across them, filter for
relevance, and downgrade composition when strong models prove the
matrix is anti-additive.
**Components to port (in dependency order):**
1. **Corpus builders** Go equivalents of `scripts/build_*_corpus.ts`.
For each named corpus, a builder that reads source, splits into
chunks per the corpus's schema, embeds via `/v1/embed`, and adds
to a vectord index of the same name. Effort: **M** for the first
builder, **S** for each subsequent.
2. **Multi-corpus retrieve+merge** (`internal/matrix/retrieve.go`)
given a query and a list of corpus names, search each at top_k=K,
merge by score, return top N globally. Match Rust's pattern:
top_k=6 per corpus, top 8 globally before relevance filter.
3. **Relevance filter** (`internal/matrix/relevance.go`) port the
threshold-based filter from `mcp-server/observer.ts:/relevance`.
Drops adjacency-pollution chunks that share a corpus with the hit
but aren't actually about the query. `LH_RELEVANCE_FILTER` /
`LH_RELEVANCE_THRESHOLD` env knobs preserved.
4. **Strong-model downgrade gate** (`internal/matrix/downgrade.go`)
port `is_weak_model` + the `codereview_lakehouse → codereview_isolation`
flip from `mode.rs::execute`. Pass5 proved composed corpora lose
5/5 vs isolation on grok-4.1-fast (p=0.031); the gate is
load-bearing for paid-model retrieval quality.
5. **Learning-loop integration** write outcomes back to a
playbook-memory corpus (probably `lakehouse_answers_v1` analogue).
This is what makes the matrix INDEX a learning system rather than
static retrieval. Per `feedback_meta_index_vision.md`: this is the
north star, not the data structure.
**Gateway routes:** `/v1/matrix/search` (multi-corpus retrieve+merge),
`/v1/matrix/corpora` (list + metadata), `/v1/matrix/relevance` (filter
endpoint, used by both internal callers and external tooling).
**Acceptance gates:**
- G3.4.A `/v1/matrix/search` against 3 corpora returns merged top-N
with corpus attribution per result.
- G3.4.B Relevance filter drops at least the threshold-margin chunks
on a known adjacency-pollution test case.
- G3.4.C Strong-model downgrade gate flips composedisolation when
the model is non-weak; bypassed when caller sets `force_mode`.
- G3.4.D Concurrent Adds across N=4 corpora parallelize (no shared
write-lock); Add throughput scales near-linearly with corpus count.
**Persistence:** each corpus's vectord index persists via the existing
G1P LHV1 format. The matrix layer is stateless above that corpus
list lives in catalog, retrieval params in config.
**Why this is its own §3.x:** in Rust the matrix indexer was emergent
and got reduced to "we have vectord" in earlier port-planning. The
SPEC names it explicitly so the port preserves the multi-corpus
retrieval shape AND the learning loop, not just the HNSW substrate.
### §3.3 — UI (HTMX)
**Approach:** server-rendered Go templates using `html/template`,
HTMX for partial-page swaps, Alpine.js for client-side interactivity
where needed. Single binary serves API + UI.
**Acceptance gates:**
- G3.3.A `Ask` tab: type natural-language question, get answer
from RAG endpoint, render in-page without full reload
- G3.3.B `Explore` tab: paginated dataset list with hot-swap badge
rendering
- G3.3.C `SQL` tab: textarea submit tabular result rendered
in-page
- G3.3.D `System` tab: live tail of `/storage/errors` and
`/hnsw/trials` via HTMX polling
**Fallback if HTMX feels limiting:** split repo `golangLAKEHOUSE-ui`
with Vite + React, served as static files by Go gateway. Costs an
extra repo + build chain.
### §3.4 — Pathway memory port
**Constraint:** the Rust `pathway_memory` and TS implementations were
byte-matching by ADR-021. The byte contract was verified by running
both implementations on the same input tokens and asserting matching
bucket indices.
**Go port plan:**
- Port the 32-bucket SHA256-keyed token hash exactly. Verify on a
golden input that Go produces the same bucket vector as Rust.
- Port the JSON state file format verbatim the existing 88 traces in
`data/_pathway_memory/state.json` reload as-is into the Go
implementation.
- Port the matrix-correctness layer (ADR-021's `SemanticFlag`,
`BugFingerprint`, `TypeHint`) these are pure value types,
trivially portable.
**Acceptance gates:**
- G3.4.A Load existing `state.json`, run `replay` on the same 11
prior successful pathways, all 11 succeed (matching the Rust 11/11
baseline).
- G3.4.B Bucket vector for a fixed test input byte-matches the
Rust output.
---
## §4. Phase plan
### Phase G0 — Skeleton (Week 13)
**Scope:** smallest end-to-end ingest + query path working in Go.
| Component | Deliverable |
|---|---|
| `cmd/gateway` | HTTP on :3100, `/health`, `/v1/chat` proxy stub |
| `cmd/catalogd` | In-memory registry + Parquet manifest persistence |
| `cmd/storaged` | Single-bucket S3 / local FS, no error journal yet |
| `cmd/ingestd` | CSV Parquet, schema inference, register-on-ingest |
| `cmd/queryd` | DuckDB-backed `POST /sql` endpoint |
**Acceptance:** upload a CSV via `POST /ingest`, query it via
`POST /sql` with a SELECT, get rows back. Single-bucket. No vector,
no profile, no UI.
### Phase G1 — Vector + RAG (Week 46)
| Component | Deliverable |
|---|---|
| `cmd/vectord` | Embed-on-ingest (calls Python sidecar), HNSW build, `POST /search` |
| `cmd/gateway` | Add `POST /rag` (embed search retrieve generate via aibridge) |
| `cmd/aibridge` | HTTP client to existing Python sidecar |
**Acceptance:** ingest 15K resumes (the original Phase 7 fixture),
ask "find me a forklift operator with OSHA-10 in IL", get ranked
results with LLM-generated explanation grounded in the retrieved
chunks.
### Phase G2 — Federation + profiles (Week 78)
| Component | Deliverable |
|---|---|
| `cmd/storaged` | Multi-bucket registry, rescue bucket, error journal at `primary://_errors/` |
| Profile system | Per-reader profile bound to bucket + vector index |
| Hot-swap | Atomic pointer swap for index generations |
**Acceptance:** two profiles bound to two buckets, queries scoped
correctly, hot-swap a vector index without query interruption,
rollback works.
### Phase G3 — Pathway memory + distillation (Week 911)
| Component | Deliverable |
|---|---|
| `cmd/vectord` | Pathway memory module ported, 88 traces reloaded |
| Distillation pipeline | SFT export, contamination firewall, scorer |
| Audit baselines | `audit_baselines.jsonl` longitudinal signal port |
**Acceptance:** replay 11 prior successful pathways, all 11 succeed.
Re-run distillation acceptance on the frozen fixture set, 22/22 pass.
### Phase G4 — TS surfaces → Go (Week 1214)
| Component | Deliverable |
|---|---|
| `cmd/mcp` | MCP server (replaces Bun) `/v1/chat`, intelligence endpoints |
| `cmd/observer` | Autonomous iteration loop, op recording |
| `cmd/auditor` | PR audit pipeline (kimi/haiku/opus rotation) |
| `cmd/scrum` | Scrum master pipeline (replaces TS) |
**Acceptance:** open a test PR, auditor cycles within 90s, emits
verdict to `data/_auditor/kimi_verdicts/`, behavior matches Rust+TS
era within tolerance.
### Phase G5 — UI + demo parity (Week 1516)
| Component | Deliverable |
|---|---|
| `cmd/gateway` | Serves HTMX templates + static demo HTML |
| Demo at `devop.live/lakehouse/` | Parity with current Bun demo |
| Staffer console at `/console` | Parity |
**Acceptance:** `devop.live/lakehouse/` cuts over from Bun to Go
gateway. Section / / all render. Compact contract cards still
expand with Project Index. Fill-probability bars still paint.
---
## §5. Repo layout
```
golangLAKEHOUSE/
├── docs/
│ ├── PRD.md ← this PRD
│ ├── SPEC.md ← this spec
│ ├── DECISIONS.md ← Go-era ADRs (start fresh, reference Rust ADRs by number)
│ └── ADR-XXX-*.md ← per-ADR detail
├── cmd/
│ ├── gateway/ ← main HTTP/gRPC ingress
│ ├── catalogd/
│ ├── storaged/
│ ├── queryd/
│ ├── ingestd/
│ ├── vectord/
│ ├── journald/
│ ├── mcp/
│ ├── observer/
│ ├── auditor/
│ └── scrum/
├── internal/ ← shared packages, not exported
│ ├── aibridge/
│ ├── validator/
│ ├── truth/
│ ├── shared/
│ ├── proto/ ← generated protobuf
│ └── pathway/
├── pkg/ ← public Go packages (none initially)
├── web/ ← UI (HTMX templates + static)
│ ├── templates/
│ └── static/
├── scripts/ ← cold-start, smoke, distill scripts
├── tests/ ← golden files, integration tests
├── go.mod
├── go.sum
└── README.md
```
**Single Go module.** All commands and internal packages live under
`golangLAKEHOUSE/`. No nested modules unless a package needs an
independent release cadence (none expected).
**Build:** `go build ./cmd/...` produces all binaries.
---
## §6. Migration data plan
### What ports verbatim
- Parquet datasets at `data/datasets/*.parquet` read by Go directly.
- Catalog manifests Parquet, ports as data not code.
- Pathway memory state JSON, ports if §3.4 byte-matching gate passes.
### What rebuilds
- HNSW indexes rebuild from Parquet embeddings on first Go startup.
- Auditor verdicts on PRs old PRs won't be re-audited; lineage starts
fresh on the new repo's PRs.
### What's archived
- The Rust `crates/` tree preserved in the original repo at the
cutover commit, tagged `pre-go-rewrite-2026-04-28` for reference.
- TS surfaces (`mcp-server/`, `auditor/`, etc.) preserved in the
original repo at the same tag.
- Distillation v1.0.0 substrate (`tag distillation-v1.0.0`,
`e7636f2`) kept as the historical reference; Go re-implementation
ports the LOGIC but not the bit-identical-reproducibility property
unless an ADR re-establishes it.
### What's discarded
- `crates/vectord-lance/` (Lance backend, see PRD §Hard problems §2)
- `crates/lance-bench/` (criterion benchmarks specific to Lance)
---
## §7. Acceptance: when is the rewrite done?
The Go Lakehouse reaches **feature parity** when:
1. **All 12 Rust PRD invariants hold** (object-storage source of truth,
catalog metadata authority, idempotent ingest, hot-swap atomicity,
profiles, etc.).
2. **The 16 distillation acceptance gates pass** (re-run
`./scripts/distill audit-full` against the Go pipeline).
3. **The 22/22 acceptance fixtures from `tests/fixtures/distillation/
acceptance/` pass** under the Go implementation.
4. **The 145 unit tests of distillation v1.0.0 are ported and pass.**
5. **`devop.live/lakehouse/` demo cuts over to Go gateway** with no
visible UI regressions.
6. **Auditor emits Kimi/Haiku/Opus verdicts** on a test PR, matching
the cross-lineage rotation behavior.
7. **The 88 pathway traces replay** with 11/11 prior successes
reproduced.
At that point the Rust repo enters maintenance-only mode (security
fixes), and the Go repo becomes the live system.
---
## §8. Ratified — Phase G0 unblocked (2026-04-28, J)
| # | Decision | Spec impact |
|---|---|---|
| 1 | DuckDB via cgo (`marcboeker/go-duckdb`) | §3.1 option A — proceed |
| 2 | HTMX + `html/template` + Alpine.js | §3.3 option A — proceed |
| 3 | `git.agentview.dev/profit/golangLAKEHOUSE` | repo location locked |
| 4 | Distillation rebuilt in Go (no bit-identical port) | §6 — port logic, not fixtures |
| 5 | Pathway memory starts empty; old traces noted | §3.4 G3.4.A is now "build initial state from scratch in Phase G3"; G3.4.B (byte-match) preserved as the porting correctness gate when the algorithm is reimplemented |
| 6 | Auditor longitudinal signal restarts | new `audit_baselines.jsonl` lineage starts on first Go-era PR |
See `docs/DECISIONS.md` ADR-001 for full rationale and
`docs/RUST_PATHWAY_MEMORY_NOTE.md` for where the legacy 88 traces live.
**Phase G0 is now unblocked.** Next step: bootstrap the Go module
skeleton + push to Gitea, then begin §4 Phase G0 implementation.