root 71b35fb85e SPEC §1 + §3.4: name matrix indexer as a port target

Adds matrix indexer as its own row in the §1 component table and a
new §3.4 with port plan. Distinct from vectord (substrate); lives at
internal/matrix/ + gateway /v1/matrix/*.

Five components in dependency order: corpus builders → multi-corpus
retrieve+merge → relevance filter → strong-model downgrade gate →
learning-loop integration.

Locks in the framing J flagged 2026-04-29: in Rust the matrix indexer
was emergent across mode.rs + build_*_corpus.ts + observer /relevance,
and earlier port-planning reduced it to "we have vectord." The SPEC
now names it explicitly so the port preserves the multi-corpus
retrieval shape AND the learning loop, not just the HNSW substrate.

Sharding-by-id was investigated as a throughput fix and rejected —
corpus-as-shard at the matrix layer is the existing retrieval shape
and parallelizes Adds for free.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

2026-04-29 18:12:10 -05:00

20 KiB

Raw Blame History

SPEC: Lakehouse-Go Component Port Plan

Status: DRAFT — companion to PRD.md. Component-by-component port plan with library choices, effort estimates, and acceptance gates. Created: 2026-04-28 Owner: J

This spec answers: for each piece of the Rust Lakehouse, what Go library carries it, what the effort looks like, and what gate proves the port is real.

Effort scale (one engineer-week = ~40h focused work):

S — 1–3 days
M — 1 engineer-week
L — 2–3 engineer-weeks
XL — 1+ months
HARD — open research, see PRD §Hard problems

§1. Component port table — Rust crates

Crate	Rust deps that mattered	Go target	Library	Effort	Risk
`gateway`	axum, tokio, tonic, tower	`cmd/gateway`	`chi` + stdlib `net/http` + `google.golang.org/grpc`	L	low — Go's strongest domain
`catalogd`	parquet-rs, arrow, sqlite	`cmd/catalogd`	`apache/arrow-go/v18`, `mattn/go-sqlite3`	L	low
`storaged`	object_store, aws-sdk	`cmd/storaged`	`aws-sdk-go-v2`, `minio-go` for MinIO-specific paths	M	low
`queryd`	datafusion, arrow	`cmd/queryd`	`duckdb/duckdb-go/v2` (cgo, official)	HARD	high — see §3
`ingestd`	csv, json, lopdf, postgres	`cmd/ingestd`	stdlib `encoding/csv`, `encoding/json`, `pdfcpu/pdfcpu`, `jackc/pgx/v5`	L	low
`vectord`	hora, arrow, hnsw	`cmd/vectord`	`coder/hnsw`, `apache/arrow-go/v18`	L	medium — re-validate HNSW recall
matrix indexer (emergent in Rust — `mode.rs` + `build_*_corpus.ts` + observer `/relevance`)	scripts/build_*_corpus.ts, crates/gateway/src/v1/mode.rs, mcp-server/observer.ts	`internal/matrix/` + gateway routes (`/v1/matrix/*`)	stdlib + vectord client	L	medium — see §3.4. Corpus-as-shard composer; relevance filter; strong-model downgrade gate; multi-corpus retrieve+merge. The learning-loop layer that lifts vectord from "static index" to "meta-index that learns from playbooks."
`vectord-lance`	lance	DROPPED	n/a	n/a	n/a — Parquet+HNSW only
`journald`	parquet, arrow	`cmd/journald`	`apache/arrow-go/v18`	M	low
`aibridge`	reqwest	library	`net/http` + connection pool · `anthropics/anthropic-sdk-go` available for direct Claude calls (currently routed via opencode)	S	low
`validator`	parquet, custom	library	`apache/arrow-go/v18` parquet reader	M	low — port the 24 unit tests as gates
`truth`	tomli, custom DSL	library	`pelletier/go-toml/v2`	M	low
`proto`	tonic-build	`proto/` + `protoc-gen-go`	`buf` + `protoc-gen-go-grpc`	S	low
`shared`	serde, anyhow	library	stdlib `encoding/json`, `errors`	S	low
`ui`	dioxus, wasm	REPLACED	`html/template` + HTMX	L	medium — see §3
`lance-bench`	criterion	n/a — dropped with Lance	n/a	n/a	n/a

Total Rust crate port effort: ~12–18 engineer-weeks (3–4 months for one engineer; 6–8 weeks for two).

§2. Component port table — TypeScript surfaces

TS surface	Current location	Go target	Library	Effort	Risk
`mcp-server/index.ts`	Bun, :3700	`cmd/mcp`	`modelcontextprotocol/go-sdk` (official Go SDK, v1.5.0, Google-collab)	L	medium — MCP semantics
`mcp-server/observer.ts`	Bun, :3800	`cmd/observer`	stdlib `net/http`, `slog`	M	low
`mcp-server/tracing.ts`	Bun, Langfuse client	library	`go.opentelemetry.io/otel` + Langfuse Go client (or hand-roll)	M	low — Langfuse Go OSS support varies
`auditor/*.ts`	TS, runs as systemd	`cmd/auditor`	stdlib + `gitea API client`	L	medium — auditor cross-lineage logic is intricate
`tests/real-world/scrum_master_pipeline.ts`	TS, ad-hoc	`cmd/scrum`	stdlib	L	medium — chunking + embed + ladder logic
`tests/real-world/scrum_applier.ts`	TS, ad-hoc	`cmd/scrum-apply`	stdlib + git CLI shell-out	M	medium
`bot/propose.ts`	TS	`cmd/bot`	stdlib	S	low
Search demo HTML/JS	static	static (no port)	n/a	n/a	n/a — copied as-is

Total TS port effort: ~6–10 engineer-weeks.

§3. Hard problem details

§3.1 — Query engine (DuckDB via cgo)

Library: github.com/duckdb/duckdb-go/v2 — official Go bindings via cgo. (Replaces the legacy marcboeker/go-duckdb, which was deprecated when the DuckDB team and Marc Boeker jointly relocated maintenance to the DuckDB org at v2.5.0. Migration is a one-line gofmt -r rewrite of import paths.) Current version v2.10502.0 (April 2026), DuckDB v1.5.2 compat. Statically links default extensions: ICU, JSON, Parquet, Autocomplete.

API shape (replaces the DataFusion SessionContext pattern):

db, _ := sql.Open("duckdb", "")
defer db.Close()
db.Exec("CREATE VIEW workers AS SELECT * FROM read_parquet('s3://bucket/workers/*.parquet')")
rows, _ := db.Query("SELECT role, count(*) FROM workers WHERE state='IL' GROUP BY role")

Acceptance gates:

G3.1.A — SELECT * FROM read_parquet('workers_500k.parquet') LIMIT 1 returns a row with the expected schema. Establishes Parquet read works.
G3.1.B — Hybrid SQL+vector query (the POST /vectors/hybrid surface) returns same workers as the Rust path on the same input, ranked the same way modulo embedding precision.
G3.1.C — Hot-cache merge-on-read: register a base table + a delta Parquet, query, observe both rows merged with the delta winning on conflict.

Fallback if cgo is rejected: run DuckDB as an external process (duckdb -json -c '...' shelled or HTTP via a thin Go wrapper). Adds operational surface; preserves SQL model.

§3.2 — HNSW index

Library: coder/hnsw — pure-Go HNSW, in-process. Supports add / delete / search / persist.

Open question: does coder/hnsw match the recall@10 we measured on the Rust hora path? Need a calibration test:

Rebuild lakehouse_arch_v1 (the 1086-chunk arch corpus) in Go.
Compare recall@10 on a fixed query set to the Rust baseline.
Acceptance: ≤2% drop or we switch library / parameters.

Persistence format: TBD — coder/hnsw has its own snapshot format; ADR equivalent of ADR-008 (Parquet for embeddings + sidecar HNSW file) needs revisiting in Go to confirm the sidecar format we ship.

Acceptance gates:

G3.2.A — Build HNSW from a Parquet of 100K vectors in <60s
G3.2.B — Search 100K vectors at k=10 in <50ms p50
G3.2.C — Recall@10 within 2% of Rust baseline on lakehouse_arch_v1

§3.4 — Matrix indexer (corpus-as-shard composer)

What it is. The matrix indexer is the layer above vectord that turns a fleet of single-corpus HNSW indexes into a learning meta-index. In the Rust system this is emergent — split between corpus builders (scripts/build_*_corpus.ts), the mode runner (crates/gateway/src/v1/mode.rs), the observer relevance endpoint (mcp-server/observer.ts), and the strong-model downgrade gate (mode.rs::execute). In Go we name it explicitly so future sessions don't reduce it to "vectord."

Why corpus-as-shard, not shard-by-id. Sharding a single index by hash(id) is a pure throughput hack with a recall tax. Sharding by corpus is the existing retrieval shape — lakehouse_arch_v1, lakehouse_symbols_v1, scrum_findings_v1, lakehouse_answers_v1, kb_team_runs_v1, successful_playbooks_live, etc. — each with distinct topology and a distinct retrieval intent. Concurrent Adds parallelize naturally because they go to different corpora; the matrix layer's job is to retrieve+merge across them, filter for relevance, and downgrade composition when strong models prove the matrix is anti-additive.

Components to port (in dependency order):

Corpus builders — Go equivalents of scripts/build_*_corpus.ts. For each named corpus, a builder that reads source, splits into chunks per the corpus's schema, embeds via /v1/embed, and adds to a vectord index of the same name. Effort: M for the first builder, S for each subsequent.
Multi-corpus retrieve+merge (internal/matrix/retrieve.go) — given a query and a list of corpus names, search each at top_k=K, merge by score, return top N globally. Match Rust's pattern: top_k=6 per corpus, top 8 globally before relevance filter.
Relevance filter (internal/matrix/relevance.go) — port the threshold-based filter from mcp-server/observer.ts:/relevance. Drops adjacency-pollution chunks that share a corpus with the hit but aren't actually about the query. LH_RELEVANCE_FILTER / LH_RELEVANCE_THRESHOLD env knobs preserved.
Strong-model downgrade gate (internal/matrix/downgrade.go) — port is_weak_model + the codereview_lakehouse → codereview_isolation flip from mode.rs::execute. Pass5 proved composed corpora lose 5/5 vs isolation on grok-4.1-fast (p=0.031); the gate is load-bearing for paid-model retrieval quality.
Learning-loop integration — write outcomes back to a playbook-memory corpus (probably lakehouse_answers_v1 analogue). This is what makes the matrix INDEX a learning system rather than static retrieval. Per feedback_meta_index_vision.md: this is the north star, not the data structure.

Gateway routes: /v1/matrix/search (multi-corpus retrieve+merge), /v1/matrix/corpora (list + metadata), /v1/matrix/relevance (filter endpoint, used by both internal callers and external tooling).

Acceptance gates:

G3.4.A — /v1/matrix/search against ≥3 corpora returns merged top-N with corpus attribution per result.
G3.4.B — Relevance filter drops at least the threshold-margin chunks on a known adjacency-pollution test case.
G3.4.C — Strong-model downgrade gate flips composed→isolation when the model is non-weak; bypassed when caller sets force_mode.
G3.4.D — Concurrent Adds across N=4 corpora parallelize (no shared write-lock); Add throughput scales near-linearly with corpus count.

Persistence: each corpus's vectord index persists via the existing G1P LHV1 format. The matrix layer is stateless above that — corpus list lives in catalog, retrieval params in config.

Why this is its own §3.x: in Rust the matrix indexer was emergent and got reduced to "we have vectord" in earlier port-planning. The SPEC names it explicitly so the port preserves the multi-corpus retrieval shape AND the learning loop, not just the HNSW substrate.

§3.3 — UI (HTMX)

Approach: server-rendered Go templates using html/template, HTMX for partial-page swaps, Alpine.js for client-side interactivity where needed. Single binary serves API + UI.

Acceptance gates:

G3.3.A — Ask tab: type natural-language question, get answer from RAG endpoint, render in-page without full reload
G3.3.B — Explore tab: paginated dataset list with hot-swap badge rendering
G3.3.C — SQL tab: textarea → submit → tabular result rendered in-page
G3.3.D — System tab: live tail of /storage/errors and /hnsw/trials via HTMX polling

Fallback if HTMX feels limiting: split repo golangLAKEHOUSE-ui with Vite + React, served as static files by Go gateway. Costs an extra repo + build chain.

§3.4 — Pathway memory port

Constraint: the Rust pathway_memory and TS implementations were byte-matching by ADR-021. The byte contract was verified by running both implementations on the same input tokens and asserting matching bucket indices.

Go port plan:

Port the 32-bucket SHA256-keyed token hash exactly. Verify on a golden input that Go produces the same bucket vector as Rust.
Port the JSON state file format verbatim — the existing 88 traces in data/_pathway_memory/state.json reload as-is into the Go implementation.
Port the matrix-correctness layer (ADR-021's SemanticFlag, BugFingerprint, TypeHint) — these are pure value types, trivially portable.

Acceptance gates:

G3.4.A — Load existing state.json, run replay on the same 11 prior successful pathways, all 11 succeed (matching the Rust 11/11 baseline).
G3.4.B — Bucket vector for a fixed test input byte-matches the Rust output.

§4. Phase plan

Phase G0 — Skeleton (Week 1–3)

Scope: smallest end-to-end ingest + query path working in Go.

Component	Deliverable
`cmd/gateway`	HTTP on :3100, `/health`, `/v1/chat` proxy stub
`cmd/catalogd`	In-memory registry + Parquet manifest persistence
`cmd/storaged`	Single-bucket S3 / local FS, no error journal yet
`cmd/ingestd`	CSV → Parquet, schema inference, register-on-ingest
`cmd/queryd`	DuckDB-backed `POST /sql` endpoint

Acceptance: upload a CSV via POST /ingest, query it via POST /sql with a SELECT, get rows back. Single-bucket. No vector, no profile, no UI.

Phase G1 — Vector + RAG (Week 4–6)

Component	Deliverable
`cmd/vectord`	Embed-on-ingest (calls Python sidecar), HNSW build, `POST /search`
`cmd/gateway`	Add `POST /rag` (embed → search → retrieve → generate via aibridge)
`cmd/aibridge`	HTTP client to existing Python sidecar

Acceptance: ingest 15K resumes (the original Phase 7 fixture), ask "find me a forklift operator with OSHA-10 in IL", get ranked results with LLM-generated explanation grounded in the retrieved chunks.

Phase G2 — Federation + profiles (Week 7–8)

Component	Deliverable
`cmd/storaged`	Multi-bucket registry, rescue bucket, error journal at `primary://_errors/`
Profile system	Per-reader profile bound to bucket + vector index
Hot-swap	Atomic pointer swap for index generations

Acceptance: two profiles bound to two buckets, queries scoped correctly, hot-swap a vector index without query interruption, rollback works.

Phase G3 — Pathway memory + distillation (Week 9–11)

Component	Deliverable
`cmd/vectord`	Pathway memory module ported, 88 traces reloaded
Distillation pipeline	SFT export, contamination firewall, scorer
Audit baselines	`audit_baselines.jsonl` longitudinal signal port

Acceptance: replay 11 prior successful pathways, all 11 succeed. Re-run distillation acceptance on the frozen fixture set, 22/22 pass.

Phase G4 — TS surfaces → Go (Week 12–14)

Component	Deliverable
`cmd/mcp`	MCP server (replaces Bun) — `/v1/chat`, intelligence endpoints
`cmd/observer`	Autonomous iteration loop, op recording
`cmd/auditor`	PR audit pipeline (kimi/haiku/opus rotation)
`cmd/scrum`	Scrum master pipeline (replaces TS)

Acceptance: open a test PR, auditor cycles within 90s, emits verdict to data/_auditor/kimi_verdicts/, behavior matches Rust+TS era within tolerance.

Phase G5 — UI + demo parity (Week 15–16)

Component	Deliverable
`cmd/gateway`	Serves HTMX templates + static demo HTML
Demo at `devop.live/lakehouse/`	Parity with current Bun demo
Staffer console at `/console`	Parity

Acceptance: devop.live/lakehouse/ cuts over from Bun to Go gateway. Section ① / ② / ③ all render. Compact contract cards still expand with Project Index. Fill-probability bars still paint.

§5. Repo layout

golangLAKEHOUSE/
├── docs/
│   ├── PRD.md                    ← this PRD
│   ├── SPEC.md                   ← this spec
│   ├── DECISIONS.md              ← Go-era ADRs (start fresh, reference Rust ADRs by number)
│   └── ADR-XXX-*.md              ← per-ADR detail
├── cmd/
│   ├── gateway/                  ← main HTTP/gRPC ingress
│   ├── catalogd/
│   ├── storaged/
│   ├── queryd/
│   ├── ingestd/
│   ├── vectord/
│   ├── journald/
│   ├── mcp/
│   ├── observer/
│   ├── auditor/
│   └── scrum/
├── internal/                     ← shared packages, not exported
│   ├── aibridge/
│   ├── validator/
│   ├── truth/
│   ├── shared/
│   ├── proto/                    ← generated protobuf
│   └── pathway/
├── pkg/                          ← public Go packages (none initially)
├── web/                          ← UI (HTMX templates + static)
│   ├── templates/
│   └── static/
├── scripts/                      ← cold-start, smoke, distill scripts
├── tests/                        ← golden files, integration tests
├── go.mod
├── go.sum
└── README.md

Single Go module. All commands and internal packages live under golangLAKEHOUSE/. No nested modules unless a package needs an independent release cadence (none expected).

Build: go build ./cmd/... produces all binaries.

§6. Migration data plan

What ports verbatim

Parquet datasets at data/datasets/*.parquet — read by Go directly.
Catalog manifests — Parquet, ports as data not code.
Pathway memory state — JSON, ports if §3.4 byte-matching gate passes.

What rebuilds

HNSW indexes — rebuild from Parquet embeddings on first Go startup.
Auditor verdicts on PRs — old PRs won't be re-audited; lineage starts fresh on the new repo's PRs.

What's archived

The Rust crates/ tree — preserved in the original repo at the cutover commit, tagged pre-go-rewrite-2026-04-28 for reference.
TS surfaces (mcp-server/, auditor/, etc.) — preserved in the original repo at the same tag.
Distillation v1.0.0 substrate (tag distillation-v1.0.0, e7636f2) — kept as the historical reference; Go re-implementation ports the LOGIC but not the bit-identical-reproducibility property unless an ADR re-establishes it.

What's discarded

crates/vectord-lance/ (Lance backend, see PRD §Hard problems §2)
crates/lance-bench/ (criterion benchmarks specific to Lance)

§7. Acceptance: when is the rewrite done?

The Go Lakehouse reaches feature parity when:

All 12 Rust PRD invariants hold (object-storage source of truth, catalog metadata authority, idempotent ingest, hot-swap atomicity, profiles, etc.).
The 16 distillation acceptance gates pass (re-run ./scripts/distill audit-full against the Go pipeline).
The 22/22 acceptance fixtures from tests/fixtures/distillation/ acceptance/ pass under the Go implementation.
The 145 unit tests of distillation v1.0.0 are ported and pass.
devop.live/lakehouse/ demo cuts over to Go gateway with no visible UI regressions.
Auditor emits Kimi/Haiku/Opus verdicts on a test PR, matching the cross-lineage rotation behavior.
The 88 pathway traces replay with 11/11 prior successes reproduced.

At that point the Rust repo enters maintenance-only mode (security fixes), and the Go repo becomes the live system.

§8. Ratified — Phase G0 unblocked (2026-04-28, J)

#	Decision	Spec impact
1	DuckDB via cgo (`marcboeker/go-duckdb`)	§3.1 option A — proceed
2	HTMX + `html/template` + Alpine.js	§3.3 option A — proceed
3	`git.agentview.dev/profit/golangLAKEHOUSE`	repo location locked
4	Distillation rebuilt in Go (no bit-identical port)	§6 — port logic, not fixtures
5	Pathway memory starts empty; old traces noted	§3.4 G3.4.A is now "build initial state from scratch in Phase G3"; G3.4.B (byte-match) preserved as the porting correctness gate when the algorithm is reimplemented
6	Auditor longitudinal signal restarts	new `audit_baselines.jsonl` lineage starts on first Go-era PR

See docs/DECISIONS.md ADR-001 for full rationale and docs/RUST_PATHWAY_MEMORY_NOTE.md for where the legacy 88 traces live.

Phase G0 is now unblocked. Next step: bootstrap the Go module skeleton + push to Gitea, then begin §4 Phase G0 implementation.

20 KiB Raw Blame History Unescape Escape

SPEC: Lakehouse-Go Component Port Plan

§1. Component port table — Rust crates

§2. Component port table — TypeScript surfaces

§3. Hard problem details

§3.1 — Query engine (DuckDB via cgo)

§3.2 — HNSW index

§3.4 — Matrix indexer (corpus-as-shard composer)

§3.3 — UI (HTMX)

§3.4 — Pathway memory port

§4. Phase plan

Phase G0 — Skeleton (Week 1–3)

Phase G1 — Vector + RAG (Week 4–6)

Phase G2 — Federation + profiles (Week 7–8)

Phase G3 — Pathway memory + distillation (Week 9–11)

Phase G4 — TS surfaces → Go (Week 12–14)

Phase G5 — UI + demo parity (Week 15–16)

§5. Repo layout

§6. Migration data plan

What ports verbatim

What rebuilds

What's archived

What's discarded

§7. Acceptance: when is the rewrite done?

§8. Ratified — Phase G0 unblocked (2026-04-28, J)

20 KiB

Raw Blame History