root 71b35fb85e SPEC §1 + §3.4: name matrix indexer as a port target
Adds matrix indexer as its own row in the §1 component table and a
new §3.4 with port plan. Distinct from vectord (substrate); lives at
internal/matrix/ + gateway /v1/matrix/*.

Five components in dependency order: corpus builders → multi-corpus
retrieve+merge → relevance filter → strong-model downgrade gate →
learning-loop integration.

Locks in the framing J flagged 2026-04-29: in Rust the matrix indexer
was emergent across mode.rs + build_*_corpus.ts + observer /relevance,
and earlier port-planning reduced it to "we have vectord." The SPEC
now names it explicitly so the port preserves the multi-corpus
retrieval shape AND the learning loop, not just the HNSW substrate.

Sharding-by-id was investigated as a throughput fix and rejected —
corpus-as-shard at the matrix layer is the existing retrieval shape
and parallelizes Adds for free.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-29 18:12:10 -05:00

20 KiB
Raw Blame History

SPEC: Lakehouse-Go Component Port Plan

Status: DRAFT — companion to PRD.md. Component-by-component port plan with library choices, effort estimates, and acceptance gates. Created: 2026-04-28 Owner: J

This spec answers: for each piece of the Rust Lakehouse, what Go library carries it, what the effort looks like, and what gate proves the port is real.

Effort scale (one engineer-week = ~40h focused work):

  • S — 13 days
  • M — 1 engineer-week
  • L — 23 engineer-weeks
  • XL — 1+ months
  • HARD — open research, see PRD §Hard problems

§1. Component port table — Rust crates

Crate Rust deps that mattered Go target Library Effort Risk
gateway axum, tokio, tonic, tower cmd/gateway chi + stdlib net/http + google.golang.org/grpc L low — Go's strongest domain
catalogd parquet-rs, arrow, sqlite cmd/catalogd apache/arrow-go/v18, mattn/go-sqlite3 L low
storaged object_store, aws-sdk cmd/storaged aws-sdk-go-v2, minio-go for MinIO-specific paths M low
queryd datafusion, arrow cmd/queryd duckdb/duckdb-go/v2 (cgo, official) HARD high — see §3
ingestd csv, json, lopdf, postgres cmd/ingestd stdlib encoding/csv, encoding/json, pdfcpu/pdfcpu, jackc/pgx/v5 L low
vectord hora, arrow, hnsw cmd/vectord coder/hnsw, apache/arrow-go/v18 L medium — re-validate HNSW recall
matrix indexer (emergent in Rust — mode.rs + build_*_corpus.ts + observer /relevance) scripts/build_*_corpus.ts, crates/gateway/src/v1/mode.rs, mcp-server/observer.ts internal/matrix/ + gateway routes (/v1/matrix/*) stdlib + vectord client L medium — see §3.4. Corpus-as-shard composer; relevance filter; strong-model downgrade gate; multi-corpus retrieve+merge. The learning-loop layer that lifts vectord from "static index" to "meta-index that learns from playbooks."
vectord-lance lance DROPPED n/a n/a n/a — Parquet+HNSW only
journald parquet, arrow cmd/journald apache/arrow-go/v18 M low
aibridge reqwest library net/http + connection pool · anthropics/anthropic-sdk-go available for direct Claude calls (currently routed via opencode) S low
validator parquet, custom library apache/arrow-go/v18 parquet reader M low — port the 24 unit tests as gates
truth tomli, custom DSL library pelletier/go-toml/v2 M low
proto tonic-build proto/ + protoc-gen-go buf + protoc-gen-go-grpc S low
shared serde, anyhow library stdlib encoding/json, errors S low
ui dioxus, wasm REPLACED html/template + HTMX L medium — see §3
lance-bench criterion n/a — dropped with Lance n/a n/a n/a

Total Rust crate port effort: ~1218 engineer-weeks (34 months for one engineer; 68 weeks for two).


§2. Component port table — TypeScript surfaces

TS surface Current location Go target Library Effort Risk
mcp-server/index.ts Bun, :3700 cmd/mcp modelcontextprotocol/go-sdk (official Go SDK, v1.5.0, Google-collab) L medium — MCP semantics
mcp-server/observer.ts Bun, :3800 cmd/observer stdlib net/http, slog M low
mcp-server/tracing.ts Bun, Langfuse client library go.opentelemetry.io/otel + Langfuse Go client (or hand-roll) M low — Langfuse Go OSS support varies
auditor/*.ts TS, runs as systemd cmd/auditor stdlib + gitea API client L medium — auditor cross-lineage logic is intricate
tests/real-world/scrum_master_pipeline.ts TS, ad-hoc cmd/scrum stdlib L medium — chunking + embed + ladder logic
tests/real-world/scrum_applier.ts TS, ad-hoc cmd/scrum-apply stdlib + git CLI shell-out M medium
bot/propose.ts TS cmd/bot stdlib S low
Search demo HTML/JS static static (no port) n/a n/a n/a — copied as-is

Total TS port effort: ~610 engineer-weeks.


§3. Hard problem details

§3.1 — Query engine (DuckDB via cgo)

Library: github.com/duckdb/duckdb-go/v2 — official Go bindings via cgo. (Replaces the legacy marcboeker/go-duckdb, which was deprecated when the DuckDB team and Marc Boeker jointly relocated maintenance to the DuckDB org at v2.5.0. Migration is a one-line gofmt -r rewrite of import paths.) Current version v2.10502.0 (April 2026), DuckDB v1.5.2 compat. Statically links default extensions: ICU, JSON, Parquet, Autocomplete.

API shape (replaces the DataFusion SessionContext pattern):

db, _ := sql.Open("duckdb", "")
defer db.Close()
db.Exec("CREATE VIEW workers AS SELECT * FROM read_parquet('s3://bucket/workers/*.parquet')")
rows, _ := db.Query("SELECT role, count(*) FROM workers WHERE state='IL' GROUP BY role")

Acceptance gates:

  • G3.1.A — SELECT * FROM read_parquet('workers_500k.parquet') LIMIT 1 returns a row with the expected schema. Establishes Parquet read works.
  • G3.1.B — Hybrid SQL+vector query (the POST /vectors/hybrid surface) returns same workers as the Rust path on the same input, ranked the same way modulo embedding precision.
  • G3.1.C — Hot-cache merge-on-read: register a base table + a delta Parquet, query, observe both rows merged with the delta winning on conflict.

Fallback if cgo is rejected: run DuckDB as an external process (duckdb -json -c '...' shelled or HTTP via a thin Go wrapper). Adds operational surface; preserves SQL model.

§3.2 — HNSW index

Library: coder/hnsw — pure-Go HNSW, in-process. Supports add / delete / search / persist.

Open question: does coder/hnsw match the recall@10 we measured on the Rust hora path? Need a calibration test:

  • Rebuild lakehouse_arch_v1 (the 1086-chunk arch corpus) in Go.
  • Compare recall@10 on a fixed query set to the Rust baseline.
  • Acceptance: ≤2% drop or we switch library / parameters.

Persistence format: TBD — coder/hnsw has its own snapshot format; ADR equivalent of ADR-008 (Parquet for embeddings + sidecar HNSW file) needs revisiting in Go to confirm the sidecar format we ship.

Acceptance gates:

  • G3.2.A — Build HNSW from a Parquet of 100K vectors in <60s
  • G3.2.B — Search 100K vectors at k=10 in <50ms p50
  • G3.2.C — Recall@10 within 2% of Rust baseline on lakehouse_arch_v1

§3.4 — Matrix indexer (corpus-as-shard composer)

What it is. The matrix indexer is the layer above vectord that turns a fleet of single-corpus HNSW indexes into a learning meta-index. In the Rust system this is emergent — split between corpus builders (scripts/build_*_corpus.ts), the mode runner (crates/gateway/src/v1/mode.rs), the observer relevance endpoint (mcp-server/observer.ts), and the strong-model downgrade gate (mode.rs::execute). In Go we name it explicitly so future sessions don't reduce it to "vectord."

Why corpus-as-shard, not shard-by-id. Sharding a single index by hash(id) is a pure throughput hack with a recall tax. Sharding by corpus is the existing retrieval shape — lakehouse_arch_v1, lakehouse_symbols_v1, scrum_findings_v1, lakehouse_answers_v1, kb_team_runs_v1, successful_playbooks_live, etc. — each with distinct topology and a distinct retrieval intent. Concurrent Adds parallelize naturally because they go to different corpora; the matrix layer's job is to retrieve+merge across them, filter for relevance, and downgrade composition when strong models prove the matrix is anti-additive.

Components to port (in dependency order):

  1. Corpus builders — Go equivalents of scripts/build_*_corpus.ts. For each named corpus, a builder that reads source, splits into chunks per the corpus's schema, embeds via /v1/embed, and adds to a vectord index of the same name. Effort: M for the first builder, S for each subsequent.

  2. Multi-corpus retrieve+merge (internal/matrix/retrieve.go) — given a query and a list of corpus names, search each at top_k=K, merge by score, return top N globally. Match Rust's pattern: top_k=6 per corpus, top 8 globally before relevance filter.

  3. Relevance filter (internal/matrix/relevance.go) — port the threshold-based filter from mcp-server/observer.ts:/relevance. Drops adjacency-pollution chunks that share a corpus with the hit but aren't actually about the query. LH_RELEVANCE_FILTER / LH_RELEVANCE_THRESHOLD env knobs preserved.

  4. Strong-model downgrade gate (internal/matrix/downgrade.go) — port is_weak_model + the codereview_lakehouse → codereview_isolation flip from mode.rs::execute. Pass5 proved composed corpora lose 5/5 vs isolation on grok-4.1-fast (p=0.031); the gate is load-bearing for paid-model retrieval quality.

  5. Learning-loop integration — write outcomes back to a playbook-memory corpus (probably lakehouse_answers_v1 analogue). This is what makes the matrix INDEX a learning system rather than static retrieval. Per feedback_meta_index_vision.md: this is the north star, not the data structure.

Gateway routes: /v1/matrix/search (multi-corpus retrieve+merge), /v1/matrix/corpora (list + metadata), /v1/matrix/relevance (filter endpoint, used by both internal callers and external tooling).

Acceptance gates:

  • G3.4.A — /v1/matrix/search against ≥3 corpora returns merged top-N with corpus attribution per result.
  • G3.4.B — Relevance filter drops at least the threshold-margin chunks on a known adjacency-pollution test case.
  • G3.4.C — Strong-model downgrade gate flips composed→isolation when the model is non-weak; bypassed when caller sets force_mode.
  • G3.4.D — Concurrent Adds across N=4 corpora parallelize (no shared write-lock); Add throughput scales near-linearly with corpus count.

Persistence: each corpus's vectord index persists via the existing G1P LHV1 format. The matrix layer is stateless above that — corpus list lives in catalog, retrieval params in config.

Why this is its own §3.x: in Rust the matrix indexer was emergent and got reduced to "we have vectord" in earlier port-planning. The SPEC names it explicitly so the port preserves the multi-corpus retrieval shape AND the learning loop, not just the HNSW substrate.

§3.3 — UI (HTMX)

Approach: server-rendered Go templates using html/template, HTMX for partial-page swaps, Alpine.js for client-side interactivity where needed. Single binary serves API + UI.

Acceptance gates:

  • G3.3.A — Ask tab: type natural-language question, get answer from RAG endpoint, render in-page without full reload
  • G3.3.B — Explore tab: paginated dataset list with hot-swap badge rendering
  • G3.3.C — SQL tab: textarea → submit → tabular result rendered in-page
  • G3.3.D — System tab: live tail of /storage/errors and /hnsw/trials via HTMX polling

Fallback if HTMX feels limiting: split repo golangLAKEHOUSE-ui with Vite + React, served as static files by Go gateway. Costs an extra repo + build chain.

§3.4 — Pathway memory port

Constraint: the Rust pathway_memory and TS implementations were byte-matching by ADR-021. The byte contract was verified by running both implementations on the same input tokens and asserting matching bucket indices.

Go port plan:

  • Port the 32-bucket SHA256-keyed token hash exactly. Verify on a golden input that Go produces the same bucket vector as Rust.
  • Port the JSON state file format verbatim — the existing 88 traces in data/_pathway_memory/state.json reload as-is into the Go implementation.
  • Port the matrix-correctness layer (ADR-021's SemanticFlag, BugFingerprint, TypeHint) — these are pure value types, trivially portable.

Acceptance gates:

  • G3.4.A — Load existing state.json, run replay on the same 11 prior successful pathways, all 11 succeed (matching the Rust 11/11 baseline).
  • G3.4.B — Bucket vector for a fixed test input byte-matches the Rust output.

§4. Phase plan

Phase G0 — Skeleton (Week 13)

Scope: smallest end-to-end ingest + query path working in Go.

Component Deliverable
cmd/gateway HTTP on :3100, /health, /v1/chat proxy stub
cmd/catalogd In-memory registry + Parquet manifest persistence
cmd/storaged Single-bucket S3 / local FS, no error journal yet
cmd/ingestd CSV → Parquet, schema inference, register-on-ingest
cmd/queryd DuckDB-backed POST /sql endpoint

Acceptance: upload a CSV via POST /ingest, query it via POST /sql with a SELECT, get rows back. Single-bucket. No vector, no profile, no UI.

Phase G1 — Vector + RAG (Week 46)

Component Deliverable
cmd/vectord Embed-on-ingest (calls Python sidecar), HNSW build, POST /search
cmd/gateway Add POST /rag (embed → search → retrieve → generate via aibridge)
cmd/aibridge HTTP client to existing Python sidecar

Acceptance: ingest 15K resumes (the original Phase 7 fixture), ask "find me a forklift operator with OSHA-10 in IL", get ranked results with LLM-generated explanation grounded in the retrieved chunks.

Phase G2 — Federation + profiles (Week 78)

Component Deliverable
cmd/storaged Multi-bucket registry, rescue bucket, error journal at primary://_errors/
Profile system Per-reader profile bound to bucket + vector index
Hot-swap Atomic pointer swap for index generations

Acceptance: two profiles bound to two buckets, queries scoped correctly, hot-swap a vector index without query interruption, rollback works.

Phase G3 — Pathway memory + distillation (Week 911)

Component Deliverable
cmd/vectord Pathway memory module ported, 88 traces reloaded
Distillation pipeline SFT export, contamination firewall, scorer
Audit baselines audit_baselines.jsonl longitudinal signal port

Acceptance: replay 11 prior successful pathways, all 11 succeed. Re-run distillation acceptance on the frozen fixture set, 22/22 pass.

Phase G4 — TS surfaces → Go (Week 1214)

Component Deliverable
cmd/mcp MCP server (replaces Bun) — /v1/chat, intelligence endpoints
cmd/observer Autonomous iteration loop, op recording
cmd/auditor PR audit pipeline (kimi/haiku/opus rotation)
cmd/scrum Scrum master pipeline (replaces TS)

Acceptance: open a test PR, auditor cycles within 90s, emits verdict to data/_auditor/kimi_verdicts/, behavior matches Rust+TS era within tolerance.

Phase G5 — UI + demo parity (Week 1516)

Component Deliverable
cmd/gateway Serves HTMX templates + static demo HTML
Demo at devop.live/lakehouse/ Parity with current Bun demo
Staffer console at /console Parity

Acceptance: devop.live/lakehouse/ cuts over from Bun to Go gateway. Section ① / ② / ③ all render. Compact contract cards still expand with Project Index. Fill-probability bars still paint.


§5. Repo layout

golangLAKEHOUSE/
├── docs/
│   ├── PRD.md                    ← this PRD
│   ├── SPEC.md                   ← this spec
│   ├── DECISIONS.md              ← Go-era ADRs (start fresh, reference Rust ADRs by number)
│   └── ADR-XXX-*.md              ← per-ADR detail
├── cmd/
│   ├── gateway/                  ← main HTTP/gRPC ingress
│   ├── catalogd/
│   ├── storaged/
│   ├── queryd/
│   ├── ingestd/
│   ├── vectord/
│   ├── journald/
│   ├── mcp/
│   ├── observer/
│   ├── auditor/
│   └── scrum/
├── internal/                     ← shared packages, not exported
│   ├── aibridge/
│   ├── validator/
│   ├── truth/
│   ├── shared/
│   ├── proto/                    ← generated protobuf
│   └── pathway/
├── pkg/                          ← public Go packages (none initially)
├── web/                          ← UI (HTMX templates + static)
│   ├── templates/
│   └── static/
├── scripts/                      ← cold-start, smoke, distill scripts
├── tests/                        ← golden files, integration tests
├── go.mod
├── go.sum
└── README.md

Single Go module. All commands and internal packages live under golangLAKEHOUSE/. No nested modules unless a package needs an independent release cadence (none expected).

Build: go build ./cmd/... produces all binaries.


§6. Migration data plan

What ports verbatim

  • Parquet datasets at data/datasets/*.parquet — read by Go directly.
  • Catalog manifests — Parquet, ports as data not code.
  • Pathway memory state — JSON, ports if §3.4 byte-matching gate passes.

What rebuilds

  • HNSW indexes — rebuild from Parquet embeddings on first Go startup.
  • Auditor verdicts on PRs — old PRs won't be re-audited; lineage starts fresh on the new repo's PRs.

What's archived

  • The Rust crates/ tree — preserved in the original repo at the cutover commit, tagged pre-go-rewrite-2026-04-28 for reference.
  • TS surfaces (mcp-server/, auditor/, etc.) — preserved in the original repo at the same tag.
  • Distillation v1.0.0 substrate (tag distillation-v1.0.0, e7636f2) — kept as the historical reference; Go re-implementation ports the LOGIC but not the bit-identical-reproducibility property unless an ADR re-establishes it.

What's discarded

  • crates/vectord-lance/ (Lance backend, see PRD §Hard problems §2)
  • crates/lance-bench/ (criterion benchmarks specific to Lance)

§7. Acceptance: when is the rewrite done?

The Go Lakehouse reaches feature parity when:

  1. All 12 Rust PRD invariants hold (object-storage source of truth, catalog metadata authority, idempotent ingest, hot-swap atomicity, profiles, etc.).
  2. The 16 distillation acceptance gates pass (re-run ./scripts/distill audit-full against the Go pipeline).
  3. The 22/22 acceptance fixtures from tests/fixtures/distillation/ acceptance/ pass under the Go implementation.
  4. The 145 unit tests of distillation v1.0.0 are ported and pass.
  5. devop.live/lakehouse/ demo cuts over to Go gateway with no visible UI regressions.
  6. Auditor emits Kimi/Haiku/Opus verdicts on a test PR, matching the cross-lineage rotation behavior.
  7. The 88 pathway traces replay with 11/11 prior successes reproduced.

At that point the Rust repo enters maintenance-only mode (security fixes), and the Go repo becomes the live system.


§8. Ratified — Phase G0 unblocked (2026-04-28, J)

# Decision Spec impact
1 DuckDB via cgo (marcboeker/go-duckdb) §3.1 option A — proceed
2 HTMX + html/template + Alpine.js §3.3 option A — proceed
3 git.agentview.dev/profit/golangLAKEHOUSE repo location locked
4 Distillation rebuilt in Go (no bit-identical port) §6 — port logic, not fixtures
5 Pathway memory starts empty; old traces noted §3.4 G3.4.A is now "build initial state from scratch in Phase G3"; G3.4.B (byte-match) preserved as the porting correctness gate when the algorithm is reimplemented
6 Auditor longitudinal signal restarts new audit_baselines.jsonl lineage starts on first Go-era PR

See docs/DECISIONS.md ADR-001 for full rationale and docs/RUST_PATHWAY_MEMORY_NOTE.md for where the legacy 88 traces live.

Phase G0 is now unblocked. Next step: bootstrap the Go module skeleton + push to Gitea, then begin §4 Phase G0 implementation.