golangLAKEHOUSE/docs/SPEC.md

# SPEC: Lakehouse-Go Component Port Plan

**Status:** DRAFT — companion to `PRD.md`. Component-by-component port
plan with library choices, effort estimates, and acceptance gates.
**Created:** 2026-04-28
**Owner:** J

This spec answers: for each piece of the Rust Lakehouse, what Go
library carries it, what the effort looks like, and what gate proves
the port is real.

Effort scale (one engineer-week = ~40h focused work):
- **S** — 1–3 days
- **M** — 1 engineer-week
- **L** — 2–3 engineer-weeks
- **XL** — 1+ months
- **HARD** — open research, see PRD §Hard problems

---

## §1. Component port table — Rust crates

| Crate | Rust deps that mattered | Go target | Library | Effort | Risk |
|---|---|---|---|---|---|
| `gateway` | axum, tokio, tonic, tower | `cmd/gateway` | `chi` + stdlib `net/http` + `google.golang.org/grpc` | **L** | low — Go's strongest domain |
| `catalogd` | parquet-rs, arrow, sqlite | `cmd/catalogd` | `apache/arrow-go/v18`, `mattn/go-sqlite3` | **L** | low |
| `storaged` | object_store, aws-sdk | `cmd/storaged` | `aws-sdk-go-v2`, `minio-go` for MinIO-specific paths | **M** | low |
| `queryd` | datafusion, arrow | `cmd/queryd` | **`duckdb/duckdb-go/v2`** (cgo, official) | **HARD** | high — see §3 |
| `ingestd` | csv, json, lopdf, postgres | `cmd/ingestd` | stdlib `encoding/csv`, `encoding/json`, `pdfcpu/pdfcpu`, `jackc/pgx/v5` | **L** | low |
| `vectord` | hora, arrow, hnsw | `cmd/vectord` | `coder/hnsw`, `apache/arrow-go/v18` | **L** | medium — re-validate HNSW recall |
| **matrix indexer** (emergent in Rust — `mode.rs` + `build_*_corpus.ts` + observer `/relevance`) | scripts/build_*_corpus.ts, crates/gateway/src/v1/mode.rs, mcp-server/observer.ts | `internal/matrix/` + gateway routes (`/v1/matrix/*`) | stdlib + vectord client | **L** | medium — see §3.4. Corpus-as-shard composer; relevance filter; strong-model downgrade gate; multi-corpus retrieve+merge. The learning-loop layer that lifts vectord from "static index" to "meta-index that learns from playbooks." |
| `vectord-lance` | lance | **DROPPED** | n/a | n/a | n/a — Parquet+HNSW only |
| `journald` | parquet, arrow | `cmd/journald` | `apache/arrow-go/v18` | **M** | low |
| `aibridge` | reqwest | library | `net/http` + connection pool · `anthropics/anthropic-sdk-go` available for direct Claude calls (currently routed via opencode) | **S** | low |
| `validator` | parquet, custom | library | `apache/arrow-go/v18` parquet reader | **M** | low — port the 24 unit tests as gates |
| `truth` | tomli, custom DSL | library | `pelletier/go-toml/v2` | **M** | low |
| `proto` | tonic-build | `proto/` + `protoc-gen-go` | `buf` + `protoc-gen-go-grpc` | **S** | low |
| `shared` | serde, anyhow | library | stdlib `encoding/json`, `errors` | **S** | low |
| `ui` | dioxus, wasm | **REPLACED** | `html/template` + HTMX | **L** | medium — see §3 |
| `lance-bench` | criterion | n/a — dropped with Lance | n/a | n/a | n/a |

**Total Rust crate port effort:** ~12–18 engineer-weeks (3–4 months for
one engineer; 6–8 weeks for two).

---

## §2. Component port table — TypeScript surfaces

| TS surface | Current location | Go target | Library | Effort | Risk |
|---|---|---|---|---|---|
| `mcp-server/index.ts` | Bun, :3700 | `cmd/mcp` | **`modelcontextprotocol/go-sdk`** (official Go SDK, v1.5.0, Google-collab) | **L** | medium — MCP semantics |
| `mcp-server/observer.ts` | Bun, :3800 | `cmd/observer` | stdlib `net/http`, `slog` | **M** | low |
| `mcp-server/tracing.ts` | Bun, Langfuse client | library | `go.opentelemetry.io/otel` + Langfuse Go client (or hand-roll) | **M** | low — Langfuse Go OSS support varies |
| `auditor/*.ts` | TS, runs as systemd | `cmd/auditor` | stdlib + `gitea API client` | **L** | medium — auditor cross-lineage logic is intricate |
| `tests/real-world/scrum_master_pipeline.ts` | TS, ad-hoc | `cmd/scrum` | stdlib | **L** | medium — chunking + embed + ladder logic |
| `tests/real-world/scrum_applier.ts` | TS, ad-hoc | `cmd/scrum-apply` | stdlib + git CLI shell-out | **M** | medium |
| `bot/propose.ts` | TS | `cmd/bot` | stdlib | **S** | low |
| Search demo HTML/JS | static | static (no port) | n/a | n/a | n/a — copied as-is |

**Total TS port effort:** ~6–10 engineer-weeks.

---

## §3. Hard problem details

### §3.1 — Query engine (DuckDB via cgo)

**Library:** `github.com/duckdb/duckdb-go/v2` — official Go bindings via
cgo. (Replaces the legacy `marcboeker/go-duckdb`, which was deprecated
when the DuckDB team and Marc Boeker jointly relocated maintenance to
the DuckDB org at v2.5.0. Migration is a one-line `gofmt -r` rewrite of
import paths.) Current version v2.10502.0 (April 2026), DuckDB v1.5.2
compat. Statically links default extensions: ICU, JSON, Parquet,
Autocomplete.

**API shape** (replaces the DataFusion `SessionContext` pattern):
```go
db, _ := sql.Open("duckdb", "")
defer db.Close()
db.Exec("CREATE VIEW workers AS SELECT * FROM read_parquet('s3://bucket/workers/*.parquet')")
rows, _ := db.Query("SELECT role, count(*) FROM workers WHERE state='IL' GROUP BY role")
```

**Acceptance gates:**
- G3.1.A — `SELECT * FROM read_parquet('workers_500k.parquet') LIMIT 1`
  returns a row with the expected schema. Establishes Parquet read
  works.
- G3.1.B — Hybrid SQL+vector query (the `POST /vectors/hybrid`
  surface) returns same workers as the Rust path on the same input,
  ranked the same way modulo embedding precision.
- G3.1.C — Hot-cache merge-on-read: register a base table + a delta
  Parquet, query, observe both rows merged with the delta winning on
  conflict.

**Fallback if cgo is rejected:** run DuckDB as an external process
(`duckdb -json -c '...'` shelled or HTTP via a thin Go wrapper). Adds
operational surface; preserves SQL model.

### §3.2 — HNSW index

**Library:** `coder/hnsw` — pure-Go HNSW, in-process. Supports add /
delete / search / persist.

**Open question:** does `coder/hnsw` match the recall@10 we measured
on the Rust `hora` path? Need a calibration test:
- Rebuild `lakehouse_arch_v1` (the 1086-chunk arch corpus) in Go.
- Compare recall@10 on a fixed query set to the Rust baseline.
- Acceptance: ≤2% drop or we switch library / parameters.

**Persistence format:** TBD — `coder/hnsw` has its own snapshot format;
ADR equivalent of ADR-008 (Parquet for embeddings + sidecar HNSW file)
needs revisiting in Go to confirm the sidecar format we ship.

**Acceptance gates:**
- G3.2.A — Build HNSW from a Parquet of 100K vectors in <60s
- G3.2.B — Search 100K vectors at k=10 in <50ms p50
- G3.2.C — Recall@10 within 2% of Rust baseline on
  `lakehouse_arch_v1`

### §3.4 — Matrix indexer (corpus-as-shard composer)

**What it is.** The matrix indexer is the layer above `vectord` that
turns a fleet of single-corpus HNSW indexes into a learning meta-index.
In the Rust system this is emergent — split between corpus builders
(`scripts/build_*_corpus.ts`), the mode runner (`crates/gateway/src/v1/mode.rs`),
the observer relevance endpoint (`mcp-server/observer.ts`), and the
strong-model downgrade gate (`mode.rs::execute`). In Go we name it
explicitly so future sessions don't reduce it to "vectord."

**Why corpus-as-shard, not shard-by-id.** Sharding a single index by
hash(id) is a pure throughput hack with a recall tax. Sharding by
corpus is the existing retrieval shape — `lakehouse_arch_v1`,
`lakehouse_symbols_v1`, `scrum_findings_v1`, `lakehouse_answers_v1`,
`kb_team_runs_v1`, `successful_playbooks_live`, etc. — each with
distinct topology and a distinct retrieval intent. Concurrent Adds
parallelize naturally because they go to different corpora; the
matrix layer's job is to retrieve+merge across them, filter for
relevance, and downgrade composition when strong models prove the
matrix is anti-additive.

**Components to port (in dependency order):**

1. **Corpus builders** — Go equivalents of `scripts/build_*_corpus.ts`.
   For each named corpus, a builder that reads source, splits into
   chunks per the corpus's schema, embeds via `/v1/embed`, and adds
   to a vectord index of the same name. Effort: **M** for the first
   builder, **S** for each subsequent.

2. **Multi-corpus retrieve+merge** (`internal/matrix/retrieve.go`) —
   given a query and a list of corpus names, search each at top_k=K,
   merge by score, return top N globally. Match Rust's pattern:
   top_k=6 per corpus, top 8 globally before relevance filter.

3. **Relevance filter** (`internal/matrix/relevance.go`) — port the
   threshold-based filter from `mcp-server/observer.ts:/relevance`.
   Drops adjacency-pollution chunks that share a corpus with the hit
   but aren't actually about the query. `LH_RELEVANCE_FILTER` /
   `LH_RELEVANCE_THRESHOLD` env knobs preserved.

4. **Strong-model downgrade gate** (`internal/matrix/downgrade.go`) —
   port `is_weak_model` + the `codereview_lakehouse → codereview_isolation`
   flip from `mode.rs::execute`. Pass5 proved composed corpora lose
   5/5 vs isolation on grok-4.1-fast (p=0.031); the gate is
   load-bearing for paid-model retrieval quality.

5. **Learning-loop integration** — write outcomes back to a
   playbook-memory corpus (probably `lakehouse_answers_v1` analogue).
   This is what makes the matrix INDEX a learning system rather than
   static retrieval. Per `feedback_meta_index_vision.md`: this is the
   north star, not the data structure.

**Gateway routes:** `/v1/matrix/search` (multi-corpus retrieve+merge),
`/v1/matrix/corpora` (list + metadata), `/v1/matrix/relevance` (filter
endpoint, used by both internal callers and external tooling).

**Acceptance gates:**
- G3.4.A — `/v1/matrix/search` against ≥3 corpora returns merged top-N
  with corpus attribution per result.
- G3.4.B — Relevance filter drops at least the threshold-margin chunks
  on a known adjacency-pollution test case.
- G3.4.C — Strong-model downgrade gate flips composed→isolation when
  the model is non-weak; bypassed when caller sets `force_mode`.
- G3.4.D — Concurrent Adds across N=4 corpora parallelize (no shared
  write-lock); Add throughput scales near-linearly with corpus count.

**Persistence:** each corpus's vectord index persists via the existing
G1P LHV1 format. The matrix layer is stateless above that — corpus
list lives in catalog, retrieval params in config.

**Why this is its own §3.x:** in Rust the matrix indexer was emergent
and got reduced to "we have vectord" in earlier port-planning. The
SPEC names it explicitly so the port preserves the multi-corpus
retrieval shape AND the learning loop, not just the HNSW substrate.

### §3.3 — UI (HTMX)

**Approach:** server-rendered Go templates using `html/template`,
HTMX for partial-page swaps, Alpine.js for client-side interactivity
where needed. Single binary serves API + UI.

**Acceptance gates:**
- G3.3.A — `Ask` tab: type natural-language question, get answer
  from RAG endpoint, render in-page without full reload
- G3.3.B — `Explore` tab: paginated dataset list with hot-swap badge
  rendering
- G3.3.C — `SQL` tab: textarea → submit → tabular result rendered
  in-page
- G3.3.D — `System` tab: live tail of `/storage/errors` and
  `/hnsw/trials` via HTMX polling

**Fallback if HTMX feels limiting:** split repo `golangLAKEHOUSE-ui`
with Vite + React, served as static files by Go gateway. Costs an
extra repo + build chain.

### §3.4 — Pathway memory port

**Constraint:** the Rust `pathway_memory` and TS implementations were
byte-matching by ADR-021. The byte contract was verified by running
both implementations on the same input tokens and asserting matching
bucket indices.

**Go port plan:**
- Port the 32-bucket SHA256-keyed token hash exactly. Verify on a
  golden input that Go produces the same bucket vector as Rust.
- Port the JSON state file format verbatim — the existing 88 traces in
  `data/_pathway_memory/state.json` reload as-is into the Go
  implementation.
- Port the matrix-correctness layer (ADR-021's `SemanticFlag`,
  `BugFingerprint`, `TypeHint`) — these are pure value types,
  trivially portable.

**Acceptance gates:**
- G3.4.A — Load existing `state.json`, run `replay` on the same 11
  prior successful pathways, all 11 succeed (matching the Rust 11/11
  baseline).
- G3.4.B — Bucket vector for a fixed test input byte-matches the
  Rust output.

---

## §4. Phase plan

### Phase G0 — Skeleton (Week 1–3)

**Scope:** smallest end-to-end ingest + query path working in Go.

| Component | Deliverable |
|---|---|
| `cmd/gateway` | HTTP on :3100, `/health`, `/v1/chat` proxy stub |
| `cmd/catalogd` | In-memory registry + Parquet manifest persistence |
| `cmd/storaged` | Single-bucket S3 / local FS, no error journal yet |
| `cmd/ingestd` | CSV → Parquet, schema inference, register-on-ingest |
| `cmd/queryd` | DuckDB-backed `POST /sql` endpoint |

**Acceptance:** upload a CSV via `POST /ingest`, query it via
`POST /sql` with a SELECT, get rows back. Single-bucket. No vector,
no profile, no UI.

### Phase G1 — Vector + RAG (Week 4–6)

| Component | Deliverable |
|---|---|
| `cmd/vectord` | Embed-on-ingest (calls Python sidecar), HNSW build, `POST /search` |
| `cmd/gateway` | Add `POST /rag` (embed → search → retrieve → generate via aibridge) |
| `cmd/aibridge` | HTTP client to existing Python sidecar |

**Acceptance:** ingest 15K resumes (the original Phase 7 fixture),
ask "find me a forklift operator with OSHA-10 in IL", get ranked
results with LLM-generated explanation grounded in the retrieved
chunks.

### Phase G2 — Federation + profiles (Week 7–8)

| Component | Deliverable |
|---|---|
| `cmd/storaged` | Multi-bucket registry, rescue bucket, error journal at `primary://_errors/` |
| Profile system | Per-reader profile bound to bucket + vector index |
| Hot-swap | Atomic pointer swap for index generations |

**Acceptance:** two profiles bound to two buckets, queries scoped
correctly, hot-swap a vector index without query interruption,
rollback works.

### Phase G3 — Pathway memory + distillation (Week 9–11)

| Component | Deliverable |
|---|---|
| `cmd/vectord` | Pathway memory module ported, 88 traces reloaded |
| Distillation pipeline | SFT export, contamination firewall, scorer |
| Audit baselines | `audit_baselines.jsonl` longitudinal signal port |

**Acceptance:** replay 11 prior successful pathways, all 11 succeed.
Re-run distillation acceptance on the frozen fixture set, 22/22 pass.

### Phase G4 — TS surfaces → Go (Week 12–14)

| Component | Deliverable |
|---|---|
| `cmd/mcp` | MCP server (replaces Bun) — `/v1/chat`, intelligence endpoints |
| `cmd/observer` | Autonomous iteration loop, op recording |
| `cmd/auditor` | PR audit pipeline (kimi/haiku/opus rotation) |
| `cmd/scrum` | Scrum master pipeline (replaces TS) |

**Acceptance:** open a test PR, auditor cycles within 90s, emits
verdict to `data/_auditor/kimi_verdicts/`, behavior matches Rust+TS
era within tolerance.

### Phase G5 — UI + demo parity (Week 15–16)

| Component | Deliverable |
|---|---|
| `cmd/gateway` | Serves HTMX templates + static demo HTML |
| Demo at `devop.live/lakehouse/` | Parity with current Bun demo |
| Staffer console at `/console` | Parity |

**Acceptance:** `devop.live/lakehouse/` cuts over from Bun to Go
gateway. Section ① / ② / ③ all render. Compact contract cards still
expand with Project Index. Fill-probability bars still paint.

---

## §5. Repo layout

```
golangLAKEHOUSE/
├── docs/
│   ├── PRD.md                    ← this PRD
│   ├── SPEC.md                   ← this spec
│   ├── DECISIONS.md              ← Go-era ADRs (start fresh, reference Rust ADRs by number)
│   └── ADR-XXX-*.md              ← per-ADR detail
├── cmd/
│   ├── gateway/                  ← main HTTP/gRPC ingress
│   ├── catalogd/
│   ├── storaged/
│   ├── queryd/
│   ├── ingestd/
│   ├── vectord/
│   ├── journald/
│   ├── mcp/
│   ├── observer/
│   ├── auditor/
│   └── scrum/
├── internal/                     ← shared packages, not exported
│   ├── aibridge/
│   ├── validator/
│   ├── truth/
│   ├── shared/
│   ├── proto/                    ← generated protobuf
│   └── pathway/
├── pkg/                          ← public Go packages (none initially)
├── web/                          ← UI (HTMX templates + static)
│   ├── templates/
│   └── static/
├── scripts/                      ← cold-start, smoke, distill scripts
├── tests/                        ← golden files, integration tests
├── go.mod
├── go.sum
└── README.md
```

**Single Go module.** All commands and internal packages live under
`golangLAKEHOUSE/`. No nested modules unless a package needs an
independent release cadence (none expected).

**Build:** `go build ./cmd/...` produces all binaries.

---

## §6. Migration data plan

### What ports verbatim
- Parquet datasets at `data/datasets/*.parquet` — read by Go directly.
- Catalog manifests — Parquet, ports as data not code.
- Pathway memory state — JSON, ports if §3.4 byte-matching gate passes.

### What rebuilds
- HNSW indexes — rebuild from Parquet embeddings on first Go startup.
- Auditor verdicts on PRs — old PRs won't be re-audited; lineage starts
  fresh on the new repo's PRs.

### What's archived
- The Rust `crates/` tree — preserved in the original repo at the
  cutover commit, tagged `pre-go-rewrite-2026-04-28` for reference.
- TS surfaces (`mcp-server/`, `auditor/`, etc.) — preserved in the
  original repo at the same tag.
- Distillation v1.0.0 substrate (`tag distillation-v1.0.0`,
  `e7636f2`) — kept as the historical reference; Go re-implementation
  ports the LOGIC but not the bit-identical-reproducibility property
  unless an ADR re-establishes it.

### What's discarded
- `crates/vectord-lance/` (Lance backend, see PRD §Hard problems §2)
- `crates/lance-bench/` (criterion benchmarks specific to Lance)

---

## §7. Acceptance: when is the rewrite done?

The Go Lakehouse reaches **feature parity** when:

1. **All 12 Rust PRD invariants hold** (object-storage source of truth,
   catalog metadata authority, idempotent ingest, hot-swap atomicity,
   profiles, etc.).
2. **The 16 distillation acceptance gates pass** (re-run
   `./scripts/distill audit-full` against the Go pipeline).
3. **The 22/22 acceptance fixtures from `tests/fixtures/distillation/
   acceptance/` pass** under the Go implementation.
4. **The 145 unit tests of distillation v1.0.0 are ported and pass.**
5. **`devop.live/lakehouse/` demo cuts over to Go gateway** with no
   visible UI regressions.
6. **Auditor emits Kimi/Haiku/Opus verdicts** on a test PR, matching
   the cross-lineage rotation behavior.
7. **The 88 pathway traces replay** with 11/11 prior successes
   reproduced.

At that point the Rust repo enters maintenance-only mode (security
fixes), and the Go repo becomes the live system.

---

## §8. Ratified — Phase G0 unblocked (2026-04-28, J)

| # | Decision | Spec impact |
|---|---|---|
| 1 | DuckDB via cgo (`marcboeker/go-duckdb`) | §3.1 option A — proceed |
| 2 | HTMX + `html/template` + Alpine.js | §3.3 option A — proceed |
| 3 | `git.agentview.dev/profit/golangLAKEHOUSE` | repo location locked |
| 4 | Distillation rebuilt in Go (no bit-identical port) | §6 — port logic, not fixtures |
| 5 | Pathway memory starts empty; old traces noted | §3.4 G3.4.A is now "build initial state from scratch in Phase G3"; G3.4.B (byte-match) preserved as the porting correctness gate when the algorithm is reimplemented |
| 6 | Auditor longitudinal signal restarts | new `audit_baselines.jsonl` lineage starts on first Go-era PR |

See `docs/DECISIONS.md` ADR-001 for full rationale and
`docs/RUST_PATHWAY_MEMORY_NOTE.md` for where the legacy 88 traces live.

**Phase G0 is now unblocked.** Next step: bootstrap the Go module
skeleton + push to Gitea, then begin §4 Phase G0 implementation.