Two documents only — no Go code yet. PRD restates the problem and preserves the Rust PRD's invariants verbatim, then maps the locked stack to Go libraries and surfaces four hard problems (DuckDB-via-cgo for the query engine, Lance dropped, Dioxus → HTMX, arrow-go maturity). SPEC walks each Rust crate + TS surface and tags the port with library choice / effort estimate / risk + a 5-phase migration plan from skeleton (Phase G0) to demo parity (Phase G5). Six open questions remain that gate Phase G0: - DuckDB cgo OK? - HTMX vs React for the UI? - Repo location? - Distillation v1.0.0 port verbatim or rebuild? - Pathway memory data — port 88 traces or start clean? - Auditor lineage — port audit_baselines.jsonl or restart? Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
15 KiB
SPEC: Lakehouse-Go Component Port Plan
Status: DRAFT — companion to PRD.md. Component-by-component port
plan with library choices, effort estimates, and acceptance gates.
Created: 2026-04-28
Owner: J
This spec answers: for each piece of the Rust Lakehouse, what Go library carries it, what the effort looks like, and what gate proves the port is real.
Effort scale (one engineer-week = ~40h focused work):
- S — 1–3 days
- M — 1 engineer-week
- L — 2–3 engineer-weeks
- XL — 1+ months
- HARD — open research, see PRD §Hard problems
§1. Component port table — Rust crates
| Crate | Rust deps that mattered | Go target | Library | Effort | Risk |
|---|---|---|---|---|---|
gateway |
axum, tokio, tonic, tower | cmd/gateway |
chi + stdlib net/http + google.golang.org/grpc |
L | low — Go's strongest domain |
catalogd |
parquet-rs, arrow, sqlite | cmd/catalogd |
arrow-go/v15, mattn/go-sqlite3 |
L | low |
storaged |
object_store, aws-sdk | cmd/storaged |
aws-sdk-go-v2, minio-go for MinIO-specific paths |
M | low |
queryd |
datafusion, arrow | cmd/queryd |
marcboeker/go-duckdb (cgo) |
HARD | high — see §3 |
ingestd |
csv, json, lopdf, postgres | cmd/ingestd |
stdlib encoding/csv, encoding/json, pdfcpu/pdfcpu, jackc/pgx/v5 |
L | low |
vectord |
hora, arrow, hnsw | cmd/vectord |
coder/hnsw, arrow-go/v15 |
L | medium — re-validate HNSW recall |
vectord-lance |
lance | DROPPED | n/a | n/a | n/a — Parquet+HNSW only |
journald |
parquet, arrow | cmd/journald |
arrow-go/v15 |
M | low |
aibridge |
reqwest | library | net/http + connection pool |
S | low |
validator |
parquet, custom | library | arrow-go/v15 parquet reader |
M | low — port the 24 unit tests as gates |
truth |
tomli, custom DSL | library | pelletier/go-toml/v2 |
M | low |
proto |
tonic-build | proto/ + protoc-gen-go |
buf + protoc-gen-go-grpc |
S | low |
shared |
serde, anyhow | library | stdlib encoding/json, errors |
S | low |
ui |
dioxus, wasm | REPLACED | html/template + HTMX |
L | medium — see §3 |
lance-bench |
criterion | n/a — dropped with Lance | n/a | n/a | n/a |
Total Rust crate port effort: ~12–18 engineer-weeks (3–4 months for one engineer; 6–8 weeks for two).
§2. Component port table — TypeScript surfaces
| TS surface | Current location | Go target | Library | Effort | Risk |
|---|---|---|---|---|---|
mcp-server/index.ts |
Bun, :3700 | cmd/mcp |
mark3labs/mcp-go (Go MCP SDK) |
L | medium — MCP semantics |
mcp-server/observer.ts |
Bun, :3800 | cmd/observer |
stdlib net/http, slog |
M | low |
mcp-server/tracing.ts |
Bun, Langfuse client | library | go.opentelemetry.io/otel + Langfuse Go client (or hand-roll) |
M | low — Langfuse Go OSS support varies |
auditor/*.ts |
TS, runs as systemd | cmd/auditor |
stdlib + gitea API client |
L | medium — auditor cross-lineage logic is intricate |
tests/real-world/scrum_master_pipeline.ts |
TS, ad-hoc | cmd/scrum |
stdlib | L | medium — chunking + embed + ladder logic |
tests/real-world/scrum_applier.ts |
TS, ad-hoc | cmd/scrum-apply |
stdlib + git CLI shell-out | M | medium |
bot/propose.ts |
TS | cmd/bot |
stdlib | S | low |
| Search demo HTML/JS | static | static (no port) | n/a | n/a | n/a — copied as-is |
Total TS port effort: ~6–10 engineer-weeks.
§3. Hard problem details
§3.1 — Query engine (DuckDB via cgo)
Library: marcboeker/go-duckdb — Go bindings via cgo.
API shape (replaces the DataFusion SessionContext pattern):
db, _ := sql.Open("duckdb", "")
defer db.Close()
db.Exec("CREATE VIEW workers AS SELECT * FROM read_parquet('s3://bucket/workers/*.parquet')")
rows, _ := db.Query("SELECT role, count(*) FROM workers WHERE state='IL' GROUP BY role")
Acceptance gates:
- G3.1.A —
SELECT * FROM read_parquet('workers_500k.parquet') LIMIT 1returns a row with the expected schema. Establishes Parquet read works. - G3.1.B — Hybrid SQL+vector query (the
POST /vectors/hybridsurface) returns same workers as the Rust path on the same input, ranked the same way modulo embedding precision. - G3.1.C — Hot-cache merge-on-read: register a base table + a delta Parquet, query, observe both rows merged with the delta winning on conflict.
Fallback if cgo is rejected: run DuckDB as an external process
(duckdb -json -c '...' shelled or HTTP via a thin Go wrapper). Adds
operational surface; preserves SQL model.
§3.2 — HNSW index
Library: coder/hnsw — pure-Go HNSW, in-process. Supports add /
delete / search / persist.
Open question: does coder/hnsw match the recall@10 we measured
on the Rust hora path? Need a calibration test:
- Rebuild
lakehouse_arch_v1(the 1086-chunk arch corpus) in Go. - Compare recall@10 on a fixed query set to the Rust baseline.
- Acceptance: ≤2% drop or we switch library / parameters.
Persistence format: TBD — coder/hnsw has its own snapshot format;
ADR equivalent of ADR-008 (Parquet for embeddings + sidecar HNSW file)
needs revisiting in Go to confirm the sidecar format we ship.
Acceptance gates:
- G3.2.A — Build HNSW from a Parquet of 100K vectors in <60s
- G3.2.B — Search 100K vectors at k=10 in <50ms p50
- G3.2.C — Recall@10 within 2% of Rust baseline on
lakehouse_arch_v1
§3.3 — UI (HTMX)
Approach: server-rendered Go templates using html/template,
HTMX for partial-page swaps, Alpine.js for client-side interactivity
where needed. Single binary serves API + UI.
Acceptance gates:
- G3.3.A —
Asktab: type natural-language question, get answer from RAG endpoint, render in-page without full reload - G3.3.B —
Exploretab: paginated dataset list with hot-swap badge rendering - G3.3.C —
SQLtab: textarea → submit → tabular result rendered in-page - G3.3.D —
Systemtab: live tail of/storage/errorsand/hnsw/trialsvia HTMX polling
Fallback if HTMX feels limiting: split repo golangLAKEHOUSE-ui
with Vite + React, served as static files by Go gateway. Costs an
extra repo + build chain.
§3.4 — Pathway memory port
Constraint: the Rust pathway_memory and TS implementations were
byte-matching by ADR-021. The byte contract was verified by running
both implementations on the same input tokens and asserting matching
bucket indices.
Go port plan:
- Port the 32-bucket SHA256-keyed token hash exactly. Verify on a golden input that Go produces the same bucket vector as Rust.
- Port the JSON state file format verbatim — the existing 88 traces in
data/_pathway_memory/state.jsonreload as-is into the Go implementation. - Port the matrix-correctness layer (ADR-021's
SemanticFlag,BugFingerprint,TypeHint) — these are pure value types, trivially portable.
Acceptance gates:
- G3.4.A — Load existing
state.json, runreplayon the same 11 prior successful pathways, all 11 succeed (matching the Rust 11/11 baseline). - G3.4.B — Bucket vector for a fixed test input byte-matches the Rust output.
§4. Phase plan
Phase G0 — Skeleton (Week 1–3)
Scope: smallest end-to-end ingest + query path working in Go.
| Component | Deliverable |
|---|---|
cmd/gateway |
HTTP on :3100, /health, /v1/chat proxy stub |
cmd/catalogd |
In-memory registry + Parquet manifest persistence |
cmd/storaged |
Single-bucket S3 / local FS, no error journal yet |
cmd/ingestd |
CSV → Parquet, schema inference, register-on-ingest |
cmd/queryd |
DuckDB-backed POST /sql endpoint |
Acceptance: upload a CSV via POST /ingest, query it via
POST /sql with a SELECT, get rows back. Single-bucket. No vector,
no profile, no UI.
Phase G1 — Vector + RAG (Week 4–6)
| Component | Deliverable |
|---|---|
cmd/vectord |
Embed-on-ingest (calls Python sidecar), HNSW build, POST /search |
cmd/gateway |
Add POST /rag (embed → search → retrieve → generate via aibridge) |
cmd/aibridge |
HTTP client to existing Python sidecar |
Acceptance: ingest 15K resumes (the original Phase 7 fixture), ask "find me a forklift operator with OSHA-10 in IL", get ranked results with LLM-generated explanation grounded in the retrieved chunks.
Phase G2 — Federation + profiles (Week 7–8)
| Component | Deliverable |
|---|---|
cmd/storaged |
Multi-bucket registry, rescue bucket, error journal at primary://_errors/ |
| Profile system | Per-reader profile bound to bucket + vector index |
| Hot-swap | Atomic pointer swap for index generations |
Acceptance: two profiles bound to two buckets, queries scoped correctly, hot-swap a vector index without query interruption, rollback works.
Phase G3 — Pathway memory + distillation (Week 9–11)
| Component | Deliverable |
|---|---|
cmd/vectord |
Pathway memory module ported, 88 traces reloaded |
| Distillation pipeline | SFT export, contamination firewall, scorer |
| Audit baselines | audit_baselines.jsonl longitudinal signal port |
Acceptance: replay 11 prior successful pathways, all 11 succeed. Re-run distillation acceptance on the frozen fixture set, 22/22 pass.
Phase G4 — TS surfaces → Go (Week 12–14)
| Component | Deliverable |
|---|---|
cmd/mcp |
MCP server (replaces Bun) — /v1/chat, intelligence endpoints |
cmd/observer |
Autonomous iteration loop, op recording |
cmd/auditor |
PR audit pipeline (kimi/haiku/opus rotation) |
cmd/scrum |
Scrum master pipeline (replaces TS) |
Acceptance: open a test PR, auditor cycles within 90s, emits
verdict to data/_auditor/kimi_verdicts/, behavior matches Rust+TS
era within tolerance.
Phase G5 — UI + demo parity (Week 15–16)
| Component | Deliverable |
|---|---|
cmd/gateway |
Serves HTMX templates + static demo HTML |
Demo at devop.live/lakehouse/ |
Parity with current Bun demo |
Staffer console at /console |
Parity |
Acceptance: devop.live/lakehouse/ cuts over from Bun to Go
gateway. Section ① / ② / ③ all render. Compact contract cards still
expand with Project Index. Fill-probability bars still paint.
§5. Repo layout
golangLAKEHOUSE/
├── docs/
│ ├── PRD.md ← this PRD
│ ├── SPEC.md ← this spec
│ ├── DECISIONS.md ← Go-era ADRs (start fresh, reference Rust ADRs by number)
│ └── ADR-XXX-*.md ← per-ADR detail
├── cmd/
│ ├── gateway/ ← main HTTP/gRPC ingress
│ ├── catalogd/
│ ├── storaged/
│ ├── queryd/
│ ├── ingestd/
│ ├── vectord/
│ ├── journald/
│ ├── mcp/
│ ├── observer/
│ ├── auditor/
│ └── scrum/
├── internal/ ← shared packages, not exported
│ ├── aibridge/
│ ├── validator/
│ ├── truth/
│ ├── shared/
│ ├── proto/ ← generated protobuf
│ └── pathway/
├── pkg/ ← public Go packages (none initially)
├── web/ ← UI (HTMX templates + static)
│ ├── templates/
│ └── static/
├── scripts/ ← cold-start, smoke, distill scripts
├── tests/ ← golden files, integration tests
├── go.mod
├── go.sum
└── README.md
Single Go module. All commands and internal packages live under
golangLAKEHOUSE/. No nested modules unless a package needs an
independent release cadence (none expected).
Build: go build ./cmd/... produces all binaries.
§6. Migration data plan
What ports verbatim
- Parquet datasets at
data/datasets/*.parquet— read by Go directly. - Catalog manifests — Parquet, ports as data not code.
- Pathway memory state — JSON, ports if §3.4 byte-matching gate passes.
What rebuilds
- HNSW indexes — rebuild from Parquet embeddings on first Go startup.
- Auditor verdicts on PRs — old PRs won't be re-audited; lineage starts fresh on the new repo's PRs.
What's archived
- The Rust
crates/tree — preserved in the original repo at the cutover commit, taggedpre-go-rewrite-2026-04-28for reference. - TS surfaces (
mcp-server/,auditor/, etc.) — preserved in the original repo at the same tag. - Distillation v1.0.0 substrate (
tag distillation-v1.0.0,e7636f2) — kept as the historical reference; Go re-implementation ports the LOGIC but not the bit-identical-reproducibility property unless an ADR re-establishes it.
What's discarded
crates/vectord-lance/(Lance backend, see PRD §Hard problems §2)crates/lance-bench/(criterion benchmarks specific to Lance)
§7. Acceptance: when is the rewrite done?
The Go Lakehouse reaches feature parity when:
- All 12 Rust PRD invariants hold (object-storage source of truth, catalog metadata authority, idempotent ingest, hot-swap atomicity, profiles, etc.).
- The 16 distillation acceptance gates pass (re-run
./scripts/distill audit-fullagainst the Go pipeline). - The 22/22 acceptance fixtures from
tests/fixtures/distillation/ acceptance/pass under the Go implementation. - The 145 unit tests of distillation v1.0.0 are ported and pass.
devop.live/lakehouse/demo cuts over to Go gateway with no visible UI regressions.- Auditor emits Kimi/Haiku/Opus verdicts on a test PR, matching the cross-lineage rotation behavior.
- The 88 pathway traces replay with 11/11 prior successes reproduced.
At that point the Rust repo enters maintenance-only mode (security fixes), and the Go repo becomes the live system.
§8. Ratified — Phase G0 unblocked (2026-04-28, J)
| # | Decision | Spec impact |
|---|---|---|
| 1 | DuckDB via cgo (marcboeker/go-duckdb) |
§3.1 option A — proceed |
| 2 | HTMX + html/template + Alpine.js |
§3.3 option A — proceed |
| 3 | git.agentview.dev/profit/golangLAKEHOUSE |
repo location locked |
| 4 | Distillation rebuilt in Go (no bit-identical port) | §6 — port logic, not fixtures |
| 5 | Pathway memory starts empty; old traces noted | §3.4 G3.4.A is now "build initial state from scratch in Phase G3"; G3.4.B (byte-match) preserved as the porting correctness gate when the algorithm is reimplemented |
| 6 | Auditor longitudinal signal restarts | new audit_baselines.jsonl lineage starts on first Go-era PR |
See docs/DECISIONS.md ADR-001 for full rationale and
docs/RUST_PATHWAY_MEMORY_NOTE.md for where the legacy 88 traces live.
Phase G0 is now unblocked. Next step: bootstrap the Go module skeleton + push to Gitea, then begin §4 Phase G0 implementation.