Claw f07668064e docs: seed PRD + SPEC for the Go-direction rewrite
Two documents only — no Go code yet. PRD restates the problem and
preserves the Rust PRD's invariants verbatim, then maps the locked
stack to Go libraries and surfaces four hard problems (DuckDB-via-cgo
for the query engine, Lance dropped, Dioxus → HTMX, arrow-go maturity).
SPEC walks each Rust crate + TS surface and tags the port with library
choice / effort estimate / risk + a 5-phase migration plan from
skeleton (Phase G0) to demo parity (Phase G5).

Six open questions remain that gate Phase G0:
- DuckDB cgo OK?
- HTMX vs React for the UI?
- Repo location?
- Distillation v1.0.0 port verbatim or rebuild?
- Pathway memory data — port 88 traces or start clean?
- Auditor lineage — port audit_baselines.jsonl or restart?

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-28 06:35:23 -05:00

355 lines
15 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# SPEC: Lakehouse-Go Component Port Plan
**Status:** DRAFT — companion to `PRD.md`. Component-by-component port
plan with library choices, effort estimates, and acceptance gates.
**Created:** 2026-04-28
**Owner:** J
This spec answers: for each piece of the Rust Lakehouse, what Go
library carries it, what the effort looks like, and what gate proves
the port is real.
Effort scale (one engineer-week = ~40h focused work):
- **S** — 13 days
- **M** — 1 engineer-week
- **L** — 23 engineer-weeks
- **XL** — 1+ months
- **HARD** — open research, see PRD §Hard problems
---
## §1. Component port table — Rust crates
| Crate | Rust deps that mattered | Go target | Library | Effort | Risk |
|---|---|---|---|---|---|
| `gateway` | axum, tokio, tonic, tower | `cmd/gateway` | `chi` + stdlib `net/http` + `google.golang.org/grpc` | **L** | low — Go's strongest domain |
| `catalogd` | parquet-rs, arrow, sqlite | `cmd/catalogd` | `arrow-go/v15`, `mattn/go-sqlite3` | **L** | low |
| `storaged` | object_store, aws-sdk | `cmd/storaged` | `aws-sdk-go-v2`, `minio-go` for MinIO-specific paths | **M** | low |
| `queryd` | datafusion, arrow | `cmd/queryd` | `marcboeker/go-duckdb` (cgo) | **HARD** | high — see §3 |
| `ingestd` | csv, json, lopdf, postgres | `cmd/ingestd` | stdlib `encoding/csv`, `encoding/json`, `pdfcpu/pdfcpu`, `jackc/pgx/v5` | **L** | low |
| `vectord` | hora, arrow, hnsw | `cmd/vectord` | `coder/hnsw`, `arrow-go/v15` | **L** | medium — re-validate HNSW recall |
| `vectord-lance` | lance | **DROPPED** | n/a | n/a | n/a — Parquet+HNSW only |
| `journald` | parquet, arrow | `cmd/journald` | `arrow-go/v15` | **M** | low |
| `aibridge` | reqwest | library | `net/http` + connection pool | **S** | low |
| `validator` | parquet, custom | library | `arrow-go/v15` parquet reader | **M** | low — port the 24 unit tests as gates |
| `truth` | tomli, custom DSL | library | `pelletier/go-toml/v2` | **M** | low |
| `proto` | tonic-build | `proto/` + `protoc-gen-go` | `buf` + `protoc-gen-go-grpc` | **S** | low |
| `shared` | serde, anyhow | library | stdlib `encoding/json`, `errors` | **S** | low |
| `ui` | dioxus, wasm | **REPLACED** | `html/template` + HTMX | **L** | medium — see §3 |
| `lance-bench` | criterion | n/a — dropped with Lance | n/a | n/a | n/a |
**Total Rust crate port effort:** ~1218 engineer-weeks (34 months for
one engineer; 68 weeks for two).
---
## §2. Component port table — TypeScript surfaces
| TS surface | Current location | Go target | Library | Effort | Risk |
|---|---|---|---|---|---|
| `mcp-server/index.ts` | Bun, :3700 | `cmd/mcp` | `mark3labs/mcp-go` (Go MCP SDK) | **L** | medium — MCP semantics |
| `mcp-server/observer.ts` | Bun, :3800 | `cmd/observer` | stdlib `net/http`, `slog` | **M** | low |
| `mcp-server/tracing.ts` | Bun, Langfuse client | library | `go.opentelemetry.io/otel` + Langfuse Go client (or hand-roll) | **M** | low — Langfuse Go OSS support varies |
| `auditor/*.ts` | TS, runs as systemd | `cmd/auditor` | stdlib + `gitea API client` | **L** | medium — auditor cross-lineage logic is intricate |
| `tests/real-world/scrum_master_pipeline.ts` | TS, ad-hoc | `cmd/scrum` | stdlib | **L** | medium — chunking + embed + ladder logic |
| `tests/real-world/scrum_applier.ts` | TS, ad-hoc | `cmd/scrum-apply` | stdlib + git CLI shell-out | **M** | medium |
| `bot/propose.ts` | TS | `cmd/bot` | stdlib | **S** | low |
| Search demo HTML/JS | static | static (no port) | n/a | n/a | n/a — copied as-is |
**Total TS port effort:** ~610 engineer-weeks.
---
## §3. Hard problem details
### §3.1 — Query engine (DuckDB via cgo)
**Library:** `marcboeker/go-duckdb` — Go bindings via cgo.
**API shape** (replaces the DataFusion `SessionContext` pattern):
```go
db, _ := sql.Open("duckdb", "")
defer db.Close()
db.Exec("CREATE VIEW workers AS SELECT * FROM read_parquet('s3://bucket/workers/*.parquet')")
rows, _ := db.Query("SELECT role, count(*) FROM workers WHERE state='IL' GROUP BY role")
```
**Acceptance gates:**
- G3.1.A — `SELECT * FROM read_parquet('workers_500k.parquet') LIMIT 1`
returns a row with the expected schema. Establishes Parquet read
works.
- G3.1.B — Hybrid SQL+vector query (the `POST /vectors/hybrid`
surface) returns same workers as the Rust path on the same input,
ranked the same way modulo embedding precision.
- G3.1.C — Hot-cache merge-on-read: register a base table + a delta
Parquet, query, observe both rows merged with the delta winning on
conflict.
**Fallback if cgo is rejected:** run DuckDB as an external process
(`duckdb -json -c '...'` shelled or HTTP via a thin Go wrapper). Adds
operational surface; preserves SQL model.
### §3.2 — HNSW index
**Library:** `coder/hnsw` — pure-Go HNSW, in-process. Supports add /
delete / search / persist.
**Open question:** does `coder/hnsw` match the recall@10 we measured
on the Rust `hora` path? Need a calibration test:
- Rebuild `lakehouse_arch_v1` (the 1086-chunk arch corpus) in Go.
- Compare recall@10 on a fixed query set to the Rust baseline.
- Acceptance: ≤2% drop or we switch library / parameters.
**Persistence format:** TBD — `coder/hnsw` has its own snapshot format;
ADR equivalent of ADR-008 (Parquet for embeddings + sidecar HNSW file)
needs revisiting in Go to confirm the sidecar format we ship.
**Acceptance gates:**
- G3.2.A — Build HNSW from a Parquet of 100K vectors in <60s
- G3.2.B Search 100K vectors at k=10 in <50ms p50
- G3.2.C Recall@10 within 2% of Rust baseline on
`lakehouse_arch_v1`
### §3.3 — UI (HTMX)
**Approach:** server-rendered Go templates using `html/template`,
HTMX for partial-page swaps, Alpine.js for client-side interactivity
where needed. Single binary serves API + UI.
**Acceptance gates:**
- G3.3.A `Ask` tab: type natural-language question, get answer
from RAG endpoint, render in-page without full reload
- G3.3.B `Explore` tab: paginated dataset list with hot-swap badge
rendering
- G3.3.C `SQL` tab: textarea submit tabular result rendered
in-page
- G3.3.D `System` tab: live tail of `/storage/errors` and
`/hnsw/trials` via HTMX polling
**Fallback if HTMX feels limiting:** split repo `golangLAKEHOUSE-ui`
with Vite + React, served as static files by Go gateway. Costs an
extra repo + build chain.
### §3.4 — Pathway memory port
**Constraint:** the Rust `pathway_memory` and TS implementations were
byte-matching by ADR-021. The byte contract was verified by running
both implementations on the same input tokens and asserting matching
bucket indices.
**Go port plan:**
- Port the 32-bucket SHA256-keyed token hash exactly. Verify on a
golden input that Go produces the same bucket vector as Rust.
- Port the JSON state file format verbatim the existing 88 traces in
`data/_pathway_memory/state.json` reload as-is into the Go
implementation.
- Port the matrix-correctness layer (ADR-021's `SemanticFlag`,
`BugFingerprint`, `TypeHint`) these are pure value types,
trivially portable.
**Acceptance gates:**
- G3.4.A Load existing `state.json`, run `replay` on the same 11
prior successful pathways, all 11 succeed (matching the Rust 11/11
baseline).
- G3.4.B Bucket vector for a fixed test input byte-matches the
Rust output.
---
## §4. Phase plan
### Phase G0 — Skeleton (Week 13)
**Scope:** smallest end-to-end ingest + query path working in Go.
| Component | Deliverable |
|---|---|
| `cmd/gateway` | HTTP on :3100, `/health`, `/v1/chat` proxy stub |
| `cmd/catalogd` | In-memory registry + Parquet manifest persistence |
| `cmd/storaged` | Single-bucket S3 / local FS, no error journal yet |
| `cmd/ingestd` | CSV Parquet, schema inference, register-on-ingest |
| `cmd/queryd` | DuckDB-backed `POST /sql` endpoint |
**Acceptance:** upload a CSV via `POST /ingest`, query it via
`POST /sql` with a SELECT, get rows back. Single-bucket. No vector,
no profile, no UI.
### Phase G1 — Vector + RAG (Week 46)
| Component | Deliverable |
|---|---|
| `cmd/vectord` | Embed-on-ingest (calls Python sidecar), HNSW build, `POST /search` |
| `cmd/gateway` | Add `POST /rag` (embed search retrieve generate via aibridge) |
| `cmd/aibridge` | HTTP client to existing Python sidecar |
**Acceptance:** ingest 15K resumes (the original Phase 7 fixture),
ask "find me a forklift operator with OSHA-10 in IL", get ranked
results with LLM-generated explanation grounded in the retrieved
chunks.
### Phase G2 — Federation + profiles (Week 78)
| Component | Deliverable |
|---|---|
| `cmd/storaged` | Multi-bucket registry, rescue bucket, error journal at `primary://_errors/` |
| Profile system | Per-reader profile bound to bucket + vector index |
| Hot-swap | Atomic pointer swap for index generations |
**Acceptance:** two profiles bound to two buckets, queries scoped
correctly, hot-swap a vector index without query interruption,
rollback works.
### Phase G3 — Pathway memory + distillation (Week 911)
| Component | Deliverable |
|---|---|
| `cmd/vectord` | Pathway memory module ported, 88 traces reloaded |
| Distillation pipeline | SFT export, contamination firewall, scorer |
| Audit baselines | `audit_baselines.jsonl` longitudinal signal port |
**Acceptance:** replay 11 prior successful pathways, all 11 succeed.
Re-run distillation acceptance on the frozen fixture set, 22/22 pass.
### Phase G4 — TS surfaces → Go (Week 1214)
| Component | Deliverable |
|---|---|
| `cmd/mcp` | MCP server (replaces Bun) `/v1/chat`, intelligence endpoints |
| `cmd/observer` | Autonomous iteration loop, op recording |
| `cmd/auditor` | PR audit pipeline (kimi/haiku/opus rotation) |
| `cmd/scrum` | Scrum master pipeline (replaces TS) |
**Acceptance:** open a test PR, auditor cycles within 90s, emits
verdict to `data/_auditor/kimi_verdicts/`, behavior matches Rust+TS
era within tolerance.
### Phase G5 — UI + demo parity (Week 1516)
| Component | Deliverable |
|---|---|
| `cmd/gateway` | Serves HTMX templates + static demo HTML |
| Demo at `devop.live/lakehouse/` | Parity with current Bun demo |
| Staffer console at `/console` | Parity |
**Acceptance:** `devop.live/lakehouse/` cuts over from Bun to Go
gateway. Section / / all render. Compact contract cards still
expand with Project Index. Fill-probability bars still paint.
---
## §5. Repo layout
```
golangLAKEHOUSE/
├── docs/
│ ├── PRD.md ← this PRD
│ ├── SPEC.md ← this spec
│ ├── DECISIONS.md ← Go-era ADRs (start fresh, reference Rust ADRs by number)
│ └── ADR-XXX-*.md ← per-ADR detail
├── cmd/
│ ├── gateway/ ← main HTTP/gRPC ingress
│ ├── catalogd/
│ ├── storaged/
│ ├── queryd/
│ ├── ingestd/
│ ├── vectord/
│ ├── journald/
│ ├── mcp/
│ ├── observer/
│ ├── auditor/
│ └── scrum/
├── internal/ ← shared packages, not exported
│ ├── aibridge/
│ ├── validator/
│ ├── truth/
│ ├── shared/
│ ├── proto/ ← generated protobuf
│ └── pathway/
├── pkg/ ← public Go packages (none initially)
├── web/ ← UI (HTMX templates + static)
│ ├── templates/
│ └── static/
├── scripts/ ← cold-start, smoke, distill scripts
├── tests/ ← golden files, integration tests
├── go.mod
├── go.sum
└── README.md
```
**Single Go module.** All commands and internal packages live under
`golangLAKEHOUSE/`. No nested modules unless a package needs an
independent release cadence (none expected).
**Build:** `go build ./cmd/...` produces all binaries.
---
## §6. Migration data plan
### What ports verbatim
- Parquet datasets at `data/datasets/*.parquet` read by Go directly.
- Catalog manifests Parquet, ports as data not code.
- Pathway memory state JSON, ports if §3.4 byte-matching gate passes.
### What rebuilds
- HNSW indexes rebuild from Parquet embeddings on first Go startup.
- Auditor verdicts on PRs old PRs won't be re-audited; lineage starts
fresh on the new repo's PRs.
### What's archived
- The Rust `crates/` tree preserved in the original repo at the
cutover commit, tagged `pre-go-rewrite-2026-04-28` for reference.
- TS surfaces (`mcp-server/`, `auditor/`, etc.) preserved in the
original repo at the same tag.
- Distillation v1.0.0 substrate (`tag distillation-v1.0.0`,
`e7636f2`) kept as the historical reference; Go re-implementation
ports the LOGIC but not the bit-identical-reproducibility property
unless an ADR re-establishes it.
### What's discarded
- `crates/vectord-lance/` (Lance backend, see PRD §Hard problems §2)
- `crates/lance-bench/` (criterion benchmarks specific to Lance)
---
## §7. Acceptance: when is the rewrite done?
The Go Lakehouse reaches **feature parity** when:
1. **All 12 Rust PRD invariants hold** (object-storage source of truth,
catalog metadata authority, idempotent ingest, hot-swap atomicity,
profiles, etc.).
2. **The 16 distillation acceptance gates pass** (re-run
`./scripts/distill audit-full` against the Go pipeline).
3. **The 22/22 acceptance fixtures from `tests/fixtures/distillation/
acceptance/` pass** under the Go implementation.
4. **The 145 unit tests of distillation v1.0.0 are ported and pass.**
5. **`devop.live/lakehouse/` demo cuts over to Go gateway** with no
visible UI regressions.
6. **Auditor emits Kimi/Haiku/Opus verdicts** on a test PR, matching
the cross-lineage rotation behavior.
7. **The 88 pathway traces replay** with 11/11 prior successes
reproduced.
At that point the Rust repo enters maintenance-only mode (security
fixes), and the Go repo becomes the live system.
---
## §8. Ratified — Phase G0 unblocked (2026-04-28, J)
| # | Decision | Spec impact |
|---|---|---|
| 1 | DuckDB via cgo (`marcboeker/go-duckdb`) | §3.1 option A — proceed |
| 2 | HTMX + `html/template` + Alpine.js | §3.3 option A — proceed |
| 3 | `git.agentview.dev/profit/golangLAKEHOUSE` | repo location locked |
| 4 | Distillation rebuilt in Go (no bit-identical port) | §6 — port logic, not fixtures |
| 5 | Pathway memory starts empty; old traces noted | §3.4 G3.4.A is now "build initial state from scratch in Phase G3"; G3.4.B (byte-match) preserved as the porting correctness gate when the algorithm is reimplemented |
| 6 | Auditor longitudinal signal restarts | new `audit_baselines.jsonl` lineage starts on first Go-era PR |
See `docs/DECISIONS.md` ADR-001 for full rationale and
`docs/RUST_PATHWAY_MEMORY_NOTE.md` for where the legacy 88 traces live.
**Phase G0 is now unblocked.** Next step: bootstrap the Go module
skeleton + push to Gitea, then begin §4 Phase G0 implementation.