# PRD: Lakehouse-Go — Versioned Knowledge Substrate, Reimplemented in Go **Status:** DRAFT — seed document for the Go-direction rewrite. Supersedes `/home/profit/lakehouse/docs/PRD.md` (Rust) once ratified. **Created:** 2026-04-28 **Owner:** J **Sibling:** `SPEC.md` — component-by-component port plan with effort estimates, library choices, and acceptance gates. --- ## Product vision — what we're actually building **The Go refactor isn't the goal. The goal is a small-model-driven autonomous pipeline that gets better with each run, with frontier models in audit/oversight and humans triaged in only for the genuinely abstract cases.** The Rust Lakehouse already has most of the pieces: - **Pathway memory** (`internal/pathway` in Go, 88 Rust traces preserved) — what we tried, what worked - **Matrix indexer** (SPEC §3.4) — multi-corpus retrieve+merge that gives the small model the right knowledge slice for *this* task - **Observer** — watches runs, refines configs, escalates - **Distillation v1.0.0** (`e7636f2`) — turns successful runs into denser playbooks - **Auditor cross-lineage fabric** — Kimi/Haiku/Opus oversight on small-model outputs What the Go refactor is FOR: a second-language pass surfaces architectural weaknesses that Rust hid. The pipeline has to pull together cleanly *as a pipeline* — not as 15 crates that happen to interact. ### The five-loop substrate 1. **Knowledge pathway loop** — pathway memory + matrix indexer give the small model context for the task. Pathway answers "what worked last time?"; matrix answers "what's relevant now?" 2. **Execution loop** — small model runs on focused context. Frontier API calls are reserved for audit/escalation, not the inner loop. Cost + rate limits stay sane. 3. **Observer loop** — watches each run, refines the configs (matrix corpus picks, downgrade gate, prompt mold) that got the model to a good pathway. Outputs new config, not new prompt. 4. **Rating + distillation loop** — successful outcomes get scored and folded back into the playbook substrate. The playbook gets denser; the next run starts smarter. 5. **Drift loop** — quantify when the distilled playbook stops matching reality (codebase changed, contracts shifted, profiles updated). Drift is a *measured* signal, not "hope nothing broke." ### The gate **The playbook + matrix indexer must produce the results we're looking for.** That's the single load-bearing acceptance criterion. Throughput, scaling, code elegance — all secondary. If a deep-field reality test on the 500K corpus surfaces wrong answers, the loop isn't working and we fix that before adding anything else. ### Triage / human-in-loop Most cases are abstract enough that small-model + pathway + matrix can complete them. Some can't — they need a human. The system's job is to **identify which is which** and only escalate the second class. Frontier models partially solve this internally with their thinking loops; we're externalizing it so: - Small models are swappable (vendor independence) - Drift is measurable (quantitative signal, not vibes) - Each loop iteration is auditable (the pathway memory IS the audit trail) This is what the auditor cross-lineage fabric proves out in Rust — Opus auto-promote on diffs >100k chars is the same pattern: triage by signal, not by guesswork. ## Direction pivot — why this PRD exists The Rust-first Lakehouse (15 crates, ~24 unmerged commits past PR #11, distillation v1.0.0 substrate frozen at `e7636f2`) is being reimplemented in Go on the principle that **anything Go can carry, Go carries**. This is an explicit re-platforming, not a refactor. ### What the rewrite preserves (verbatim from the Rust PRD) - The **problem statement** — legacy data systems silo information; AI needs both fast analytical queries AND semantic retrieval over unstructured text in one substrate. - The **two use cases** — staffing analytics (reference implementation) and local AI knowledge substrate (per-profile vector indexes for running models). - The **shared requirements** — schema-less ingest, SQL at scale, AI-embedding search, hot-swappable indexes, trials-as-data, local-first / no-cloud, repo-rebuildable. - The **architectural invariants** — object storage as source of truth, catalog as sole metadata authority, hot-swap atomicity, profiles as first-class, playbooks-feed-the-index, errors findable in one HTTP call. ### What the rewrite changes | Layer | Was (Rust) | Becomes (Go) | Confidence | |---|---|---|---| | HTTP gateway | Axum + Tokio | `net/http` + `chi` (or `gin`) | High — Go's bread and butter | | gRPC | tonic | `google.golang.org/grpc` | High — Go is the reference impl | | Object store | Apache Arrow `object_store` | `aws-sdk-go-v2/service/s3` + thin wrapper | High | | Parquet I/O | parquet-rs (arrow-rs) | `apache/arrow-go/v18/parquet` | Medium — arrow-go lags arrow-rs but v18 covers our needs | | Query engine | DataFusion | **Hard problem** (see §Hard problems) | Low — no like-for-like Go equivalent | | Vector index (HNSW) | `hora` / hand-rolled | `coder/hnsw` or `Bithack/go-hnsw` (in-process) | High — HNSW is a self-contained algorithm | | Vector backend (Lance) | `lance` (Rust) | **Hard problem** — likely dropped, Parquet-only | Medium | | Frontend | Dioxus + WASM | Go `html/template` + HTMX + Alpine, or React/Vite split repo | Medium | | Concurrency | Tokio async | Goroutines + `context.Context` | High | | Config | TOML | TOML (`pelletier/go-toml/v2`) | High | | Secrets | `SecretsProvider` trait | Go interface, same shape | High | | AI bridge | HTTP client to Python sidecar | Same — Python sidecar stays | High | | Embedded ML | Python sidecar (kept) | Python sidecar (kept) | n/a | ### What stays Python (and why) - **Embedding generation, image gen, deepface analysis** — Python's ML ecosystem is genuinely stronger than Go's. The sidecar stays as an HTTP service; the Go gateway calls it the same way the Rust gateway did. No port required. - **Distillation pipeline scoring** — current TS scripts; can move to Go but not first-tier priority. Keep TS until Go gateway is live. --- ## Solution — Go service mesh over S3-compatible object storage A modular Go service mesh, same architectural shape as the Rust system, with the Python AI sidecar retained as the embedding/generation boundary. Single repo (`golangLAKEHOUSE`), single Go module, multiple binaries built from one workspace. ### Locked stack (Go) | Layer | Choice | Rationale | |---|---|---| | HTTP | `chi` | Idiomatic, middleware-friendly, used by major Go services | | gRPC | `google.golang.org/grpc` | Reference implementation | | Protobuf | `protoc-gen-go` + `buf` | Standard tooling | | Object store | `aws-sdk-go-v2` | Mature, covers S3 + MinIO + RustFS | | Parquet | `apache/arrow-go/v18` | Columnar I/O + Arrow interop (v18.5.2 — March 2026) | | SQL engine | **Open** — see §Hard problems §1 | Biggest open decision | | Vector index | `coder/hnsw` | Pure-Go HNSW, in-process, no external service | | TOML config | `pelletier/go-toml/v2` | Standard | | Logging | `log/slog` | Standard library since Go 1.21 | | Tracing | `go.opentelemetry.io/otel` | Standard | | Testing | `testing` + `testify` + `golden` files | Standard | | Frontend | **Open** — `html/template` + HTMX vs separate Vite/React | Hard problem §3 | No new dependencies without an ADR. --- ## Architecture Same service decomposition as Rust, same data flow. Names preserved so the spec, ADRs, and runbooks port semantically: ``` ┌─ ingestd ─→ storaged ─→ catalogd ─→ queryd ─┐ │ │ │ │ └→ vectord │ │ │ └──────── aibridge ──HTTP──→ Python sidecar ───┘ │ gateway ─ HTTP/gRPC ────┘ │ └→ ui (HTMX or Vite) ``` | Service | Responsibility | Go binary | |---|---|---| | **gateway** | HTTP/gRPC ingress, routing, auth | `cmd/gateway` | | **catalogd** | Metadata control plane, dataset registry | `cmd/catalogd` | | **storaged** | Object I/O, multi-bucket, error journal | `cmd/storaged` | | **queryd** | SQL execution over Parquet (engine TBD) | `cmd/queryd` | | **ingestd** | CSV/JSON/PDF ingest → Parquet | `cmd/ingestd` | | **vectord** | Embeddings + HNSW index + trial system | `cmd/vectord` | | **journald** | Append-only mutation event log | `cmd/journald` | | **aibridge** | HTTP client to Python sidecar | library, linked into gateway | | **validator** | Production worker/permit validators | library, linked into gateway | | **mcp** | MCP server (replaces Bun `mcp-server`) | `cmd/mcp` | | **observer** | Autonomous iteration loop | `cmd/observer` | | **auditor** | PR audit pipeline (replaces TS auditor) | `cmd/auditor` | ### Invariants (preserved verbatim from Rust PRD) 1. Object storage = source of truth 2. catalogd = sole metadata authority 3. No raw data in catalog — only pointers 4. vectord stores embeddings AS Parquet (portable) 5. ingestd is idempotent 6. Hot cache is performance, not source of truth 7. All services modular and independently replaceable 8. Indexes are hot-swappable (atomic pointer swap, rollback always possible) 9. Every reader gets its own profile 10. Trials are data, not logs 11. Operational failures findable in one HTTP call 12. Playbooks feed the index, not just the log --- ## Hard problems (the ones that don't trivially port) These four define whether the rewrite is feasible. Spec answers each with a concrete library/approach choice and a fallback. ### 1. Query engine — replacing DataFusion **Constraint:** DataFusion is the most consequential Rust dependency in the Lakehouse. It powers `queryd`, hybrid SQL+vector search, and hot-cache merge-on-read. Go has no like-for-like equivalent. **Options:** - **A. Embed DuckDB via cgo (`marcboeker/go-duckdb`)** — DuckDB reads Parquet natively, supports SQL similar to DataFusion, has cgo Go bindings. Loses pure-Go portability (cgo required) but preserves the query model. - **B. Run DuckDB as an external service** — one DuckDB process, Go talks to it via HTTP. Pure-Go gateway, separate-process query layer. Adds an operational surface (one more service to manage). - **C. Hand-roll a query planner over Arrow** — parse SQL with `xwb1989/sqlparser`, plan over arrow-go RecordBatches, execute. High effort, high risk. Best avoided. - **D. Postgres + foreign data wrappers** — point Postgres at Parquet via `parquet_fdw`. Mature but introduces a database we said we'd avoid (ADR-001). **Recommendation:** **Option A (DuckDB via cgo)**. Preserves the SQL + columnar + Parquet model, single-binary deploy with cgo, mature. Cgo adds build complexity but is acceptable. ### 2. Lance backend — vectord-lance **Constraint:** Lance is a Rust-native columnar format with built-in vector indexing. There is no Go port and no FFI binding. ADR-019 designates Lance as a per-profile *secondary* backend; Parquet+HNSW is *primary*. **Options:** - **A. Drop Lance entirely.** Parquet+HNSW handles primary path; Lance was secondary. ADR-019 stays valid for the Rust era; Go Lakehouse is Parquet-only. - **B. Keep Lance via FFI/cgo.** Build Lance as a Rust dylib, call from Go via cgo. Reintroduces Rust into the build chain; defeats the point. - **C. Wait for Lance Go port.** Doesn't exist; not on Lance roadmap. **Recommendation:** **Option A (drop Lance).** The hybrid backend was optional per-profile; Parquet+HNSW carries the primary path. If a specific workload later proves Lance-only, it can be exposed as a Python-sidecar service. ### 3. UI — replacing Dioxus **Constraint:** Dioxus is a Rust+WASM frontend framework. No Go equivalent at the same level of polish. The current `crates/ui` covers Ask, Explore, SQL, System tabs. **Options:** - **A. `html/template` + HTMX + Alpine.js** — server-rendered Go, partial-page swaps via HTMX. Single repo, minimal JS, fits Go's "boring is good" ethos. - **B. Separate Vite/React frontend** — `golangLAKEHOUSE-ui` repo, Go gateway serves static files. Modern UX patterns, more dev tooling needed. - **C. Keep Dioxus + WASM as a build step** — defeats the rewrite. **Recommendation:** **Option A** for v1; revisit if UI requirements demand React-tier interactivity. The current Lakehouse UIs (`/lakehouse/` demo + staffer console) are mostly server-rendered HTML with vanilla JS — `html/template` + HTMX is a strong fit. ### 4. Arrow ecosystem maturity **Constraint:** `arrow-go/v15` lags `arrow-rs` in compute kernels, expression APIs, and some compression codecs. Specific gaps known: limited `cast` kernel coverage, no equivalent of `arrow-rs`'s `compute::sort_to_indices` for all dtypes, no Acero-style streaming execution. **Mitigation:** the Go Lakehouse relies on Arrow primarily for **Parquet I/O + RecordBatch transport**, not for in-process compute (that's DuckDB's job). The narrower scope makes arrow-go's gaps less load-bearing. **Acceptance gate:** any Arrow API the Go Lakehouse uses must be covered by `arrow-go/v15`. Anything missing → file an upstream issue, implement locally if blocking, contribute back. --- ## Migration strategy ### What ports verbatim - Problem statement, use cases, requirements - Architectural invariants (1–12) - ADRs 001–021 (preserved as design intent; some change implementation) - Federation building blocks (multi-bucket, error-journal, append-log) ### What rebuilds from data - HNSW indexes (rebuild from Parquet embeddings — ADR-008 is verbatim preserved) - Pathway memory state (88 traces in `data/_pathway_memory/state.json` on Rust side — port the JSON format and reload; the byte-matching contract becomes Go-Go instead of Rust-TS) - Catalog manifests (Parquet, portable) - Distillation v1.0.0 substrate (port the SFT/contamination-firewall logic; the fixture-as-gate pattern stays) ### What ships first (port-order outline — see SPEC.md for detail) 1. **Phase G0** — Skeleton: `cmd/gateway`, `cmd/catalogd`, `cmd/storaged`, `cmd/ingestd`. Single-bucket, no auth, CSV→Parquet, query via DuckDB. 2. **Phase G1** — Vector path: `cmd/vectord` with HNSW + RAG endpoint. 3. **Phase G2** — Multi-profile + federation (ADRs 016–017). 4. **Phase G3** — Pathway memory + distillation port. 5. **Phase G4** — MCP server, observer, auditor (TS surfaces → Go). 6. **Phase G5** — UI (HTMX) and demo parity with `devop.live/lakehouse/`. Detailed acceptance gates in `SPEC.md`. ### What does NOT migrate - The Rust crates themselves (archived in the original `lakehouse` repo) - The TS scrum/auditor pipelines (rewritten in Go in Phase G4) - The Bun mcp-server (rewritten in Go in Phase G4) - The Python sidecar (kept as-is, behind aibridge) --- ## Non-goals - **No port of `vectord-lance`.** Lance backend is dropped; Parquet+HNSW is the only vector backend. - **No retention of Rust in the build chain.** No cgo-to-Rust bridges, no FFI to keep specific crates alive. Cgo to **C/C++** (DuckDB) is acceptable. - **No new feature work during the port.** Feature parity with the Rust Lakehouse at the cutoff commit is the bar; new capabilities defer to post-port phases. - **No live-migration of running services.** The Rust Lakehouse stops serving when Go reaches feature parity; data moves once via Parquet re-pointer. --- ## Ratified decisions (2026-04-28, J) The six gating questions are answered. Phase G0 is unblocked. Full context for each lives in `docs/DECISIONS.md` ADR-001. | # | Decision | |---|---| | 1 | **DuckDB via cgo** — `marcboeker/go-duckdb` is the query engine. Cgo accepted as the cost of a mature SQL+Parquet path. | | 2 | **HTMX** — server-rendered `html/template` + HTMX + Alpine.js. Single-binary deploy. React is post-G5 if needed. | | 3 | **Gitea** — repo lives at `git.agentview.dev/profit/golangLAKEHOUSE` (same server as the Rust lakehouse). | | 4 | **Distillation rebuild in Go** — port the SFT export + contamination firewall logic, but bit-identical reproducibility is a Rust-era property. New Go fixtures, new acceptance gates. | | 5 | **Pathway memory starts clean** — Go pathway memory begins with zero traces. The existing 88 Rust traces are preserved at `data/_pathway_memory/state.json` in the lakehouse repo as a historical record (see `docs/RUST_PATHWAY_MEMORY_NOTE.md`). | | 6 | **Auditor longitudinal signal restarts** — `audit_baselines.jsonl` is a Rust-era artifact. Go auditor begins a fresh drift signal. |