Pre-Phase-G0 research sweep against current Go ecosystem state. Three upstream changes that the day-of SPEC missed: 1. DuckDB Go binding ownership transferred. marcboeker/go-duckdb is deprecated as of v2.5.0 — official maintainer is now github.com/duckdb/duckdb-go/v2 (DuckDB team + Marc Boeker joint hand-off). Current v2.10502.0 / DuckDB v1.5.2. SPEC §3.1 + component table updated. 2. Official Go MCP SDK exists. Switching from mark3labs/mcp-go (community) to github.com/modelcontextprotocol/go-sdk (official, Google collaboration, v1.5.0 stable, 4.4k stars, targets MCP spec 2025-11-25). Component table updated. 3. arrow-go is on v18, not v15. v18.5.2 (March 2026) has parquet encryption fixes relevant for PII-masked safe views. PRD locked stack + SPEC component table updated. Validated unchanged: coder/hnsw (220 stars, active), chi (still the clean-architecture pick over fiber/gin/echo). Surfaced for future use: anthropics/anthropic-sdk-go (official, available for direct Claude calls bypassing opencode if ever needed), duckdb-wasm (browser-side analytics future option), IVF as HNSW fallback if recall gate fails. See docs/RESEARCH_LOG_2026-04-28.md for full survey + sources. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
14 KiB
PRD: Lakehouse-Go — Versioned Knowledge Substrate, Reimplemented in Go
Status: DRAFT — seed document for the Go-direction rewrite. Supersedes
/home/profit/lakehouse/docs/PRD.md (Rust) once ratified.
Created: 2026-04-28
Owner: J
Sibling: SPEC.md — component-by-component port plan with effort
estimates, library choices, and acceptance gates.
Direction pivot — why this PRD exists
The Rust-first Lakehouse (15 crates, ~24 unmerged commits past PR #11,
distillation v1.0.0 substrate frozen at e7636f2) is being reimplemented
in Go on the principle that anything Go can carry, Go carries. This
is an explicit re-platforming, not a refactor.
What the rewrite preserves (verbatim from the Rust PRD)
- The problem statement — legacy data systems silo information; AI needs both fast analytical queries AND semantic retrieval over unstructured text in one substrate.
- The two use cases — staffing analytics (reference implementation) and local AI knowledge substrate (per-profile vector indexes for running models).
- The shared requirements — schema-less ingest, SQL at scale, AI-embedding search, hot-swappable indexes, trials-as-data, local-first / no-cloud, repo-rebuildable.
- The architectural invariants — object storage as source of truth, catalog as sole metadata authority, hot-swap atomicity, profiles as first-class, playbooks-feed-the-index, errors findable in one HTTP call.
What the rewrite changes
| Layer | Was (Rust) | Becomes (Go) | Confidence |
|---|---|---|---|
| HTTP gateway | Axum + Tokio | net/http + chi (or gin) |
High — Go's bread and butter |
| gRPC | tonic | google.golang.org/grpc |
High — Go is the reference impl |
| Object store | Apache Arrow object_store |
aws-sdk-go-v2/service/s3 + thin wrapper |
High |
| Parquet I/O | parquet-rs (arrow-rs) | apache/arrow-go/v18/parquet |
Medium — arrow-go lags arrow-rs but v18 covers our needs |
| Query engine | DataFusion | Hard problem (see §Hard problems) | Low — no like-for-like Go equivalent |
| Vector index (HNSW) | hora / hand-rolled |
coder/hnsw or Bithack/go-hnsw (in-process) |
High — HNSW is a self-contained algorithm |
| Vector backend (Lance) | lance (Rust) |
Hard problem — likely dropped, Parquet-only | Medium |
| Frontend | Dioxus + WASM | Go html/template + HTMX + Alpine, or React/Vite split repo |
Medium |
| Concurrency | Tokio async | Goroutines + context.Context |
High |
| Config | TOML | TOML (pelletier/go-toml/v2) |
High |
| Secrets | SecretsProvider trait |
Go interface, same shape | High |
| AI bridge | HTTP client to Python sidecar | Same — Python sidecar stays | High |
| Embedded ML | Python sidecar (kept) | Python sidecar (kept) | n/a |
What stays Python (and why)
- Embedding generation, image gen, deepface analysis — Python's ML ecosystem is genuinely stronger than Go's. The sidecar stays as an HTTP service; the Go gateway calls it the same way the Rust gateway did. No port required.
- Distillation pipeline scoring — current TS scripts; can move to Go but not first-tier priority. Keep TS until Go gateway is live.
Solution — Go service mesh over S3-compatible object storage
A modular Go service mesh, same architectural shape as the Rust system,
with the Python AI sidecar retained as the embedding/generation
boundary. Single repo (golangLAKEHOUSE), single Go module, multiple
binaries built from one workspace.
Locked stack (Go)
| Layer | Choice | Rationale |
|---|---|---|
| HTTP | chi |
Idiomatic, middleware-friendly, used by major Go services |
| gRPC | google.golang.org/grpc |
Reference implementation |
| Protobuf | protoc-gen-go + buf |
Standard tooling |
| Object store | aws-sdk-go-v2 |
Mature, covers S3 + MinIO + RustFS |
| Parquet | apache/arrow-go/v18 |
Columnar I/O + Arrow interop (v18.5.2 — March 2026) |
| SQL engine | Open — see §Hard problems §1 | Biggest open decision |
| Vector index | coder/hnsw |
Pure-Go HNSW, in-process, no external service |
| TOML config | pelletier/go-toml/v2 |
Standard |
| Logging | log/slog |
Standard library since Go 1.21 |
| Tracing | go.opentelemetry.io/otel |
Standard |
| Testing | testing + testify + golden files |
Standard |
| Frontend | Open — html/template + HTMX vs separate Vite/React |
Hard problem §3 |
No new dependencies without an ADR.
Architecture
Same service decomposition as Rust, same data flow. Names preserved so the spec, ADRs, and runbooks port semantically:
┌─ ingestd ─→ storaged ─→ catalogd ─→ queryd ─┐
│ │ │
│ └→ vectord │
│ │
└──────── aibridge ──HTTP──→ Python sidecar ───┘
│
gateway ─ HTTP/gRPC ────┘
│
└→ ui (HTMX or Vite)
| Service | Responsibility | Go binary |
|---|---|---|
| gateway | HTTP/gRPC ingress, routing, auth | cmd/gateway |
| catalogd | Metadata control plane, dataset registry | cmd/catalogd |
| storaged | Object I/O, multi-bucket, error journal | cmd/storaged |
| queryd | SQL execution over Parquet (engine TBD) | cmd/queryd |
| ingestd | CSV/JSON/PDF ingest → Parquet | cmd/ingestd |
| vectord | Embeddings + HNSW index + trial system | cmd/vectord |
| journald | Append-only mutation event log | cmd/journald |
| aibridge | HTTP client to Python sidecar | library, linked into gateway |
| validator | Production worker/permit validators | library, linked into gateway |
| mcp | MCP server (replaces Bun mcp-server) |
cmd/mcp |
| observer | Autonomous iteration loop | cmd/observer |
| auditor | PR audit pipeline (replaces TS auditor) | cmd/auditor |
Invariants (preserved verbatim from Rust PRD)
- Object storage = source of truth
- catalogd = sole metadata authority
- No raw data in catalog — only pointers
- vectord stores embeddings AS Parquet (portable)
- ingestd is idempotent
- Hot cache is performance, not source of truth
- All services modular and independently replaceable
- Indexes are hot-swappable (atomic pointer swap, rollback always possible)
- Every reader gets its own profile
- Trials are data, not logs
- Operational failures findable in one HTTP call
- Playbooks feed the index, not just the log
Hard problems (the ones that don't trivially port)
These four define whether the rewrite is feasible. Spec answers each with a concrete library/approach choice and a fallback.
1. Query engine — replacing DataFusion
Constraint: DataFusion is the most consequential Rust dependency in
the Lakehouse. It powers queryd, hybrid SQL+vector search, and
hot-cache merge-on-read. Go has no like-for-like equivalent.
Options:
- A. Embed DuckDB via cgo (
marcboeker/go-duckdb) — DuckDB reads Parquet natively, supports SQL similar to DataFusion, has cgo Go bindings. Loses pure-Go portability (cgo required) but preserves the query model. - B. Run DuckDB as an external service — one DuckDB process, Go talks to it via HTTP. Pure-Go gateway, separate-process query layer. Adds an operational surface (one more service to manage).
- C. Hand-roll a query planner over Arrow — parse SQL with
xwb1989/sqlparser, plan over arrow-go RecordBatches, execute. High effort, high risk. Best avoided. - D. Postgres + foreign data wrappers — point Postgres at Parquet
via
parquet_fdw. Mature but introduces a database we said we'd avoid (ADR-001).
Recommendation: Option A (DuckDB via cgo). Preserves the SQL + columnar + Parquet model, single-binary deploy with cgo, mature. Cgo adds build complexity but is acceptable.
2. Lance backend — vectord-lance
Constraint: Lance is a Rust-native columnar format with built-in vector indexing. There is no Go port and no FFI binding. ADR-019 designates Lance as a per-profile secondary backend; Parquet+HNSW is primary.
Options:
- A. Drop Lance entirely. Parquet+HNSW handles primary path; Lance was secondary. ADR-019 stays valid for the Rust era; Go Lakehouse is Parquet-only.
- B. Keep Lance via FFI/cgo. Build Lance as a Rust dylib, call from Go via cgo. Reintroduces Rust into the build chain; defeats the point.
- C. Wait for Lance Go port. Doesn't exist; not on Lance roadmap.
Recommendation: Option A (drop Lance). The hybrid backend was optional per-profile; Parquet+HNSW carries the primary path. If a specific workload later proves Lance-only, it can be exposed as a Python-sidecar service.
3. UI — replacing Dioxus
Constraint: Dioxus is a Rust+WASM frontend framework. No Go
equivalent at the same level of polish. The current crates/ui covers
Ask, Explore, SQL, System tabs.
Options:
- A.
html/template+ HTMX + Alpine.js — server-rendered Go, partial-page swaps via HTMX. Single repo, minimal JS, fits Go's "boring is good" ethos. - B. Separate Vite/React frontend —
golangLAKEHOUSE-uirepo, Go gateway serves static files. Modern UX patterns, more dev tooling needed. - C. Keep Dioxus + WASM as a build step — defeats the rewrite.
Recommendation: Option A for v1; revisit if UI requirements
demand React-tier interactivity. The current Lakehouse UIs (/lakehouse/
demo + staffer console) are mostly server-rendered HTML with vanilla
JS — html/template + HTMX is a strong fit.
4. Arrow ecosystem maturity
Constraint: arrow-go/v15 lags arrow-rs in compute kernels,
expression APIs, and some compression codecs. Specific gaps known:
limited cast kernel coverage, no equivalent of arrow-rs's
compute::sort_to_indices for all dtypes, no Acero-style streaming
execution.
Mitigation: the Go Lakehouse relies on Arrow primarily for Parquet I/O + RecordBatch transport, not for in-process compute (that's DuckDB's job). The narrower scope makes arrow-go's gaps less load-bearing.
Acceptance gate: any Arrow API the Go Lakehouse uses must be
covered by arrow-go/v15. Anything missing → file an upstream issue,
implement locally if blocking, contribute back.
Migration strategy
What ports verbatim
- Problem statement, use cases, requirements
- Architectural invariants (1–12)
- ADRs 001–021 (preserved as design intent; some change implementation)
- Federation building blocks (multi-bucket, error-journal, append-log)
What rebuilds from data
- HNSW indexes (rebuild from Parquet embeddings — ADR-008 is verbatim preserved)
- Pathway memory state (88 traces in
data/_pathway_memory/state.jsonon Rust side — port the JSON format and reload; the byte-matching contract becomes Go-Go instead of Rust-TS) - Catalog manifests (Parquet, portable)
- Distillation v1.0.0 substrate (port the SFT/contamination-firewall logic; the fixture-as-gate pattern stays)
What ships first (port-order outline — see SPEC.md for detail)
- Phase G0 — Skeleton:
cmd/gateway,cmd/catalogd,cmd/storaged,cmd/ingestd. Single-bucket, no auth, CSV→Parquet, query via DuckDB. - Phase G1 — Vector path:
cmd/vectordwith HNSW + RAG endpoint. - Phase G2 — Multi-profile + federation (ADRs 016–017).
- Phase G3 — Pathway memory + distillation port.
- Phase G4 — MCP server, observer, auditor (TS surfaces → Go).
- Phase G5 — UI (HTMX) and demo parity with
devop.live/lakehouse/.
Detailed acceptance gates in SPEC.md.
What does NOT migrate
- The Rust crates themselves (archived in the original
lakehouserepo) - The TS scrum/auditor pipelines (rewritten in Go in Phase G4)
- The Bun mcp-server (rewritten in Go in Phase G4)
- The Python sidecar (kept as-is, behind aibridge)
Non-goals
- No port of
vectord-lance. Lance backend is dropped; Parquet+HNSW is the only vector backend. - No retention of Rust in the build chain. No cgo-to-Rust bridges, no FFI to keep specific crates alive. Cgo to C/C++ (DuckDB) is acceptable.
- No new feature work during the port. Feature parity with the Rust Lakehouse at the cutoff commit is the bar; new capabilities defer to post-port phases.
- No live-migration of running services. The Rust Lakehouse stops serving when Go reaches feature parity; data moves once via Parquet re-pointer.
Ratified decisions (2026-04-28, J)
The six gating questions are answered. Phase G0 is unblocked. Full
context for each lives in docs/DECISIONS.md ADR-001.
| # | Decision |
|---|---|
| 1 | DuckDB via cgo — marcboeker/go-duckdb is the query engine. Cgo accepted as the cost of a mature SQL+Parquet path. |
| 2 | HTMX — server-rendered html/template + HTMX + Alpine.js. Single-binary deploy. React is post-G5 if needed. |
| 3 | Gitea — repo lives at git.agentview.dev/profit/golangLAKEHOUSE (same server as the Rust lakehouse). |
| 4 | Distillation rebuild in Go — port the SFT export + contamination firewall logic, but bit-identical reproducibility is a Rust-era property. New Go fixtures, new acceptance gates. |
| 5 | Pathway memory starts clean — Go pathway memory begins with zero traces. The existing 88 Rust traces are preserved at data/_pathway_memory/state.json in the lakehouse repo as a historical record (see docs/RUST_PATHWAY_MEMORY_NOTE.md). |
| 6 | Auditor longitudinal signal restarts — audit_baselines.jsonl is a Rust-era artifact. Go auditor begins a fresh drift signal. |