Claw 29468b1413 docs: 2026-04-28 upstream survey — three SPEC-changing pivots
Pre-Phase-G0 research sweep against current Go ecosystem state. Three
upstream changes that the day-of SPEC missed:

1. DuckDB Go binding ownership transferred. marcboeker/go-duckdb is
   deprecated as of v2.5.0 — official maintainer is now
   github.com/duckdb/duckdb-go/v2 (DuckDB team + Marc Boeker joint
   hand-off). Current v2.10502.0 / DuckDB v1.5.2. SPEC §3.1 +
   component table updated.

2. Official Go MCP SDK exists. Switching from mark3labs/mcp-go
   (community) to github.com/modelcontextprotocol/go-sdk (official,
   Google collaboration, v1.5.0 stable, 4.4k stars, targets MCP spec
   2025-11-25). Component table updated.

3. arrow-go is on v18, not v15. v18.5.2 (March 2026) has parquet
   encryption fixes relevant for PII-masked safe views. PRD locked
   stack + SPEC component table updated.

Validated unchanged: coder/hnsw (220 stars, active), chi (still the
clean-architecture pick over fiber/gin/echo).

Surfaced for future use: anthropics/anthropic-sdk-go (official,
available for direct Claude calls bypassing opencode if ever needed),
duckdb-wasm (browser-side analytics future option), IVF as HNSW
fallback if recall gate fails.

See docs/RESEARCH_LOG_2026-04-28.md for full survey + sources.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-28 06:40:26 -05:00

298 lines
14 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# PRD: Lakehouse-Go — Versioned Knowledge Substrate, Reimplemented in Go
**Status:** DRAFT — seed document for the Go-direction rewrite. Supersedes
`/home/profit/lakehouse/docs/PRD.md` (Rust) once ratified.
**Created:** 2026-04-28
**Owner:** J
**Sibling:** `SPEC.md` — component-by-component port plan with effort
estimates, library choices, and acceptance gates.
---
## Direction pivot — why this PRD exists
The Rust-first Lakehouse (15 crates, ~24 unmerged commits past PR #11,
distillation v1.0.0 substrate frozen at `e7636f2`) is being reimplemented
in Go on the principle that **anything Go can carry, Go carries**. This
is an explicit re-platforming, not a refactor.
### What the rewrite preserves (verbatim from the Rust PRD)
- The **problem statement** — legacy data systems silo information; AI
needs both fast analytical queries AND semantic retrieval over
unstructured text in one substrate.
- The **two use cases** — staffing analytics (reference implementation)
and local AI knowledge substrate (per-profile vector indexes for
running models).
- The **shared requirements** — schema-less ingest, SQL at scale,
AI-embedding search, hot-swappable indexes, trials-as-data,
local-first / no-cloud, repo-rebuildable.
- The **architectural invariants** — object storage as source of truth,
catalog as sole metadata authority, hot-swap atomicity, profiles as
first-class, playbooks-feed-the-index, errors findable in one HTTP
call.
### What the rewrite changes
| Layer | Was (Rust) | Becomes (Go) | Confidence |
|---|---|---|---|
| HTTP gateway | Axum + Tokio | `net/http` + `chi` (or `gin`) | High — Go's bread and butter |
| gRPC | tonic | `google.golang.org/grpc` | High — Go is the reference impl |
| Object store | Apache Arrow `object_store` | `aws-sdk-go-v2/service/s3` + thin wrapper | High |
| Parquet I/O | parquet-rs (arrow-rs) | `apache/arrow-go/v18/parquet` | Medium — arrow-go lags arrow-rs but v18 covers our needs |
| Query engine | DataFusion | **Hard problem** (see §Hard problems) | Low — no like-for-like Go equivalent |
| Vector index (HNSW) | `hora` / hand-rolled | `coder/hnsw` or `Bithack/go-hnsw` (in-process) | High — HNSW is a self-contained algorithm |
| Vector backend (Lance) | `lance` (Rust) | **Hard problem** — likely dropped, Parquet-only | Medium |
| Frontend | Dioxus + WASM | Go `html/template` + HTMX + Alpine, or React/Vite split repo | Medium |
| Concurrency | Tokio async | Goroutines + `context.Context` | High |
| Config | TOML | TOML (`pelletier/go-toml/v2`) | High |
| Secrets | `SecretsProvider` trait | Go interface, same shape | High |
| AI bridge | HTTP client to Python sidecar | Same — Python sidecar stays | High |
| Embedded ML | Python sidecar (kept) | Python sidecar (kept) | n/a |
### What stays Python (and why)
- **Embedding generation, image gen, deepface analysis** — Python's ML
ecosystem is genuinely stronger than Go's. The sidecar stays as an
HTTP service; the Go gateway calls it the same way the Rust gateway
did. No port required.
- **Distillation pipeline scoring** — current TS scripts; can move to Go
but not first-tier priority. Keep TS until Go gateway is live.
---
## Solution — Go service mesh over S3-compatible object storage
A modular Go service mesh, same architectural shape as the Rust system,
with the Python AI sidecar retained as the embedding/generation
boundary. Single repo (`golangLAKEHOUSE`), single Go module, multiple
binaries built from one workspace.
### Locked stack (Go)
| Layer | Choice | Rationale |
|---|---|---|
| HTTP | `chi` | Idiomatic, middleware-friendly, used by major Go services |
| gRPC | `google.golang.org/grpc` | Reference implementation |
| Protobuf | `protoc-gen-go` + `buf` | Standard tooling |
| Object store | `aws-sdk-go-v2` | Mature, covers S3 + MinIO + RustFS |
| Parquet | `apache/arrow-go/v18` | Columnar I/O + Arrow interop (v18.5.2 — March 2026) |
| SQL engine | **Open** — see §Hard problems §1 | Biggest open decision |
| Vector index | `coder/hnsw` | Pure-Go HNSW, in-process, no external service |
| TOML config | `pelletier/go-toml/v2` | Standard |
| Logging | `log/slog` | Standard library since Go 1.21 |
| Tracing | `go.opentelemetry.io/otel` | Standard |
| Testing | `testing` + `testify` + `golden` files | Standard |
| Frontend | **Open**`html/template` + HTMX vs separate Vite/React | Hard problem §3 |
No new dependencies without an ADR.
---
## Architecture
Same service decomposition as Rust, same data flow. Names preserved so
the spec, ADRs, and runbooks port semantically:
```
┌─ ingestd ─→ storaged ─→ catalogd ─→ queryd ─┐
│ │ │
│ └→ vectord │
│ │
└──────── aibridge ──HTTP──→ Python sidecar ───┘
gateway ─ HTTP/gRPC ────┘
└→ ui (HTMX or Vite)
```
| Service | Responsibility | Go binary |
|---|---|---|
| **gateway** | HTTP/gRPC ingress, routing, auth | `cmd/gateway` |
| **catalogd** | Metadata control plane, dataset registry | `cmd/catalogd` |
| **storaged** | Object I/O, multi-bucket, error journal | `cmd/storaged` |
| **queryd** | SQL execution over Parquet (engine TBD) | `cmd/queryd` |
| **ingestd** | CSV/JSON/PDF ingest → Parquet | `cmd/ingestd` |
| **vectord** | Embeddings + HNSW index + trial system | `cmd/vectord` |
| **journald** | Append-only mutation event log | `cmd/journald` |
| **aibridge** | HTTP client to Python sidecar | library, linked into gateway |
| **validator** | Production worker/permit validators | library, linked into gateway |
| **mcp** | MCP server (replaces Bun `mcp-server`) | `cmd/mcp` |
| **observer** | Autonomous iteration loop | `cmd/observer` |
| **auditor** | PR audit pipeline (replaces TS auditor) | `cmd/auditor` |
### Invariants (preserved verbatim from Rust PRD)
1. Object storage = source of truth
2. catalogd = sole metadata authority
3. No raw data in catalog — only pointers
4. vectord stores embeddings AS Parquet (portable)
5. ingestd is idempotent
6. Hot cache is performance, not source of truth
7. All services modular and independently replaceable
8. Indexes are hot-swappable (atomic pointer swap, rollback always possible)
9. Every reader gets its own profile
10. Trials are data, not logs
11. Operational failures findable in one HTTP call
12. Playbooks feed the index, not just the log
---
## Hard problems (the ones that don't trivially port)
These four define whether the rewrite is feasible. Spec answers each
with a concrete library/approach choice and a fallback.
### 1. Query engine — replacing DataFusion
**Constraint:** DataFusion is the most consequential Rust dependency in
the Lakehouse. It powers `queryd`, hybrid SQL+vector search, and
hot-cache merge-on-read. Go has no like-for-like equivalent.
**Options:**
- **A. Embed DuckDB via cgo (`marcboeker/go-duckdb`)** — DuckDB reads
Parquet natively, supports SQL similar to DataFusion, has cgo Go
bindings. Loses pure-Go portability (cgo required) but preserves the
query model.
- **B. Run DuckDB as an external service** — one DuckDB process, Go
talks to it via HTTP. Pure-Go gateway, separate-process query layer.
Adds an operational surface (one more service to manage).
- **C. Hand-roll a query planner over Arrow** — parse SQL with
`xwb1989/sqlparser`, plan over arrow-go RecordBatches, execute. High
effort, high risk. Best avoided.
- **D. Postgres + foreign data wrappers** — point Postgres at Parquet
via `parquet_fdw`. Mature but introduces a database we said we'd
avoid (ADR-001).
**Recommendation:** **Option A (DuckDB via cgo)**. Preserves the SQL +
columnar + Parquet model, single-binary deploy with cgo, mature. Cgo
adds build complexity but is acceptable.
### 2. Lance backend — vectord-lance
**Constraint:** Lance is a Rust-native columnar format with built-in
vector indexing. There is no Go port and no FFI binding. ADR-019
designates Lance as a per-profile *secondary* backend; Parquet+HNSW is
*primary*.
**Options:**
- **A. Drop Lance entirely.** Parquet+HNSW handles primary path; Lance
was secondary. ADR-019 stays valid for the Rust era; Go Lakehouse is
Parquet-only.
- **B. Keep Lance via FFI/cgo.** Build Lance as a Rust dylib, call from
Go via cgo. Reintroduces Rust into the build chain; defeats the
point.
- **C. Wait for Lance Go port.** Doesn't exist; not on Lance roadmap.
**Recommendation:** **Option A (drop Lance).** The hybrid backend was
optional per-profile; Parquet+HNSW carries the primary path. If a
specific workload later proves Lance-only, it can be exposed as a
Python-sidecar service.
### 3. UI — replacing Dioxus
**Constraint:** Dioxus is a Rust+WASM frontend framework. No Go
equivalent at the same level of polish. The current `crates/ui` covers
Ask, Explore, SQL, System tabs.
**Options:**
- **A. `html/template` + HTMX + Alpine.js** — server-rendered Go,
partial-page swaps via HTMX. Single repo, minimal JS, fits Go's
"boring is good" ethos.
- **B. Separate Vite/React frontend** — `golangLAKEHOUSE-ui` repo,
Go gateway serves static files. Modern UX patterns, more dev tooling
needed.
- **C. Keep Dioxus + WASM as a build step** — defeats the rewrite.
**Recommendation:** **Option A** for v1; revisit if UI requirements
demand React-tier interactivity. The current Lakehouse UIs (`/lakehouse/`
demo + staffer console) are mostly server-rendered HTML with vanilla
JS — `html/template` + HTMX is a strong fit.
### 4. Arrow ecosystem maturity
**Constraint:** `arrow-go/v15` lags `arrow-rs` in compute kernels,
expression APIs, and some compression codecs. Specific gaps known:
limited `cast` kernel coverage, no equivalent of `arrow-rs`'s
`compute::sort_to_indices` for all dtypes, no Acero-style streaming
execution.
**Mitigation:** the Go Lakehouse relies on Arrow primarily for
**Parquet I/O + RecordBatch transport**, not for in-process compute
(that's DuckDB's job). The narrower scope makes arrow-go's gaps less
load-bearing.
**Acceptance gate:** any Arrow API the Go Lakehouse uses must be
covered by `arrow-go/v15`. Anything missing → file an upstream issue,
implement locally if blocking, contribute back.
---
## Migration strategy
### What ports verbatim
- Problem statement, use cases, requirements
- Architectural invariants (112)
- ADRs 001021 (preserved as design intent; some change implementation)
- Federation building blocks (multi-bucket, error-journal, append-log)
### What rebuilds from data
- HNSW indexes (rebuild from Parquet embeddings — ADR-008 is verbatim
preserved)
- Pathway memory state (88 traces in `data/_pathway_memory/state.json`
on Rust side — port the JSON format and reload; the byte-matching
contract becomes Go-Go instead of Rust-TS)
- Catalog manifests (Parquet, portable)
- Distillation v1.0.0 substrate (port the SFT/contamination-firewall
logic; the fixture-as-gate pattern stays)
### What ships first (port-order outline — see SPEC.md for detail)
1. **Phase G0** — Skeleton: `cmd/gateway`, `cmd/catalogd`, `cmd/storaged`,
`cmd/ingestd`. Single-bucket, no auth, CSV→Parquet, query via DuckDB.
2. **Phase G1** — Vector path: `cmd/vectord` with HNSW + RAG endpoint.
3. **Phase G2** — Multi-profile + federation (ADRs 016017).
4. **Phase G3** — Pathway memory + distillation port.
5. **Phase G4** — MCP server, observer, auditor (TS surfaces → Go).
6. **Phase G5** — UI (HTMX) and demo parity with `devop.live/lakehouse/`.
Detailed acceptance gates in `SPEC.md`.
### What does NOT migrate
- The Rust crates themselves (archived in the original `lakehouse` repo)
- The TS scrum/auditor pipelines (rewritten in Go in Phase G4)
- The Bun mcp-server (rewritten in Go in Phase G4)
- The Python sidecar (kept as-is, behind aibridge)
---
## Non-goals
- **No port of `vectord-lance`.** Lance backend is dropped; Parquet+HNSW
is the only vector backend.
- **No retention of Rust in the build chain.** No cgo-to-Rust bridges,
no FFI to keep specific crates alive. Cgo to **C/C++** (DuckDB) is
acceptable.
- **No new feature work during the port.** Feature parity with the Rust
Lakehouse at the cutoff commit is the bar; new capabilities defer to
post-port phases.
- **No live-migration of running services.** The Rust Lakehouse stops
serving when Go reaches feature parity; data moves once via Parquet
re-pointer.
---
## Ratified decisions (2026-04-28, J)
The six gating questions are answered. Phase G0 is unblocked. Full
context for each lives in `docs/DECISIONS.md` ADR-001.
| # | Decision |
|---|---|
| 1 | **DuckDB via cgo**`marcboeker/go-duckdb` is the query engine. Cgo accepted as the cost of a mature SQL+Parquet path. |
| 2 | **HTMX** — server-rendered `html/template` + HTMX + Alpine.js. Single-binary deploy. React is post-G5 if needed. |
| 3 | **Gitea** — repo lives at `git.agentview.dev/profit/golangLAKEHOUSE` (same server as the Rust lakehouse). |
| 4 | **Distillation rebuild in Go** — port the SFT export + contamination firewall logic, but bit-identical reproducibility is a Rust-era property. New Go fixtures, new acceptance gates. |
| 5 | **Pathway memory starts clean** — Go pathway memory begins with zero traces. The existing 88 Rust traces are preserved at `data/_pathway_memory/state.json` in the lakehouse repo as a historical record (see `docs/RUST_PATHWAY_MEMORY_NOTE.md`). |
| 6 | **Auditor longitudinal signal restarts**`audit_baselines.jsonl` is a Rust-era artifact. Go auditor begins a fresh drift signal. |