Claw 29468b1413 docs: 2026-04-28 upstream survey — three SPEC-changing pivots
Pre-Phase-G0 research sweep against current Go ecosystem state. Three
upstream changes that the day-of SPEC missed:

1. DuckDB Go binding ownership transferred. marcboeker/go-duckdb is
   deprecated as of v2.5.0 — official maintainer is now
   github.com/duckdb/duckdb-go/v2 (DuckDB team + Marc Boeker joint
   hand-off). Current v2.10502.0 / DuckDB v1.5.2. SPEC §3.1 +
   component table updated.

2. Official Go MCP SDK exists. Switching from mark3labs/mcp-go
   (community) to github.com/modelcontextprotocol/go-sdk (official,
   Google collaboration, v1.5.0 stable, 4.4k stars, targets MCP spec
   2025-11-25). Component table updated.

3. arrow-go is on v18, not v15. v18.5.2 (March 2026) has parquet
   encryption fixes relevant for PII-masked safe views. PRD locked
   stack + SPEC component table updated.

Validated unchanged: coder/hnsw (220 stars, active), chi (still the
clean-architecture pick over fiber/gin/echo).

Surfaced for future use: anthropics/anthropic-sdk-go (official,
available for direct Claude calls bypassing opencode if ever needed),
duckdb-wasm (browser-side analytics future option), IVF as HNSW
fallback if recall gate fails.

See docs/RESEARCH_LOG_2026-04-28.md for full survey + sources.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-28 06:40:26 -05:00

14 KiB
Raw Blame History

PRD: Lakehouse-Go — Versioned Knowledge Substrate, Reimplemented in Go

Status: DRAFT — seed document for the Go-direction rewrite. Supersedes /home/profit/lakehouse/docs/PRD.md (Rust) once ratified. Created: 2026-04-28 Owner: J Sibling: SPEC.md — component-by-component port plan with effort estimates, library choices, and acceptance gates.


Direction pivot — why this PRD exists

The Rust-first Lakehouse (15 crates, ~24 unmerged commits past PR #11, distillation v1.0.0 substrate frozen at e7636f2) is being reimplemented in Go on the principle that anything Go can carry, Go carries. This is an explicit re-platforming, not a refactor.

What the rewrite preserves (verbatim from the Rust PRD)

  • The problem statement — legacy data systems silo information; AI needs both fast analytical queries AND semantic retrieval over unstructured text in one substrate.
  • The two use cases — staffing analytics (reference implementation) and local AI knowledge substrate (per-profile vector indexes for running models).
  • The shared requirements — schema-less ingest, SQL at scale, AI-embedding search, hot-swappable indexes, trials-as-data, local-first / no-cloud, repo-rebuildable.
  • The architectural invariants — object storage as source of truth, catalog as sole metadata authority, hot-swap atomicity, profiles as first-class, playbooks-feed-the-index, errors findable in one HTTP call.

What the rewrite changes

Layer Was (Rust) Becomes (Go) Confidence
HTTP gateway Axum + Tokio net/http + chi (or gin) High — Go's bread and butter
gRPC tonic google.golang.org/grpc High — Go is the reference impl
Object store Apache Arrow object_store aws-sdk-go-v2/service/s3 + thin wrapper High
Parquet I/O parquet-rs (arrow-rs) apache/arrow-go/v18/parquet Medium — arrow-go lags arrow-rs but v18 covers our needs
Query engine DataFusion Hard problem (see §Hard problems) Low — no like-for-like Go equivalent
Vector index (HNSW) hora / hand-rolled coder/hnsw or Bithack/go-hnsw (in-process) High — HNSW is a self-contained algorithm
Vector backend (Lance) lance (Rust) Hard problem — likely dropped, Parquet-only Medium
Frontend Dioxus + WASM Go html/template + HTMX + Alpine, or React/Vite split repo Medium
Concurrency Tokio async Goroutines + context.Context High
Config TOML TOML (pelletier/go-toml/v2) High
Secrets SecretsProvider trait Go interface, same shape High
AI bridge HTTP client to Python sidecar Same — Python sidecar stays High
Embedded ML Python sidecar (kept) Python sidecar (kept) n/a

What stays Python (and why)

  • Embedding generation, image gen, deepface analysis — Python's ML ecosystem is genuinely stronger than Go's. The sidecar stays as an HTTP service; the Go gateway calls it the same way the Rust gateway did. No port required.
  • Distillation pipeline scoring — current TS scripts; can move to Go but not first-tier priority. Keep TS until Go gateway is live.

Solution — Go service mesh over S3-compatible object storage

A modular Go service mesh, same architectural shape as the Rust system, with the Python AI sidecar retained as the embedding/generation boundary. Single repo (golangLAKEHOUSE), single Go module, multiple binaries built from one workspace.

Locked stack (Go)

Layer Choice Rationale
HTTP chi Idiomatic, middleware-friendly, used by major Go services
gRPC google.golang.org/grpc Reference implementation
Protobuf protoc-gen-go + buf Standard tooling
Object store aws-sdk-go-v2 Mature, covers S3 + MinIO + RustFS
Parquet apache/arrow-go/v18 Columnar I/O + Arrow interop (v18.5.2 — March 2026)
SQL engine Open — see §Hard problems §1 Biggest open decision
Vector index coder/hnsw Pure-Go HNSW, in-process, no external service
TOML config pelletier/go-toml/v2 Standard
Logging log/slog Standard library since Go 1.21
Tracing go.opentelemetry.io/otel Standard
Testing testing + testify + golden files Standard
Frontend Openhtml/template + HTMX vs separate Vite/React Hard problem §3

No new dependencies without an ADR.


Architecture

Same service decomposition as Rust, same data flow. Names preserved so the spec, ADRs, and runbooks port semantically:

┌─ ingestd ─→ storaged ─→ catalogd ─→ queryd ─┐
│                          │                   │
│                          └→ vectord          │
│                                              │
└──────── aibridge ──HTTP──→ Python sidecar ───┘
                                               │
                       gateway ─ HTTP/gRPC ────┘
                          │
                          └→ ui (HTMX or Vite)
Service Responsibility Go binary
gateway HTTP/gRPC ingress, routing, auth cmd/gateway
catalogd Metadata control plane, dataset registry cmd/catalogd
storaged Object I/O, multi-bucket, error journal cmd/storaged
queryd SQL execution over Parquet (engine TBD) cmd/queryd
ingestd CSV/JSON/PDF ingest → Parquet cmd/ingestd
vectord Embeddings + HNSW index + trial system cmd/vectord
journald Append-only mutation event log cmd/journald
aibridge HTTP client to Python sidecar library, linked into gateway
validator Production worker/permit validators library, linked into gateway
mcp MCP server (replaces Bun mcp-server) cmd/mcp
observer Autonomous iteration loop cmd/observer
auditor PR audit pipeline (replaces TS auditor) cmd/auditor

Invariants (preserved verbatim from Rust PRD)

  1. Object storage = source of truth
  2. catalogd = sole metadata authority
  3. No raw data in catalog — only pointers
  4. vectord stores embeddings AS Parquet (portable)
  5. ingestd is idempotent
  6. Hot cache is performance, not source of truth
  7. All services modular and independently replaceable
  8. Indexes are hot-swappable (atomic pointer swap, rollback always possible)
  9. Every reader gets its own profile
  10. Trials are data, not logs
  11. Operational failures findable in one HTTP call
  12. Playbooks feed the index, not just the log

Hard problems (the ones that don't trivially port)

These four define whether the rewrite is feasible. Spec answers each with a concrete library/approach choice and a fallback.

1. Query engine — replacing DataFusion

Constraint: DataFusion is the most consequential Rust dependency in the Lakehouse. It powers queryd, hybrid SQL+vector search, and hot-cache merge-on-read. Go has no like-for-like equivalent.

Options:

  • A. Embed DuckDB via cgo (marcboeker/go-duckdb) — DuckDB reads Parquet natively, supports SQL similar to DataFusion, has cgo Go bindings. Loses pure-Go portability (cgo required) but preserves the query model.
  • B. Run DuckDB as an external service — one DuckDB process, Go talks to it via HTTP. Pure-Go gateway, separate-process query layer. Adds an operational surface (one more service to manage).
  • C. Hand-roll a query planner over Arrow — parse SQL with xwb1989/sqlparser, plan over arrow-go RecordBatches, execute. High effort, high risk. Best avoided.
  • D. Postgres + foreign data wrappers — point Postgres at Parquet via parquet_fdw. Mature but introduces a database we said we'd avoid (ADR-001).

Recommendation: Option A (DuckDB via cgo). Preserves the SQL + columnar + Parquet model, single-binary deploy with cgo, mature. Cgo adds build complexity but is acceptable.

2. Lance backend — vectord-lance

Constraint: Lance is a Rust-native columnar format with built-in vector indexing. There is no Go port and no FFI binding. ADR-019 designates Lance as a per-profile secondary backend; Parquet+HNSW is primary.

Options:

  • A. Drop Lance entirely. Parquet+HNSW handles primary path; Lance was secondary. ADR-019 stays valid for the Rust era; Go Lakehouse is Parquet-only.
  • B. Keep Lance via FFI/cgo. Build Lance as a Rust dylib, call from Go via cgo. Reintroduces Rust into the build chain; defeats the point.
  • C. Wait for Lance Go port. Doesn't exist; not on Lance roadmap.

Recommendation: Option A (drop Lance). The hybrid backend was optional per-profile; Parquet+HNSW carries the primary path. If a specific workload later proves Lance-only, it can be exposed as a Python-sidecar service.

3. UI — replacing Dioxus

Constraint: Dioxus is a Rust+WASM frontend framework. No Go equivalent at the same level of polish. The current crates/ui covers Ask, Explore, SQL, System tabs.

Options:

  • A. html/template + HTMX + Alpine.js — server-rendered Go, partial-page swaps via HTMX. Single repo, minimal JS, fits Go's "boring is good" ethos.
  • B. Separate Vite/React frontendgolangLAKEHOUSE-ui repo, Go gateway serves static files. Modern UX patterns, more dev tooling needed.
  • C. Keep Dioxus + WASM as a build step — defeats the rewrite.

Recommendation: Option A for v1; revisit if UI requirements demand React-tier interactivity. The current Lakehouse UIs (/lakehouse/ demo + staffer console) are mostly server-rendered HTML with vanilla JS — html/template + HTMX is a strong fit.

4. Arrow ecosystem maturity

Constraint: arrow-go/v15 lags arrow-rs in compute kernels, expression APIs, and some compression codecs. Specific gaps known: limited cast kernel coverage, no equivalent of arrow-rs's compute::sort_to_indices for all dtypes, no Acero-style streaming execution.

Mitigation: the Go Lakehouse relies on Arrow primarily for Parquet I/O + RecordBatch transport, not for in-process compute (that's DuckDB's job). The narrower scope makes arrow-go's gaps less load-bearing.

Acceptance gate: any Arrow API the Go Lakehouse uses must be covered by arrow-go/v15. Anything missing → file an upstream issue, implement locally if blocking, contribute back.


Migration strategy

What ports verbatim

  • Problem statement, use cases, requirements
  • Architectural invariants (112)
  • ADRs 001021 (preserved as design intent; some change implementation)
  • Federation building blocks (multi-bucket, error-journal, append-log)

What rebuilds from data

  • HNSW indexes (rebuild from Parquet embeddings — ADR-008 is verbatim preserved)
  • Pathway memory state (88 traces in data/_pathway_memory/state.json on Rust side — port the JSON format and reload; the byte-matching contract becomes Go-Go instead of Rust-TS)
  • Catalog manifests (Parquet, portable)
  • Distillation v1.0.0 substrate (port the SFT/contamination-firewall logic; the fixture-as-gate pattern stays)

What ships first (port-order outline — see SPEC.md for detail)

  1. Phase G0 — Skeleton: cmd/gateway, cmd/catalogd, cmd/storaged, cmd/ingestd. Single-bucket, no auth, CSV→Parquet, query via DuckDB.
  2. Phase G1 — Vector path: cmd/vectord with HNSW + RAG endpoint.
  3. Phase G2 — Multi-profile + federation (ADRs 016017).
  4. Phase G3 — Pathway memory + distillation port.
  5. Phase G4 — MCP server, observer, auditor (TS surfaces → Go).
  6. Phase G5 — UI (HTMX) and demo parity with devop.live/lakehouse/.

Detailed acceptance gates in SPEC.md.

What does NOT migrate

  • The Rust crates themselves (archived in the original lakehouse repo)
  • The TS scrum/auditor pipelines (rewritten in Go in Phase G4)
  • The Bun mcp-server (rewritten in Go in Phase G4)
  • The Python sidecar (kept as-is, behind aibridge)

Non-goals

  • No port of vectord-lance. Lance backend is dropped; Parquet+HNSW is the only vector backend.
  • No retention of Rust in the build chain. No cgo-to-Rust bridges, no FFI to keep specific crates alive. Cgo to C/C++ (DuckDB) is acceptable.
  • No new feature work during the port. Feature parity with the Rust Lakehouse at the cutoff commit is the bar; new capabilities defer to post-port phases.
  • No live-migration of running services. The Rust Lakehouse stops serving when Go reaches feature parity; data moves once via Parquet re-pointer.

Ratified decisions (2026-04-28, J)

The six gating questions are answered. Phase G0 is unblocked. Full context for each lives in docs/DECISIONS.md ADR-001.

# Decision
1 DuckDB via cgomarcboeker/go-duckdb is the query engine. Cgo accepted as the cost of a mature SQL+Parquet path.
2 HTMX — server-rendered html/template + HTMX + Alpine.js. Single-binary deploy. React is post-G5 if needed.
3 Gitea — repo lives at git.agentview.dev/profit/golangLAKEHOUSE (same server as the Rust lakehouse).
4 Distillation rebuild in Go — port the SFT export + contamination firewall logic, but bit-identical reproducibility is a Rust-era property. New Go fixtures, new acceptance gates.
5 Pathway memory starts clean — Go pathway memory begins with zero traces. The existing 88 Rust traces are preserved at data/_pathway_memory/state.json in the lakehouse repo as a historical record (see docs/RUST_PATHWAY_MEMORY_NOTE.md).
6 Auditor longitudinal signal restartsaudit_baselines.jsonl is a Rust-era artifact. Go auditor begins a fresh drift signal.