golangLAKEHOUSE/docs/PRD.md

# PRD: Lakehouse-Go — Versioned Knowledge Substrate, Reimplemented in Go

**Status:** DRAFT — seed document for the Go-direction rewrite. Supersedes
`/home/profit/lakehouse/docs/PRD.md` (Rust) once ratified.
**Created:** 2026-04-28
**Owner:** J
**Sibling:** `SPEC.md` — component-by-component port plan with effort
estimates, library choices, and acceptance gates.

---

## Product vision — what we're actually building

**The Go refactor isn't the goal. The goal is a small-model-driven autonomous pipeline that gets better with each run, with frontier models in audit/oversight and humans triaged in only for the genuinely abstract cases.**

The Rust Lakehouse already has most of the pieces:
- **Pathway memory** (`internal/pathway` in Go, 88 Rust traces preserved) — what we tried, what worked
- **Matrix indexer** (SPEC §3.4) — multi-corpus retrieve+merge that gives the small model the right knowledge slice for *this* task
- **Observer** — watches runs, refines configs, escalates
- **Distillation v1.0.0** (`e7636f2`) — turns successful runs into denser playbooks
- **Auditor cross-lineage fabric** — Kimi/Haiku/Opus oversight on small-model outputs

What the Go refactor is FOR: a second-language pass surfaces architectural weaknesses that Rust hid. The pipeline has to pull together cleanly *as a pipeline* — not as 15 crates that happen to interact.

### The five-loop substrate

1. **Knowledge pathway loop** — pathway memory + matrix indexer give the small model context for the task. Pathway answers "what worked last time?"; matrix answers "what's relevant now?"
2. **Execution loop** — small model runs on focused context. Frontier API calls are reserved for audit/escalation, not the inner loop. Cost + rate limits stay sane.
3. **Observer loop** — watches each run, refines the configs (matrix corpus picks, downgrade gate, prompt mold) that got the model to a good pathway. Outputs new config, not new prompt.
4. **Rating + distillation loop** — successful outcomes get scored and folded back into the playbook substrate. The playbook gets denser; the next run starts smarter.
5. **Drift loop** — quantify when the distilled playbook stops matching reality (codebase changed, contracts shifted, profiles updated). Drift is a *measured* signal, not "hope nothing broke."

### The gate

**The playbook + matrix indexer must produce the results we're looking for.** That's the single load-bearing acceptance criterion. Throughput, scaling, code elegance — all secondary. If a deep-field reality test on the 500K corpus surfaces wrong answers, the loop isn't working and we fix that before adding anything else.

### Observer as system resource (clarified 2026-04-29)

The observer is not a service among services — it's a *system
resource*. Its job is to be objective about the process: watch
everything, record measurements, surface what worked vs what
didn't, feed the KB so the playbook substrate can decide the
right pathway to the correct outcome.

The bare-bones observerd shipped in `bc9ab93` (event ingest +
stats) is the substrate for this. The architectural pattern
that grows it into the full "objective measurement engine" is
the **multi-pass workflow runner** documented in SPEC §3.8 —
inspired by Archon (`/home/profit/external/Archon`) and proven
in the Rust `observer-kb` branch's Python prototypes (`deep_analysis.py`,
`extract_knowledge.py`, `process_knowledge.py`).

The pipeline mode-chain (extract → validator → hallucination →
consensus → redteam → pipeline → render) IS how the observer
makes actionable decisions: each mode pass is a deterministic
measurement; what survives the gauntlet is what feeds the KB.

### Triage / human-in-loop

Most cases are abstract enough that small-model + pathway + matrix can complete them. Some can't — they need a human. The system's job is to **identify which is which** and only escalate the second class. Frontier models partially solve this internally with their thinking loops; we're externalizing it so:
- Small models are swappable (vendor independence)
- Drift is measurable (quantitative signal, not vibes)
- Each loop iteration is auditable (the pathway memory IS the audit trail)

This is what the auditor cross-lineage fabric proves out in Rust — Opus auto-promote on diffs >100k chars is the same pattern: triage by signal, not by guesswork.

## Direction pivot — why this PRD exists

The Rust-first Lakehouse (15 crates, ~24 unmerged commits past PR #11,
distillation v1.0.0 substrate frozen at `e7636f2`) is being reimplemented
in Go on the principle that **anything Go can carry, Go carries**. This
is an explicit re-platforming, not a refactor.

### What the rewrite preserves (verbatim from the Rust PRD)

- The **problem statement** — legacy data systems silo information; AI
  needs both fast analytical queries AND semantic retrieval over
  unstructured text in one substrate.
- The **two use cases** — staffing analytics (reference implementation)
  and local AI knowledge substrate (per-profile vector indexes for
  running models).
- The **shared requirements** — schema-less ingest, SQL at scale,
  AI-embedding search, hot-swappable indexes, trials-as-data,
  local-first / no-cloud, repo-rebuildable.
- The **architectural invariants** — object storage as source of truth,
  catalog as sole metadata authority, hot-swap atomicity, profiles as
  first-class, playbooks-feed-the-index, errors findable in one HTTP
  call.

### What the rewrite changes

| Layer | Was (Rust) | Becomes (Go) | Confidence |
|---|---|---|---|
| HTTP gateway | Axum + Tokio | `net/http` + `chi` (or `gin`) | High — Go's bread and butter |
| gRPC | tonic | `google.golang.org/grpc` | High — Go is the reference impl |
| Object store | Apache Arrow `object_store` | `aws-sdk-go-v2/service/s3` + thin wrapper | High |
| Parquet I/O | parquet-rs (arrow-rs) | `apache/arrow-go/v18/parquet` | Medium — arrow-go lags arrow-rs but v18 covers our needs |
| Query engine | DataFusion | **Hard problem** (see §Hard problems) | Low — no like-for-like Go equivalent |
| Vector index (HNSW) | `hora` / hand-rolled | `coder/hnsw` or `Bithack/go-hnsw` (in-process) | High — HNSW is a self-contained algorithm |
| Vector backend (Lance) | `lance` (Rust) | **Hard problem** — likely dropped, Parquet-only | Medium |
| Frontend | Dioxus + WASM | Go `html/template` + HTMX + Alpine, or React/Vite split repo | Medium |
| Concurrency | Tokio async | Goroutines + `context.Context` | High |
| Config | TOML | TOML (`pelletier/go-toml/v2`) | High |
| Secrets | `SecretsProvider` trait | Go interface, same shape | High |
| AI bridge | HTTP client to Python sidecar | Same — Python sidecar stays | High |
| Embedded ML | Python sidecar (kept) | Python sidecar (kept) | n/a |

### What stays Python (and why)

- **Embedding generation, image gen, deepface analysis** — Python's ML
  ecosystem is genuinely stronger than Go's. The sidecar stays as an
  HTTP service; the Go gateway calls it the same way the Rust gateway
  did. No port required.
- **Distillation pipeline scoring** — current TS scripts; can move to Go
  but not first-tier priority. Keep TS until Go gateway is live.

---

## Solution — Go service mesh over S3-compatible object storage

A modular Go service mesh, same architectural shape as the Rust system,
with the Python AI sidecar retained as the embedding/generation
boundary. Single repo (`golangLAKEHOUSE`), single Go module, multiple
binaries built from one workspace.

### Locked stack (Go)

| Layer | Choice | Rationale |
|---|---|---|
| HTTP | `chi` | Idiomatic, middleware-friendly, used by major Go services |
| gRPC | `google.golang.org/grpc` | Reference implementation |
| Protobuf | `protoc-gen-go` + `buf` | Standard tooling |
| Object store | `aws-sdk-go-v2` | Mature, covers S3 + MinIO + RustFS |
| Parquet | `apache/arrow-go/v18` | Columnar I/O + Arrow interop (v18.5.2 — March 2026) |
| SQL engine | **Open** — see §Hard problems §1 | Biggest open decision |
| Vector index | `coder/hnsw` | Pure-Go HNSW, in-process, no external service |
| TOML config | `pelletier/go-toml/v2` | Standard |
| Logging | `log/slog` | Standard library since Go 1.21 |
| Tracing | `go.opentelemetry.io/otel` | Standard |
| Testing | `testing` + `testify` + `golden` files | Standard |
| Frontend | **Open** — `html/template` + HTMX vs separate Vite/React | Hard problem §3 |

No new dependencies without an ADR.

---

## Architecture

Same service decomposition as Rust, same data flow. Names preserved so
the spec, ADRs, and runbooks port semantically:

```
┌─ ingestd ─→ storaged ─→ catalogd ─→ queryd ─┐
│                          │                   │
│                          └→ vectord          │
│                                              │
└──────── aibridge ──HTTP──→ Python sidecar ───┘
                                               │
                       gateway ─ HTTP/gRPC ────┘
                          │
                          └→ ui (HTMX or Vite)
```

| Service | Responsibility | Go binary |
|---|---|---|
| **gateway** | HTTP/gRPC ingress, routing, auth | `cmd/gateway` |
| **catalogd** | Metadata control plane, dataset registry | `cmd/catalogd` |
| **storaged** | Object I/O, multi-bucket, error journal | `cmd/storaged` |
| **queryd** | SQL execution over Parquet (engine TBD) | `cmd/queryd` |
| **ingestd** | CSV/JSON/PDF ingest → Parquet | `cmd/ingestd` |
| **vectord** | Embeddings + HNSW index + trial system | `cmd/vectord` |
| **journald** | Append-only mutation event log | `cmd/journald` |
| **aibridge** | HTTP client to Python sidecar | library, linked into gateway |
| **validator** | Production worker/permit validators | library, linked into gateway |
| **mcp** | MCP server (replaces Bun `mcp-server`) | `cmd/mcp` |
| **observer** | Autonomous iteration loop | `cmd/observer` |
| **auditor** | PR audit pipeline (replaces TS auditor) | `cmd/auditor` |

### Invariants (preserved verbatim from Rust PRD)

1. Object storage = source of truth
2. catalogd = sole metadata authority
3. No raw data in catalog — only pointers
4. vectord stores embeddings AS Parquet (portable)
5. ingestd is idempotent
6. Hot cache is performance, not source of truth
7. All services modular and independently replaceable
8. Indexes are hot-swappable (atomic pointer swap, rollback always possible)
9. Every reader gets its own profile
10. Trials are data, not logs
11. Operational failures findable in one HTTP call
12. Playbooks feed the index, not just the log

---

## Hard problems (the ones that don't trivially port)

These four define whether the rewrite is feasible. Spec answers each
with a concrete library/approach choice and a fallback.

### 1. Query engine — replacing DataFusion

**Constraint:** DataFusion is the most consequential Rust dependency in
the Lakehouse. It powers `queryd`, hybrid SQL+vector search, and
hot-cache merge-on-read. Go has no like-for-like equivalent.

**Options:**
- **A. Embed DuckDB via cgo (`marcboeker/go-duckdb`)** — DuckDB reads
  Parquet natively, supports SQL similar to DataFusion, has cgo Go
  bindings. Loses pure-Go portability (cgo required) but preserves the
  query model.
- **B. Run DuckDB as an external service** — one DuckDB process, Go
  talks to it via HTTP. Pure-Go gateway, separate-process query layer.
  Adds an operational surface (one more service to manage).
- **C. Hand-roll a query planner over Arrow** — parse SQL with
  `xwb1989/sqlparser`, plan over arrow-go RecordBatches, execute. High
  effort, high risk. Best avoided.
- **D. Postgres + foreign data wrappers** — point Postgres at Parquet
  via `parquet_fdw`. Mature but introduces a database we said we'd
  avoid (ADR-001).

**Recommendation:** **Option A (DuckDB via cgo)**. Preserves the SQL +
columnar + Parquet model, single-binary deploy with cgo, mature. Cgo
adds build complexity but is acceptable.

### 2. Lance backend — vectord-lance

**Constraint:** Lance is a Rust-native columnar format with built-in
vector indexing. There is no Go port and no FFI binding. ADR-019
designates Lance as a per-profile *secondary* backend; Parquet+HNSW is
*primary*.

**Options:**
- **A. Drop Lance entirely.** Parquet+HNSW handles primary path; Lance
  was secondary. ADR-019 stays valid for the Rust era; Go Lakehouse is
  Parquet-only.
- **B. Keep Lance via FFI/cgo.** Build Lance as a Rust dylib, call from
  Go via cgo. Reintroduces Rust into the build chain; defeats the
  point.
- **C. Wait for Lance Go port.** Doesn't exist; not on Lance roadmap.

**Recommendation:** **Option A (drop Lance).** The hybrid backend was
optional per-profile; Parquet+HNSW carries the primary path. If a
specific workload later proves Lance-only, it can be exposed as a
Python-sidecar service.

### 3. UI — replacing Dioxus

**Constraint:** Dioxus is a Rust+WASM frontend framework. No Go
equivalent at the same level of polish. The current `crates/ui` covers
Ask, Explore, SQL, System tabs.

**Options:**
- **A. `html/template` + HTMX + Alpine.js** — server-rendered Go,
  partial-page swaps via HTMX. Single repo, minimal JS, fits Go's
  "boring is good" ethos.
- **B. Separate Vite/React frontend** — `golangLAKEHOUSE-ui` repo,
  Go gateway serves static files. Modern UX patterns, more dev tooling
  needed.
- **C. Keep Dioxus + WASM as a build step** — defeats the rewrite.

**Recommendation:** **Option A** for v1; revisit if UI requirements
demand React-tier interactivity. The current Lakehouse UIs (`/lakehouse/`
demo + staffer console) are mostly server-rendered HTML with vanilla
JS — `html/template` + HTMX is a strong fit.

### 4. Arrow ecosystem maturity

**Constraint:** `arrow-go/v15` lags `arrow-rs` in compute kernels,
expression APIs, and some compression codecs. Specific gaps known:
limited `cast` kernel coverage, no equivalent of `arrow-rs`'s
`compute::sort_to_indices` for all dtypes, no Acero-style streaming
execution.

**Mitigation:** the Go Lakehouse relies on Arrow primarily for
**Parquet I/O + RecordBatch transport**, not for in-process compute
(that's DuckDB's job). The narrower scope makes arrow-go's gaps less
load-bearing.

**Acceptance gate:** any Arrow API the Go Lakehouse uses must be
covered by `arrow-go/v15`. Anything missing → file an upstream issue,
implement locally if blocking, contribute back.

---

## Migration strategy

### What ports verbatim
- Problem statement, use cases, requirements
- Architectural invariants (1–12)
- ADRs 001–021 (preserved as design intent; some change implementation)
- Federation building blocks (multi-bucket, error-journal, append-log)

### What rebuilds from data
- HNSW indexes (rebuild from Parquet embeddings — ADR-008 is verbatim
  preserved)
- Pathway memory state (88 traces in `data/_pathway_memory/state.json`
  on Rust side — port the JSON format and reload; the byte-matching
  contract becomes Go-Go instead of Rust-TS)
- Catalog manifests (Parquet, portable)
- Distillation v1.0.0 substrate (port the SFT/contamination-firewall
  logic; the fixture-as-gate pattern stays)

### What ships first (port-order outline — see SPEC.md for detail)

1. **Phase G0** — Skeleton: `cmd/gateway`, `cmd/catalogd`, `cmd/storaged`,
   `cmd/ingestd`. Single-bucket, no auth, CSV→Parquet, query via DuckDB.
2. **Phase G1** — Vector path: `cmd/vectord` with HNSW + RAG endpoint.
3. **Phase G2** — Multi-profile + federation (ADRs 016–017).
4. **Phase G3** — Pathway memory + distillation port.
5. **Phase G4** — MCP server, observer, auditor (TS surfaces → Go).
6. **Phase G5** — UI (HTMX) and demo parity with `devop.live/lakehouse/`.

Detailed acceptance gates in `SPEC.md`.

### What does NOT migrate
- The Rust crates themselves (archived in the original `lakehouse` repo)
- The TS scrum/auditor pipelines (rewritten in Go in Phase G4)
- The Bun mcp-server (rewritten in Go in Phase G4)
- The Python sidecar (kept as-is, behind aibridge)

---

## Non-goals

- **No port of `vectord-lance`.** Lance backend is dropped; Parquet+HNSW
  is the only vector backend.
- **No retention of Rust in the build chain.** No cgo-to-Rust bridges,
  no FFI to keep specific crates alive. Cgo to **C/C++** (DuckDB) is
  acceptable.
- **No new feature work during the port.** Feature parity with the Rust
  Lakehouse at the cutoff commit is the bar; new capabilities defer to
  post-port phases.
- **No live-migration of running services.** The Rust Lakehouse stops
  serving when Go reaches feature parity; data moves once via Parquet
  re-pointer.

---

## Ratified decisions (2026-04-28, J)

The six gating questions are answered. Phase G0 is unblocked. Full
context for each lives in `docs/DECISIONS.md` ADR-001.

| # | Decision |
|---|---|
| 1 | **DuckDB via cgo** — `marcboeker/go-duckdb` is the query engine. Cgo accepted as the cost of a mature SQL+Parquet path. |
| 2 | **HTMX** — server-rendered `html/template` + HTMX + Alpine.js. Single-binary deploy. React is post-G5 if needed. |
| 3 | **Gitea** — repo lives at `git.agentview.dev/profit/golangLAKEHOUSE` (same server as the Rust lakehouse). |
| 4 | **Distillation rebuild in Go** — port the SFT export + contamination firewall logic, but bit-identical reproducibility is a Rust-era property. New Go fixtures, new acceptance gates. |
| 5 | **Pathway memory starts clean** — Go pathway memory begins with zero traces. The existing 88 Rust traces are preserved at `data/_pathway_memory/state.json` in the lakehouse repo as a historical record (see `docs/RUST_PATHWAY_MEMORY_NOTE.md`). |
| 6 | **Auditor longitudinal signal restarts** — `audit_baselines.jsonl` is a Rust-era artifact. Go auditor begins a fresh drift signal. |