root a7620c8b6f PRD: name the product vision — small-model pipeline + 5-loop substrate
Adds a "Product vision" section before the Direction-pivot section.
Captures the framing J flagged 2026-04-29: the Go refactor is not the
goal. The goal is a small-model-driven autonomous pipeline that gets
better with each run, with frontier models in audit/oversight, not
the hot path.

Five loops named explicitly:
  1. Knowledge pathway (pathway memory + matrix indexer)
  2. Execution (small models on focused context)
  3. Observer (refines configs that got the model to a good pathway)
  4. Rating + distillation (outcomes fold back into the playbook)
  5. Drift (measure when the playbook stops matching reality)

Triage / human-in-loop named as the system's job, not an escape
hatch. The gate: "playbook + matrix indexer must give the results
we're looking for" — single load-bearing acceptance criterion.

Why Go after Rust: second-language pass surfaces architectural
weaknesses Rust hid; the pipeline must work AS A PIPELINE, not as
crates that interact. Maps existing Rust components (✓ pathway, ✓
matrix, ✓ observer, ✓ distillation, ✓ auditor; partial: drift,
rating gate, triage).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-29 18:17:01 -05:00

332 lines
16 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# PRD: Lakehouse-Go — Versioned Knowledge Substrate, Reimplemented in Go
**Status:** DRAFT — seed document for the Go-direction rewrite. Supersedes
`/home/profit/lakehouse/docs/PRD.md` (Rust) once ratified.
**Created:** 2026-04-28
**Owner:** J
**Sibling:** `SPEC.md` — component-by-component port plan with effort
estimates, library choices, and acceptance gates.
---
## Product vision — what we're actually building
**The Go refactor isn't the goal. The goal is a small-model-driven autonomous pipeline that gets better with each run, with frontier models in audit/oversight and humans triaged in only for the genuinely abstract cases.**
The Rust Lakehouse already has most of the pieces:
- **Pathway memory** (`internal/pathway` in Go, 88 Rust traces preserved) — what we tried, what worked
- **Matrix indexer** (SPEC §3.4) — multi-corpus retrieve+merge that gives the small model the right knowledge slice for *this* task
- **Observer** — watches runs, refines configs, escalates
- **Distillation v1.0.0** (`e7636f2`) — turns successful runs into denser playbooks
- **Auditor cross-lineage fabric** — Kimi/Haiku/Opus oversight on small-model outputs
What the Go refactor is FOR: a second-language pass surfaces architectural weaknesses that Rust hid. The pipeline has to pull together cleanly *as a pipeline* — not as 15 crates that happen to interact.
### The five-loop substrate
1. **Knowledge pathway loop** — pathway memory + matrix indexer give the small model context for the task. Pathway answers "what worked last time?"; matrix answers "what's relevant now?"
2. **Execution loop** — small model runs on focused context. Frontier API calls are reserved for audit/escalation, not the inner loop. Cost + rate limits stay sane.
3. **Observer loop** — watches each run, refines the configs (matrix corpus picks, downgrade gate, prompt mold) that got the model to a good pathway. Outputs new config, not new prompt.
4. **Rating + distillation loop** — successful outcomes get scored and folded back into the playbook substrate. The playbook gets denser; the next run starts smarter.
5. **Drift loop** — quantify when the distilled playbook stops matching reality (codebase changed, contracts shifted, profiles updated). Drift is a *measured* signal, not "hope nothing broke."
### The gate
**The playbook + matrix indexer must produce the results we're looking for.** That's the single load-bearing acceptance criterion. Throughput, scaling, code elegance — all secondary. If a deep-field reality test on the 500K corpus surfaces wrong answers, the loop isn't working and we fix that before adding anything else.
### Triage / human-in-loop
Most cases are abstract enough that small-model + pathway + matrix can complete them. Some can't — they need a human. The system's job is to **identify which is which** and only escalate the second class. Frontier models partially solve this internally with their thinking loops; we're externalizing it so:
- Small models are swappable (vendor independence)
- Drift is measurable (quantitative signal, not vibes)
- Each loop iteration is auditable (the pathway memory IS the audit trail)
This is what the auditor cross-lineage fabric proves out in Rust — Opus auto-promote on diffs >100k chars is the same pattern: triage by signal, not by guesswork.
## Direction pivot — why this PRD exists
The Rust-first Lakehouse (15 crates, ~24 unmerged commits past PR #11,
distillation v1.0.0 substrate frozen at `e7636f2`) is being reimplemented
in Go on the principle that **anything Go can carry, Go carries**. This
is an explicit re-platforming, not a refactor.
### What the rewrite preserves (verbatim from the Rust PRD)
- The **problem statement** — legacy data systems silo information; AI
needs both fast analytical queries AND semantic retrieval over
unstructured text in one substrate.
- The **two use cases** — staffing analytics (reference implementation)
and local AI knowledge substrate (per-profile vector indexes for
running models).
- The **shared requirements** — schema-less ingest, SQL at scale,
AI-embedding search, hot-swappable indexes, trials-as-data,
local-first / no-cloud, repo-rebuildable.
- The **architectural invariants** — object storage as source of truth,
catalog as sole metadata authority, hot-swap atomicity, profiles as
first-class, playbooks-feed-the-index, errors findable in one HTTP
call.
### What the rewrite changes
| Layer | Was (Rust) | Becomes (Go) | Confidence |
|---|---|---|---|
| HTTP gateway | Axum + Tokio | `net/http` + `chi` (or `gin`) | High — Go's bread and butter |
| gRPC | tonic | `google.golang.org/grpc` | High — Go is the reference impl |
| Object store | Apache Arrow `object_store` | `aws-sdk-go-v2/service/s3` + thin wrapper | High |
| Parquet I/O | parquet-rs (arrow-rs) | `apache/arrow-go/v18/parquet` | Medium — arrow-go lags arrow-rs but v18 covers our needs |
| Query engine | DataFusion | **Hard problem** (see §Hard problems) | Low — no like-for-like Go equivalent |
| Vector index (HNSW) | `hora` / hand-rolled | `coder/hnsw` or `Bithack/go-hnsw` (in-process) | High — HNSW is a self-contained algorithm |
| Vector backend (Lance) | `lance` (Rust) | **Hard problem** — likely dropped, Parquet-only | Medium |
| Frontend | Dioxus + WASM | Go `html/template` + HTMX + Alpine, or React/Vite split repo | Medium |
| Concurrency | Tokio async | Goroutines + `context.Context` | High |
| Config | TOML | TOML (`pelletier/go-toml/v2`) | High |
| Secrets | `SecretsProvider` trait | Go interface, same shape | High |
| AI bridge | HTTP client to Python sidecar | Same — Python sidecar stays | High |
| Embedded ML | Python sidecar (kept) | Python sidecar (kept) | n/a |
### What stays Python (and why)
- **Embedding generation, image gen, deepface analysis** — Python's ML
ecosystem is genuinely stronger than Go's. The sidecar stays as an
HTTP service; the Go gateway calls it the same way the Rust gateway
did. No port required.
- **Distillation pipeline scoring** — current TS scripts; can move to Go
but not first-tier priority. Keep TS until Go gateway is live.
---
## Solution — Go service mesh over S3-compatible object storage
A modular Go service mesh, same architectural shape as the Rust system,
with the Python AI sidecar retained as the embedding/generation
boundary. Single repo (`golangLAKEHOUSE`), single Go module, multiple
binaries built from one workspace.
### Locked stack (Go)
| Layer | Choice | Rationale |
|---|---|---|
| HTTP | `chi` | Idiomatic, middleware-friendly, used by major Go services |
| gRPC | `google.golang.org/grpc` | Reference implementation |
| Protobuf | `protoc-gen-go` + `buf` | Standard tooling |
| Object store | `aws-sdk-go-v2` | Mature, covers S3 + MinIO + RustFS |
| Parquet | `apache/arrow-go/v18` | Columnar I/O + Arrow interop (v18.5.2 — March 2026) |
| SQL engine | **Open** — see §Hard problems §1 | Biggest open decision |
| Vector index | `coder/hnsw` | Pure-Go HNSW, in-process, no external service |
| TOML config | `pelletier/go-toml/v2` | Standard |
| Logging | `log/slog` | Standard library since Go 1.21 |
| Tracing | `go.opentelemetry.io/otel` | Standard |
| Testing | `testing` + `testify` + `golden` files | Standard |
| Frontend | **Open**`html/template` + HTMX vs separate Vite/React | Hard problem §3 |
No new dependencies without an ADR.
---
## Architecture
Same service decomposition as Rust, same data flow. Names preserved so
the spec, ADRs, and runbooks port semantically:
```
┌─ ingestd ─→ storaged ─→ catalogd ─→ queryd ─┐
│ │ │
│ └→ vectord │
│ │
└──────── aibridge ──HTTP──→ Python sidecar ───┘
gateway ─ HTTP/gRPC ────┘
└→ ui (HTMX or Vite)
```
| Service | Responsibility | Go binary |
|---|---|---|
| **gateway** | HTTP/gRPC ingress, routing, auth | `cmd/gateway` |
| **catalogd** | Metadata control plane, dataset registry | `cmd/catalogd` |
| **storaged** | Object I/O, multi-bucket, error journal | `cmd/storaged` |
| **queryd** | SQL execution over Parquet (engine TBD) | `cmd/queryd` |
| **ingestd** | CSV/JSON/PDF ingest → Parquet | `cmd/ingestd` |
| **vectord** | Embeddings + HNSW index + trial system | `cmd/vectord` |
| **journald** | Append-only mutation event log | `cmd/journald` |
| **aibridge** | HTTP client to Python sidecar | library, linked into gateway |
| **validator** | Production worker/permit validators | library, linked into gateway |
| **mcp** | MCP server (replaces Bun `mcp-server`) | `cmd/mcp` |
| **observer** | Autonomous iteration loop | `cmd/observer` |
| **auditor** | PR audit pipeline (replaces TS auditor) | `cmd/auditor` |
### Invariants (preserved verbatim from Rust PRD)
1. Object storage = source of truth
2. catalogd = sole metadata authority
3. No raw data in catalog — only pointers
4. vectord stores embeddings AS Parquet (portable)
5. ingestd is idempotent
6. Hot cache is performance, not source of truth
7. All services modular and independently replaceable
8. Indexes are hot-swappable (atomic pointer swap, rollback always possible)
9. Every reader gets its own profile
10. Trials are data, not logs
11. Operational failures findable in one HTTP call
12. Playbooks feed the index, not just the log
---
## Hard problems (the ones that don't trivially port)
These four define whether the rewrite is feasible. Spec answers each
with a concrete library/approach choice and a fallback.
### 1. Query engine — replacing DataFusion
**Constraint:** DataFusion is the most consequential Rust dependency in
the Lakehouse. It powers `queryd`, hybrid SQL+vector search, and
hot-cache merge-on-read. Go has no like-for-like equivalent.
**Options:**
- **A. Embed DuckDB via cgo (`marcboeker/go-duckdb`)** — DuckDB reads
Parquet natively, supports SQL similar to DataFusion, has cgo Go
bindings. Loses pure-Go portability (cgo required) but preserves the
query model.
- **B. Run DuckDB as an external service** — one DuckDB process, Go
talks to it via HTTP. Pure-Go gateway, separate-process query layer.
Adds an operational surface (one more service to manage).
- **C. Hand-roll a query planner over Arrow** — parse SQL with
`xwb1989/sqlparser`, plan over arrow-go RecordBatches, execute. High
effort, high risk. Best avoided.
- **D. Postgres + foreign data wrappers** — point Postgres at Parquet
via `parquet_fdw`. Mature but introduces a database we said we'd
avoid (ADR-001).
**Recommendation:** **Option A (DuckDB via cgo)**. Preserves the SQL +
columnar + Parquet model, single-binary deploy with cgo, mature. Cgo
adds build complexity but is acceptable.
### 2. Lance backend — vectord-lance
**Constraint:** Lance is a Rust-native columnar format with built-in
vector indexing. There is no Go port and no FFI binding. ADR-019
designates Lance as a per-profile *secondary* backend; Parquet+HNSW is
*primary*.
**Options:**
- **A. Drop Lance entirely.** Parquet+HNSW handles primary path; Lance
was secondary. ADR-019 stays valid for the Rust era; Go Lakehouse is
Parquet-only.
- **B. Keep Lance via FFI/cgo.** Build Lance as a Rust dylib, call from
Go via cgo. Reintroduces Rust into the build chain; defeats the
point.
- **C. Wait for Lance Go port.** Doesn't exist; not on Lance roadmap.
**Recommendation:** **Option A (drop Lance).** The hybrid backend was
optional per-profile; Parquet+HNSW carries the primary path. If a
specific workload later proves Lance-only, it can be exposed as a
Python-sidecar service.
### 3. UI — replacing Dioxus
**Constraint:** Dioxus is a Rust+WASM frontend framework. No Go
equivalent at the same level of polish. The current `crates/ui` covers
Ask, Explore, SQL, System tabs.
**Options:**
- **A. `html/template` + HTMX + Alpine.js** — server-rendered Go,
partial-page swaps via HTMX. Single repo, minimal JS, fits Go's
"boring is good" ethos.
- **B. Separate Vite/React frontend** — `golangLAKEHOUSE-ui` repo,
Go gateway serves static files. Modern UX patterns, more dev tooling
needed.
- **C. Keep Dioxus + WASM as a build step** — defeats the rewrite.
**Recommendation:** **Option A** for v1; revisit if UI requirements
demand React-tier interactivity. The current Lakehouse UIs (`/lakehouse/`
demo + staffer console) are mostly server-rendered HTML with vanilla
JS — `html/template` + HTMX is a strong fit.
### 4. Arrow ecosystem maturity
**Constraint:** `arrow-go/v15` lags `arrow-rs` in compute kernels,
expression APIs, and some compression codecs. Specific gaps known:
limited `cast` kernel coverage, no equivalent of `arrow-rs`'s
`compute::sort_to_indices` for all dtypes, no Acero-style streaming
execution.
**Mitigation:** the Go Lakehouse relies on Arrow primarily for
**Parquet I/O + RecordBatch transport**, not for in-process compute
(that's DuckDB's job). The narrower scope makes arrow-go's gaps less
load-bearing.
**Acceptance gate:** any Arrow API the Go Lakehouse uses must be
covered by `arrow-go/v15`. Anything missing → file an upstream issue,
implement locally if blocking, contribute back.
---
## Migration strategy
### What ports verbatim
- Problem statement, use cases, requirements
- Architectural invariants (112)
- ADRs 001021 (preserved as design intent; some change implementation)
- Federation building blocks (multi-bucket, error-journal, append-log)
### What rebuilds from data
- HNSW indexes (rebuild from Parquet embeddings — ADR-008 is verbatim
preserved)
- Pathway memory state (88 traces in `data/_pathway_memory/state.json`
on Rust side — port the JSON format and reload; the byte-matching
contract becomes Go-Go instead of Rust-TS)
- Catalog manifests (Parquet, portable)
- Distillation v1.0.0 substrate (port the SFT/contamination-firewall
logic; the fixture-as-gate pattern stays)
### What ships first (port-order outline — see SPEC.md for detail)
1. **Phase G0** — Skeleton: `cmd/gateway`, `cmd/catalogd`, `cmd/storaged`,
`cmd/ingestd`. Single-bucket, no auth, CSV→Parquet, query via DuckDB.
2. **Phase G1** — Vector path: `cmd/vectord` with HNSW + RAG endpoint.
3. **Phase G2** — Multi-profile + federation (ADRs 016017).
4. **Phase G3** — Pathway memory + distillation port.
5. **Phase G4** — MCP server, observer, auditor (TS surfaces → Go).
6. **Phase G5** — UI (HTMX) and demo parity with `devop.live/lakehouse/`.
Detailed acceptance gates in `SPEC.md`.
### What does NOT migrate
- The Rust crates themselves (archived in the original `lakehouse` repo)
- The TS scrum/auditor pipelines (rewritten in Go in Phase G4)
- The Bun mcp-server (rewritten in Go in Phase G4)
- The Python sidecar (kept as-is, behind aibridge)
---
## Non-goals
- **No port of `vectord-lance`.** Lance backend is dropped; Parquet+HNSW
is the only vector backend.
- **No retention of Rust in the build chain.** No cgo-to-Rust bridges,
no FFI to keep specific crates alive. Cgo to **C/C++** (DuckDB) is
acceptable.
- **No new feature work during the port.** Feature parity with the Rust
Lakehouse at the cutoff commit is the bar; new capabilities defer to
post-port phases.
- **No live-migration of running services.** The Rust Lakehouse stops
serving when Go reaches feature parity; data moves once via Parquet
re-pointer.
---
## Ratified decisions (2026-04-28, J)
The six gating questions are answered. Phase G0 is unblocked. Full
context for each lives in `docs/DECISIONS.md` ADR-001.
| # | Decision |
|---|---|
| 1 | **DuckDB via cgo**`marcboeker/go-duckdb` is the query engine. Cgo accepted as the cost of a mature SQL+Parquet path. |
| 2 | **HTMX** — server-rendered `html/template` + HTMX + Alpine.js. Single-binary deploy. React is post-G5 if needed. |
| 3 | **Gitea** — repo lives at `git.agentview.dev/profit/golangLAKEHOUSE` (same server as the Rust lakehouse). |
| 4 | **Distillation rebuild in Go** — port the SFT export + contamination firewall logic, but bit-identical reproducibility is a Rust-era property. New Go fixtures, new acceptance gates. |
| 5 | **Pathway memory starts clean** — Go pathway memory begins with zero traces. The existing 88 Rust traces are preserved at `data/_pathway_memory/state.json` in the lakehouse repo as a historical record (see `docs/RUST_PATHWAY_MEMORY_NOTE.md`). |
| 6 | **Auditor longitudinal signal restarts**`audit_baselines.jsonl` is a Rust-era artifact. Go auditor begins a fresh drift signal. |