Per the 2026-04-29 scope-discipline pause: the wave shipped four
pieces beyond SPEC §3.4 component scope, and one architectural
pattern surfaced (Archon-style multi-pass workflow runner) that's
the observer's natural growth path. Document them as port targets
so the next scrum review has authoritative SPEC components.
§3.5 — Drift quantification (loop 5 of the PRD)
Names the SCORER drift work shipped in be65f85 + the deferred
shapes (PLAYBOOK drift, EMBEDDING drift, AUDIT BASELINE drift).
Acceptance gates G3.5.A–B.
§3.6 — Staffing-side structured filter
Names the metadata-filter MVP shipped in b199093 + the deferred
pre-retrieval SQL gate via queryd. Acceptance gates G3.6.A–C.
§3.7 — Operational rating wiring
Names the bulk playbook-record endpoint shipped in 6392772 + the
deferred UI shim, negative-feedback path, and time-decay.
Acceptance gates G3.7.A–B.
§3.8 — Observer-KB workflow runner (Archon-style multi-pass) —
PORT TARGET, not yet started
Documents the architecture J was working on across the Rust
observer-kb branch (10 commits ahead of main, never merged) and
the local Archon mod (committed 2026-04-29 as 3f2afc8 in
/home/profit/external/Archon, not pushed to coleam00/Archon).
The pattern: multi-pass mode chain (extract → validator →
hallucination → consensus → redteam → pipeline → render) where
each pass is a deterministic measurement. The observer is the
natural home — workflows ARE observation patterns whose every
step is recorded. Five components in dependency order: workflow
definition (YAML), node executor (DAG runner), provenance
recording (ObservedOps), mode catalog (matrix.search,
distillation.score, drift.scorer, llm.chat), HTTP surface
(/v1/observer/workflow/run).
Reference materials on the system (preserved, not lost):
- /home/profit/lakehouse/.archon/workflows/lakehouse-architect-review.yaml
(Rust main, 69919d9) — 3-node Archon-via-Lakehouse proof
- /home/profit/external/Archon dev branch — upstream engine
with local pi/provider.ts mod (3f2afc8) for Lakehouse routing
- Rust observer-kb branch — apps/observer-kb/docs/PRD.md +
Python prototypes proven on real ChatGPT/Claude PDF data
Acceptance gates G3.8.A–D. Estimated effort: L.
PRD updated with "Observer as system resource (clarified
2026-04-29)" section pointing at §3.8 as the architectural growth
path. The bare-bones observerd in bc9ab93 is the substrate; the
workflow runner is what makes it the "objective measurement engine"
the small-model pipeline needs.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
353 lines
17 KiB
Markdown
353 lines
17 KiB
Markdown
# PRD: Lakehouse-Go — Versioned Knowledge Substrate, Reimplemented in Go
|
||
|
||
**Status:** DRAFT — seed document for the Go-direction rewrite. Supersedes
|
||
`/home/profit/lakehouse/docs/PRD.md` (Rust) once ratified.
|
||
**Created:** 2026-04-28
|
||
**Owner:** J
|
||
**Sibling:** `SPEC.md` — component-by-component port plan with effort
|
||
estimates, library choices, and acceptance gates.
|
||
|
||
---
|
||
|
||
## Product vision — what we're actually building
|
||
|
||
**The Go refactor isn't the goal. The goal is a small-model-driven autonomous pipeline that gets better with each run, with frontier models in audit/oversight and humans triaged in only for the genuinely abstract cases.**
|
||
|
||
The Rust Lakehouse already has most of the pieces:
|
||
- **Pathway memory** (`internal/pathway` in Go, 88 Rust traces preserved) — what we tried, what worked
|
||
- **Matrix indexer** (SPEC §3.4) — multi-corpus retrieve+merge that gives the small model the right knowledge slice for *this* task
|
||
- **Observer** — watches runs, refines configs, escalates
|
||
- **Distillation v1.0.0** (`e7636f2`) — turns successful runs into denser playbooks
|
||
- **Auditor cross-lineage fabric** — Kimi/Haiku/Opus oversight on small-model outputs
|
||
|
||
What the Go refactor is FOR: a second-language pass surfaces architectural weaknesses that Rust hid. The pipeline has to pull together cleanly *as a pipeline* — not as 15 crates that happen to interact.
|
||
|
||
### The five-loop substrate
|
||
|
||
1. **Knowledge pathway loop** — pathway memory + matrix indexer give the small model context for the task. Pathway answers "what worked last time?"; matrix answers "what's relevant now?"
|
||
2. **Execution loop** — small model runs on focused context. Frontier API calls are reserved for audit/escalation, not the inner loop. Cost + rate limits stay sane.
|
||
3. **Observer loop** — watches each run, refines the configs (matrix corpus picks, downgrade gate, prompt mold) that got the model to a good pathway. Outputs new config, not new prompt.
|
||
4. **Rating + distillation loop** — successful outcomes get scored and folded back into the playbook substrate. The playbook gets denser; the next run starts smarter.
|
||
5. **Drift loop** — quantify when the distilled playbook stops matching reality (codebase changed, contracts shifted, profiles updated). Drift is a *measured* signal, not "hope nothing broke."
|
||
|
||
### The gate
|
||
|
||
**The playbook + matrix indexer must produce the results we're looking for.** That's the single load-bearing acceptance criterion. Throughput, scaling, code elegance — all secondary. If a deep-field reality test on the 500K corpus surfaces wrong answers, the loop isn't working and we fix that before adding anything else.
|
||
|
||
### Observer as system resource (clarified 2026-04-29)
|
||
|
||
The observer is not a service among services — it's a *system
|
||
resource*. Its job is to be objective about the process: watch
|
||
everything, record measurements, surface what worked vs what
|
||
didn't, feed the KB so the playbook substrate can decide the
|
||
right pathway to the correct outcome.
|
||
|
||
The bare-bones observerd shipped in `bc9ab93` (event ingest +
|
||
stats) is the substrate for this. The architectural pattern
|
||
that grows it into the full "objective measurement engine" is
|
||
the **multi-pass workflow runner** documented in SPEC §3.8 —
|
||
inspired by Archon (`/home/profit/external/Archon`) and proven
|
||
in the Rust `observer-kb` branch's Python prototypes (`deep_analysis.py`,
|
||
`extract_knowledge.py`, `process_knowledge.py`).
|
||
|
||
The pipeline mode-chain (extract → validator → hallucination →
|
||
consensus → redteam → pipeline → render) IS how the observer
|
||
makes actionable decisions: each mode pass is a deterministic
|
||
measurement; what survives the gauntlet is what feeds the KB.
|
||
|
||
### Triage / human-in-loop
|
||
|
||
Most cases are abstract enough that small-model + pathway + matrix can complete them. Some can't — they need a human. The system's job is to **identify which is which** and only escalate the second class. Frontier models partially solve this internally with their thinking loops; we're externalizing it so:
|
||
- Small models are swappable (vendor independence)
|
||
- Drift is measurable (quantitative signal, not vibes)
|
||
- Each loop iteration is auditable (the pathway memory IS the audit trail)
|
||
|
||
This is what the auditor cross-lineage fabric proves out in Rust — Opus auto-promote on diffs >100k chars is the same pattern: triage by signal, not by guesswork.
|
||
|
||
## Direction pivot — why this PRD exists
|
||
|
||
The Rust-first Lakehouse (15 crates, ~24 unmerged commits past PR #11,
|
||
distillation v1.0.0 substrate frozen at `e7636f2`) is being reimplemented
|
||
in Go on the principle that **anything Go can carry, Go carries**. This
|
||
is an explicit re-platforming, not a refactor.
|
||
|
||
### What the rewrite preserves (verbatim from the Rust PRD)
|
||
|
||
- The **problem statement** — legacy data systems silo information; AI
|
||
needs both fast analytical queries AND semantic retrieval over
|
||
unstructured text in one substrate.
|
||
- The **two use cases** — staffing analytics (reference implementation)
|
||
and local AI knowledge substrate (per-profile vector indexes for
|
||
running models).
|
||
- The **shared requirements** — schema-less ingest, SQL at scale,
|
||
AI-embedding search, hot-swappable indexes, trials-as-data,
|
||
local-first / no-cloud, repo-rebuildable.
|
||
- The **architectural invariants** — object storage as source of truth,
|
||
catalog as sole metadata authority, hot-swap atomicity, profiles as
|
||
first-class, playbooks-feed-the-index, errors findable in one HTTP
|
||
call.
|
||
|
||
### What the rewrite changes
|
||
|
||
| Layer | Was (Rust) | Becomes (Go) | Confidence |
|
||
|---|---|---|---|
|
||
| HTTP gateway | Axum + Tokio | `net/http` + `chi` (or `gin`) | High — Go's bread and butter |
|
||
| gRPC | tonic | `google.golang.org/grpc` | High — Go is the reference impl |
|
||
| Object store | Apache Arrow `object_store` | `aws-sdk-go-v2/service/s3` + thin wrapper | High |
|
||
| Parquet I/O | parquet-rs (arrow-rs) | `apache/arrow-go/v18/parquet` | Medium — arrow-go lags arrow-rs but v18 covers our needs |
|
||
| Query engine | DataFusion | **Hard problem** (see §Hard problems) | Low — no like-for-like Go equivalent |
|
||
| Vector index (HNSW) | `hora` / hand-rolled | `coder/hnsw` or `Bithack/go-hnsw` (in-process) | High — HNSW is a self-contained algorithm |
|
||
| Vector backend (Lance) | `lance` (Rust) | **Hard problem** — likely dropped, Parquet-only | Medium |
|
||
| Frontend | Dioxus + WASM | Go `html/template` + HTMX + Alpine, or React/Vite split repo | Medium |
|
||
| Concurrency | Tokio async | Goroutines + `context.Context` | High |
|
||
| Config | TOML | TOML (`pelletier/go-toml/v2`) | High |
|
||
| Secrets | `SecretsProvider` trait | Go interface, same shape | High |
|
||
| AI bridge | HTTP client to Python sidecar | Same — Python sidecar stays | High |
|
||
| Embedded ML | Python sidecar (kept) | Python sidecar (kept) | n/a |
|
||
|
||
### What stays Python (and why)
|
||
|
||
- **Embedding generation, image gen, deepface analysis** — Python's ML
|
||
ecosystem is genuinely stronger than Go's. The sidecar stays as an
|
||
HTTP service; the Go gateway calls it the same way the Rust gateway
|
||
did. No port required.
|
||
- **Distillation pipeline scoring** — current TS scripts; can move to Go
|
||
but not first-tier priority. Keep TS until Go gateway is live.
|
||
|
||
---
|
||
|
||
## Solution — Go service mesh over S3-compatible object storage
|
||
|
||
A modular Go service mesh, same architectural shape as the Rust system,
|
||
with the Python AI sidecar retained as the embedding/generation
|
||
boundary. Single repo (`golangLAKEHOUSE`), single Go module, multiple
|
||
binaries built from one workspace.
|
||
|
||
### Locked stack (Go)
|
||
|
||
| Layer | Choice | Rationale |
|
||
|---|---|---|
|
||
| HTTP | `chi` | Idiomatic, middleware-friendly, used by major Go services |
|
||
| gRPC | `google.golang.org/grpc` | Reference implementation |
|
||
| Protobuf | `protoc-gen-go` + `buf` | Standard tooling |
|
||
| Object store | `aws-sdk-go-v2` | Mature, covers S3 + MinIO + RustFS |
|
||
| Parquet | `apache/arrow-go/v18` | Columnar I/O + Arrow interop (v18.5.2 — March 2026) |
|
||
| SQL engine | **Open** — see §Hard problems §1 | Biggest open decision |
|
||
| Vector index | `coder/hnsw` | Pure-Go HNSW, in-process, no external service |
|
||
| TOML config | `pelletier/go-toml/v2` | Standard |
|
||
| Logging | `log/slog` | Standard library since Go 1.21 |
|
||
| Tracing | `go.opentelemetry.io/otel` | Standard |
|
||
| Testing | `testing` + `testify` + `golden` files | Standard |
|
||
| Frontend | **Open** — `html/template` + HTMX vs separate Vite/React | Hard problem §3 |
|
||
|
||
No new dependencies without an ADR.
|
||
|
||
---
|
||
|
||
## Architecture
|
||
|
||
Same service decomposition as Rust, same data flow. Names preserved so
|
||
the spec, ADRs, and runbooks port semantically:
|
||
|
||
```
|
||
┌─ ingestd ─→ storaged ─→ catalogd ─→ queryd ─┐
|
||
│ │ │
|
||
│ └→ vectord │
|
||
│ │
|
||
└──────── aibridge ──HTTP──→ Python sidecar ───┘
|
||
│
|
||
gateway ─ HTTP/gRPC ────┘
|
||
│
|
||
└→ ui (HTMX or Vite)
|
||
```
|
||
|
||
| Service | Responsibility | Go binary |
|
||
|---|---|---|
|
||
| **gateway** | HTTP/gRPC ingress, routing, auth | `cmd/gateway` |
|
||
| **catalogd** | Metadata control plane, dataset registry | `cmd/catalogd` |
|
||
| **storaged** | Object I/O, multi-bucket, error journal | `cmd/storaged` |
|
||
| **queryd** | SQL execution over Parquet (engine TBD) | `cmd/queryd` |
|
||
| **ingestd** | CSV/JSON/PDF ingest → Parquet | `cmd/ingestd` |
|
||
| **vectord** | Embeddings + HNSW index + trial system | `cmd/vectord` |
|
||
| **journald** | Append-only mutation event log | `cmd/journald` |
|
||
| **aibridge** | HTTP client to Python sidecar | library, linked into gateway |
|
||
| **validator** | Production worker/permit validators | library, linked into gateway |
|
||
| **mcp** | MCP server (replaces Bun `mcp-server`) | `cmd/mcp` |
|
||
| **observer** | Autonomous iteration loop | `cmd/observer` |
|
||
| **auditor** | PR audit pipeline (replaces TS auditor) | `cmd/auditor` |
|
||
|
||
### Invariants (preserved verbatim from Rust PRD)
|
||
|
||
1. Object storage = source of truth
|
||
2. catalogd = sole metadata authority
|
||
3. No raw data in catalog — only pointers
|
||
4. vectord stores embeddings AS Parquet (portable)
|
||
5. ingestd is idempotent
|
||
6. Hot cache is performance, not source of truth
|
||
7. All services modular and independently replaceable
|
||
8. Indexes are hot-swappable (atomic pointer swap, rollback always possible)
|
||
9. Every reader gets its own profile
|
||
10. Trials are data, not logs
|
||
11. Operational failures findable in one HTTP call
|
||
12. Playbooks feed the index, not just the log
|
||
|
||
---
|
||
|
||
## Hard problems (the ones that don't trivially port)
|
||
|
||
These four define whether the rewrite is feasible. Spec answers each
|
||
with a concrete library/approach choice and a fallback.
|
||
|
||
### 1. Query engine — replacing DataFusion
|
||
|
||
**Constraint:** DataFusion is the most consequential Rust dependency in
|
||
the Lakehouse. It powers `queryd`, hybrid SQL+vector search, and
|
||
hot-cache merge-on-read. Go has no like-for-like equivalent.
|
||
|
||
**Options:**
|
||
- **A. Embed DuckDB via cgo (`marcboeker/go-duckdb`)** — DuckDB reads
|
||
Parquet natively, supports SQL similar to DataFusion, has cgo Go
|
||
bindings. Loses pure-Go portability (cgo required) but preserves the
|
||
query model.
|
||
- **B. Run DuckDB as an external service** — one DuckDB process, Go
|
||
talks to it via HTTP. Pure-Go gateway, separate-process query layer.
|
||
Adds an operational surface (one more service to manage).
|
||
- **C. Hand-roll a query planner over Arrow** — parse SQL with
|
||
`xwb1989/sqlparser`, plan over arrow-go RecordBatches, execute. High
|
||
effort, high risk. Best avoided.
|
||
- **D. Postgres + foreign data wrappers** — point Postgres at Parquet
|
||
via `parquet_fdw`. Mature but introduces a database we said we'd
|
||
avoid (ADR-001).
|
||
|
||
**Recommendation:** **Option A (DuckDB via cgo)**. Preserves the SQL +
|
||
columnar + Parquet model, single-binary deploy with cgo, mature. Cgo
|
||
adds build complexity but is acceptable.
|
||
|
||
### 2. Lance backend — vectord-lance
|
||
|
||
**Constraint:** Lance is a Rust-native columnar format with built-in
|
||
vector indexing. There is no Go port and no FFI binding. ADR-019
|
||
designates Lance as a per-profile *secondary* backend; Parquet+HNSW is
|
||
*primary*.
|
||
|
||
**Options:**
|
||
- **A. Drop Lance entirely.** Parquet+HNSW handles primary path; Lance
|
||
was secondary. ADR-019 stays valid for the Rust era; Go Lakehouse is
|
||
Parquet-only.
|
||
- **B. Keep Lance via FFI/cgo.** Build Lance as a Rust dylib, call from
|
||
Go via cgo. Reintroduces Rust into the build chain; defeats the
|
||
point.
|
||
- **C. Wait for Lance Go port.** Doesn't exist; not on Lance roadmap.
|
||
|
||
**Recommendation:** **Option A (drop Lance).** The hybrid backend was
|
||
optional per-profile; Parquet+HNSW carries the primary path. If a
|
||
specific workload later proves Lance-only, it can be exposed as a
|
||
Python-sidecar service.
|
||
|
||
### 3. UI — replacing Dioxus
|
||
|
||
**Constraint:** Dioxus is a Rust+WASM frontend framework. No Go
|
||
equivalent at the same level of polish. The current `crates/ui` covers
|
||
Ask, Explore, SQL, System tabs.
|
||
|
||
**Options:**
|
||
- **A. `html/template` + HTMX + Alpine.js** — server-rendered Go,
|
||
partial-page swaps via HTMX. Single repo, minimal JS, fits Go's
|
||
"boring is good" ethos.
|
||
- **B. Separate Vite/React frontend** — `golangLAKEHOUSE-ui` repo,
|
||
Go gateway serves static files. Modern UX patterns, more dev tooling
|
||
needed.
|
||
- **C. Keep Dioxus + WASM as a build step** — defeats the rewrite.
|
||
|
||
**Recommendation:** **Option A** for v1; revisit if UI requirements
|
||
demand React-tier interactivity. The current Lakehouse UIs (`/lakehouse/`
|
||
demo + staffer console) are mostly server-rendered HTML with vanilla
|
||
JS — `html/template` + HTMX is a strong fit.
|
||
|
||
### 4. Arrow ecosystem maturity
|
||
|
||
**Constraint:** `arrow-go/v15` lags `arrow-rs` in compute kernels,
|
||
expression APIs, and some compression codecs. Specific gaps known:
|
||
limited `cast` kernel coverage, no equivalent of `arrow-rs`'s
|
||
`compute::sort_to_indices` for all dtypes, no Acero-style streaming
|
||
execution.
|
||
|
||
**Mitigation:** the Go Lakehouse relies on Arrow primarily for
|
||
**Parquet I/O + RecordBatch transport**, not for in-process compute
|
||
(that's DuckDB's job). The narrower scope makes arrow-go's gaps less
|
||
load-bearing.
|
||
|
||
**Acceptance gate:** any Arrow API the Go Lakehouse uses must be
|
||
covered by `arrow-go/v15`. Anything missing → file an upstream issue,
|
||
implement locally if blocking, contribute back.
|
||
|
||
---
|
||
|
||
## Migration strategy
|
||
|
||
### What ports verbatim
|
||
- Problem statement, use cases, requirements
|
||
- Architectural invariants (1–12)
|
||
- ADRs 001–021 (preserved as design intent; some change implementation)
|
||
- Federation building blocks (multi-bucket, error-journal, append-log)
|
||
|
||
### What rebuilds from data
|
||
- HNSW indexes (rebuild from Parquet embeddings — ADR-008 is verbatim
|
||
preserved)
|
||
- Pathway memory state (88 traces in `data/_pathway_memory/state.json`
|
||
on Rust side — port the JSON format and reload; the byte-matching
|
||
contract becomes Go-Go instead of Rust-TS)
|
||
- Catalog manifests (Parquet, portable)
|
||
- Distillation v1.0.0 substrate (port the SFT/contamination-firewall
|
||
logic; the fixture-as-gate pattern stays)
|
||
|
||
### What ships first (port-order outline — see SPEC.md for detail)
|
||
|
||
1. **Phase G0** — Skeleton: `cmd/gateway`, `cmd/catalogd`, `cmd/storaged`,
|
||
`cmd/ingestd`. Single-bucket, no auth, CSV→Parquet, query via DuckDB.
|
||
2. **Phase G1** — Vector path: `cmd/vectord` with HNSW + RAG endpoint.
|
||
3. **Phase G2** — Multi-profile + federation (ADRs 016–017).
|
||
4. **Phase G3** — Pathway memory + distillation port.
|
||
5. **Phase G4** — MCP server, observer, auditor (TS surfaces → Go).
|
||
6. **Phase G5** — UI (HTMX) and demo parity with `devop.live/lakehouse/`.
|
||
|
||
Detailed acceptance gates in `SPEC.md`.
|
||
|
||
### What does NOT migrate
|
||
- The Rust crates themselves (archived in the original `lakehouse` repo)
|
||
- The TS scrum/auditor pipelines (rewritten in Go in Phase G4)
|
||
- The Bun mcp-server (rewritten in Go in Phase G4)
|
||
- The Python sidecar (kept as-is, behind aibridge)
|
||
|
||
---
|
||
|
||
## Non-goals
|
||
|
||
- **No port of `vectord-lance`.** Lance backend is dropped; Parquet+HNSW
|
||
is the only vector backend.
|
||
- **No retention of Rust in the build chain.** No cgo-to-Rust bridges,
|
||
no FFI to keep specific crates alive. Cgo to **C/C++** (DuckDB) is
|
||
acceptable.
|
||
- **No new feature work during the port.** Feature parity with the Rust
|
||
Lakehouse at the cutoff commit is the bar; new capabilities defer to
|
||
post-port phases.
|
||
- **No live-migration of running services.** The Rust Lakehouse stops
|
||
serving when Go reaches feature parity; data moves once via Parquet
|
||
re-pointer.
|
||
|
||
---
|
||
|
||
## Ratified decisions (2026-04-28, J)
|
||
|
||
The six gating questions are answered. Phase G0 is unblocked. Full
|
||
context for each lives in `docs/DECISIONS.md` ADR-001.
|
||
|
||
| # | Decision |
|
||
|---|---|
|
||
| 1 | **DuckDB via cgo** — `marcboeker/go-duckdb` is the query engine. Cgo accepted as the cost of a mature SQL+Parquet path. |
|
||
| 2 | **HTMX** — server-rendered `html/template` + HTMX + Alpine.js. Single-binary deploy. React is post-G5 if needed. |
|
||
| 3 | **Gitea** — repo lives at `git.agentview.dev/profit/golangLAKEHOUSE` (same server as the Rust lakehouse). |
|
||
| 4 | **Distillation rebuild in Go** — port the SFT export + contamination firewall logic, but bit-identical reproducibility is a Rust-era property. New Go fixtures, new acceptance gates. |
|
||
| 5 | **Pathway memory starts clean** — Go pathway memory begins with zero traces. The existing 88 Rust traces are preserved at `data/_pathway_memory/state.json` in the lakehouse repo as a historical record (see `docs/RUST_PATHWAY_MEMORY_NOTE.md`). |
|
||
| 6 | **Auditor longitudinal signal restarts** — `audit_baselines.jsonl` is a Rust-era artifact. Go auditor begins a fresh drift signal. |
|