docs: seed PRD + SPEC for the Go-direction rewrite

Two documents only — no Go code yet. PRD restates the problem and
preserves the Rust PRD's invariants verbatim, then maps the locked
stack to Go libraries and surfaces four hard problems (DuckDB-via-cgo
for the query engine, Lance dropped, Dioxus → HTMX, arrow-go maturity).
SPEC walks each Rust crate + TS surface and tags the port with library
choice / effort estimate / risk + a 5-phase migration plan from
skeleton (Phase G0) to demo parity (Phase G5).

Six open questions remain that gate Phase G0:
- DuckDB cgo OK?
- HTMX vs React for the UI?
- Repo location?
- Distillation v1.0.0 port verbatim or rebuild?
- Pathway memory data — port 88 traces or start clean?
- Auditor lineage — port audit_baselines.jsonl or restart?

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
Claw 2026-04-28 06:29:55 -05:00
commit f07668064e
7 changed files with 949 additions and 0 deletions

41
.gitignore vendored Normal file
View File

@ -0,0 +1,41 @@
# Go
*.exe
*.exe~
*.dll
*.so
*.dylib
*.test
*.out
go.work
go.work.sum
vendor/
# Build artifacts
/bin/
/dist/
# Editor / OS
.DS_Store
.idea/
.vscode/
*.swp
*~
# Local data — these directories follow the Rust lakehouse pattern;
# regenerated by services on demand. Do not commit runtime artifacts.
/data/_auditor/
/data/_kb/
/data/_pathway_memory/
/data/_errors/
/data/_imagecache/
/data/datasets/
/data/vectors/
/data/headshots/
/data/lance/
/exports/
/logs/
/reports/
# Secrets — never commit. Resolved via SecretsProvider per ADR-001 §1.x.
*.env
secrets.toml

49
README.md Normal file
View File

@ -0,0 +1,49 @@
# golangLAKEHOUSE
Go reimplementation of the Lakehouse — a versioned knowledge substrate
for staffing analytics + local AI workloads.
## Status
**Pre-Phase G0.** Documents seeded; Go module declared; implementation
has not started. See `docs/PRD.md` for direction and `docs/SPEC.md`
for the component-by-component port plan.
### Phase G0 prerequisites (must be done before any code lands)
1. **Install Go 1.23+ on the dev box.** Not currently present at
`/usr/local/go` or elsewhere on the build machine. Standard install:
```
curl -L https://go.dev/dl/go1.23.linux-amd64.tar.gz | sudo tar -C /usr/local -xz
echo 'export PATH=$PATH:/usr/local/go/bin' >> ~/.bashrc
```
2. **Ensure cgo toolchain is present** (gcc + libc-dev) — required by
the DuckDB binding per ADR-001 §1.1. `apt install build-essential`
on Debian-based systems.
3. **Initialize the dependency tree** with `go mod tidy` once
`cmd/gateway/main.go` declares its first imports.
## Layout
```
docs/ Direction + spec + ADRs
cmd/ (forthcoming) main packages — one per service
internal/ (forthcoming) shared packages
web/ (forthcoming) HTMX templates + static
scripts/ (forthcoming) cold-start, smoke, distill
tests/ (forthcoming) golden files, integration tests
```
## Reading order
1. `docs/PRD.md` — what we're building and why
2. `docs/SPEC.md` — how, per-component
3. `docs/DECISIONS.md` — ADRs, starting with ADR-001 (foundational)
4. `docs/RUST_PATHWAY_MEMORY_NOTE.md` — historical reference for the
Rust era's pathway memory state (not migrated)
## Predecessor
The Rust Lakehouse this rewrite supersedes lives at
`git.agentview.dev/profit/lakehouse`. It remains the live system until
this Go implementation reaches feature parity (per `docs/SPEC.md` §7).

126
docs/DECISIONS.md Normal file
View File

@ -0,0 +1,126 @@
# Architecture Decision Records — Lakehouse-Go
ADRs from the Go era. Numbered fresh from 001 to start clean lineage.
Where a Rust ADR (numbered 001021 in the Rust repo's `DECISIONS.md`)
remains in force, this file references it explicitly. Where a Rust
ADR is superseded, the new ADR records why.
---
## ADR-001: Foundational decisions for the Go rewrite
**Date:** 2026-04-28
**Decided by:** J
**Status:** Ratified — Phase G0 unblocked
The six questions that gated Phase G0 (per PRD.md / SPEC.md §8) are
all answered.
### Decision 1.1 — DuckDB via cgo for the query engine
**Decision:** `queryd` uses `marcboeker/go-duckdb` (cgo bindings to
DuckDB). Pure-Go alternative was rejected.
**Rationale:** DuckDB reads Parquet natively, supports the SQL surface
DataFusion exposed in the Rust era (CTEs, window functions, hybrid
joins), and runs in-process with cgo. The alternatives were:
- Hand-rolling a query planner over arrow-go RecordBatches —
multi-engineer-month research project; high risk of correctness
bugs.
- Running DuckDB as an external process — adds an operational surface
and a network hop to every query.
Cgo build complexity is the accepted cost. Single-binary deploy
preserved (the cgo dependency embeds at link time).
**Supersedes Rust ADR-001** (object storage as source of truth) — no.
That ADR remains in force; the change is the *engine* over the
storage, not the storage model.
### Decision 1.2 — HTMX for the UI
**Decision:** Frontend is `html/template` + HTMX + Alpine.js,
server-rendered by `cmd/gateway`. React/Vite in a separate repo is the
fallback if UX requirements demand SPA-tier interactivity post-G5.
**Rationale:** The existing Lakehouse UIs (`/lakehouse/` demo + staffer
console) are mostly server-rendered HTML with vanilla JS that already
fits the HTMX style. Single-binary deploy is preserved (gateway serves
templates + static assets). No build chain beyond `go build`.
The React fallback is named explicitly so it's not relitigated unless
an actual UX requirement triggers it.
### Decision 1.3 — Gitea hosts the new repo
**Decision:** Repo lives at `git.agentview.dev/profit/golangLAKEHOUSE`
(same Gitea server that hosts the Rust lakehouse).
**Rationale:** Single source of truth for repo hosting; existing
auditor tooling (`lakehouse-auditor` systemd service) already speaks
Gitea API; existing credentials work; no new ops surface.
### Decision 1.4 — Distillation rebuilt in Go, not ported verbatim
**Decision:** The distillation v1.0.0 substrate (`tag
distillation-v1.0.0` at `e7636f2` in the Rust repo) is **not**
bit-identical-ported. The Go reimplementation:
- Ports the LOGIC: SFT export pipeline, contamination firewall (the
`quality_score` enum + `SFT_NEVER` constant), category mapping
rules, audit-baselines append-only pattern.
- Does NOT port the FIXTURES: `tests/fixtures/distillation/acceptance/`
is rebuilt from scratch in Go with new ground-truth golden files.
- Does NOT port the bit-identical reproducibility PROPERTY: that was
measured against the Rust implementation. The Go implementation
establishes its own reproducibility baseline.
**Rationale:** Bit-identical reproducibility was a measured property
of a specific implementation, not a portable invariant. Re-establishing
it in Go means new fixtures, new gates, new audit-baselines. This is
honest about what's transferring (logic) versus what's a Rust-era
artifact (the specific bit-identical hashes).
**Risk:** the contamination firewall is the most consequential
distillation safety net. The port must be reviewed line-by-line, and
the new Go fixtures must include adversarial cases that prove the
firewall works in the new implementation. See SPEC §7 acceptance gates.
### Decision 1.5 — Pathway memory starts clean; old traces preserved as reference
**Decision:** Go pathway memory begins with zero traces. The existing
88 Rust traces at
`/home/profit/lakehouse/data/_pathway_memory/state.json` are NOT loaded
into the Go implementation. They are preserved as a historical record
in the Rust repo and documented at `docs/RUST_PATHWAY_MEMORY_NOTE.md`.
**Rationale:** The Rust pathway memory's value compounded over months
of scrum cycles. Loading those traces into a Go implementation that
hasn't proven its byte-matching contract risks corrupting the new
substrate's signal with semantically-mismatched data. Starting clean
keeps the Go pathway memory's lineage clean and lets the byte-match
correctness be proven on a known input (per SPEC §3.4 G3.4.B).
The historical note records the 88 traces' value (11/11 successful
replays at the time of freeze) so the Go implementation has a
reference baseline to outperform.
### Decision 1.6 — Auditor longitudinal signal restarts
**Decision:** The Rust auditor's `audit_baselines.jsonl`
(longitudinal drift signal accumulated across PRs #6#13) is **not**
ported to Go. The Go auditor begins a fresh `audit_baselines.jsonl`
lineage on its first PR.
**Rationale:** The drift signal is anchored to specific Rust commits,
verdict shapes, and Kimi/Haiku/Opus rotation traces. Carrying it into
the Go era would be like grafting Rust-PR audit history onto the first
Go PR's prologue — confusing more than informative. Restarting gives
the Go auditor a clean baseline to measure drift against.
The existing Rust `audit_baselines.jsonl` stays in the Rust repo as a
historical record.
---
(Future ADRs from ADR-002 onward will be added as the Go
implementation accrues design decisions — e.g. HNSW parameter
choices, pathway-memory hash function, auditor model rotation, etc.)

297
docs/PRD.md Normal file
View File

@ -0,0 +1,297 @@
# PRD: Lakehouse-Go — Versioned Knowledge Substrate, Reimplemented in Go
**Status:** DRAFT — seed document for the Go-direction rewrite. Supersedes
`/home/profit/lakehouse/docs/PRD.md` (Rust) once ratified.
**Created:** 2026-04-28
**Owner:** J
**Sibling:** `SPEC.md` — component-by-component port plan with effort
estimates, library choices, and acceptance gates.
---
## Direction pivot — why this PRD exists
The Rust-first Lakehouse (15 crates, ~24 unmerged commits past PR #11,
distillation v1.0.0 substrate frozen at `e7636f2`) is being reimplemented
in Go on the principle that **anything Go can carry, Go carries**. This
is an explicit re-platforming, not a refactor.
### What the rewrite preserves (verbatim from the Rust PRD)
- The **problem statement** — legacy data systems silo information; AI
needs both fast analytical queries AND semantic retrieval over
unstructured text in one substrate.
- The **two use cases** — staffing analytics (reference implementation)
and local AI knowledge substrate (per-profile vector indexes for
running models).
- The **shared requirements** — schema-less ingest, SQL at scale,
AI-embedding search, hot-swappable indexes, trials-as-data,
local-first / no-cloud, repo-rebuildable.
- The **architectural invariants** — object storage as source of truth,
catalog as sole metadata authority, hot-swap atomicity, profiles as
first-class, playbooks-feed-the-index, errors findable in one HTTP
call.
### What the rewrite changes
| Layer | Was (Rust) | Becomes (Go) | Confidence |
|---|---|---|---|
| HTTP gateway | Axum + Tokio | `net/http` + `chi` (or `gin`) | High — Go's bread and butter |
| gRPC | tonic | `google.golang.org/grpc` | High — Go is the reference impl |
| Object store | Apache Arrow `object_store` | `aws-sdk-go-v2/service/s3` + thin wrapper | High |
| Parquet I/O | parquet-rs (arrow-rs) | `apache/arrow-go/v15/parquet` | Medium — arrow-go lags arrow-rs but covers our needs |
| Query engine | DataFusion | **Hard problem** (see §Hard problems) | Low — no like-for-like Go equivalent |
| Vector index (HNSW) | `hora` / hand-rolled | `coder/hnsw` or `Bithack/go-hnsw` (in-process) | High — HNSW is a self-contained algorithm |
| Vector backend (Lance) | `lance` (Rust) | **Hard problem** — likely dropped, Parquet-only | Medium |
| Frontend | Dioxus + WASM | Go `html/template` + HTMX + Alpine, or React/Vite split repo | Medium |
| Concurrency | Tokio async | Goroutines + `context.Context` | High |
| Config | TOML | TOML (`pelletier/go-toml/v2`) | High |
| Secrets | `SecretsProvider` trait | Go interface, same shape | High |
| AI bridge | HTTP client to Python sidecar | Same — Python sidecar stays | High |
| Embedded ML | Python sidecar (kept) | Python sidecar (kept) | n/a |
### What stays Python (and why)
- **Embedding generation, image gen, deepface analysis** — Python's ML
ecosystem is genuinely stronger than Go's. The sidecar stays as an
HTTP service; the Go gateway calls it the same way the Rust gateway
did. No port required.
- **Distillation pipeline scoring** — current TS scripts; can move to Go
but not first-tier priority. Keep TS until Go gateway is live.
---
## Solution — Go service mesh over S3-compatible object storage
A modular Go service mesh, same architectural shape as the Rust system,
with the Python AI sidecar retained as the embedding/generation
boundary. Single repo (`golangLAKEHOUSE`), single Go module, multiple
binaries built from one workspace.
### Locked stack (Go)
| Layer | Choice | Rationale |
|---|---|---|
| HTTP | `chi` | Idiomatic, middleware-friendly, used by major Go services |
| gRPC | `google.golang.org/grpc` | Reference implementation |
| Protobuf | `protoc-gen-go` + `buf` | Standard tooling |
| Object store | `aws-sdk-go-v2` | Mature, covers S3 + MinIO + RustFS |
| Parquet | `apache/arrow-go/v15` | Columnar I/O + Arrow interop |
| SQL engine | **Open** — see §Hard problems §1 | Biggest open decision |
| Vector index | `coder/hnsw` | Pure-Go HNSW, in-process, no external service |
| TOML config | `pelletier/go-toml/v2` | Standard |
| Logging | `log/slog` | Standard library since Go 1.21 |
| Tracing | `go.opentelemetry.io/otel` | Standard |
| Testing | `testing` + `testify` + `golden` files | Standard |
| Frontend | **Open**`html/template` + HTMX vs separate Vite/React | Hard problem §3 |
No new dependencies without an ADR.
---
## Architecture
Same service decomposition as Rust, same data flow. Names preserved so
the spec, ADRs, and runbooks port semantically:
```
┌─ ingestd ─→ storaged ─→ catalogd ─→ queryd ─┐
│ │ │
│ └→ vectord │
│ │
└──────── aibridge ──HTTP──→ Python sidecar ───┘
gateway ─ HTTP/gRPC ────┘
└→ ui (HTMX or Vite)
```
| Service | Responsibility | Go binary |
|---|---|---|
| **gateway** | HTTP/gRPC ingress, routing, auth | `cmd/gateway` |
| **catalogd** | Metadata control plane, dataset registry | `cmd/catalogd` |
| **storaged** | Object I/O, multi-bucket, error journal | `cmd/storaged` |
| **queryd** | SQL execution over Parquet (engine TBD) | `cmd/queryd` |
| **ingestd** | CSV/JSON/PDF ingest → Parquet | `cmd/ingestd` |
| **vectord** | Embeddings + HNSW index + trial system | `cmd/vectord` |
| **journald** | Append-only mutation event log | `cmd/journald` |
| **aibridge** | HTTP client to Python sidecar | library, linked into gateway |
| **validator** | Production worker/permit validators | library, linked into gateway |
| **mcp** | MCP server (replaces Bun `mcp-server`) | `cmd/mcp` |
| **observer** | Autonomous iteration loop | `cmd/observer` |
| **auditor** | PR audit pipeline (replaces TS auditor) | `cmd/auditor` |
### Invariants (preserved verbatim from Rust PRD)
1. Object storage = source of truth
2. catalogd = sole metadata authority
3. No raw data in catalog — only pointers
4. vectord stores embeddings AS Parquet (portable)
5. ingestd is idempotent
6. Hot cache is performance, not source of truth
7. All services modular and independently replaceable
8. Indexes are hot-swappable (atomic pointer swap, rollback always possible)
9. Every reader gets its own profile
10. Trials are data, not logs
11. Operational failures findable in one HTTP call
12. Playbooks feed the index, not just the log
---
## Hard problems (the ones that don't trivially port)
These four define whether the rewrite is feasible. Spec answers each
with a concrete library/approach choice and a fallback.
### 1. Query engine — replacing DataFusion
**Constraint:** DataFusion is the most consequential Rust dependency in
the Lakehouse. It powers `queryd`, hybrid SQL+vector search, and
hot-cache merge-on-read. Go has no like-for-like equivalent.
**Options:**
- **A. Embed DuckDB via cgo (`marcboeker/go-duckdb`)** — DuckDB reads
Parquet natively, supports SQL similar to DataFusion, has cgo Go
bindings. Loses pure-Go portability (cgo required) but preserves the
query model.
- **B. Run DuckDB as an external service** — one DuckDB process, Go
talks to it via HTTP. Pure-Go gateway, separate-process query layer.
Adds an operational surface (one more service to manage).
- **C. Hand-roll a query planner over Arrow** — parse SQL with
`xwb1989/sqlparser`, plan over arrow-go RecordBatches, execute. High
effort, high risk. Best avoided.
- **D. Postgres + foreign data wrappers** — point Postgres at Parquet
via `parquet_fdw`. Mature but introduces a database we said we'd
avoid (ADR-001).
**Recommendation:** **Option A (DuckDB via cgo)**. Preserves the SQL +
columnar + Parquet model, single-binary deploy with cgo, mature. Cgo
adds build complexity but is acceptable.
### 2. Lance backend — vectord-lance
**Constraint:** Lance is a Rust-native columnar format with built-in
vector indexing. There is no Go port and no FFI binding. ADR-019
designates Lance as a per-profile *secondary* backend; Parquet+HNSW is
*primary*.
**Options:**
- **A. Drop Lance entirely.** Parquet+HNSW handles primary path; Lance
was secondary. ADR-019 stays valid for the Rust era; Go Lakehouse is
Parquet-only.
- **B. Keep Lance via FFI/cgo.** Build Lance as a Rust dylib, call from
Go via cgo. Reintroduces Rust into the build chain; defeats the
point.
- **C. Wait for Lance Go port.** Doesn't exist; not on Lance roadmap.
**Recommendation:** **Option A (drop Lance).** The hybrid backend was
optional per-profile; Parquet+HNSW carries the primary path. If a
specific workload later proves Lance-only, it can be exposed as a
Python-sidecar service.
### 3. UI — replacing Dioxus
**Constraint:** Dioxus is a Rust+WASM frontend framework. No Go
equivalent at the same level of polish. The current `crates/ui` covers
Ask, Explore, SQL, System tabs.
**Options:**
- **A. `html/template` + HTMX + Alpine.js** — server-rendered Go,
partial-page swaps via HTMX. Single repo, minimal JS, fits Go's
"boring is good" ethos.
- **B. Separate Vite/React frontend**`golangLAKEHOUSE-ui` repo,
Go gateway serves static files. Modern UX patterns, more dev tooling
needed.
- **C. Keep Dioxus + WASM as a build step** — defeats the rewrite.
**Recommendation:** **Option A** for v1; revisit if UI requirements
demand React-tier interactivity. The current Lakehouse UIs (`/lakehouse/`
demo + staffer console) are mostly server-rendered HTML with vanilla
JS — `html/template` + HTMX is a strong fit.
### 4. Arrow ecosystem maturity
**Constraint:** `arrow-go/v15` lags `arrow-rs` in compute kernels,
expression APIs, and some compression codecs. Specific gaps known:
limited `cast` kernel coverage, no equivalent of `arrow-rs`'s
`compute::sort_to_indices` for all dtypes, no Acero-style streaming
execution.
**Mitigation:** the Go Lakehouse relies on Arrow primarily for
**Parquet I/O + RecordBatch transport**, not for in-process compute
(that's DuckDB's job). The narrower scope makes arrow-go's gaps less
load-bearing.
**Acceptance gate:** any Arrow API the Go Lakehouse uses must be
covered by `arrow-go/v15`. Anything missing → file an upstream issue,
implement locally if blocking, contribute back.
---
## Migration strategy
### What ports verbatim
- Problem statement, use cases, requirements
- Architectural invariants (112)
- ADRs 001021 (preserved as design intent; some change implementation)
- Federation building blocks (multi-bucket, error-journal, append-log)
### What rebuilds from data
- HNSW indexes (rebuild from Parquet embeddings — ADR-008 is verbatim
preserved)
- Pathway memory state (88 traces in `data/_pathway_memory/state.json`
on Rust side — port the JSON format and reload; the byte-matching
contract becomes Go-Go instead of Rust-TS)
- Catalog manifests (Parquet, portable)
- Distillation v1.0.0 substrate (port the SFT/contamination-firewall
logic; the fixture-as-gate pattern stays)
### What ships first (port-order outline — see SPEC.md for detail)
1. **Phase G0** — Skeleton: `cmd/gateway`, `cmd/catalogd`, `cmd/storaged`,
`cmd/ingestd`. Single-bucket, no auth, CSV→Parquet, query via DuckDB.
2. **Phase G1** — Vector path: `cmd/vectord` with HNSW + RAG endpoint.
3. **Phase G2** — Multi-profile + federation (ADRs 016017).
4. **Phase G3** — Pathway memory + distillation port.
5. **Phase G4** — MCP server, observer, auditor (TS surfaces → Go).
6. **Phase G5** — UI (HTMX) and demo parity with `devop.live/lakehouse/`.
Detailed acceptance gates in `SPEC.md`.
### What does NOT migrate
- The Rust crates themselves (archived in the original `lakehouse` repo)
- The TS scrum/auditor pipelines (rewritten in Go in Phase G4)
- The Bun mcp-server (rewritten in Go in Phase G4)
- The Python sidecar (kept as-is, behind aibridge)
---
## Non-goals
- **No port of `vectord-lance`.** Lance backend is dropped; Parquet+HNSW
is the only vector backend.
- **No retention of Rust in the build chain.** No cgo-to-Rust bridges,
no FFI to keep specific crates alive. Cgo to **C/C++** (DuckDB) is
acceptable.
- **No new feature work during the port.** Feature parity with the Rust
Lakehouse at the cutoff commit is the bar; new capabilities defer to
post-port phases.
- **No live-migration of running services.** The Rust Lakehouse stops
serving when Go reaches feature parity; data moves once via Parquet
re-pointer.
---
## Ratified decisions (2026-04-28, J)
The six gating questions are answered. Phase G0 is unblocked. Full
context for each lives in `docs/DECISIONS.md` ADR-001.
| # | Decision |
|---|---|
| 1 | **DuckDB via cgo**`marcboeker/go-duckdb` is the query engine. Cgo accepted as the cost of a mature SQL+Parquet path. |
| 2 | **HTMX** — server-rendered `html/template` + HTMX + Alpine.js. Single-binary deploy. React is post-G5 if needed. |
| 3 | **Gitea** — repo lives at `git.agentview.dev/profit/golangLAKEHOUSE` (same server as the Rust lakehouse). |
| 4 | **Distillation rebuild in Go** — port the SFT export + contamination firewall logic, but bit-identical reproducibility is a Rust-era property. New Go fixtures, new acceptance gates. |
| 5 | **Pathway memory starts clean** — Go pathway memory begins with zero traces. The existing 88 Rust traces are preserved at `data/_pathway_memory/state.json` in the lakehouse repo as a historical record (see `docs/RUST_PATHWAY_MEMORY_NOTE.md`). |
| 6 | **Auditor longitudinal signal restarts**`audit_baselines.jsonl` is a Rust-era artifact. Go auditor begins a fresh drift signal. |

View File

@ -0,0 +1,79 @@
# Rust Pathway Memory — Historical Reference
**Status:** Reference-only. The Go Lakehouse does NOT load these
traces (per ADR-001 §1.5). This note exists so future-Go-engineer
knows what the Rust era accumulated, where it lives, and why it was
left in place.
---
## What was there
By the time of the rewrite cutoff (commit `dcf4c9a`,
2026-04-28), the Rust pathway memory held:
- **88 traces** at `/home/profit/lakehouse/data/_pathway_memory/state.json`
- **11/11 successful replays** as of the most recent verification (the
"probation gate crossed" signal from the lakehouse `STATE_OF_PLAY.md`)
- Active scrum-cycle compounding: each scrum loop iteration appended
new traces and re-ran replays against existing pathway fingerprints
to preempt review prompts with "this file pattern has produced bug X
before"
## Where it lives (Rust repo)
```
lakehouse/
├── crates/vectord/src/pathway_memory.rs ← implementation
├── data/_pathway_memory/state.json ← 88 traces, JSON
└── docs/DECISIONS.md ADR-021 ← matrix-correctness layer design
```
The TS-side mirror lived in
`tests/real-world/scrum_master_pipeline.ts` (functions
`computePathwayId`, `buildPathwayVec`). Both implementations
byte-matched on bucket vectors.
## Why this matters for the Go port
The pathway memory's *algorithm* is portable — 32-bucket SHA256-keyed
token hash, JSON state file, replay logic. The pathway memory's
*signal value* is not — those 88 traces represent months of scrum
loops on Rust code, with bug fingerprints anchored to Rust file
prefixes (`crates/queryd/`, `crates/vectord/`, etc.) that don't exist
in the Go repo.
Per ADR-001 §1.5, the Go pathway memory:
1. Reimplements the algorithm (SPEC §3.4 G3.4.B is the byte-match
correctness gate).
2. Starts with zero traces. The 88 Rust traces are NOT migrated.
3. Builds its own signal over Go-era scrum cycles.
## What to do if the Go pathway memory underperforms
If after Phase G3 the Go pathway memory shows a noticeable lift
deficit vs. the Rust era's "11/11 successful replays" baseline:
1. **First** — verify the Go algorithm byte-matches the Rust one on
the SPEC G3.4.B golden input. If yes, the algorithm is correct and
the gap is data-volume, not implementation.
2. **Second** — the Rust traces exist; if needed, re-prefix file paths
from `crates/queryd/` style to `cmd/queryd/` style, run a
compatibility check, and seed the Go pathway memory selectively. But
only after the algorithm is proven byte-match correct.
3. **Third** — accept that the first ~3 months of Go scrum cycles need
to rebuild the signal naturally. This is the cost of the clean
restart per ADR-001 §1.5.
## Historical baseline (frozen reference)
| Metric | Rust value at cutoff | Source |
|---|---|---|
| Total traces | 88 | `data/_pathway_memory/state.json` |
| Successful replays | 11/11 | scrum loop log circa 2026-04-26 |
| Distinct file prefixes | TBD — query the state file | n/a |
| Distinct semantic_flag variants used | 9 (per ADR-021) | `pathway_memory.rs` |
| Distinct bug_fingerprint hashes | TBD | `pathway_memory.rs` |
When the Go pathway memory reaches comparable numbers, it has caught
up to the Rust era and can be considered fully replacement-grade.

354
docs/SPEC.md Normal file
View File

@ -0,0 +1,354 @@
# SPEC: Lakehouse-Go Component Port Plan
**Status:** DRAFT — companion to `PRD.md`. Component-by-component port
plan with library choices, effort estimates, and acceptance gates.
**Created:** 2026-04-28
**Owner:** J
This spec answers: for each piece of the Rust Lakehouse, what Go
library carries it, what the effort looks like, and what gate proves
the port is real.
Effort scale (one engineer-week = ~40h focused work):
- **S** — 13 days
- **M** — 1 engineer-week
- **L** — 23 engineer-weeks
- **XL** — 1+ months
- **HARD** — open research, see PRD §Hard problems
---
## §1. Component port table — Rust crates
| Crate | Rust deps that mattered | Go target | Library | Effort | Risk |
|---|---|---|---|---|---|
| `gateway` | axum, tokio, tonic, tower | `cmd/gateway` | `chi` + stdlib `net/http` + `google.golang.org/grpc` | **L** | low — Go's strongest domain |
| `catalogd` | parquet-rs, arrow, sqlite | `cmd/catalogd` | `arrow-go/v15`, `mattn/go-sqlite3` | **L** | low |
| `storaged` | object_store, aws-sdk | `cmd/storaged` | `aws-sdk-go-v2`, `minio-go` for MinIO-specific paths | **M** | low |
| `queryd` | datafusion, arrow | `cmd/queryd` | `marcboeker/go-duckdb` (cgo) | **HARD** | high — see §3 |
| `ingestd` | csv, json, lopdf, postgres | `cmd/ingestd` | stdlib `encoding/csv`, `encoding/json`, `pdfcpu/pdfcpu`, `jackc/pgx/v5` | **L** | low |
| `vectord` | hora, arrow, hnsw | `cmd/vectord` | `coder/hnsw`, `arrow-go/v15` | **L** | medium — re-validate HNSW recall |
| `vectord-lance` | lance | **DROPPED** | n/a | n/a | n/a — Parquet+HNSW only |
| `journald` | parquet, arrow | `cmd/journald` | `arrow-go/v15` | **M** | low |
| `aibridge` | reqwest | library | `net/http` + connection pool | **S** | low |
| `validator` | parquet, custom | library | `arrow-go/v15` parquet reader | **M** | low — port the 24 unit tests as gates |
| `truth` | tomli, custom DSL | library | `pelletier/go-toml/v2` | **M** | low |
| `proto` | tonic-build | `proto/` + `protoc-gen-go` | `buf` + `protoc-gen-go-grpc` | **S** | low |
| `shared` | serde, anyhow | library | stdlib `encoding/json`, `errors` | **S** | low |
| `ui` | dioxus, wasm | **REPLACED** | `html/template` + HTMX | **L** | medium — see §3 |
| `lance-bench` | criterion | n/a — dropped with Lance | n/a | n/a | n/a |
**Total Rust crate port effort:** ~1218 engineer-weeks (34 months for
one engineer; 68 weeks for two).
---
## §2. Component port table — TypeScript surfaces
| TS surface | Current location | Go target | Library | Effort | Risk |
|---|---|---|---|---|---|
| `mcp-server/index.ts` | Bun, :3700 | `cmd/mcp` | `mark3labs/mcp-go` (Go MCP SDK) | **L** | medium — MCP semantics |
| `mcp-server/observer.ts` | Bun, :3800 | `cmd/observer` | stdlib `net/http`, `slog` | **M** | low |
| `mcp-server/tracing.ts` | Bun, Langfuse client | library | `go.opentelemetry.io/otel` + Langfuse Go client (or hand-roll) | **M** | low — Langfuse Go OSS support varies |
| `auditor/*.ts` | TS, runs as systemd | `cmd/auditor` | stdlib + `gitea API client` | **L** | medium — auditor cross-lineage logic is intricate |
| `tests/real-world/scrum_master_pipeline.ts` | TS, ad-hoc | `cmd/scrum` | stdlib | **L** | medium — chunking + embed + ladder logic |
| `tests/real-world/scrum_applier.ts` | TS, ad-hoc | `cmd/scrum-apply` | stdlib + git CLI shell-out | **M** | medium |
| `bot/propose.ts` | TS | `cmd/bot` | stdlib | **S** | low |
| Search demo HTML/JS | static | static (no port) | n/a | n/a | n/a — copied as-is |
**Total TS port effort:** ~610 engineer-weeks.
---
## §3. Hard problem details
### §3.1 — Query engine (DuckDB via cgo)
**Library:** `marcboeker/go-duckdb` — Go bindings via cgo.
**API shape** (replaces the DataFusion `SessionContext` pattern):
```go
db, _ := sql.Open("duckdb", "")
defer db.Close()
db.Exec("CREATE VIEW workers AS SELECT * FROM read_parquet('s3://bucket/workers/*.parquet')")
rows, _ := db.Query("SELECT role, count(*) FROM workers WHERE state='IL' GROUP BY role")
```
**Acceptance gates:**
- G3.1.A — `SELECT * FROM read_parquet('workers_500k.parquet') LIMIT 1`
returns a row with the expected schema. Establishes Parquet read
works.
- G3.1.B — Hybrid SQL+vector query (the `POST /vectors/hybrid`
surface) returns same workers as the Rust path on the same input,
ranked the same way modulo embedding precision.
- G3.1.C — Hot-cache merge-on-read: register a base table + a delta
Parquet, query, observe both rows merged with the delta winning on
conflict.
**Fallback if cgo is rejected:** run DuckDB as an external process
(`duckdb -json -c '...'` shelled or HTTP via a thin Go wrapper). Adds
operational surface; preserves SQL model.
### §3.2 — HNSW index
**Library:** `coder/hnsw` — pure-Go HNSW, in-process. Supports add /
delete / search / persist.
**Open question:** does `coder/hnsw` match the recall@10 we measured
on the Rust `hora` path? Need a calibration test:
- Rebuild `lakehouse_arch_v1` (the 1086-chunk arch corpus) in Go.
- Compare recall@10 on a fixed query set to the Rust baseline.
- Acceptance: ≤2% drop or we switch library / parameters.
**Persistence format:** TBD — `coder/hnsw` has its own snapshot format;
ADR equivalent of ADR-008 (Parquet for embeddings + sidecar HNSW file)
needs revisiting in Go to confirm the sidecar format we ship.
**Acceptance gates:**
- G3.2.A — Build HNSW from a Parquet of 100K vectors in <60s
- G3.2.B — Search 100K vectors at k=10 in <50ms p50
- G3.2.C — Recall@10 within 2% of Rust baseline on
`lakehouse_arch_v1`
### §3.3 — UI (HTMX)
**Approach:** server-rendered Go templates using `html/template`,
HTMX for partial-page swaps, Alpine.js for client-side interactivity
where needed. Single binary serves API + UI.
**Acceptance gates:**
- G3.3.A — `Ask` tab: type natural-language question, get answer
from RAG endpoint, render in-page without full reload
- G3.3.B — `Explore` tab: paginated dataset list with hot-swap badge
rendering
- G3.3.C — `SQL` tab: textarea → submit → tabular result rendered
in-page
- G3.3.D — `System` tab: live tail of `/storage/errors` and
`/hnsw/trials` via HTMX polling
**Fallback if HTMX feels limiting:** split repo `golangLAKEHOUSE-ui`
with Vite + React, served as static files by Go gateway. Costs an
extra repo + build chain.
### §3.4 — Pathway memory port
**Constraint:** the Rust `pathway_memory` and TS implementations were
byte-matching by ADR-021. The byte contract was verified by running
both implementations on the same input tokens and asserting matching
bucket indices.
**Go port plan:**
- Port the 32-bucket SHA256-keyed token hash exactly. Verify on a
golden input that Go produces the same bucket vector as Rust.
- Port the JSON state file format verbatim — the existing 88 traces in
`data/_pathway_memory/state.json` reload as-is into the Go
implementation.
- Port the matrix-correctness layer (ADR-021's `SemanticFlag`,
`BugFingerprint`, `TypeHint`) — these are pure value types,
trivially portable.
**Acceptance gates:**
- G3.4.A — Load existing `state.json`, run `replay` on the same 11
prior successful pathways, all 11 succeed (matching the Rust 11/11
baseline).
- G3.4.B — Bucket vector for a fixed test input byte-matches the
Rust output.
---
## §4. Phase plan
### Phase G0 — Skeleton (Week 13)
**Scope:** smallest end-to-end ingest + query path working in Go.
| Component | Deliverable |
|---|---|
| `cmd/gateway` | HTTP on :3100, `/health`, `/v1/chat` proxy stub |
| `cmd/catalogd` | In-memory registry + Parquet manifest persistence |
| `cmd/storaged` | Single-bucket S3 / local FS, no error journal yet |
| `cmd/ingestd` | CSV → Parquet, schema inference, register-on-ingest |
| `cmd/queryd` | DuckDB-backed `POST /sql` endpoint |
**Acceptance:** upload a CSV via `POST /ingest`, query it via
`POST /sql` with a SELECT, get rows back. Single-bucket. No vector,
no profile, no UI.
### Phase G1 — Vector + RAG (Week 46)
| Component | Deliverable |
|---|---|
| `cmd/vectord` | Embed-on-ingest (calls Python sidecar), HNSW build, `POST /search` |
| `cmd/gateway` | Add `POST /rag` (embed → search → retrieve → generate via aibridge) |
| `cmd/aibridge` | HTTP client to existing Python sidecar |
**Acceptance:** ingest 15K resumes (the original Phase 7 fixture),
ask "find me a forklift operator with OSHA-10 in IL", get ranked
results with LLM-generated explanation grounded in the retrieved
chunks.
### Phase G2 — Federation + profiles (Week 78)
| Component | Deliverable |
|---|---|
| `cmd/storaged` | Multi-bucket registry, rescue bucket, error journal at `primary://_errors/` |
| Profile system | Per-reader profile bound to bucket + vector index |
| Hot-swap | Atomic pointer swap for index generations |
**Acceptance:** two profiles bound to two buckets, queries scoped
correctly, hot-swap a vector index without query interruption,
rollback works.
### Phase G3 — Pathway memory + distillation (Week 911)
| Component | Deliverable |
|---|---|
| `cmd/vectord` | Pathway memory module ported, 88 traces reloaded |
| Distillation pipeline | SFT export, contamination firewall, scorer |
| Audit baselines | `audit_baselines.jsonl` longitudinal signal port |
**Acceptance:** replay 11 prior successful pathways, all 11 succeed.
Re-run distillation acceptance on the frozen fixture set, 22/22 pass.
### Phase G4 — TS surfaces → Go (Week 1214)
| Component | Deliverable |
|---|---|
| `cmd/mcp` | MCP server (replaces Bun) — `/v1/chat`, intelligence endpoints |
| `cmd/observer` | Autonomous iteration loop, op recording |
| `cmd/auditor` | PR audit pipeline (kimi/haiku/opus rotation) |
| `cmd/scrum` | Scrum master pipeline (replaces TS) |
**Acceptance:** open a test PR, auditor cycles within 90s, emits
verdict to `data/_auditor/kimi_verdicts/`, behavior matches Rust+TS
era within tolerance.
### Phase G5 — UI + demo parity (Week 1516)
| Component | Deliverable |
|---|---|
| `cmd/gateway` | Serves HTMX templates + static demo HTML |
| Demo at `devop.live/lakehouse/` | Parity with current Bun demo |
| Staffer console at `/console` | Parity |
**Acceptance:** `devop.live/lakehouse/` cuts over from Bun to Go
gateway. Section ① / ② / ③ all render. Compact contract cards still
expand with Project Index. Fill-probability bars still paint.
---
## §5. Repo layout
```
golangLAKEHOUSE/
├── docs/
│ ├── PRD.md ← this PRD
│ ├── SPEC.md ← this spec
│ ├── DECISIONS.md ← Go-era ADRs (start fresh, reference Rust ADRs by number)
│ └── ADR-XXX-*.md ← per-ADR detail
├── cmd/
│ ├── gateway/ ← main HTTP/gRPC ingress
│ ├── catalogd/
│ ├── storaged/
│ ├── queryd/
│ ├── ingestd/
│ ├── vectord/
│ ├── journald/
│ ├── mcp/
│ ├── observer/
│ ├── auditor/
│ └── scrum/
├── internal/ ← shared packages, not exported
│ ├── aibridge/
│ ├── validator/
│ ├── truth/
│ ├── shared/
│ ├── proto/ ← generated protobuf
│ └── pathway/
├── pkg/ ← public Go packages (none initially)
├── web/ ← UI (HTMX templates + static)
│ ├── templates/
│ └── static/
├── scripts/ ← cold-start, smoke, distill scripts
├── tests/ ← golden files, integration tests
├── go.mod
├── go.sum
└── README.md
```
**Single Go module.** All commands and internal packages live under
`golangLAKEHOUSE/`. No nested modules unless a package needs an
independent release cadence (none expected).
**Build:** `go build ./cmd/...` produces all binaries.
---
## §6. Migration data plan
### What ports verbatim
- Parquet datasets at `data/datasets/*.parquet` — read by Go directly.
- Catalog manifests — Parquet, ports as data not code.
- Pathway memory state — JSON, ports if §3.4 byte-matching gate passes.
### What rebuilds
- HNSW indexes — rebuild from Parquet embeddings on first Go startup.
- Auditor verdicts on PRs — old PRs won't be re-audited; lineage starts
fresh on the new repo's PRs.
### What's archived
- The Rust `crates/` tree — preserved in the original repo at the
cutover commit, tagged `pre-go-rewrite-2026-04-28` for reference.
- TS surfaces (`mcp-server/`, `auditor/`, etc.) — preserved in the
original repo at the same tag.
- Distillation v1.0.0 substrate (`tag distillation-v1.0.0`,
`e7636f2`) — kept as the historical reference; Go re-implementation
ports the LOGIC but not the bit-identical-reproducibility property
unless an ADR re-establishes it.
### What's discarded
- `crates/vectord-lance/` (Lance backend, see PRD §Hard problems §2)
- `crates/lance-bench/` (criterion benchmarks specific to Lance)
---
## §7. Acceptance: when is the rewrite done?
The Go Lakehouse reaches **feature parity** when:
1. **All 12 Rust PRD invariants hold** (object-storage source of truth,
catalog metadata authority, idempotent ingest, hot-swap atomicity,
profiles, etc.).
2. **The 16 distillation acceptance gates pass** (re-run
`./scripts/distill audit-full` against the Go pipeline).
3. **The 22/22 acceptance fixtures from `tests/fixtures/distillation/
acceptance/` pass** under the Go implementation.
4. **The 145 unit tests of distillation v1.0.0 are ported and pass.**
5. **`devop.live/lakehouse/` demo cuts over to Go gateway** with no
visible UI regressions.
6. **Auditor emits Kimi/Haiku/Opus verdicts** on a test PR, matching
the cross-lineage rotation behavior.
7. **The 88 pathway traces replay** with 11/11 prior successes
reproduced.
At that point the Rust repo enters maintenance-only mode (security
fixes), and the Go repo becomes the live system.
---
## §8. Ratified — Phase G0 unblocked (2026-04-28, J)
| # | Decision | Spec impact |
|---|---|---|
| 1 | DuckDB via cgo (`marcboeker/go-duckdb`) | §3.1 option A — proceed |
| 2 | HTMX + `html/template` + Alpine.js | §3.3 option A — proceed |
| 3 | `git.agentview.dev/profit/golangLAKEHOUSE` | repo location locked |
| 4 | Distillation rebuilt in Go (no bit-identical port) | §6 — port logic, not fixtures |
| 5 | Pathway memory starts empty; old traces noted | §3.4 G3.4.A is now "build initial state from scratch in Phase G3"; G3.4.B (byte-match) preserved as the porting correctness gate when the algorithm is reimplemented |
| 6 | Auditor longitudinal signal restarts | new `audit_baselines.jsonl` lineage starts on first Go-era PR |
See `docs/DECISIONS.md` ADR-001 for full rationale and
`docs/RUST_PATHWAY_MEMORY_NOTE.md` for where the legacy 88 traces live.
**Phase G0 is now unblocked.** Next step: bootstrap the Go module
skeleton + push to Gitea, then begin §4 Phase G0 implementation.

3
go.mod Normal file
View File

@ -0,0 +1,3 @@
module git.agentview.dev/profit/golangLAKEHOUSE
go 1.23