From f07668064e854485904de064926f6da63ae29446 Mon Sep 17 00:00:00 2001 From: Claw Date: Tue, 28 Apr 2026 06:29:55 -0500 Subject: [PATCH] docs: seed PRD + SPEC for the Go-direction rewrite MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Two documents only — no Go code yet. PRD restates the problem and preserves the Rust PRD's invariants verbatim, then maps the locked stack to Go libraries and surfaces four hard problems (DuckDB-via-cgo for the query engine, Lance dropped, Dioxus → HTMX, arrow-go maturity). SPEC walks each Rust crate + TS surface and tags the port with library choice / effort estimate / risk + a 5-phase migration plan from skeleton (Phase G0) to demo parity (Phase G5). Six open questions remain that gate Phase G0: - DuckDB cgo OK? - HTMX vs React for the UI? - Repo location? - Distillation v1.0.0 port verbatim or rebuild? - Pathway memory data — port 88 traces or start clean? - Auditor lineage — port audit_baselines.jsonl or restart? Co-Authored-By: Claude Opus 4.7 (1M context) --- .gitignore | 41 ++++ README.md | 49 +++++ docs/DECISIONS.md | 126 +++++++++++ docs/PRD.md | 297 ++++++++++++++++++++++++++ docs/RUST_PATHWAY_MEMORY_NOTE.md | 79 +++++++ docs/SPEC.md | 354 +++++++++++++++++++++++++++++++ go.mod | 3 + 7 files changed, 949 insertions(+) create mode 100644 .gitignore create mode 100644 README.md create mode 100644 docs/DECISIONS.md create mode 100644 docs/PRD.md create mode 100644 docs/RUST_PATHWAY_MEMORY_NOTE.md create mode 100644 docs/SPEC.md create mode 100644 go.mod diff --git a/.gitignore b/.gitignore new file mode 100644 index 0000000..9ac7e07 --- /dev/null +++ b/.gitignore @@ -0,0 +1,41 @@ +# Go +*.exe +*.exe~ +*.dll +*.so +*.dylib +*.test +*.out +go.work +go.work.sum +vendor/ + +# Build artifacts +/bin/ +/dist/ + +# Editor / OS +.DS_Store +.idea/ +.vscode/ +*.swp +*~ + +# Local data — these directories follow the Rust lakehouse pattern; +# regenerated by services on demand. Do not commit runtime artifacts. +/data/_auditor/ +/data/_kb/ +/data/_pathway_memory/ +/data/_errors/ +/data/_imagecache/ +/data/datasets/ +/data/vectors/ +/data/headshots/ +/data/lance/ +/exports/ +/logs/ +/reports/ + +# Secrets — never commit. Resolved via SecretsProvider per ADR-001 §1.x. +*.env +secrets.toml diff --git a/README.md b/README.md new file mode 100644 index 0000000..1584aa4 --- /dev/null +++ b/README.md @@ -0,0 +1,49 @@ +# golangLAKEHOUSE + +Go reimplementation of the Lakehouse — a versioned knowledge substrate +for staffing analytics + local AI workloads. + +## Status + +**Pre-Phase G0.** Documents seeded; Go module declared; implementation +has not started. See `docs/PRD.md` for direction and `docs/SPEC.md` +for the component-by-component port plan. + +### Phase G0 prerequisites (must be done before any code lands) + +1. **Install Go 1.23+ on the dev box.** Not currently present at + `/usr/local/go` or elsewhere on the build machine. Standard install: + ``` + curl -L https://go.dev/dl/go1.23.linux-amd64.tar.gz | sudo tar -C /usr/local -xz + echo 'export PATH=$PATH:/usr/local/go/bin' >> ~/.bashrc + ``` +2. **Ensure cgo toolchain is present** (gcc + libc-dev) — required by + the DuckDB binding per ADR-001 §1.1. `apt install build-essential` + on Debian-based systems. +3. **Initialize the dependency tree** with `go mod tidy` once + `cmd/gateway/main.go` declares its first imports. + +## Layout + +``` +docs/ Direction + spec + ADRs +cmd/ (forthcoming) main packages — one per service +internal/ (forthcoming) shared packages +web/ (forthcoming) HTMX templates + static +scripts/ (forthcoming) cold-start, smoke, distill +tests/ (forthcoming) golden files, integration tests +``` + +## Reading order + +1. `docs/PRD.md` — what we're building and why +2. `docs/SPEC.md` — how, per-component +3. `docs/DECISIONS.md` — ADRs, starting with ADR-001 (foundational) +4. `docs/RUST_PATHWAY_MEMORY_NOTE.md` — historical reference for the + Rust era's pathway memory state (not migrated) + +## Predecessor + +The Rust Lakehouse this rewrite supersedes lives at +`git.agentview.dev/profit/lakehouse`. It remains the live system until +this Go implementation reaches feature parity (per `docs/SPEC.md` §7). diff --git a/docs/DECISIONS.md b/docs/DECISIONS.md new file mode 100644 index 0000000..27fa542 --- /dev/null +++ b/docs/DECISIONS.md @@ -0,0 +1,126 @@ +# Architecture Decision Records — Lakehouse-Go + +ADRs from the Go era. Numbered fresh from 001 to start clean lineage. +Where a Rust ADR (numbered 001–021 in the Rust repo's `DECISIONS.md`) +remains in force, this file references it explicitly. Where a Rust +ADR is superseded, the new ADR records why. + +--- + +## ADR-001: Foundational decisions for the Go rewrite +**Date:** 2026-04-28 +**Decided by:** J +**Status:** Ratified — Phase G0 unblocked + +The six questions that gated Phase G0 (per PRD.md / SPEC.md §8) are +all answered. + +### Decision 1.1 — DuckDB via cgo for the query engine + +**Decision:** `queryd` uses `marcboeker/go-duckdb` (cgo bindings to +DuckDB). Pure-Go alternative was rejected. + +**Rationale:** DuckDB reads Parquet natively, supports the SQL surface +DataFusion exposed in the Rust era (CTEs, window functions, hybrid +joins), and runs in-process with cgo. The alternatives were: +- Hand-rolling a query planner over arrow-go RecordBatches — + multi-engineer-month research project; high risk of correctness + bugs. +- Running DuckDB as an external process — adds an operational surface + and a network hop to every query. + +Cgo build complexity is the accepted cost. Single-binary deploy +preserved (the cgo dependency embeds at link time). + +**Supersedes Rust ADR-001** (object storage as source of truth) — no. +That ADR remains in force; the change is the *engine* over the +storage, not the storage model. + +### Decision 1.2 — HTMX for the UI + +**Decision:** Frontend is `html/template` + HTMX + Alpine.js, +server-rendered by `cmd/gateway`. React/Vite in a separate repo is the +fallback if UX requirements demand SPA-tier interactivity post-G5. + +**Rationale:** The existing Lakehouse UIs (`/lakehouse/` demo + staffer +console) are mostly server-rendered HTML with vanilla JS that already +fits the HTMX style. Single-binary deploy is preserved (gateway serves +templates + static assets). No build chain beyond `go build`. + +The React fallback is named explicitly so it's not relitigated unless +an actual UX requirement triggers it. + +### Decision 1.3 — Gitea hosts the new repo + +**Decision:** Repo lives at `git.agentview.dev/profit/golangLAKEHOUSE` +(same Gitea server that hosts the Rust lakehouse). + +**Rationale:** Single source of truth for repo hosting; existing +auditor tooling (`lakehouse-auditor` systemd service) already speaks +Gitea API; existing credentials work; no new ops surface. + +### Decision 1.4 — Distillation rebuilt in Go, not ported verbatim + +**Decision:** The distillation v1.0.0 substrate (`tag +distillation-v1.0.0` at `e7636f2` in the Rust repo) is **not** +bit-identical-ported. The Go reimplementation: +- Ports the LOGIC: SFT export pipeline, contamination firewall (the + `quality_score` enum + `SFT_NEVER` constant), category mapping + rules, audit-baselines append-only pattern. +- Does NOT port the FIXTURES: `tests/fixtures/distillation/acceptance/` + is rebuilt from scratch in Go with new ground-truth golden files. +- Does NOT port the bit-identical reproducibility PROPERTY: that was + measured against the Rust implementation. The Go implementation + establishes its own reproducibility baseline. + +**Rationale:** Bit-identical reproducibility was a measured property +of a specific implementation, not a portable invariant. Re-establishing +it in Go means new fixtures, new gates, new audit-baselines. This is +honest about what's transferring (logic) versus what's a Rust-era +artifact (the specific bit-identical hashes). + +**Risk:** the contamination firewall is the most consequential +distillation safety net. The port must be reviewed line-by-line, and +the new Go fixtures must include adversarial cases that prove the +firewall works in the new implementation. See SPEC §7 acceptance gates. + +### Decision 1.5 — Pathway memory starts clean; old traces preserved as reference + +**Decision:** Go pathway memory begins with zero traces. The existing +88 Rust traces at +`/home/profit/lakehouse/data/_pathway_memory/state.json` are NOT loaded +into the Go implementation. They are preserved as a historical record +in the Rust repo and documented at `docs/RUST_PATHWAY_MEMORY_NOTE.md`. + +**Rationale:** The Rust pathway memory's value compounded over months +of scrum cycles. Loading those traces into a Go implementation that +hasn't proven its byte-matching contract risks corrupting the new +substrate's signal with semantically-mismatched data. Starting clean +keeps the Go pathway memory's lineage clean and lets the byte-match +correctness be proven on a known input (per SPEC §3.4 G3.4.B). + +The historical note records the 88 traces' value (11/11 successful +replays at the time of freeze) so the Go implementation has a +reference baseline to outperform. + +### Decision 1.6 — Auditor longitudinal signal restarts + +**Decision:** The Rust auditor's `audit_baselines.jsonl` +(longitudinal drift signal accumulated across PRs #6–#13) is **not** +ported to Go. The Go auditor begins a fresh `audit_baselines.jsonl` +lineage on its first PR. + +**Rationale:** The drift signal is anchored to specific Rust commits, +verdict shapes, and Kimi/Haiku/Opus rotation traces. Carrying it into +the Go era would be like grafting Rust-PR audit history onto the first +Go PR's prologue — confusing more than informative. Restarting gives +the Go auditor a clean baseline to measure drift against. + +The existing Rust `audit_baselines.jsonl` stays in the Rust repo as a +historical record. + +--- + +(Future ADRs from ADR-002 onward will be added as the Go +implementation accrues design decisions — e.g. HNSW parameter +choices, pathway-memory hash function, auditor model rotation, etc.) diff --git a/docs/PRD.md b/docs/PRD.md new file mode 100644 index 0000000..571c329 --- /dev/null +++ b/docs/PRD.md @@ -0,0 +1,297 @@ +# PRD: Lakehouse-Go — Versioned Knowledge Substrate, Reimplemented in Go + +**Status:** DRAFT — seed document for the Go-direction rewrite. Supersedes +`/home/profit/lakehouse/docs/PRD.md` (Rust) once ratified. +**Created:** 2026-04-28 +**Owner:** J +**Sibling:** `SPEC.md` — component-by-component port plan with effort +estimates, library choices, and acceptance gates. + +--- + +## Direction pivot — why this PRD exists + +The Rust-first Lakehouse (15 crates, ~24 unmerged commits past PR #11, +distillation v1.0.0 substrate frozen at `e7636f2`) is being reimplemented +in Go on the principle that **anything Go can carry, Go carries**. This +is an explicit re-platforming, not a refactor. + +### What the rewrite preserves (verbatim from the Rust PRD) + +- The **problem statement** — legacy data systems silo information; AI + needs both fast analytical queries AND semantic retrieval over + unstructured text in one substrate. +- The **two use cases** — staffing analytics (reference implementation) + and local AI knowledge substrate (per-profile vector indexes for + running models). +- The **shared requirements** — schema-less ingest, SQL at scale, + AI-embedding search, hot-swappable indexes, trials-as-data, + local-first / no-cloud, repo-rebuildable. +- The **architectural invariants** — object storage as source of truth, + catalog as sole metadata authority, hot-swap atomicity, profiles as + first-class, playbooks-feed-the-index, errors findable in one HTTP + call. + +### What the rewrite changes + +| Layer | Was (Rust) | Becomes (Go) | Confidence | +|---|---|---|---| +| HTTP gateway | Axum + Tokio | `net/http` + `chi` (or `gin`) | High — Go's bread and butter | +| gRPC | tonic | `google.golang.org/grpc` | High — Go is the reference impl | +| Object store | Apache Arrow `object_store` | `aws-sdk-go-v2/service/s3` + thin wrapper | High | +| Parquet I/O | parquet-rs (arrow-rs) | `apache/arrow-go/v15/parquet` | Medium — arrow-go lags arrow-rs but covers our needs | +| Query engine | DataFusion | **Hard problem** (see §Hard problems) | Low — no like-for-like Go equivalent | +| Vector index (HNSW) | `hora` / hand-rolled | `coder/hnsw` or `Bithack/go-hnsw` (in-process) | High — HNSW is a self-contained algorithm | +| Vector backend (Lance) | `lance` (Rust) | **Hard problem** — likely dropped, Parquet-only | Medium | +| Frontend | Dioxus + WASM | Go `html/template` + HTMX + Alpine, or React/Vite split repo | Medium | +| Concurrency | Tokio async | Goroutines + `context.Context` | High | +| Config | TOML | TOML (`pelletier/go-toml/v2`) | High | +| Secrets | `SecretsProvider` trait | Go interface, same shape | High | +| AI bridge | HTTP client to Python sidecar | Same — Python sidecar stays | High | +| Embedded ML | Python sidecar (kept) | Python sidecar (kept) | n/a | + +### What stays Python (and why) + +- **Embedding generation, image gen, deepface analysis** — Python's ML + ecosystem is genuinely stronger than Go's. The sidecar stays as an + HTTP service; the Go gateway calls it the same way the Rust gateway + did. No port required. +- **Distillation pipeline scoring** — current TS scripts; can move to Go + but not first-tier priority. Keep TS until Go gateway is live. + +--- + +## Solution — Go service mesh over S3-compatible object storage + +A modular Go service mesh, same architectural shape as the Rust system, +with the Python AI sidecar retained as the embedding/generation +boundary. Single repo (`golangLAKEHOUSE`), single Go module, multiple +binaries built from one workspace. + +### Locked stack (Go) + +| Layer | Choice | Rationale | +|---|---|---| +| HTTP | `chi` | Idiomatic, middleware-friendly, used by major Go services | +| gRPC | `google.golang.org/grpc` | Reference implementation | +| Protobuf | `protoc-gen-go` + `buf` | Standard tooling | +| Object store | `aws-sdk-go-v2` | Mature, covers S3 + MinIO + RustFS | +| Parquet | `apache/arrow-go/v15` | Columnar I/O + Arrow interop | +| SQL engine | **Open** — see §Hard problems §1 | Biggest open decision | +| Vector index | `coder/hnsw` | Pure-Go HNSW, in-process, no external service | +| TOML config | `pelletier/go-toml/v2` | Standard | +| Logging | `log/slog` | Standard library since Go 1.21 | +| Tracing | `go.opentelemetry.io/otel` | Standard | +| Testing | `testing` + `testify` + `golden` files | Standard | +| Frontend | **Open** — `html/template` + HTMX vs separate Vite/React | Hard problem §3 | + +No new dependencies without an ADR. + +--- + +## Architecture + +Same service decomposition as Rust, same data flow. Names preserved so +the spec, ADRs, and runbooks port semantically: + +``` +┌─ ingestd ─→ storaged ─→ catalogd ─→ queryd ─┐ +│ │ │ +│ └→ vectord │ +│ │ +└──────── aibridge ──HTTP──→ Python sidecar ───┘ + │ + gateway ─ HTTP/gRPC ────┘ + │ + └→ ui (HTMX or Vite) +``` + +| Service | Responsibility | Go binary | +|---|---|---| +| **gateway** | HTTP/gRPC ingress, routing, auth | `cmd/gateway` | +| **catalogd** | Metadata control plane, dataset registry | `cmd/catalogd` | +| **storaged** | Object I/O, multi-bucket, error journal | `cmd/storaged` | +| **queryd** | SQL execution over Parquet (engine TBD) | `cmd/queryd` | +| **ingestd** | CSV/JSON/PDF ingest → Parquet | `cmd/ingestd` | +| **vectord** | Embeddings + HNSW index + trial system | `cmd/vectord` | +| **journald** | Append-only mutation event log | `cmd/journald` | +| **aibridge** | HTTP client to Python sidecar | library, linked into gateway | +| **validator** | Production worker/permit validators | library, linked into gateway | +| **mcp** | MCP server (replaces Bun `mcp-server`) | `cmd/mcp` | +| **observer** | Autonomous iteration loop | `cmd/observer` | +| **auditor** | PR audit pipeline (replaces TS auditor) | `cmd/auditor` | + +### Invariants (preserved verbatim from Rust PRD) + +1. Object storage = source of truth +2. catalogd = sole metadata authority +3. No raw data in catalog — only pointers +4. vectord stores embeddings AS Parquet (portable) +5. ingestd is idempotent +6. Hot cache is performance, not source of truth +7. All services modular and independently replaceable +8. Indexes are hot-swappable (atomic pointer swap, rollback always possible) +9. Every reader gets its own profile +10. Trials are data, not logs +11. Operational failures findable in one HTTP call +12. Playbooks feed the index, not just the log + +--- + +## Hard problems (the ones that don't trivially port) + +These four define whether the rewrite is feasible. Spec answers each +with a concrete library/approach choice and a fallback. + +### 1. Query engine — replacing DataFusion + +**Constraint:** DataFusion is the most consequential Rust dependency in +the Lakehouse. It powers `queryd`, hybrid SQL+vector search, and +hot-cache merge-on-read. Go has no like-for-like equivalent. + +**Options:** +- **A. Embed DuckDB via cgo (`marcboeker/go-duckdb`)** — DuckDB reads + Parquet natively, supports SQL similar to DataFusion, has cgo Go + bindings. Loses pure-Go portability (cgo required) but preserves the + query model. +- **B. Run DuckDB as an external service** — one DuckDB process, Go + talks to it via HTTP. Pure-Go gateway, separate-process query layer. + Adds an operational surface (one more service to manage). +- **C. Hand-roll a query planner over Arrow** — parse SQL with + `xwb1989/sqlparser`, plan over arrow-go RecordBatches, execute. High + effort, high risk. Best avoided. +- **D. Postgres + foreign data wrappers** — point Postgres at Parquet + via `parquet_fdw`. Mature but introduces a database we said we'd + avoid (ADR-001). + +**Recommendation:** **Option A (DuckDB via cgo)**. Preserves the SQL + +columnar + Parquet model, single-binary deploy with cgo, mature. Cgo +adds build complexity but is acceptable. + +### 2. Lance backend — vectord-lance + +**Constraint:** Lance is a Rust-native columnar format with built-in +vector indexing. There is no Go port and no FFI binding. ADR-019 +designates Lance as a per-profile *secondary* backend; Parquet+HNSW is +*primary*. + +**Options:** +- **A. Drop Lance entirely.** Parquet+HNSW handles primary path; Lance + was secondary. ADR-019 stays valid for the Rust era; Go Lakehouse is + Parquet-only. +- **B. Keep Lance via FFI/cgo.** Build Lance as a Rust dylib, call from + Go via cgo. Reintroduces Rust into the build chain; defeats the + point. +- **C. Wait for Lance Go port.** Doesn't exist; not on Lance roadmap. + +**Recommendation:** **Option A (drop Lance).** The hybrid backend was +optional per-profile; Parquet+HNSW carries the primary path. If a +specific workload later proves Lance-only, it can be exposed as a +Python-sidecar service. + +### 3. UI — replacing Dioxus + +**Constraint:** Dioxus is a Rust+WASM frontend framework. No Go +equivalent at the same level of polish. The current `crates/ui` covers +Ask, Explore, SQL, System tabs. + +**Options:** +- **A. `html/template` + HTMX + Alpine.js** — server-rendered Go, + partial-page swaps via HTMX. Single repo, minimal JS, fits Go's + "boring is good" ethos. +- **B. Separate Vite/React frontend** — `golangLAKEHOUSE-ui` repo, + Go gateway serves static files. Modern UX patterns, more dev tooling + needed. +- **C. Keep Dioxus + WASM as a build step** — defeats the rewrite. + +**Recommendation:** **Option A** for v1; revisit if UI requirements +demand React-tier interactivity. The current Lakehouse UIs (`/lakehouse/` +demo + staffer console) are mostly server-rendered HTML with vanilla +JS — `html/template` + HTMX is a strong fit. + +### 4. Arrow ecosystem maturity + +**Constraint:** `arrow-go/v15` lags `arrow-rs` in compute kernels, +expression APIs, and some compression codecs. Specific gaps known: +limited `cast` kernel coverage, no equivalent of `arrow-rs`'s +`compute::sort_to_indices` for all dtypes, no Acero-style streaming +execution. + +**Mitigation:** the Go Lakehouse relies on Arrow primarily for +**Parquet I/O + RecordBatch transport**, not for in-process compute +(that's DuckDB's job). The narrower scope makes arrow-go's gaps less +load-bearing. + +**Acceptance gate:** any Arrow API the Go Lakehouse uses must be +covered by `arrow-go/v15`. Anything missing → file an upstream issue, +implement locally if blocking, contribute back. + +--- + +## Migration strategy + +### What ports verbatim +- Problem statement, use cases, requirements +- Architectural invariants (1–12) +- ADRs 001–021 (preserved as design intent; some change implementation) +- Federation building blocks (multi-bucket, error-journal, append-log) + +### What rebuilds from data +- HNSW indexes (rebuild from Parquet embeddings — ADR-008 is verbatim + preserved) +- Pathway memory state (88 traces in `data/_pathway_memory/state.json` + on Rust side — port the JSON format and reload; the byte-matching + contract becomes Go-Go instead of Rust-TS) +- Catalog manifests (Parquet, portable) +- Distillation v1.0.0 substrate (port the SFT/contamination-firewall + logic; the fixture-as-gate pattern stays) + +### What ships first (port-order outline — see SPEC.md for detail) + +1. **Phase G0** — Skeleton: `cmd/gateway`, `cmd/catalogd`, `cmd/storaged`, + `cmd/ingestd`. Single-bucket, no auth, CSV→Parquet, query via DuckDB. +2. **Phase G1** — Vector path: `cmd/vectord` with HNSW + RAG endpoint. +3. **Phase G2** — Multi-profile + federation (ADRs 016–017). +4. **Phase G3** — Pathway memory + distillation port. +5. **Phase G4** — MCP server, observer, auditor (TS surfaces → Go). +6. **Phase G5** — UI (HTMX) and demo parity with `devop.live/lakehouse/`. + +Detailed acceptance gates in `SPEC.md`. + +### What does NOT migrate +- The Rust crates themselves (archived in the original `lakehouse` repo) +- The TS scrum/auditor pipelines (rewritten in Go in Phase G4) +- The Bun mcp-server (rewritten in Go in Phase G4) +- The Python sidecar (kept as-is, behind aibridge) + +--- + +## Non-goals + +- **No port of `vectord-lance`.** Lance backend is dropped; Parquet+HNSW + is the only vector backend. +- **No retention of Rust in the build chain.** No cgo-to-Rust bridges, + no FFI to keep specific crates alive. Cgo to **C/C++** (DuckDB) is + acceptable. +- **No new feature work during the port.** Feature parity with the Rust + Lakehouse at the cutoff commit is the bar; new capabilities defer to + post-port phases. +- **No live-migration of running services.** The Rust Lakehouse stops + serving when Go reaches feature parity; data moves once via Parquet + re-pointer. + +--- + +## Ratified decisions (2026-04-28, J) + +The six gating questions are answered. Phase G0 is unblocked. Full +context for each lives in `docs/DECISIONS.md` ADR-001. + +| # | Decision | +|---|---| +| 1 | **DuckDB via cgo** — `marcboeker/go-duckdb` is the query engine. Cgo accepted as the cost of a mature SQL+Parquet path. | +| 2 | **HTMX** — server-rendered `html/template` + HTMX + Alpine.js. Single-binary deploy. React is post-G5 if needed. | +| 3 | **Gitea** — repo lives at `git.agentview.dev/profit/golangLAKEHOUSE` (same server as the Rust lakehouse). | +| 4 | **Distillation rebuild in Go** — port the SFT export + contamination firewall logic, but bit-identical reproducibility is a Rust-era property. New Go fixtures, new acceptance gates. | +| 5 | **Pathway memory starts clean** — Go pathway memory begins with zero traces. The existing 88 Rust traces are preserved at `data/_pathway_memory/state.json` in the lakehouse repo as a historical record (see `docs/RUST_PATHWAY_MEMORY_NOTE.md`). | +| 6 | **Auditor longitudinal signal restarts** — `audit_baselines.jsonl` is a Rust-era artifact. Go auditor begins a fresh drift signal. | diff --git a/docs/RUST_PATHWAY_MEMORY_NOTE.md b/docs/RUST_PATHWAY_MEMORY_NOTE.md new file mode 100644 index 0000000..11603ea --- /dev/null +++ b/docs/RUST_PATHWAY_MEMORY_NOTE.md @@ -0,0 +1,79 @@ +# Rust Pathway Memory — Historical Reference + +**Status:** Reference-only. The Go Lakehouse does NOT load these +traces (per ADR-001 §1.5). This note exists so future-Go-engineer +knows what the Rust era accumulated, where it lives, and why it was +left in place. + +--- + +## What was there + +By the time of the rewrite cutoff (commit `dcf4c9a`, +2026-04-28), the Rust pathway memory held: + +- **88 traces** at `/home/profit/lakehouse/data/_pathway_memory/state.json` +- **11/11 successful replays** as of the most recent verification (the + "probation gate crossed" signal from the lakehouse `STATE_OF_PLAY.md`) +- Active scrum-cycle compounding: each scrum loop iteration appended + new traces and re-ran replays against existing pathway fingerprints + to preempt review prompts with "this file pattern has produced bug X + before" + +## Where it lives (Rust repo) + +``` +lakehouse/ +├── crates/vectord/src/pathway_memory.rs ← implementation +├── data/_pathway_memory/state.json ← 88 traces, JSON +└── docs/DECISIONS.md ADR-021 ← matrix-correctness layer design +``` + +The TS-side mirror lived in +`tests/real-world/scrum_master_pipeline.ts` (functions +`computePathwayId`, `buildPathwayVec`). Both implementations +byte-matched on bucket vectors. + +## Why this matters for the Go port + +The pathway memory's *algorithm* is portable — 32-bucket SHA256-keyed +token hash, JSON state file, replay logic. The pathway memory's +*signal value* is not — those 88 traces represent months of scrum +loops on Rust code, with bug fingerprints anchored to Rust file +prefixes (`crates/queryd/`, `crates/vectord/`, etc.) that don't exist +in the Go repo. + +Per ADR-001 §1.5, the Go pathway memory: +1. Reimplements the algorithm (SPEC §3.4 G3.4.B is the byte-match + correctness gate). +2. Starts with zero traces. The 88 Rust traces are NOT migrated. +3. Builds its own signal over Go-era scrum cycles. + +## What to do if the Go pathway memory underperforms + +If after Phase G3 the Go pathway memory shows a noticeable lift +deficit vs. the Rust era's "11/11 successful replays" baseline: + +1. **First** — verify the Go algorithm byte-matches the Rust one on + the SPEC G3.4.B golden input. If yes, the algorithm is correct and + the gap is data-volume, not implementation. +2. **Second** — the Rust traces exist; if needed, re-prefix file paths + from `crates/queryd/` style to `cmd/queryd/` style, run a + compatibility check, and seed the Go pathway memory selectively. But + only after the algorithm is proven byte-match correct. +3. **Third** — accept that the first ~3 months of Go scrum cycles need + to rebuild the signal naturally. This is the cost of the clean + restart per ADR-001 §1.5. + +## Historical baseline (frozen reference) + +| Metric | Rust value at cutoff | Source | +|---|---|---| +| Total traces | 88 | `data/_pathway_memory/state.json` | +| Successful replays | 11/11 | scrum loop log circa 2026-04-26 | +| Distinct file prefixes | TBD — query the state file | n/a | +| Distinct semantic_flag variants used | 9 (per ADR-021) | `pathway_memory.rs` | +| Distinct bug_fingerprint hashes | TBD | `pathway_memory.rs` | + +When the Go pathway memory reaches comparable numbers, it has caught +up to the Rust era and can be considered fully replacement-grade. diff --git a/docs/SPEC.md b/docs/SPEC.md new file mode 100644 index 0000000..ac69c3d --- /dev/null +++ b/docs/SPEC.md @@ -0,0 +1,354 @@ +# SPEC: Lakehouse-Go Component Port Plan + +**Status:** DRAFT — companion to `PRD.md`. Component-by-component port +plan with library choices, effort estimates, and acceptance gates. +**Created:** 2026-04-28 +**Owner:** J + +This spec answers: for each piece of the Rust Lakehouse, what Go +library carries it, what the effort looks like, and what gate proves +the port is real. + +Effort scale (one engineer-week = ~40h focused work): +- **S** — 1–3 days +- **M** — 1 engineer-week +- **L** — 2–3 engineer-weeks +- **XL** — 1+ months +- **HARD** — open research, see PRD §Hard problems + +--- + +## §1. Component port table — Rust crates + +| Crate | Rust deps that mattered | Go target | Library | Effort | Risk | +|---|---|---|---|---|---| +| `gateway` | axum, tokio, tonic, tower | `cmd/gateway` | `chi` + stdlib `net/http` + `google.golang.org/grpc` | **L** | low — Go's strongest domain | +| `catalogd` | parquet-rs, arrow, sqlite | `cmd/catalogd` | `arrow-go/v15`, `mattn/go-sqlite3` | **L** | low | +| `storaged` | object_store, aws-sdk | `cmd/storaged` | `aws-sdk-go-v2`, `minio-go` for MinIO-specific paths | **M** | low | +| `queryd` | datafusion, arrow | `cmd/queryd` | `marcboeker/go-duckdb` (cgo) | **HARD** | high — see §3 | +| `ingestd` | csv, json, lopdf, postgres | `cmd/ingestd` | stdlib `encoding/csv`, `encoding/json`, `pdfcpu/pdfcpu`, `jackc/pgx/v5` | **L** | low | +| `vectord` | hora, arrow, hnsw | `cmd/vectord` | `coder/hnsw`, `arrow-go/v15` | **L** | medium — re-validate HNSW recall | +| `vectord-lance` | lance | **DROPPED** | n/a | n/a | n/a — Parquet+HNSW only | +| `journald` | parquet, arrow | `cmd/journald` | `arrow-go/v15` | **M** | low | +| `aibridge` | reqwest | library | `net/http` + connection pool | **S** | low | +| `validator` | parquet, custom | library | `arrow-go/v15` parquet reader | **M** | low — port the 24 unit tests as gates | +| `truth` | tomli, custom DSL | library | `pelletier/go-toml/v2` | **M** | low | +| `proto` | tonic-build | `proto/` + `protoc-gen-go` | `buf` + `protoc-gen-go-grpc` | **S** | low | +| `shared` | serde, anyhow | library | stdlib `encoding/json`, `errors` | **S** | low | +| `ui` | dioxus, wasm | **REPLACED** | `html/template` + HTMX | **L** | medium — see §3 | +| `lance-bench` | criterion | n/a — dropped with Lance | n/a | n/a | n/a | + +**Total Rust crate port effort:** ~12–18 engineer-weeks (3–4 months for +one engineer; 6–8 weeks for two). + +--- + +## §2. Component port table — TypeScript surfaces + +| TS surface | Current location | Go target | Library | Effort | Risk | +|---|---|---|---|---|---| +| `mcp-server/index.ts` | Bun, :3700 | `cmd/mcp` | `mark3labs/mcp-go` (Go MCP SDK) | **L** | medium — MCP semantics | +| `mcp-server/observer.ts` | Bun, :3800 | `cmd/observer` | stdlib `net/http`, `slog` | **M** | low | +| `mcp-server/tracing.ts` | Bun, Langfuse client | library | `go.opentelemetry.io/otel` + Langfuse Go client (or hand-roll) | **M** | low — Langfuse Go OSS support varies | +| `auditor/*.ts` | TS, runs as systemd | `cmd/auditor` | stdlib + `gitea API client` | **L** | medium — auditor cross-lineage logic is intricate | +| `tests/real-world/scrum_master_pipeline.ts` | TS, ad-hoc | `cmd/scrum` | stdlib | **L** | medium — chunking + embed + ladder logic | +| `tests/real-world/scrum_applier.ts` | TS, ad-hoc | `cmd/scrum-apply` | stdlib + git CLI shell-out | **M** | medium | +| `bot/propose.ts` | TS | `cmd/bot` | stdlib | **S** | low | +| Search demo HTML/JS | static | static (no port) | n/a | n/a | n/a — copied as-is | + +**Total TS port effort:** ~6–10 engineer-weeks. + +--- + +## §3. Hard problem details + +### §3.1 — Query engine (DuckDB via cgo) + +**Library:** `marcboeker/go-duckdb` — Go bindings via cgo. + +**API shape** (replaces the DataFusion `SessionContext` pattern): +```go +db, _ := sql.Open("duckdb", "") +defer db.Close() +db.Exec("CREATE VIEW workers AS SELECT * FROM read_parquet('s3://bucket/workers/*.parquet')") +rows, _ := db.Query("SELECT role, count(*) FROM workers WHERE state='IL' GROUP BY role") +``` + +**Acceptance gates:** +- G3.1.A — `SELECT * FROM read_parquet('workers_500k.parquet') LIMIT 1` + returns a row with the expected schema. Establishes Parquet read + works. +- G3.1.B — Hybrid SQL+vector query (the `POST /vectors/hybrid` + surface) returns same workers as the Rust path on the same input, + ranked the same way modulo embedding precision. +- G3.1.C — Hot-cache merge-on-read: register a base table + a delta + Parquet, query, observe both rows merged with the delta winning on + conflict. + +**Fallback if cgo is rejected:** run DuckDB as an external process +(`duckdb -json -c '...'` shelled or HTTP via a thin Go wrapper). Adds +operational surface; preserves SQL model. + +### §3.2 — HNSW index + +**Library:** `coder/hnsw` — pure-Go HNSW, in-process. Supports add / +delete / search / persist. + +**Open question:** does `coder/hnsw` match the recall@10 we measured +on the Rust `hora` path? Need a calibration test: +- Rebuild `lakehouse_arch_v1` (the 1086-chunk arch corpus) in Go. +- Compare recall@10 on a fixed query set to the Rust baseline. +- Acceptance: ≤2% drop or we switch library / parameters. + +**Persistence format:** TBD — `coder/hnsw` has its own snapshot format; +ADR equivalent of ADR-008 (Parquet for embeddings + sidecar HNSW file) +needs revisiting in Go to confirm the sidecar format we ship. + +**Acceptance gates:** +- G3.2.A — Build HNSW from a Parquet of 100K vectors in <60s +- G3.2.B — Search 100K vectors at k=10 in <50ms p50 +- G3.2.C — Recall@10 within 2% of Rust baseline on + `lakehouse_arch_v1` + +### §3.3 — UI (HTMX) + +**Approach:** server-rendered Go templates using `html/template`, +HTMX for partial-page swaps, Alpine.js for client-side interactivity +where needed. Single binary serves API + UI. + +**Acceptance gates:** +- G3.3.A — `Ask` tab: type natural-language question, get answer + from RAG endpoint, render in-page without full reload +- G3.3.B — `Explore` tab: paginated dataset list with hot-swap badge + rendering +- G3.3.C — `SQL` tab: textarea → submit → tabular result rendered + in-page +- G3.3.D — `System` tab: live tail of `/storage/errors` and + `/hnsw/trials` via HTMX polling + +**Fallback if HTMX feels limiting:** split repo `golangLAKEHOUSE-ui` +with Vite + React, served as static files by Go gateway. Costs an +extra repo + build chain. + +### §3.4 — Pathway memory port + +**Constraint:** the Rust `pathway_memory` and TS implementations were +byte-matching by ADR-021. The byte contract was verified by running +both implementations on the same input tokens and asserting matching +bucket indices. + +**Go port plan:** +- Port the 32-bucket SHA256-keyed token hash exactly. Verify on a + golden input that Go produces the same bucket vector as Rust. +- Port the JSON state file format verbatim — the existing 88 traces in + `data/_pathway_memory/state.json` reload as-is into the Go + implementation. +- Port the matrix-correctness layer (ADR-021's `SemanticFlag`, + `BugFingerprint`, `TypeHint`) — these are pure value types, + trivially portable. + +**Acceptance gates:** +- G3.4.A — Load existing `state.json`, run `replay` on the same 11 + prior successful pathways, all 11 succeed (matching the Rust 11/11 + baseline). +- G3.4.B — Bucket vector for a fixed test input byte-matches the + Rust output. + +--- + +## §4. Phase plan + +### Phase G0 — Skeleton (Week 1–3) + +**Scope:** smallest end-to-end ingest + query path working in Go. + +| Component | Deliverable | +|---|---| +| `cmd/gateway` | HTTP on :3100, `/health`, `/v1/chat` proxy stub | +| `cmd/catalogd` | In-memory registry + Parquet manifest persistence | +| `cmd/storaged` | Single-bucket S3 / local FS, no error journal yet | +| `cmd/ingestd` | CSV → Parquet, schema inference, register-on-ingest | +| `cmd/queryd` | DuckDB-backed `POST /sql` endpoint | + +**Acceptance:** upload a CSV via `POST /ingest`, query it via +`POST /sql` with a SELECT, get rows back. Single-bucket. No vector, +no profile, no UI. + +### Phase G1 — Vector + RAG (Week 4–6) + +| Component | Deliverable | +|---|---| +| `cmd/vectord` | Embed-on-ingest (calls Python sidecar), HNSW build, `POST /search` | +| `cmd/gateway` | Add `POST /rag` (embed → search → retrieve → generate via aibridge) | +| `cmd/aibridge` | HTTP client to existing Python sidecar | + +**Acceptance:** ingest 15K resumes (the original Phase 7 fixture), +ask "find me a forklift operator with OSHA-10 in IL", get ranked +results with LLM-generated explanation grounded in the retrieved +chunks. + +### Phase G2 — Federation + profiles (Week 7–8) + +| Component | Deliverable | +|---|---| +| `cmd/storaged` | Multi-bucket registry, rescue bucket, error journal at `primary://_errors/` | +| Profile system | Per-reader profile bound to bucket + vector index | +| Hot-swap | Atomic pointer swap for index generations | + +**Acceptance:** two profiles bound to two buckets, queries scoped +correctly, hot-swap a vector index without query interruption, +rollback works. + +### Phase G3 — Pathway memory + distillation (Week 9–11) + +| Component | Deliverable | +|---|---| +| `cmd/vectord` | Pathway memory module ported, 88 traces reloaded | +| Distillation pipeline | SFT export, contamination firewall, scorer | +| Audit baselines | `audit_baselines.jsonl` longitudinal signal port | + +**Acceptance:** replay 11 prior successful pathways, all 11 succeed. +Re-run distillation acceptance on the frozen fixture set, 22/22 pass. + +### Phase G4 — TS surfaces → Go (Week 12–14) + +| Component | Deliverable | +|---|---| +| `cmd/mcp` | MCP server (replaces Bun) — `/v1/chat`, intelligence endpoints | +| `cmd/observer` | Autonomous iteration loop, op recording | +| `cmd/auditor` | PR audit pipeline (kimi/haiku/opus rotation) | +| `cmd/scrum` | Scrum master pipeline (replaces TS) | + +**Acceptance:** open a test PR, auditor cycles within 90s, emits +verdict to `data/_auditor/kimi_verdicts/`, behavior matches Rust+TS +era within tolerance. + +### Phase G5 — UI + demo parity (Week 15–16) + +| Component | Deliverable | +|---|---| +| `cmd/gateway` | Serves HTMX templates + static demo HTML | +| Demo at `devop.live/lakehouse/` | Parity with current Bun demo | +| Staffer console at `/console` | Parity | + +**Acceptance:** `devop.live/lakehouse/` cuts over from Bun to Go +gateway. Section ① / ② / ③ all render. Compact contract cards still +expand with Project Index. Fill-probability bars still paint. + +--- + +## §5. Repo layout + +``` +golangLAKEHOUSE/ +├── docs/ +│ ├── PRD.md ← this PRD +│ ├── SPEC.md ← this spec +│ ├── DECISIONS.md ← Go-era ADRs (start fresh, reference Rust ADRs by number) +│ └── ADR-XXX-*.md ← per-ADR detail +├── cmd/ +│ ├── gateway/ ← main HTTP/gRPC ingress +│ ├── catalogd/ +│ ├── storaged/ +│ ├── queryd/ +│ ├── ingestd/ +│ ├── vectord/ +│ ├── journald/ +│ ├── mcp/ +│ ├── observer/ +│ ├── auditor/ +│ └── scrum/ +├── internal/ ← shared packages, not exported +│ ├── aibridge/ +│ ├── validator/ +│ ├── truth/ +│ ├── shared/ +│ ├── proto/ ← generated protobuf +│ └── pathway/ +├── pkg/ ← public Go packages (none initially) +├── web/ ← UI (HTMX templates + static) +│ ├── templates/ +│ └── static/ +├── scripts/ ← cold-start, smoke, distill scripts +├── tests/ ← golden files, integration tests +├── go.mod +├── go.sum +└── README.md +``` + +**Single Go module.** All commands and internal packages live under +`golangLAKEHOUSE/`. No nested modules unless a package needs an +independent release cadence (none expected). + +**Build:** `go build ./cmd/...` produces all binaries. + +--- + +## §6. Migration data plan + +### What ports verbatim +- Parquet datasets at `data/datasets/*.parquet` — read by Go directly. +- Catalog manifests — Parquet, ports as data not code. +- Pathway memory state — JSON, ports if §3.4 byte-matching gate passes. + +### What rebuilds +- HNSW indexes — rebuild from Parquet embeddings on first Go startup. +- Auditor verdicts on PRs — old PRs won't be re-audited; lineage starts + fresh on the new repo's PRs. + +### What's archived +- The Rust `crates/` tree — preserved in the original repo at the + cutover commit, tagged `pre-go-rewrite-2026-04-28` for reference. +- TS surfaces (`mcp-server/`, `auditor/`, etc.) — preserved in the + original repo at the same tag. +- Distillation v1.0.0 substrate (`tag distillation-v1.0.0`, + `e7636f2`) — kept as the historical reference; Go re-implementation + ports the LOGIC but not the bit-identical-reproducibility property + unless an ADR re-establishes it. + +### What's discarded +- `crates/vectord-lance/` (Lance backend, see PRD §Hard problems §2) +- `crates/lance-bench/` (criterion benchmarks specific to Lance) + +--- + +## §7. Acceptance: when is the rewrite done? + +The Go Lakehouse reaches **feature parity** when: + +1. **All 12 Rust PRD invariants hold** (object-storage source of truth, + catalog metadata authority, idempotent ingest, hot-swap atomicity, + profiles, etc.). +2. **The 16 distillation acceptance gates pass** (re-run + `./scripts/distill audit-full` against the Go pipeline). +3. **The 22/22 acceptance fixtures from `tests/fixtures/distillation/ + acceptance/` pass** under the Go implementation. +4. **The 145 unit tests of distillation v1.0.0 are ported and pass.** +5. **`devop.live/lakehouse/` demo cuts over to Go gateway** with no + visible UI regressions. +6. **Auditor emits Kimi/Haiku/Opus verdicts** on a test PR, matching + the cross-lineage rotation behavior. +7. **The 88 pathway traces replay** with 11/11 prior successes + reproduced. + +At that point the Rust repo enters maintenance-only mode (security +fixes), and the Go repo becomes the live system. + +--- + +## §8. Ratified — Phase G0 unblocked (2026-04-28, J) + +| # | Decision | Spec impact | +|---|---|---| +| 1 | DuckDB via cgo (`marcboeker/go-duckdb`) | §3.1 option A — proceed | +| 2 | HTMX + `html/template` + Alpine.js | §3.3 option A — proceed | +| 3 | `git.agentview.dev/profit/golangLAKEHOUSE` | repo location locked | +| 4 | Distillation rebuilt in Go (no bit-identical port) | §6 — port logic, not fixtures | +| 5 | Pathway memory starts empty; old traces noted | §3.4 G3.4.A is now "build initial state from scratch in Phase G3"; G3.4.B (byte-match) preserved as the porting correctness gate when the algorithm is reimplemented | +| 6 | Auditor longitudinal signal restarts | new `audit_baselines.jsonl` lineage starts on first Go-era PR | + +See `docs/DECISIONS.md` ADR-001 for full rationale and +`docs/RUST_PATHWAY_MEMORY_NOTE.md` for where the legacy 88 traces live. + +**Phase G0 is now unblocked.** Next step: bootstrap the Go module +skeleton + push to Gitea, then begin §4 Phase G0 implementation. diff --git a/go.mod b/go.mod new file mode 100644 index 0000000..37f2878 --- /dev/null +++ b/go.mod @@ -0,0 +1,3 @@ +module git.agentview.dev/profit/golangLAKEHOUSE + +go 1.23