# SPEC: Lakehouse-Go Component Port Plan **Status:** DRAFT — companion to `PRD.md`. Component-by-component port plan with library choices, effort estimates, and acceptance gates. **Created:** 2026-04-28 **Owner:** J This spec answers: for each piece of the Rust Lakehouse, what Go library carries it, what the effort looks like, and what gate proves the port is real. Effort scale (one engineer-week = ~40h focused work): - **S** — 1–3 days - **M** — 1 engineer-week - **L** — 2–3 engineer-weeks - **XL** — 1+ months - **HARD** — open research, see PRD §Hard problems --- ## §1. Component port table — Rust crates | Crate | Rust deps that mattered | Go target | Library | Effort | Risk | |---|---|---|---|---|---| | `gateway` | axum, tokio, tonic, tower | `cmd/gateway` | `chi` + stdlib `net/http` + `google.golang.org/grpc` | **L** | low — Go's strongest domain | | `catalogd` | parquet-rs, arrow, sqlite | `cmd/catalogd` | `apache/arrow-go/v18`, `mattn/go-sqlite3` | **L** | low | | `storaged` | object_store, aws-sdk | `cmd/storaged` | `aws-sdk-go-v2`, `minio-go` for MinIO-specific paths | **M** | low | | `queryd` | datafusion, arrow | `cmd/queryd` | **`duckdb/duckdb-go/v2`** (cgo, official) | **HARD** | high — see §3 | | `ingestd` | csv, json, lopdf, postgres | `cmd/ingestd` | stdlib `encoding/csv`, `encoding/json`, `pdfcpu/pdfcpu`, `jackc/pgx/v5` | **L** | low | | `vectord` | hora, arrow, hnsw | `cmd/vectord` | `coder/hnsw`, `apache/arrow-go/v18` | **L** | medium — re-validate HNSW recall | | **matrix indexer** (emergent in Rust — `mode.rs` + `build_*_corpus.ts` + observer `/relevance`) | scripts/build_*_corpus.ts, crates/gateway/src/v1/mode.rs, mcp-server/observer.ts | `internal/matrix/` + gateway routes (`/v1/matrix/*`) | stdlib + vectord client | **L** | medium — see §3.4. Corpus-as-shard composer; relevance filter; strong-model downgrade gate; multi-corpus retrieve+merge. The learning-loop layer that lifts vectord from "static index" to "meta-index that learns from playbooks." | | `vectord-lance` | lance | **DROPPED** | n/a | n/a | n/a — Parquet+HNSW only | | `journald` | parquet, arrow | `cmd/journald` | `apache/arrow-go/v18` | **M** | low | | `aibridge` | reqwest | library | `net/http` + connection pool · `anthropics/anthropic-sdk-go` available for direct Claude calls (currently routed via opencode) | **S** | low | | **chatd** (Phase 4 — multi-provider LLM dispatcher) | crates/gateway/src/v1/{ollama_cloud,openrouter,opencode}.rs | `cmd/chatd` + `internal/chat/` | stdlib `net/http` only | **DONE** | medium — see §3.9. 5-provider routing (ollama / ollama_cloud / openrouter / opencode / kimi) by model-name prefix or `:cloud` suffix. Replaces the Rust gateway's `v1::` adapters. | | `validator` | parquet, custom | library | `apache/arrow-go/v18` parquet reader | **M** | low — port the 24 unit tests as gates | | `truth` | tomli, custom DSL | library | `pelletier/go-toml/v2` | **M** | low | | `proto` | tonic-build | `proto/` + `protoc-gen-go` | `buf` + `protoc-gen-go-grpc` | **S** | low | | `shared` | serde, anyhow | library | stdlib `encoding/json`, `errors` | **S** | low | | `ui` | dioxus, wasm | **REPLACED** | `html/template` + HTMX | **L** | medium — see §3 | | `lance-bench` | criterion | n/a — dropped with Lance | n/a | n/a | n/a | **Total Rust crate port effort:** ~12–18 engineer-weeks (3–4 months for one engineer; 6–8 weeks for two). --- ## §2. Component port table — TypeScript surfaces | TS surface | Current location | Go target | Library | Effort | Risk | |---|---|---|---|---|---| | `mcp-server/index.ts` | Bun, :3700 | `cmd/mcp` | **`modelcontextprotocol/go-sdk`** (official Go SDK, v1.5.0, Google-collab) | **L** | medium — MCP semantics | | `mcp-server/observer.ts` | Bun, :3800 | `cmd/observer` | stdlib `net/http`, `slog` | **M** | low | | `mcp-server/tracing.ts` | Bun, Langfuse client | library | `go.opentelemetry.io/otel` + Langfuse Go client (or hand-roll) | **M** | low — Langfuse Go OSS support varies | | `auditor/*.ts` | TS, runs as systemd | `cmd/auditor` | stdlib + `gitea API client` | **L** | medium — auditor cross-lineage logic is intricate | | `tests/real-world/scrum_master_pipeline.ts` | TS, ad-hoc | `cmd/scrum` | stdlib | **L** | medium — chunking + embed + ladder logic | | `tests/real-world/scrum_applier.ts` | TS, ad-hoc | `cmd/scrum-apply` | stdlib + git CLI shell-out | **M** | medium | | `bot/propose.ts` | TS | `cmd/bot` | stdlib | **S** | low | | Search demo HTML/JS | static | static (no port) | n/a | n/a | n/a — copied as-is | **Total TS port effort:** ~6–10 engineer-weeks. --- ## §3. Hard problem details ### §3.1 — Query engine (DuckDB via cgo) **Library:** `github.com/duckdb/duckdb-go/v2` — official Go bindings via cgo. (Replaces the legacy `marcboeker/go-duckdb`, which was deprecated when the DuckDB team and Marc Boeker jointly relocated maintenance to the DuckDB org at v2.5.0. Migration is a one-line `gofmt -r` rewrite of import paths.) Current version v2.10502.0 (April 2026), DuckDB v1.5.2 compat. Statically links default extensions: ICU, JSON, Parquet, Autocomplete. **API shape** (replaces the DataFusion `SessionContext` pattern): ```go db, _ := sql.Open("duckdb", "") defer db.Close() db.Exec("CREATE VIEW workers AS SELECT * FROM read_parquet('s3://bucket/workers/*.parquet')") rows, _ := db.Query("SELECT role, count(*) FROM workers WHERE state='IL' GROUP BY role") ``` **Acceptance gates:** - G3.1.A — `SELECT * FROM read_parquet('workers_500k.parquet') LIMIT 1` returns a row with the expected schema. Establishes Parquet read works. - G3.1.B — Hybrid SQL+vector query (the `POST /vectors/hybrid` surface) returns same workers as the Rust path on the same input, ranked the same way modulo embedding precision. - G3.1.C — Hot-cache merge-on-read: register a base table + a delta Parquet, query, observe both rows merged with the delta winning on conflict. **Fallback if cgo is rejected:** run DuckDB as an external process (`duckdb -json -c '...'` shelled or HTTP via a thin Go wrapper). Adds operational surface; preserves SQL model. ### §3.2 — HNSW index **Library:** `coder/hnsw` — pure-Go HNSW, in-process. Supports add / delete / search / persist. **Open question:** does `coder/hnsw` match the recall@10 we measured on the Rust `hora` path? Need a calibration test: - Rebuild `lakehouse_arch_v1` (the 1086-chunk arch corpus) in Go. - Compare recall@10 on a fixed query set to the Rust baseline. - Acceptance: ≤2% drop or we switch library / parameters. **Persistence format:** TBD — `coder/hnsw` has its own snapshot format; ADR equivalent of ADR-008 (Parquet for embeddings + sidecar HNSW file) needs revisiting in Go to confirm the sidecar format we ship. **Acceptance gates:** - G3.2.A — Build HNSW from a Parquet of 100K vectors in <60s - G3.2.B — Search 100K vectors at k=10 in <50ms p50 - G3.2.C — Recall@10 within 2% of Rust baseline on `lakehouse_arch_v1` ### §3.4 — Matrix indexer (corpus-as-shard composer) **What it is.** The matrix indexer is the layer above `vectord` that turns a fleet of single-corpus HNSW indexes into a learning meta-index. In the Rust system this is emergent — split between corpus builders (`scripts/build_*_corpus.ts`), the mode runner (`crates/gateway/src/v1/mode.rs`), the observer relevance endpoint (`mcp-server/observer.ts`), and the strong-model downgrade gate (`mode.rs::execute`). In Go we name it explicitly so future sessions don't reduce it to "vectord." **Why corpus-as-shard, not shard-by-id.** Sharding a single index by hash(id) is a pure throughput hack with a recall tax. Sharding by corpus is the existing retrieval shape — `lakehouse_arch_v1`, `lakehouse_symbols_v1`, `scrum_findings_v1`, `lakehouse_answers_v1`, `kb_team_runs_v1`, `successful_playbooks_live`, etc. — each with distinct topology and a distinct retrieval intent. Concurrent Adds parallelize naturally because they go to different corpora; the matrix layer's job is to retrieve+merge across them, filter for relevance, and downgrade composition when strong models prove the matrix is anti-additive. **Components to port (in dependency order):** 1. **Corpus builders** — Go equivalents of `scripts/build_*_corpus.ts`. For each named corpus, a builder that reads source, splits into chunks per the corpus's schema, embeds via `/v1/embed`, and adds to a vectord index of the same name. Effort: **M** for the first builder, **S** for each subsequent. 2. **Multi-corpus retrieve+merge** (`internal/matrix/retrieve.go`) — given a query and a list of corpus names, search each at top_k=K, merge by score, return top N globally. Match Rust's pattern: top_k=6 per corpus, top 8 globally before relevance filter. 3. **Relevance filter** (`internal/matrix/relevance.go`) — port the threshold-based filter from `mcp-server/observer.ts:/relevance`. Drops adjacency-pollution chunks that share a corpus with the hit but aren't actually about the query. `LH_RELEVANCE_FILTER` / `LH_RELEVANCE_THRESHOLD` env knobs preserved. 4. **Strong-model downgrade gate** (`internal/matrix/downgrade.go`) — port `is_weak_model` + the `codereview_lakehouse → codereview_isolation` flip from `mode.rs::execute`. Pass5 proved composed corpora lose 5/5 vs isolation on grok-4.1-fast (p=0.031); the gate is load-bearing for paid-model retrieval quality. 5. **Learning-loop integration** — write outcomes back to a playbook-memory corpus (probably `lakehouse_answers_v1` analogue). This is what makes the matrix INDEX a learning system rather than static retrieval. Per `feedback_meta_index_vision.md`: this is the north star, not the data structure. **Gateway routes:** `/v1/matrix/search` (multi-corpus retrieve+merge), `/v1/matrix/corpora` (list + metadata), `/v1/matrix/relevance` (filter endpoint, used by both internal callers and external tooling). **Acceptance gates:** - G3.4.A — `/v1/matrix/search` against ≥3 corpora returns merged top-N with corpus attribution per result. - G3.4.B — Relevance filter drops at least the threshold-margin chunks on a known adjacency-pollution test case. - G3.4.C — Strong-model downgrade gate flips composed→isolation when the model is non-weak; bypassed when caller sets `force_mode`. - G3.4.D — Concurrent Adds across N=4 corpora parallelize (no shared write-lock); Add throughput scales near-linearly with corpus count. **Persistence:** each corpus's vectord index persists via the existing G1P LHV1 format. The matrix layer is stateless above that — corpus list lives in catalog, retrieval params in config. **Why this is its own §3.x:** in Rust the matrix indexer was emergent and got reduced to "we have vectord" in earlier port-planning. The SPEC names it explicitly so the port preserves the multi-corpus retrieval shape AND the learning loop, not just the HNSW substrate. ### §3.5 — Drift quantification (loop 5 of the PRD) **What it is.** PRD names "drift" as the 5th loop: quantify when historical decisions stop matching current reality. Distinct from the rating+distillation loop because drift is MEASUREMENT, not LEARNING. The learning loop says "this match worked, remember it"; the drift loop says "this 4-month-old playbook entry — does it still match what the substrate would surface today?" **What's shipped (commit `be65f85`):** - SCORER drift: re-runs current `distillation.ScoreRecord` over historical (EvidenceRecord, persisted_category) pairs and reports mismatches + a sorted shift matrix - `internal/drift/drift.go` — pure-function `ComputeScorerDrift` - 6 unit tests covering no-drift, shift detection, multi-shift sorted-by-count, includeEntries flag, empty input, scorer-version stamping **Future drift shapes (not shipped):** - PLAYBOOK drift: re-run playbook queries through current matrix-search; recorded answer not in top-K = drift - EMBEDDING drift: KS-test on vector distribution at T1 vs T2 - AUDIT BASELINE drift: matches Rust `audit_baselines.jsonl` longitudinal signal **Acceptance gates:** - G3.5.A — A scorer-version bump triggers a non-zero `Drifted` count on a corpus of historical ScoredRuns where the new logic produces different categories than the persisted ones. - G3.5.B — `ScorerDriftReport.ShiftMatrix` is deterministic-ordered (count desc, ties broken alphabetically) so JSON output is stable across runs. ### §3.6 — Staffing-side structured filter **What it is.** Reality tests on the candidates + workers corpora (commits `0d1553c`, `a97881d`) surfaced that pure semantic retrieval can't gate by location/status/availability — the matrix indexer returns Production Workers for a Forklift+OSHA-30 query because nomic-embed-text's geometry doesn't separate the role labels well. Structured filtering is the addressable piece: pre-filter the candidate set on metadata fields BEFORE semantic ranking. **What's shipped (commit `b199093`):** - `SearchRequest.MetadataFilter` — `map[string]any` of metadata field → expected value (single value or list-of-values for OR semantics within a key, AND across keys) - Post-retrieval filter applied before top-K truncation in `internal/matrix/retrieve.go` - `SearchResponse.MetadataFilterDropped` for telemetry on filter aggressiveness - 7 unit tests covering nil filter, missing metadata, exact match, AND across keys, OR within list, bool match, malformed JSON **Deferred:** - Pre-retrieval SQL gate via `queryd` (the actual hybrid). The post-retrieval filter is an MVP that helps when the candidate set is mostly relevant; for aggressive filters that drop most results, a SQL pre-filter into matrix retrieval would surface the right candidates with less wasted embedding work. - Filter language richer than equality (e.g. range, prefix, regex). **Acceptance gates:** - G3.6.A — `MetadataFilter: {"state": "IL"}` against a mixed-state corpus drops every non-IL result; `MetadataFilterDropped` reports the count. - G3.6.B — List filter `{"state": ["IL", "WI"]}` keeps both states, drops the rest (OR within key). - G3.6.C — Multi-key filter is AND: a result missing any key is dropped, no exception. ### §3.7 — Operational rating wiring **What it is.** PRD loop 4 (rating + distillation) needs real inflows to be a learning system rather than a substrate. The playbook-record endpoint (`06e7152`) takes one (query, answer, score) per call; productizing it into actual signal sources is what makes the system get smarter with use. **What's shipped (commit `6392772`):** - `POST /v1/matrix/playbooks/bulk` — bulk-record N successes; per-entry success/failure response so callers can see which of a 4,701-row historical placement import succeeded vs which failed validation. - Single-record path from `06e7152` unchanged. **Deferred:** - UI shim for click-tracking (no Go demo UI yet — the Bun demo at `devop.live/lakehouse/` is still serving the public surface). When the Go UI lands or a feedback API is added to the Bun UI, every coordinator click → bulk-batched POST → playbook entry. - Negative feedback (this match didn't work). Currently only positive scores are recorded; a rejection signal would help the learning loop avoid pushing bad matches. - Time-decay on playbook scores so stale recommendations attenuate. **Acceptance gates:** - G3.7.A — Bulk POST of N entries returns `{recorded, failed, results[]}` with per-entry IDs/errors, no single-entry failure aborting the batch. - G3.7.B — Each recorded entry surfaces in `/v1/matrix/search` with `use_playbook=true` after a re-query. ### §3.8 — Observer-KB workflow runner (Archon-style multi-pass) **What it is.** The architectural pattern documented in the Rust `observer-kb` branch (10 commits ahead of main, never merged) and proven by `/home/profit/external/Archon`'s workflow engine. Multiple mode passes processing data, with each pass an objective measurement that contributes to the KB: ``` Raw data ↓ Mode: EXTRACT structured facts/entities/relationships ↓ Mode: VALIDATOR fact-check, confidence 1-10 ↓ Mode: HALLUCINATION verify each claim, flag likely fabrications ↓ Mode: CONSENSUS multiple passes until extraction converges ↓ Mode: REDTEAM attack what survived, patch what fails ↓ Mode: PIPELINE clean → Q&A structure → topic group → rank ↓ RENDER curated doc anchored on questions ``` This is the *orchestrator* missing from §3.4 components 1-5: each SPEC §3.4 piece (relevance, downgrade, scorer, drift) is a "mode"; what's missing is the workflow engine that chains them. **Why it matters.** Per the PRD's product vision: the observer should make actionable decisions based on watching what's successful. The workflow runner is how observers compose modes into multi-pass pipelines that score outcomes rigorously enough to feed the KB and inform the playbook substrate. **Reference materials on the system:** - `/home/profit/lakehouse/.archon/workflows/lakehouse-architect-review.yaml` (committed `69919d9` in main) — proves Archon-via-Lakehouse works with a 3-node `shape → weakness → improvement` workflow - `/home/profit/external/Archon` — the upstream workflow engine (cloned 2026-04-26); `packages/providers/src/community/pi/provider.ts` has the local Lakehouse-routing mod committed locally as `3f2afc8` (not pushed to upstream `coleam00/Archon`) - Rust `observer-kb` branch (10 commits, +4338/-55506 LoC) — `apps/observer-kb/docs/PRD.md` documents the multi-pass architecture; `scripts/{deep_analysis,extract_knowledge,process_knowledge}.py` are the Python prototypes that proved it on real ChatGPT/Claude PDF data (496 topics, 300 decisions, 100 insights extracted) **Components to port (in dependency order):** 1. **Workflow definition** (`internal/workflow/types.go`) — YAML schema matching Archon's shape: `name`, `description`, `provider`, `model`, list of `nodes` each with `id`, `prompt`, `allowed_tools`, `effort`, `idle_timeout`, `depends_on`. The depends_on edges form a DAG; the runner resolves topologically. 2. **Node executor** (`internal/workflow/runner.go`) — given a workflow and a starting context, walks the DAG, executes each node by dispatching to the configured backend (matrix.Search, distillation.ScoreRecord, drift.ComputeScorerDrift, or a generic prompt-against-LLM via gateway `/v1/chat`), captures per-node output, makes it available as `$.output` in subsequent nodes. 3. **Provenance recording** — every node execution lands an ObservedOp (via the observerd substrate from `bc9ab93`) with `source: "workflow"`, the workflow name + node ID, input/output summaries, and timing. The ring buffer + JSONL log become the substrate for the rating+distillation loop's KB feed. 4. **Mode catalog** (`internal/workflow/modes.go`) — registry of the modes the runner can dispatch to. Each mode is a Go function matching a uniform `func(ctx, input map[string]any) (map[string]any, error)` signature so workflows can compose them. Initial modes from §3.4: `matrix.search`, `matrix.relevance`, `matrix.downgrade`, `playbook.record`, `playbook.lookup`, `distillation.score`, `drift.scorer`. Plus `llm.chat` for free-form mode prompts. 5. **HTTP surface** — `POST /v1/observer/workflow/run` accepts a workflow YAML body + a starting context; returns the per-node results + the chain of ObservedOps generated. `GET /v1/observer/workflow/list` lists workflows in a known directory for operator discoverability. **Why integrate into observerd, not a new service.** The observer is the system resource that watches and records. Workflows ARE observation patterns — multi-step processes whose every step is recorded. Putting the runner inside observerd keeps the "measurement → KB feed" wiring tight; a separate service would re-implement the recording layer. **Acceptance gates:** - G3.8.A — Load a workflow YAML matching the Archon `lakehouse-architect-review.yaml` shape; runner executes the 3-node DAG topologically. - G3.8.B — Each node execution lands an ObservedOp with `source: "workflow"` and the node's input/output. Stats endpoint shows the workflow ops. - G3.8.C — A node referencing `$.output` in its prompt resolves correctly; missing reference is a clear error not a silent empty string. - G3.8.D — Mode catalog dispatches `matrix.search` invocation to the matrixd backend without going through HTTP (in-process function call when matrixd is co-resident). **Status:** PORT TARGET, not yet started. SPEC commits the design; implementation is its own wave (estimated **L** effort given the DAG runner + mode dispatch + provenance recording). ### §3.9 — chatd (multi-provider LLM dispatcher) — SHIPPED 2026-04-30 **Status:** done at commit `05273ac` (Phase 4 wave) + scrum-hardened at `0efc736`. Composite port: 1,624 LoC, 19+ tests, 6/6 chatd_smoke. **What:** `cmd/chatd` on `:3220` routes `POST /chat` to a provider selected by model-name prefix or `:cloud` suffix: ``` ollama/ → local Ollama at :11434 (no auth) ollama_cloud/ → ollama.com /api/generate (Bearer) :cloud → ollama_cloud (suffix variant) openrouter// → openrouter.ai (OpenAI-compat, Bearer) opencode/ → opencode.ai/zen/v1 (OpenAI-compat, Bearer) kimi/ → api.kimi.com/coding/v1 (OpenAI-compat, Bearer) bare names → ollama (default) ``` **Provider key resolution:** env var first (`OPENROUTER_API_KEY`, `OPENCODE_API_KEY`, `KIMI_API_KEY`, `OLLAMA_CLOUD_KEY`); then `/etc/lakehouse/.env` fallback (mode 0600). Empty key → provider stays unregistered (404 at first call instead of 503). **Companion to the model tier registry** in `lakehouse.toml [models]` (local_fast / local_judge / cloud_judge / frontier_review / etc.), which maps tier names to model IDs. Callers reference `cfg.Models.LocalJudge` instead of literal strings. Bumping a tier is a 1-line config edit. **Quirks captured by today's scrum:** - `Request.Temperature` is `*float64` (pointer) — Anthropic 4.7 (via OpenCode) rejects the field entirely with "temperature is deprecated." Pointer lets us omit when caller didn't set explicitly. - Local Ollama defaults `think=false` — qwen3.5:latest is reasoning- capable but the inner-loop hot path wants direct answers, not reasoning traces consuming the token budget. **Replaces the Rust adapters:** - `crates/gateway/src/v1/ollama_cloud.rs` → `internal/chat/ollama_cloud.go` - `crates/gateway/src/v1/openrouter.rs` → `internal/chat/openai_compat.go` (shared with opencode + kimi) - `crates/gateway/src/v1/opencode.rs` → same shared helper - `crates/aibridge/*` → `internal/chat/` (cleaner abstraction) **Reusable downstream:** `scripts/scrum_review.sh` runs a 3-lineage cross-review (Opus + Kimi + Qwen3-coder) by POST-ing each diff to chatd's `/v1/chat`. Same vehicle the harness's own scrum-hardening used to find its own bugs. ### §3.10 — local-review-harness (sibling tool, separate repo) **Where:** `git.agentview.dev/profit/local-review-harness` (also SMB-mounted at `/home/profit/share/local-review-harness-full-md/`). **What:** a local-first code review harness — walks a target repo, runs evidence-bearing static checks (12 analyzers covering hardcoded paths, raw SQL, wildcard CORS, secrets, exec/spawn, large files, TODO/FIXME, missing tests, .env files committed, exposed mutation endpoints, hardcoded private IPs), produces Scrum-style markdown reports + JSON receipts. **No cloud deps.** Single static Go binary. **Status:** Phase A (skeleton) + Phase B (MVP — static-only path) shipped at first commit `f3ee472`. 5 acceptance gates green plus self-review (the harness reviews its own repo). **Phases C–E pending:** local-Ollama LLM review, validation cross- check, append-only `.memory/`, diff/rules subcommands. **Why a sibling tool, not a Lakehouse module:** PROMPT.md "Strategic Goal" — the harness eventually plugs into OpenClaw / MCP tools / Lakehouse memory / playbook sealing / observer review loop. But first it has to be reliable, inspectable, and evidence-driven on its own. Lakehouse-Go integration is post-Phase-E (probably E+1: write the harness's findings into Lakehouse's playbook substrate via `/v1/matrix/playbooks/record`). **Cross-pollination opportunities (post-MVP):** - Replace `internal/llm/ollama.go` in the harness with a thin client pointed at chatd's `/v1/chat` — frontier judges become a config toggle. - Feed harness findings into Lakehouse-Go's pathway memory as a drift signal (which static checks fired this run vs last). - Use the harness's `.memory/known-risks.json` as a corpus the matrix indexer can retrieve from when the same risk pattern appears. ### §3.3 — UI (HTMX) **Approach:** server-rendered Go templates using `html/template`, HTMX for partial-page swaps, Alpine.js for client-side interactivity where needed. Single binary serves API + UI. **Acceptance gates:** - G3.3.A — `Ask` tab: type natural-language question, get answer from RAG endpoint, render in-page without full reload - G3.3.B — `Explore` tab: paginated dataset list with hot-swap badge rendering - G3.3.C — `SQL` tab: textarea → submit → tabular result rendered in-page - G3.3.D — `System` tab: live tail of `/storage/errors` and `/hnsw/trials` via HTMX polling **Fallback if HTMX feels limiting:** split repo `golangLAKEHOUSE-ui` with Vite + React, served as static files by Go gateway. Costs an extra repo + build chain. ### §3.4 — Pathway memory port **Constraint:** the Rust `pathway_memory` and TS implementations were byte-matching by ADR-021. The byte contract was verified by running both implementations on the same input tokens and asserting matching bucket indices. **Go port plan:** - Port the 32-bucket SHA256-keyed token hash exactly. Verify on a golden input that Go produces the same bucket vector as Rust. - Port the JSON state file format verbatim — the existing 88 traces in `data/_pathway_memory/state.json` reload as-is into the Go implementation. - Port the matrix-correctness layer (ADR-021's `SemanticFlag`, `BugFingerprint`, `TypeHint`) — these are pure value types, trivially portable. **Acceptance gates:** - G3.4.A — Load existing `state.json`, run `replay` on the same 11 prior successful pathways, all 11 succeed (matching the Rust 11/11 baseline). - G3.4.B — Bucket vector for a fixed test input byte-matches the Rust output. --- ## §4. Phase plan ### Phase G0 — Skeleton (Week 1–3) **Scope:** smallest end-to-end ingest + query path working in Go. | Component | Deliverable | |---|---| | `cmd/gateway` | HTTP on :3100, `/health`, `/v1/chat` proxy stub | | `cmd/catalogd` | In-memory registry + Parquet manifest persistence | | `cmd/storaged` | Single-bucket S3 / local FS, no error journal yet | | `cmd/ingestd` | CSV → Parquet, schema inference, register-on-ingest | | `cmd/queryd` | DuckDB-backed `POST /sql` endpoint | **Acceptance:** upload a CSV via `POST /ingest`, query it via `POST /sql` with a SELECT, get rows back. Single-bucket. No vector, no profile, no UI. ### Phase G1 — Vector + RAG (Week 4–6) | Component | Deliverable | |---|---| | `cmd/vectord` | Embed-on-ingest (calls Python sidecar), HNSW build, `POST /search` | | `cmd/gateway` | Add `POST /rag` (embed → search → retrieve → generate via aibridge) | | `cmd/aibridge` | HTTP client to existing Python sidecar | **Acceptance:** ingest 15K resumes (the original Phase 7 fixture), ask "find me a forklift operator with OSHA-10 in IL", get ranked results with LLM-generated explanation grounded in the retrieved chunks. ### Phase G2 — Federation + profiles (Week 7–8) | Component | Deliverable | |---|---| | `cmd/storaged` | Multi-bucket registry, rescue bucket, error journal at `primary://_errors/` | | Profile system | Per-reader profile bound to bucket + vector index | | Hot-swap | Atomic pointer swap for index generations | **Acceptance:** two profiles bound to two buckets, queries scoped correctly, hot-swap a vector index without query interruption, rollback works. ### Phase G3 — Pathway memory + distillation (Week 9–11) | Component | Deliverable | |---|---| | `cmd/vectord` | Pathway memory module ported, 88 traces reloaded | | Distillation pipeline | SFT export, contamination firewall, scorer | | Audit baselines | `audit_baselines.jsonl` longitudinal signal port | **Acceptance:** replay 11 prior successful pathways, all 11 succeed. Re-run distillation acceptance on the frozen fixture set, 22/22 pass. ### Phase G4 — TS surfaces → Go (Week 12–14) | Component | Deliverable | |---|---| | `cmd/mcp` | MCP server (replaces Bun) — `/v1/chat`, intelligence endpoints | | `cmd/observer` | Autonomous iteration loop, op recording | | `cmd/auditor` | PR audit pipeline (kimi/haiku/opus rotation) | | `cmd/scrum` | Scrum master pipeline (replaces TS) | **Acceptance:** open a test PR, auditor cycles within 90s, emits verdict to `data/_auditor/kimi_verdicts/`, behavior matches Rust+TS era within tolerance. ### Phase G5 — UI + demo parity (Week 15–16) | Component | Deliverable | |---|---| | `cmd/gateway` | Serves HTMX templates + static demo HTML | | Demo at `devop.live/lakehouse/` | Parity with current Bun demo | | Staffer console at `/console` | Parity | **Acceptance:** `devop.live/lakehouse/` cuts over from Bun to Go gateway. Section ① / ② / ③ all render. Compact contract cards still expand with Project Index. Fill-probability bars still paint. --- ## §5. Repo layout ``` golangLAKEHOUSE/ ├── docs/ │ ├── PRD.md ← this PRD │ ├── SPEC.md ← this spec │ ├── DECISIONS.md ← Go-era ADRs (start fresh, reference Rust ADRs by number) │ └── ADR-XXX-*.md ← per-ADR detail ├── cmd/ │ ├── gateway/ ← main HTTP/gRPC ingress │ ├── catalogd/ │ ├── storaged/ │ ├── queryd/ │ ├── ingestd/ │ ├── vectord/ │ ├── journald/ │ ├── mcp/ │ ├── observer/ │ ├── auditor/ │ └── scrum/ ├── internal/ ← shared packages, not exported │ ├── aibridge/ │ ├── validator/ │ ├── truth/ │ ├── shared/ │ ├── proto/ ← generated protobuf │ └── pathway/ ├── pkg/ ← public Go packages (none initially) ├── web/ ← UI (HTMX templates + static) │ ├── templates/ │ └── static/ ├── scripts/ ← cold-start, smoke, distill scripts ├── tests/ ← golden files, integration tests ├── go.mod ├── go.sum └── README.md ``` **Single Go module.** All commands and internal packages live under `golangLAKEHOUSE/`. No nested modules unless a package needs an independent release cadence (none expected). **Build:** `go build ./cmd/...` produces all binaries. --- ## §6. Migration data plan ### What ports verbatim - Parquet datasets at `data/datasets/*.parquet` — read by Go directly. - Catalog manifests — Parquet, ports as data not code. - Pathway memory state — JSON, ports if §3.4 byte-matching gate passes. ### What rebuilds - HNSW indexes — rebuild from Parquet embeddings on first Go startup. - Auditor verdicts on PRs — old PRs won't be re-audited; lineage starts fresh on the new repo's PRs. ### What's archived - The Rust `crates/` tree — preserved in the original repo at the cutover commit, tagged `pre-go-rewrite-2026-04-28` for reference. - TS surfaces (`mcp-server/`, `auditor/`, etc.) — preserved in the original repo at the same tag. - Distillation v1.0.0 substrate (`tag distillation-v1.0.0`, `e7636f2`) — kept as the historical reference; Go re-implementation ports the LOGIC but not the bit-identical-reproducibility property unless an ADR re-establishes it. ### What's discarded - `crates/vectord-lance/` (Lance backend, see PRD §Hard problems §2) - `crates/lance-bench/` (criterion benchmarks specific to Lance) --- ## §7. Acceptance: when is the rewrite done? The Go Lakehouse reaches **feature parity** when: 1. **All 12 Rust PRD invariants hold** (object-storage source of truth, catalog metadata authority, idempotent ingest, hot-swap atomicity, profiles, etc.). 2. **The 16 distillation acceptance gates pass** (re-run `./scripts/distill audit-full` against the Go pipeline). 3. **The 22/22 acceptance fixtures from `tests/fixtures/distillation/ acceptance/` pass** under the Go implementation. 4. **The 145 unit tests of distillation v1.0.0 are ported and pass.** 5. **`devop.live/lakehouse/` demo cuts over to Go gateway** with no visible UI regressions. 6. **Auditor emits Kimi/Haiku/Opus verdicts** on a test PR, matching the cross-lineage rotation behavior. 7. **The 88 pathway traces replay** with 11/11 prior successes reproduced. At that point the Rust repo enters maintenance-only mode (security fixes), and the Go repo becomes the live system. --- ## §8. Ratified — Phase G0 unblocked (2026-04-28, J) | # | Decision | Spec impact | |---|---|---| | 1 | DuckDB via cgo (`marcboeker/go-duckdb`) | §3.1 option A — proceed | | 2 | HTMX + `html/template` + Alpine.js | §3.3 option A — proceed | | 3 | `git.agentview.dev/profit/golangLAKEHOUSE` | repo location locked | | 4 | Distillation rebuilt in Go (no bit-identical port) | §6 — port logic, not fixtures | | 5 | Pathway memory starts empty; old traces noted | §3.4 G3.4.A is now "build initial state from scratch in Phase G3"; G3.4.B (byte-match) preserved as the porting correctness gate when the algorithm is reimplemented | | 6 | Auditor longitudinal signal restarts | new `audit_baselines.jsonl` lineage starts on first Go-era PR | See `docs/DECISIONS.md` ADR-001 for full rationale and `docs/RUST_PATHWAY_MEMORY_NOTE.md` for where the legacy 88 traces live. **Phase G0 is now unblocked.** Next step: bootstrap the Go module skeleton + push to Gitea, then begin §4 Phase G0 implementation.