docs: seed PRD + SPEC for the Go-direction rewrite
Two documents only — no Go code yet. PRD restates the problem and preserves the Rust PRD's invariants verbatim, then maps the locked stack to Go libraries and surfaces four hard problems (DuckDB-via-cgo for the query engine, Lance dropped, Dioxus → HTMX, arrow-go maturity). SPEC walks each Rust crate + TS surface and tags the port with library choice / effort estimate / risk + a 5-phase migration plan from skeleton (Phase G0) to demo parity (Phase G5). Six open questions remain that gate Phase G0: - DuckDB cgo OK? - HTMX vs React for the UI? - Repo location? - Distillation v1.0.0 port verbatim or rebuild? - Pathway memory data — port 88 traces or start clean? - Auditor lineage — port audit_baselines.jsonl or restart? Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
commit
f07668064e
41
.gitignore
vendored
Normal file
41
.gitignore
vendored
Normal file
@ -0,0 +1,41 @@
|
||||
# Go
|
||||
*.exe
|
||||
*.exe~
|
||||
*.dll
|
||||
*.so
|
||||
*.dylib
|
||||
*.test
|
||||
*.out
|
||||
go.work
|
||||
go.work.sum
|
||||
vendor/
|
||||
|
||||
# Build artifacts
|
||||
/bin/
|
||||
/dist/
|
||||
|
||||
# Editor / OS
|
||||
.DS_Store
|
||||
.idea/
|
||||
.vscode/
|
||||
*.swp
|
||||
*~
|
||||
|
||||
# Local data — these directories follow the Rust lakehouse pattern;
|
||||
# regenerated by services on demand. Do not commit runtime artifacts.
|
||||
/data/_auditor/
|
||||
/data/_kb/
|
||||
/data/_pathway_memory/
|
||||
/data/_errors/
|
||||
/data/_imagecache/
|
||||
/data/datasets/
|
||||
/data/vectors/
|
||||
/data/headshots/
|
||||
/data/lance/
|
||||
/exports/
|
||||
/logs/
|
||||
/reports/
|
||||
|
||||
# Secrets — never commit. Resolved via SecretsProvider per ADR-001 §1.x.
|
||||
*.env
|
||||
secrets.toml
|
||||
49
README.md
Normal file
49
README.md
Normal file
@ -0,0 +1,49 @@
|
||||
# golangLAKEHOUSE
|
||||
|
||||
Go reimplementation of the Lakehouse — a versioned knowledge substrate
|
||||
for staffing analytics + local AI workloads.
|
||||
|
||||
## Status
|
||||
|
||||
**Pre-Phase G0.** Documents seeded; Go module declared; implementation
|
||||
has not started. See `docs/PRD.md` for direction and `docs/SPEC.md`
|
||||
for the component-by-component port plan.
|
||||
|
||||
### Phase G0 prerequisites (must be done before any code lands)
|
||||
|
||||
1. **Install Go 1.23+ on the dev box.** Not currently present at
|
||||
`/usr/local/go` or elsewhere on the build machine. Standard install:
|
||||
```
|
||||
curl -L https://go.dev/dl/go1.23.linux-amd64.tar.gz | sudo tar -C /usr/local -xz
|
||||
echo 'export PATH=$PATH:/usr/local/go/bin' >> ~/.bashrc
|
||||
```
|
||||
2. **Ensure cgo toolchain is present** (gcc + libc-dev) — required by
|
||||
the DuckDB binding per ADR-001 §1.1. `apt install build-essential`
|
||||
on Debian-based systems.
|
||||
3. **Initialize the dependency tree** with `go mod tidy` once
|
||||
`cmd/gateway/main.go` declares its first imports.
|
||||
|
||||
## Layout
|
||||
|
||||
```
|
||||
docs/ Direction + spec + ADRs
|
||||
cmd/ (forthcoming) main packages — one per service
|
||||
internal/ (forthcoming) shared packages
|
||||
web/ (forthcoming) HTMX templates + static
|
||||
scripts/ (forthcoming) cold-start, smoke, distill
|
||||
tests/ (forthcoming) golden files, integration tests
|
||||
```
|
||||
|
||||
## Reading order
|
||||
|
||||
1. `docs/PRD.md` — what we're building and why
|
||||
2. `docs/SPEC.md` — how, per-component
|
||||
3. `docs/DECISIONS.md` — ADRs, starting with ADR-001 (foundational)
|
||||
4. `docs/RUST_PATHWAY_MEMORY_NOTE.md` — historical reference for the
|
||||
Rust era's pathway memory state (not migrated)
|
||||
|
||||
## Predecessor
|
||||
|
||||
The Rust Lakehouse this rewrite supersedes lives at
|
||||
`git.agentview.dev/profit/lakehouse`. It remains the live system until
|
||||
this Go implementation reaches feature parity (per `docs/SPEC.md` §7).
|
||||
126
docs/DECISIONS.md
Normal file
126
docs/DECISIONS.md
Normal file
@ -0,0 +1,126 @@
|
||||
# Architecture Decision Records — Lakehouse-Go
|
||||
|
||||
ADRs from the Go era. Numbered fresh from 001 to start clean lineage.
|
||||
Where a Rust ADR (numbered 001–021 in the Rust repo's `DECISIONS.md`)
|
||||
remains in force, this file references it explicitly. Where a Rust
|
||||
ADR is superseded, the new ADR records why.
|
||||
|
||||
---
|
||||
|
||||
## ADR-001: Foundational decisions for the Go rewrite
|
||||
**Date:** 2026-04-28
|
||||
**Decided by:** J
|
||||
**Status:** Ratified — Phase G0 unblocked
|
||||
|
||||
The six questions that gated Phase G0 (per PRD.md / SPEC.md §8) are
|
||||
all answered.
|
||||
|
||||
### Decision 1.1 — DuckDB via cgo for the query engine
|
||||
|
||||
**Decision:** `queryd` uses `marcboeker/go-duckdb` (cgo bindings to
|
||||
DuckDB). Pure-Go alternative was rejected.
|
||||
|
||||
**Rationale:** DuckDB reads Parquet natively, supports the SQL surface
|
||||
DataFusion exposed in the Rust era (CTEs, window functions, hybrid
|
||||
joins), and runs in-process with cgo. The alternatives were:
|
||||
- Hand-rolling a query planner over arrow-go RecordBatches —
|
||||
multi-engineer-month research project; high risk of correctness
|
||||
bugs.
|
||||
- Running DuckDB as an external process — adds an operational surface
|
||||
and a network hop to every query.
|
||||
|
||||
Cgo build complexity is the accepted cost. Single-binary deploy
|
||||
preserved (the cgo dependency embeds at link time).
|
||||
|
||||
**Supersedes Rust ADR-001** (object storage as source of truth) — no.
|
||||
That ADR remains in force; the change is the *engine* over the
|
||||
storage, not the storage model.
|
||||
|
||||
### Decision 1.2 — HTMX for the UI
|
||||
|
||||
**Decision:** Frontend is `html/template` + HTMX + Alpine.js,
|
||||
server-rendered by `cmd/gateway`. React/Vite in a separate repo is the
|
||||
fallback if UX requirements demand SPA-tier interactivity post-G5.
|
||||
|
||||
**Rationale:** The existing Lakehouse UIs (`/lakehouse/` demo + staffer
|
||||
console) are mostly server-rendered HTML with vanilla JS that already
|
||||
fits the HTMX style. Single-binary deploy is preserved (gateway serves
|
||||
templates + static assets). No build chain beyond `go build`.
|
||||
|
||||
The React fallback is named explicitly so it's not relitigated unless
|
||||
an actual UX requirement triggers it.
|
||||
|
||||
### Decision 1.3 — Gitea hosts the new repo
|
||||
|
||||
**Decision:** Repo lives at `git.agentview.dev/profit/golangLAKEHOUSE`
|
||||
(same Gitea server that hosts the Rust lakehouse).
|
||||
|
||||
**Rationale:** Single source of truth for repo hosting; existing
|
||||
auditor tooling (`lakehouse-auditor` systemd service) already speaks
|
||||
Gitea API; existing credentials work; no new ops surface.
|
||||
|
||||
### Decision 1.4 — Distillation rebuilt in Go, not ported verbatim
|
||||
|
||||
**Decision:** The distillation v1.0.0 substrate (`tag
|
||||
distillation-v1.0.0` at `e7636f2` in the Rust repo) is **not**
|
||||
bit-identical-ported. The Go reimplementation:
|
||||
- Ports the LOGIC: SFT export pipeline, contamination firewall (the
|
||||
`quality_score` enum + `SFT_NEVER` constant), category mapping
|
||||
rules, audit-baselines append-only pattern.
|
||||
- Does NOT port the FIXTURES: `tests/fixtures/distillation/acceptance/`
|
||||
is rebuilt from scratch in Go with new ground-truth golden files.
|
||||
- Does NOT port the bit-identical reproducibility PROPERTY: that was
|
||||
measured against the Rust implementation. The Go implementation
|
||||
establishes its own reproducibility baseline.
|
||||
|
||||
**Rationale:** Bit-identical reproducibility was a measured property
|
||||
of a specific implementation, not a portable invariant. Re-establishing
|
||||
it in Go means new fixtures, new gates, new audit-baselines. This is
|
||||
honest about what's transferring (logic) versus what's a Rust-era
|
||||
artifact (the specific bit-identical hashes).
|
||||
|
||||
**Risk:** the contamination firewall is the most consequential
|
||||
distillation safety net. The port must be reviewed line-by-line, and
|
||||
the new Go fixtures must include adversarial cases that prove the
|
||||
firewall works in the new implementation. See SPEC §7 acceptance gates.
|
||||
|
||||
### Decision 1.5 — Pathway memory starts clean; old traces preserved as reference
|
||||
|
||||
**Decision:** Go pathway memory begins with zero traces. The existing
|
||||
88 Rust traces at
|
||||
`/home/profit/lakehouse/data/_pathway_memory/state.json` are NOT loaded
|
||||
into the Go implementation. They are preserved as a historical record
|
||||
in the Rust repo and documented at `docs/RUST_PATHWAY_MEMORY_NOTE.md`.
|
||||
|
||||
**Rationale:** The Rust pathway memory's value compounded over months
|
||||
of scrum cycles. Loading those traces into a Go implementation that
|
||||
hasn't proven its byte-matching contract risks corrupting the new
|
||||
substrate's signal with semantically-mismatched data. Starting clean
|
||||
keeps the Go pathway memory's lineage clean and lets the byte-match
|
||||
correctness be proven on a known input (per SPEC §3.4 G3.4.B).
|
||||
|
||||
The historical note records the 88 traces' value (11/11 successful
|
||||
replays at the time of freeze) so the Go implementation has a
|
||||
reference baseline to outperform.
|
||||
|
||||
### Decision 1.6 — Auditor longitudinal signal restarts
|
||||
|
||||
**Decision:** The Rust auditor's `audit_baselines.jsonl`
|
||||
(longitudinal drift signal accumulated across PRs #6–#13) is **not**
|
||||
ported to Go. The Go auditor begins a fresh `audit_baselines.jsonl`
|
||||
lineage on its first PR.
|
||||
|
||||
**Rationale:** The drift signal is anchored to specific Rust commits,
|
||||
verdict shapes, and Kimi/Haiku/Opus rotation traces. Carrying it into
|
||||
the Go era would be like grafting Rust-PR audit history onto the first
|
||||
Go PR's prologue — confusing more than informative. Restarting gives
|
||||
the Go auditor a clean baseline to measure drift against.
|
||||
|
||||
The existing Rust `audit_baselines.jsonl` stays in the Rust repo as a
|
||||
historical record.
|
||||
|
||||
---
|
||||
|
||||
(Future ADRs from ADR-002 onward will be added as the Go
|
||||
implementation accrues design decisions — e.g. HNSW parameter
|
||||
choices, pathway-memory hash function, auditor model rotation, etc.)
|
||||
297
docs/PRD.md
Normal file
297
docs/PRD.md
Normal file
@ -0,0 +1,297 @@
|
||||
# PRD: Lakehouse-Go — Versioned Knowledge Substrate, Reimplemented in Go
|
||||
|
||||
**Status:** DRAFT — seed document for the Go-direction rewrite. Supersedes
|
||||
`/home/profit/lakehouse/docs/PRD.md` (Rust) once ratified.
|
||||
**Created:** 2026-04-28
|
||||
**Owner:** J
|
||||
**Sibling:** `SPEC.md` — component-by-component port plan with effort
|
||||
estimates, library choices, and acceptance gates.
|
||||
|
||||
---
|
||||
|
||||
## Direction pivot — why this PRD exists
|
||||
|
||||
The Rust-first Lakehouse (15 crates, ~24 unmerged commits past PR #11,
|
||||
distillation v1.0.0 substrate frozen at `e7636f2`) is being reimplemented
|
||||
in Go on the principle that **anything Go can carry, Go carries**. This
|
||||
is an explicit re-platforming, not a refactor.
|
||||
|
||||
### What the rewrite preserves (verbatim from the Rust PRD)
|
||||
|
||||
- The **problem statement** — legacy data systems silo information; AI
|
||||
needs both fast analytical queries AND semantic retrieval over
|
||||
unstructured text in one substrate.
|
||||
- The **two use cases** — staffing analytics (reference implementation)
|
||||
and local AI knowledge substrate (per-profile vector indexes for
|
||||
running models).
|
||||
- The **shared requirements** — schema-less ingest, SQL at scale,
|
||||
AI-embedding search, hot-swappable indexes, trials-as-data,
|
||||
local-first / no-cloud, repo-rebuildable.
|
||||
- The **architectural invariants** — object storage as source of truth,
|
||||
catalog as sole metadata authority, hot-swap atomicity, profiles as
|
||||
first-class, playbooks-feed-the-index, errors findable in one HTTP
|
||||
call.
|
||||
|
||||
### What the rewrite changes
|
||||
|
||||
| Layer | Was (Rust) | Becomes (Go) | Confidence |
|
||||
|---|---|---|---|
|
||||
| HTTP gateway | Axum + Tokio | `net/http` + `chi` (or `gin`) | High — Go's bread and butter |
|
||||
| gRPC | tonic | `google.golang.org/grpc` | High — Go is the reference impl |
|
||||
| Object store | Apache Arrow `object_store` | `aws-sdk-go-v2/service/s3` + thin wrapper | High |
|
||||
| Parquet I/O | parquet-rs (arrow-rs) | `apache/arrow-go/v15/parquet` | Medium — arrow-go lags arrow-rs but covers our needs |
|
||||
| Query engine | DataFusion | **Hard problem** (see §Hard problems) | Low — no like-for-like Go equivalent |
|
||||
| Vector index (HNSW) | `hora` / hand-rolled | `coder/hnsw` or `Bithack/go-hnsw` (in-process) | High — HNSW is a self-contained algorithm |
|
||||
| Vector backend (Lance) | `lance` (Rust) | **Hard problem** — likely dropped, Parquet-only | Medium |
|
||||
| Frontend | Dioxus + WASM | Go `html/template` + HTMX + Alpine, or React/Vite split repo | Medium |
|
||||
| Concurrency | Tokio async | Goroutines + `context.Context` | High |
|
||||
| Config | TOML | TOML (`pelletier/go-toml/v2`) | High |
|
||||
| Secrets | `SecretsProvider` trait | Go interface, same shape | High |
|
||||
| AI bridge | HTTP client to Python sidecar | Same — Python sidecar stays | High |
|
||||
| Embedded ML | Python sidecar (kept) | Python sidecar (kept) | n/a |
|
||||
|
||||
### What stays Python (and why)
|
||||
|
||||
- **Embedding generation, image gen, deepface analysis** — Python's ML
|
||||
ecosystem is genuinely stronger than Go's. The sidecar stays as an
|
||||
HTTP service; the Go gateway calls it the same way the Rust gateway
|
||||
did. No port required.
|
||||
- **Distillation pipeline scoring** — current TS scripts; can move to Go
|
||||
but not first-tier priority. Keep TS until Go gateway is live.
|
||||
|
||||
---
|
||||
|
||||
## Solution — Go service mesh over S3-compatible object storage
|
||||
|
||||
A modular Go service mesh, same architectural shape as the Rust system,
|
||||
with the Python AI sidecar retained as the embedding/generation
|
||||
boundary. Single repo (`golangLAKEHOUSE`), single Go module, multiple
|
||||
binaries built from one workspace.
|
||||
|
||||
### Locked stack (Go)
|
||||
|
||||
| Layer | Choice | Rationale |
|
||||
|---|---|---|
|
||||
| HTTP | `chi` | Idiomatic, middleware-friendly, used by major Go services |
|
||||
| gRPC | `google.golang.org/grpc` | Reference implementation |
|
||||
| Protobuf | `protoc-gen-go` + `buf` | Standard tooling |
|
||||
| Object store | `aws-sdk-go-v2` | Mature, covers S3 + MinIO + RustFS |
|
||||
| Parquet | `apache/arrow-go/v15` | Columnar I/O + Arrow interop |
|
||||
| SQL engine | **Open** — see §Hard problems §1 | Biggest open decision |
|
||||
| Vector index | `coder/hnsw` | Pure-Go HNSW, in-process, no external service |
|
||||
| TOML config | `pelletier/go-toml/v2` | Standard |
|
||||
| Logging | `log/slog` | Standard library since Go 1.21 |
|
||||
| Tracing | `go.opentelemetry.io/otel` | Standard |
|
||||
| Testing | `testing` + `testify` + `golden` files | Standard |
|
||||
| Frontend | **Open** — `html/template` + HTMX vs separate Vite/React | Hard problem §3 |
|
||||
|
||||
No new dependencies without an ADR.
|
||||
|
||||
---
|
||||
|
||||
## Architecture
|
||||
|
||||
Same service decomposition as Rust, same data flow. Names preserved so
|
||||
the spec, ADRs, and runbooks port semantically:
|
||||
|
||||
```
|
||||
┌─ ingestd ─→ storaged ─→ catalogd ─→ queryd ─┐
|
||||
│ │ │
|
||||
│ └→ vectord │
|
||||
│ │
|
||||
└──────── aibridge ──HTTP──→ Python sidecar ───┘
|
||||
│
|
||||
gateway ─ HTTP/gRPC ────┘
|
||||
│
|
||||
└→ ui (HTMX or Vite)
|
||||
```
|
||||
|
||||
| Service | Responsibility | Go binary |
|
||||
|---|---|---|
|
||||
| **gateway** | HTTP/gRPC ingress, routing, auth | `cmd/gateway` |
|
||||
| **catalogd** | Metadata control plane, dataset registry | `cmd/catalogd` |
|
||||
| **storaged** | Object I/O, multi-bucket, error journal | `cmd/storaged` |
|
||||
| **queryd** | SQL execution over Parquet (engine TBD) | `cmd/queryd` |
|
||||
| **ingestd** | CSV/JSON/PDF ingest → Parquet | `cmd/ingestd` |
|
||||
| **vectord** | Embeddings + HNSW index + trial system | `cmd/vectord` |
|
||||
| **journald** | Append-only mutation event log | `cmd/journald` |
|
||||
| **aibridge** | HTTP client to Python sidecar | library, linked into gateway |
|
||||
| **validator** | Production worker/permit validators | library, linked into gateway |
|
||||
| **mcp** | MCP server (replaces Bun `mcp-server`) | `cmd/mcp` |
|
||||
| **observer** | Autonomous iteration loop | `cmd/observer` |
|
||||
| **auditor** | PR audit pipeline (replaces TS auditor) | `cmd/auditor` |
|
||||
|
||||
### Invariants (preserved verbatim from Rust PRD)
|
||||
|
||||
1. Object storage = source of truth
|
||||
2. catalogd = sole metadata authority
|
||||
3. No raw data in catalog — only pointers
|
||||
4. vectord stores embeddings AS Parquet (portable)
|
||||
5. ingestd is idempotent
|
||||
6. Hot cache is performance, not source of truth
|
||||
7. All services modular and independently replaceable
|
||||
8. Indexes are hot-swappable (atomic pointer swap, rollback always possible)
|
||||
9. Every reader gets its own profile
|
||||
10. Trials are data, not logs
|
||||
11. Operational failures findable in one HTTP call
|
||||
12. Playbooks feed the index, not just the log
|
||||
|
||||
---
|
||||
|
||||
## Hard problems (the ones that don't trivially port)
|
||||
|
||||
These four define whether the rewrite is feasible. Spec answers each
|
||||
with a concrete library/approach choice and a fallback.
|
||||
|
||||
### 1. Query engine — replacing DataFusion
|
||||
|
||||
**Constraint:** DataFusion is the most consequential Rust dependency in
|
||||
the Lakehouse. It powers `queryd`, hybrid SQL+vector search, and
|
||||
hot-cache merge-on-read. Go has no like-for-like equivalent.
|
||||
|
||||
**Options:**
|
||||
- **A. Embed DuckDB via cgo (`marcboeker/go-duckdb`)** — DuckDB reads
|
||||
Parquet natively, supports SQL similar to DataFusion, has cgo Go
|
||||
bindings. Loses pure-Go portability (cgo required) but preserves the
|
||||
query model.
|
||||
- **B. Run DuckDB as an external service** — one DuckDB process, Go
|
||||
talks to it via HTTP. Pure-Go gateway, separate-process query layer.
|
||||
Adds an operational surface (one more service to manage).
|
||||
- **C. Hand-roll a query planner over Arrow** — parse SQL with
|
||||
`xwb1989/sqlparser`, plan over arrow-go RecordBatches, execute. High
|
||||
effort, high risk. Best avoided.
|
||||
- **D. Postgres + foreign data wrappers** — point Postgres at Parquet
|
||||
via `parquet_fdw`. Mature but introduces a database we said we'd
|
||||
avoid (ADR-001).
|
||||
|
||||
**Recommendation:** **Option A (DuckDB via cgo)**. Preserves the SQL +
|
||||
columnar + Parquet model, single-binary deploy with cgo, mature. Cgo
|
||||
adds build complexity but is acceptable.
|
||||
|
||||
### 2. Lance backend — vectord-lance
|
||||
|
||||
**Constraint:** Lance is a Rust-native columnar format with built-in
|
||||
vector indexing. There is no Go port and no FFI binding. ADR-019
|
||||
designates Lance as a per-profile *secondary* backend; Parquet+HNSW is
|
||||
*primary*.
|
||||
|
||||
**Options:**
|
||||
- **A. Drop Lance entirely.** Parquet+HNSW handles primary path; Lance
|
||||
was secondary. ADR-019 stays valid for the Rust era; Go Lakehouse is
|
||||
Parquet-only.
|
||||
- **B. Keep Lance via FFI/cgo.** Build Lance as a Rust dylib, call from
|
||||
Go via cgo. Reintroduces Rust into the build chain; defeats the
|
||||
point.
|
||||
- **C. Wait for Lance Go port.** Doesn't exist; not on Lance roadmap.
|
||||
|
||||
**Recommendation:** **Option A (drop Lance).** The hybrid backend was
|
||||
optional per-profile; Parquet+HNSW carries the primary path. If a
|
||||
specific workload later proves Lance-only, it can be exposed as a
|
||||
Python-sidecar service.
|
||||
|
||||
### 3. UI — replacing Dioxus
|
||||
|
||||
**Constraint:** Dioxus is a Rust+WASM frontend framework. No Go
|
||||
equivalent at the same level of polish. The current `crates/ui` covers
|
||||
Ask, Explore, SQL, System tabs.
|
||||
|
||||
**Options:**
|
||||
- **A. `html/template` + HTMX + Alpine.js** — server-rendered Go,
|
||||
partial-page swaps via HTMX. Single repo, minimal JS, fits Go's
|
||||
"boring is good" ethos.
|
||||
- **B. Separate Vite/React frontend** — `golangLAKEHOUSE-ui` repo,
|
||||
Go gateway serves static files. Modern UX patterns, more dev tooling
|
||||
needed.
|
||||
- **C. Keep Dioxus + WASM as a build step** — defeats the rewrite.
|
||||
|
||||
**Recommendation:** **Option A** for v1; revisit if UI requirements
|
||||
demand React-tier interactivity. The current Lakehouse UIs (`/lakehouse/`
|
||||
demo + staffer console) are mostly server-rendered HTML with vanilla
|
||||
JS — `html/template` + HTMX is a strong fit.
|
||||
|
||||
### 4. Arrow ecosystem maturity
|
||||
|
||||
**Constraint:** `arrow-go/v15` lags `arrow-rs` in compute kernels,
|
||||
expression APIs, and some compression codecs. Specific gaps known:
|
||||
limited `cast` kernel coverage, no equivalent of `arrow-rs`'s
|
||||
`compute::sort_to_indices` for all dtypes, no Acero-style streaming
|
||||
execution.
|
||||
|
||||
**Mitigation:** the Go Lakehouse relies on Arrow primarily for
|
||||
**Parquet I/O + RecordBatch transport**, not for in-process compute
|
||||
(that's DuckDB's job). The narrower scope makes arrow-go's gaps less
|
||||
load-bearing.
|
||||
|
||||
**Acceptance gate:** any Arrow API the Go Lakehouse uses must be
|
||||
covered by `arrow-go/v15`. Anything missing → file an upstream issue,
|
||||
implement locally if blocking, contribute back.
|
||||
|
||||
---
|
||||
|
||||
## Migration strategy
|
||||
|
||||
### What ports verbatim
|
||||
- Problem statement, use cases, requirements
|
||||
- Architectural invariants (1–12)
|
||||
- ADRs 001–021 (preserved as design intent; some change implementation)
|
||||
- Federation building blocks (multi-bucket, error-journal, append-log)
|
||||
|
||||
### What rebuilds from data
|
||||
- HNSW indexes (rebuild from Parquet embeddings — ADR-008 is verbatim
|
||||
preserved)
|
||||
- Pathway memory state (88 traces in `data/_pathway_memory/state.json`
|
||||
on Rust side — port the JSON format and reload; the byte-matching
|
||||
contract becomes Go-Go instead of Rust-TS)
|
||||
- Catalog manifests (Parquet, portable)
|
||||
- Distillation v1.0.0 substrate (port the SFT/contamination-firewall
|
||||
logic; the fixture-as-gate pattern stays)
|
||||
|
||||
### What ships first (port-order outline — see SPEC.md for detail)
|
||||
|
||||
1. **Phase G0** — Skeleton: `cmd/gateway`, `cmd/catalogd`, `cmd/storaged`,
|
||||
`cmd/ingestd`. Single-bucket, no auth, CSV→Parquet, query via DuckDB.
|
||||
2. **Phase G1** — Vector path: `cmd/vectord` with HNSW + RAG endpoint.
|
||||
3. **Phase G2** — Multi-profile + federation (ADRs 016–017).
|
||||
4. **Phase G3** — Pathway memory + distillation port.
|
||||
5. **Phase G4** — MCP server, observer, auditor (TS surfaces → Go).
|
||||
6. **Phase G5** — UI (HTMX) and demo parity with `devop.live/lakehouse/`.
|
||||
|
||||
Detailed acceptance gates in `SPEC.md`.
|
||||
|
||||
### What does NOT migrate
|
||||
- The Rust crates themselves (archived in the original `lakehouse` repo)
|
||||
- The TS scrum/auditor pipelines (rewritten in Go in Phase G4)
|
||||
- The Bun mcp-server (rewritten in Go in Phase G4)
|
||||
- The Python sidecar (kept as-is, behind aibridge)
|
||||
|
||||
---
|
||||
|
||||
## Non-goals
|
||||
|
||||
- **No port of `vectord-lance`.** Lance backend is dropped; Parquet+HNSW
|
||||
is the only vector backend.
|
||||
- **No retention of Rust in the build chain.** No cgo-to-Rust bridges,
|
||||
no FFI to keep specific crates alive. Cgo to **C/C++** (DuckDB) is
|
||||
acceptable.
|
||||
- **No new feature work during the port.** Feature parity with the Rust
|
||||
Lakehouse at the cutoff commit is the bar; new capabilities defer to
|
||||
post-port phases.
|
||||
- **No live-migration of running services.** The Rust Lakehouse stops
|
||||
serving when Go reaches feature parity; data moves once via Parquet
|
||||
re-pointer.
|
||||
|
||||
---
|
||||
|
||||
## Ratified decisions (2026-04-28, J)
|
||||
|
||||
The six gating questions are answered. Phase G0 is unblocked. Full
|
||||
context for each lives in `docs/DECISIONS.md` ADR-001.
|
||||
|
||||
| # | Decision |
|
||||
|---|---|
|
||||
| 1 | **DuckDB via cgo** — `marcboeker/go-duckdb` is the query engine. Cgo accepted as the cost of a mature SQL+Parquet path. |
|
||||
| 2 | **HTMX** — server-rendered `html/template` + HTMX + Alpine.js. Single-binary deploy. React is post-G5 if needed. |
|
||||
| 3 | **Gitea** — repo lives at `git.agentview.dev/profit/golangLAKEHOUSE` (same server as the Rust lakehouse). |
|
||||
| 4 | **Distillation rebuild in Go** — port the SFT export + contamination firewall logic, but bit-identical reproducibility is a Rust-era property. New Go fixtures, new acceptance gates. |
|
||||
| 5 | **Pathway memory starts clean** — Go pathway memory begins with zero traces. The existing 88 Rust traces are preserved at `data/_pathway_memory/state.json` in the lakehouse repo as a historical record (see `docs/RUST_PATHWAY_MEMORY_NOTE.md`). |
|
||||
| 6 | **Auditor longitudinal signal restarts** — `audit_baselines.jsonl` is a Rust-era artifact. Go auditor begins a fresh drift signal. |
|
||||
79
docs/RUST_PATHWAY_MEMORY_NOTE.md
Normal file
79
docs/RUST_PATHWAY_MEMORY_NOTE.md
Normal file
@ -0,0 +1,79 @@
|
||||
# Rust Pathway Memory — Historical Reference
|
||||
|
||||
**Status:** Reference-only. The Go Lakehouse does NOT load these
|
||||
traces (per ADR-001 §1.5). This note exists so future-Go-engineer
|
||||
knows what the Rust era accumulated, where it lives, and why it was
|
||||
left in place.
|
||||
|
||||
---
|
||||
|
||||
## What was there
|
||||
|
||||
By the time of the rewrite cutoff (commit `dcf4c9a`,
|
||||
2026-04-28), the Rust pathway memory held:
|
||||
|
||||
- **88 traces** at `/home/profit/lakehouse/data/_pathway_memory/state.json`
|
||||
- **11/11 successful replays** as of the most recent verification (the
|
||||
"probation gate crossed" signal from the lakehouse `STATE_OF_PLAY.md`)
|
||||
- Active scrum-cycle compounding: each scrum loop iteration appended
|
||||
new traces and re-ran replays against existing pathway fingerprints
|
||||
to preempt review prompts with "this file pattern has produced bug X
|
||||
before"
|
||||
|
||||
## Where it lives (Rust repo)
|
||||
|
||||
```
|
||||
lakehouse/
|
||||
├── crates/vectord/src/pathway_memory.rs ← implementation
|
||||
├── data/_pathway_memory/state.json ← 88 traces, JSON
|
||||
└── docs/DECISIONS.md ADR-021 ← matrix-correctness layer design
|
||||
```
|
||||
|
||||
The TS-side mirror lived in
|
||||
`tests/real-world/scrum_master_pipeline.ts` (functions
|
||||
`computePathwayId`, `buildPathwayVec`). Both implementations
|
||||
byte-matched on bucket vectors.
|
||||
|
||||
## Why this matters for the Go port
|
||||
|
||||
The pathway memory's *algorithm* is portable — 32-bucket SHA256-keyed
|
||||
token hash, JSON state file, replay logic. The pathway memory's
|
||||
*signal value* is not — those 88 traces represent months of scrum
|
||||
loops on Rust code, with bug fingerprints anchored to Rust file
|
||||
prefixes (`crates/queryd/`, `crates/vectord/`, etc.) that don't exist
|
||||
in the Go repo.
|
||||
|
||||
Per ADR-001 §1.5, the Go pathway memory:
|
||||
1. Reimplements the algorithm (SPEC §3.4 G3.4.B is the byte-match
|
||||
correctness gate).
|
||||
2. Starts with zero traces. The 88 Rust traces are NOT migrated.
|
||||
3. Builds its own signal over Go-era scrum cycles.
|
||||
|
||||
## What to do if the Go pathway memory underperforms
|
||||
|
||||
If after Phase G3 the Go pathway memory shows a noticeable lift
|
||||
deficit vs. the Rust era's "11/11 successful replays" baseline:
|
||||
|
||||
1. **First** — verify the Go algorithm byte-matches the Rust one on
|
||||
the SPEC G3.4.B golden input. If yes, the algorithm is correct and
|
||||
the gap is data-volume, not implementation.
|
||||
2. **Second** — the Rust traces exist; if needed, re-prefix file paths
|
||||
from `crates/queryd/` style to `cmd/queryd/` style, run a
|
||||
compatibility check, and seed the Go pathway memory selectively. But
|
||||
only after the algorithm is proven byte-match correct.
|
||||
3. **Third** — accept that the first ~3 months of Go scrum cycles need
|
||||
to rebuild the signal naturally. This is the cost of the clean
|
||||
restart per ADR-001 §1.5.
|
||||
|
||||
## Historical baseline (frozen reference)
|
||||
|
||||
| Metric | Rust value at cutoff | Source |
|
||||
|---|---|---|
|
||||
| Total traces | 88 | `data/_pathway_memory/state.json` |
|
||||
| Successful replays | 11/11 | scrum loop log circa 2026-04-26 |
|
||||
| Distinct file prefixes | TBD — query the state file | n/a |
|
||||
| Distinct semantic_flag variants used | 9 (per ADR-021) | `pathway_memory.rs` |
|
||||
| Distinct bug_fingerprint hashes | TBD | `pathway_memory.rs` |
|
||||
|
||||
When the Go pathway memory reaches comparable numbers, it has caught
|
||||
up to the Rust era and can be considered fully replacement-grade.
|
||||
354
docs/SPEC.md
Normal file
354
docs/SPEC.md
Normal file
@ -0,0 +1,354 @@
|
||||
# SPEC: Lakehouse-Go Component Port Plan
|
||||
|
||||
**Status:** DRAFT — companion to `PRD.md`. Component-by-component port
|
||||
plan with library choices, effort estimates, and acceptance gates.
|
||||
**Created:** 2026-04-28
|
||||
**Owner:** J
|
||||
|
||||
This spec answers: for each piece of the Rust Lakehouse, what Go
|
||||
library carries it, what the effort looks like, and what gate proves
|
||||
the port is real.
|
||||
|
||||
Effort scale (one engineer-week = ~40h focused work):
|
||||
- **S** — 1–3 days
|
||||
- **M** — 1 engineer-week
|
||||
- **L** — 2–3 engineer-weeks
|
||||
- **XL** — 1+ months
|
||||
- **HARD** — open research, see PRD §Hard problems
|
||||
|
||||
---
|
||||
|
||||
## §1. Component port table — Rust crates
|
||||
|
||||
| Crate | Rust deps that mattered | Go target | Library | Effort | Risk |
|
||||
|---|---|---|---|---|---|
|
||||
| `gateway` | axum, tokio, tonic, tower | `cmd/gateway` | `chi` + stdlib `net/http` + `google.golang.org/grpc` | **L** | low — Go's strongest domain |
|
||||
| `catalogd` | parquet-rs, arrow, sqlite | `cmd/catalogd` | `arrow-go/v15`, `mattn/go-sqlite3` | **L** | low |
|
||||
| `storaged` | object_store, aws-sdk | `cmd/storaged` | `aws-sdk-go-v2`, `minio-go` for MinIO-specific paths | **M** | low |
|
||||
| `queryd` | datafusion, arrow | `cmd/queryd` | `marcboeker/go-duckdb` (cgo) | **HARD** | high — see §3 |
|
||||
| `ingestd` | csv, json, lopdf, postgres | `cmd/ingestd` | stdlib `encoding/csv`, `encoding/json`, `pdfcpu/pdfcpu`, `jackc/pgx/v5` | **L** | low |
|
||||
| `vectord` | hora, arrow, hnsw | `cmd/vectord` | `coder/hnsw`, `arrow-go/v15` | **L** | medium — re-validate HNSW recall |
|
||||
| `vectord-lance` | lance | **DROPPED** | n/a | n/a | n/a — Parquet+HNSW only |
|
||||
| `journald` | parquet, arrow | `cmd/journald` | `arrow-go/v15` | **M** | low |
|
||||
| `aibridge` | reqwest | library | `net/http` + connection pool | **S** | low |
|
||||
| `validator` | parquet, custom | library | `arrow-go/v15` parquet reader | **M** | low — port the 24 unit tests as gates |
|
||||
| `truth` | tomli, custom DSL | library | `pelletier/go-toml/v2` | **M** | low |
|
||||
| `proto` | tonic-build | `proto/` + `protoc-gen-go` | `buf` + `protoc-gen-go-grpc` | **S** | low |
|
||||
| `shared` | serde, anyhow | library | stdlib `encoding/json`, `errors` | **S** | low |
|
||||
| `ui` | dioxus, wasm | **REPLACED** | `html/template` + HTMX | **L** | medium — see §3 |
|
||||
| `lance-bench` | criterion | n/a — dropped with Lance | n/a | n/a | n/a |
|
||||
|
||||
**Total Rust crate port effort:** ~12–18 engineer-weeks (3–4 months for
|
||||
one engineer; 6–8 weeks for two).
|
||||
|
||||
---
|
||||
|
||||
## §2. Component port table — TypeScript surfaces
|
||||
|
||||
| TS surface | Current location | Go target | Library | Effort | Risk |
|
||||
|---|---|---|---|---|---|
|
||||
| `mcp-server/index.ts` | Bun, :3700 | `cmd/mcp` | `mark3labs/mcp-go` (Go MCP SDK) | **L** | medium — MCP semantics |
|
||||
| `mcp-server/observer.ts` | Bun, :3800 | `cmd/observer` | stdlib `net/http`, `slog` | **M** | low |
|
||||
| `mcp-server/tracing.ts` | Bun, Langfuse client | library | `go.opentelemetry.io/otel` + Langfuse Go client (or hand-roll) | **M** | low — Langfuse Go OSS support varies |
|
||||
| `auditor/*.ts` | TS, runs as systemd | `cmd/auditor` | stdlib + `gitea API client` | **L** | medium — auditor cross-lineage logic is intricate |
|
||||
| `tests/real-world/scrum_master_pipeline.ts` | TS, ad-hoc | `cmd/scrum` | stdlib | **L** | medium — chunking + embed + ladder logic |
|
||||
| `tests/real-world/scrum_applier.ts` | TS, ad-hoc | `cmd/scrum-apply` | stdlib + git CLI shell-out | **M** | medium |
|
||||
| `bot/propose.ts` | TS | `cmd/bot` | stdlib | **S** | low |
|
||||
| Search demo HTML/JS | static | static (no port) | n/a | n/a | n/a — copied as-is |
|
||||
|
||||
**Total TS port effort:** ~6–10 engineer-weeks.
|
||||
|
||||
---
|
||||
|
||||
## §3. Hard problem details
|
||||
|
||||
### §3.1 — Query engine (DuckDB via cgo)
|
||||
|
||||
**Library:** `marcboeker/go-duckdb` — Go bindings via cgo.
|
||||
|
||||
**API shape** (replaces the DataFusion `SessionContext` pattern):
|
||||
```go
|
||||
db, _ := sql.Open("duckdb", "")
|
||||
defer db.Close()
|
||||
db.Exec("CREATE VIEW workers AS SELECT * FROM read_parquet('s3://bucket/workers/*.parquet')")
|
||||
rows, _ := db.Query("SELECT role, count(*) FROM workers WHERE state='IL' GROUP BY role")
|
||||
```
|
||||
|
||||
**Acceptance gates:**
|
||||
- G3.1.A — `SELECT * FROM read_parquet('workers_500k.parquet') LIMIT 1`
|
||||
returns a row with the expected schema. Establishes Parquet read
|
||||
works.
|
||||
- G3.1.B — Hybrid SQL+vector query (the `POST /vectors/hybrid`
|
||||
surface) returns same workers as the Rust path on the same input,
|
||||
ranked the same way modulo embedding precision.
|
||||
- G3.1.C — Hot-cache merge-on-read: register a base table + a delta
|
||||
Parquet, query, observe both rows merged with the delta winning on
|
||||
conflict.
|
||||
|
||||
**Fallback if cgo is rejected:** run DuckDB as an external process
|
||||
(`duckdb -json -c '...'` shelled or HTTP via a thin Go wrapper). Adds
|
||||
operational surface; preserves SQL model.
|
||||
|
||||
### §3.2 — HNSW index
|
||||
|
||||
**Library:** `coder/hnsw` — pure-Go HNSW, in-process. Supports add /
|
||||
delete / search / persist.
|
||||
|
||||
**Open question:** does `coder/hnsw` match the recall@10 we measured
|
||||
on the Rust `hora` path? Need a calibration test:
|
||||
- Rebuild `lakehouse_arch_v1` (the 1086-chunk arch corpus) in Go.
|
||||
- Compare recall@10 on a fixed query set to the Rust baseline.
|
||||
- Acceptance: ≤2% drop or we switch library / parameters.
|
||||
|
||||
**Persistence format:** TBD — `coder/hnsw` has its own snapshot format;
|
||||
ADR equivalent of ADR-008 (Parquet for embeddings + sidecar HNSW file)
|
||||
needs revisiting in Go to confirm the sidecar format we ship.
|
||||
|
||||
**Acceptance gates:**
|
||||
- G3.2.A — Build HNSW from a Parquet of 100K vectors in <60s
|
||||
- G3.2.B — Search 100K vectors at k=10 in <50ms p50
|
||||
- G3.2.C — Recall@10 within 2% of Rust baseline on
|
||||
`lakehouse_arch_v1`
|
||||
|
||||
### §3.3 — UI (HTMX)
|
||||
|
||||
**Approach:** server-rendered Go templates using `html/template`,
|
||||
HTMX for partial-page swaps, Alpine.js for client-side interactivity
|
||||
where needed. Single binary serves API + UI.
|
||||
|
||||
**Acceptance gates:**
|
||||
- G3.3.A — `Ask` tab: type natural-language question, get answer
|
||||
from RAG endpoint, render in-page without full reload
|
||||
- G3.3.B — `Explore` tab: paginated dataset list with hot-swap badge
|
||||
rendering
|
||||
- G3.3.C — `SQL` tab: textarea → submit → tabular result rendered
|
||||
in-page
|
||||
- G3.3.D — `System` tab: live tail of `/storage/errors` and
|
||||
`/hnsw/trials` via HTMX polling
|
||||
|
||||
**Fallback if HTMX feels limiting:** split repo `golangLAKEHOUSE-ui`
|
||||
with Vite + React, served as static files by Go gateway. Costs an
|
||||
extra repo + build chain.
|
||||
|
||||
### §3.4 — Pathway memory port
|
||||
|
||||
**Constraint:** the Rust `pathway_memory` and TS implementations were
|
||||
byte-matching by ADR-021. The byte contract was verified by running
|
||||
both implementations on the same input tokens and asserting matching
|
||||
bucket indices.
|
||||
|
||||
**Go port plan:**
|
||||
- Port the 32-bucket SHA256-keyed token hash exactly. Verify on a
|
||||
golden input that Go produces the same bucket vector as Rust.
|
||||
- Port the JSON state file format verbatim — the existing 88 traces in
|
||||
`data/_pathway_memory/state.json` reload as-is into the Go
|
||||
implementation.
|
||||
- Port the matrix-correctness layer (ADR-021's `SemanticFlag`,
|
||||
`BugFingerprint`, `TypeHint`) — these are pure value types,
|
||||
trivially portable.
|
||||
|
||||
**Acceptance gates:**
|
||||
- G3.4.A — Load existing `state.json`, run `replay` on the same 11
|
||||
prior successful pathways, all 11 succeed (matching the Rust 11/11
|
||||
baseline).
|
||||
- G3.4.B — Bucket vector for a fixed test input byte-matches the
|
||||
Rust output.
|
||||
|
||||
---
|
||||
|
||||
## §4. Phase plan
|
||||
|
||||
### Phase G0 — Skeleton (Week 1–3)
|
||||
|
||||
**Scope:** smallest end-to-end ingest + query path working in Go.
|
||||
|
||||
| Component | Deliverable |
|
||||
|---|---|
|
||||
| `cmd/gateway` | HTTP on :3100, `/health`, `/v1/chat` proxy stub |
|
||||
| `cmd/catalogd` | In-memory registry + Parquet manifest persistence |
|
||||
| `cmd/storaged` | Single-bucket S3 / local FS, no error journal yet |
|
||||
| `cmd/ingestd` | CSV → Parquet, schema inference, register-on-ingest |
|
||||
| `cmd/queryd` | DuckDB-backed `POST /sql` endpoint |
|
||||
|
||||
**Acceptance:** upload a CSV via `POST /ingest`, query it via
|
||||
`POST /sql` with a SELECT, get rows back. Single-bucket. No vector,
|
||||
no profile, no UI.
|
||||
|
||||
### Phase G1 — Vector + RAG (Week 4–6)
|
||||
|
||||
| Component | Deliverable |
|
||||
|---|---|
|
||||
| `cmd/vectord` | Embed-on-ingest (calls Python sidecar), HNSW build, `POST /search` |
|
||||
| `cmd/gateway` | Add `POST /rag` (embed → search → retrieve → generate via aibridge) |
|
||||
| `cmd/aibridge` | HTTP client to existing Python sidecar |
|
||||
|
||||
**Acceptance:** ingest 15K resumes (the original Phase 7 fixture),
|
||||
ask "find me a forklift operator with OSHA-10 in IL", get ranked
|
||||
results with LLM-generated explanation grounded in the retrieved
|
||||
chunks.
|
||||
|
||||
### Phase G2 — Federation + profiles (Week 7–8)
|
||||
|
||||
| Component | Deliverable |
|
||||
|---|---|
|
||||
| `cmd/storaged` | Multi-bucket registry, rescue bucket, error journal at `primary://_errors/` |
|
||||
| Profile system | Per-reader profile bound to bucket + vector index |
|
||||
| Hot-swap | Atomic pointer swap for index generations |
|
||||
|
||||
**Acceptance:** two profiles bound to two buckets, queries scoped
|
||||
correctly, hot-swap a vector index without query interruption,
|
||||
rollback works.
|
||||
|
||||
### Phase G3 — Pathway memory + distillation (Week 9–11)
|
||||
|
||||
| Component | Deliverable |
|
||||
|---|---|
|
||||
| `cmd/vectord` | Pathway memory module ported, 88 traces reloaded |
|
||||
| Distillation pipeline | SFT export, contamination firewall, scorer |
|
||||
| Audit baselines | `audit_baselines.jsonl` longitudinal signal port |
|
||||
|
||||
**Acceptance:** replay 11 prior successful pathways, all 11 succeed.
|
||||
Re-run distillation acceptance on the frozen fixture set, 22/22 pass.
|
||||
|
||||
### Phase G4 — TS surfaces → Go (Week 12–14)
|
||||
|
||||
| Component | Deliverable |
|
||||
|---|---|
|
||||
| `cmd/mcp` | MCP server (replaces Bun) — `/v1/chat`, intelligence endpoints |
|
||||
| `cmd/observer` | Autonomous iteration loop, op recording |
|
||||
| `cmd/auditor` | PR audit pipeline (kimi/haiku/opus rotation) |
|
||||
| `cmd/scrum` | Scrum master pipeline (replaces TS) |
|
||||
|
||||
**Acceptance:** open a test PR, auditor cycles within 90s, emits
|
||||
verdict to `data/_auditor/kimi_verdicts/`, behavior matches Rust+TS
|
||||
era within tolerance.
|
||||
|
||||
### Phase G5 — UI + demo parity (Week 15–16)
|
||||
|
||||
| Component | Deliverable |
|
||||
|---|---|
|
||||
| `cmd/gateway` | Serves HTMX templates + static demo HTML |
|
||||
| Demo at `devop.live/lakehouse/` | Parity with current Bun demo |
|
||||
| Staffer console at `/console` | Parity |
|
||||
|
||||
**Acceptance:** `devop.live/lakehouse/` cuts over from Bun to Go
|
||||
gateway. Section ① / ② / ③ all render. Compact contract cards still
|
||||
expand with Project Index. Fill-probability bars still paint.
|
||||
|
||||
---
|
||||
|
||||
## §5. Repo layout
|
||||
|
||||
```
|
||||
golangLAKEHOUSE/
|
||||
├── docs/
|
||||
│ ├── PRD.md ← this PRD
|
||||
│ ├── SPEC.md ← this spec
|
||||
│ ├── DECISIONS.md ← Go-era ADRs (start fresh, reference Rust ADRs by number)
|
||||
│ └── ADR-XXX-*.md ← per-ADR detail
|
||||
├── cmd/
|
||||
│ ├── gateway/ ← main HTTP/gRPC ingress
|
||||
│ ├── catalogd/
|
||||
│ ├── storaged/
|
||||
│ ├── queryd/
|
||||
│ ├── ingestd/
|
||||
│ ├── vectord/
|
||||
│ ├── journald/
|
||||
│ ├── mcp/
|
||||
│ ├── observer/
|
||||
│ ├── auditor/
|
||||
│ └── scrum/
|
||||
├── internal/ ← shared packages, not exported
|
||||
│ ├── aibridge/
|
||||
│ ├── validator/
|
||||
│ ├── truth/
|
||||
│ ├── shared/
|
||||
│ ├── proto/ ← generated protobuf
|
||||
│ └── pathway/
|
||||
├── pkg/ ← public Go packages (none initially)
|
||||
├── web/ ← UI (HTMX templates + static)
|
||||
│ ├── templates/
|
||||
│ └── static/
|
||||
├── scripts/ ← cold-start, smoke, distill scripts
|
||||
├── tests/ ← golden files, integration tests
|
||||
├── go.mod
|
||||
├── go.sum
|
||||
└── README.md
|
||||
```
|
||||
|
||||
**Single Go module.** All commands and internal packages live under
|
||||
`golangLAKEHOUSE/`. No nested modules unless a package needs an
|
||||
independent release cadence (none expected).
|
||||
|
||||
**Build:** `go build ./cmd/...` produces all binaries.
|
||||
|
||||
---
|
||||
|
||||
## §6. Migration data plan
|
||||
|
||||
### What ports verbatim
|
||||
- Parquet datasets at `data/datasets/*.parquet` — read by Go directly.
|
||||
- Catalog manifests — Parquet, ports as data not code.
|
||||
- Pathway memory state — JSON, ports if §3.4 byte-matching gate passes.
|
||||
|
||||
### What rebuilds
|
||||
- HNSW indexes — rebuild from Parquet embeddings on first Go startup.
|
||||
- Auditor verdicts on PRs — old PRs won't be re-audited; lineage starts
|
||||
fresh on the new repo's PRs.
|
||||
|
||||
### What's archived
|
||||
- The Rust `crates/` tree — preserved in the original repo at the
|
||||
cutover commit, tagged `pre-go-rewrite-2026-04-28` for reference.
|
||||
- TS surfaces (`mcp-server/`, `auditor/`, etc.) — preserved in the
|
||||
original repo at the same tag.
|
||||
- Distillation v1.0.0 substrate (`tag distillation-v1.0.0`,
|
||||
`e7636f2`) — kept as the historical reference; Go re-implementation
|
||||
ports the LOGIC but not the bit-identical-reproducibility property
|
||||
unless an ADR re-establishes it.
|
||||
|
||||
### What's discarded
|
||||
- `crates/vectord-lance/` (Lance backend, see PRD §Hard problems §2)
|
||||
- `crates/lance-bench/` (criterion benchmarks specific to Lance)
|
||||
|
||||
---
|
||||
|
||||
## §7. Acceptance: when is the rewrite done?
|
||||
|
||||
The Go Lakehouse reaches **feature parity** when:
|
||||
|
||||
1. **All 12 Rust PRD invariants hold** (object-storage source of truth,
|
||||
catalog metadata authority, idempotent ingest, hot-swap atomicity,
|
||||
profiles, etc.).
|
||||
2. **The 16 distillation acceptance gates pass** (re-run
|
||||
`./scripts/distill audit-full` against the Go pipeline).
|
||||
3. **The 22/22 acceptance fixtures from `tests/fixtures/distillation/
|
||||
acceptance/` pass** under the Go implementation.
|
||||
4. **The 145 unit tests of distillation v1.0.0 are ported and pass.**
|
||||
5. **`devop.live/lakehouse/` demo cuts over to Go gateway** with no
|
||||
visible UI regressions.
|
||||
6. **Auditor emits Kimi/Haiku/Opus verdicts** on a test PR, matching
|
||||
the cross-lineage rotation behavior.
|
||||
7. **The 88 pathway traces replay** with 11/11 prior successes
|
||||
reproduced.
|
||||
|
||||
At that point the Rust repo enters maintenance-only mode (security
|
||||
fixes), and the Go repo becomes the live system.
|
||||
|
||||
---
|
||||
|
||||
## §8. Ratified — Phase G0 unblocked (2026-04-28, J)
|
||||
|
||||
| # | Decision | Spec impact |
|
||||
|---|---|---|
|
||||
| 1 | DuckDB via cgo (`marcboeker/go-duckdb`) | §3.1 option A — proceed |
|
||||
| 2 | HTMX + `html/template` + Alpine.js | §3.3 option A — proceed |
|
||||
| 3 | `git.agentview.dev/profit/golangLAKEHOUSE` | repo location locked |
|
||||
| 4 | Distillation rebuilt in Go (no bit-identical port) | §6 — port logic, not fixtures |
|
||||
| 5 | Pathway memory starts empty; old traces noted | §3.4 G3.4.A is now "build initial state from scratch in Phase G3"; G3.4.B (byte-match) preserved as the porting correctness gate when the algorithm is reimplemented |
|
||||
| 6 | Auditor longitudinal signal restarts | new `audit_baselines.jsonl` lineage starts on first Go-era PR |
|
||||
|
||||
See `docs/DECISIONS.md` ADR-001 for full rationale and
|
||||
`docs/RUST_PATHWAY_MEMORY_NOTE.md` for where the legacy 88 traces live.
|
||||
|
||||
**Phase G0 is now unblocked.** Next step: bootstrap the Go module
|
||||
skeleton + push to Gitea, then begin §4 Phase G0 implementation.
|
||||
Loading…
x
Reference in New Issue
Block a user