Claw 29468b1413 docs: 2026-04-28 upstream survey — three SPEC-changing pivots
Pre-Phase-G0 research sweep against current Go ecosystem state. Three
upstream changes that the day-of SPEC missed:

1. DuckDB Go binding ownership transferred. marcboeker/go-duckdb is
   deprecated as of v2.5.0 — official maintainer is now
   github.com/duckdb/duckdb-go/v2 (DuckDB team + Marc Boeker joint
   hand-off). Current v2.10502.0 / DuckDB v1.5.2. SPEC §3.1 +
   component table updated.

2. Official Go MCP SDK exists. Switching from mark3labs/mcp-go
   (community) to github.com/modelcontextprotocol/go-sdk (official,
   Google collaboration, v1.5.0 stable, 4.4k stars, targets MCP spec
   2025-11-25). Component table updated.

3. arrow-go is on v18, not v15. v18.5.2 (March 2026) has parquet
   encryption fixes relevant for PII-masked safe views. PRD locked
   stack + SPEC component table updated.

Validated unchanged: coder/hnsw (220 stars, active), chi (still the
clean-architecture pick over fiber/gin/echo).

Surfaced for future use: anthropics/anthropic-sdk-go (official,
available for direct Claude calls bypassing opencode if ever needed),
duckdb-wasm (browser-side analytics future option), IVF as HNSW
fallback if recall gate fails.

See docs/RESEARCH_LOG_2026-04-28.md for full survey + sources.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-28 06:40:26 -05:00

15 KiB
Raw Blame History

SPEC: Lakehouse-Go Component Port Plan

Status: DRAFT — companion to PRD.md. Component-by-component port plan with library choices, effort estimates, and acceptance gates. Created: 2026-04-28 Owner: J

This spec answers: for each piece of the Rust Lakehouse, what Go library carries it, what the effort looks like, and what gate proves the port is real.

Effort scale (one engineer-week = ~40h focused work):

  • S — 13 days
  • M — 1 engineer-week
  • L — 23 engineer-weeks
  • XL — 1+ months
  • HARD — open research, see PRD §Hard problems

§1. Component port table — Rust crates

Crate Rust deps that mattered Go target Library Effort Risk
gateway axum, tokio, tonic, tower cmd/gateway chi + stdlib net/http + google.golang.org/grpc L low — Go's strongest domain
catalogd parquet-rs, arrow, sqlite cmd/catalogd apache/arrow-go/v18, mattn/go-sqlite3 L low
storaged object_store, aws-sdk cmd/storaged aws-sdk-go-v2, minio-go for MinIO-specific paths M low
queryd datafusion, arrow cmd/queryd duckdb/duckdb-go/v2 (cgo, official) HARD high — see §3
ingestd csv, json, lopdf, postgres cmd/ingestd stdlib encoding/csv, encoding/json, pdfcpu/pdfcpu, jackc/pgx/v5 L low
vectord hora, arrow, hnsw cmd/vectord coder/hnsw, apache/arrow-go/v18 L medium — re-validate HNSW recall
vectord-lance lance DROPPED n/a n/a n/a — Parquet+HNSW only
journald parquet, arrow cmd/journald apache/arrow-go/v18 M low
aibridge reqwest library net/http + connection pool · anthropics/anthropic-sdk-go available for direct Claude calls (currently routed via opencode) S low
validator parquet, custom library apache/arrow-go/v18 parquet reader M low — port the 24 unit tests as gates
truth tomli, custom DSL library pelletier/go-toml/v2 M low
proto tonic-build proto/ + protoc-gen-go buf + protoc-gen-go-grpc S low
shared serde, anyhow library stdlib encoding/json, errors S low
ui dioxus, wasm REPLACED html/template + HTMX L medium — see §3
lance-bench criterion n/a — dropped with Lance n/a n/a n/a

Total Rust crate port effort: ~1218 engineer-weeks (34 months for one engineer; 68 weeks for two).


§2. Component port table — TypeScript surfaces

TS surface Current location Go target Library Effort Risk
mcp-server/index.ts Bun, :3700 cmd/mcp modelcontextprotocol/go-sdk (official Go SDK, v1.5.0, Google-collab) L medium — MCP semantics
mcp-server/observer.ts Bun, :3800 cmd/observer stdlib net/http, slog M low
mcp-server/tracing.ts Bun, Langfuse client library go.opentelemetry.io/otel + Langfuse Go client (or hand-roll) M low — Langfuse Go OSS support varies
auditor/*.ts TS, runs as systemd cmd/auditor stdlib + gitea API client L medium — auditor cross-lineage logic is intricate
tests/real-world/scrum_master_pipeline.ts TS, ad-hoc cmd/scrum stdlib L medium — chunking + embed + ladder logic
tests/real-world/scrum_applier.ts TS, ad-hoc cmd/scrum-apply stdlib + git CLI shell-out M medium
bot/propose.ts TS cmd/bot stdlib S low
Search demo HTML/JS static static (no port) n/a n/a n/a — copied as-is

Total TS port effort: ~610 engineer-weeks.


§3. Hard problem details

§3.1 — Query engine (DuckDB via cgo)

Library: github.com/duckdb/duckdb-go/v2 — official Go bindings via cgo. (Replaces the legacy marcboeker/go-duckdb, which was deprecated when the DuckDB team and Marc Boeker jointly relocated maintenance to the DuckDB org at v2.5.0. Migration is a one-line gofmt -r rewrite of import paths.) Current version v2.10502.0 (April 2026), DuckDB v1.5.2 compat. Statically links default extensions: ICU, JSON, Parquet, Autocomplete.

API shape (replaces the DataFusion SessionContext pattern):

db, _ := sql.Open("duckdb", "")
defer db.Close()
db.Exec("CREATE VIEW workers AS SELECT * FROM read_parquet('s3://bucket/workers/*.parquet')")
rows, _ := db.Query("SELECT role, count(*) FROM workers WHERE state='IL' GROUP BY role")

Acceptance gates:

  • G3.1.A — SELECT * FROM read_parquet('workers_500k.parquet') LIMIT 1 returns a row with the expected schema. Establishes Parquet read works.
  • G3.1.B — Hybrid SQL+vector query (the POST /vectors/hybrid surface) returns same workers as the Rust path on the same input, ranked the same way modulo embedding precision.
  • G3.1.C — Hot-cache merge-on-read: register a base table + a delta Parquet, query, observe both rows merged with the delta winning on conflict.

Fallback if cgo is rejected: run DuckDB as an external process (duckdb -json -c '...' shelled or HTTP via a thin Go wrapper). Adds operational surface; preserves SQL model.

§3.2 — HNSW index

Library: coder/hnsw — pure-Go HNSW, in-process. Supports add / delete / search / persist.

Open question: does coder/hnsw match the recall@10 we measured on the Rust hora path? Need a calibration test:

  • Rebuild lakehouse_arch_v1 (the 1086-chunk arch corpus) in Go.
  • Compare recall@10 on a fixed query set to the Rust baseline.
  • Acceptance: ≤2% drop or we switch library / parameters.

Persistence format: TBD — coder/hnsw has its own snapshot format; ADR equivalent of ADR-008 (Parquet for embeddings + sidecar HNSW file) needs revisiting in Go to confirm the sidecar format we ship.

Acceptance gates:

  • G3.2.A — Build HNSW from a Parquet of 100K vectors in <60s
  • G3.2.B — Search 100K vectors at k=10 in <50ms p50
  • G3.2.C — Recall@10 within 2% of Rust baseline on lakehouse_arch_v1

§3.3 — UI (HTMX)

Approach: server-rendered Go templates using html/template, HTMX for partial-page swaps, Alpine.js for client-side interactivity where needed. Single binary serves API + UI.

Acceptance gates:

  • G3.3.A — Ask tab: type natural-language question, get answer from RAG endpoint, render in-page without full reload
  • G3.3.B — Explore tab: paginated dataset list with hot-swap badge rendering
  • G3.3.C — SQL tab: textarea → submit → tabular result rendered in-page
  • G3.3.D — System tab: live tail of /storage/errors and /hnsw/trials via HTMX polling

Fallback if HTMX feels limiting: split repo golangLAKEHOUSE-ui with Vite + React, served as static files by Go gateway. Costs an extra repo + build chain.

§3.4 — Pathway memory port

Constraint: the Rust pathway_memory and TS implementations were byte-matching by ADR-021. The byte contract was verified by running both implementations on the same input tokens and asserting matching bucket indices.

Go port plan:

  • Port the 32-bucket SHA256-keyed token hash exactly. Verify on a golden input that Go produces the same bucket vector as Rust.
  • Port the JSON state file format verbatim — the existing 88 traces in data/_pathway_memory/state.json reload as-is into the Go implementation.
  • Port the matrix-correctness layer (ADR-021's SemanticFlag, BugFingerprint, TypeHint) — these are pure value types, trivially portable.

Acceptance gates:

  • G3.4.A — Load existing state.json, run replay on the same 11 prior successful pathways, all 11 succeed (matching the Rust 11/11 baseline).
  • G3.4.B — Bucket vector for a fixed test input byte-matches the Rust output.

§4. Phase plan

Phase G0 — Skeleton (Week 13)

Scope: smallest end-to-end ingest + query path working in Go.

Component Deliverable
cmd/gateway HTTP on :3100, /health, /v1/chat proxy stub
cmd/catalogd In-memory registry + Parquet manifest persistence
cmd/storaged Single-bucket S3 / local FS, no error journal yet
cmd/ingestd CSV → Parquet, schema inference, register-on-ingest
cmd/queryd DuckDB-backed POST /sql endpoint

Acceptance: upload a CSV via POST /ingest, query it via POST /sql with a SELECT, get rows back. Single-bucket. No vector, no profile, no UI.

Phase G1 — Vector + RAG (Week 46)

Component Deliverable
cmd/vectord Embed-on-ingest (calls Python sidecar), HNSW build, POST /search
cmd/gateway Add POST /rag (embed → search → retrieve → generate via aibridge)
cmd/aibridge HTTP client to existing Python sidecar

Acceptance: ingest 15K resumes (the original Phase 7 fixture), ask "find me a forklift operator with OSHA-10 in IL", get ranked results with LLM-generated explanation grounded in the retrieved chunks.

Phase G2 — Federation + profiles (Week 78)

Component Deliverable
cmd/storaged Multi-bucket registry, rescue bucket, error journal at primary://_errors/
Profile system Per-reader profile bound to bucket + vector index
Hot-swap Atomic pointer swap for index generations

Acceptance: two profiles bound to two buckets, queries scoped correctly, hot-swap a vector index without query interruption, rollback works.

Phase G3 — Pathway memory + distillation (Week 911)

Component Deliverable
cmd/vectord Pathway memory module ported, 88 traces reloaded
Distillation pipeline SFT export, contamination firewall, scorer
Audit baselines audit_baselines.jsonl longitudinal signal port

Acceptance: replay 11 prior successful pathways, all 11 succeed. Re-run distillation acceptance on the frozen fixture set, 22/22 pass.

Phase G4 — TS surfaces → Go (Week 1214)

Component Deliverable
cmd/mcp MCP server (replaces Bun) — /v1/chat, intelligence endpoints
cmd/observer Autonomous iteration loop, op recording
cmd/auditor PR audit pipeline (kimi/haiku/opus rotation)
cmd/scrum Scrum master pipeline (replaces TS)

Acceptance: open a test PR, auditor cycles within 90s, emits verdict to data/_auditor/kimi_verdicts/, behavior matches Rust+TS era within tolerance.

Phase G5 — UI + demo parity (Week 1516)

Component Deliverable
cmd/gateway Serves HTMX templates + static demo HTML
Demo at devop.live/lakehouse/ Parity with current Bun demo
Staffer console at /console Parity

Acceptance: devop.live/lakehouse/ cuts over from Bun to Go gateway. Section ① / ② / ③ all render. Compact contract cards still expand with Project Index. Fill-probability bars still paint.


§5. Repo layout

golangLAKEHOUSE/
├── docs/
│   ├── PRD.md                    ← this PRD
│   ├── SPEC.md                   ← this spec
│   ├── DECISIONS.md              ← Go-era ADRs (start fresh, reference Rust ADRs by number)
│   └── ADR-XXX-*.md              ← per-ADR detail
├── cmd/
│   ├── gateway/                  ← main HTTP/gRPC ingress
│   ├── catalogd/
│   ├── storaged/
│   ├── queryd/
│   ├── ingestd/
│   ├── vectord/
│   ├── journald/
│   ├── mcp/
│   ├── observer/
│   ├── auditor/
│   └── scrum/
├── internal/                     ← shared packages, not exported
│   ├── aibridge/
│   ├── validator/
│   ├── truth/
│   ├── shared/
│   ├── proto/                    ← generated protobuf
│   └── pathway/
├── pkg/                          ← public Go packages (none initially)
├── web/                          ← UI (HTMX templates + static)
│   ├── templates/
│   └── static/
├── scripts/                      ← cold-start, smoke, distill scripts
├── tests/                        ← golden files, integration tests
├── go.mod
├── go.sum
└── README.md

Single Go module. All commands and internal packages live under golangLAKEHOUSE/. No nested modules unless a package needs an independent release cadence (none expected).

Build: go build ./cmd/... produces all binaries.


§6. Migration data plan

What ports verbatim

  • Parquet datasets at data/datasets/*.parquet — read by Go directly.
  • Catalog manifests — Parquet, ports as data not code.
  • Pathway memory state — JSON, ports if §3.4 byte-matching gate passes.

What rebuilds

  • HNSW indexes — rebuild from Parquet embeddings on first Go startup.
  • Auditor verdicts on PRs — old PRs won't be re-audited; lineage starts fresh on the new repo's PRs.

What's archived

  • The Rust crates/ tree — preserved in the original repo at the cutover commit, tagged pre-go-rewrite-2026-04-28 for reference.
  • TS surfaces (mcp-server/, auditor/, etc.) — preserved in the original repo at the same tag.
  • Distillation v1.0.0 substrate (tag distillation-v1.0.0, e7636f2) — kept as the historical reference; Go re-implementation ports the LOGIC but not the bit-identical-reproducibility property unless an ADR re-establishes it.

What's discarded

  • crates/vectord-lance/ (Lance backend, see PRD §Hard problems §2)
  • crates/lance-bench/ (criterion benchmarks specific to Lance)

§7. Acceptance: when is the rewrite done?

The Go Lakehouse reaches feature parity when:

  1. All 12 Rust PRD invariants hold (object-storage source of truth, catalog metadata authority, idempotent ingest, hot-swap atomicity, profiles, etc.).
  2. The 16 distillation acceptance gates pass (re-run ./scripts/distill audit-full against the Go pipeline).
  3. The 22/22 acceptance fixtures from tests/fixtures/distillation/ acceptance/ pass under the Go implementation.
  4. The 145 unit tests of distillation v1.0.0 are ported and pass.
  5. devop.live/lakehouse/ demo cuts over to Go gateway with no visible UI regressions.
  6. Auditor emits Kimi/Haiku/Opus verdicts on a test PR, matching the cross-lineage rotation behavior.
  7. The 88 pathway traces replay with 11/11 prior successes reproduced.

At that point the Rust repo enters maintenance-only mode (security fixes), and the Go repo becomes the live system.


§8. Ratified — Phase G0 unblocked (2026-04-28, J)

# Decision Spec impact
1 DuckDB via cgo (marcboeker/go-duckdb) §3.1 option A — proceed
2 HTMX + html/template + Alpine.js §3.3 option A — proceed
3 git.agentview.dev/profit/golangLAKEHOUSE repo location locked
4 Distillation rebuilt in Go (no bit-identical port) §6 — port logic, not fixtures
5 Pathway memory starts empty; old traces noted §3.4 G3.4.A is now "build initial state from scratch in Phase G3"; G3.4.B (byte-match) preserved as the porting correctness gate when the algorithm is reimplemented
6 Auditor longitudinal signal restarts new audit_baselines.jsonl lineage starts on first Go-era PR

See docs/DECISIONS.md ADR-001 for full rationale and docs/RUST_PATHWAY_MEMORY_NOTE.md for where the legacy 88 traces live.

Phase G0 is now unblocked. Next step: bootstrap the Go module skeleton + push to Gitea, then begin §4 Phase G0 implementation.