From 2a974d6dea1ce919244fc78cfafe0c3c14dfd667 Mon Sep 17 00:00:00 2001 From: root Date: Fri, 1 May 2026 04:56:20 -0500 Subject: [PATCH] docs: ARCHITECTURE_COMPARISON.md as living source file MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Per J's request: move the parallel-runtime comparison from reports/cutover/ (where it lived as cutover-prep evidence) into docs/ as the source-of-truth file. J will keep updating it as fixes ship on either side. Restructured for living-document use: - Status header (last refresh date, owner, update triggers) - 'How to update this doc' section with explicit dos and don'ts - Decisions tracker at top — actioned items with commit refs + open backlog with LOC estimates - Each comparison section now has 'Last verified' columns where numbers are time-sensitive - Change log section at bottom for one-line entries on every meaningful refresh The original at reports/cutover/architecture_comparison.md gains a 'THIS IS A SNAPSHOT' header pointing at the docs/ source. Kept as historical record but no longer the place to update. Sister pointer file in /home/profit/lakehouse/docs/ARCHITECTURE_COMPARISON.md so the doc is reachable from either repo side. That file explicitly says the source lives in golangLAKEHOUSE and warns against authoritative content in the pointer. Co-Authored-By: Claude Opus 4.7 (1M context) --- docs/ARCHITECTURE_COMPARISON.md | 321 +++++++++++++++++++++ reports/cutover/architecture_comparison.md | 7 +- 2 files changed, 327 insertions(+), 1 deletion(-) create mode 100644 docs/ARCHITECTURE_COMPARISON.md diff --git a/docs/ARCHITECTURE_COMPARISON.md b/docs/ARCHITECTURE_COMPARISON.md new file mode 100644 index 0000000..1fc7b4a --- /dev/null +++ b/docs/ARCHITECTURE_COMPARISON.md @@ -0,0 +1,321 @@ +# Lakehouse: Rust vs Go architecture comparison + +> **Status**: Living document · primary source for the parallel-runtime +> comparison. +> **Owner**: J. Update this when either side ships a fix that changes +> the table values, or when a new architectural axis surfaces. +> **Last meaningful refresh**: 2026-05-01 (post-Rust-cache + Go-validator-port) + +This document compares the two parallel implementations of the lakehouse +substrate — Rust at `/home/profit/lakehouse/` (production today), Go at +`/home/profit/golangLAKEHOUSE/` (cutover-prep, Bun `/_go/*` slice live). +The goal of running both lines is to find where each architecture is +weak vs strong, address those gaps, and make the keep/maintain +decision based on real evidence rather than preference. + +A snapshot of this document at any point in time is also captured at +`reports/cutover/architecture_comparison.md`. The version in `docs/` +is the source of truth; `reports/cutover/` is the historical record. + +--- + +## How to update this doc + +Three triggers: + +1. **A fix lands on either side that moves a table value.** Update the + number, append a one-line entry to the change log at the bottom, + commit alongside the fix. +2. **A new architectural axis surfaces.** Add a section. Match the + shape of existing sections (table + read paragraph). +3. **A keep/maintain decision is made.** Update the Recommendation + section + change log. + +Don't: +- Delete sections without recording the reason in the change log. +- Embed unverified claims — every "Rust is X" or "Go is X" should + point to either a load-test number, a code reference (`crate/file:line`), + or an explicit "asserted, not measured" caveat. + +--- + +## Decisions tracker + +| Date | Decision | Effect | +|---|---|---| +| 2026-05-01 | Add LRU embed cache to Rust aibridge | Closes 236× perf gap. **DONE** (commit `150cc3b` in lakehouse). | +| 2026-05-01 | Port FillValidator + EmailValidator to Go | Production safety net Go was missing. **DONE** (commit `b03521a` in golangLAKEHOUSE). | +| _open_ | Drop Python sidecar from Rust aibridge | Universal-win architectural cleanup. ~200 LOC, removes 1 runtime + 1 process. | +| _open_ | Port Rust materializer to Go (transforms.ts) | Unblocks Go-only end-to-end pipeline. ~500-800 LOC. | +| _open_ | Port Rust replay tool to Go | Closes audit-FULL phase 7 live invocation. ~400-600 LOC. | +| _open_ | Decide on Lance vector backend | Defer until corpus exceeds ~5M rows. | +| _open_ | Pick Go primary vs Rust primary | Both viable. Go has perf edge after today; Rust has production deploy + producer-side completeness. | + +--- + +## Code volume + +| | Lines | Last verified | +|---|---:|---| +| Rust `crates/` (15 crates) | 35,447 | 2026-05-01 | +| Rust `sidecar/` (Python) | 1,237 | 2026-05-01 | +| Go `internal/` (20 packages) | 11,896 (+ validator 1190) | 2026-05-01 | +| Go `cmd/` (14 binaries) | 3,232 | 2026-05-01 | +| **Go total** | **~16,300** | 2026-05-01 | + +Go is ~46% the size of Rust on like-for-like surface (post-validator-port). +The gap is largely `vectord` (Rust 11,005 lines vs Go 804) — Rust's +vectord implements HNSW + Lance-format storage + benchmarking; Go's +wraps `coder/hnsw` and stops there. + +--- + +## Process model + +| | Rust | Go | +|---|---|---| +| Binaries running | **1** mega-process (gateway PID 1241, 14.9G RSS, 374% CPU under load) | **11** dedicated daemons (~100-300MB RSS each) | +| Inter-component comms | In-process axum.nest (no network) | HTTP between daemons | +| Crash blast radius | Whole system if any subsystem panics | One daemon dies, rest survive | +| Horizontal scale | One unit only — can't scale individual components | Each daemon scales independently | +| Deploy unit | Single binary | 11 systemd units | + +**Reading**: Rust's mega-binary is simpler ops at small scale (one +thing to start, one log to tail). Go's daemons are simpler ops at +production scale (kill the misbehaving one, restart it, others stay +up). Go also lets you tune per-daemon resource limits via systemd. + +--- + +## Python dependency (the load-bearing axis) + +This is the architectural difference that drove the original perf gap. +Both call Ollama at `:11434`, but the path differs: + +``` +Rust embed: gateway → HTTP → Python sidecar :3200 → HTTP → Ollama :11434 +Go embed: gateway → HTTP → Go embedd :4216 → HTTP → Ollama :11434 +``` + +The Python sidecar (`sidecar/sidecar/main.py`, 1,237 lines) is a +FastAPI wrapper around Ollama. It does pydantic validation + request +shaping; **no fundamental compute** that Ollama can't do directly. + +### Performance impact (load-tested 2026-05-01, 6 rotating bodies, 10 concurrency, 30s) + +| Path | Pre-cache | Post-cache (`150cc3b`) | Δ | +|---|---:|---:|---:| +| **Rust /ai/embed** (via gateway) | 128 RPS · p50 78ms · p99 124ms | **30,279 RPS · p50 129µs · p99 5ms** | +236× RPS | +| **Go /v1/embed** (via gateway → embedd) | 8,119 RPS · p50 0.79ms · p99 3ms | _unchanged_ | (already cached) | + +Rust now beats Go ~3.7× on cache-warm workloads. The cache being +in-process inside Rust's gateway (no HTTP hop to a separate daemon) +gives it the edge once both sides have caching. + +### What the cache fix did NOT do + +The Python sidecar is still in the Rust path on cache misses. Cold +queries pay the full Python+Ollama tax. Dropping the sidecar +(rewriting aibridge to call Ollama directly) is the next universal-win +item — open in the Decisions tracker. + +--- + +## Vector storage + +| | Rust | Go | +|---|---|---| +| HNSW lib | `hnsw_rs` (mature) | `coder/hnsw` (newer, smaller) | +| Code size | 11,005 lines (`vectord` + `vectord-lance`) | 804 lines | +| Lance-format storage | Yes (`vectord-lance` crate) | No | +| Persistence | LanceDB or in-memory | MinIO + JSON envelope (v2 envelope as of `eb0dfdf`) | +| Distance functions | cosine, euclidean, dot product | cosine, euclidean | + +**Reading**: Rust has the deeper substrate. Lance-format gives columnar +persistence + zero-copy reads + Apache Arrow integration. For +staffing-domain corpus sizes (5K-500K vectors) both work fine; for +multi-million-row indexes Rust would have a real edge. **Defer the Go +Lance port until corpus growth demands it.** + +--- + +## Distillation pipeline (porting status) + +| Phase | Rust source | Go port | +|---|---|---| +| Materializer (transforms.ts) | TS, full | ❌ NOT YET PORTED | +| Scorer | TS + Go | ✅ Ported | +| Score categories + firewall | Pinned | ✅ Ported (`SftNever`) | +| SFT export (synthesis) | TS, full (8 source classes) | ✅ Fully ported, 4-decimal byte-equal | +| RAG export | TS | ❌ NOT YET PORTED | +| Preference export | TS | ❌ NOT YET PORTED | +| Audit-baselines | TS | ✅ Fully ported, byte-equal verified | +| Audit-FULL phase 0/3/4 | TS | ✅ Ported | +| Audit-FULL phase 1 (schema) | bun test | ✅ Via `go test` exec | +| Audit-FULL phase 2 (materializer) | TS | ✅ Observer mode (read-only) | +| Audit-FULL phase 5 (run summaries) | TS | ✅ Observer mode (read-only) | +| Audit-FULL phase 6 (acceptance) | TS fixture harness | ❌ Skipped (TS-only deps) | +| Audit-FULL phase 7 (replay) | TS | ✅ Observer mode (read-only) | +| Replay tool | TS | ❌ NOT YET PORTED | +| Quarantine writer | TS | ❌ NOT YET PORTED | + +**Reading**: Go has the substrate for everything observable (read +paths) and SFT export end-to-end. The producer side (materializer, +replay) is still Rust-only. To run the full pipeline from Go alone, +the materializer + replay need porting. + +--- + +## Production validators + +| | Rust | Go | +|---|---|---| +| FillValidator | `crates/validator/src/staffing/fill.rs` (12 unit tests) | ✅ **Ported 2026-05-01** (`internal/validator/fill.go` + 13 tests) | +| EmailValidator | `crates/validator/src/staffing/email.rs` (12 tests) | ✅ **Ported 2026-05-01** (`internal/validator/email.go` + 11 tests) | +| `/v1/validate` endpoint | Yes | ❌ NOT YET PORTED (validator network surface) | +| `/v1/iterate` endpoint | Yes (gen→validate→correct→retry loop) | ❌ NOT YET PORTED | +| Production validators load `workers_500k.parquet` at startup | Yes (75MB resident) | N/A — Go uses WorkerLookup interface; in-memory or adapter | + +**Reading**: With today's port, Go has the load-bearing validators. +The network surface (`/v1/validate`, `/v1/iterate`) is the next +piece — the in-memory validators work in-process; turning them into +HTTP endpoints adds the production-shape access pattern. + +--- + +## Substrate features unique to each side + +### Go has, Rust doesn't + +- **chatd 5-provider dispatcher** (kimi / opencode / openrouter / ollama_cloud / ollama). +- **Cross-role gate** in matrix retrieve (real_001 fix). Verified by reality tests real_001..005. +- **Multi-corpus matrix indexer** (Spec §3.4 component 2). +- **Pathway memory** (Mem0-style versioned traces). +- **Observer fail-safe semantics** (ADR-005 Decision 5.1). +- **In-process embed cache** (CachedProvider + LRU). _Note: Rust got this 2026-05-01 too._ +- **LLM-based role extractor** (regex + qwen2.5 fallback). +- **Persistent stack 3-layer isolation** (`scripts/cutover/start_go_stack.sh`). +- **Cutover slice** (Bun `/_go/*` route, opt-in via systemd drop-in). +- **Production load test** (`scripts/cutover/loadgen/`) with Bun-frontend + direct comparison. + +### Rust has, Go doesn't + +- **Lance-format vector storage** (vectord-lance crate, 605 lines). +- **`truth` crate** (970 lines). Cross-source claim reconciliation. +- **`journald` crate** (455 lines). Structured event journal. +- **`/v1/validate` + `/v1/iterate` endpoints** (network surface). +- **`ui` crate (Dioxus, 1,509 lines)**. Native desktop/web UI. +- **Materializer + replay tools** (the "produce evidence" side). +- **Acceptance harness** (22 invariants over fixtures, TS). +- **Production deployment** (devop.live/lakehouse/* serves through Rust today). + +--- + +## Strengths and weaknesses + +### Rust strengths + +- Mature, in production, serving real demo traffic. +- Single deploy unit; one binary, one systemd service, one log. +- Type system + memory safety; fewer runtime bugs in hot paths. +- Mature library ecosystem (axum, tokio, polars, arrow, hnsw_rs, lance). +- Native distillation pipeline; Go is the porter. +- Production validators (now also in Go but Rust authored them). +- Lance vector storage scales beyond 5M rows. +- **In-process embed cache (post-`150cc3b`) makes Rust the fastest path on warm workloads.** + +### Rust weaknesses + +- **Python sidecar dependency** — every cache-miss AI call goes through Python. Adds 1 runtime + 1 process to ops. ~200 LOC to fix. +- **Mega-binary blast radius** — gateway at 14.9G RSS means any panic kills the whole production system. +- **Tail latency cliff under uncached load** — single async runtime serializes I/O completions. +- **Compile times** — slow iteration vs Go's per-package builds. +- **Coupling** — adding a feature touches gateway/v1/ and ripples across crates. + +### Go strengths + +- **Process isolation** — daemons crash independently; ops can `systemctl restart vectord` without touching gateway. +- **Per-daemon scale** — embed cache lives in embedd; vectord shards independently. Hot daemons scale horizontally. +- **No Python dependency** — every daemon talks to peers in HTTP/JSON. Native Go down to Ollama. +- **In-process embed cache** at the daemon level (was the perf lever pre-Rust-cache). +- **Smaller, denser code** — 16,300 lines vs Rust's 35,447 + 1,237 sidecar (~46% the size). +- **Faster iteration** — `go build` of all 14 binaries is ~3-5s; Rust full rebuild is minutes. +- **Cross-runtime artifact compatibility verified** — audit_baselines.jsonl, scored-runs JSONL, sft_export.jsonl all round-trip byte-equal. + +### Go weaknesses + +- **Distillation pipeline incomplete** — materializer + replay + RAG export + preference export still Rust-only. +- **Validator network surface missing** — in-memory validators work, but `/v1/validate` HTTP endpoint not yet ported. Operators can't call validators over the wire from Go. +- **Vector storage HNSW-only** — no Lance equivalent. Fine for current scale. +- **Less production-tested** — cutover slice live but no real coordinator traffic yet. +- **HTTP between daemons** — every cross-daemon call is a network round-trip. Latency fine on localhost (microseconds) but tail-latency contributes more than Rust's in-process composition. +- **`coder/hnsw` is newer** than Rust's `hnsw_rs`. Less battle-tested. + +--- + +## Cross-cutting abstracts to address + +The list below is a working backlog. Move items to "Decisions tracker" +(at top) when actioned with a commit reference. + +### Universal wins (apply regardless of primary line) + +1. ✅ **Embed cache in Rust aibridge** — DONE 2026-05-01 (`150cc3b`). +2. ✅ **FillValidator + EmailValidator in Go** — DONE 2026-05-01 (`b03521a`). +3. **Drop Python sidecar from Rust** — Rewrite aibridge to call Ollama at `:11434/api/embed` and `/api/generate` directly. Removes 1 runtime + 1 process from ops. ~200 LOC. +4. **Cross-runtime contract tests** — Pin shared JSONL schemas (audit_baselines, scored_run, sft_sample) as canonical specs in `auditor/schemas/` with Go-side validators consuming the same definitions. + +### If keeping Go primary + +5. **Port materializer** (highest leverage — unblocks full Go pipeline). ~500-800 LOC. +6. **Port replay tool** (closes audit-FULL phase 7 live invocation). ~400-600 LOC. +7. **Port `/v1/validate` + `/v1/iterate` HTTP surface** for the now-Go-side validators. ~200 LOC. +8. **Skip Lance** until corpus growth demands it (>5M rows). +9. **Keep chatd, observer fail-safe, role gate, multi-corpus matrix** — real Go wins worth preserving. + +### If keeping Rust primary + +10. **Port chatd's 5-provider dispatcher to Rust** — unified cloud LLM access. +11. **Port the cross-role gate to Rust matrix retrieve** — production safety on the matrix layer (verified by Go reality tests real_001..005). +12. **Consider process splitting** — even partial decomposition (split out vectord into its own process) would help with the mega-binary blast radius. + +--- + +## Recommendation (working hypothesis) + +**Go for the primary line, Rust for production-bridge maintenance.** + +Reasons: +1. **Operations** — process isolation is genuinely simpler at production scale than a 14.9G mega-binary. +2. **Code volume** — Go does the same job in ~46% the lines. +3. **Cross-runtime parity verified** — every artifact round-trips byte-equal between runtimes. +4. **The 4 missing pieces are bounded** — materializer + replay + validators-network + RAG/preference exports are concrete porting targets, not research questions. +5. **Performance is no longer a deciding factor** post-`150cc3b` — Rust is faster on warm cache, but both are well above staffing-domain demand levels (<1 RPS typical). + +But **don't abandon Rust**: +1. devop.live/lakehouse/ runs through Rust today; cutover is multi-week. +2. Several Go improvements would be downstream of Rust patterns. Keeping Rust live means anything new there is a porting opportunity for Go. +3. The Python sidecar drop + cross-role gate port are valuable Rust improvements regardless of which line is primary. + +--- + +## Change log + +Append entries here when this doc gets updated. One-line entries; link to commits. + +- 2026-05-01 — Initial draft (`b3ad148` golangLAKEHOUSE). +- 2026-05-01 — Recorded Rust embed cache shipping (`150cc3b` lakehouse), updated Python-dependency section + table. +- 2026-05-01 — Recorded Go validator port shipping (`b03521a` golangLAKEHOUSE), updated production-validators section. +- 2026-05-01 — Reframed as living document in `docs/`, added Decisions tracker + Update guidance + Change log sections. + +--- + +## See also + +- **`reports/cutover/architecture_comparison.md`** — historical snapshot (matched this doc as of the date stamp at top). +- **`docs/SPEC.md`** — Go-side architectural spec. +- **`docs/DECISIONS.md`** — Go-side ADRs. +- **`/home/profit/lakehouse/docs/DECISIONS.md`** — Rust-side ADRs. +- **`/home/profit/lakehouse/docs/go-rewrite/`** — Rust-side notes on the rewrite. +- **`reports/cutover/SUMMARY.md`** — running log of cross-runtime parity probes. +- **`reports/cutover/g5_load_test.md`** — load-test methodology + numbers. diff --git a/reports/cutover/architecture_comparison.md b/reports/cutover/architecture_comparison.md index 499ec4c..8ce908a 100644 --- a/reports/cutover/architecture_comparison.md +++ b/reports/cutover/architecture_comparison.md @@ -1,4 +1,9 @@ -# Lakehouse: Rust vs Go architecture comparison +# Lakehouse: Rust vs Go architecture comparison (snapshot) + +> **THIS IS A SNAPSHOT — NOT THE SOURCE OF TRUTH.** +> The living document is at **`docs/ARCHITECTURE_COMPARISON.md`**. +> Update there; this file is a frozen historical record. +> Snapshot date: 2026-05-01. Produced 2026-05-01 to inform the keep/maintain decision and surface abstractions that should be addressed regardless of which side is the