golangLAKEHOUSE/STATE_OF_PLAY.md
root e8cf113af8 gauntlet 2026-05-02: smoke chain + per-component scrum + parity probe
Production-readiness gauntlet exploiting the dual Rust/Go
implementation as a measurement instrument.

## Phase 1 — Full smoke chain
21/21 PASS in ~60s. Substrate intact across the full service surface.

## Phase 2 — Per-component scrum (token-volume fix)
Prior wave (165KB diff): Kimi 62 tokens out, Qwen 297 → no useful
analysis. This wave splits today's commits into 4 focused bundles
(36-71KB each):
  c1 validatord (46KB) → 0 convergent / 11 distinct
  c2 vectord substrate (36KB) → 0 convergent / 10 distinct
  c3 materializer (71KB) → 0 convergent / 6 distinct (Opus emitted
                           a BLOCK then self-retracted in same response)
  c4 replay (45KB) → 0 convergent / 10 distinct

Reviewer engagement vs prior wave: Kimi went 62 → ~250 tokens out
once bundles dropped below 60KB.

scripts/scrum_review.sh hardening:
  * Diff-size guard (warn >60KB, hard-fail >100KB,
    SCRUM_FORCE_OVERSIZE=1 override)
  * Tightened prompt — file path must appear EXACTLY as in diff
    so post-processor can grep WHERE: lines reliably
  * Auto-tally step dedupes by (reviewer, location); convergence
    counts distinct lineages (closes the prior `opus+opus+opus`
    false-convergence bug)

## Phase 3 — Cross-runtime validator parity probe (the headline finding)
scripts/cutover/parity/validator_parity.sh sends 6 identical
/v1/validate cases to Rust :3100 AND Go :4110, compares status+body.

Result: **6/6 status codes match · 5/6 body shapes diverge.**

Rust returns serde-tagged enum:   {"Schema":{"field":"x","reason":"y"}}
Go returns flat exported-fields:  {"Kind":"schema","Field":"x","Reason":"y"}

Both round-trip inside their own runtime; a caller swapping one for
the other would break parsing silently. Captured as new _open_ row
in docs/ARCHITECTURE_COMPARISON.md decisions tracker.

This is the "use the dual-implementation as a measurement instrument"
return — single-repo scrums can't catch this class of cross-runtime
drift.

## Phase 4 — Production assessment
ship-with-known-gap. Validator wire-format gap is documented, not
regressed. ~50 LOC future fix on Go side (custom MarshalJSON on
ValidationError to match Rust's serde shape).

Persistent stack config (/tmp/lakehouse-persistent.toml) gains
validatord on :3221 + persistent-validatord binary so operators
bringing up the persistent stack get the new daemon automatically.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-02 04:05:18 -05:00

41 KiB
Raw Blame History

STATE OF PLAY — Lakehouse-Go

Last verified: 2026-05-02 ~05:00 CDT Verified by: production-readiness gauntlet — 21/21 smoke chain green in ~60s, per-component scrum across 4 bundles (no convergent findings, no real bugs), cross-runtime validator parity probe (6/6 status match, 5/6 body shape divergence captured as known gap). Disposition: reports/cutover/gauntlet_2026-05-02/disposition.md.

Read this FIRST. When the user says "we're working on lakehouse," default to the Go rewrite (this repo); the Rust legacy at /home/profit/lakehouse/ is maintenance-only. If memory contradicts this file, this file wins. Update it when something is verified working — not when a phase finishes.


VERIFIED WORKING RIGHT NOW

Substrate (G0 + G1 family)

14 service binaries under cmd/ plus 2 driver scripts (scripts/staffing_*) and 3 distillation tools (cmd/audit_full, cmd/materializer, cmd/replay) build into bin/. 21 smoke scripts all PASS (added validatord_smoke.sh 2026-05-02). just verify (vet + 33 packages × short tests + 9 core smokes) green in ~32s wall.

Binary Port What
gateway 3110 reverse proxy, single OpenAI-compat-style edge
storaged 3211 S3 GET/PUT/LIST/DELETE w/ per-prefix PUT cap (ADR-002)
catalogd 3212 Parquet manifests, ADR-020 idempotent register
ingestd 3213 CSV → Parquet → catalogd, content-addressed keys
queryd 3214 DuckDB SELECT over Parquet via httpfs
vectord 3215 HNSW indexes (coder/hnsw), persistence to storaged
embedd 3216 Ollama-backed embedder w/ LRU cache
pathwayd 3217 Mem0 ops (Add/Update/Revise/Retire/History/Search)
matrixd 3218 Multi-corpus retrieve+merge + relevance + downgrade + playbook
observerd 3219 Witness loop, workflow runner with DAG executor
chatd 3220 LLM dispatcher: ollama / ollama_cloud / openrouter / opencode / kimi
validatord 3221 /validate (FillValidator + EmailValidator + PlaybookValidator) + /iterate (gen→validate→correct loop). Roster from JSONL.
mcpd MCP SDK port (Bun mcp-server replacement)
fake_ollama Test fixture (used by g2_smoke_fixtures.sh)

Matrix indexer — all 5 SPEC §3.4 components shipped

  1. Corpus builders (internal/corpusingest)
  2. Multi-corpus retrieve+merge (matrixd /matrix/search)
  3. Relevance filter (internal/matrix/relevance.go 376 LoC + 289 LoC test)
  4. Strong-model downgrade gate (internal/matrix/downgrade.go, reads cfg.Models.WeakModels after Phase 2)
  5. Playbook memory: boost + Shape B inject (internal/matrix/playbook.go, learning loop). Shape B (InjectPlaybookMisses, 154a72e) appends recorded answers to results when regular retrieval misses them — closed the paraphrase recovery gap exposed by run #002 (0/2) and validated by run #003 (6/6).

Pathway memory (Mem0 substrate)

Full ADR-004 surface shipped. Cycle-detection + retired-trace exclusion proven by tests: TestHistory_CycleDetected, TestRetire_ExcludedFromSearch, TestRevise_ChainOfThree_BackwardWalk. JSONL append-only persistence with corruption tolerance.

Observer + workflow runner

  • observerd ring buffer + JSONL persistence
  • Workflow DAG executor (Archon-style) with 5 native modes wired: matrix.relevance, matrix.downgrade, matrix.search, distillation.score, drift.scorer. Plus fixture.echo / fixture.upper for runner mechanics smokes.

Distillation + drift

  • E (partial) at 57d0df1 — scorer + contamination firewall ported from Rust v1.0.0 (logic only per ADR-001 §1.4; not bit-identical).
  • F (first slice) at be65f85 — drift quantification, scorer drift first.
  • Materializer port (2026-05-02) — internal/materializer + cmd/materializer. Ports scripts/distillation/transforms.ts (12 transforms) + build_evidence_index.ts (idempotency, day-partition, receipt). On-wire JSON shape matches TS so Bun and Go runs are interchangeable. 14 tests + materializer_smoke.sh.
  • Replay port (2026-05-02) — internal/replay + cmd/replay. Ports scripts/distillation/replay.ts (retrieve → bundle → /v1/chat → validate → log). Closes audit-FULL phase 7 live invocation on the Go side. Both runtimes append to the same data/_kb/replay_runs.jsonl (schema=replay_run.v1). 14 tests + replay_smoke.sh.

chatd — Phase 4 (shipped 2026-04-30, scrum-hardened same day)

Multi-provider LLM dispatcher routing /v1/chat by model-name prefix or :cloud suffix:

Prefix / suffix Provider Auth
ollama/<m> or bare ollama (local) none
ollama_cloud/<m> or <m>:cloud ollama_cloud Bearer (OLLAMA_CLOUD_KEY)
openrouter/<v>/<m> openrouter Bearer (OPENROUTER_API_KEY)
opencode/<m> opencode Bearer (OPENCODE_API_KEY)
kimi/<m> kimi Bearer (KIMI_API_KEY)

All 5 keys live in /etc/lakehouse/{ollama_cloud,openrouter,opencode,kimi}.env files (mode 0600). Empty/missing files leave that provider unregistered (404 at first call instead of 503). Test request: POST /v1/chat {"model":"opencode/claude-opus-4-7","messages":[{"role":"user","content":"hi"}],"max_tokens":8}.

Request.Temperature is *float64 (pointer) — Anthropic 4.7 deprecates temperature entirely, so we omit the field when caller doesn't set it.

Model tier registry

lakehouse.toml [models] names model IDs by tier so swaps are 1-line:

local_fast       = "qwen3.5:latest"
local_judge      = "qwen2.5:latest"   # NOT qwen3.5 — vision-SSM 256K-ctx ran 30s/judge in lift loop
cloud_judge      = "kimi-k2.6:cloud"
cloud_review     = "qwen3-coder:480b"
frontier_review  = "openrouter/anthropic/claude-opus-4-7"
frontier_arch    = "openrouter/moonshotai/kimi-k2-0905"
frontier_free    = "opencode/claude-opus-4-7"
weak_models      = ["qwen3.5:latest", "qwen3:latest"]   # matrix.downgrade bypass

Callers read cfg.Models.LocalJudge etc. instead of literal strings. playbook_lift harness, matrix.downgrade, and observerd's MatrixDowngradeWithWeakList factory all migrated.

Code health

  • go vet ./...0 warnings, 0 errors
  • go test -short ./...all green, 349 test functions
  • just verify → PASS (vet + tests + 9 smokes) in ~31s
  • 18 smoke scripts (9 core gating verify + 9 domain smokes for new daemons)

Latest scrum: 2026-04-30 cross-lineage wave

Composite 50/60 at scrum2 head c7e3124 (was 35 baseline → 43 R1 → 50 R2). Today's chatd wave reviewed by Opus + Kimi + Qwen3-coder via the chatd's own /v1/chat; 2 BLOCKs + 2 WARNs landed as fixes (0efc736); reusable driver at scripts/scrum_review.sh.

Reality tests #001#003 — load-bearing gate verified (2026-04-30 ~05:5007:05 CDT)

The 5-loop substrate's load-bearing gate (per project_small_model_pipeline_vision.md: "the playbook + matrix indexer must give the results we're looking for") is verified for both verbatim replay and paraphrase queries.

Run Stance Verbatim lift Paraphrase recovery What it proved
playbook_lift_001 boost-only 7/8 (87.5%) not tested Cosine + boost re-rank works for verbatim replay. Substrate live.
playbook_lift_002 boost-only 2/2 0/2 Boost can't promote answers OUT of regular top-K — paraphrase gap exposed.
playbook_lift_003 Shape B (loose 0.5) 2/6 6/6 → top-1 Shape B injects, but cross-pollinates: w-4435 surfaces as warm top-1 for unrelated OOD queries (dental/RN/SWE).
playbook_lift_004 Shape B + split threshold (0.5 boost / 0.20 inject) 6/8 (75%) 6/8 (75%) OOD cross-pollination GONE; system refuses to inject when it's not confident. The honest configuration.

Shape B (InjectPlaybookMisses in internal/matrix/playbook.go): when warm-pass retrieval doesn't already include a playbook hit's answer, append a synthetic Result with distance = playbook_hit_distance × BoostFactor. Caller re-sorts + truncates. Documented at playbook.go:22-27 since v0; v3 shipped the implementation. v4 added the split-threshold defense (DefaultPlaybookMaxInjectDistance = 0.20 while boost stays at 0.50) — boost is safe at loose thresholds because it only re-ranks results already in retrieval; inject is structurally riskier so its threshold is tighter.

OOD honesty (dental hygienist / RN / software engineer queries) holds across all three runs — judge rates them 1, system doesn't fabricate matches. Cross-corpus boosts (e- ↔ w- swaps) confirmed in v1 + v3.

Evidence: reports/reality-tests/playbook_lift_{001,002,003}.{json,md}. Per the report's rubric (lift ≥ 50% = matrix doing real work), 6/6 paraphrase recovery is the validation that matters — verbatim replay is structurally the easy case.

v3 → v4 is the configuration evolution. v3 ran with one threshold (0.5) for both boost and inject — paraphrase recovery hit 6/6 but a single strong recording (w-4435 for "OSHA-30 forklift Wisconsin") cross-pollinated to OOD queries (dental hygienist / RN / software engineer) because their text vectors fell within 0.5 cosine. v4 split: boost stays at 0.5 (safe — only re-ranks existing results), inject tightens to 0.20 (true paraphrase territory). Result: OOD cross-pollination eliminated, paraphrase recovery 6/8, verbatim lift 6/8. The two paraphrases that missed (Q9, Q15) were liberally rephrased — drift > 0.20 cosine.

Multi-coordinator stress test (Phase 1 → 3) — VERIFIED end-to-end

Reality test #2 catalog. New harness scripts/multi_coord_stress.{sh,go} simulates 3 coordinators (alice/bob/carol) handling 3 distinct contracts (Milwaukee distribution / Indianapolis manufacturing / Chicago construction), each with their own playbook namespace. 9-phase operational narrative across simulated Hours 048: baseline → fresh-resume injection → inbox burst → mid-day surge → 200-worker swap → contract merge → handover (verbatim + paraphrase) → split → reissue.

Capability Verified Where
Per-coordinator playbook isolation playbook_alice / playbook_bob / playbook_carol corpora
Same-role-across-contracts diversity Jaccard 0.026 (n=9) — 97% workers differ per region Phase 1 baseline
Different-roles-same-contract diversity Jaccard 0.070 (n=18) — 93% differ per role Phase 1 baseline
HNSW retrieval determinism Jaccard 1.000 (n=12) Phase 6 reissue
Verbatim handover (Bob runs Alice's queries with Alice's playbook) 4/4 Phase 4
Paraphrase handover (Bob runs qwen2.5-paraphrased queries) 4/4 Phase 4b
200-worker swap with ExcludeIDs Jaccard 0.000 — 8/8 placed workers fully replaced Phase 2b
Fresh-resume injection (two-tier fresh_workers index) 3/3 fresh workers at top-1 Phase 1b
Inbox endpoint /v1/observer/inbox (email + SMS, priority weighting) 6/6 events recorded Phase 1c
LLM demand parsing (qwen2.5 format=json on inbox bodies) 6/6 parsed cleanly into structured {role, count, location, certs, skills, shift} Phase 1c
Judge re-rates inbox top-1 against ORIGINAL body catches tight-distance-but-wrong (Q3 crane case: dist 0.23 → rating 1) Phase 1c
Langfuse Go-side tracing 111 observations on a single run trace, browseable at http://localhost:3001 Run #011

Substrate gains added by this wave:

  • internal/matrix/playbook.go Shape B + split inject threshold (commit 67d1957 from earlier wave; verified in multi-coord too)
  • internal/matrix/retrieve.go ExcludeIDs field on SearchRequest — filters worker IDs at retrieve, boost, AND inject (so excluded answers can't sneak back via playbook). Real-world driver: coordinator placed N workers, client asks for replacements.
  • internal/observer/types.go SourceInbox taxonomy alongside SourceMCP / SourceScenario / SourceWorkflow
  • cmd/observerd POST /observer/inbox route — accepts {type, sender, subject, body, priority, tag} and records as ObservedOp. Type must be email or sms; body required; priority defaults to medium.
  • internal/langfuse/client.go — minimal Go-side Trace+Span client, best-effort posture (logs on error, never blocks calling path; same fail-open semantics as ADR-005 Decision 5.1).
  • embedd default_model bumped from nomic-embed-text (137M) → nomic-embed-text-v2-moe (475M, MoE, drop-in 768d). Same-role-across-contracts diversity went from 0.080 → 0.000 with the upgrade.
  • Two-tier index pattern: fresh content goes to fresh_workers (a small "hot" corpus); main queries include it in the corpora list. Solves the HNSW post-build add recall issue (incremental adds to a populated graph land in poorly-connected regions and disappear from search). Canonical NRT pattern; Lucene works the same way.

Harness expansion (2026-04-30 ~05:30 CDT)

scripts/playbook_lift.sh rewritten from a 5-daemon stripped harness to the full 10-daemon prod-realistic stack (chatd stays up independently). The 5-daemon version was structurally hiding bugs; expanding the daemon set surfaced 7 distinct fixes:

# Fix Lock
1 driver→matrixd: queryquery_text field name cmd/matrixd/main_test.go TestPlaybookRecord_OldFieldNameRejected
2 harness toml missing [s3] block inline comment in scripts/playbook_lift.sh
3 harness→queryd: qsql field name cmd/queryd/main_test.go TestHandleSQL_WrongFieldName_400
4 5→10 daemon boot order inline comment + dep-ordered launch
5 SQL surface probe (3-row CSV → COUNT=3) [lift] ✓ SQL surface probe passed assertion
6 candidates corpus was SWE-tech, not staffing swapped to ethereal_workers.parquet (10K rows, real staffing schema, "e-" id prefix)
7 qwen3.5:latest is vision-SSM 256K-ctx → 30s/judge reverted local_judge to qwen2.5:latest (1s/judge, 30× faster)

R-005 closed (2026-04-30 ~05:35 CDT)

Four new cmd/<bin>/main_test.go files — chi router-level contract tests:

  • cmd/matrixd/main_test.go (123 lines) — playbook record drift detector + score bounds + 6 routes mounted
  • cmd/queryd/main_test.go (extended) — wrong-field-name drift detector
  • cmd/pathwayd/main_test.go (102 lines) — 9 routes + add round-trip + retire-nonexistent
  • cmd/observerd/main_test.go (98 lines) — 4 routes + invalid-op 400 + unknown-mode 400

go test ./cmd/{matrixd,queryd,pathwayd,observerd} all green. R-005 from prior STATE OPEN list is closed.


DO NOT RELITIGATE

Ratified ADRs (docs/DECISIONS.md)

  • ADR-001: DuckDB via cgo, HTMX UI, Gitea hosting, distillation rebuilt-not-ported, pathway memory clean start, auditor longitudinal signal restarts. 6 sub-decisions, all final.
  • ADR-002: storaged per-prefix PUT cap (4 GiB for _vectors/, 256 MiB elsewhere) — implemented at 423a381. Operator-config bump rather than constant change is the documented path if 4 GiB ever insufficient.
  • ADR-003: Inter-service auth = Bearer + IP allowlist, opt-in via cfg.Auth.Token. Wiring deferred to Sprint 1 but the design is locked — alternatives (mTLS, JWT, OAuth2, IP-only) all considered + rejected.
  • ADR-004: Pathway memory = Mem0 versioned traces, JSONL append-only persistence, opaque json.RawMessage content. Implemented in internal/pathway/.

Today's scrum dispositions (2026-04-30)

Verbatim verdicts at reports/scrum/_evidence/2026-04-30/verdicts/. Disposition table: reports/scrum/_evidence/2026-04-30/disposition.md.

Real findings, all fixed in 0efc736:

  • B-1 (Opus+Kimi convergent): ResolveKey 3-arg API → 2-arg
  • B-2 (Opus+Kimi convergent): handleProviders direct map lookup, drop synthesis-via-Resolve
  • B-3 (Opus single, trace-verified): OllamaCloud.Chat strips ollama_cloud/ prefix correctly
  • B-4 (Opus single): Ollama done_reason surfaced to FinishReason

False positives dismissed (3, documented):

  • FP-A1: Kimi misread TestMaybeDowngrade_WithConfigList assertion
  • FP-A2: Qwen claimed nil-deref in MaybeDowngrade that doesn't exist
  • FP-C1: Opus claimed qwen3.5:latest doesn't exist on Ollama hub (it does on this box's local install)

Session frame (don't redo)

  • The Rust legacy is maintenance-only until Go reaches feature parity. Don't propose ports of components already shipped here.
  • The matrix indexer 5/5 components are shipped. Don't propose to "build the matrix indexer" — it's done.
  • The 5-loop substrate's load-bearing gate is PASSED. v3 (154a72e) showed 6/6 paraphrase recovery via Shape B. Don't propose to "test if the playbook learns" — it does.
  • Shape B is the playbook stance now. When use_playbook=true, both ApplyPlaybookBoost (re-rank in place) AND InjectPlaybookMisses (insert recorded answers not in regular top-K) run. Don't revert to boost-only; v2 proved that path can't recover paraphrase queries.
  • Boost / inject use SEPARATE thresholds. Boost stays at DefaultPlaybookMaxDistance = 0.5 (safe — only re-ranks results already in retrieval). Inject uses tighter DefaultPlaybookMaxInjectDistance = 0.20 (Shape B forces results in, so loose match cross-pollinates wrong-domain). Don't merge them.
  • Multi-coord product theory is empirically VALIDATED by run #011 (Phase 3). Per-coordinator playbook namespaces (playbook_alice etc.) with cross-coordinator handover (Bob takes Alice's contract using playbook_corpus=playbook_alice) work end-to-end including paraphrase handover. Don't propose to "test if multi-coord works" — it does.
  • Auth posture is locked per ADR-006. Non-loopback bind requires auth.token (mechanical gate at shared.Run). Operators set the token via token_env (defaults to AUTH_TOKEN) loaded by systemd EnvironmentFile=/etc/lakehouse/auth.env, NOT in the committed TOML. Internal services use AllowedIPs; external boundary uses Bearer. Token rotation is dual-token via secondary_tokens. TLS terminates at the edge (nginx/Caddy), not in-process. Don't re-litigate.
  • Shape B inject has a judge-gate substrate. InjectPlaybookMisses takes an optional InjectGate (interface) that approves each candidate before the rank insert. LLMJudgeGate (Ollama-shape /api/chat client) is the default impl; nil gate = pre-judge-gating distance-only behavior preserved for backward compat. Caller wires via SearchRequest.{JudgeURL, JudgeModel, JudgeMinRating}. Closes the lift-suite tail issues (Q6↔Q7 adjacent-query swap + Q9/Q15 paraphrase drift) at substrate level.
  • Fresh content uses two-tier indexing. Fresh resumes go to fresh_workers corpus, not the main workers index. coder/hnsw incremental adds to a populated 5K+ graph have recall issues; the small hot index has no such crowding. Periodic merge (post-G3) consolidates fresh→main. Canonical NRT pattern.
  • embedd.default_model = "nomic-embed-text-v2-moe" (475M MoE, 768d). Don't bump to nomic-embed-text (137M) "for speed" — diversity scores fell from 0.000 → 0.080 same-role-across-contracts on the smaller model. Cost (~5× slower ingest) is acceptable for once-per-deploy work.
  • Inbox flow: parse + search + judge + trace. /v1/observer/inbox records the body; coordinator/driver parses via LLM (qwen2.5 format=json), runs matrix.search on the parsed query, judges top-1 against the ORIGINAL body, emits Langfuse spans through it all. Don't replace the judge re-rate with a distance-only gate — tight distance + low rating is the load-bearing honesty signal.
  • local_judge = "qwen2.5:latest" for the lift loop — qwen3.5:latest is a vision-SSM 256K-ctx build that ran 30s/judge. Don't bump to qwen3.5 again "for cleanliness." Different model in the same tier has different cost geometry.
  • qwen3.5:latest IS available locally on this box. Opus's hub-only knowledge is a known-stale signal; the chatd_smoke uses it daily.
  • temperature is omitted for Anthropic 4.7 (handled by Request.Temperature *float64); don't re-add it.
  • chatd-smoke runs with all cloud providers disabled intentionally so the suite doesn't depend on API keys; that's why it can't catch B-3-class bugs (those need a fake-server fixture, see Sprint 0 follow-up).
  • Langfuse Go-side client lives at internal/langfuse/ with best-effort fail-open posture. URL+creds from /etc/lakehouse/langfuse.env. Don't propose to "wire Langfuse on Go side" — it's wired; multi_coord_stress is the proof.
  • vectord's source-of-truth is i.vectors, NOT the coder/hnsw graph. The Index struct holds a parallel vectors map[string][]float32 updated on every successful Add/Delete; the graph is a derived, replaceable view. safeGraphAdd/safeGraphDelete wrap the library's panic-prone ops; rebuildGraphLocked reads from i.vectors (graph-state-independent). Don't propose to "drop the side map for memory" — it's the load-bearing piece that makes Add panic-recoverable past the small-index threshold (closes the multitier_100k 277884b 96-98% fail). The prior i.ids set was folded into i.vectors keys.
  • vectord saves are coalesced async, not synchronous. cmd/vectord/main.go runs a per-index saveTask that single-flights through Persistor.Save — at most one in-flight + one pending. Add returns OK before the save completes; an Add-then-crash can lose ~1 save's worth of data, matching ADR-005's fail-open posture. Don't propose to "make saves synchronous for durability" — that re-introduces the lock-contention bottleneck (1-2.5s tail at conc=50, observed 2026-05-01) without fixing a real durability hole (in-memory state is the source of truth in flight).

OPEN — what's not done yet

The list is intentionally short. Items move to closed when the work demands them, not on a calendar. Ordered by leverage on the active product theory (multi-coord staffing co-pilot via the 5-loop substrate), not by effort.

All 4 prior OPEN items closed (substrate or fully) in the 2026-04-30 "fix the other 4" wave. No new items pending; the substrate is in a steady state. Future items will land here as production triggers fire.


RECENT VERIFIED WAVE (2026-04-30)

05273ac..e4ee002 — 4 phases + scrum + tooling, all gate-tested.

SHA What
ec1d031 Phase 1: [models] tier config (additive, no callers migrate)
622e124 Phase 2: matrix.downgrade reads cfg.Models.WeakModels
848cbf5 Phase 3: playbook_lift harness defaults from config
05273ac Phase 4: chatd + 5 providers (1,624 LoC)
0efc736 Scrum: 4 fixes (B-1..B-4) + 2 INFOs from cross-lineage review
e4ee002 scripts/scrum_review.sh — reusable 3-lineage driver
b2e45f7 playbook_lift harness expansion + reality test #001 (7/8 lift, 87.5%)
6c02c90 scrum lift_001: 4 fixes (sleep→polling SQL probe, JUDGE_SOURCE template, -id-prefix validation, chi.Router cast)
2c71d1c ADR-005: observer fail-safe semantics
9ce067b observerd: test that locks ADR-005 5.3 (provenance recorded post-run)
e9822f0 playbook_lift v2: paraphrase pass — exposed boost-only limit (0/2 paraphrase recovery)
154a72e matrix: Shape B (InjectPlaybookMisses) — 6/6 paraphrase recovery in run #003
94fc3b6 STATE_OF_PLAY: capture v3 reality test + Shape B + cross-pollination
67d1957 matrix: split boost/inject thresholds (0.5 / 0.20) — kills cross-pollination, run #004 6/8+6/8
b13b5cd playbook_lift v4 metric: warm-top-1 re-judge → quality lift +24%/-14% (5 lifted / 13 neutral / 3 regressed)
61c7b55 multi-coord stress harness (Phase 1) — 3 coords / 3 contracts / 7-phase scenario
0fa42a0 multi-coord stress Phase 1.5 — shared-role contracts + paraphrase handover
84a32f0 multi-coord stress Phase 2 — ExcludeIDs, 200-worker swap, fresh-resume
4da32ad embedd: bump default to nomic-embed-text-v2-moe (475M MoE, 768d drop-in)
e7fc63b observerd /observer/inbox + multi-coord stress phase 1c (priority-ordered events)
186d209 multi_coord_stress: LLM-parsed inbox demands (qwen2.5 format=json)
ce940f4 multi_coord_stress: judge re-rates inbox top-1 against original body — recovers OOD honesty signal
7e6431e langfuse: Go-side client + Phase 1c instrumentation
08a0867 multi_coord_stress: fresh_workers two-tier index — fresh-resume now top-1 (3/3)
5d49967 multi_coord_stress: full Langfuse coverage — every phase + every call (111 observations)
68d9e55 shared: auto-emit Langfuse trace+span per HTTP request — closes OPEN #2
a2fa9a2 scrum_review: pipe diff via temp files — fixes argv overflow on large bundles
(prep) G5 cutover prep: embed_parity probe — Rust /ai/embed ↔ Go /v1/embed 5/5 cos=1.000 (both v1 and v2-moe). Verdict + drift catalog in reports/cutover/SUMMARY.md. Wire-format remap (embeddings/vectors, dimensions/dimension) is the only real cutover work; math is provably equivalent.
(probe) Reality test real_001: 10 real-shape queries from fill_events.parquet through lift harness. 8/10 cold-pass top-1 = judge-best (substrate works on real distribution). Surfaced same-client+city cross-role bleed — Shape A boost from Forklift-Operator playbook landed on CNC-Operator query, demoting the cold-pass-correct worker. Findings: reports/reality-tests/real_001_findings.md.
(fix) Cross-role gate: Role on PlaybookEntry, QueryRole on SearchRequest, gate fires in both ApplyPlaybookBoost + InjectPlaybookMisses. roleEqual handles case + plural. Backward-compat: empty role on either side = gate disabled (preserves lift suite + free-form callers). 5 new unit tests use exact real_001 distance + role values. Re-run real_002: bleed closed (Q#5 Pickers, Q#10 CNC Operator stay at cold-pass top-1; same-role lifts still fire). Closes OPEN #1. Findings: reports/reality-tests/real_002_findings.md.
(probe) Reality test real_003: 40 queries (10 fill_events rows × 4 styles — need / client_first / looking / shorthand). Confirmed shorthand-vs-shorthand bleed is real: CNC Operator shorthand recording leaked w-2404 onto Forklift Operator shorthand query (both empty role, gate disabled). Extended extractRoleFromNeed to handle client_first + looking patterns; shorthand stays empty (regex can't separate role from city without anchor). Re-run real_003b: bleed closed across all 4 styles in this dataset. 10 new sub-tests in scripts/playbook_lift/main_test.go lock the patterns + the documented shorthand limitation. Findings: reports/reality-tests/real_003_findings.md.
(fix) LLM-based role extractor (real_004): roleExtractor struct with regex-first → qwen2.5 format=json fallback → per-process cache. Opt-in via -llm-role-extract flag + LLM_ROLE_EXTRACT=1 env. Off-by-default preserves real_003b shipping config. 8 new tests including TestRoleExtractor_ClosesCrossRoleShorthandBleed — the load-bearing witness pairing with the matrix-side TestInjectPlaybookMisses_RoleGateRejectsCrossRole to prove the extraction-layer + gate-layer compose correctly on the exact real_003 failure mode. Findings: reports/reality-tests/real_004_findings.md.
(scrum) 3-lineage scrum review on 7f2f112..0331288 (Opus + Kimi + Qwen3-coder via scripts/scrum_review.sh). Convergent finding (3/3): roleNormalize plural-stripper mangled non-plural-s tokens (Sales → Sale, Logistics → Logistic). Fixed: nonPluralSWords allowlist + -ss ending check + strings.ToLower/TrimSpace cleanup. New tests TestRoleNormalize_NonPluralS + TestRoleEqual_NonPluralS lock the edge cases. Kimi 2 BLOCKs were false positives (model-truncation artifacts per feedback_cross_lineage_review.md). Disposition: reports/scrum/_evidence/2026-04-30/verdicts/role_gate_v1_disposition.md (local).
(probe) Negation reality test real_005: 5 explicit-negation queries ("NOT in Detroit", "excluding Cornerstone roster", etc.). Confirmed substrate has zero negation handling — cosine treats "NOT X" as "X" + noise. Judge IS the safety net (Q1/Q3/Q4 rated all top-10 results 1-2/5 — operator-visible honesty signal). No code change needed: production UI should handle exclusion via ExcludeIDs (already supported, added in multi-coord stress 200-worker swap), not via NL-negation. Findings: reports/reality-tests/real_005_findings.md.
(wire-up) Multi-coord stress role wire-through: Demand.Role was already extracted at every call site (44 occurrences) but never threaded into matrix retrieve or playbook record. Cross-role gate was bypassed for the entire multi-coord harness. Fixed by extending tracedSearch, matrixSearch, and playbookRecord signatures with role string and updating all 14 call sites — passing d.Role (demand loops), parsed.Role (LLM-parsed inbox path), warehouseDemand.Role (swap path), ev.Role (reissue path), "" (fresh-verify resume snippet — no clean role). Build + vet + tests green; multi-coord stress now honors role gate end-to-end.
(close-1) OPEN #1: vectord merge endpointPOST /v1/vectors/index/{src}/merge with body {dest, clear_source}. Idempotent on re-runs (existing-in-dest items skipped). New Index.IDs() snapshot method backs it; new i.ids tracker field is the canonical ID set (independent of meta map's nil-vs-{} sparseness). 4 cmd-level tests + 1 unit test.
(close-2) OPEN #2: distillation SFT export substrateinternal/distillation/sft_export.go: IsSftNever predicate + ListScoredRunFiles (data/scored-runs/YYYY/MM/DD walk) + LoadScoredRunsFromFile + partial ExportSft that wires the firewall but leaves synthesis (instruction/input/response generation) as the next wave. Firewall pinning test fails if SftNever set changes without review. 5 new tests. The synthesis port remains on Rust at scripts/distillation/export_sft.ts.
(close-2 full) OPEN #2 fully ported (2026-05-01): SynthesizeSft + LoadEvidenceByRunID + buildInstruction ported byte-for-byte from scripts/distillation/export_sft.ts. All 8 source-class instruction templates (scrum_reviews / mode_experiments / auto_apply / audits / observer_reviews / contract_analyses / outcomes / default) match Rust output exactly so a/b validation between runtimes can diff JSONL byte-for-byte. ExportSft writes to data/distilled/sft/sft_export.jsonl. 5 additional tests including per-source-class template verification, extraction-rejection, empty-text-rejection, context-assembly, end-to-end fixture write.
(close-2 lineage) Audit-baselines lineage ported (2026-05-01): internal/distillation/audit_baseline.go mirrors Rust audit_full.ts's LoadBaseline/AppendBaseline/buildDriftTable. LoadLastBaseline reads the most recent JSON line from data/_kb/audit_baselines.jsonl; AppendBaseline appends append-only with bufio. BuildAuditDriftTable flags drift >20% (configurable); zero-baseline and new-metric edge cases handled (no division-by-zero, no false-stable on zero→nonzero). FormatAuditDriftTable for stdout dumps. Generic on metric names so callers running both runtimes can pin Rust-compat names (AuditBaselineRustCompat constant lists them). 13 tests including last-line-wins, trailing-blank-tolerance, malformed-line-errors, threshold-boundary, zero-baseline-handling, sort-stability.
(scrum) 3-lineage scrum on 434f466..0d4f033 (post_role_gate_v1). Convergent finding (Opus + Kimi): DecodeIndex lost nil-meta items across persistence. Fixed by bumping envelope version 1→2 with explicit IDs []string field; v1 envelopes still load via meta-key fallback. Opus-only real bugs also actioned: handleMerge non-ErrIndexNotFound nil-deref, mathLog dead wrapper removed, bubble sort → sort.Slice. False positives rejected after verification (Kimi rollback misreading + Opus stale-comment claim). 2 new regression tests lock the v2 round-trip + v1 backward-compat. Disposition: reports/scrum/_evidence/2026-05-01/verdicts/post_role_gate_v1_disposition.md.
(audit-full port) Audit-FULL pipeline (phases 0/3/4) ported from scripts/distillation/audit_full.ts. internal/distillation/audit_full.go + cmd/audit_full CLI. 6 ported required-check classes; 4 phases (1, 2, 5, 6, 7) deferred — depend on broader Rust pieces (materializer / replay / run-summaries) not yet ported. Cross-runtime byte-equal verdict on live data: Go-side audit-full against /home/profit/lakehouse produced p3_/p4_ metrics IDENTICAL to the last Rust-emitted audit_baselines.jsonl entry (all 8 metrics match: p3_accepted=386, p3_partial=132, p3_rejected=57, p3_human=480, p4_sft_rows=353, p4_rag_rows=448, p4_pref_pairs=83, p4_total_quarantined=1325). 6 new tests + the live-data probe captured in reports/cutover/audit_full_go_vs_rust.md.
(audit-full skips fixed) Phases 1/2/5/7 unskipped (2026-05-01) — port reduced from 4 deferred phases to 1. Phase 1: invokes go test ./internal/distillation/... via exec.Command (Go equivalent of Rust's bun test). Phase 2: reads data/evidence/ and tallies rows + tier-1 source hits as an observer (doesn't re-run the materializer; emits p2_evidence_rows/p2_evidence_skips metrics). Phase 5: reads reports/distillation/{run_id}/summary.json + 5 stage receipts; validates schema_version + run_hash sha256 + git_commit hex. Phase 7: reads data/_kb/replay_runs.jsonl; tail-row JSON parse check. Only Phase 6 remains skipped (Rust acceptance.ts is a TS-only fixture harness; porting fixtures + invariant runner is its own ADR). Live-data probe: 12/12 required checks PASS, p2_evidence_rows=1055 byte-equal to Rust summary.json collect.records_out. 6 new tests.
(lets-go) Persistent Go stack live (2026-05-01). All 11 daemons (storaged/catalogd/ingestd/queryd/embedd/vectord/pathwayd/observerd/matrixd/gateway/chatd) up as long-running processes on :3110+:3211-:3220 → later moved to :4110+:4211-:4219+:3220 for smoke isolation. First time the Go side runs as production-shape daemons rather than per-harness transient processes. Brought up via scripts/cutover/start_go_stack.sh. Gateway proxies /v1/embed correctly to embedd; all 5 chatd providers loaded. First Go-side entry written to data/_kb/audit_baselines.jsonl (entry #8, git_commit=ee2a40c, golangLAKEHOUSE SHA distinguishable from Rust's ca7375ea); the longitudinal log now mixes runtimes.
(g5-slice) G5 cutover slice LIVE (2026-05-01). First real Bun-frontend traffic reaching the Go substrate end-to-end. Bun mcp-server (/home/profit/lakehouse/mcp-server/index.ts) gains opt-in /_go/* pass-through to $GO_LAKEHOUSE_URL (set to http://127.0.0.1:4110 via systemd drop-in). /_go/v1/embed returns nomic-embed-text-v2-moe vectors via Go embedd; /_go/v1/matrix/search returns 3/3 Forklift Operators against the persistent 200-worker corpus. Fully additive (no existing Bun tool modified) + fully reversible (unset env). /api/* (Rust gateway) path unchanged. See reports/cutover/g5_first_slice_live.md.
(close-3) OPEN #3: distribution drift via PSIinternal/drift/drift.go: ComputeDistributionDrift returns Population Stability Index + verdict tier (stable < 0.10, minor 0.100.25, major ≥ 0.25). Equal-width bucketing over combined min/max range, epsilon-clamping for empty buckets, per-bucket breakdown for drilldown. 7 new tests including identical-is-stable, hard-shift-is-major, moderate-detected-not-stable, empty-inputs-safe, all-identical-safe, bucket-counts-conserved, num-buckets-clamping.
(close-4) OPEN #4: ops nice-to-haves — (a) Real-time wall-clock for stress harness: per-phase elapsed time logged to stdout as it runs ([stress] phase NAME starting (T+12.3s) + [stress] phase NAME done — 8.5s (T+20.8s)); Output.PhaseTimings + Output.TotalElapsedMs written to JSON; (b) chatd fixture-mode S3 mock + (c) liberal-paraphrase calibration: not actioned — no fired trigger yet, would be speculative. Documented as deferred-until-need rather than ignored.
(close-bug) coder/hnsw v0.6.1 panic — REAL FIX landed (2026-05-01 ~22:25). The 277884b multitier_100k run hit 96-98% fail on 2/6 scenarios from a v0.6.1 nil-deref (layerNode.search) that fires when the graph transitions through degenerate states post-Delete. Initial recover() guard caught panics but returned errors at the same rate. Real fix: lift the source-of-truth out of coder/hnsw — i.vectors map[string][]float32 side store maintained alongside the graph, panic-safe safeGraphAdd/safeGraphDelete wrappers, rebuildGraphLocked reads from i.vectors (independent of graph state), warm-path Add falls back to rebuild on panic. Side effect: i.ids collapsed into i.vectors keys; Len() reads from len(i.vectors). Memory cost: ~2x for vectors. Verification: 7 new regression tests in index_test.go (TestAdd_PastThreshold_SustainedReAdd reproduces the multitier shape — 64-entry index, 800 upserts, 0 errors), just verify PASS, multitier_100k re-run on persistent stack 19,622 scenarios / 0 failures across all 6 classes. p50 on previously-failing scenarios went 5ms (instant fail) → 551ms (real Add work — honest cost of correctness).
(perf-fix) Save coalescing — write-path lock contention closed (2026-05-01 ~22:50). The panic fix exposed a second bottleneck: every successful Add called Persistor.Save synchronously, which takes the index RLock for Encode (~6MB JSON for 1942-entry × 768d) — blocking concurrent Add Lock acquisitions. 5min sustained run showed playbook scenario p50 climbing 551ms→1398ms as the index grew. Fix: saveTask per-index single-flight coalescer in cmd/vectord/main.gosaveAfter now triggers an async save; concurrent triggers during an in-flight save mark "pending" so N triggers collapse into ≤2 actual saves. RPO trade: Add returns OK before save completes (~1 save's worth of crash-loss exposure; same fail-open posture as ADR-005). Verification: 3 new tests in cmd/vectord/main_test.go (50-trigger pile-up → 2 saves; single → 1; error doesn't stall). Re-run: surge_fill_validate p50 1296ms→47ms (~28× faster), playbook_record_replay 1398ms→385ms (~3.6× faster), throughput 144→668 scen/sec at 0% fail. Restart-rehydrate verified — playbook_memory 4041 entries persisted to MinIO and round-tripped cleanly.

Plus on Rust side (8de94eb, 3d06868): qwen2.5 → qwen3.5:latest backport in active defaults; distillation acceptance reports regenerated (run_hash refresh, reproducibility property still holds).


RUNTIME CHEATSHEET

# Verify everything green
cd /home/profit/golangLAKEHOUSE
just verify                                            # vet + tests + 9 core smokes (~31s)
just doctor                                            # dep probe (go/gcc/minio/ollama/secrets)

# Boot the chat dispatcher (Phase 4)
nohup ./bin/chatd   -config lakehouse.toml > /tmp/chatd.log   2>&1 & disown
nohup ./bin/gateway -config lakehouse.toml > /tmp/gateway.log 2>&1 & disown
curl -sf http://127.0.0.1:3110/v1/chat/providers | jq  # all 5 providers should report true

# Test a chat call to each lineage
for m in "qwen3.5:latest" "opencode/claude-opus-4-7" "openrouter/moonshotai/kimi-k2-0905"; do
  curl -sS -X POST http://127.0.0.1:3110/v1/chat \
    -H 'Content-Type: application/json' \
    -d "{\"model\":\"$m\",\"messages\":[{\"role\":\"user\",\"content\":\"reply: OK\"}],\"max_tokens\":8}" \
    | jq -c '{model,provider,content}'
done

# Run the scrum on a diff
./scripts/scrum_review.sh path/to/bundle.diff bundle_label
ls reports/scrum/_evidence/$(date +%Y-%m-%d)/verdicts/

# Domain smokes (not in `just verify`)
for s in chatd matrix observer pathway playbook relevance downgrade workflow; do
  bash scripts/${s}_smoke.sh > /tmp/${s}.log 2>&1 && echo "$s ✓" || echo "$s ✗"
done

VISION — what we're actually building

J's framing (canonical at /root/.claude/projects/-home-profit/memory/project_small_model_pipeline_vision.md): a small-model-driven autonomous pipeline that gets better with each run. Frontier APIs (Opus, Kimi, GPT-5) are too expensive + rate-limited for the inner loop — they live in audit/oversight via frontier_* tier. The hot path runs on local qwen3.5:latest given:

  1. Pathway memory — what we tried before, how it went (Mem0 substrate ✓)
  2. Matrix indexer — multi-corpus retrieve+merge giving the small model the right slice for this task (5/5 components ✓)
  3. Observer — watches each run, refines configs (not prompts) toward good pathways

Successful runs get rated and distilled back into the playbook. Each iteration the playbook gets denser, runs get cheaper, results get better. Drift in the distilled playbook is a measured signal, not vibes.

The single load-bearing gate: "the playbook + matrix indexer must give the results we're looking for." Throughput, scaling, code elegance are all secondary. The playbook_lift reality test is the regression gate before Enterprise cutover (where real contracts + live profile updates land).

When evaluating any Go workstream, ask: which of the 5 loops does this advance? Strong workstreams advance ≥1; weak workstreams sit in infra-for-its-own-sake.


SIBLING TOOLS (separate repos, intentional integration target later)

local-review-harness at git.agentview.dev/profit/local-review-harness (also SMB-mounted at /home/profit/share/local-review-harness-full-md/). Local-first code review harness — 12 evidence-bearing static analyzers, Scrum-style reports, no cloud deps. Phase A + B (MVP) shipped 2026-04-30. Phases CE (Ollama LLM review, validation, memory) pending.

Cross-pollination plan when both stabilize:

  • Replace harness's internal/llm/ollama.go with a chatd /v1/chat client → frontier judges via config toggle
  • Feed harness findings into Lakehouse pathway memory as a drift signal
  • Treat harness's .memory/known-risks.json as a matrix-indexer corpus

Detail at docs/SPEC.md §3.10. Don't re-port harness functionality into Lakehouse-Go — the standalone tool is the design.