golangLAKEHOUSE/STATE_OF_PLAY.md

# STATE OF PLAY — Lakehouse-Go

**Last verified:** 2026-04-30 ~16:42 CDT
**Verified by:** live probes + `just verify` PASS + multi-coord stress run #011 (full 9-phase scenario, 67 captured events, 1 Langfuse trace + 111 child observations covering every phase + every external call), not memory.

> **Read this FIRST.** When the user says "we're working on lakehouse," default to the Go rewrite (this repo); the Rust legacy at `/home/profit/lakehouse/` is maintenance-only. If memory contradicts this file, this file wins. Update it when something is verified working — not when a phase finishes.

---

## VERIFIED WORKING RIGHT NOW

### Substrate (G0 + G1 family)

13 service binaries under `cmd/` plus 2 driver scripts under `scripts/staffing_*` build into `bin/`. **18 smoke scripts all PASS.** `just verify` (vet + 30 packages × short tests + 9 core smokes) green in ~31s wall.

| Binary | Port | What |
|---|---|---|
| `gateway` | 3110 | reverse proxy, single OpenAI-compat-style edge |
| `storaged` | 3211 | S3 GET/PUT/LIST/DELETE w/ per-prefix PUT cap (ADR-002) |
| `catalogd` | 3212 | Parquet manifests, ADR-020 idempotent register |
| `ingestd` | 3213 | CSV → Parquet → catalogd, content-addressed keys |
| `queryd` | 3214 | DuckDB SELECT over Parquet via httpfs |
| `vectord` | 3215 | HNSW indexes (coder/hnsw), persistence to storaged |
| `embedd` | 3216 | Ollama-backed embedder w/ LRU cache |
| `pathwayd` | 3217 | Mem0 ops (Add/Update/Revise/Retire/History/Search) |
| `matrixd` | 3218 | Multi-corpus retrieve+merge + relevance + downgrade + playbook |
| `observerd` | 3219 | Witness loop, workflow runner with DAG executor |
| `chatd` | 3220 | LLM dispatcher: ollama / ollama_cloud / openrouter / opencode / kimi |
| `mcpd` | — | MCP SDK port (Bun mcp-server replacement) |
| `fake_ollama` | — | Test fixture (used by `g2_smoke_fixtures.sh`) |

### Matrix indexer — all 5 SPEC §3.4 components shipped

1. **Corpus builders** (`internal/corpusingest`)
2. **Multi-corpus retrieve+merge** (`matrixd /matrix/search`)
3. **Relevance filter** (`internal/matrix/relevance.go` 376 LoC + 289 LoC test)
4. **Strong-model downgrade gate** (`internal/matrix/downgrade.go`, reads `cfg.Models.WeakModels` after Phase 2)
5. **Playbook memory: boost + Shape B inject** (`internal/matrix/playbook.go`, learning loop). Shape B (`InjectPlaybookMisses`, `154a72e`) appends recorded answers to results when regular retrieval misses them — closed the paraphrase recovery gap exposed by run #002 (0/2) and validated by run #003 (6/6).

### Pathway memory (Mem0 substrate)

Full ADR-004 surface shipped. **Cycle-detection + retired-trace exclusion proven by tests:** `TestHistory_CycleDetected`, `TestRetire_ExcludedFromSearch`, `TestRevise_ChainOfThree_BackwardWalk`. JSONL append-only persistence with corruption tolerance.

### Observer + workflow runner

- `observerd` ring buffer + JSONL persistence
- Workflow DAG executor (Archon-style) with 5 native modes wired: `matrix.relevance`, `matrix.downgrade`, `matrix.search`, `distillation.score`, `drift.scorer`. Plus `fixture.echo` / `fixture.upper` for runner mechanics smokes.

### Distillation + drift

- **E (partial)** at `57d0df1` — scorer + contamination firewall ported from Rust v1.0.0 (logic only per ADR-001 §1.4; not bit-identical).
- **F (first slice)** at `be65f85` — drift quantification, scorer drift first.

### chatd — Phase 4 (shipped 2026-04-30, scrum-hardened same day)

Multi-provider LLM dispatcher routing `/v1/chat` by model-name prefix or `:cloud` suffix:

| Prefix / suffix | Provider | Auth |
|---|---|---|
| `ollama/<m>` or bare | `ollama` (local) | none |
| `ollama_cloud/<m>` or `<m>:cloud` | `ollama_cloud` | Bearer (OLLAMA_CLOUD_KEY) |
| `openrouter/<v>/<m>` | `openrouter` | Bearer (OPENROUTER_API_KEY) |
| `opencode/<m>` | `opencode` | Bearer (OPENCODE_API_KEY) |
| `kimi/<m>` | `kimi` | Bearer (KIMI_API_KEY) |

All 5 keys live in `/etc/lakehouse/{ollama_cloud,openrouter,opencode,kimi}.env` files (mode 0600). Empty/missing files leave that provider unregistered (404 at first call instead of 503). Test request: `POST /v1/chat {"model":"opencode/claude-opus-4-7","messages":[{"role":"user","content":"hi"}],"max_tokens":8}`.

`Request.Temperature` is `*float64` (pointer) — Anthropic 4.7 deprecates `temperature` entirely, so we omit the field when caller doesn't set it.

### Model tier registry

`lakehouse.toml [models]` names model IDs by tier so swaps are 1-line:

```toml
local_fast       = "qwen3.5:latest"
local_judge      = "qwen2.5:latest"   # NOT qwen3.5 — vision-SSM 256K-ctx ran 30s/judge in lift loop
cloud_judge      = "kimi-k2.6:cloud"
cloud_review     = "qwen3-coder:480b"
frontier_review  = "openrouter/anthropic/claude-opus-4-7"
frontier_arch    = "openrouter/moonshotai/kimi-k2-0905"
frontier_free    = "opencode/claude-opus-4-7"
weak_models      = ["qwen3.5:latest", "qwen3:latest"]   # matrix.downgrade bypass
```

Callers read `cfg.Models.LocalJudge` etc. instead of literal strings. `playbook_lift` harness, `matrix.downgrade`, and observerd's `MatrixDowngradeWithWeakList` factory all migrated.

### Code health

- `go vet ./...` → **0 warnings, 0 errors**
- `go test -short ./...` → **all green**, 349 test functions
- `just verify` → PASS (vet + tests + 9 smokes) in ~31s
- 18 smoke scripts (9 core gating verify + 9 domain smokes for new daemons)

### Latest scrum: 2026-04-30 cross-lineage wave

Composite **50/60** at scrum2 head `c7e3124` (was 35 baseline → 43 R1 → 50 R2). Today's chatd wave reviewed by Opus + Kimi + Qwen3-coder via the chatd's own `/v1/chat`; **2 BLOCKs + 2 WARNs landed as fixes** (`0efc736`); reusable driver at `scripts/scrum_review.sh`.

### Reality tests #001–#003 — load-bearing gate verified (2026-04-30 ~05:50–07:05 CDT)

The 5-loop substrate's load-bearing gate (per `project_small_model_pipeline_vision.md`: *"the playbook + matrix indexer must give the results we're looking for"*) is verified for both **verbatim replay** and **paraphrase queries**.

| Run | Stance | Verbatim lift | Paraphrase recovery | What it proved |
|---|---|---|---|---|
| `playbook_lift_001` | boost-only | **7/8 (87.5%)** | not tested | Cosine + boost re-rank works for verbatim replay. Substrate live. |
| `playbook_lift_002` | boost-only | 2/2 | **0/2** | Boost can't promote answers OUT of regular top-K — paraphrase gap exposed. |
| `playbook_lift_003` | Shape B (loose 0.5) | 2/6 | 6/6 → top-1 | Shape B injects, but cross-pollinates: w-4435 surfaces as warm top-1 for unrelated OOD queries (dental/RN/SWE). |
| `playbook_lift_004` | **Shape B + split threshold (0.5 boost / 0.20 inject)** | **6/8 (75%)** | **6/8 (75%)** | OOD cross-pollination GONE; system refuses to inject when it's not confident. The honest configuration. |

**Shape B** (`InjectPlaybookMisses` in `internal/matrix/playbook.go`): when warm-pass retrieval doesn't already include a playbook hit's answer, append a synthetic Result with distance = `playbook_hit_distance × BoostFactor`. Caller re-sorts + truncates. Documented at `playbook.go:22-27` since v0; v3 shipped the implementation. v4 added the split-threshold defense (`DefaultPlaybookMaxInjectDistance = 0.20` while boost stays at 0.50) — boost is safe at loose thresholds because it only re-ranks results already in retrieval; inject is structurally riskier so its threshold is tighter.

OOD honesty (dental hygienist / RN / software engineer queries) holds across all three runs — judge rates them 1, system doesn't fabricate matches. Cross-corpus boosts (e- ↔ w- swaps) confirmed in v1 + v3.

Evidence: `reports/reality-tests/playbook_lift_{001,002,003}.{json,md}`. Per the report's rubric (lift ≥ 50% = matrix doing real work), 6/6 paraphrase recovery is the validation that matters — verbatim replay is structurally the easy case.

**v3 → v4 is the configuration evolution.** v3 ran with one threshold (0.5) for both boost and inject — paraphrase recovery hit 6/6 but a single strong recording (w-4435 for "OSHA-30 forklift Wisconsin") cross-pollinated to OOD queries (dental hygienist / RN / software engineer) because their text vectors fell within 0.5 cosine. v4 split: boost stays at 0.5 (safe — only re-ranks existing results), inject tightens to 0.20 (true paraphrase territory). Result: OOD cross-pollination eliminated, paraphrase recovery 6/8, verbatim lift 6/8. The two paraphrases that missed (Q9, Q15) were liberally rephrased — drift > 0.20 cosine.

### Multi-coordinator stress test (Phase 1 → 3) — VERIFIED end-to-end

Reality test #2 catalog. New harness `scripts/multi_coord_stress.{sh,go}` simulates 3 coordinators (alice/bob/carol) handling 3 distinct contracts (Milwaukee distribution / Indianapolis manufacturing / Chicago construction), each with their own playbook namespace. 9-phase operational narrative across simulated Hours 0–48: baseline → fresh-resume injection → inbox burst → mid-day surge → 200-worker swap → contract merge → handover (verbatim + paraphrase) → split → reissue.

| Capability | Verified | Where |
|---|---|---|
| Per-coordinator playbook isolation | ✓ | `playbook_alice` / `playbook_bob` / `playbook_carol` corpora |
| Same-role-across-contracts diversity | Jaccard 0.026 (n=9) — 97% workers differ per region | Phase 1 baseline |
| Different-roles-same-contract diversity | Jaccard 0.070 (n=18) — 93% differ per role | Phase 1 baseline |
| HNSW retrieval determinism | Jaccard 1.000 (n=12) | Phase 6 reissue |
| Verbatim handover (Bob runs Alice's queries with Alice's playbook) | 4/4 | Phase 4 |
| Paraphrase handover (Bob runs qwen2.5-paraphrased queries) | 4/4 | Phase 4b |
| 200-worker swap with `ExcludeIDs` | Jaccard 0.000 — 8/8 placed workers fully replaced | Phase 2b |
| Fresh-resume injection (two-tier `fresh_workers` index) | 3/3 fresh workers at top-1 | Phase 1b |
| Inbox endpoint `/v1/observer/inbox` (email + SMS, priority weighting) | 6/6 events recorded | Phase 1c |
| LLM demand parsing (qwen2.5 format=json on inbox bodies) | 6/6 parsed cleanly into structured `{role, count, location, certs, skills, shift}` | Phase 1c |
| Judge re-rates inbox top-1 against ORIGINAL body | catches tight-distance-but-wrong (Q3 crane case: dist 0.23 → rating 1) | Phase 1c |
| Langfuse Go-side tracing | 111 observations on a single run trace, browseable at http://localhost:3001 | Run #011 |

**Substrate gains added by this wave:**
- `internal/matrix/playbook.go` Shape B + split inject threshold (commit `67d1957` from earlier wave; verified in multi-coord too)
- `internal/matrix/retrieve.go` `ExcludeIDs` field on `SearchRequest` — filters worker IDs at retrieve, boost, AND inject (so excluded answers can't sneak back via playbook). Real-world driver: coordinator placed N workers, client asks for replacements.
- `internal/observer/types.go` `SourceInbox` taxonomy alongside `SourceMCP / SourceScenario / SourceWorkflow`
- `cmd/observerd` `POST /observer/inbox` route — accepts `{type, sender, subject, body, priority, tag}` and records as `ObservedOp`. Type must be `email` or `sms`; body required; priority defaults to medium.
- `internal/langfuse/client.go` — minimal Go-side Trace+Span client, best-effort posture (logs on error, never blocks calling path; same fail-open semantics as ADR-005 Decision 5.1).
- `embedd default_model` bumped from `nomic-embed-text` (137M) → `nomic-embed-text-v2-moe` (475M, MoE, drop-in 768d). Same-role-across-contracts diversity went from 0.080 → 0.000 with the upgrade.
- Two-tier index pattern: fresh content goes to `fresh_workers` (a small "hot" corpus); main queries include it in the corpora list. Solves the HNSW post-build add recall issue (incremental adds to a populated graph land in poorly-connected regions and disappear from search). Canonical NRT pattern; Lucene works the same way.

### Harness expansion (2026-04-30 ~05:30 CDT)

`scripts/playbook_lift.sh` rewritten from a 5-daemon stripped harness to the **full 10-daemon prod-realistic stack** (chatd stays up independently). The 5-daemon version was structurally hiding bugs; expanding the daemon set surfaced 7 distinct fixes:

| # | Fix | Lock |
|---|---|---|
| 1 | driver→matrixd: `query` → `query_text` field name | `cmd/matrixd/main_test.go` TestPlaybookRecord_OldFieldNameRejected |
| 2 | harness toml missing `[s3]` block | inline comment in `scripts/playbook_lift.sh` |
| 3 | harness→queryd: `q` → `sql` field name | `cmd/queryd/main_test.go` TestHandleSQL_WrongFieldName_400 |
| 4 | 5→10 daemon boot order | inline comment + dep-ordered launch |
| 5 | SQL surface probe (3-row CSV → COUNT=3) | `[lift] ✓ SQL surface probe passed` assertion |
| 6 | `candidates` corpus was SWE-tech, not staffing | swapped to `ethereal_workers.parquet` (10K rows, real staffing schema, "e-" id prefix) |
| 7 | `qwen3.5:latest` is vision-SSM 256K-ctx → 30s/judge | reverted `local_judge` to `qwen2.5:latest` (1s/judge, 30× faster) |

### R-005 closed (2026-04-30 ~05:35 CDT)

Four new `cmd/<bin>/main_test.go` files — chi router-level contract tests:

- `cmd/matrixd/main_test.go` (123 lines) — playbook record drift detector + score bounds + 6 routes mounted
- `cmd/queryd/main_test.go` (extended) — wrong-field-name drift detector
- `cmd/pathwayd/main_test.go` (102 lines) — 9 routes + add round-trip + retire-nonexistent
- `cmd/observerd/main_test.go` (98 lines) — 4 routes + invalid-op 400 + unknown-mode 400

`go test ./cmd/{matrixd,queryd,pathwayd,observerd}` all green. R-005 from prior STATE OPEN list is closed.

---

## DO NOT RELITIGATE

### Ratified ADRs (`docs/DECISIONS.md`)

- **ADR-001**: DuckDB via cgo, HTMX UI, Gitea hosting, distillation rebuilt-not-ported, pathway memory clean start, auditor longitudinal signal restarts. **6 sub-decisions, all final.**
- **ADR-002**: storaged per-prefix PUT cap (4 GiB for `_vectors/`, 256 MiB elsewhere) — implemented at `423a381`. Operator-config bump rather than constant change is the documented path if 4 GiB ever insufficient.
- **ADR-003**: Inter-service auth = Bearer + IP allowlist, opt-in via `cfg.Auth.Token`. Wiring deferred to Sprint 1 but **the design is locked** — alternatives (mTLS, JWT, OAuth2, IP-only) all considered + rejected.
- **ADR-004**: Pathway memory = Mem0 versioned traces, JSONL append-only persistence, opaque `json.RawMessage` content. Implemented in `internal/pathway/`.

### Today's scrum dispositions (2026-04-30)

Verbatim verdicts at `reports/scrum/_evidence/2026-04-30/verdicts/`. Disposition table: `reports/scrum/_evidence/2026-04-30/disposition.md`.

**Real findings, all fixed in `0efc736`:**
- B-1 (Opus+Kimi convergent): `ResolveKey` 3-arg API → 2-arg
- B-2 (Opus+Kimi convergent): `handleProviders` direct map lookup, drop synthesis-via-Resolve
- B-3 (Opus single, trace-verified): `OllamaCloud.Chat` strips `ollama_cloud/` prefix correctly
- B-4 (Opus single): Ollama `done_reason` surfaced to FinishReason

**False positives dismissed (3, documented):**
- FP-A1: Kimi misread `TestMaybeDowngrade_WithConfigList` assertion
- FP-A2: Qwen claimed nil-deref in `MaybeDowngrade` that doesn't exist
- FP-C1: Opus claimed `qwen3.5:latest` doesn't exist on Ollama hub (it does on this box's local install)

### Session frame (don't redo)

- The Rust legacy is **maintenance-only** until Go reaches feature parity. Don't propose ports of components already shipped here.
- The matrix indexer **5/5 components** are shipped. Don't propose to "build the matrix indexer" — it's done.
- The 5-loop substrate's load-bearing gate is **PASSED**. v3 (`154a72e`) showed 6/6 paraphrase recovery via Shape B. Don't propose to "test if the playbook learns" — it does.
- **Shape B is the playbook stance now.** When `use_playbook=true`, both `ApplyPlaybookBoost` (re-rank in place) AND `InjectPlaybookMisses` (insert recorded answers not in regular top-K) run. Don't revert to boost-only; v2 proved that path can't recover paraphrase queries.
- **Boost / inject use SEPARATE thresholds.** Boost stays at `DefaultPlaybookMaxDistance = 0.5` (safe — only re-ranks results already in retrieval). Inject uses tighter `DefaultPlaybookMaxInjectDistance = 0.20` (Shape B forces results in, so loose match cross-pollinates wrong-domain). Don't merge them.
- **Multi-coord product theory is empirically VALIDATED** by run #011 (Phase 3). Per-coordinator playbook namespaces (`playbook_alice` etc.) with cross-coordinator handover (Bob takes Alice's contract using `playbook_corpus=playbook_alice`) work end-to-end including paraphrase handover. Don't propose to "test if multi-coord works" — it does.
- **Auth posture is locked per ADR-006.** Non-loopback bind requires `auth.token` (mechanical gate at `shared.Run`). Operators set the token via `token_env` (defaults to `AUTH_TOKEN`) loaded by systemd `EnvironmentFile=/etc/lakehouse/auth.env`, NOT in the committed TOML. Internal services use `AllowedIPs`; external boundary uses Bearer. Token rotation is dual-token via `secondary_tokens`. TLS terminates at the edge (nginx/Caddy), not in-process. Don't re-litigate.
- **Shape B inject has a judge-gate substrate.** `InjectPlaybookMisses` takes an optional `InjectGate` (interface) that approves each candidate before the rank insert. `LLMJudgeGate` (Ollama-shape /api/chat client) is the default impl; nil gate = pre-judge-gating distance-only behavior preserved for backward compat. Caller wires via `SearchRequest.{JudgeURL, JudgeModel, JudgeMinRating}`. Closes the lift-suite tail issues (Q6↔Q7 adjacent-query swap + Q9/Q15 paraphrase drift) at substrate level.
- **Fresh content uses two-tier indexing.** Fresh resumes go to `fresh_workers` corpus, not the main `workers` index. coder/hnsw incremental adds to a populated 5K+ graph have recall issues; the small hot index has no such crowding. Periodic merge (post-G3) consolidates fresh→main. Canonical NRT pattern.
- **`embedd.default_model = "nomic-embed-text-v2-moe"`** (475M MoE, 768d). Don't bump to `nomic-embed-text` (137M) "for speed" — diversity scores fell from 0.000 → 0.080 same-role-across-contracts on the smaller model. Cost (~5× slower ingest) is acceptable for once-per-deploy work.
- **Inbox flow: parse + search + judge + trace.** `/v1/observer/inbox` records the body; coordinator/driver parses via LLM (qwen2.5 format=json), runs `matrix.search` on the parsed query, judges top-1 against the ORIGINAL body, emits Langfuse spans through it all. Don't replace the judge re-rate with a distance-only gate — tight distance + low rating is the load-bearing honesty signal.
- `local_judge = "qwen2.5:latest"` for the lift loop — qwen3.5:latest is a vision-SSM 256K-ctx build that ran 30s/judge. Don't bump to qwen3.5 again "for cleanliness." Different model in the same tier has different cost geometry.
- `qwen3.5:latest` IS available locally on this box. Opus's hub-only knowledge is a known-stale signal; the chatd_smoke uses it daily.
- `temperature` is **omitted** for Anthropic 4.7 (handled by `Request.Temperature *float64`); don't re-add it.
- chatd-smoke runs with **all cloud providers disabled** intentionally so the suite doesn't depend on API keys; that's why it can't catch B-3-class bugs (those need a fake-server fixture, see Sprint 0 follow-up).
- **Langfuse Go-side client lives at `internal/langfuse/`** with best-effort fail-open posture. URL+creds from `/etc/lakehouse/langfuse.env`. Don't propose to "wire Langfuse on Go side" — it's wired; multi_coord_stress is the proof.

---

## OPEN — what's not done yet

The list is intentionally short. Items move to closed when the work demands them, not on a calendar. Ordered by leverage on the active product theory (multi-coord staffing co-pilot via the 5-loop substrate), not by effort.

**All 4 prior OPEN items closed (substrate or fully) in the 2026-04-30
"fix the other 4" wave.** No new items pending; the substrate is in
a steady state. Future items will land here as production triggers fire.

---

## RECENT VERIFIED WAVE (2026-04-30)

`05273ac..e4ee002` — 4 phases + scrum + tooling, all gate-tested.

| SHA | What |
|---|---|
| `ec1d031` | Phase 1: `[models]` tier config (additive, no callers migrate) |
| `622e124` | Phase 2: `matrix.downgrade` reads `cfg.Models.WeakModels` |
| `848cbf5` | Phase 3: `playbook_lift` harness defaults from config |
| `05273ac` | Phase 4: chatd + 5 providers (1,624 LoC) |
| `0efc736` | Scrum: 4 fixes (B-1..B-4) + 2 INFOs from cross-lineage review |
| `e4ee002` | `scripts/scrum_review.sh` — reusable 3-lineage driver |
| `b2e45f7` | playbook_lift harness expansion + reality test #001 (7/8 lift, 87.5%) |
| `6c02c90` | scrum lift_001: 4 fixes (sleep→polling SQL probe, JUDGE_SOURCE template, -id-prefix validation, chi.Router cast) |
| `2c71d1c` | ADR-005: observer fail-safe semantics |
| `9ce067b` | observerd: test that locks ADR-005 5.3 (provenance recorded post-run) |
| `e9822f0` | playbook_lift v2: paraphrase pass — exposed boost-only limit (0/2 paraphrase recovery) |
| `154a72e` | matrix: Shape B (`InjectPlaybookMisses`) — 6/6 paraphrase recovery in run #003 |
| `94fc3b6` | STATE_OF_PLAY: capture v3 reality test + Shape B + cross-pollination |
| `67d1957` | matrix: split boost/inject thresholds (0.5 / 0.20) — kills cross-pollination, run #004 6/8+6/8 |
| `b13b5cd` | playbook_lift v4 metric: warm-top-1 re-judge → quality lift +24%/-14% (5 lifted / 13 neutral / 3 regressed) |
| `61c7b55` | multi-coord stress harness (Phase 1) — 3 coords / 3 contracts / 7-phase scenario |
| `0fa42a0` | multi-coord stress Phase 1.5 — shared-role contracts + paraphrase handover |
| `84a32f0` | multi-coord stress Phase 2 — `ExcludeIDs`, 200-worker swap, fresh-resume |
| `4da32ad` | embedd: bump default to nomic-embed-text-v2-moe (475M MoE, 768d drop-in) |
| `e7fc63b` | observerd `/observer/inbox` + multi-coord stress phase 1c (priority-ordered events) |
| `186d209` | multi_coord_stress: LLM-parsed inbox demands (qwen2.5 format=json) |
| `ce940f4` | multi_coord_stress: judge re-rates inbox top-1 against original body — recovers OOD honesty signal |
| `7e6431e` | langfuse: Go-side client + Phase 1c instrumentation |
| `08a0867` | multi_coord_stress: fresh_workers two-tier index — fresh-resume now top-1 (3/3) |
| `5d49967` | multi_coord_stress: full Langfuse coverage — every phase + every call (111 observations) |
| `68d9e55` | shared: auto-emit Langfuse trace+span per HTTP request — closes OPEN #2 |
| `a2fa9a2` | scrum_review: pipe diff via temp files — fixes argv overflow on large bundles |
| (prep)    | G5 cutover prep: `embed_parity` probe — Rust `/ai/embed` ↔ Go `/v1/embed` 5/5 cos=1.000 (both v1 and v2-moe). Verdict + drift catalog in `reports/cutover/SUMMARY.md`. Wire-format remap (`embeddings`/`vectors`, `dimensions`/`dimension`) is the only real cutover work; math is provably equivalent. |
| (probe)   | Reality test real_001: 10 real-shape queries from `fill_events.parquet` through lift harness. 8/10 cold-pass top-1 = judge-best (substrate works on real distribution). Surfaced **same-client+city cross-role bleed** — Shape A boost from Forklift-Operator playbook landed on CNC-Operator query, demoting the cold-pass-correct worker. Findings: `reports/reality-tests/real_001_findings.md`. |
| (fix)     | Cross-role gate: `Role` on `PlaybookEntry`, `QueryRole` on `SearchRequest`, gate fires in both `ApplyPlaybookBoost` + `InjectPlaybookMisses`. `roleEqual` handles case + plural. Backward-compat: empty role on either side = gate disabled (preserves lift suite + free-form callers). 5 new unit tests use exact real_001 distance + role values. Re-run real_002: bleed closed (Q#5 Pickers, Q#10 CNC Operator stay at cold-pass top-1; same-role lifts still fire). Closes OPEN #1. Findings: `reports/reality-tests/real_002_findings.md`. |
| (probe)   | Reality test real_003: 40 queries (10 fill_events rows × 4 styles — `need` / `client_first` / `looking` / `shorthand`). Confirmed shorthand-vs-shorthand bleed is real: CNC Operator shorthand recording leaked `w-2404` onto Forklift Operator shorthand query (both empty role, gate disabled). Extended `extractRoleFromNeed` to handle `client_first` + `looking` patterns; shorthand stays empty (regex can't separate role from city without anchor). Re-run real_003b: bleed closed across all 4 styles in this dataset. 10 new sub-tests in `scripts/playbook_lift/main_test.go` lock the patterns + the documented shorthand limitation. Findings: `reports/reality-tests/real_003_findings.md`. |
| (fix)     | LLM-based role extractor (real_004): `roleExtractor` struct with regex-first → qwen2.5 format=json fallback → per-process cache. Opt-in via `-llm-role-extract` flag + `LLM_ROLE_EXTRACT=1` env. Off-by-default preserves real_003b shipping config. 8 new tests including `TestRoleExtractor_ClosesCrossRoleShorthandBleed` — the load-bearing witness pairing with the matrix-side `TestInjectPlaybookMisses_RoleGateRejectsCrossRole` to prove the extraction-layer + gate-layer compose correctly on the exact real_003 failure mode. Findings: `reports/reality-tests/real_004_findings.md`. |
| (scrum)   | 3-lineage scrum review on `7f2f112..0331288` (Opus + Kimi + Qwen3-coder via `scripts/scrum_review.sh`). Convergent finding (3/3): `roleNormalize` plural-stripper mangled non-plural-s tokens (Sales → Sale, Logistics → Logistic). **Fixed**: `nonPluralSWords` allowlist + `-ss` ending check + `strings.ToLower`/`TrimSpace` cleanup. New tests `TestRoleNormalize_NonPluralS` + `TestRoleEqual_NonPluralS` lock the edge cases. Kimi 2 BLOCKs were false positives (model-truncation artifacts per `feedback_cross_lineage_review.md`). Disposition: `reports/scrum/_evidence/2026-04-30/verdicts/role_gate_v1_disposition.md` (local). |
| (probe)   | Negation reality test real_005: 5 explicit-negation queries ("NOT in Detroit", "excluding Cornerstone roster", etc.). Confirmed substrate has **zero negation handling** — cosine treats "NOT X" as "X" + noise. Judge IS the safety net (Q1/Q3/Q4 rated all top-10 results 1-2/5 — operator-visible honesty signal). **No code change needed**: production UI should handle exclusion via `ExcludeIDs` (already supported, added in multi-coord stress 200-worker swap), not via NL-negation. Findings: `reports/reality-tests/real_005_findings.md`. |
| (wire-up) | Multi-coord stress role wire-through: `Demand.Role` was already extracted at every call site (44 occurrences) but never threaded into matrix retrieve or playbook record. Cross-role gate was bypassed for the entire multi-coord harness. **Fixed** by extending `tracedSearch`, `matrixSearch`, and `playbookRecord` signatures with `role string` and updating all 14 call sites — passing `d.Role` (demand loops), `parsed.Role` (LLM-parsed inbox path), `warehouseDemand.Role` (swap path), `ev.Role` (reissue path), `""` (fresh-verify resume snippet — no clean role). Build + vet + tests green; multi-coord stress now honors role gate end-to-end. |
| (close-1) | **OPEN #1: vectord merge endpoint** — `POST /v1/vectors/index/{src}/merge` with body `{dest, clear_source}`. Idempotent on re-runs (existing-in-dest items skipped). New `Index.IDs()` snapshot method backs it; new `i.ids` tracker field is the canonical ID set (independent of meta map's nil-vs-{} sparseness). 4 cmd-level tests + 1 unit test. |
| (close-2) | **OPEN #2: distillation SFT export substrate** — `internal/distillation/sft_export.go`: `IsSftNever` predicate + `ListScoredRunFiles` (data/scored-runs/YYYY/MM/DD walk) + `LoadScoredRunsFromFile` + partial `ExportSft` that wires the firewall but leaves synthesis (instruction/input/response generation) as the next wave. Firewall pinning test fails if `SftNever` set changes without review. 5 new tests. The synthesis port remains on Rust at `scripts/distillation/export_sft.ts`. |
| (close-2 full) | **OPEN #2 fully ported** (2026-05-01): `SynthesizeSft` + `LoadEvidenceByRunID` + `buildInstruction` ported byte-for-byte from `scripts/distillation/export_sft.ts`. All 8 source-class instruction templates (scrum_reviews / mode_experiments / auto_apply / audits / observer_reviews / contract_analyses / outcomes / default) match Rust output exactly so a/b validation between runtimes can diff JSONL byte-for-byte. `ExportSft` writes to `data/distilled/sft/sft_export.jsonl`. 5 additional tests including per-source-class template verification, extraction-rejection, empty-text-rejection, context-assembly, end-to-end fixture write. |
| (close-2 lineage) | **Audit-baselines lineage ported** (2026-05-01): `internal/distillation/audit_baseline.go` mirrors Rust `audit_full.ts`'s LoadBaseline/AppendBaseline/buildDriftTable. `LoadLastBaseline` reads the most recent JSON line from `data/_kb/audit_baselines.jsonl`; `AppendBaseline` appends append-only with bufio. `BuildAuditDriftTable` flags drift `>20%` (configurable); zero-baseline and new-metric edge cases handled (no division-by-zero, no false-stable on zero→nonzero). `FormatAuditDriftTable` for stdout dumps. Generic on metric names so callers running both runtimes can pin Rust-compat names (`AuditBaselineRustCompat` constant lists them). 13 tests including last-line-wins, trailing-blank-tolerance, malformed-line-errors, threshold-boundary, zero-baseline-handling, sort-stability. |
| (close-3) | **OPEN #3: distribution drift via PSI** — `internal/drift/drift.go`: `ComputeDistributionDrift` returns Population Stability Index + verdict tier (stable < 0.10, minor 0.10–0.25, major ≥ 0.25). Equal-width bucketing over combined min/max range, epsilon-clamping for empty buckets, per-bucket breakdown for drilldown. 7 new tests including identical-is-stable, hard-shift-is-major, moderate-detected-not-stable, empty-inputs-safe, all-identical-safe, bucket-counts-conserved, num-buckets-clamping. |
| (close-4) | **OPEN #4: ops nice-to-haves** — (a) Real-time wall-clock for stress harness: per-phase elapsed time logged to stdout as it runs (`[stress] phase NAME starting (T+12.3s)` + `[stress] phase NAME done — 8.5s (T+20.8s)`); `Output.PhaseTimings` + `Output.TotalElapsedMs` written to JSON; (b) chatd fixture-mode S3 mock + (c) liberal-paraphrase calibration: not actioned — no fired trigger yet, would be speculative. Documented as deferred-until-need rather than ignored. |

Plus on Rust side (`8de94eb`, `3d06868`): qwen2.5 → qwen3.5:latest backport in active defaults; distillation acceptance reports regenerated (run_hash refresh, reproducibility property still holds).

---

## RUNTIME CHEATSHEET

```bash
# Verify everything green
cd /home/profit/golangLAKEHOUSE
just verify                                            # vet + tests + 9 core smokes (~31s)
just doctor                                            # dep probe (go/gcc/minio/ollama/secrets)

# Boot the chat dispatcher (Phase 4)
nohup ./bin/chatd   -config lakehouse.toml > /tmp/chatd.log   2>&1 & disown
nohup ./bin/gateway -config lakehouse.toml > /tmp/gateway.log 2>&1 & disown
curl -sf http://127.0.0.1:3110/v1/chat/providers | jq  # all 5 providers should report true

# Test a chat call to each lineage
for m in "qwen3.5:latest" "opencode/claude-opus-4-7" "openrouter/moonshotai/kimi-k2-0905"; do
  curl -sS -X POST http://127.0.0.1:3110/v1/chat \
    -H 'Content-Type: application/json' \
    -d "{\"model\":\"$m\",\"messages\":[{\"role\":\"user\",\"content\":\"reply: OK\"}],\"max_tokens\":8}" \
    | jq -c '{model,provider,content}'
done

# Run the scrum on a diff
./scripts/scrum_review.sh path/to/bundle.diff bundle_label
ls reports/scrum/_evidence/$(date +%Y-%m-%d)/verdicts/

# Domain smokes (not in `just verify`)
for s in chatd matrix observer pathway playbook relevance downgrade workflow; do
  bash scripts/${s}_smoke.sh > /tmp/${s}.log 2>&1 && echo "$s ✓" || echo "$s ✗"
done
```

---

## VISION — what we're actually building

J's framing (canonical at `/root/.claude/projects/-home-profit/memory/project_small_model_pipeline_vision.md`): a small-model-driven autonomous pipeline that gets better with each run. Frontier APIs (Opus, Kimi, GPT-5) are too expensive + rate-limited for the inner loop — they live in audit/oversight via `frontier_*` tier. The hot path runs on local `qwen3.5:latest` given:

1. **Pathway memory** — what we tried before, how it went (Mem0 substrate ✓)
2. **Matrix indexer** — multi-corpus retrieve+merge giving the small model the right slice for this task (5/5 components ✓)
3. **Observer** — watches each run, refines configs (not prompts) toward good pathways

Successful runs get **rated and distilled back into the playbook**. Each iteration the playbook gets denser, runs get cheaper, results get better. **Drift** in the distilled playbook is a measured signal, not vibes.

**The single load-bearing gate:** *"the playbook + matrix indexer must give the results we're looking for."* Throughput, scaling, code elegance are all secondary. The `playbook_lift` reality test is the regression gate before Enterprise cutover (where real contracts + live profile updates land).

When evaluating any Go workstream, ask: which of the 5 loops does this advance? Strong workstreams advance ≥1; weak workstreams sit in infra-for-its-own-sake.

---

## SIBLING TOOLS (separate repos, intentional integration target later)

**`local-review-harness`** at `git.agentview.dev/profit/local-review-harness` (also SMB-mounted at `/home/profit/share/local-review-harness-full-md/`). Local-first code review harness — 12 evidence-bearing static analyzers, Scrum-style reports, no cloud deps. Phase A + B (MVP) shipped 2026-04-30. Phases C–E (Ollama LLM review, validation, memory) pending.

**Cross-pollination plan when both stabilize:**
- Replace harness's `internal/llm/ollama.go` with a chatd `/v1/chat` client → frontier judges via config toggle
- Feed harness findings into Lakehouse pathway memory as a drift signal
- Treat harness's `.memory/known-risks.json` as a matrix-indexer corpus

Detail at `docs/SPEC.md` §3.10. Don't re-port harness functionality into Lakehouse-Go — the standalone tool is the design.