From db2e57402e19c2026243bd4dec29830ef065b4f2 Mon Sep 17 00:00:00 2001
From: root <root@island37.com>
Date: Thu, 30 Apr 2026 17:30:04 -0500
Subject: [PATCH] STATE_OF_PLAY: capture multi-coord stress wave (Phase 1-3
 verified)
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Anchor was last touched at v4 split-threshold; since then the
multi-coord stress harness landed end-to-end across 11 commits.
Future sessions reading this file need to see the verified state,
not derive from git log.

Major additions:
- New "Multi-coordinator stress test (Phase 1 → 3)" section in
  VERIFIED WORKING. 11-row capability table covering per-coord
  playbook isolation, diversity metrics, paraphrase handover,
  ExcludeIDs swap, fresh-resume two-tier, inbox endpoints, LLM
  demand parsing, judge re-rating, Langfuse tracing.
- Substrate-gains list under that section: ExcludeIDs on
  SearchRequest, observer.SourceInbox + /observer/inbox,
  internal/langfuse client, embedd default bumped to v2-moe,
  two-tier fresh_workers index pattern.
- Last-verified bumped to 16:42 CDT on the run #011 anchor.

DO NOT RELITIGATE expanded with five new locks:
1. Boost / inject use SEPARATE thresholds (0.5 / 0.20)
2. Multi-coord product theory is empirically VALIDATED
3. Fresh content uses two-tier indexing (fresh_workers)
4. embedd.default_model = nomic-embed-text-v2-moe (don't downgrade)
5. Inbox flow: parse + search + judge + trace
6. Langfuse Go-side client lives at internal/langfuse/

OPEN list refresh:
- Removed: re-judge metric (shipped as b13b5cd), adjacent-query as
  separate item (folded into a single "judge-approves-before-inject"
  follow-up), liberal-paraphrase (kept).
- Added: real-time 48-hour clock, wider Langfuse instrumentation,
  periodic fresh→main merge job.

RECENT VERIFIED WAVE table extended with 11 commits (b13b5cd..5d49967).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
---
 STATE_OF_PLAY.md | 57 +++++++++++++++++++++++++++++++++++++++++++-----
 1 file changed, 52 insertions(+), 5 deletions(-)

diff --git a/STATE_OF_PLAY.md b/STATE_OF_PLAY.md
index dd11a5c..62efffa 100644
--- a/STATE_OF_PLAY.md
+++ b/STATE_OF_PLAY.md
@@ -1,7 +1,7 @@
 # STATE OF PLAY — Lakehouse-Go
 
-**Last verified:** 2026-04-30 ~07:25 CDT
-**Verified by:** live probes + `just verify` PASS + reality tests #001/#002/#003/#004 (v4 with split inject threshold: 6/8 verbatim lift + 6/8 paraphrase recovery + zero OOD cross-pollination), not memory.
+**Last verified:** 2026-04-30 ~16:42 CDT
+**Verified by:** live probes + `just verify` PASS + multi-coord stress run #011 (full 9-phase scenario, 67 captured events, 1 Langfuse trace + 111 child observations covering every phase + every external call), not memory.
 
 > **Read this FIRST.** When the user says "we're working on lakehouse," default to the Go rewrite (this repo); the Rust legacy at `/home/profit/lakehouse/` is maintenance-only. If memory contradicts this file, this file wins. Update it when something is verified working — not when a phase finishes.
 
@@ -114,6 +114,34 @@ Evidence: `reports/reality-tests/playbook_lift_{001,002,003}.{json,md}`. Per the
 
 **v3 → v4 is the configuration evolution.** v3 ran with one threshold (0.5) for both boost and inject — paraphrase recovery hit 6/6 but a single strong recording (w-4435 for "OSHA-30 forklift Wisconsin") cross-pollinated to OOD queries (dental hygienist / RN / software engineer) because their text vectors fell within 0.5 cosine. v4 split: boost stays at 0.5 (safe — only re-ranks existing results), inject tightens to 0.20 (true paraphrase territory). Result: OOD cross-pollination eliminated, paraphrase recovery 6/8, verbatim lift 6/8. The two paraphrases that missed (Q9, Q15) were liberally rephrased — drift > 0.20 cosine.
 
+### Multi-coordinator stress test (Phase 1 → 3) — VERIFIED end-to-end
+
+Reality test #2 catalog. New harness `scripts/multi_coord_stress.{sh,go}` simulates 3 coordinators (alice/bob/carol) handling 3 distinct contracts (Milwaukee distribution / Indianapolis manufacturing / Chicago construction), each with their own playbook namespace. 9-phase operational narrative across simulated Hours 0–48: baseline → fresh-resume injection → inbox burst → mid-day surge → 200-worker swap → contract merge → handover (verbatim + paraphrase) → split → reissue.
+
+| Capability | Verified | Where |
+|---|---|---|
+| Per-coordinator playbook isolation | ✓ | `playbook_alice` / `playbook_bob` / `playbook_carol` corpora |
+| Same-role-across-contracts diversity | Jaccard 0.026 (n=9) — 97% workers differ per region | Phase 1 baseline |
+| Different-roles-same-contract diversity | Jaccard 0.070 (n=18) — 93% differ per role | Phase 1 baseline |
+| HNSW retrieval determinism | Jaccard 1.000 (n=12) | Phase 6 reissue |
+| Verbatim handover (Bob runs Alice's queries with Alice's playbook) | 4/4 | Phase 4 |
+| Paraphrase handover (Bob runs qwen2.5-paraphrased queries) | 4/4 | Phase 4b |
+| 200-worker swap with `ExcludeIDs` | Jaccard 0.000 — 8/8 placed workers fully replaced | Phase 2b |
+| Fresh-resume injection (two-tier `fresh_workers` index) | 3/3 fresh workers at top-1 | Phase 1b |
+| Inbox endpoint `/v1/observer/inbox` (email + SMS, priority weighting) | 6/6 events recorded | Phase 1c |
+| LLM demand parsing (qwen2.5 format=json on inbox bodies) | 6/6 parsed cleanly into structured `{role, count, location, certs, skills, shift}` | Phase 1c |
+| Judge re-rates inbox top-1 against ORIGINAL body | catches tight-distance-but-wrong (Q3 crane case: dist 0.23 → rating 1) | Phase 1c |
+| Langfuse Go-side tracing | 111 observations on a single run trace, browseable at http://localhost:3001 | Run #011 |
+
+**Substrate gains added by this wave:**
+- `internal/matrix/playbook.go` Shape B + split inject threshold (commit `67d1957` from earlier wave; verified in multi-coord too)
+- `internal/matrix/retrieve.go` `ExcludeIDs` field on `SearchRequest` — filters worker IDs at retrieve, boost, AND inject (so excluded answers can't sneak back via playbook). Real-world driver: coordinator placed N workers, client asks for replacements.
+- `internal/observer/types.go` `SourceInbox` taxonomy alongside `SourceMCP / SourceScenario / SourceWorkflow`
+- `cmd/observerd` `POST /observer/inbox` route — accepts `{type, sender, subject, body, priority, tag}` and records as `ObservedOp`. Type must be `email` or `sms`; body required; priority defaults to medium.
+- `internal/langfuse/client.go` — minimal Go-side Trace+Span client, best-effort posture (logs on error, never blocks calling path; same fail-open semantics as ADR-005 Decision 5.1).
+- `embedd default_model` bumped from `nomic-embed-text` (137M) → `nomic-embed-text-v2-moe` (475M, MoE, drop-in 768d). Same-role-across-contracts diversity went from 0.080 → 0.000 with the upgrade.
+- Two-tier index pattern: fresh content goes to `fresh_workers` (a small "hot" corpus); main queries include it in the corpora list. Solves the HNSW post-build add recall issue (incremental adds to a populated graph land in poorly-connected regions and disappear from search). Canonical NRT pattern; Lucene works the same way.
+
 ### Harness expansion (2026-04-30 ~05:30 CDT)
 
 `scripts/playbook_lift.sh` rewritten from a 5-daemon stripped harness to the **full 10-daemon prod-realistic stack** (chatd stays up independently). The 5-daemon version was structurally hiding bugs; expanding the daemon set surfaced 7 distinct fixes:
@@ -171,10 +199,16 @@ Verbatim verdicts at `reports/scrum/_evidence/2026-04-30/verdicts/`. Disposition
 - The matrix indexer **5/5 components** are shipped. Don't propose to "build the matrix indexer" — it's done.
 - The 5-loop substrate's load-bearing gate is **PASSED**. v3 (`154a72e`) showed 6/6 paraphrase recovery via Shape B. Don't propose to "test if the playbook learns" — it does.
 - **Shape B is the playbook stance now.** When `use_playbook=true`, both `ApplyPlaybookBoost` (re-rank in place) AND `InjectPlaybookMisses` (insert recorded answers not in regular top-K) run. Don't revert to boost-only; v2 proved that path can't recover paraphrase queries.
+- **Boost / inject use SEPARATE thresholds.** Boost stays at `DefaultPlaybookMaxDistance = 0.5` (safe — only re-ranks results already in retrieval). Inject uses tighter `DefaultPlaybookMaxInjectDistance = 0.20` (Shape B forces results in, so loose match cross-pollinates wrong-domain). Don't merge them.
+- **Multi-coord product theory is empirically VALIDATED** by run #011 (Phase 3). Per-coordinator playbook namespaces (`playbook_alice` etc.) with cross-coordinator handover (Bob takes Alice's contract using `playbook_corpus=playbook_alice`) work end-to-end including paraphrase handover. Don't propose to "test if multi-coord works" — it does.
+- **Fresh content uses two-tier indexing.** Fresh resumes go to `fresh_workers` corpus, not the main `workers` index. coder/hnsw incremental adds to a populated 5K+ graph have recall issues; the small hot index has no such crowding. Periodic merge (post-G3) consolidates fresh→main. Canonical NRT pattern.
+- **`embedd.default_model = "nomic-embed-text-v2-moe"`** (475M MoE, 768d). Don't bump to `nomic-embed-text` (137M) "for speed" — diversity scores fell from 0.000 → 0.080 same-role-across-contracts on the smaller model. Cost (~5× slower ingest) is acceptable for once-per-deploy work.
+- **Inbox flow: parse + search + judge + trace.** `/v1/observer/inbox` records the body; coordinator/driver parses via LLM (qwen2.5 format=json), runs `matrix.search` on the parsed query, judges top-1 against the ORIGINAL body, emits Langfuse spans through it all. Don't replace the judge re-rate with a distance-only gate — tight distance + low rating is the load-bearing honesty signal.
 - `local_judge = "qwen2.5:latest"` for the lift loop — qwen3.5:latest is a vision-SSM 256K-ctx build that ran 30s/judge. Don't bump to qwen3.5 again "for cleanliness." Different model in the same tier has different cost geometry.
 - `qwen3.5:latest` IS available locally on this box. Opus's hub-only knowledge is a known-stale signal; the chatd_smoke uses it daily.
 - `temperature` is **omitted** for Anthropic 4.7 (handled by `Request.Temperature *float64`); don't re-add it.
 - chatd-smoke runs with **all cloud providers disabled** intentionally so the suite doesn't depend on API keys; that's why it can't catch B-3-class bugs (those need a fake-server fixture, see Sprint 0 follow-up).
+- **Langfuse Go-side client lives at `internal/langfuse/`** with best-effort fail-open posture. URL+creds from `/etc/lakehouse/langfuse.env`. Don't propose to "wire Langfuse on Go side" — it's wired; multi_coord_stress is the proof.
 
 ---
 
@@ -182,9 +216,11 @@ Verbatim verdicts at `reports/scrum/_evidence/2026-04-30/verdicts/`. Disposition
 
 | Item | What | When to act |
 |---|---|---|
-| **Reality test v4: re-judge warm results** | Current lift metric counts "warm top-1 == cold judge best." Shape B cross-pollinates recorded answers across queries (run #003 dropped verbatim-lift 7→2 because w-4435 surfaced as warm top-1 for several queries whose cold-judge-best was a different worker). True quality lift would re-judge warm results and measure judge agreement. ~30 min — adds a 4th pass to the driver. | Next time we want to measure Shape B's actual quality, not rank-of-original-judge-best. |
-| **Adjacent-query cross-pollination** | After v4's split threshold, OOD cross-pollination is gone but Q6 ("Forklift-certified loader") ↔ Q7 ("Hazmat-certified warehouse worker, cold storage") still swap recordings as warm top-1 because their embeddings are within 0.20 cosine of each other. Likely correct (genuinely similar staffing queries), but worth verifying with the v4 re-judge metric — if the judge agrees both directions are good matches, accept; if not, tighten further (e.g. 0.15) or add a same-query-only mode. | Co-decision with v4 re-judge. |
-| **Liberal-paraphrase recovery loss** | Q9 + Q15 in run #004 lost paraphrase recovery because qwen2.5 rephrased liberally enough to drift past 0.20 inject threshold. Acceptable (system refusing to inject when not confident), but might be tightenable with a different paraphrase prompt (more conservative wording variation) or a per-pair `paraphrase_max_drift` measurement. Cosmetic vs. real depends on whether realistic coordinator queries drift like qwen2.5's rephrases do. | When real coordinator queries are available for a calibration run. |
+| **Real-time 48-hour clock** | Multi-coord stress fires phases as discrete steps with simulated-hour labels (0/6/12/18/24/30/36/42/48). A real-time variant would space events on actual wall-clock with `time.Sleep`, simulating the rhythm of a coordinator workday. Cosmetic; doesn't change product behavior — but lets the harness mimic the cadence at which inbox events would arrive in production. ~30 min. | When stress test starts being used to capture realistic per-hour throughput numbers. |
+| **Wider Langfuse instrumentation across daemons** | multi_coord_stress traces every external call, but the daemons themselves (matrixd, observerd, chatd) don't yet emit traces from their own request handlers. Adding `internal/langfuse/middleware.go` would auto-emit a span per HTTP request, giving production-traffic visibility for free. | When production traffic starts hitting the Go gateway. |
+| **Periodic fresh→main index merge** | Two-tier `fresh_workers` pattern is verified working but no scheduled job consolidates fresh→main. Fresh corpus grows monotonically; eventually has its own recall issues. A daily cron that ingests the fresh corpus' contents into the main `workers` index + drops fresh would solve it. | When fresh_workers grows past ~500 items. |
+| **Adjacent-query cross-pollination (lift suite Q6↔Q7)** | After lift v4's split threshold, OOD cross-pollination is gone but Q6 / Q7 still swap recordings as warm top-1 because their embeddings are within 0.20 cosine of each other. Multi-coord run #008 inbox-judge re-rating proved the judge can distinguish — gating injection on "judge approves before injecting" closes this. ~1 hr. | When playbook injection quality matters more than retrieval throughput. |
+| **Liberal-paraphrase recovery loss (lift suite Q9, Q15)** | Q9 + Q15 in run #004 lost paraphrase recovery because qwen2.5 rephrased liberally enough to drift past 0.20 inject threshold. Acceptable (system refusing to inject when not confident), but might be tightenable with a different paraphrase prompt or a per-pair `paraphrase_max_drift` measurement. | When real coordinator queries are available for a calibration run. |
 | **Sprint 4 — deployment** | No `REPLICATION.md`, `secrets-go.toml.example`, `deploy/systemd/<bin>.service`, `Dockerfile`. Largest open Sprint. Required input for any G5 cutover plan. | When G5 cutover is on the table. |
 | **ADR-006 — auth posture for non-loopback deploy** | Locks R-001 + R-007 from "opt-in middleware exists" to "wired-by-default for X, opt-in for Y." Doc-only, ~1 hr. | Required before any Go binary binds non-loopback in prod. |
 | **chatd fixture-mode storage half** | `g2_smoke_fixtures.sh` closed embed half via fake_ollama; storage half (mock S3) still deferred. Closes R-006 fully. | When CI box without MinIO is needed. |
@@ -213,6 +249,17 @@ Verbatim verdicts at `reports/scrum/_evidence/2026-04-30/verdicts/`. Disposition
 | `154a72e` | matrix: Shape B (`InjectPlaybookMisses`) — 6/6 paraphrase recovery in run #003 |
 | `94fc3b6` | STATE_OF_PLAY: capture v3 reality test + Shape B + cross-pollination |
 | `67d1957` | matrix: split boost/inject thresholds (0.5 / 0.20) — kills cross-pollination, run #004 6/8+6/8 |
+| `b13b5cd` | playbook_lift v4 metric: warm-top-1 re-judge → quality lift +24%/-14% (5 lifted / 13 neutral / 3 regressed) |
+| `61c7b55` | multi-coord stress harness (Phase 1) — 3 coords / 3 contracts / 7-phase scenario |
+| `0fa42a0` | multi-coord stress Phase 1.5 — shared-role contracts + paraphrase handover |
+| `84a32f0` | multi-coord stress Phase 2 — `ExcludeIDs`, 200-worker swap, fresh-resume |
+| `4da32ad` | embedd: bump default to nomic-embed-text-v2-moe (475M MoE, 768d drop-in) |
+| `e7fc63b` | observerd `/observer/inbox` + multi-coord stress phase 1c (priority-ordered events) |
+| `186d209` | multi_coord_stress: LLM-parsed inbox demands (qwen2.5 format=json) |
+| `ce940f4` | multi_coord_stress: judge re-rates inbox top-1 against original body — recovers OOD honesty signal |
+| `7e6431e` | langfuse: Go-side client + Phase 1c instrumentation |
+| `08a0867` | multi_coord_stress: fresh_workers two-tier index — fresh-resume now top-1 (3/3) |
+| `5d49967` | multi_coord_stress: full Langfuse coverage — every phase + every call (111 observations) |
 
 Plus on Rust side (`8de94eb`, `3d06868`): qwen2.5 → qwen3.5:latest backport in active defaults; distillation acceptance reports regenerated (run_hash refresh, reproducibility property still holds).