From 247e36e687cae7e0d3b0365db09749fe615f9357 Mon Sep 17 00:00:00 2001 From: root Date: Thu, 30 Apr 2026 19:32:31 -0500 Subject: [PATCH] =?UTF-8?q?STATE=5FOF=5FPLAY:=20trim=20OPEN=20list=20?= =?UTF-8?q?=E2=80=94=209=20rows=20=E2=86=92=206,=20ordered=20by=20product?= =?UTF-8?q?=20leverage?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Sprint 4 row removed (shipped: a59ef5b systemd + 54a05d9 docker). ADR-006 row already dropped on the previous STATE update. Two lift-suite tail items (Q6↔Q7 adjacent-query, Q9/Q15 liberal- paraphrase) consolidated into one "judge-gated playbook injection" row — both are downstream of the same fix (let the judge approve before Shape B inserts). Captures the design lineage from multi-coord run #008's judge-rating pattern. Three items folded into a single "operational nice-to-haves" row: real-time clock, chatd fixture storage half, liberal-paraphrase calibration. None are product-blocking; each lights up when someone hits its specific trigger. Reorder reflects leverage on the active product theory (multi- coord staffing co-pilot via the 5-loop substrate), not effort: 1. Judge-gated injection (lift quality + lift-tail closure) 2. Wider Langfuse instrumentation (production observability) 3. Fresh→main merge (operational hygiene as the corpus grows) 4. Distillation full port (production dependency, not yet) 5. Drift quantification (research) 6. Operational nice-to-haves Lead-in note added: "Items move to closed when the work demands them, not on a calendar." Locks intent against future drift toward a sprawling todo list. Co-Authored-By: Claude Opus 4.7 (1M context) --- STATE_OF_PLAY.md | 19 +++++++++---------- 1 file changed, 9 insertions(+), 10 deletions(-) diff --git a/STATE_OF_PLAY.md b/STATE_OF_PLAY.md index d024efd..47267ae 100644 --- a/STATE_OF_PLAY.md +++ b/STATE_OF_PLAY.md @@ -215,17 +215,16 @@ Verbatim verdicts at `reports/scrum/_evidence/2026-04-30/verdicts/`. Disposition ## OPEN — what's not done yet -| Item | What | When to act | +The list is intentionally short. Items move to closed when the work demands them, not on a calendar. Ordered by leverage on the active product theory (multi-coord staffing co-pilot via the 5-loop substrate), not by effort. + +| # | Item | When to act | |---|---|---| -| **Real-time 48-hour clock** | Multi-coord stress fires phases as discrete steps with simulated-hour labels (0/6/12/18/24/30/36/42/48). A real-time variant would space events on actual wall-clock with `time.Sleep`, simulating the rhythm of a coordinator workday. Cosmetic; doesn't change product behavior — but lets the harness mimic the cadence at which inbox events would arrive in production. ~30 min. | When stress test starts being used to capture realistic per-hour throughput numbers. | -| **Wider Langfuse instrumentation across daemons** | multi_coord_stress traces every external call, but the daemons themselves (matrixd, observerd, chatd) don't yet emit traces from their own request handlers. Adding `internal/langfuse/middleware.go` would auto-emit a span per HTTP request, giving production-traffic visibility for free. | When production traffic starts hitting the Go gateway. | -| **Periodic fresh→main index merge** | Two-tier `fresh_workers` pattern is verified working but no scheduled job consolidates fresh→main. Fresh corpus grows monotonically; eventually has its own recall issues. A daily cron that ingests the fresh corpus' contents into the main `workers` index + drops fresh would solve it. | When fresh_workers grows past ~500 items. | -| **Adjacent-query cross-pollination (lift suite Q6↔Q7)** | After lift v4's split threshold, OOD cross-pollination is gone but Q6 / Q7 still swap recordings as warm top-1 because their embeddings are within 0.20 cosine of each other. Multi-coord run #008 inbox-judge re-rating proved the judge can distinguish — gating injection on "judge approves before injecting" closes this. ~1 hr. | When playbook injection quality matters more than retrieval throughput. | -| **Liberal-paraphrase recovery loss (lift suite Q9, Q15)** | Q9 + Q15 in run #004 lost paraphrase recovery because qwen2.5 rephrased liberally enough to drift past 0.20 inject threshold. Acceptable (system refusing to inject when not confident), but might be tightenable with a different paraphrase prompt or a per-pair `paraphrase_max_drift` measurement. | When real coordinator queries are available for a calibration run. | -| **Sprint 4 — deployment** | No `REPLICATION.md`, `secrets-go.toml.example`, `deploy/systemd/.service`, `Dockerfile`. Largest open Sprint. Required input for any G5 cutover plan. | When G5 cutover is on the table. | -| **chatd fixture-mode storage half** | `g2_smoke_fixtures.sh` closed embed half via fake_ollama; storage half (mock S3) still deferred. Closes R-006 fully. | When CI box without MinIO is needed. | -| **Distillation full port** | `57d0df1` shipped scorer + contamination firewall (E partial); SFT export pipeline + audit_baselines lineage not yet ported. | When distillation is needed for production. | -| **Drift full quantification** | `be65f85` is "scorer drift first." Full distribution-drift signal underspecified everywhere — research gap, not a port. | Open research item. | +| 1 | **Judge-gated playbook injection** — close lift-suite tail issues (Q6↔Q7 swap, Q9/Q15 paraphrase drift) by routing every Shape B injection through the judge before the rank insert lands. Multi-coord run #008 already proved the judge can distinguish tight-but-wrong from tight-and-right; this lifts that pattern into the matrix substrate. ~1.5 hr. | When playbook quality starts mattering more than retrieval throughput. | +| 2 | **Wider Langfuse instrumentation across daemons** — `internal/langfuse/middleware.go` that auto-emits one span per HTTP request from every daemon's `shared.Run`. Production traffic gets free trace visibility without per-handler wiring. | When production traffic actually starts hitting the gateway. | +| 3 | **Periodic fresh→main index merge** — two-tier pattern works but `fresh_workers` grows monotonically. A scheduled job that re-ingests the fresh corpus into `workers` (with the v2-moe embedder) + clears fresh closes the loop. | When `fresh_workers` crosses ~500 items in production. | +| 4 | **Distillation full port** — `57d0df1` shipped scorer + contamination firewall (E partial). SFT export pipeline + audit_baselines lineage still on the Rust side. | When distillation becomes a production dependency. | +| 5 | **Drift quantification** — `be65f85` is "scorer drift first." Full distribution-drift signal is underspecified everywhere; this is research, not a port. | Open research item; no calendar. | +| 6 | **Operational nice-to-haves** — real-time wall-clock for the stress harness; chatd fixture-mode storage half (mock S3 for CI without MinIO); liberal-paraphrase calibration once real coordinator queries land. | When any of these block someone. | ---