# Multi-Coordinator Stress Test — Run 004 **Generated:** 2026-04-30T13:17:03.577877974Z **Coordinators:** alice / bob / carol (each with own playbook namespace: `playbook_alice` / `playbook_bob` / `playbook_carol`) **Contracts:** alpha_milwaukee_distribution / beta_indianapolis_manufacturing / gamma_chicago_construction **Corpora:** `workers,ethereal_workers` **K per query:** 8 **Total events captured:** 61 **Evidence:** `reports/reality-tests/multi_coord_stress_004.json` --- ## Diversity — is the system locking into scenarios or cycling? | Metric | Mean Jaccard | n pairs | Interpretation | |---|---:|---:|---| | Same role across different contracts | 0.08013468013468013 | 9 | Lower = more diverse (different region/cert mix → different workers) | | Different roles within same contract | 0.012820512820512822 | 18 | Should be near-zero (different roles = different worker pools) | **Healthy ranges:** - Same role across contracts: < 0.30 means the system is genuinely picking different workers per region/contract. - Different roles same contract: < 0.10 means role-specific retrieval is working. - If either is > 0.50, the system is "cycling" the same handful of workers regardless of query intent. --- ## Determinism — same query reissued, top-K stability | Metric | Value | |---|---:| | Mean Jaccard on retrieval-only reissue | 1 | | Number of reissue pairs | 12 | **Interpretation:** - ≥ 0.95: HNSW retrieval is highly deterministic; reissues land on near-identical top-K. Good — system locks into a stable view of "best workers for this query." - 0.80 – 0.95: Some HNSW or embed variance, acceptable. - < 0.80: Retrieval is unstable — reissues see substantially different results, suggesting either embed nondeterminism (Ollama returning slightly different vectors) or vectord nondeterminism (HNSW insertion order affecting recall). --- ## Learning — handover hit rate Bob takes Alice's contract using Alice's playbook namespace. Did Alice's recorded answers surface in Bob's results? | Metric | Value | |---|---:| | Verbatim handover queries run | 4 | | Alice's recorded answer at Bob's top-1 (verbatim) | 4 | | Alice's recorded answer in Bob's top-K (verbatim) | 4 | | **Verbatim handover hit rate (top-1)** | **1** | | Paraphrase handover queries run | 4 | | Alice's recorded answer at Bob's top-1 (paraphrase) | 4 | | Alice's recorded answer in Bob's top-K (paraphrase) | 4 | | **Paraphrase handover hit rate (top-1)** | **1** | **Interpretation:** - Verbatim hit rate ≈ 1.0: trivial case — Bob runs identical queries; should always hit. - Paraphrase hit rate ≥ 0.5: institutional memory survives wording change — the harder learning property. - Paraphrase hit rate ≈ 0.0: Bob's paraphrases drift past the inject threshold, so Alice's recordings don't activate. Same caveat as the playbook_lift paraphrase pass. --- ## Per-event capture All matrix.search responses live in the JSON — top-K with worker IDs, distances, and per-corpus counts. Search by phase: ```bash jq '.events[] | select(.phase == "merge")' reports/reality-tests/multi_coord_stress_004.json jq '.events[] | select(.coordinator == "alice" and .phase == "baseline")' reports/reality-tests/multi_coord_stress_004.json jq '.events[] | select(.role == "warehouse worker") | {phase, contract, top_k_ids: [.top_k[].id]}' reports/reality-tests/multi_coord_stress_004.json ``` --- ## What's NOT in this run (Phase 1 deliberately defers) - **48-hour clock.** Events fire as discrete steps, not on a timeline. - **Email / SMS ingest.** No endpoints exist on the Go side yet. - **New-resume injection mid-run.** The corpus is fixed at the start. - **Langfuse traces.** Need Go-side wiring. These are Phase 2/3. The Phase 1 substrate is what the time-based runner will mount on top of.