Run #007 surfaced a tradeoff: LLM-parsed inbox queries produce much tighter cosine distances (0.05-0.10 in three cases) but lose the "system has no good match" signal that high-distance results give. A coordinator UI showing only distance can't tell wrong-domain matches apart from real ones. Fix: judge re-rates top-1 against the ORIGINAL inbox body (not the LLM-parsed query). Coordinators see both: - distance: how close was retrieval in vector space - rating: does this person actually fit the original ask The pair tells the honest story. Run #008 result on the 6 inbox events: Demand Top-1 Distance Rating Reading ───────────────────────────────────────────────────────────── Forklift Cleveland w-3573 0.29 4 Strong Production Indy e-1764 0.41 3 Adjacent Crane Chicago e-7798 0.23 1 TIGHT BUT WRONG Bilingual safety Indy w-3918 0.05 5 Perfect Drone Chicago e-1058 0.06 5 Perfect (verify e-1058) Warehouse Milwaukee w-460 0.32 4 Strong The crane-Chicago case is the architectural-honesty signal at work: distance 0.23 says "tight match" but the judge says rating 1 reading the original body. A coordinator seeing only distance would ship the wrong worker; coordinator seeing distance+rating sees the disagreement and escalates. Net distribution: 5/6 rated 3+ (acceptable→perfect), 1/6 rated 1 (irrelevant despite tight cosine). The substrate-honesty signal is recovered without losing the LLM-parse quality wins. Cost: 6 extra judge calls (~9s on qwen2.5). Production amortizes when judge runs only on top-1 of high-priority inbox events; the search-cost-vs-quality tradeoff lives in the priority gate. Implementation: - New JudgeRating int field on Event (omitempty so non-judged events stay clean in JSON) - New judgeInboxResult helper, reusing the same prompt structure as playbook_lift's judgeRate. The two could share an internal package if a third judge consumer appears. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
3.7 KiB
Multi-Coordinator Stress Test — Run 008
Generated: 2026-04-30T21:15:37.045817146Z
Coordinators: alice / bob / carol (each with own playbook namespace: playbook_alice / playbook_bob / playbook_carol)
Contracts: alpha_milwaukee_distribution / beta_indianapolis_manufacturing / gamma_chicago_construction
Corpora: workers,ethereal_workers
K per query: 8
Total events captured: 67
Evidence: reports/reality-tests/multi_coord_stress_008.json
Diversity — is the system locking into scenarios or cycling?
| Metric | Mean Jaccard | n pairs | Interpretation |
|---|---|---|---|
| Same role across different contracts | 0 | 9 | Lower = more diverse (different region/cert mix → different workers) |
| Different roles within same contract | 0.04126984126984126 | 18 | Should be near-zero (different roles = different worker pools) |
Healthy ranges:
- Same role across contracts: < 0.30 means the system is genuinely picking different workers per region/contract.
- Different roles same contract: < 0.10 means role-specific retrieval is working.
- If either is > 0.50, the system is "cycling" the same handful of workers regardless of query intent.
Determinism — same query reissued, top-K stability
| Metric | Value |
|---|---|
| Mean Jaccard on retrieval-only reissue | 1 |
| Number of reissue pairs | 12 |
Interpretation:
- ≥ 0.95: HNSW retrieval is highly deterministic; reissues land on near-identical top-K. Good — system locks into a stable view of "best workers for this query."
- 0.80 – 0.95: Some HNSW or embed variance, acceptable.
- < 0.80: Retrieval is unstable — reissues see substantially different results, suggesting either embed nondeterminism (Ollama returning slightly different vectors) or vectord nondeterminism (HNSW insertion order affecting recall).
Learning — handover hit rate
Bob takes Alice's contract using Alice's playbook namespace. Did Alice's recorded answers surface in Bob's results?
| Metric | Value |
|---|---|
| Verbatim handover queries run | 4 |
| Alice's recorded answer at Bob's top-1 (verbatim) | 4 |
| Alice's recorded answer in Bob's top-K (verbatim) | 4 |
| Verbatim handover hit rate (top-1) | 1 |
| Paraphrase handover queries run | 4 |
| Alice's recorded answer at Bob's top-1 (paraphrase) | 4 |
| Alice's recorded answer in Bob's top-K (paraphrase) | 4 |
| Paraphrase handover hit rate (top-1) | 1 |
Interpretation:
- Verbatim hit rate ≈ 1.0: trivial case — Bob runs identical queries; should always hit.
- Paraphrase hit rate ≥ 0.5: institutional memory survives wording change — the harder learning property.
- Paraphrase hit rate ≈ 0.0: Bob's paraphrases drift past the inject threshold, so Alice's recordings don't activate. Same caveat as the playbook_lift paraphrase pass.
Per-event capture
All matrix.search responses live in the JSON — top-K with worker IDs, distances, and per-corpus counts. Search by phase:
jq '.events[] | select(.phase == "merge")' reports/reality-tests/multi_coord_stress_008.json
jq '.events[] | select(.coordinator == "alice" and .phase == "baseline")' reports/reality-tests/multi_coord_stress_008.json
jq '.events[] | select(.role == "warehouse worker") | {phase, contract, top_k_ids: [.top_k[].id]}' reports/reality-tests/multi_coord_stress_008.json
What's NOT in this run (Phase 1 deliberately defers)
- 48-hour clock. Events fire as discrete steps, not on a timeline.
- Email / SMS ingest. No endpoints exist on the Go side yet.
- New-resume injection mid-run. The corpus is fixed at the start.
- Langfuse traces. Need Go-side wiring.
These are Phase 2/3. The Phase 1 substrate is what the time-based runner will mount on top of.