root ce940f4a14 multi_coord_stress: judge re-rates inbox top-1 — recovers honesty signal

Run #007 surfaced a tradeoff: LLM-parsed inbox queries produce much
tighter cosine distances (0.05-0.10 in three cases) but lose the
"system has no good match" signal that high-distance results give.
A coordinator UI showing only distance can't tell wrong-domain
matches apart from real ones.

Fix: judge re-rates top-1 against the ORIGINAL inbox body (not the
LLM-parsed query). Coordinators see both:
- distance: how close was retrieval in vector space
- rating: does this person actually fit the original ask
The pair tells the honest story.

Run #008 result on the 6 inbox events:

Demand Top-1 Distance Rating Reading
─────────────────────────────────────────────────────────────
Forklift Cleveland w-3573 0.29 4 Strong
Production Indy e-1764 0.41 3 Adjacent
Crane Chicago e-7798 0.23 1 TIGHT BUT WRONG
Bilingual safety Indy w-3918 0.05 5 Perfect
Drone Chicago e-1058 0.06 5 Perfect (verify e-1058)
Warehouse Milwaukee w-460 0.32 4 Strong

The crane-Chicago case is the architectural-honesty signal at work:
distance 0.23 says "tight match" but the judge says rating 1 reading
the original body. A coordinator seeing only distance would ship the
wrong worker; coordinator seeing distance+rating sees the disagreement
and escalates.

Net distribution: 5/6 rated 3+ (acceptable→perfect), 1/6 rated 1
(irrelevant despite tight cosine). The substrate-honesty signal is
recovered without losing the LLM-parse quality wins.

Cost: 6 extra judge calls (~9s on qwen2.5). Production amortizes
when judge runs only on top-1 of high-priority inbox events; the
search-cost-vs-quality tradeoff lives in the priority gate.

Implementation:
- New JudgeRating int field on Event (omitempty so non-judged
events stay clean in JSON)
- New judgeInboxResult helper, reusing the same prompt structure as
playbook_lift's judgeRate. The two could share an internal package
if a third judge consumer appears.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

2026-04-30 16:16:49 -05:00

3.7 KiB

Raw Blame History

Multi-Coordinator Stress Test — Run 008

Generated: 2026-04-30T21:15:37.045817146Z Coordinators: alice / bob / carol (each with own playbook namespace: playbook_alice / playbook_bob / playbook_carol) Contracts: alpha_milwaukee_distribution / beta_indianapolis_manufacturing / gamma_chicago_construction Corpora: workers,ethereal_workers K per query: 8 Total events captured: 67 Evidence: reports/reality-tests/multi_coord_stress_008.json

Diversity — is the system locking into scenarios or cycling?

Metric	Mean Jaccard	n pairs	Interpretation
Same role across different contracts	0	9	Lower = more diverse (different region/cert mix → different workers)
Different roles within same contract	0.04126984126984126	18	Should be near-zero (different roles = different worker pools)

Healthy ranges:

Same role across contracts: < 0.30 means the system is genuinely picking different workers per region/contract.
Different roles same contract: < 0.10 means role-specific retrieval is working.
If either is > 0.50, the system is "cycling" the same handful of workers regardless of query intent.

Determinism — same query reissued, top-K stability

Metric	Value
Mean Jaccard on retrieval-only reissue	1
Number of reissue pairs	12

Interpretation:

≥ 0.95: HNSW retrieval is highly deterministic; reissues land on near-identical top-K. Good — system locks into a stable view of "best workers for this query."
0.80 – 0.95: Some HNSW or embed variance, acceptable.
< 0.80: Retrieval is unstable — reissues see substantially different results, suggesting either embed nondeterminism (Ollama returning slightly different vectors) or vectord nondeterminism (HNSW insertion order affecting recall).

Learning — handover hit rate

Bob takes Alice's contract using Alice's playbook namespace. Did Alice's recorded answers surface in Bob's results?

Metric	Value
Verbatim handover queries run	4
Alice's recorded answer at Bob's top-1 (verbatim)	4
Alice's recorded answer in Bob's top-K (verbatim)	4
Verbatim handover hit rate (top-1)	1
Paraphrase handover queries run	4
Alice's recorded answer at Bob's top-1 (paraphrase)	4
Alice's recorded answer in Bob's top-K (paraphrase)	4
Paraphrase handover hit rate (top-1)	1

Interpretation:

Verbatim hit rate ≈ 1.0: trivial case — Bob runs identical queries; should always hit.
Paraphrase hit rate ≥ 0.5: institutional memory survives wording change — the harder learning property.
Paraphrase hit rate ≈ 0.0: Bob's paraphrases drift past the inject threshold, so Alice's recordings don't activate. Same caveat as the playbook_lift paraphrase pass.

Per-event capture

All matrix.search responses live in the JSON — top-K with worker IDs, distances, and per-corpus counts. Search by phase:

jq '.events[] | select(.phase == "merge")' reports/reality-tests/multi_coord_stress_008.json
jq '.events[] | select(.coordinator == "alice" and .phase == "baseline")' reports/reality-tests/multi_coord_stress_008.json
jq '.events[] | select(.role == "warehouse worker") | {phase, contract, top_k_ids: [.top_k[].id]}' reports/reality-tests/multi_coord_stress_008.json

What's NOT in this run (Phase 1 deliberately defers)

48-hour clock. Events fire as discrete steps, not on a timeline.
Email / SMS ingest. No endpoints exist on the Go side yet.
New-resume injection mid-run. The corpus is fixed at the start.
Langfuse traces. Need Go-side wiring.

These are Phase 2/3. The Phase 1 substrate is what the time-based runner will mount on top of.

3.7 KiB Raw Blame History Unescape Escape