golangLAKEHOUSE/reports/reality-tests/multi_coord_stress_008.md
root ce940f4a14 multi_coord_stress: judge re-rates inbox top-1 — recovers honesty signal
Run #007 surfaced a tradeoff: LLM-parsed inbox queries produce much
tighter cosine distances (0.05-0.10 in three cases) but lose the
"system has no good match" signal that high-distance results give.
A coordinator UI showing only distance can't tell wrong-domain
matches apart from real ones.

Fix: judge re-rates top-1 against the ORIGINAL inbox body (not the
LLM-parsed query). Coordinators see both:
  - distance: how close was retrieval in vector space
  - rating:   does this person actually fit the original ask
The pair tells the honest story.

Run #008 result on the 6 inbox events:

  Demand                Top-1     Distance  Rating  Reading
  ─────────────────────────────────────────────────────────────
  Forklift Cleveland    w-3573    0.29      4       Strong
  Production Indy       e-1764    0.41      3       Adjacent
  Crane Chicago         e-7798    0.23      1       TIGHT BUT WRONG
  Bilingual safety Indy w-3918    0.05      5       Perfect
  Drone Chicago         e-1058    0.06      5       Perfect (verify e-1058)
  Warehouse Milwaukee   w-460     0.32      4       Strong

The crane-Chicago case is the architectural-honesty signal at work:
distance 0.23 says "tight match" but the judge says rating 1 reading
the original body. A coordinator seeing only distance would ship the
wrong worker; coordinator seeing distance+rating sees the disagreement
and escalates.

Net distribution: 5/6 rated 3+ (acceptable→perfect), 1/6 rated 1
(irrelevant despite tight cosine). The substrate-honesty signal is
recovered without losing the LLM-parse quality wins.

Cost: 6 extra judge calls (~9s on qwen2.5). Production amortizes
when judge runs only on top-1 of high-priority inbox events; the
search-cost-vs-quality tradeoff lives in the priority gate.

Implementation:
- New JudgeRating int field on Event (omitempty so non-judged
  events stay clean in JSON)
- New judgeInboxResult helper, reusing the same prompt structure as
  playbook_lift's judgeRate. The two could share an internal package
  if a third judge consumer appears.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 16:16:49 -05:00

3.7 KiB
Raw Blame History

Multi-Coordinator Stress Test — Run 008

Generated: 2026-04-30T21:15:37.045817146Z Coordinators: alice / bob / carol (each with own playbook namespace: playbook_alice / playbook_bob / playbook_carol) Contracts: alpha_milwaukee_distribution / beta_indianapolis_manufacturing / gamma_chicago_construction Corpora: workers,ethereal_workers K per query: 8 Total events captured: 67 Evidence: reports/reality-tests/multi_coord_stress_008.json


Diversity — is the system locking into scenarios or cycling?

Metric Mean Jaccard n pairs Interpretation
Same role across different contracts 0 9 Lower = more diverse (different region/cert mix → different workers)
Different roles within same contract 0.04126984126984126 18 Should be near-zero (different roles = different worker pools)

Healthy ranges:

  • Same role across contracts: < 0.30 means the system is genuinely picking different workers per region/contract.
  • Different roles same contract: < 0.10 means role-specific retrieval is working.
  • If either is > 0.50, the system is "cycling" the same handful of workers regardless of query intent.

Determinism — same query reissued, top-K stability

Metric Value
Mean Jaccard on retrieval-only reissue 1
Number of reissue pairs 12

Interpretation:

  • ≥ 0.95: HNSW retrieval is highly deterministic; reissues land on near-identical top-K. Good — system locks into a stable view of "best workers for this query."
  • 0.80 0.95: Some HNSW or embed variance, acceptable.
  • < 0.80: Retrieval is unstable — reissues see substantially different results, suggesting either embed nondeterminism (Ollama returning slightly different vectors) or vectord nondeterminism (HNSW insertion order affecting recall).

Learning — handover hit rate

Bob takes Alice's contract using Alice's playbook namespace. Did Alice's recorded answers surface in Bob's results?

Metric Value
Verbatim handover queries run 4
Alice's recorded answer at Bob's top-1 (verbatim) 4
Alice's recorded answer in Bob's top-K (verbatim) 4
Verbatim handover hit rate (top-1) 1
Paraphrase handover queries run 4
Alice's recorded answer at Bob's top-1 (paraphrase) 4
Alice's recorded answer in Bob's top-K (paraphrase) 4
Paraphrase handover hit rate (top-1) 1

Interpretation:

  • Verbatim hit rate ≈ 1.0: trivial case — Bob runs identical queries; should always hit.
  • Paraphrase hit rate ≥ 0.5: institutional memory survives wording change — the harder learning property.
  • Paraphrase hit rate ≈ 0.0: Bob's paraphrases drift past the inject threshold, so Alice's recordings don't activate. Same caveat as the playbook_lift paraphrase pass.

Per-event capture

All matrix.search responses live in the JSON — top-K with worker IDs, distances, and per-corpus counts. Search by phase:

jq '.events[] | select(.phase == "merge")' reports/reality-tests/multi_coord_stress_008.json
jq '.events[] | select(.coordinator == "alice" and .phase == "baseline")' reports/reality-tests/multi_coord_stress_008.json
jq '.events[] | select(.role == "warehouse worker") | {phase, contract, top_k_ids: [.top_k[].id]}' reports/reality-tests/multi_coord_stress_008.json

What's NOT in this run (Phase 1 deliberately defers)

  • 48-hour clock. Events fire as discrete steps, not on a timeline.
  • Email / SMS ingest. No endpoints exist on the Go side yet.
  • New-resume injection mid-run. The corpus is fixed at the start.
  • Langfuse traces. Need Go-side wiring.

These are Phase 2/3. The Phase 1 substrate is what the time-based runner will mount on top of.