golangLAKEHOUSE/reports/reality-tests/multi_coord_stress_007.md
root 186d209aae multi_coord_stress: LLM-parsed inbox demands (qwen2.5)
Replaced the hard-coded DemandQuery on inbox events with an actual
LLM call: each email/SMS body is parsed by qwen2.5 (format=json,
schema-anchored) into structured {role, count, location, certs,
skills, shift}. The driver then composes a query string from those
fields and runs matrix.search.

This is the real-product flow that the Phase 3 stress test was
asking for: real bodies → real LLM parsing → real search. Before
this commit, the DemandQuery was my hand-crafted string, which
made the inbox phase trivial.

Run #007 result vs #006 (same bodies, parser swapped):

  All 6 inbox events parsed cleanly — qwen2.5 nailed:
    "Need 50 forklift operators in Cleveland OH for Monday day
     shift. OSHA-30 + active forklift cert required."
    → {role:"forklift operator", count:50, location:"Cleveland, OH",
       certs:["OSHA-30","active forklift cert"], skills:[], shift:"day"}
    Other 5 similarly faithful (indy stayed as "indy", count
    defaulted to 1 when unspecified, no hallucinated fields).

  LLM-parsed queries produced TIGHTER matches than hard-coded:
    Demand              #006 dist  #007 dist  Δ
    Crane Chicago       0.499      0.093      -82%
    Drone Chicago       0.707      0.073      -90%
    Bilingual safety    0.240      0.048      -80%
    Forklift Cleveland  0.330      0.273      -17%
    Production Indy     0.260      0.399      +53%
    Warehouse Milwaukee 0.458      0.420       -8%

  Three matches landed at distance < 0.10 — verbatim-replay-tight
  territory. Structured queries embed sharper than conversational
  hand-crafted strings.

  Other metrics unchanged: diversity 0.000, determinism 1.000,
  verbatim handover 4/4, paraphrase handover 4/4.

Tradeoff worth flagging: the drone-Chicago case dropped from
distance 0.71 (clear "we don't have one") to 0.07 (confident match
returned). The OOD honesty signal weakens when LLM-parsed structure
makes any closest-neighbor look tight. Future Phase 4 work: judge
re-rates the top match before surfacing, so coordinators see "your
demand was for X but the closest match scored 2/5" rather than just
the worker ID + distance.

Substrate cost: +6 LLM calls per inbox burst (~9s on qwen2.5).
Production would amortize via a small dedicated parser model.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 14:51:19 -05:00

3.7 KiB
Raw Permalink Blame History

Multi-Coordinator Stress Test — Run 007

Generated: 2026-04-30T19:50:04.791000091Z Coordinators: alice / bob / carol (each with own playbook namespace: playbook_alice / playbook_bob / playbook_carol) Contracts: alpha_milwaukee_distribution / beta_indianapolis_manufacturing / gamma_chicago_construction Corpora: workers,ethereal_workers K per query: 8 Total events captured: 67 Evidence: reports/reality-tests/multi_coord_stress_007.json


Diversity — is the system locking into scenarios or cycling?

Metric Mean Jaccard n pairs Interpretation
Same role across different contracts 0 9 Lower = more diverse (different region/cert mix → different workers)
Different roles within same contract 0.026455026455026454 18 Should be near-zero (different roles = different worker pools)

Healthy ranges:

  • Same role across contracts: < 0.30 means the system is genuinely picking different workers per region/contract.
  • Different roles same contract: < 0.10 means role-specific retrieval is working.
  • If either is > 0.50, the system is "cycling" the same handful of workers regardless of query intent.

Determinism — same query reissued, top-K stability

Metric Value
Mean Jaccard on retrieval-only reissue 1
Number of reissue pairs 12

Interpretation:

  • ≥ 0.95: HNSW retrieval is highly deterministic; reissues land on near-identical top-K. Good — system locks into a stable view of "best workers for this query."
  • 0.80 0.95: Some HNSW or embed variance, acceptable.
  • < 0.80: Retrieval is unstable — reissues see substantially different results, suggesting either embed nondeterminism (Ollama returning slightly different vectors) or vectord nondeterminism (HNSW insertion order affecting recall).

Learning — handover hit rate

Bob takes Alice's contract using Alice's playbook namespace. Did Alice's recorded answers surface in Bob's results?

Metric Value
Verbatim handover queries run 4
Alice's recorded answer at Bob's top-1 (verbatim) 4
Alice's recorded answer in Bob's top-K (verbatim) 4
Verbatim handover hit rate (top-1) 1
Paraphrase handover queries run 4
Alice's recorded answer at Bob's top-1 (paraphrase) 4
Alice's recorded answer in Bob's top-K (paraphrase) 4
Paraphrase handover hit rate (top-1) 1

Interpretation:

  • Verbatim hit rate ≈ 1.0: trivial case — Bob runs identical queries; should always hit.
  • Paraphrase hit rate ≥ 0.5: institutional memory survives wording change — the harder learning property.
  • Paraphrase hit rate ≈ 0.0: Bob's paraphrases drift past the inject threshold, so Alice's recordings don't activate. Same caveat as the playbook_lift paraphrase pass.

Per-event capture

All matrix.search responses live in the JSON — top-K with worker IDs, distances, and per-corpus counts. Search by phase:

jq '.events[] | select(.phase == "merge")' reports/reality-tests/multi_coord_stress_007.json
jq '.events[] | select(.coordinator == "alice" and .phase == "baseline")' reports/reality-tests/multi_coord_stress_007.json
jq '.events[] | select(.role == "warehouse worker") | {phase, contract, top_k_ids: [.top_k[].id]}' reports/reality-tests/multi_coord_stress_007.json

What's NOT in this run (Phase 1 deliberately defers)

  • 48-hour clock. Events fire as discrete steps, not on a timeline.
  • Email / SMS ingest. No endpoints exist on the Go side yet.
  • New-resume injection mid-run. The corpus is fixed at the start.
  • Langfuse traces. Need Go-side wiring.

These are Phase 2/3. The Phase 1 substrate is what the time-based runner will mount on top of.