root 0fa42a0cc3 multi-coord stress Phase 1.5: shared-role contracts + paraphrase handover

Phase 1 had two known gaps: (1) the 3 contracts had zero shared role
names, so same-role-across-contracts Jaccard was vacuous (n=0); (2)
the verbatim handover at 100% was the trivial case, not the hard
learning test (paraphrased queries against another coord's playbook).

Both fixed in this commit.

Contract redesign — all 3 contracts now share warehouse worker /
admin assistant / heavy equipment operator roles, plus a unique
specialist per contract (industrial electrician / bilingual safety
coord / drone surveyor — the "specialist not on the standard roster"
case from J's spec). Counts and skill mixes vary per region.

New driver phase 4b — paraphrase handover. Bob runs qwen2.5-paraphrased
versions of Alice's contract queries against Alice's playbook
namespace. Tests whether institutional memory propagates across
coordinators AND across natural wording variation that Bob would
introduce when running Alice's contract.

Run #002 result (5K workers + 10K ethereal_workers, 4 demand × 3
coords + paraphrase handover):

  Diversity (the question J asked: locking or cycling?):
    Same-role-across-contracts Jaccard = 0.119 (n=9)
      → 88% of workers DIFFER across regions for the same role name.
        Milwaukee warehouse vs Indianapolis warehouse vs Chicago
        warehouse pull mostly distinct top-K from the same population.
        The system locks into geo+cert+skill context, not cycling.
    Different-roles-same-contract Jaccard = 0.004 (n=18)
      → role-specific retrieval works (unchanged from Phase 1).

  Determinism: Jaccard = 1.000 (n=12) — unchanged.

  Learning:
    Verbatim handover  4/4 = 100%  (trivial case, expected)
    Paraphrase handover 4/4 = 100% (HARD case — passes!)
      Of those 4 paraphrase recoveries:
        - 2 used boost (Alice's recording was already in Bob's
          paraphrase top-K; ApplyPlaybookBoost re-ranked to top-1)
        - 2 used Shape B inject (recording wasn't in Bob's
          paraphrase top-K; InjectPlaybookMisses brought it in)

The boost/inject mix is healthy — both paths are used and both
produce correct top-1s. Multi-coord institutional memory propagation
is empirically working under wording variation.

Sample warehouse worker top-1s across contracts (proves diversity):
  alice / Milwaukee     → w-713
  bob   / Indianapolis  → e-8447
  carol / Chicago       → e-7145
Three different workers from the same 15K-person population,
selected on geo+cert+skill context.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

2026-04-30 08:03:16 -05:00

3.7 KiB

Raw Blame History

Multi-Coordinator Stress Test — Run 002

Generated: 2026-04-30T13:02:13.570393819Z Coordinators: alice / bob / carol (each with own playbook namespace: playbook_alice / playbook_bob / playbook_carol) Contracts: alpha_milwaukee_distribution / beta_indianapolis_manufacturing / gamma_chicago_construction Corpora: workers,ethereal_workers K per query: 8 Total events captured: 56 Evidence: reports/reality-tests/multi_coord_stress_002.json

Diversity — is the system locking into scenarios or cycling?

Metric	Mean Jaccard	n pairs	Interpretation
Same role across different contracts	0.11900691900691901	9	Lower = more diverse (different region/cert mix → different workers)
Different roles within same contract	0.003703703703703704	18	Should be near-zero (different roles = different worker pools)

Healthy ranges:

Same role across contracts: < 0.30 means the system is genuinely picking different workers per region/contract.
Different roles same contract: < 0.10 means role-specific retrieval is working.
If either is > 0.50, the system is "cycling" the same handful of workers regardless of query intent.

Determinism — same query reissued, top-K stability

Metric	Value
Mean Jaccard on retrieval-only reissue	1
Number of reissue pairs	12

Interpretation:

≥ 0.95: HNSW retrieval is highly deterministic; reissues land on near-identical top-K. Good — system locks into a stable view of "best workers for this query."
0.80 – 0.95: Some HNSW or embed variance, acceptable.
< 0.80: Retrieval is unstable — reissues see substantially different results, suggesting either embed nondeterminism (Ollama returning slightly different vectors) or vectord nondeterminism (HNSW insertion order affecting recall).

Learning — handover hit rate

Bob takes Alice's contract using Alice's playbook namespace. Did Alice's recorded answers surface in Bob's results?

Metric	Value
Verbatim handover queries run	4
Alice's recorded answer at Bob's top-1 (verbatim)	4
Alice's recorded answer in Bob's top-K (verbatim)	4
Verbatim handover hit rate (top-1)	1
Paraphrase handover queries run	4
Alice's recorded answer at Bob's top-1 (paraphrase)	4
Alice's recorded answer in Bob's top-K (paraphrase)	4
Paraphrase handover hit rate (top-1)	1

Interpretation:

Verbatim hit rate ≈ 1.0: trivial case — Bob runs identical queries; should always hit.
Paraphrase hit rate ≥ 0.5: institutional memory survives wording change — the harder learning property.
Paraphrase hit rate ≈ 0.0: Bob's paraphrases drift past the inject threshold, so Alice's recordings don't activate. Same caveat as the playbook_lift paraphrase pass.

Per-event capture

All matrix.search responses live in the JSON — top-K with worker IDs, distances, and per-corpus counts. Search by phase:

jq '.events[] | select(.phase == "merge")' reports/reality-tests/multi_coord_stress_002.json
jq '.events[] | select(.coordinator == "alice" and .phase == "baseline")' reports/reality-tests/multi_coord_stress_002.json
jq '.events[] | select(.role == "warehouse worker") | {phase, contract, top_k_ids: [.top_k[].id]}' reports/reality-tests/multi_coord_stress_002.json

What's NOT in this run (Phase 1 deliberately defers)

48-hour clock. Events fire as discrete steps, not on a timeline.
Email / SMS ingest. No endpoints exist on the Go side yet.
New-resume injection mid-run. The corpus is fixed at the start.
Langfuse traces. Need Go-side wiring.

These are Phase 2/3. The Phase 1 substrate is what the time-based runner will mount on top of.

3.7 KiB Raw Blame History Unescape Escape