golangLAKEHOUSE/reports/reality-tests/multi_coord_stress_001.md
root 61c7b55e48 multi-coord stress harness — Phase 1 of 48-hour mock
Three coordinators (alice / bob / carol) with three contracts
(Milwaukee distribution / Indianapolis manufacturing / Chicago
construction). 7-phase scenario runner: baseline → surge → merge →
handover → split → reissue → analysis. Each coord has a separate
playbook namespace (playbook_{name}) so institutional memory stays
isolated by default but transferable on demand.

Phase 1 deliberately skips the 48-hour clock, email/SMS endpoints,
and Langfuse tracing — those are Phase 2/3.

Run #001 (52 events, 4 queries × 3 coords × 2 demand flavors):

  Diversity:
    Different-roles-same-contract Jaccard = 0.004 (n=18)
      → role-specific retrieval is working perfectly. Different
        roles within one contract pull totally different worker
        pools. System is NOT cycling; locks into per-role retrieval.
    Same-role-across-contracts Jaccard = N/A (n=0)
      → TEST-DESIGN ISSUE: the 3 contracts use distinct role names
        per industry (warehouse worker / production worker / general
        laborer), so no exact-name overlaps exist. Phase 2 should
        either share at least one role across contracts OR add a
        skill-based diversity metric.

  Determinism: Jaccard = 1.000 (n=12)
    → HNSW + Ollama retrieval is fully deterministic on identical
      query text. coder/hnsw + nomic-embed-text are stable.

  Learning: handover hit rate = 4/4 = 100%
    → Bob inherits Alice's recordings perfectly when bob runs
      identical queries with alice's playbook namespace. CAVEAT:
      this tests the trivial verbatim case, not paraphrase handover.
      The harder test (bob runs paraphrased queries with alice's
      playbook) is Phase 2 work.

Per-event capture in JSON: every matrix.search response is logged
with phase / coordinator / contract / role / query / top-K IDs +
distances + per-corpus counts + boosted/injected counts. Reviewable
via:
  jq '.events[] | select(.phase == "merge")'
  jq '.events[] | select(.coordinator == "alice")'
  jq '.events[] | select(.role == "warehouse worker")'

Notable finding from per-event: carol's "general laborer" and "crane
operator" queries both surface w-1009 as top-1, with crane operator
at distance 0.098 (very tight) and general laborer at 0.297. The
system found a worker who legitimately covers both roles — realistic
for small construction crews.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 07:55:29 -05:00

3.4 KiB
Raw Blame History

Multi-Coordinator Stress Test — Run 001

Generated: 2026-04-30T12:54:09.621556469Z Coordinators: alice / bob / carol (each with own playbook namespace: playbook_alice / playbook_bob / playbook_carol) Contracts: alpha_milwaukee_distribution / beta_indianapolis_manufacturing / gamma_chicago_construction Corpora: workers,ethereal_workers K per query: 8 Total events captured: 52 Evidence: reports/reality-tests/multi_coord_stress_001.json


Diversity — is the system locking into scenarios or cycling?

Metric Mean Jaccard n pairs Interpretation
Same role across different contracts 0 0 Lower = more diverse (different region/cert mix → different workers)
Different roles within same contract 0.003703703703703704 18 Should be near-zero (different roles = different worker pools)

Healthy ranges:

  • Same role across contracts: < 0.30 means the system is genuinely picking different workers per region/contract.
  • Different roles same contract: < 0.10 means role-specific retrieval is working.
  • If either is > 0.50, the system is "cycling" the same handful of workers regardless of query intent.

Determinism — same query reissued, top-K stability

Metric Value
Mean Jaccard on retrieval-only reissue 1
Number of reissue pairs 12

Interpretation:

  • ≥ 0.95: HNSW retrieval is highly deterministic; reissues land on near-identical top-K. Good — system locks into a stable view of "best workers for this query."
  • 0.80 0.95: Some HNSW or embed variance, acceptable.
  • < 0.80: Retrieval is unstable — reissues see substantially different results, suggesting either embed nondeterminism (Ollama returning slightly different vectors) or vectord nondeterminism (HNSW insertion order affecting recall).

Learning — handover hit rate

Bob takes Alice's contract using Alice's playbook namespace. Did Alice's recorded answers surface in Bob's results?

Metric Value
Handover queries run 4
Alice's recorded answer at Bob's top-1 4
Alice's recorded answer in Bob's top-K 4
Handover hit rate (top-1) 1

Interpretation:

  • Hit rate ≥ 0.5: handover is meaningful — the second coordinator inherits the first's institutional memory.
  • Hit rate ≈ 0.0: playbook namespace isolation is working but the playbook itself isn't transferable, OR Bob's queries don't match Alice's recordings closely enough.

Per-event capture

All matrix.search responses live in the JSON — top-K with worker IDs, distances, and per-corpus counts. Search by phase:

jq '.events[] | select(.phase == "merge")' reports/reality-tests/multi_coord_stress_001.json
jq '.events[] | select(.coordinator == "alice" and .phase == "baseline")' reports/reality-tests/multi_coord_stress_001.json
jq '.events[] | select(.role == "warehouse worker") | {phase, contract, top_k_ids: [.top_k[].id]}' reports/reality-tests/multi_coord_stress_001.json

What's NOT in this run (Phase 1 deliberately defers)

  • 48-hour clock. Events fire as discrete steps, not on a timeline.
  • Email / SMS ingest. No endpoints exist on the Go side yet.
  • New-resume injection mid-run. The corpus is fixed at the start.
  • Langfuse traces. Need Go-side wiring.

These are Phase 2/3. The Phase 1 substrate is what the time-based runner will mount on top of.