multi-coord stress Phase 1.5: shared-role contracts + paraphrase handover

Phase 1 had two known gaps: (1) the 3 contracts had zero shared role names, so same-role-across-contracts Jaccard was vacuous (n=0); (2) the verbatim handover at 100% was the trivial case, not the hard learning test (paraphrased queries against another coord's playbook). Both fixed in this commit. Contract redesign — all 3 contracts now share warehouse worker / admin assistant / heavy equipment operator roles, plus a unique specialist per contract (industrial electrician / bilingual safety coord / drone surveyor — the "specialist not on the standard roster" case from J's spec). Counts and skill mixes vary per region. New driver phase 4b — paraphrase handover. Bob runs qwen2.5-paraphrased versions of Alice's contract queries against Alice's playbook namespace. Tests whether institutional memory propagates across coordinators AND across natural wording variation that Bob would introduce when running Alice's contract. Run #002 result (5K workers + 10K ethereal_workers, 4 demand × 3 coords + paraphrase handover): Diversity (the question J asked: locking or cycling?): Same-role-across-contracts Jaccard = 0.119 (n=9) → 88% of workers DIFFER across regions for the same role name. Milwaukee warehouse vs Indianapolis warehouse vs Chicago warehouse pull mostly distinct top-K from the same population. The system locks into geo+cert+skill context, not cycling. Different-roles-same-contract Jaccard = 0.004 (n=18) → role-specific retrieval works (unchanged from Phase 1). Determinism: Jaccard = 1.000 (n=12) — unchanged. Learning: Verbatim handover 4/4 = 100% (trivial case, expected) Paraphrase handover 4/4 = 100% (HARD case — passes!) Of those 4 paraphrase recoveries: - 2 used boost (Alice's recording was already in Bob's paraphrase top-K; ApplyPlaybookBoost re-ranked to top-1) - 2 used Shape B inject (recording wasn't in Bob's paraphrase top-K; InjectPlaybookMisses brought it in) The boost/inject mix is healthy — both paths are used and both produce correct top-1s. Multi-coord institutional memory propagation is empirically working under wording variation. Sample warehouse worker top-1s across contracts (proves diversity): alice / Milwaukee → w-713 bob / Indianapolis → e-8447 carol / Chicago → e-7145 Three different workers from the same 15K-person population, selected on geo+cert+skill context. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 08:03:16 -05:00 · 2026-04-30 08:03:16 -05:00 · 0fa42a0cc3
commit 0fa42a0cc3
parent 61c7b55e48
5 changed files with 240 additions and 14 deletions
--- a/reports/reality-tests/multi_coord_stress_002.md
+++ b/reports/reality-tests/multi_coord_stress_002.md
@ -0,0 +1,82 @@
+# Multi-Coordinator Stress Test — Run 002
+
+**Generated:** 2026-04-30T13:02:13.570393819Z
+**Coordinators:** alice / bob / carol (each with own playbook namespace: `playbook_alice` / `playbook_bob` / `playbook_carol`)
+**Contracts:** alpha_milwaukee_distribution / beta_indianapolis_manufacturing / gamma_chicago_construction
+**Corpora:** `workers,ethereal_workers`
+**K per query:** 8
+**Total events captured:** 56
+**Evidence:** `reports/reality-tests/multi_coord_stress_002.json`
+
+---
+
+## Diversity — is the system locking into scenarios or cycling?
+
+| Metric | Mean Jaccard | n pairs | Interpretation |
+|---|---:|---:|---|
+| Same role across different contracts | 0.11900691900691901 | 9 | Lower = more diverse (different region/cert mix → different workers) |
+| Different roles within same contract | 0.003703703703703704 | 18 | Should be near-zero (different roles = different worker pools) |
+
+**Healthy ranges:**
+- Same role across contracts: < 0.30 means the system is genuinely picking different workers per region/contract.
+- Different roles same contract: < 0.10 means role-specific retrieval is working.
+- If either is > 0.50, the system is "cycling" the same handful of workers regardless of query intent.
+
+---
+
+## Determinism — same query reissued, top-K stability
+
+| Metric | Value |
+|---|---:|
+| Mean Jaccard on retrieval-only reissue | 1 |
+| Number of reissue pairs | 12 |
+
+**Interpretation:**
+- ≥ 0.95: HNSW retrieval is highly deterministic; reissues land on near-identical top-K. Good — system locks into a stable view of "best workers for this query."
+- 0.80 – 0.95: Some HNSW or embed variance, acceptable.
+- < 0.80: Retrieval is unstable — reissues see substantially different results, suggesting either embed nondeterminism (Ollama returning slightly different vectors) or vectord nondeterminism (HNSW insertion order affecting recall).
+
+---
+
+## Learning — handover hit rate
+
+Bob takes Alice's contract using Alice's playbook namespace. Did Alice's recorded answers surface in Bob's results?
+
+| Metric | Value |
+|---|---:|
+| Verbatim handover queries run | 4 |
+| Alice's recorded answer at Bob's top-1 (verbatim) | 4 |
+| Alice's recorded answer in Bob's top-K (verbatim) | 4 |
+| **Verbatim handover hit rate (top-1)** | **1** |
+| Paraphrase handover queries run | 4 |
+| Alice's recorded answer at Bob's top-1 (paraphrase) | 4 |
+| Alice's recorded answer in Bob's top-K (paraphrase) | 4 |
+| **Paraphrase handover hit rate (top-1)** | **1** |
+
+**Interpretation:**
+- Verbatim hit rate ≈ 1.0: trivial case — Bob runs identical queries; should always hit.
+- Paraphrase hit rate ≥ 0.5: institutional memory survives wording change — the harder learning property.
+- Paraphrase hit rate ≈ 0.0: Bob's paraphrases drift past the inject threshold, so Alice's recordings don't activate. Same caveat as the playbook_lift paraphrase pass.
+
+---
+
+## Per-event capture
+
+All matrix.search responses live in the JSON — top-K with worker IDs, distances, and per-corpus counts. Search by phase:
+
+```bash
+jq '.events[] | select(.phase == "merge")' reports/reality-tests/multi_coord_stress_002.json
+jq '.events[] | select(.coordinator == "alice" and .phase == "baseline")' reports/reality-tests/multi_coord_stress_002.json
+jq '.events[] | select(.role == "warehouse worker") | {phase, contract, top_k_ids: [.top_k[].id]}' reports/reality-tests/multi_coord_stress_002.json
+```
+
+---
+
+## What's NOT in this run (Phase 1 deliberately defers)
+
+- **48-hour clock.** Events fire as discrete steps, not on a timeline.
+- **Email / SMS ingest.** No endpoints exist on the Go side yet.
+- **New-resume injection mid-run.** The corpus is fixed at the start.
+- **Langfuse traces.** Need Go-side wiring.
+
+These are Phase 2/3. The Phase 1 substrate is what the time-based runner will mount on top of.
--- a/scripts/multi_coord_stress.sh
+++ b/scripts/multi_coord_stress.sh
@ -155,12 +155,19 @@ echo "[stress] ingest ethereal_workers (limit=$ETHEREAL_LIMIT, 0=all) into 'ethe

 echo
 echo "[stress] running multi-coord stress driver..."
+EXTRA_FLAGS=""
+if [ "${WITH_PARAPHRASE_HANDOVER:-1}" = "1" ]; then
+  EXTRA_FLAGS="$EXTRA_FLAGS -with-paraphrase-handover"
+fi
 ./bin/multi_coord_stress \
  -gateway "http://127.0.0.1:3110" \
  -contracts tests/reality/contracts \
  -corpora "$CORPORA" \
  -k "$K" \
-  -out "$OUT_JSON"
+  -out "$OUT_JSON" \
+  -ollama  "http://localhost:11434" \
+  -judge   "${JUDGE_MODEL:-qwen2.5:latest}" \
+  $EXTRA_FLAGS

 echo
 echo "[stress] generating markdown report → $OUT_MD"
@ -179,6 +186,10 @@ hand_run=$(jq -r '.learning.handover_queries_run' "$OUT_JSON")
 hand_top1=$(jq -r '.learning.recorded_answers_top1_count' "$OUT_JSON")
 hand_topk=$(jq -r '.learning.recorded_answers_topk_count' "$OUT_JSON")
 hand_rate=$(jq -r '.learning.handover_hit_rate' "$OUT_JSON")
+ph_run=$(jq -r '.learning.paraphrase_handover_run // 0' "$OUT_JSON")
+ph_top1=$(jq -r '.learning.paraphrase_top1_count // 0' "$OUT_JSON")
+ph_topk=$(jq -r '.learning.paraphrase_topk_count // 0' "$OUT_JSON")
+ph_rate=$(jq -r '.learning.paraphrase_handover_hit_rate // 0' "$OUT_JSON")

 cat > "$OUT_MD" <<MDEOF
 # Multi-Coordinator Stress Test — Run ${RUN_ID}
@ -227,14 +238,19 @@ Bob takes Alice's contract using Alice's playbook namespace. Did Alice's recorde

 | Metric | Value |
 |---|---:|
-| Handover queries run | ${hand_run} |
-| Alice's recorded answer at Bob's top-1 | ${hand_top1} |
-| Alice's recorded answer in Bob's top-K | ${hand_topk} |
-| **Handover hit rate (top-1)** | **${hand_rate}** |
+| Verbatim handover queries run | ${hand_run} |
+| Alice's recorded answer at Bob's top-1 (verbatim) | ${hand_top1} |
+| Alice's recorded answer in Bob's top-K (verbatim) | ${hand_topk} |
+| **Verbatim handover hit rate (top-1)** | **${hand_rate}** |
+| Paraphrase handover queries run | ${ph_run} |
+| Alice's recorded answer at Bob's top-1 (paraphrase) | ${ph_top1} |
+| Alice's recorded answer in Bob's top-K (paraphrase) | ${ph_topk} |
+| **Paraphrase handover hit rate (top-1)** | **${ph_rate}** |

 **Interpretation:**
- Hit rate ≥ 0.5: handover is meaningful — the second coordinator inherits the first's institutional memory.
- Hit rate ≈ 0.0: playbook namespace isolation is working but the playbook itself isn't transferable, OR Bob's queries don't match Alice's recordings closely enough.
+- Verbatim hit rate ≈ 1.0: trivial case — Bob runs identical queries; should always hit.
+- Paraphrase hit rate ≥ 0.5: institutional memory survives wording change — the harder learning property.
+- Paraphrase hit rate ≈ 0.0: Bob's paraphrases drift past the inject threshold, so Alice's recordings don't activate. Same caveat as the playbook_lift paraphrase pass.

 ---

--- a/scripts/multi_coord_stress/main.go
+++ b/scripts/multi_coord_stress/main.go
@ -142,11 +142,24 @@ type Determ struct {
 // Learning = handover signal. After Alice records playbooks for her
 // contract, Bob runs the same queries with Alice's playbook namespace.
 // We measure: do Alice's recorded answer IDs surface in Bob's top-K?
+//
+// Two modes:
+//   - Verbatim handover: Bob runs Alice's exact queries (trivial case).
+//   - Paraphrase handover: Bob runs paraphrased queries against Alice's
+//     playbook (the hard case — does cosine on paraphrase find the
+//     recorded query's vector?). This is the multi-coord analog of the
+//     paraphrase reality test in playbook_lift.
 type Learning struct {
 	HandoverQueriesRun       int     `json:"handover_queries_run"`
 	RecordedAnswersTop1Count int     `json:"recorded_answers_top1_count"`
 	RecordedAnswersTopKCount int     `json:"recorded_answers_topk_count"`
 	HandoverHitRate          float64 `json:"handover_hit_rate"`
+
+	// Paraphrase handover — only populated when --with-paraphrase-handover.
+	ParaphraseHandoverRun       int     `json:"paraphrase_handover_run,omitempty"`
+	ParaphraseTop1Count         int     `json:"paraphrase_top1_count,omitempty"`
+	ParaphraseTopKCount         int     `json:"paraphrase_topk_count,omitempty"`
+	ParaphraseHandoverHitRate   float64 `json:"paraphrase_handover_hit_rate,omitempty"`
 }

 // ── main ─────────────────────────────────────────────────────────
@ -158,6 +171,9 @@ func main() {
 		corporaCSV    = flag.String("corpora", "workers,ethereal_workers", "comma-separated matrix corpora")
 		k             = flag.Int("k", 8, "top-k from matrix.search per query")
 		out           = flag.String("out", "reports/reality-tests/multi_coord_stress_001.json", "output JSON path")
+		ollama        = flag.String("ollama", "http://127.0.0.1:11434", "Ollama base URL (only used if --with-paraphrase-handover)")
+		judgeModel    = flag.String("judge", "qwen2.5:latest", "Ollama model for paraphrase generation (only used if --with-paraphrase-handover)")
+		withParaphraseHandover = flag.Bool("with-paraphrase-handover", false, "after the verbatim handover phase, run a paraphrase handover phase: Bob runs paraphrased versions of Alice's queries against Alice's playbook")
 	)
 	flag.Parse()

@ -285,6 +301,58 @@ func main() {
 		output.Learning.HandoverHitRate = float64(handoverHitsTop1) / float64(handoverRun)
 	}

+	// ── Phase 4b: paraphrase handover ───────────────────────────
+	// Bob runs PARAPHRASED versions of Alice's queries against
+	// Alice's playbook. The verbatim handover above is the trivial
+	// case (identical queries → identical retrieval → playbook
+	// boost). The paraphrase handover is the real test: did Alice's
+	// institutional memory survive the wording change Bob would
+	// naturally introduce?
+	if *withParaphraseHandover {
+		log.Printf("[stress] phase 4b: paraphrase handover (bob runs paraphrased versions of alice's queries)")
+		pHandoverRun := 0
+		pTop1 := 0
+		pTopK := 0
+		for _, d := range contracts[0].Demand {
+			origQuery := buildQuery(&contracts[0], d, 1)
+			paraphrase, err := generateParaphrase(hc, *ollama, *judgeModel, origQuery)
+			if err != nil {
+				log.Printf("  paraphrase gen failed for %s: %v", d.Role, err)
+				continue
+			}
+			resp, err := matrixSearch(hc, *gateway, paraphrase, corpora, *k, true, coords[0].PlaybookCorpus)
+			if err != nil {
+				log.Printf("  paraphrase search failed for %s: %v", d.Role, err)
+				continue
+			}
+			ev := captureEvent("handover-paraphrase", "bob", contracts[0].Name, d.Role, paraphrase, 1, true, coords[0].PlaybookCorpus, resp)
+			ev.Note = "paraphrase of: " + origQuery
+			output.Events = append(output.Events, ev)
+			pHandoverRun++
+			recordedID, ok := aliceRecordedAnswers[d.Role]
+			if !ok {
+				continue
+			}
+			if len(ev.TopK) > 0 && ev.TopK[0].ID == recordedID {
+				pTop1++
+				pTopK++
+			} else {
+				for _, r := range ev.TopK {
+					if r.ID == recordedID {
+						pTopK++
+						break
+					}
+				}
+			}
+		}
+		output.Learning.ParaphraseHandoverRun = pHandoverRun
+		output.Learning.ParaphraseTop1Count = pTop1
+		output.Learning.ParaphraseTopKCount = pTopK
+		if pHandoverRun > 0 {
+			output.Learning.ParaphraseHandoverHitRate = float64(pTop1) / float64(pHandoverRun)
+		}
+	}
+
 	// ── Phase 5: split — surge re-distributed across 3 coords ──
 	log.Printf("[stress] phase 5: split (alpha surge spread across all 3 coords)")
 	for i, d := range contracts[0].Demand {
@ -339,12 +407,72 @@ func main() {
 		output.Diversity.DifferentRolesSameContractMeanJaccard, output.Diversity.NumPairsDifferentRolesSameContract)
 	log.Printf("[stress]   determinism: mean Jaccard on reissue = %.3f (n=%d)",
 		output.Determinism.MeanJaccard, output.Determinism.NumReissuedPairs)
-	log.Printf("[stress]   learning: handover hit rate (top-1) = %d/%d = %.0f%%",
+	log.Printf("[stress]   learning verbatim: handover hit rate (top-1) = %d/%d = %.0f%%",
 		output.Learning.RecordedAnswersTop1Count, output.Learning.HandoverQueriesRun,
 		output.Learning.HandoverHitRate*100)
+	if output.Learning.ParaphraseHandoverRun > 0 {
+		log.Printf("[stress]   learning paraphrase: handover hit rate (top-1) = %d/%d = %.0f%% (top-K = %d/%d)",
+			output.Learning.ParaphraseTop1Count, output.Learning.ParaphraseHandoverRun,
+			output.Learning.ParaphraseHandoverHitRate*100,
+			output.Learning.ParaphraseTopKCount, output.Learning.ParaphraseHandoverRun)
+	}
 	log.Printf("[stress]   results → %s", *out)
 }

+// generateParaphrase asks the judge model to rephrase a staffing query
+// while preserving intent — same prompt template as
+// scripts/playbook_lift/main.go, kept here as a copy to avoid a shared
+// internal package for two scripts. If callers ever need a third
+// paraphraser, lift this into internal/paraphrase/.
+func generateParaphrase(hc *http.Client, ollamaURL, model, query string) (string, error) {
+	system := `You rephrase staffing queries while preserving intent.
+Output JSON only: {"paraphrase": "<rephrased query>"}.
+Rules:
+- Keep the same role, certifications, geography, and constraints.
+- Vary the wording (synonyms, reordered clauses, different sentence shape).
+- Do NOT add or remove requirements.
+- Do NOT explain — just emit the JSON.`
+	body, _ := json.Marshal(map[string]any{
+		"model":  model,
+		"stream": false,
+		"format": "json",
+		"messages": []map[string]string{
+			{"role": "system", "content": system},
+			{"role": "user", "content": query},
+		},
+		"options": map[string]any{"temperature": 0.5},
+	})
+	req, _ := http.NewRequest("POST", ollamaURL+"/api/chat", bytes.NewReader(body))
+	req.Header.Set("Content-Type", "application/json")
+	resp, err := hc.Do(req)
+	if err != nil {
+		return "", err
+	}
+	defer resp.Body.Close()
+	if resp.StatusCode/100 != 2 {
+		return "", fmt.Errorf("ollama chat: HTTP %d", resp.StatusCode)
+	}
+	rb, _ := io.ReadAll(resp.Body)
+	var ollamaResp struct {
+		Message struct {
+			Content string `json:"content"`
+		} `json:"message"`
+	}
+	if err := json.Unmarshal(rb, &ollamaResp); err != nil {
+		return "", err
+	}
+	var out struct {
+		Paraphrase string `json:"paraphrase"`
+	}
+	if err := json.Unmarshal([]byte(ollamaResp.Message.Content), &out); err != nil {
+		return "", fmt.Errorf("decode paraphrase: %w (content=%q)", err, ollamaResp.Message.Content)
+	}
+	if strings.TrimSpace(out.Paraphrase) == "" {
+		return "", fmt.Errorf("empty paraphrase (content=%q)", ollamaResp.Message.Content)
+	}
+	return out.Paraphrase, nil
+}
+
 // ── helpers ──────────────────────────────────────────────────────

 func loadContracts(dir string) ([]Contract, error) {
--- a/tests/reality/contracts/contract_beta.json
+++ b/tests/reality/contracts/contract_beta.json
@ -4,9 +4,9 @@
  "location": "Indianapolis, IN metro",
  "shift": "swing",
  "demand": [
-    {"role": "production worker", "count": 150, "skills": ["assembly", "machine operation"], "certs": ["OSHA-10"]},
-    {"role": "quality inspector", "count": 4, "skills": ["measurement", "documentation"], "certs": ["six-sigma yellow belt"]},
-    {"role": "forklift operator", "count": 3, "skills": ["pallet jack", "inventory", "cold storage"], "certs": ["OSHA-30", "forklift cert"]},
+    {"role": "warehouse worker", "count": 150, "skills": ["assembly", "machine operation"], "certs": ["OSHA-10"]},
+    {"role": "admin assistant", "count": 4, "skills": ["scheduling", "documentation", "spanish"], "certs": []},
+    {"role": "heavy equipment operator", "count": 3, "skills": ["forklift", "pallet jack", "cold storage"], "certs": ["OSHA-30", "forklift cert"]},
    {"role": "bilingual safety coordinator", "count": 1, "skills": ["spanish", "english", "training"], "certs": ["OSHA trainer"], "in_roster": false}
  ]
 }
--- a/tests/reality/contracts/contract_gamma.json
+++ b/tests/reality/contracts/contract_gamma.json
@ -4,9 +4,9 @@
  "location": "Chicago, IL metro",
  "shift": "early-day",
  "demand": [
-    {"role": "general laborer", "count": 80, "skills": ["framing", "concrete", "rigging"], "certs": ["OSHA-10"]},
-    {"role": "site superintendent", "count": 1, "skills": ["scheduling", "leadership", "blueprint reading"], "certs": ["OSHA-30", "first-aid"]},
-    {"role": "crane operator", "count": 2, "skills": ["mobile crane", "rigging signals"], "certs": ["NCCCO crane cert"]},
+    {"role": "warehouse worker", "count": 80, "skills": ["framing", "rigging", "concrete"], "certs": ["OSHA-10"]},
+    {"role": "admin assistant", "count": 1, "skills": ["scheduling", "blueprint reading"], "certs": []},
+    {"role": "heavy equipment operator", "count": 2, "skills": ["mobile crane", "rigging signals", "bobcat"], "certs": ["NCCCO crane cert"]},
    {"role": "drone surveyor", "count": 1, "skills": ["UAV piloting", "GIS", "site mapping"], "certs": ["FAA Part 107"], "in_roster": false}
  ]
 }