diff --git a/reports/reality-tests/multi_coord_stress_002.md b/reports/reality-tests/multi_coord_stress_002.md new file mode 100644 index 0000000..5d67be7 --- /dev/null +++ b/reports/reality-tests/multi_coord_stress_002.md @@ -0,0 +1,82 @@ +# Multi-Coordinator Stress Test — Run 002 + +**Generated:** 2026-04-30T13:02:13.570393819Z +**Coordinators:** alice / bob / carol (each with own playbook namespace: `playbook_alice` / `playbook_bob` / `playbook_carol`) +**Contracts:** alpha_milwaukee_distribution / beta_indianapolis_manufacturing / gamma_chicago_construction +**Corpora:** `workers,ethereal_workers` +**K per query:** 8 +**Total events captured:** 56 +**Evidence:** `reports/reality-tests/multi_coord_stress_002.json` + +--- + +## Diversity — is the system locking into scenarios or cycling? + +| Metric | Mean Jaccard | n pairs | Interpretation | +|---|---:|---:|---| +| Same role across different contracts | 0.11900691900691901 | 9 | Lower = more diverse (different region/cert mix → different workers) | +| Different roles within same contract | 0.003703703703703704 | 18 | Should be near-zero (different roles = different worker pools) | + +**Healthy ranges:** +- Same role across contracts: < 0.30 means the system is genuinely picking different workers per region/contract. +- Different roles same contract: < 0.10 means role-specific retrieval is working. +- If either is > 0.50, the system is "cycling" the same handful of workers regardless of query intent. + +--- + +## Determinism — same query reissued, top-K stability + +| Metric | Value | +|---|---:| +| Mean Jaccard on retrieval-only reissue | 1 | +| Number of reissue pairs | 12 | + +**Interpretation:** +- ≥ 0.95: HNSW retrieval is highly deterministic; reissues land on near-identical top-K. Good — system locks into a stable view of "best workers for this query." +- 0.80 – 0.95: Some HNSW or embed variance, acceptable. +- < 0.80: Retrieval is unstable — reissues see substantially different results, suggesting either embed nondeterminism (Ollama returning slightly different vectors) or vectord nondeterminism (HNSW insertion order affecting recall). + +--- + +## Learning — handover hit rate + +Bob takes Alice's contract using Alice's playbook namespace. Did Alice's recorded answers surface in Bob's results? + +| Metric | Value | +|---|---:| +| Verbatim handover queries run | 4 | +| Alice's recorded answer at Bob's top-1 (verbatim) | 4 | +| Alice's recorded answer in Bob's top-K (verbatim) | 4 | +| **Verbatim handover hit rate (top-1)** | **1** | +| Paraphrase handover queries run | 4 | +| Alice's recorded answer at Bob's top-1 (paraphrase) | 4 | +| Alice's recorded answer in Bob's top-K (paraphrase) | 4 | +| **Paraphrase handover hit rate (top-1)** | **1** | + +**Interpretation:** +- Verbatim hit rate ≈ 1.0: trivial case — Bob runs identical queries; should always hit. +- Paraphrase hit rate ≥ 0.5: institutional memory survives wording change — the harder learning property. +- Paraphrase hit rate ≈ 0.0: Bob's paraphrases drift past the inject threshold, so Alice's recordings don't activate. Same caveat as the playbook_lift paraphrase pass. + +--- + +## Per-event capture + +All matrix.search responses live in the JSON — top-K with worker IDs, distances, and per-corpus counts. Search by phase: + +```bash +jq '.events[] | select(.phase == "merge")' reports/reality-tests/multi_coord_stress_002.json +jq '.events[] | select(.coordinator == "alice" and .phase == "baseline")' reports/reality-tests/multi_coord_stress_002.json +jq '.events[] | select(.role == "warehouse worker") | {phase, contract, top_k_ids: [.top_k[].id]}' reports/reality-tests/multi_coord_stress_002.json +``` + +--- + +## What's NOT in this run (Phase 1 deliberately defers) + +- **48-hour clock.** Events fire as discrete steps, not on a timeline. +- **Email / SMS ingest.** No endpoints exist on the Go side yet. +- **New-resume injection mid-run.** The corpus is fixed at the start. +- **Langfuse traces.** Need Go-side wiring. + +These are Phase 2/3. The Phase 1 substrate is what the time-based runner will mount on top of. diff --git a/scripts/multi_coord_stress.sh b/scripts/multi_coord_stress.sh index a7c48b0..68f1bf9 100755 --- a/scripts/multi_coord_stress.sh +++ b/scripts/multi_coord_stress.sh @@ -155,12 +155,19 @@ echo "[stress] ingest ethereal_workers (limit=$ETHEREAL_LIMIT, 0=all) into 'ethe echo echo "[stress] running multi-coord stress driver..." +EXTRA_FLAGS="" +if [ "${WITH_PARAPHRASE_HANDOVER:-1}" = "1" ]; then + EXTRA_FLAGS="$EXTRA_FLAGS -with-paraphrase-handover" +fi ./bin/multi_coord_stress \ -gateway "http://127.0.0.1:3110" \ -contracts tests/reality/contracts \ -corpora "$CORPORA" \ -k "$K" \ - -out "$OUT_JSON" + -out "$OUT_JSON" \ + -ollama "http://localhost:11434" \ + -judge "${JUDGE_MODEL:-qwen2.5:latest}" \ + $EXTRA_FLAGS echo echo "[stress] generating markdown report → $OUT_MD" @@ -179,6 +186,10 @@ hand_run=$(jq -r '.learning.handover_queries_run' "$OUT_JSON") hand_top1=$(jq -r '.learning.recorded_answers_top1_count' "$OUT_JSON") hand_topk=$(jq -r '.learning.recorded_answers_topk_count' "$OUT_JSON") hand_rate=$(jq -r '.learning.handover_hit_rate' "$OUT_JSON") +ph_run=$(jq -r '.learning.paraphrase_handover_run // 0' "$OUT_JSON") +ph_top1=$(jq -r '.learning.paraphrase_top1_count // 0' "$OUT_JSON") +ph_topk=$(jq -r '.learning.paraphrase_topk_count // 0' "$OUT_JSON") +ph_rate=$(jq -r '.learning.paraphrase_handover_hit_rate // 0' "$OUT_JSON") cat > "$OUT_MD" < 0 && ev.TopK[0].ID == recordedID { + pTop1++ + pTopK++ + } else { + for _, r := range ev.TopK { + if r.ID == recordedID { + pTopK++ + break + } + } + } + } + output.Learning.ParaphraseHandoverRun = pHandoverRun + output.Learning.ParaphraseTop1Count = pTop1 + output.Learning.ParaphraseTopKCount = pTopK + if pHandoverRun > 0 { + output.Learning.ParaphraseHandoverHitRate = float64(pTop1) / float64(pHandoverRun) + } + } + // ── Phase 5: split — surge re-distributed across 3 coords ── log.Printf("[stress] phase 5: split (alpha surge spread across all 3 coords)") for i, d := range contracts[0].Demand { @@ -339,12 +407,72 @@ func main() { output.Diversity.DifferentRolesSameContractMeanJaccard, output.Diversity.NumPairsDifferentRolesSameContract) log.Printf("[stress] determinism: mean Jaccard on reissue = %.3f (n=%d)", output.Determinism.MeanJaccard, output.Determinism.NumReissuedPairs) - log.Printf("[stress] learning: handover hit rate (top-1) = %d/%d = %.0f%%", + log.Printf("[stress] learning verbatim: handover hit rate (top-1) = %d/%d = %.0f%%", output.Learning.RecordedAnswersTop1Count, output.Learning.HandoverQueriesRun, output.Learning.HandoverHitRate*100) + if output.Learning.ParaphraseHandoverRun > 0 { + log.Printf("[stress] learning paraphrase: handover hit rate (top-1) = %d/%d = %.0f%% (top-K = %d/%d)", + output.Learning.ParaphraseTop1Count, output.Learning.ParaphraseHandoverRun, + output.Learning.ParaphraseHandoverHitRate*100, + output.Learning.ParaphraseTopKCount, output.Learning.ParaphraseHandoverRun) + } log.Printf("[stress] results → %s", *out) } +// generateParaphrase asks the judge model to rephrase a staffing query +// while preserving intent — same prompt template as +// scripts/playbook_lift/main.go, kept here as a copy to avoid a shared +// internal package for two scripts. If callers ever need a third +// paraphraser, lift this into internal/paraphrase/. +func generateParaphrase(hc *http.Client, ollamaURL, model, query string) (string, error) { + system := `You rephrase staffing queries while preserving intent. +Output JSON only: {"paraphrase": ""}. +Rules: +- Keep the same role, certifications, geography, and constraints. +- Vary the wording (synonyms, reordered clauses, different sentence shape). +- Do NOT add or remove requirements. +- Do NOT explain — just emit the JSON.` + body, _ := json.Marshal(map[string]any{ + "model": model, + "stream": false, + "format": "json", + "messages": []map[string]string{ + {"role": "system", "content": system}, + {"role": "user", "content": query}, + }, + "options": map[string]any{"temperature": 0.5}, + }) + req, _ := http.NewRequest("POST", ollamaURL+"/api/chat", bytes.NewReader(body)) + req.Header.Set("Content-Type", "application/json") + resp, err := hc.Do(req) + if err != nil { + return "", err + } + defer resp.Body.Close() + if resp.StatusCode/100 != 2 { + return "", fmt.Errorf("ollama chat: HTTP %d", resp.StatusCode) + } + rb, _ := io.ReadAll(resp.Body) + var ollamaResp struct { + Message struct { + Content string `json:"content"` + } `json:"message"` + } + if err := json.Unmarshal(rb, &ollamaResp); err != nil { + return "", err + } + var out struct { + Paraphrase string `json:"paraphrase"` + } + if err := json.Unmarshal([]byte(ollamaResp.Message.Content), &out); err != nil { + return "", fmt.Errorf("decode paraphrase: %w (content=%q)", err, ollamaResp.Message.Content) + } + if strings.TrimSpace(out.Paraphrase) == "" { + return "", fmt.Errorf("empty paraphrase (content=%q)", ollamaResp.Message.Content) + } + return out.Paraphrase, nil +} + // ── helpers ────────────────────────────────────────────────────── func loadContracts(dir string) ([]Contract, error) { diff --git a/tests/reality/contracts/contract_beta.json b/tests/reality/contracts/contract_beta.json index 9f32ef2..2bad8a6 100644 --- a/tests/reality/contracts/contract_beta.json +++ b/tests/reality/contracts/contract_beta.json @@ -4,9 +4,9 @@ "location": "Indianapolis, IN metro", "shift": "swing", "demand": [ - {"role": "production worker", "count": 150, "skills": ["assembly", "machine operation"], "certs": ["OSHA-10"]}, - {"role": "quality inspector", "count": 4, "skills": ["measurement", "documentation"], "certs": ["six-sigma yellow belt"]}, - {"role": "forklift operator", "count": 3, "skills": ["pallet jack", "inventory", "cold storage"], "certs": ["OSHA-30", "forklift cert"]}, + {"role": "warehouse worker", "count": 150, "skills": ["assembly", "machine operation"], "certs": ["OSHA-10"]}, + {"role": "admin assistant", "count": 4, "skills": ["scheduling", "documentation", "spanish"], "certs": []}, + {"role": "heavy equipment operator", "count": 3, "skills": ["forklift", "pallet jack", "cold storage"], "certs": ["OSHA-30", "forklift cert"]}, {"role": "bilingual safety coordinator", "count": 1, "skills": ["spanish", "english", "training"], "certs": ["OSHA trainer"], "in_roster": false} ] } diff --git a/tests/reality/contracts/contract_gamma.json b/tests/reality/contracts/contract_gamma.json index 3663bbc..1ac98a7 100644 --- a/tests/reality/contracts/contract_gamma.json +++ b/tests/reality/contracts/contract_gamma.json @@ -4,9 +4,9 @@ "location": "Chicago, IL metro", "shift": "early-day", "demand": [ - {"role": "general laborer", "count": 80, "skills": ["framing", "concrete", "rigging"], "certs": ["OSHA-10"]}, - {"role": "site superintendent", "count": 1, "skills": ["scheduling", "leadership", "blueprint reading"], "certs": ["OSHA-30", "first-aid"]}, - {"role": "crane operator", "count": 2, "skills": ["mobile crane", "rigging signals"], "certs": ["NCCCO crane cert"]}, + {"role": "warehouse worker", "count": 80, "skills": ["framing", "rigging", "concrete"], "certs": ["OSHA-10"]}, + {"role": "admin assistant", "count": 1, "skills": ["scheduling", "blueprint reading"], "certs": []}, + {"role": "heavy equipment operator", "count": 2, "skills": ["mobile crane", "rigging signals", "bobcat"], "certs": ["NCCCO crane cert"]}, {"role": "drone surveyor", "count": 1, "skills": ["UAV piloting", "GIS", "site mapping"], "certs": ["FAA Part 107"], "in_roster": false} ] }