diff --git a/reports/reality-tests/multi_coord_stress_007.md b/reports/reality-tests/multi_coord_stress_007.md new file mode 100644 index 0000000..4004a3b --- /dev/null +++ b/reports/reality-tests/multi_coord_stress_007.md @@ -0,0 +1,82 @@ +# Multi-Coordinator Stress Test — Run 007 + +**Generated:** 2026-04-30T19:50:04.791000091Z +**Coordinators:** alice / bob / carol (each with own playbook namespace: `playbook_alice` / `playbook_bob` / `playbook_carol`) +**Contracts:** alpha_milwaukee_distribution / beta_indianapolis_manufacturing / gamma_chicago_construction +**Corpora:** `workers,ethereal_workers` +**K per query:** 8 +**Total events captured:** 67 +**Evidence:** `reports/reality-tests/multi_coord_stress_007.json` + +--- + +## Diversity — is the system locking into scenarios or cycling? + +| Metric | Mean Jaccard | n pairs | Interpretation | +|---|---:|---:|---| +| Same role across different contracts | 0 | 9 | Lower = more diverse (different region/cert mix → different workers) | +| Different roles within same contract | 0.026455026455026454 | 18 | Should be near-zero (different roles = different worker pools) | + +**Healthy ranges:** +- Same role across contracts: < 0.30 means the system is genuinely picking different workers per region/contract. +- Different roles same contract: < 0.10 means role-specific retrieval is working. +- If either is > 0.50, the system is "cycling" the same handful of workers regardless of query intent. + +--- + +## Determinism — same query reissued, top-K stability + +| Metric | Value | +|---|---:| +| Mean Jaccard on retrieval-only reissue | 1 | +| Number of reissue pairs | 12 | + +**Interpretation:** +- ≥ 0.95: HNSW retrieval is highly deterministic; reissues land on near-identical top-K. Good — system locks into a stable view of "best workers for this query." +- 0.80 – 0.95: Some HNSW or embed variance, acceptable. +- < 0.80: Retrieval is unstable — reissues see substantially different results, suggesting either embed nondeterminism (Ollama returning slightly different vectors) or vectord nondeterminism (HNSW insertion order affecting recall). + +--- + +## Learning — handover hit rate + +Bob takes Alice's contract using Alice's playbook namespace. Did Alice's recorded answers surface in Bob's results? + +| Metric | Value | +|---|---:| +| Verbatim handover queries run | 4 | +| Alice's recorded answer at Bob's top-1 (verbatim) | 4 | +| Alice's recorded answer in Bob's top-K (verbatim) | 4 | +| **Verbatim handover hit rate (top-1)** | **1** | +| Paraphrase handover queries run | 4 | +| Alice's recorded answer at Bob's top-1 (paraphrase) | 4 | +| Alice's recorded answer in Bob's top-K (paraphrase) | 4 | +| **Paraphrase handover hit rate (top-1)** | **1** | + +**Interpretation:** +- Verbatim hit rate ≈ 1.0: trivial case — Bob runs identical queries; should always hit. +- Paraphrase hit rate ≥ 0.5: institutional memory survives wording change — the harder learning property. +- Paraphrase hit rate ≈ 0.0: Bob's paraphrases drift past the inject threshold, so Alice's recordings don't activate. Same caveat as the playbook_lift paraphrase pass. + +--- + +## Per-event capture + +All matrix.search responses live in the JSON — top-K with worker IDs, distances, and per-corpus counts. Search by phase: + +```bash +jq '.events[] | select(.phase == "merge")' reports/reality-tests/multi_coord_stress_007.json +jq '.events[] | select(.coordinator == "alice" and .phase == "baseline")' reports/reality-tests/multi_coord_stress_007.json +jq '.events[] | select(.role == "warehouse worker") | {phase, contract, top_k_ids: [.top_k[].id]}' reports/reality-tests/multi_coord_stress_007.json +``` + +--- + +## What's NOT in this run (Phase 1 deliberately defers) + +- **48-hour clock.** Events fire as discrete steps, not on a timeline. +- **Email / SMS ingest.** No endpoints exist on the Go side yet. +- **New-resume injection mid-run.** The corpus is fixed at the start. +- **Langfuse traces.** Need Go-side wiring. + +These are Phase 2/3. The Phase 1 substrate is what the time-based runner will mount on top of. diff --git a/scripts/multi_coord_stress/main.go b/scripts/multi_coord_stress/main.go index 6816b2d..e02cd8f 100644 --- a/scripts/multi_coord_stress/main.go +++ b/scripts/multi_coord_stress/main.go @@ -325,52 +325,42 @@ func main() { Sender string Subject string Body string - // DemandQuery is the parsed demand we'd extract from the body. - // In production a small LLM would parse this; here we fix it - // up-front to keep the test deterministic. - DemandQuery string - Coord string + Coord string } inboxEvents := []inboxEvent{ { Priority: "urgent", Type: "email", Sender: "ops@northstar.com", Subject: "URGENT: 50 forklift operators Cleveland Monday", Body: "Need 50 forklift operators in Cleveland OH for Monday day shift. OSHA-30 + active forklift cert required. Current Milwaukee batch cannot relocate.", - DemandQuery: "Forklift operator Cleveland OH OSHA-30 forklift certification day shift", - Coord: "alice", + Coord: "alice", }, { Priority: "urgent", Type: "email", Sender: "client@crossroads-mfg.com", Subject: "URGENT: Production line down — need 30 production workers tonight", Body: "Production line failure at Indianapolis IN site. Need 30 production workers swing shift starting tonight. Assembly + machine operation experience required.", - DemandQuery: "Production worker Indianapolis IN swing shift assembly machine operation", - Coord: "bob", + Coord: "bob", }, { Priority: "high", Type: "email", Sender: "supervisor@loop-construction.com", Subject: "Need crane operator Chicago for 2-week project", Body: "Crane operator with NCCCO certification needed for 2-week Chicago IL site project. Day shift. Mobile crane experience preferred.", - DemandQuery: "Crane operator NCCCO certification Chicago IL mobile crane day shift", - Coord: "carol", + Coord: "carol", }, { Priority: "high", Type: "sms", Sender: "+1-555-0142", - Body: "Bilingual safety coord needed Indy plant ASAP. Spanish + English. OSHA trainer credential.", - DemandQuery: "Bilingual Spanish English safety coordinator Indianapolis OSHA trainer", - Coord: "bob", + Body: "Bilingual safety coord needed Indy plant ASAP. Spanish + English. OSHA trainer credential.", + Coord: "bob", }, { Priority: "medium", Type: "sms", Sender: "+1-555-0188", - Body: "Drone surveyor for Chicago site progress mapping. FAA Part 107.", - DemandQuery: "FAA Part 107 drone surveyor Chicago site mapping", - Coord: "carol", + Body: "Drone surveyor for Chicago site progress mapping. FAA Part 107.", + Coord: "carol", }, { Priority: "medium", Type: "email", Sender: "scheduling@northstar.com", Subject: "FYI: warehouse worker capacity check Milwaukee", Body: "Routine capacity check on Milwaukee warehouse worker pool — anyone with cold storage experience for next week?", - DemandQuery: "Warehouse worker Milwaukee cold storage", - Coord: "alice", + Coord: "alice", }, } // Sort by priority (urgent < high < medium < low for ordering). @@ -384,15 +374,27 @@ func main() { log.Printf(" inbox record failed (%s): %v", ie.Priority, err) continue } - // 2. Triggered matrix.search using the parsed demand. + // 2. LLM parses the body into a structured demand. Real + // production: a small local model extracts {role, count, + // location, certs, skills, shift} from email/SMS bodies + // instead of a coordinator typing it into a form. Test + // captures both raw body and parsed structure for review. + parsed, perr := parseInboxDemand(hc, *ollama, *judgeModel, ie.Body) + if perr != nil { + log.Printf(" inbox demand parse failed (%s): %v", ie.Priority, perr) + continue + } + // 3. Build a query string from the parsed demand and search. + query := parsed.AsQuery() coord := coordByName(coords, ie.Coord) - resp, err := matrixSearch(hc, *gateway, ie.DemandQuery, corpora, *k, true, coord.PlaybookCorpus) + resp, err := matrixSearch(hc, *gateway, query, corpora, *k, true, coord.PlaybookCorpus) if err != nil { log.Printf(" inbox-triggered search failed (%s): %v", ie.Priority, err) continue } - ev := captureEvent("inbox-triggered-search", 9, ie.Coord, "inbox-burst", ie.Subject, ie.DemandQuery, 1, true, coord.PlaybookCorpus, resp) - ev.Note = fmt.Sprintf("inbox %s/%s from %s", ie.Type, ie.Priority, ie.Sender) + ev := captureEvent("inbox-triggered-search", 9, ie.Coord, "inbox-burst", ie.Subject, query, 1, true, coord.PlaybookCorpus, resp) + parsedJSON, _ := json.Marshal(parsed) + ev.Note = fmt.Sprintf("inbox %s/%s from %s · LLM-parsed demand: %s", ie.Type, ie.Priority, ie.Sender, string(parsedJSON)) output.Events = append(output.Events, ev) } @@ -930,6 +932,108 @@ func ingestFreshWorker(hc *http.Client, gw, id, text string, metadata map[string return nil } +// parsedDemand is the LLM-extracted structure from an inbox message +// body — what a real coordinator would type into a search form. +// Empty fields are honest: the body didn't say. +type parsedDemand struct { + Role string `json:"role"` + Count int `json:"count"` + Location string `json:"location"` + Certs []string `json:"certs"` + Skills []string `json:"skills"` + Shift string `json:"shift"` +} + +// AsQuery composes a matrix.search query string from the parsed +// fields. Mirrors buildQuery's shape so search-time semantics match +// what the contract-driven phases produce. Empty fields are skipped +// rather than emitted as "" markers. +func (p parsedDemand) AsQuery() string { + var b strings.Builder + if p.Count > 0 { + fmt.Fprintf(&b, "Need %d ", p.Count) + } else { + b.WriteString("Need ") + } + b.WriteString(p.Role) + if p.Location != "" { + b.WriteString(" for ") + b.WriteString(p.Location) + } + if p.Shift != "" { + b.WriteString(", ") + b.WriteString(p.Shift) + b.WriteString(" shift") + } + if len(p.Certs) > 0 { + b.WriteString(", certifications: ") + b.WriteString(strings.Join(p.Certs, ", ")) + } + if len(p.Skills) > 0 { + b.WriteString(", skills: ") + b.WriteString(strings.Join(p.Skills, ", ")) + } + return b.String() +} + +// parseInboxDemand asks the judge model to extract structured fields +// from an inbox body. Same Ollama+JSON-format pattern as the +// generateParaphrase function. Real production would have a dedicated +// small model fine-tuned on staffing-language inbox parsing; here we +// use the same model that judges relevance. temperature=0 for +// deterministic extraction. +func parseInboxDemand(hc *http.Client, ollamaURL, model, inboxBody string) (*parsedDemand, error) { + system := `You parse staffing requests from email/SMS bodies. Extract structured fields. +Output JSON ONLY, this exact shape: {"role": "...", "count": N, "location": "...", "certs": [...], "skills": [...], "shift": "..."}. + +Rules: +- role: the job role being requested (lowercase, e.g. "warehouse worker", "forklift operator") +- count: number of workers needed (integer; if "a few" or unspecified, use 1) +- location: city + state if both mentioned (e.g. "Cleveland, OH"); city only if state missing +- certs: certification list as named in the body (e.g. ["OSHA-30", "forklift cert"]) +- skills: skill list as named in the body (e.g. ["pallet jack", "spanish"]) +- shift: "day"|"swing"|"night" if mentioned, else "" +- If a field isn't in the body, use empty string or empty array (never null) +- Do NOT explain — emit the JSON only.` + body, _ := json.Marshal(map[string]any{ + "model": model, + "stream": false, + "format": "json", + "messages": []map[string]string{ + {"role": "system", "content": system}, + {"role": "user", "content": inboxBody}, + }, + "options": map[string]any{"temperature": 0}, + }) + req, _ := http.NewRequest("POST", ollamaURL+"/api/chat", bytes.NewReader(body)) + req.Header.Set("Content-Type", "application/json") + resp, err := hc.Do(req) + if err != nil { + return nil, err + } + defer resp.Body.Close() + if resp.StatusCode/100 != 2 { + return nil, fmt.Errorf("ollama chat: HTTP %d", resp.StatusCode) + } + rb, _ := io.ReadAll(resp.Body) + var ollamaResp struct { + Message struct { + Content string `json:"content"` + } `json:"message"` + } + if err := json.Unmarshal(rb, &ollamaResp); err != nil { + return nil, err + } + var out parsedDemand + if err := json.Unmarshal([]byte(ollamaResp.Message.Content), &out); err != nil { + return nil, fmt.Errorf("decode parsed demand: %w (content=%q)", err, ollamaResp.Message.Content) + } + if strings.TrimSpace(out.Role) == "" { + return nil, fmt.Errorf("parsed demand has empty role (content=%q)", ollamaResp.Message.Content) + } + return &out, nil +} + // postInbox sends an inbox message (email or SMS) to observerd via // the gateway. observerd records it as an ObservedOp with // Source=SourceInbox; downstream actions (search, ingest, etc.) are