multi_coord_stress: LLM-parsed inbox demands (qwen2.5)

Replaced the hard-coded DemandQuery on inbox events with an actual LLM call: each email/SMS body is parsed by qwen2.5 (format=json, schema-anchored) into structured {role, count, location, certs, skills, shift}. The driver then composes a query string from those fields and runs matrix.search. This is the real-product flow that the Phase 3 stress test was asking for: real bodies → real LLM parsing → real search. Before this commit, the DemandQuery was my hand-crafted string, which made the inbox phase trivial. Run #007 result vs #006 (same bodies, parser swapped): All 6 inbox events parsed cleanly — qwen2.5 nailed: "Need 50 forklift operators in Cleveland OH for Monday day shift. OSHA-30 + active forklift cert required." → {role:"forklift operator", count:50, location:"Cleveland, OH", certs:["OSHA-30","active forklift cert"], skills:[], shift:"day"} Other 5 similarly faithful (indy stayed as "indy", count defaulted to 1 when unspecified, no hallucinated fields). LLM-parsed queries produced TIGHTER matches than hard-coded: Demand #006 dist #007 dist Δ Crane Chicago 0.499 0.093 -82% Drone Chicago 0.707 0.073 -90% Bilingual safety 0.240 0.048 -80% Forklift Cleveland 0.330 0.273 -17% Production Indy 0.260 0.399 +53% Warehouse Milwaukee 0.458 0.420 -8% Three matches landed at distance < 0.10 — verbatim-replay-tight territory. Structured queries embed sharper than conversational hand-crafted strings. Other metrics unchanged: diversity 0.000, determinism 1.000, verbatim handover 4/4, paraphrase handover 4/4. Tradeoff worth flagging: the drone-Chicago case dropped from distance 0.71 (clear "we don't have one") to 0.07 (confident match returned). The OOD honesty signal weakens when LLM-parsed structure makes any closest-neighbor look tight. Future Phase 4 work: judge re-rates the top match before surfacing, so coordinators see "your demand was for X but the closest match scored 2/5" rather than just the worker ID + distance. Substrate cost: +6 LLM calls per inbox burst (~9s on qwen2.5). Production would amortize via a small dedicated parser model. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 14:51:19 -05:00 · 2026-04-30 14:51:19 -05:00 · 186d209aae
commit 186d209aae
parent e7fc63b216
2 changed files with 209 additions and 23 deletions
--- a/reports/reality-tests/multi_coord_stress_007.md
+++ b/reports/reality-tests/multi_coord_stress_007.md
@ -0,0 +1,82 @@
+# Multi-Coordinator Stress Test — Run 007
+
+**Generated:** 2026-04-30T19:50:04.791000091Z
+**Coordinators:** alice / bob / carol (each with own playbook namespace: `playbook_alice` / `playbook_bob` / `playbook_carol`)
+**Contracts:** alpha_milwaukee_distribution / beta_indianapolis_manufacturing / gamma_chicago_construction
+**Corpora:** `workers,ethereal_workers`
+**K per query:** 8
+**Total events captured:** 67
+**Evidence:** `reports/reality-tests/multi_coord_stress_007.json`
+
+---
+
+## Diversity — is the system locking into scenarios or cycling?
+
+| Metric | Mean Jaccard | n pairs | Interpretation |
+|---|---:|---:|---|
+| Same role across different contracts | 0 | 9 | Lower = more diverse (different region/cert mix → different workers) |
+| Different roles within same contract | 0.026455026455026454 | 18 | Should be near-zero (different roles = different worker pools) |
+
+**Healthy ranges:**
+- Same role across contracts: < 0.30 means the system is genuinely picking different workers per region/contract.
+- Different roles same contract: < 0.10 means role-specific retrieval is working.
+- If either is > 0.50, the system is "cycling" the same handful of workers regardless of query intent.
+
+---
+
+## Determinism — same query reissued, top-K stability
+
+| Metric | Value |
+|---|---:|
+| Mean Jaccard on retrieval-only reissue | 1 |
+| Number of reissue pairs | 12 |
+
+**Interpretation:**
+- ≥ 0.95: HNSW retrieval is highly deterministic; reissues land on near-identical top-K. Good — system locks into a stable view of "best workers for this query."
+- 0.80 – 0.95: Some HNSW or embed variance, acceptable.
+- < 0.80: Retrieval is unstable — reissues see substantially different results, suggesting either embed nondeterminism (Ollama returning slightly different vectors) or vectord nondeterminism (HNSW insertion order affecting recall).
+
+---
+
+## Learning — handover hit rate
+
+Bob takes Alice's contract using Alice's playbook namespace. Did Alice's recorded answers surface in Bob's results?
+
+| Metric | Value |
+|---|---:|
+| Verbatim handover queries run | 4 |
+| Alice's recorded answer at Bob's top-1 (verbatim) | 4 |
+| Alice's recorded answer in Bob's top-K (verbatim) | 4 |
+| **Verbatim handover hit rate (top-1)** | **1** |
+| Paraphrase handover queries run | 4 |
+| Alice's recorded answer at Bob's top-1 (paraphrase) | 4 |
+| Alice's recorded answer in Bob's top-K (paraphrase) | 4 |
+| **Paraphrase handover hit rate (top-1)** | **1** |
+
+**Interpretation:**
+- Verbatim hit rate ≈ 1.0: trivial case — Bob runs identical queries; should always hit.
+- Paraphrase hit rate ≥ 0.5: institutional memory survives wording change — the harder learning property.
+- Paraphrase hit rate ≈ 0.0: Bob's paraphrases drift past the inject threshold, so Alice's recordings don't activate. Same caveat as the playbook_lift paraphrase pass.
+
+---
+
+## Per-event capture
+
+All matrix.search responses live in the JSON — top-K with worker IDs, distances, and per-corpus counts. Search by phase:
+
+```bash
+jq '.events[] | select(.phase == "merge")' reports/reality-tests/multi_coord_stress_007.json
+jq '.events[] | select(.coordinator == "alice" and .phase == "baseline")' reports/reality-tests/multi_coord_stress_007.json
+jq '.events[] | select(.role == "warehouse worker") | {phase, contract, top_k_ids: [.top_k[].id]}' reports/reality-tests/multi_coord_stress_007.json
+```
+
+---
+
+## What's NOT in this run (Phase 1 deliberately defers)
+
+- **48-hour clock.** Events fire as discrete steps, not on a timeline.
+- **Email / SMS ingest.** No endpoints exist on the Go side yet.
+- **New-resume injection mid-run.** The corpus is fixed at the start.
+- **Langfuse traces.** Need Go-side wiring.
+
+These are Phase 2/3. The Phase 1 substrate is what the time-based runner will mount on top of.
--- a/scripts/multi_coord_stress/main.go
+++ b/scripts/multi_coord_stress/main.go
@ -325,52 +325,42 @@ func main() {
 		Sender   string
 		Subject  string
 		Body     string
-		// DemandQuery is the parsed demand we'd extract from the body.
-		// In production a small LLM would parse this; here we fix it
-		// up-front to keep the test deterministic.
-		DemandQuery string
-		Coord       string
+		Coord    string
 	}
 	inboxEvents := []inboxEvent{
 		{
 			Priority: "urgent", Type: "email", Sender: "ops@northstar.com",
 			Subject: "URGENT: 50 forklift operators Cleveland Monday",
 			Body:    "Need 50 forklift operators in Cleveland OH for Monday day shift. OSHA-30 + active forklift cert required. Current Milwaukee batch cannot relocate.",
-			DemandQuery: "Forklift operator Cleveland OH OSHA-30 forklift certification day shift",
-			Coord:       "alice",
+			Coord:   "alice",
 		},
 		{
 			Priority: "urgent", Type: "email", Sender: "client@crossroads-mfg.com",
 			Subject: "URGENT: Production line down — need 30 production workers tonight",
 			Body:    "Production line failure at Indianapolis IN site. Need 30 production workers swing shift starting tonight. Assembly + machine operation experience required.",
-			DemandQuery: "Production worker Indianapolis IN swing shift assembly machine operation",
-			Coord:       "bob",
+			Coord:   "bob",
 		},
 		{
 			Priority: "high", Type: "email", Sender: "supervisor@loop-construction.com",
 			Subject: "Need crane operator Chicago for 2-week project",
 			Body:    "Crane operator with NCCCO certification needed for 2-week Chicago IL site project. Day shift. Mobile crane experience preferred.",
-			DemandQuery: "Crane operator NCCCO certification Chicago IL mobile crane day shift",
-			Coord:       "carol",
+			Coord:   "carol",
 		},
 		{
 			Priority: "high", Type: "sms", Sender: "+1-555-0142",
-			Body:        "Bilingual safety coord needed Indy plant ASAP. Spanish + English. OSHA trainer credential.",
-			DemandQuery: "Bilingual Spanish English safety coordinator Indianapolis OSHA trainer",
-			Coord:       "bob",
+			Body:  "Bilingual safety coord needed Indy plant ASAP. Spanish + English. OSHA trainer credential.",
+			Coord: "bob",
 		},
 		{
 			Priority: "medium", Type: "sms", Sender: "+1-555-0188",
-			Body:        "Drone surveyor for Chicago site progress mapping. FAA Part 107.",
-			DemandQuery: "FAA Part 107 drone surveyor Chicago site mapping",
-			Coord:       "carol",
+			Body:  "Drone surveyor for Chicago site progress mapping. FAA Part 107.",
+			Coord: "carol",
 		},
 		{
 			Priority: "medium", Type: "email", Sender: "scheduling@northstar.com",
 			Subject: "FYI: warehouse worker capacity check Milwaukee",
 			Body:    "Routine capacity check on Milwaukee warehouse worker pool — anyone with cold storage experience for next week?",
-			DemandQuery: "Warehouse worker Milwaukee cold storage",
-			Coord:       "alice",
+			Coord:   "alice",
 		},
 	}
 	// Sort by priority (urgent < high < medium < low for ordering).
@ -384,15 +374,27 @@ func main() {
 			log.Printf("  inbox record failed (%s): %v", ie.Priority, err)
 			continue
 		}
-		// 2. Triggered matrix.search using the parsed demand.
+		// 2. LLM parses the body into a structured demand. Real
+		// production: a small local model extracts {role, count,
+		// location, certs, skills, shift} from email/SMS bodies
+		// instead of a coordinator typing it into a form. Test
+		// captures both raw body and parsed structure for review.
+		parsed, perr := parseInboxDemand(hc, *ollama, *judgeModel, ie.Body)
+		if perr != nil {
+			log.Printf("  inbox demand parse failed (%s): %v", ie.Priority, perr)
+			continue
+		}
+		// 3. Build a query string from the parsed demand and search.
+		query := parsed.AsQuery()
 		coord := coordByName(coords, ie.Coord)
-		resp, err := matrixSearch(hc, *gateway, ie.DemandQuery, corpora, *k, true, coord.PlaybookCorpus)
+		resp, err := matrixSearch(hc, *gateway, query, corpora, *k, true, coord.PlaybookCorpus)
 		if err != nil {
 			log.Printf("  inbox-triggered search failed (%s): %v", ie.Priority, err)
 			continue
 		}
-		ev := captureEvent("inbox-triggered-search", 9, ie.Coord, "inbox-burst", ie.Subject, ie.DemandQuery, 1, true, coord.PlaybookCorpus, resp)
-		ev.Note = fmt.Sprintf("inbox %s/%s from %s", ie.Type, ie.Priority, ie.Sender)
+		ev := captureEvent("inbox-triggered-search", 9, ie.Coord, "inbox-burst", ie.Subject, query, 1, true, coord.PlaybookCorpus, resp)
+		parsedJSON, _ := json.Marshal(parsed)
+		ev.Note = fmt.Sprintf("inbox %s/%s from %s · LLM-parsed demand: %s", ie.Type, ie.Priority, ie.Sender, string(parsedJSON))
 		output.Events = append(output.Events, ev)
 	}

@ -930,6 +932,108 @@ func ingestFreshWorker(hc *http.Client, gw, id, text string, metadata map[string
 	return nil
 }

+// parsedDemand is the LLM-extracted structure from an inbox message
+// body — what a real coordinator would type into a search form.
+// Empty fields are honest: the body didn't say.
+type parsedDemand struct {
+	Role     string   `json:"role"`
+	Count    int      `json:"count"`
+	Location string   `json:"location"`
+	Certs    []string `json:"certs"`
+	Skills   []string `json:"skills"`
+	Shift    string   `json:"shift"`
+}
+
+// AsQuery composes a matrix.search query string from the parsed
+// fields. Mirrors buildQuery's shape so search-time semantics match
+// what the contract-driven phases produce. Empty fields are skipped
+// rather than emitted as "" markers.
+func (p parsedDemand) AsQuery() string {
+	var b strings.Builder
+	if p.Count > 0 {
+		fmt.Fprintf(&b, "Need %d ", p.Count)
+	} else {
+		b.WriteString("Need ")
+	}
+	b.WriteString(p.Role)
+	if p.Location != "" {
+		b.WriteString(" for ")
+		b.WriteString(p.Location)
+	}
+	if p.Shift != "" {
+		b.WriteString(", ")
+		b.WriteString(p.Shift)
+		b.WriteString(" shift")
+	}
+	if len(p.Certs) > 0 {
+		b.WriteString(", certifications: ")
+		b.WriteString(strings.Join(p.Certs, ", "))
+	}
+	if len(p.Skills) > 0 {
+		b.WriteString(", skills: ")
+		b.WriteString(strings.Join(p.Skills, ", "))
+	}
+	return b.String()
+}
+
+// parseInboxDemand asks the judge model to extract structured fields
+// from an inbox body. Same Ollama+JSON-format pattern as the
+// generateParaphrase function. Real production would have a dedicated
+// small model fine-tuned on staffing-language inbox parsing; here we
+// use the same model that judges relevance. temperature=0 for
+// deterministic extraction.
+func parseInboxDemand(hc *http.Client, ollamaURL, model, inboxBody string) (*parsedDemand, error) {
+	system := `You parse staffing requests from email/SMS bodies. Extract structured fields.
+Output JSON ONLY, this exact shape: {"role": "...", "count": N, "location": "...", "certs": [...], "skills": [...], "shift": "..."}.
+
+Rules:
+- role: the job role being requested (lowercase, e.g. "warehouse worker", "forklift operator")
+- count: number of workers needed (integer; if "a few" or unspecified, use 1)
+- location: city + state if both mentioned (e.g. "Cleveland, OH"); city only if state missing
+- certs: certification list as named in the body (e.g. ["OSHA-30", "forklift cert"])
+- skills: skill list as named in the body (e.g. ["pallet jack", "spanish"])
+- shift: "day"|"swing"|"night" if mentioned, else ""
+- If a field isn't in the body, use empty string or empty array (never null)
+- Do NOT explain — emit the JSON only.`
+	body, _ := json.Marshal(map[string]any{
+		"model":  model,
+		"stream": false,
+		"format": "json",
+		"messages": []map[string]string{
+			{"role": "system", "content": system},
+			{"role": "user", "content": inboxBody},
+		},
+		"options": map[string]any{"temperature": 0},
+	})
+	req, _ := http.NewRequest("POST", ollamaURL+"/api/chat", bytes.NewReader(body))
+	req.Header.Set("Content-Type", "application/json")
+	resp, err := hc.Do(req)
+	if err != nil {
+		return nil, err
+	}
+	defer resp.Body.Close()
+	if resp.StatusCode/100 != 2 {
+		return nil, fmt.Errorf("ollama chat: HTTP %d", resp.StatusCode)
+	}
+	rb, _ := io.ReadAll(resp.Body)
+	var ollamaResp struct {
+		Message struct {
+			Content string `json:"content"`
+		} `json:"message"`
+	}
+	if err := json.Unmarshal(rb, &ollamaResp); err != nil {
+		return nil, err
+	}
+	var out parsedDemand
+	if err := json.Unmarshal([]byte(ollamaResp.Message.Content), &out); err != nil {
+		return nil, fmt.Errorf("decode parsed demand: %w (content=%q)", err, ollamaResp.Message.Content)
+	}
+	if strings.TrimSpace(out.Role) == "" {
+		return nil, fmt.Errorf("parsed demand has empty role (content=%q)", ollamaResp.Message.Content)
+	}
+	return &out, nil
+}
+
 // postInbox sends an inbox message (email or SMS) to observerd via
 // the gateway. observerd records it as an ObservedOp with
 // Source=SourceInbox; downstream actions (search, ingest, etc.) are