multi_coord_stress: LLM-parsed inbox demands (qwen2.5)

Replaced the hard-coded DemandQuery on inbox events with an actual
LLM call: each email/SMS body is parsed by qwen2.5 (format=json,
schema-anchored) into structured {role, count, location, certs,
skills, shift}. The driver then composes a query string from those
fields and runs matrix.search.

This is the real-product flow that the Phase 3 stress test was
asking for: real bodies → real LLM parsing → real search. Before
this commit, the DemandQuery was my hand-crafted string, which
made the inbox phase trivial.

Run #007 result vs #006 (same bodies, parser swapped):

  All 6 inbox events parsed cleanly — qwen2.5 nailed:
    "Need 50 forklift operators in Cleveland OH for Monday day
     shift. OSHA-30 + active forklift cert required."
    → {role:"forklift operator", count:50, location:"Cleveland, OH",
       certs:["OSHA-30","active forklift cert"], skills:[], shift:"day"}
    Other 5 similarly faithful (indy stayed as "indy", count
    defaulted to 1 when unspecified, no hallucinated fields).

  LLM-parsed queries produced TIGHTER matches than hard-coded:
    Demand              #006 dist  #007 dist  Δ
    Crane Chicago       0.499      0.093      -82%
    Drone Chicago       0.707      0.073      -90%
    Bilingual safety    0.240      0.048      -80%
    Forklift Cleveland  0.330      0.273      -17%
    Production Indy     0.260      0.399      +53%
    Warehouse Milwaukee 0.458      0.420       -8%

  Three matches landed at distance < 0.10 — verbatim-replay-tight
  territory. Structured queries embed sharper than conversational
  hand-crafted strings.

  Other metrics unchanged: diversity 0.000, determinism 1.000,
  verbatim handover 4/4, paraphrase handover 4/4.

Tradeoff worth flagging: the drone-Chicago case dropped from
distance 0.71 (clear "we don't have one") to 0.07 (confident match
returned). The OOD honesty signal weakens when LLM-parsed structure
makes any closest-neighbor look tight. Future Phase 4 work: judge
re-rates the top match before surfacing, so coordinators see "your
demand was for X but the closest match scored 2/5" rather than just
the worker ID + distance.

Substrate cost: +6 LLM calls per inbox burst (~9s on qwen2.5).
Production would amortize via a small dedicated parser model.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
root 2026-04-30 14:51:19 -05:00
parent e7fc63b216
commit 186d209aae
2 changed files with 209 additions and 23 deletions

View File

@ -0,0 +1,82 @@
# Multi-Coordinator Stress Test — Run 007
**Generated:** 2026-04-30T19:50:04.791000091Z
**Coordinators:** alice / bob / carol (each with own playbook namespace: `playbook_alice` / `playbook_bob` / `playbook_carol`)
**Contracts:** alpha_milwaukee_distribution / beta_indianapolis_manufacturing / gamma_chicago_construction
**Corpora:** `workers,ethereal_workers`
**K per query:** 8
**Total events captured:** 67
**Evidence:** `reports/reality-tests/multi_coord_stress_007.json`
---
## Diversity — is the system locking into scenarios or cycling?
| Metric | Mean Jaccard | n pairs | Interpretation |
|---|---:|---:|---|
| Same role across different contracts | 0 | 9 | Lower = more diverse (different region/cert mix → different workers) |
| Different roles within same contract | 0.026455026455026454 | 18 | Should be near-zero (different roles = different worker pools) |
**Healthy ranges:**
- Same role across contracts: < 0.30 means the system is genuinely picking different workers per region/contract.
- Different roles same contract: < 0.10 means role-specific retrieval is working.
- If either is > 0.50, the system is "cycling" the same handful of workers regardless of query intent.
---
## Determinism — same query reissued, top-K stability
| Metric | Value |
|---|---:|
| Mean Jaccard on retrieval-only reissue | 1 |
| Number of reissue pairs | 12 |
**Interpretation:**
- ≥ 0.95: HNSW retrieval is highly deterministic; reissues land on near-identical top-K. Good — system locks into a stable view of "best workers for this query."
- 0.80 0.95: Some HNSW or embed variance, acceptable.
- < 0.80: Retrieval is unstable reissues see substantially different results, suggesting either embed nondeterminism (Ollama returning slightly different vectors) or vectord nondeterminism (HNSW insertion order affecting recall).
---
## Learning — handover hit rate
Bob takes Alice's contract using Alice's playbook namespace. Did Alice's recorded answers surface in Bob's results?
| Metric | Value |
|---|---:|
| Verbatim handover queries run | 4 |
| Alice's recorded answer at Bob's top-1 (verbatim) | 4 |
| Alice's recorded answer in Bob's top-K (verbatim) | 4 |
| **Verbatim handover hit rate (top-1)** | **1** |
| Paraphrase handover queries run | 4 |
| Alice's recorded answer at Bob's top-1 (paraphrase) | 4 |
| Alice's recorded answer in Bob's top-K (paraphrase) | 4 |
| **Paraphrase handover hit rate (top-1)** | **1** |
**Interpretation:**
- Verbatim hit rate ≈ 1.0: trivial case — Bob runs identical queries; should always hit.
- Paraphrase hit rate ≥ 0.5: institutional memory survives wording change — the harder learning property.
- Paraphrase hit rate ≈ 0.0: Bob's paraphrases drift past the inject threshold, so Alice's recordings don't activate. Same caveat as the playbook_lift paraphrase pass.
---
## Per-event capture
All matrix.search responses live in the JSON — top-K with worker IDs, distances, and per-corpus counts. Search by phase:
```bash
jq '.events[] | select(.phase == "merge")' reports/reality-tests/multi_coord_stress_007.json
jq '.events[] | select(.coordinator == "alice" and .phase == "baseline")' reports/reality-tests/multi_coord_stress_007.json
jq '.events[] | select(.role == "warehouse worker") | {phase, contract, top_k_ids: [.top_k[].id]}' reports/reality-tests/multi_coord_stress_007.json
```
---
## What's NOT in this run (Phase 1 deliberately defers)
- **48-hour clock.** Events fire as discrete steps, not on a timeline.
- **Email / SMS ingest.** No endpoints exist on the Go side yet.
- **New-resume injection mid-run.** The corpus is fixed at the start.
- **Langfuse traces.** Need Go-side wiring.
These are Phase 2/3. The Phase 1 substrate is what the time-based runner will mount on top of.

View File

@ -325,52 +325,42 @@ func main() {
Sender string
Subject string
Body string
// DemandQuery is the parsed demand we'd extract from the body.
// In production a small LLM would parse this; here we fix it
// up-front to keep the test deterministic.
DemandQuery string
Coord string
Coord string
}
inboxEvents := []inboxEvent{
{
Priority: "urgent", Type: "email", Sender: "ops@northstar.com",
Subject: "URGENT: 50 forklift operators Cleveland Monday",
Body: "Need 50 forklift operators in Cleveland OH for Monday day shift. OSHA-30 + active forklift cert required. Current Milwaukee batch cannot relocate.",
DemandQuery: "Forklift operator Cleveland OH OSHA-30 forklift certification day shift",
Coord: "alice",
Coord: "alice",
},
{
Priority: "urgent", Type: "email", Sender: "client@crossroads-mfg.com",
Subject: "URGENT: Production line down — need 30 production workers tonight",
Body: "Production line failure at Indianapolis IN site. Need 30 production workers swing shift starting tonight. Assembly + machine operation experience required.",
DemandQuery: "Production worker Indianapolis IN swing shift assembly machine operation",
Coord: "bob",
Coord: "bob",
},
{
Priority: "high", Type: "email", Sender: "supervisor@loop-construction.com",
Subject: "Need crane operator Chicago for 2-week project",
Body: "Crane operator with NCCCO certification needed for 2-week Chicago IL site project. Day shift. Mobile crane experience preferred.",
DemandQuery: "Crane operator NCCCO certification Chicago IL mobile crane day shift",
Coord: "carol",
Coord: "carol",
},
{
Priority: "high", Type: "sms", Sender: "+1-555-0142",
Body: "Bilingual safety coord needed Indy plant ASAP. Spanish + English. OSHA trainer credential.",
DemandQuery: "Bilingual Spanish English safety coordinator Indianapolis OSHA trainer",
Coord: "bob",
Body: "Bilingual safety coord needed Indy plant ASAP. Spanish + English. OSHA trainer credential.",
Coord: "bob",
},
{
Priority: "medium", Type: "sms", Sender: "+1-555-0188",
Body: "Drone surveyor for Chicago site progress mapping. FAA Part 107.",
DemandQuery: "FAA Part 107 drone surveyor Chicago site mapping",
Coord: "carol",
Body: "Drone surveyor for Chicago site progress mapping. FAA Part 107.",
Coord: "carol",
},
{
Priority: "medium", Type: "email", Sender: "scheduling@northstar.com",
Subject: "FYI: warehouse worker capacity check Milwaukee",
Body: "Routine capacity check on Milwaukee warehouse worker pool — anyone with cold storage experience for next week?",
DemandQuery: "Warehouse worker Milwaukee cold storage",
Coord: "alice",
Coord: "alice",
},
}
// Sort by priority (urgent < high < medium < low for ordering).
@ -384,15 +374,27 @@ func main() {
log.Printf(" inbox record failed (%s): %v", ie.Priority, err)
continue
}
// 2. Triggered matrix.search using the parsed demand.
// 2. LLM parses the body into a structured demand. Real
// production: a small local model extracts {role, count,
// location, certs, skills, shift} from email/SMS bodies
// instead of a coordinator typing it into a form. Test
// captures both raw body and parsed structure for review.
parsed, perr := parseInboxDemand(hc, *ollama, *judgeModel, ie.Body)
if perr != nil {
log.Printf(" inbox demand parse failed (%s): %v", ie.Priority, perr)
continue
}
// 3. Build a query string from the parsed demand and search.
query := parsed.AsQuery()
coord := coordByName(coords, ie.Coord)
resp, err := matrixSearch(hc, *gateway, ie.DemandQuery, corpora, *k, true, coord.PlaybookCorpus)
resp, err := matrixSearch(hc, *gateway, query, corpora, *k, true, coord.PlaybookCorpus)
if err != nil {
log.Printf(" inbox-triggered search failed (%s): %v", ie.Priority, err)
continue
}
ev := captureEvent("inbox-triggered-search", 9, ie.Coord, "inbox-burst", ie.Subject, ie.DemandQuery, 1, true, coord.PlaybookCorpus, resp)
ev.Note = fmt.Sprintf("inbox %s/%s from %s", ie.Type, ie.Priority, ie.Sender)
ev := captureEvent("inbox-triggered-search", 9, ie.Coord, "inbox-burst", ie.Subject, query, 1, true, coord.PlaybookCorpus, resp)
parsedJSON, _ := json.Marshal(parsed)
ev.Note = fmt.Sprintf("inbox %s/%s from %s · LLM-parsed demand: %s", ie.Type, ie.Priority, ie.Sender, string(parsedJSON))
output.Events = append(output.Events, ev)
}
@ -930,6 +932,108 @@ func ingestFreshWorker(hc *http.Client, gw, id, text string, metadata map[string
return nil
}
// parsedDemand is the LLM-extracted structure from an inbox message
// body — what a real coordinator would type into a search form.
// Empty fields are honest: the body didn't say.
type parsedDemand struct {
Role string `json:"role"`
Count int `json:"count"`
Location string `json:"location"`
Certs []string `json:"certs"`
Skills []string `json:"skills"`
Shift string `json:"shift"`
}
// AsQuery composes a matrix.search query string from the parsed
// fields. Mirrors buildQuery's shape so search-time semantics match
// what the contract-driven phases produce. Empty fields are skipped
// rather than emitted as "" markers.
func (p parsedDemand) AsQuery() string {
var b strings.Builder
if p.Count > 0 {
fmt.Fprintf(&b, "Need %d ", p.Count)
} else {
b.WriteString("Need ")
}
b.WriteString(p.Role)
if p.Location != "" {
b.WriteString(" for ")
b.WriteString(p.Location)
}
if p.Shift != "" {
b.WriteString(", ")
b.WriteString(p.Shift)
b.WriteString(" shift")
}
if len(p.Certs) > 0 {
b.WriteString(", certifications: ")
b.WriteString(strings.Join(p.Certs, ", "))
}
if len(p.Skills) > 0 {
b.WriteString(", skills: ")
b.WriteString(strings.Join(p.Skills, ", "))
}
return b.String()
}
// parseInboxDemand asks the judge model to extract structured fields
// from an inbox body. Same Ollama+JSON-format pattern as the
// generateParaphrase function. Real production would have a dedicated
// small model fine-tuned on staffing-language inbox parsing; here we
// use the same model that judges relevance. temperature=0 for
// deterministic extraction.
func parseInboxDemand(hc *http.Client, ollamaURL, model, inboxBody string) (*parsedDemand, error) {
system := `You parse staffing requests from email/SMS bodies. Extract structured fields.
Output JSON ONLY, this exact shape: {"role": "...", "count": N, "location": "...", "certs": [...], "skills": [...], "shift": "..."}.
Rules:
- role: the job role being requested (lowercase, e.g. "warehouse worker", "forklift operator")
- count: number of workers needed (integer; if "a few" or unspecified, use 1)
- location: city + state if both mentioned (e.g. "Cleveland, OH"); city only if state missing
- certs: certification list as named in the body (e.g. ["OSHA-30", "forklift cert"])
- skills: skill list as named in the body (e.g. ["pallet jack", "spanish"])
- shift: "day"|"swing"|"night" if mentioned, else ""
- If a field isn't in the body, use empty string or empty array (never null)
- Do NOT explain emit the JSON only.`
body, _ := json.Marshal(map[string]any{
"model": model,
"stream": false,
"format": "json",
"messages": []map[string]string{
{"role": "system", "content": system},
{"role": "user", "content": inboxBody},
},
"options": map[string]any{"temperature": 0},
})
req, _ := http.NewRequest("POST", ollamaURL+"/api/chat", bytes.NewReader(body))
req.Header.Set("Content-Type", "application/json")
resp, err := hc.Do(req)
if err != nil {
return nil, err
}
defer resp.Body.Close()
if resp.StatusCode/100 != 2 {
return nil, fmt.Errorf("ollama chat: HTTP %d", resp.StatusCode)
}
rb, _ := io.ReadAll(resp.Body)
var ollamaResp struct {
Message struct {
Content string `json:"content"`
} `json:"message"`
}
if err := json.Unmarshal(rb, &ollamaResp); err != nil {
return nil, err
}
var out parsedDemand
if err := json.Unmarshal([]byte(ollamaResp.Message.Content), &out); err != nil {
return nil, fmt.Errorf("decode parsed demand: %w (content=%q)", err, ollamaResp.Message.Content)
}
if strings.TrimSpace(out.Role) == "" {
return nil, fmt.Errorf("parsed demand has empty role (content=%q)", ollamaResp.Message.Content)
}
return &out, nil
}
// postInbox sends an inbox message (email or SMS) to observerd via
// the gateway. observerd records it as an ObservedOp with
// Source=SourceInbox; downstream actions (search, ingest, etc.) are