multi_coord_stress: LLM-parsed inbox demands (qwen2.5)
Replaced the hard-coded DemandQuery on inbox events with an actual
LLM call: each email/SMS body is parsed by qwen2.5 (format=json,
schema-anchored) into structured {role, count, location, certs,
skills, shift}. The driver then composes a query string from those
fields and runs matrix.search.
This is the real-product flow that the Phase 3 stress test was
asking for: real bodies → real LLM parsing → real search. Before
this commit, the DemandQuery was my hand-crafted string, which
made the inbox phase trivial.
Run #007 result vs #006 (same bodies, parser swapped):
All 6 inbox events parsed cleanly — qwen2.5 nailed:
"Need 50 forklift operators in Cleveland OH for Monday day
shift. OSHA-30 + active forklift cert required."
→ {role:"forklift operator", count:50, location:"Cleveland, OH",
certs:["OSHA-30","active forklift cert"], skills:[], shift:"day"}
Other 5 similarly faithful (indy stayed as "indy", count
defaulted to 1 when unspecified, no hallucinated fields).
LLM-parsed queries produced TIGHTER matches than hard-coded:
Demand #006 dist #007 dist Δ
Crane Chicago 0.499 0.093 -82%
Drone Chicago 0.707 0.073 -90%
Bilingual safety 0.240 0.048 -80%
Forklift Cleveland 0.330 0.273 -17%
Production Indy 0.260 0.399 +53%
Warehouse Milwaukee 0.458 0.420 -8%
Three matches landed at distance < 0.10 — verbatim-replay-tight
territory. Structured queries embed sharper than conversational
hand-crafted strings.
Other metrics unchanged: diversity 0.000, determinism 1.000,
verbatim handover 4/4, paraphrase handover 4/4.
Tradeoff worth flagging: the drone-Chicago case dropped from
distance 0.71 (clear "we don't have one") to 0.07 (confident match
returned). The OOD honesty signal weakens when LLM-parsed structure
makes any closest-neighbor look tight. Future Phase 4 work: judge
re-rates the top match before surfacing, so coordinators see "your
demand was for X but the closest match scored 2/5" rather than just
the worker ID + distance.
Substrate cost: +6 LLM calls per inbox burst (~9s on qwen2.5).
Production would amortize via a small dedicated parser model.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
parent
e7fc63b216
commit
186d209aae
82
reports/reality-tests/multi_coord_stress_007.md
Normal file
82
reports/reality-tests/multi_coord_stress_007.md
Normal file
@ -0,0 +1,82 @@
|
||||
# Multi-Coordinator Stress Test — Run 007
|
||||
|
||||
**Generated:** 2026-04-30T19:50:04.791000091Z
|
||||
**Coordinators:** alice / bob / carol (each with own playbook namespace: `playbook_alice` / `playbook_bob` / `playbook_carol`)
|
||||
**Contracts:** alpha_milwaukee_distribution / beta_indianapolis_manufacturing / gamma_chicago_construction
|
||||
**Corpora:** `workers,ethereal_workers`
|
||||
**K per query:** 8
|
||||
**Total events captured:** 67
|
||||
**Evidence:** `reports/reality-tests/multi_coord_stress_007.json`
|
||||
|
||||
---
|
||||
|
||||
## Diversity — is the system locking into scenarios or cycling?
|
||||
|
||||
| Metric | Mean Jaccard | n pairs | Interpretation |
|
||||
|---|---:|---:|---|
|
||||
| Same role across different contracts | 0 | 9 | Lower = more diverse (different region/cert mix → different workers) |
|
||||
| Different roles within same contract | 0.026455026455026454 | 18 | Should be near-zero (different roles = different worker pools) |
|
||||
|
||||
**Healthy ranges:**
|
||||
- Same role across contracts: < 0.30 means the system is genuinely picking different workers per region/contract.
|
||||
- Different roles same contract: < 0.10 means role-specific retrieval is working.
|
||||
- If either is > 0.50, the system is "cycling" the same handful of workers regardless of query intent.
|
||||
|
||||
---
|
||||
|
||||
## Determinism — same query reissued, top-K stability
|
||||
|
||||
| Metric | Value |
|
||||
|---|---:|
|
||||
| Mean Jaccard on retrieval-only reissue | 1 |
|
||||
| Number of reissue pairs | 12 |
|
||||
|
||||
**Interpretation:**
|
||||
- ≥ 0.95: HNSW retrieval is highly deterministic; reissues land on near-identical top-K. Good — system locks into a stable view of "best workers for this query."
|
||||
- 0.80 – 0.95: Some HNSW or embed variance, acceptable.
|
||||
- < 0.80: Retrieval is unstable — reissues see substantially different results, suggesting either embed nondeterminism (Ollama returning slightly different vectors) or vectord nondeterminism (HNSW insertion order affecting recall).
|
||||
|
||||
---
|
||||
|
||||
## Learning — handover hit rate
|
||||
|
||||
Bob takes Alice's contract using Alice's playbook namespace. Did Alice's recorded answers surface in Bob's results?
|
||||
|
||||
| Metric | Value |
|
||||
|---|---:|
|
||||
| Verbatim handover queries run | 4 |
|
||||
| Alice's recorded answer at Bob's top-1 (verbatim) | 4 |
|
||||
| Alice's recorded answer in Bob's top-K (verbatim) | 4 |
|
||||
| **Verbatim handover hit rate (top-1)** | **1** |
|
||||
| Paraphrase handover queries run | 4 |
|
||||
| Alice's recorded answer at Bob's top-1 (paraphrase) | 4 |
|
||||
| Alice's recorded answer in Bob's top-K (paraphrase) | 4 |
|
||||
| **Paraphrase handover hit rate (top-1)** | **1** |
|
||||
|
||||
**Interpretation:**
|
||||
- Verbatim hit rate ≈ 1.0: trivial case — Bob runs identical queries; should always hit.
|
||||
- Paraphrase hit rate ≥ 0.5: institutional memory survives wording change — the harder learning property.
|
||||
- Paraphrase hit rate ≈ 0.0: Bob's paraphrases drift past the inject threshold, so Alice's recordings don't activate. Same caveat as the playbook_lift paraphrase pass.
|
||||
|
||||
---
|
||||
|
||||
## Per-event capture
|
||||
|
||||
All matrix.search responses live in the JSON — top-K with worker IDs, distances, and per-corpus counts. Search by phase:
|
||||
|
||||
```bash
|
||||
jq '.events[] | select(.phase == "merge")' reports/reality-tests/multi_coord_stress_007.json
|
||||
jq '.events[] | select(.coordinator == "alice" and .phase == "baseline")' reports/reality-tests/multi_coord_stress_007.json
|
||||
jq '.events[] | select(.role == "warehouse worker") | {phase, contract, top_k_ids: [.top_k[].id]}' reports/reality-tests/multi_coord_stress_007.json
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## What's NOT in this run (Phase 1 deliberately defers)
|
||||
|
||||
- **48-hour clock.** Events fire as discrete steps, not on a timeline.
|
||||
- **Email / SMS ingest.** No endpoints exist on the Go side yet.
|
||||
- **New-resume injection mid-run.** The corpus is fixed at the start.
|
||||
- **Langfuse traces.** Need Go-side wiring.
|
||||
|
||||
These are Phase 2/3. The Phase 1 substrate is what the time-based runner will mount on top of.
|
||||
@ -325,52 +325,42 @@ func main() {
|
||||
Sender string
|
||||
Subject string
|
||||
Body string
|
||||
// DemandQuery is the parsed demand we'd extract from the body.
|
||||
// In production a small LLM would parse this; here we fix it
|
||||
// up-front to keep the test deterministic.
|
||||
DemandQuery string
|
||||
Coord string
|
||||
Coord string
|
||||
}
|
||||
inboxEvents := []inboxEvent{
|
||||
{
|
||||
Priority: "urgent", Type: "email", Sender: "ops@northstar.com",
|
||||
Subject: "URGENT: 50 forklift operators Cleveland Monday",
|
||||
Body: "Need 50 forklift operators in Cleveland OH for Monday day shift. OSHA-30 + active forklift cert required. Current Milwaukee batch cannot relocate.",
|
||||
DemandQuery: "Forklift operator Cleveland OH OSHA-30 forklift certification day shift",
|
||||
Coord: "alice",
|
||||
Coord: "alice",
|
||||
},
|
||||
{
|
||||
Priority: "urgent", Type: "email", Sender: "client@crossroads-mfg.com",
|
||||
Subject: "URGENT: Production line down — need 30 production workers tonight",
|
||||
Body: "Production line failure at Indianapolis IN site. Need 30 production workers swing shift starting tonight. Assembly + machine operation experience required.",
|
||||
DemandQuery: "Production worker Indianapolis IN swing shift assembly machine operation",
|
||||
Coord: "bob",
|
||||
Coord: "bob",
|
||||
},
|
||||
{
|
||||
Priority: "high", Type: "email", Sender: "supervisor@loop-construction.com",
|
||||
Subject: "Need crane operator Chicago for 2-week project",
|
||||
Body: "Crane operator with NCCCO certification needed for 2-week Chicago IL site project. Day shift. Mobile crane experience preferred.",
|
||||
DemandQuery: "Crane operator NCCCO certification Chicago IL mobile crane day shift",
|
||||
Coord: "carol",
|
||||
Coord: "carol",
|
||||
},
|
||||
{
|
||||
Priority: "high", Type: "sms", Sender: "+1-555-0142",
|
||||
Body: "Bilingual safety coord needed Indy plant ASAP. Spanish + English. OSHA trainer credential.",
|
||||
DemandQuery: "Bilingual Spanish English safety coordinator Indianapolis OSHA trainer",
|
||||
Coord: "bob",
|
||||
Body: "Bilingual safety coord needed Indy plant ASAP. Spanish + English. OSHA trainer credential.",
|
||||
Coord: "bob",
|
||||
},
|
||||
{
|
||||
Priority: "medium", Type: "sms", Sender: "+1-555-0188",
|
||||
Body: "Drone surveyor for Chicago site progress mapping. FAA Part 107.",
|
||||
DemandQuery: "FAA Part 107 drone surveyor Chicago site mapping",
|
||||
Coord: "carol",
|
||||
Body: "Drone surveyor for Chicago site progress mapping. FAA Part 107.",
|
||||
Coord: "carol",
|
||||
},
|
||||
{
|
||||
Priority: "medium", Type: "email", Sender: "scheduling@northstar.com",
|
||||
Subject: "FYI: warehouse worker capacity check Milwaukee",
|
||||
Body: "Routine capacity check on Milwaukee warehouse worker pool — anyone with cold storage experience for next week?",
|
||||
DemandQuery: "Warehouse worker Milwaukee cold storage",
|
||||
Coord: "alice",
|
||||
Coord: "alice",
|
||||
},
|
||||
}
|
||||
// Sort by priority (urgent < high < medium < low for ordering).
|
||||
@ -384,15 +374,27 @@ func main() {
|
||||
log.Printf(" inbox record failed (%s): %v", ie.Priority, err)
|
||||
continue
|
||||
}
|
||||
// 2. Triggered matrix.search using the parsed demand.
|
||||
// 2. LLM parses the body into a structured demand. Real
|
||||
// production: a small local model extracts {role, count,
|
||||
// location, certs, skills, shift} from email/SMS bodies
|
||||
// instead of a coordinator typing it into a form. Test
|
||||
// captures both raw body and parsed structure for review.
|
||||
parsed, perr := parseInboxDemand(hc, *ollama, *judgeModel, ie.Body)
|
||||
if perr != nil {
|
||||
log.Printf(" inbox demand parse failed (%s): %v", ie.Priority, perr)
|
||||
continue
|
||||
}
|
||||
// 3. Build a query string from the parsed demand and search.
|
||||
query := parsed.AsQuery()
|
||||
coord := coordByName(coords, ie.Coord)
|
||||
resp, err := matrixSearch(hc, *gateway, ie.DemandQuery, corpora, *k, true, coord.PlaybookCorpus)
|
||||
resp, err := matrixSearch(hc, *gateway, query, corpora, *k, true, coord.PlaybookCorpus)
|
||||
if err != nil {
|
||||
log.Printf(" inbox-triggered search failed (%s): %v", ie.Priority, err)
|
||||
continue
|
||||
}
|
||||
ev := captureEvent("inbox-triggered-search", 9, ie.Coord, "inbox-burst", ie.Subject, ie.DemandQuery, 1, true, coord.PlaybookCorpus, resp)
|
||||
ev.Note = fmt.Sprintf("inbox %s/%s from %s", ie.Type, ie.Priority, ie.Sender)
|
||||
ev := captureEvent("inbox-triggered-search", 9, ie.Coord, "inbox-burst", ie.Subject, query, 1, true, coord.PlaybookCorpus, resp)
|
||||
parsedJSON, _ := json.Marshal(parsed)
|
||||
ev.Note = fmt.Sprintf("inbox %s/%s from %s · LLM-parsed demand: %s", ie.Type, ie.Priority, ie.Sender, string(parsedJSON))
|
||||
output.Events = append(output.Events, ev)
|
||||
}
|
||||
|
||||
@ -930,6 +932,108 @@ func ingestFreshWorker(hc *http.Client, gw, id, text string, metadata map[string
|
||||
return nil
|
||||
}
|
||||
|
||||
// parsedDemand is the LLM-extracted structure from an inbox message
|
||||
// body — what a real coordinator would type into a search form.
|
||||
// Empty fields are honest: the body didn't say.
|
||||
type parsedDemand struct {
|
||||
Role string `json:"role"`
|
||||
Count int `json:"count"`
|
||||
Location string `json:"location"`
|
||||
Certs []string `json:"certs"`
|
||||
Skills []string `json:"skills"`
|
||||
Shift string `json:"shift"`
|
||||
}
|
||||
|
||||
// AsQuery composes a matrix.search query string from the parsed
|
||||
// fields. Mirrors buildQuery's shape so search-time semantics match
|
||||
// what the contract-driven phases produce. Empty fields are skipped
|
||||
// rather than emitted as "" markers.
|
||||
func (p parsedDemand) AsQuery() string {
|
||||
var b strings.Builder
|
||||
if p.Count > 0 {
|
||||
fmt.Fprintf(&b, "Need %d ", p.Count)
|
||||
} else {
|
||||
b.WriteString("Need ")
|
||||
}
|
||||
b.WriteString(p.Role)
|
||||
if p.Location != "" {
|
||||
b.WriteString(" for ")
|
||||
b.WriteString(p.Location)
|
||||
}
|
||||
if p.Shift != "" {
|
||||
b.WriteString(", ")
|
||||
b.WriteString(p.Shift)
|
||||
b.WriteString(" shift")
|
||||
}
|
||||
if len(p.Certs) > 0 {
|
||||
b.WriteString(", certifications: ")
|
||||
b.WriteString(strings.Join(p.Certs, ", "))
|
||||
}
|
||||
if len(p.Skills) > 0 {
|
||||
b.WriteString(", skills: ")
|
||||
b.WriteString(strings.Join(p.Skills, ", "))
|
||||
}
|
||||
return b.String()
|
||||
}
|
||||
|
||||
// parseInboxDemand asks the judge model to extract structured fields
|
||||
// from an inbox body. Same Ollama+JSON-format pattern as the
|
||||
// generateParaphrase function. Real production would have a dedicated
|
||||
// small model fine-tuned on staffing-language inbox parsing; here we
|
||||
// use the same model that judges relevance. temperature=0 for
|
||||
// deterministic extraction.
|
||||
func parseInboxDemand(hc *http.Client, ollamaURL, model, inboxBody string) (*parsedDemand, error) {
|
||||
system := `You parse staffing requests from email/SMS bodies. Extract structured fields.
|
||||
Output JSON ONLY, this exact shape: {"role": "...", "count": N, "location": "...", "certs": [...], "skills": [...], "shift": "..."}.
|
||||
|
||||
Rules:
|
||||
- role: the job role being requested (lowercase, e.g. "warehouse worker", "forklift operator")
|
||||
- count: number of workers needed (integer; if "a few" or unspecified, use 1)
|
||||
- location: city + state if both mentioned (e.g. "Cleveland, OH"); city only if state missing
|
||||
- certs: certification list as named in the body (e.g. ["OSHA-30", "forklift cert"])
|
||||
- skills: skill list as named in the body (e.g. ["pallet jack", "spanish"])
|
||||
- shift: "day"|"swing"|"night" if mentioned, else ""
|
||||
- If a field isn't in the body, use empty string or empty array (never null)
|
||||
- Do NOT explain — emit the JSON only.`
|
||||
body, _ := json.Marshal(map[string]any{
|
||||
"model": model,
|
||||
"stream": false,
|
||||
"format": "json",
|
||||
"messages": []map[string]string{
|
||||
{"role": "system", "content": system},
|
||||
{"role": "user", "content": inboxBody},
|
||||
},
|
||||
"options": map[string]any{"temperature": 0},
|
||||
})
|
||||
req, _ := http.NewRequest("POST", ollamaURL+"/api/chat", bytes.NewReader(body))
|
||||
req.Header.Set("Content-Type", "application/json")
|
||||
resp, err := hc.Do(req)
|
||||
if err != nil {
|
||||
return nil, err
|
||||
}
|
||||
defer resp.Body.Close()
|
||||
if resp.StatusCode/100 != 2 {
|
||||
return nil, fmt.Errorf("ollama chat: HTTP %d", resp.StatusCode)
|
||||
}
|
||||
rb, _ := io.ReadAll(resp.Body)
|
||||
var ollamaResp struct {
|
||||
Message struct {
|
||||
Content string `json:"content"`
|
||||
} `json:"message"`
|
||||
}
|
||||
if err := json.Unmarshal(rb, &ollamaResp); err != nil {
|
||||
return nil, err
|
||||
}
|
||||
var out parsedDemand
|
||||
if err := json.Unmarshal([]byte(ollamaResp.Message.Content), &out); err != nil {
|
||||
return nil, fmt.Errorf("decode parsed demand: %w (content=%q)", err, ollamaResp.Message.Content)
|
||||
}
|
||||
if strings.TrimSpace(out.Role) == "" {
|
||||
return nil, fmt.Errorf("parsed demand has empty role (content=%q)", ollamaResp.Message.Content)
|
||||
}
|
||||
return &out, nil
|
||||
}
|
||||
|
||||
// postInbox sends an inbox message (email or SMS) to observerd via
|
||||
// the gateway. observerd records it as an ObservedOp with
|
||||
// Source=SourceInbox; downstream actions (search, ingest, etc.) are
|
||||
|
||||
Loading…
x
Reference in New Issue
Block a user