multi_coord_stress: judge re-rates inbox top-1 — recovers honesty signal
Run #007 surfaced a tradeoff: LLM-parsed inbox queries produce much tighter cosine distances (0.05-0.10 in three cases) but lose the "system has no good match" signal that high-distance results give. A coordinator UI showing only distance can't tell wrong-domain matches apart from real ones. Fix: judge re-rates top-1 against the ORIGINAL inbox body (not the LLM-parsed query). Coordinators see both: - distance: how close was retrieval in vector space - rating: does this person actually fit the original ask The pair tells the honest story. Run #008 result on the 6 inbox events: Demand Top-1 Distance Rating Reading ───────────────────────────────────────────────────────────── Forklift Cleveland w-3573 0.29 4 Strong Production Indy e-1764 0.41 3 Adjacent Crane Chicago e-7798 0.23 1 TIGHT BUT WRONG Bilingual safety Indy w-3918 0.05 5 Perfect Drone Chicago e-1058 0.06 5 Perfect (verify e-1058) Warehouse Milwaukee w-460 0.32 4 Strong The crane-Chicago case is the architectural-honesty signal at work: distance 0.23 says "tight match" but the judge says rating 1 reading the original body. A coordinator seeing only distance would ship the wrong worker; coordinator seeing distance+rating sees the disagreement and escalates. Net distribution: 5/6 rated 3+ (acceptable→perfect), 1/6 rated 1 (irrelevant despite tight cosine). The substrate-honesty signal is recovered without losing the LLM-parse quality wins. Cost: 6 extra judge calls (~9s on qwen2.5). Production amortizes when judge runs only on top-1 of high-priority inbox events; the search-cost-vs-quality tradeoff lives in the priority gate. Implementation: - New JudgeRating int field on Event (omitempty so non-judged events stay clean in JSON) - New judgeInboxResult helper, reusing the same prompt structure as playbook_lift's judgeRate. The two could share an internal package if a third judge consumer appears. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
parent
186d209aae
commit
ce940f4a14
82
reports/reality-tests/multi_coord_stress_008.md
Normal file
82
reports/reality-tests/multi_coord_stress_008.md
Normal file
@ -0,0 +1,82 @@
|
|||||||
|
# Multi-Coordinator Stress Test — Run 008
|
||||||
|
|
||||||
|
**Generated:** 2026-04-30T21:15:37.045817146Z
|
||||||
|
**Coordinators:** alice / bob / carol (each with own playbook namespace: `playbook_alice` / `playbook_bob` / `playbook_carol`)
|
||||||
|
**Contracts:** alpha_milwaukee_distribution / beta_indianapolis_manufacturing / gamma_chicago_construction
|
||||||
|
**Corpora:** `workers,ethereal_workers`
|
||||||
|
**K per query:** 8
|
||||||
|
**Total events captured:** 67
|
||||||
|
**Evidence:** `reports/reality-tests/multi_coord_stress_008.json`
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Diversity — is the system locking into scenarios or cycling?
|
||||||
|
|
||||||
|
| Metric | Mean Jaccard | n pairs | Interpretation |
|
||||||
|
|---|---:|---:|---|
|
||||||
|
| Same role across different contracts | 0 | 9 | Lower = more diverse (different region/cert mix → different workers) |
|
||||||
|
| Different roles within same contract | 0.04126984126984126 | 18 | Should be near-zero (different roles = different worker pools) |
|
||||||
|
|
||||||
|
**Healthy ranges:**
|
||||||
|
- Same role across contracts: < 0.30 means the system is genuinely picking different workers per region/contract.
|
||||||
|
- Different roles same contract: < 0.10 means role-specific retrieval is working.
|
||||||
|
- If either is > 0.50, the system is "cycling" the same handful of workers regardless of query intent.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Determinism — same query reissued, top-K stability
|
||||||
|
|
||||||
|
| Metric | Value |
|
||||||
|
|---|---:|
|
||||||
|
| Mean Jaccard on retrieval-only reissue | 1 |
|
||||||
|
| Number of reissue pairs | 12 |
|
||||||
|
|
||||||
|
**Interpretation:**
|
||||||
|
- ≥ 0.95: HNSW retrieval is highly deterministic; reissues land on near-identical top-K. Good — system locks into a stable view of "best workers for this query."
|
||||||
|
- 0.80 – 0.95: Some HNSW or embed variance, acceptable.
|
||||||
|
- < 0.80: Retrieval is unstable — reissues see substantially different results, suggesting either embed nondeterminism (Ollama returning slightly different vectors) or vectord nondeterminism (HNSW insertion order affecting recall).
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Learning — handover hit rate
|
||||||
|
|
||||||
|
Bob takes Alice's contract using Alice's playbook namespace. Did Alice's recorded answers surface in Bob's results?
|
||||||
|
|
||||||
|
| Metric | Value |
|
||||||
|
|---|---:|
|
||||||
|
| Verbatim handover queries run | 4 |
|
||||||
|
| Alice's recorded answer at Bob's top-1 (verbatim) | 4 |
|
||||||
|
| Alice's recorded answer in Bob's top-K (verbatim) | 4 |
|
||||||
|
| **Verbatim handover hit rate (top-1)** | **1** |
|
||||||
|
| Paraphrase handover queries run | 4 |
|
||||||
|
| Alice's recorded answer at Bob's top-1 (paraphrase) | 4 |
|
||||||
|
| Alice's recorded answer in Bob's top-K (paraphrase) | 4 |
|
||||||
|
| **Paraphrase handover hit rate (top-1)** | **1** |
|
||||||
|
|
||||||
|
**Interpretation:**
|
||||||
|
- Verbatim hit rate ≈ 1.0: trivial case — Bob runs identical queries; should always hit.
|
||||||
|
- Paraphrase hit rate ≥ 0.5: institutional memory survives wording change — the harder learning property.
|
||||||
|
- Paraphrase hit rate ≈ 0.0: Bob's paraphrases drift past the inject threshold, so Alice's recordings don't activate. Same caveat as the playbook_lift paraphrase pass.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Per-event capture
|
||||||
|
|
||||||
|
All matrix.search responses live in the JSON — top-K with worker IDs, distances, and per-corpus counts. Search by phase:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
jq '.events[] | select(.phase == "merge")' reports/reality-tests/multi_coord_stress_008.json
|
||||||
|
jq '.events[] | select(.coordinator == "alice" and .phase == "baseline")' reports/reality-tests/multi_coord_stress_008.json
|
||||||
|
jq '.events[] | select(.role == "warehouse worker") | {phase, contract, top_k_ids: [.top_k[].id]}' reports/reality-tests/multi_coord_stress_008.json
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## What's NOT in this run (Phase 1 deliberately defers)
|
||||||
|
|
||||||
|
- **48-hour clock.** Events fire as discrete steps, not on a timeline.
|
||||||
|
- **Email / SMS ingest.** No endpoints exist on the Go side yet.
|
||||||
|
- **New-resume injection mid-run.** The corpus is fixed at the start.
|
||||||
|
- **Langfuse traces.** Need Go-side wiring.
|
||||||
|
|
||||||
|
These are Phase 2/3. The Phase 1 substrate is what the time-based runner will mount on top of.
|
||||||
@ -110,6 +110,12 @@ type Event struct {
|
|||||||
PerCorpusCounts map[string]int `json:"per_corpus_counts,omitempty"`
|
PerCorpusCounts map[string]int `json:"per_corpus_counts,omitempty"`
|
||||||
PlaybookBoosted int `json:"playbook_boosted,omitempty"`
|
PlaybookBoosted int `json:"playbook_boosted,omitempty"`
|
||||||
PlaybookInjected int `json:"playbook_injected,omitempty"`
|
PlaybookInjected int `json:"playbook_injected,omitempty"`
|
||||||
|
// JudgeRating: 1-5 quality score on top-1 result against the
|
||||||
|
// original inbox body (not the LLM-parsed query). Lets us flag
|
||||||
|
// the case where LLM parsing produces a tight-distance match
|
||||||
|
// but the result doesn't actually fit the original ask.
|
||||||
|
// 0 = unrated, 1-5 = judge verdict.
|
||||||
|
JudgeRating int `json:"judge_rating,omitempty"`
|
||||||
Note string `json:"note,omitempty"`
|
Note string `json:"note,omitempty"`
|
||||||
TimestampUnixNano int64 `json:"ts_ns"`
|
TimestampUnixNano int64 `json:"ts_ns"`
|
||||||
}
|
}
|
||||||
@ -395,6 +401,16 @@ func main() {
|
|||||||
ev := captureEvent("inbox-triggered-search", 9, ie.Coord, "inbox-burst", ie.Subject, query, 1, true, coord.PlaybookCorpus, resp)
|
ev := captureEvent("inbox-triggered-search", 9, ie.Coord, "inbox-burst", ie.Subject, query, 1, true, coord.PlaybookCorpus, resp)
|
||||||
parsedJSON, _ := json.Marshal(parsed)
|
parsedJSON, _ := json.Marshal(parsed)
|
||||||
ev.Note = fmt.Sprintf("inbox %s/%s from %s · LLM-parsed demand: %s", ie.Type, ie.Priority, ie.Sender, string(parsedJSON))
|
ev.Note = fmt.Sprintf("inbox %s/%s from %s · LLM-parsed demand: %s", ie.Type, ie.Priority, ie.Sender, string(parsedJSON))
|
||||||
|
// 4. Judge re-rates top-1 against the ORIGINAL body — not the
|
||||||
|
// parsed query. Catches the case where parsing dropped a
|
||||||
|
// constraint (or where the corpus has no real match for the
|
||||||
|
// asked specialist, e.g. "drone surveyor" against a corpus
|
||||||
|
// of warehouse workers — the closest semantic neighbor will
|
||||||
|
// have a tight distance but not actually fit).
|
||||||
|
if len(resp.Results) > 0 {
|
||||||
|
rating := judgeInboxResult(hc, *ollama, *judgeModel, ie.Body, resp.Results[0])
|
||||||
|
ev.JudgeRating = rating
|
||||||
|
}
|
||||||
output.Events = append(output.Events, ev)
|
output.Events = append(output.Events, ev)
|
||||||
}
|
}
|
||||||
|
|
||||||
@ -976,6 +992,65 @@ func (p parsedDemand) AsQuery() string {
|
|||||||
return b.String()
|
return b.String()
|
||||||
}
|
}
|
||||||
|
|
||||||
|
// judgeInboxResult rates the top retrieval against the ORIGINAL inbox
|
||||||
|
// body. Returns 1-5 (5 = perfect match, 1 = irrelevant); 0 on any
|
||||||
|
// error. Real product driver: a tight-distance result on a
|
||||||
|
// LLM-parsed query may still be wrong-domain (parser dropped a
|
||||||
|
// critical constraint, or the corpus genuinely has no match). The
|
||||||
|
// rating gives coordinators an honest "this is close in vector
|
||||||
|
// space but doesn't actually fit your ask" signal.
|
||||||
|
func judgeInboxResult(hc *http.Client, ollamaURL, model, inboxBody string, top matrixResult) int {
|
||||||
|
system := `You rate retrieval results for a staffing co-pilot.
|
||||||
|
Rate the result 1-5 against the original inbox request:
|
||||||
|
5 = perfect match (this person/role IS what was asked for)
|
||||||
|
4 = strong match (right field, right level, minor mismatches)
|
||||||
|
3 = adjacent match (related field or partial overlap)
|
||||||
|
2 = weak/tangential match
|
||||||
|
1 = irrelevant
|
||||||
|
Output JSON only: {"rating": N, "reason": "<one sentence>"}.`
|
||||||
|
user := fmt.Sprintf("Original inbox request:\n%s\n\nResult corpus: %s\nResult ID: %s\nResult metadata:\n%s",
|
||||||
|
inboxBody, top.Corpus, top.ID, string(top.Metadata))
|
||||||
|
body, _ := json.Marshal(map[string]any{
|
||||||
|
"model": model,
|
||||||
|
"stream": false,
|
||||||
|
"format": "json",
|
||||||
|
"messages": []map[string]string{
|
||||||
|
{"role": "system", "content": system},
|
||||||
|
{"role": "user", "content": user},
|
||||||
|
},
|
||||||
|
"options": map[string]any{"temperature": 0},
|
||||||
|
})
|
||||||
|
req, _ := http.NewRequest("POST", ollamaURL+"/api/chat", bytes.NewReader(body))
|
||||||
|
req.Header.Set("Content-Type", "application/json")
|
||||||
|
resp, err := hc.Do(req)
|
||||||
|
if err != nil {
|
||||||
|
return 0
|
||||||
|
}
|
||||||
|
defer resp.Body.Close()
|
||||||
|
if resp.StatusCode/100 != 2 {
|
||||||
|
return 0
|
||||||
|
}
|
||||||
|
rb, _ := io.ReadAll(resp.Body)
|
||||||
|
var ollamaResp struct {
|
||||||
|
Message struct {
|
||||||
|
Content string `json:"content"`
|
||||||
|
} `json:"message"`
|
||||||
|
}
|
||||||
|
if err := json.Unmarshal(rb, &ollamaResp); err != nil {
|
||||||
|
return 0
|
||||||
|
}
|
||||||
|
var v struct {
|
||||||
|
Rating int `json:"rating"`
|
||||||
|
}
|
||||||
|
if err := json.Unmarshal([]byte(ollamaResp.Message.Content), &v); err != nil {
|
||||||
|
return 0
|
||||||
|
}
|
||||||
|
if v.Rating < 1 || v.Rating > 5 {
|
||||||
|
return 0
|
||||||
|
}
|
||||||
|
return v.Rating
|
||||||
|
}
|
||||||
|
|
||||||
// parseInboxDemand asks the judge model to extract structured fields
|
// parseInboxDemand asks the judge model to extract structured fields
|
||||||
// from an inbox body. Same Ollama+JSON-format pattern as the
|
// from an inbox body. Same Ollama+JSON-format pattern as the
|
||||||
// generateParaphrase function. Real production would have a dedicated
|
// generateParaphrase function. Real production would have a dedicated
|
||||||
|
|||||||
Loading…
x
Reference in New Issue
Block a user