golangLAKEHOUSE

History

root ce940f4a14 multi_coord_stress: judge re-rates inbox top-1 — recovers honesty signal

Run #007 surfaced a tradeoff: LLM-parsed inbox queries produce much
tighter cosine distances (0.05-0.10 in three cases) but lose the
"system has no good match" signal that high-distance results give.
A coordinator UI showing only distance can't tell wrong-domain
matches apart from real ones.

Fix: judge re-rates top-1 against the ORIGINAL inbox body (not the
LLM-parsed query). Coordinators see both:
- distance: how close was retrieval in vector space
- rating: does this person actually fit the original ask
The pair tells the honest story.

Run #008 result on the 6 inbox events:

Demand Top-1 Distance Rating Reading
─────────────────────────────────────────────────────────────
Forklift Cleveland w-3573 0.29 4 Strong
Production Indy e-1764 0.41 3 Adjacent
Crane Chicago e-7798 0.23 1 TIGHT BUT WRONG
Bilingual safety Indy w-3918 0.05 5 Perfect
Drone Chicago e-1058 0.06 5 Perfect (verify e-1058)
Warehouse Milwaukee w-460 0.32 4 Strong

The crane-Chicago case is the architectural-honesty signal at work:
distance 0.23 says "tight match" but the judge says rating 1 reading
the original body. A coordinator seeing only distance would ship the
wrong worker; coordinator seeing distance+rating sees the disagreement
and escalates.

Net distribution: 5/6 rated 3+ (acceptable→perfect), 1/6 rated 1
(irrelevant despite tight cosine). The substrate-honesty signal is
recovered without losing the LLM-parse quality wins.

Cost: 6 extra judge calls (~9s on qwen2.5). Production amortizes
when judge runs only on top-1 of high-priority inbox events; the
search-cost-vs-quality tradeoff lives in the priority gate.

Implementation:
- New JudgeRating int field on Event (omitempty so non-judged
events stay clean in JSON)
- New judgeInboxResult helper, reusing the same prompt structure as
playbook_lift's judgeRate. The two could share an internal package
if a third judge consumer appears.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

2026-04-30 16:16:49 -05:00

multi_coord_stress

multi_coord_stress: judge re-rates inbox top-1 — recovers honesty signal

2026-04-30 16:16:49 -05:00

playbook_lift

playbook_lift v4 metric: warm-top-1 re-judge — quality lift +24%/-14%

2026-04-30 07:42:04 -05:00

staffing_500k

scrum fixes: 4 real findings landed, 4 false positives dismissed

2026-04-29 19:42:39 -05:00

staffing_candidates