real_003 left a known-weak hole: shorthand-style queries
("{count} {role} {city} {state} ...") have no separator between
role and city, so a regex can't reliably extract — leaving the
cross-role gate disabled when both record AND query are shorthand.
This commit adds a roleExtractor with regex-first + LLM fallback:
- Regex first (fast, deterministic) — handles need + client_first +
looking from real_003b. ~75% of styles, no LLM cost paid.
- LLM fallback when regex returns empty AND model is configured —
Ollama-shape /api/chat with format=json, schema-tight prompt,
temperature 0. ~1-3s on local qwen2.5.
- Per-process cache — paraphrase + rejudge passes reuse the same
query 4× per run; cache prevents 4× LLM cost.
- Off-by-default — opt-in via -llm-role-extract flag (CLI) and
LLM_ROLE_EXTRACT=1 env var (harness wrapper). real_003b shipping
config unchanged unless explicitly enabled.
8 new tests in scripts/playbook_lift/main_test.go:
- TestRoleExtractor_RegexFirst: LLM not called when regex matches
- TestRoleExtractor_LLMFallback: shorthand goes to LLM
- TestRoleExtractor_LLMOffLeavesEmpty: opt-in default preserved
- TestRoleExtractor_Cache: 3 calls = 1 LLM hit
- TestRoleExtractor_NilSafe: nil receiver runs regex only
- TestExtractRoleViaLLM_HTTPError + _BadJSON: failure paths
- TestRoleExtractor_ClosesCrossRoleShorthandBleed: synthetic
witness for the real_003 scenario — both record + query are
shorthand, regex returns "" for both, LLM produces DIFFERENT
role tokens for CNC vs Forklift, so matrix gate's cross-role
rejection (locked separately in
TestInjectPlaybookMisses_RoleGateRejectsCrossRole) fires
correctly. This is the load-bearing verification.
Reality test real_004 ran the same 40-query stress as real_003 with
LLM extraction on. Cross-style same-role boosts fired correctly
across all 4 styles for Loaders + Packers + Shipping Clerk clusters
(including shorthand → other-style transfer). No cross-role bleed
observed. The reality test alone can't be a clean "with vs without"
comparison (HNSW build is non-deterministic across runs, and
real_004 stochastics didn't trigger a shorthand recording at all),
which is why the unit-test witness exists.
Production note (in real_004_findings.md): LLM extraction is for
reality-test coverage of arbitrary query shapes. Production should
extract role at INGEST time (when the inbox parser already runs an
LLM) and pass already-resolved role through requests — same shape
as multi_coord_stress's existing Demand{Role: ...} model. The hot
path should never need the harness extractor's per-query LLM cost.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Local Ollama has three embedding models loaded:
nomic-embed-text:latest 137M 768d (previous default)
nomic-embed-text-v2-moe:latest 475M 768d (this commit's default)
qwen3-embedding:latest 7.6B 4096d (would require dim change)
v2-moe is a drop-in upgrade — same 768 dim, 3.5× more params, MoE
architecture. Workers index doesn't need rebuilding, just future ingests
embed with the stronger model.
Run #005 result on the multi-coord stress suite:
Diversity (same-role-across-contracts): 0.080 → 0.000 (n=9)
→ MoE is more discriminating: zero worker overlap across
Milwaukee / Indianapolis / Chicago for shared role names.
The geo + cert + skill context fully separates worker pools.
Different-roles-same-contract: 0.013 → 0.036 (still ~96% diff)
Determinism: 1.000 (unchanged)
Verbatim handover: 4/4 (100%)
Paraphrase handover: 4/4 (100%)
200-worker swap: Jaccard 0.000 (unchanged — still perfect)
Fresh-resume verify: STILL doesn't surface fresh workers in top-8.
With v2-moe, distances increased (top-1 = 0.43–0.65 vs v1's 0.25–0.39)
— the embedder is MORE discriminating, but the fresh worker's
vector still doesn't outrank the 8th-best existing worker. Now
suspect of being an HNSW post-build add issue (coder/hnsw
incremental adds can land in hard-to-reach graph regions, not an
embedder problem). Better embedder didn't fix it; needs a
different strategy: full index rebuild after fresh adds, or
explicit playbook-layer score boost for fresh workers, or
hybrid (keyword + semantic) retrieval. Phase 3 investigation.
Cost: ingest is ~5× slower (workers 20s→100s; ethereal 35s→112s).
Acceptable for the quality jump on diversity. Real production with
incremental ingest won't pay this once-per-deploy.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The rank-based "lift" metric (warm-top-1 == cold-judge-best) doesn't
distinguish "Shape B surfaced a strictly-better answer" from "Shape B
shuffled ranks but quality is unchanged" from "Shape B replaced a good
answer with a wrong one." This commit adds Pass 4: judge warm top-1
with the same prompt as cold ratings, then bucket the comparison.
Implementation:
- New --with-rejudge driver flag (default off).
- New WITH_REJUDGE harness env (default 1, on for prod runs).
- queryRun gains WarmTop1Metadata (cached during Pass 2 for the
rejudge call) + WarmTop1Rating *int (nil-distinguishable; nil = no
rejudge, 0..5 = rating).
- summary gains RejudgeAttempted, QualityLifted, QualityNeutral,
QualityRegressed (counts of warm-rating > / == / < cold-rating).
- Markdown headline gains a Quality block when rejudge ran.
- ~21 extra judge calls (~30s on qwen2.5).
Run #005 result (split inject threshold 0.20 + paraphrase + rejudge):
Quality lifted 5 / 21 (24%) — 3× +2 rating, 2× +1 rating
Quality neutral 13 / 21 (62%) — includes OOD queries holding 1
Quality regressed 3 / 21 (14%)
Net rating delta +3 across 21 queries (+0.14 average)
The 5 lifts were all rating-2 cold replaced with rating-3 or rating-4
warm — Shape B took mediocre matches and substituted substantively
better ones. The 3 regressions were small (-1, -1, -3).
Q11 is the cautionary tale: cold top-1 "production line worker"
(rating 4) got replaced by Q1's recorded "forklift OSHA-30 operator"
e-5729 (rating 1). Adjacent-domain cross-pollination — production
worker and forklift operator embed within 0.20 cosine because both
are warehouse-adjacent staffing queries, even though the judge
correctly distinguishes them. The split-threshold defense (0.5 boost
/ 0.20 inject) catches OOD cross-pollination (Q19/Q20/Q21 all stayed
neutral at rating 1) but not adjacent-domain cross-pollination.
Net product verdict: working, net-positive on quality, but the worst
case (Q11 4→1) is customer-visible and warrants a tighter inject
threshold OR an additional gate beyond cosine distance. Filed in
STATE_OF_PLAY OPEN as a follow-up.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds an opt-in Pass 3 to the lift driver: for each query whose Pass 1
recorded a playbook, ask the judge to rephrase the query, then re-query
with playbook=true and check whether the recorded answer surfaces in
top-K. This is the test the v1 report's caveat #3 explicitly flagged
as the actual learning-property gate (not the cheap verbatim case).
Implementation:
- New flag --with-paraphrase on the driver (default off).
- New WITH_PARAPHRASE env in the harness (default 1, on for prod runs).
- New paraphrase_* fields on queryRun + summary, // 0 fallback in jq so
re-rendering verbatim-only evidence stays clean.
- generateParaphrase() calls the same judge model with format=json and
a tight schema; temperature=0.5 for variance without domain drift.
- Markdown report adds a paraphrase per-query table (only when the
pass ran) and an honesty caveat about judge-also-rephrases coupling.
Run #002 result (reports/reality-tests/playbook_lift_002.{json,md}):
Verbatim lift 2/2 (100% — Q7 + Q13, both stable from v1)
Paraphrase top-1 0/2
Paraphrase any-rank in K 0/2
Both paraphrases dropped the recorded answer OUT of top-K entirely
(rank=-1). This isn't a paraphrase-quality problem — qwen2.5's outputs
preserved intent ("Hazmat-certified warehouse worker comfortable with
cold storage" → "Warehouse worker with Hazmat certification and
experience in cold storage"). It's the v0 boost-only stance documented
in internal/matrix/playbook.go:22-27: the boost only re-ranks results
that ALREADY surfaced from regular retrieval. If paraphrase's cosine
retrieval doesn't include the recorded answer in top-K, no boost can
promote it.
The "Shape B" upgrade mentioned in the playbook.go comment — inject
playbook hits directly even when they weren't in the top-K — is what
would close this gap. The reality test surfaced exactly the gap the
docs warned about. Worth filing as the next product gate.
Run-to-run variance also visible: v1 had 8 discoveries, v2 had 2.
HNSW insertion order + judge variance both contribute. Stability of
Q7 and Q13 across both runs (lifted in v1 AND v2) is the most reliable
signal in the dataset.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Cross-lineage scrum on b2e45f7 produced 1 convergent + 3 single-reviewer
findings worth fixing. All apply.
1. (Opus WARN + Qwen INFO convergent) scripts/playbook_lift.sh: replace
sleep 2.5 in SQL probe with active polling up to 5s. refresh_every=1s
is a lower bound; under load the manifest may not be visible in a
fixed sleep, which would 4xx the probe and abort the reality run.
2. (Opus INFO) scripts/playbook_lift.sh: report template glued
"env JUDGE_MODEL" + value as "env JUDGE_MODELqwen2.5:latest" with no
separator. Replaced two :+/:- substitution chains with a single
JUDGE_SOURCE variable computed once at the top of the harness.
3. (Opus INFO) scripts/staffing_workers/main.go: -id-prefix "" silently
allowed, defeating the flag's purpose (cross-corpus collision prevent).
Now log.Fatal at startup with explicit hint.
4. (Opus WARN) cmd/{pathwayd,observerd}/main_test.go: newTestRouter
returned http.Handler then re-cast to chi.Router for chi.Walk.
Returning chi.Router directly satisfies http.Handler AND avoids an
assertion that would panic if future middleware wraps the router.
Dismissed (with rationale):
- Kimi INFO hardcoded MinIO endpoint: harness is local-by-design.
- Kimi WARN matrixd accepts 502/500: documented; real retriever needs
real upstreams the test doesn't spin up.
- Qwen INFO queryd string.Contains: brittle but very low risk; restating
through typed-error path would couple without adding signal.
go test ./cmd/{matrixd,queryd,pathwayd,observerd} all green.
Verdicts at reports/scrum/_evidence/2026-04-30/verdicts/lift_001_*.md.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The 5-loop substrate's load-bearing gate is verified — playbook +
matrix indexer give the results we're looking for. Per the report's
rubric, lift ≥ 50% of discoveries means matrix is doing real work;
7/8 = 87.5% blew through that.
Harness was structurally hiding bugs behind a 5-daemon stripped boot.
Expanding to the full 10-daemon prod stack surfaced 7 fixes in cascade:
1. driver→matrixd: {"query": ...} → {"query_text": ...} field name
2. harness temp toml missing [s3] → wrong default bucket → catalogd
rehydrate 500 on first call
3. harness→queryd SQL probe: {"q": ...} → {"sql": ...} field name
4. expand boot from 5 → 10 daemons in dep-ordered launch
5. add SQL surface probe (3-row CSV ingest → COUNT(*)=3 assertion)
6. candidates corpus was synthetic SWE-tech (Swift/iOS, Scala/Spark) —
wrong domain for staffing queries; replaced with ethereal_workers
(10K rows, real staffing schema, "e-" id prefix to avoid collision
with workers' "w-"). staffing_workers driver gains -index-name +
-id-prefix flags so the same binary serves both corpora
7. local_judge qwen3.5:latest is a vision-SSM 256K-ctx build running
~30s per judge call against the lift loop; reverted to
qwen2.5:latest (~1s/call, 30× faster, held lift theory)
Each contract drift (1, 3) is now locked into a cmd/<bin>/main_test.go
so future drift fires in `go test`, not in a reality run. R-005 closed:
- cmd/matrixd/main_test.go (new) — playbook record drift detector +
score bounds + 6 routes mounted
- cmd/queryd/main_test.go — wrong-field-name drift detector
- cmd/pathwayd/main_test.go (new) — 9 routes + add round-trip + retire
- cmd/observerd/main_test.go (new) — 4 routes + invalid-op + unknown-mode
`go test ./cmd/{matrixd,queryd,pathwayd,observerd}` all green.
Reality test results (reports/reality-tests/playbook_lift_001.{json,md}):
Queries 21 (staffing-domain, 7 categories)
Discoveries 8 (judge ≠ cosine top-1)
Lifts 7/8 (87.5%)
Boosts triggered 9
Mean Δ distance -0.053 (warm closer than cold)
OOD honesty dental/RN/SWE rated 1, no fake matches
Cross-corpus boosts confirmed (e- ↔ w- swaps in lifts)
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
3-lineage scrum (Opus 4.7 / Kimi K2.6 / Qwen3-coder) on today's wave
landed 4 real findings (2 BLOCK + 2 WARN) and 2 INFO touch-ups.
Verbatim verdicts + disposition table at:
reports/scrum/_evidence/2026-04-30/
B-1 (BLOCK Opus + INFO Kimi convergent) — ResolveKey API:
collapse from 3-arg (envVar, envFileName, envFilePath) to 2-arg
(envVar, envFilePath). Pre-fix every chatd caller passed the env
var name twice; if operator renamed *_key_env in lakehouse.toml
while keeping the canonical KEY= line in the .env file, fallback
silently missed.
B-2 (WARN Opus + WARN Kimi convergent) — handleProviders probe:
drop the synthesize-then-Resolve probe; look up by name directly
via Registry.Available(name). Prior probe synthesized "<name>/probe"
model strings and routed through Resolve, fragile to any future
routing rule (e.g. cloud-suffix special case).
B-3 (BLOCK Opus single — verified by trace + end-to-end probe) —
OllamaCloud.Chat StripPrefix used "cloud" but registry routes
"ollama_cloud/<m>". Result: upstream got the prefixed model name
and 400'd. Smoke missed it because chatd_smoke runs without
ollama_cloud registered. Now strips the right prefix; new
TestOllamaCloud_StripsCorrectPrefix locks both prefix + suffix
cases. Verified live: ollama_cloud/deepseek-v3.2 round-trips
cleanly through the real ollama.com endpoint.
B-4 (WARN Opus single) — Ollama finishReason: read done_reason
field instead of inferring from done bool alone. Newer Ollama
reports done=true with done_reason="length" on truncation; the
prior code mapped that to "stop" and lost the truncation signal
the playbook_lift judge needs to retry. New
TestFinishReasonFromOllama_PrefersDoneReason covers the fallback
ladder.
INFOs:
- B-5: replace hand-rolled insertion sort in Registry.Names with
sort.Strings (Opus called the "avoid sort import" comment a
false economy — correct).
- A-1: clarify the playbook_lift.sh comment around -judge "" arg
passing (Opus noted the comment said "env priority" but didn't
reflect that the empty arg also passes through the Go driver's
resolution chain).
False positives dismissed (3, documented in disposition.md):
- Kimi: TestMaybeDowngrade_WithConfigList wrong assertion (test IS
correct per design — model excluded from weak list = strong = downgrade)
- Qwen: nil-deref claim (defensive code already handles nil)
- Opus: qwen3.5:latest doesn't exist on Ollama hub (true on the
public hub but local install has it)
just verify: PASS. chatd_smoke 6/6 PASS. New regression tests:
3 (B-2, B-3, B-4 each get a focused test).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
migrate the reality-test harness's judge-model default from a
hardcoded "qwen3.5:latest" string to cfg.Models.LocalJudge.
resolution priority: explicit -judge flag > $JUDGE_MODEL env >
cfg.Models.LocalJudge from lakehouse.toml > hardcoded fallback.
bumping the judge for run #N+1 now means editing one line in
lakehouse.toml [models].local_judge — no Go file or shell script
edits required.
changes:
- scripts/playbook_lift/main.go: -config flag added, judge default
flips to "" so resolution chain runs. Imports internal/shared for
config loader.
- scripts/playbook_lift.sh: JUDGE_MODEL no longer defaulted in bash;
EFFECTIVE_JUDGE resolved by mirror-of-the-Go-chain (env > config
grep > qwen3.5:latest fallback). Used for the Ollama presence
check + report header. Pre-flight grep avoids requiring jq just
to read the toml.
- reports/reality-tests/README.md: documents the 4-step priority
chain.
verified all 4 paths produce the expected judge:
- config (no env): qwen3.5:latest (from lakehouse.toml)
- env override: env wins
- flag override: flag wins over env
- missing config: DefaultConfig fallback still gives qwen3.5:latest
just verify PASS.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
First reality test driver. Two-pass design:
- Pass 1 (cold): matrix.search use_playbook=false → small-model judge
rates top-K → record playbook entry pointing at the highest-rated
result (which may NOT be top-1 by distance — that's the discovery).
- Pass 2 (warm): same queries with use_playbook=true → measure
ranking shift. Lift = real if recorded answer becomes top-1.
Files:
- scripts/playbook_lift/main.go driver (391 LoC)
- scripts/playbook_lift.sh stack-bring-up + report gen
- tests/reality/playbook_lift_queries.txt query corpus (5 placeholders;
J writes real 20+)
- reports/reality-tests/README.md framework + interpretation
- .gitignore track reports/reality-tests/
but ignore per-run JSON evidence
This answers the gate from project_small_model_pipeline_vision.md:
"the playbook + matrix indexer must give the results we're looking
for." Without ground-truth labels, the LLM judge is the proxy — the
same small-model thesis applied to evaluation. Honest about that
limitation in the generated reports.
Driver compiles clean; full run requires Ollama + workers/candidates
ingest. Skips cleanly if Ollama absent.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>