Run #003 surfaced rampant cross-pollination: Q2's "OSHA-30 forklift Wisconsin" recording (w-4435) became warm top-1 for Q19 (dental hygienist), Q20 (RN), Q21 (software engineer), and 6 other unrelated staffing queries. Cause: InjectPlaybookMisses inherited the same DefaultPlaybookMaxDistance (0.5) as the boost path, but inject is structurally riskier than boost — boost only re-ranks results that already retrieved on their own merits, while inject FORCES a result into top-K, so a loose match cross-pollinates wrong-domain answers. Empirical motivation from v3: Implied playbook hit distances for cross-pollinated cases: 0.20-0.46 Implied distances for the 6/6 paraphrase recoveries: 0.23-0.30 Threshold of 0.20 should keep most paraphrases, kill the OOD bleed. Implementation: - New DefaultPlaybookMaxInjectDistance = 0.20 in playbook.go. - New PlaybookMaxInjectDistance field on SearchRequest (override). - InjectPlaybookMisses signature gains maxInjectDist param; hits whose Distance exceeds it are skipped (boost path may still re-rank them). - TestInjectPlaybookMisses_RespectsInjectThreshold locks the contract with one tight + one loose hit, asserting only the tight one injects. - Existing tests pass explicit threshold (0 = default for tight tests, 0.5 for the dedupe test which uses 0.30 hits). Run #004 result on identical queries with the split threshold: Verbatim discovery 8 (vs v3's 6 — judge variance, separate) Verbatim lift 6 / 8 (75%) Paraphrase top-1 6 / 8 (75%) Paraphrase any-rank in K 6 / 8 OOD queries Q19/Q20/Q21 ALL show warm top-1 = cold top-1 (no injection) — cross-pollination eliminated where it was wrong-direction. Mean Δ top-1 distance dropped from -0.164 (v3, distorted) to -0.071 (v4, comparable to v1's -0.053). Two paraphrases missed in v4 (Q9, Q15) were ones where qwen2.5 rephrased liberally enough to drift past 0.20 — Q9: "Inventory specialist..." → "Individual needed for inventory management..." and Q15: "Engaged warehouse associate..." → "Warehouse associate currently engaged with a robust history...". The system correctly refusing to inject when it's not confident is the right product behavior; the boost path still re-ranks recorded answers when they appear in regular retrieval. The Q6 ↔ Q7 cross-pollination ("Forklift-certified loader" ↔ "Hazmat warehouse worker") is legitimate — these are genuinely similar staffing queries and the judge ranks both directions as plausible. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
reports/reality-tests — does the 5-loop substrate actually work?
Reality tests measure product outcomes, not substrate health. The 21 smokes prove the system runs; the proof harness proves the system makes the claims it claims; reality tests answer: does the small-model pipeline + matrix indexer + playbook give measurably better results than raw cosine?
This is the gate from project_small_model_pipeline_vision.md: "the playbook + matrix indexer must give the results we're looking for." Single load-bearing criterion. Throughput, scaling, code elegance are secondary.
What lives here
Each reality test is a numbered run that produces:
<test>_<NNN>.json— raw structured evidence (per-query data, summary metrics)<test>_<NNN>.md— human-readable report with headline metrics, per-query table, honesty caveats, next moves
Runs are append-only. Earlier runs stay in tree as historical baseline.
Test catalog
playbook_lift_<NNN> — does the playbook actually lift the right answer?
Driver: scripts/playbook_lift.sh → bin/playbook_lift
Queries: tests/reality/playbook_lift_queries.txt
Pipeline: cold pass → LLM judge → playbook record → warm pass → measure ranking shift.
The headline question: when the LLM judge finds a better answer than cosine top-1, can the playbook boost it to top-1 on the next run? If yes, the learning loop closes; if no, the matrix layer + playbook is infrastructure for a thesis that doesn't pay rent.
See the run reports for honesty caveats — chiefly that the LLM judge IS the ground-truth proxy.
Running a reality test
# Defaults: judge resolved from lakehouse.toml [models].local_judge,
# workers limit 5000, run id 001
./scripts/playbook_lift.sh
# Re-run with a different judge to check inter-judge agreement
# (env JUDGE_MODEL overrides the config tier)
JUDGE_MODEL=qwen3:latest RUN_ID=002 ./scripts/playbook_lift.sh
# Smaller scale for fast iteration
WORKERS_LIMIT=1000 K=5 RUN_ID=dev ./scripts/playbook_lift.sh
Judge resolution priority (Phase 3, 2026-04-29):
-judgeflag on the Go driver (explicit override)JUDGE_MODELenv var (operator override)lakehouse.toml [models].local_judge(default)- Hardcoded
qwen3.5:latest(last-resort fallback if config missing)
This means model bumps land in lakehouse.toml, not in this script or
the Go driver. Bumping local_judge to a stronger local model (e.g.
when qwen4 ships) takes one line.
Requires: Ollama on :11434 with nomic-embed-text + the resolved judge
model loaded. Skips cleanly (exit 0) if Ollama is absent.
Interpreting results
Three thresholds matter on the playbook_lift tests:
| Lift rate (lifts / discoveries) | Verdict |
|---|---|
| ≥ 50% | Loop closes — playbook is doing real work, move to paraphrase queries |
| 20-50% | Lift exists but inconsistent — investigate boost math (score × 0.5) or judge variance |
| < 20% | Loop is not pulling its weight — diagnose before adding more components |
A separate concern: discovery rate (cold judge-best ≠ cold top-1). If discovery is itself rare (< 30% of queries), cosine is already close to optimal on this query distribution and the matrix+playbook layer has little headroom. That's not necessarily a bug — but it means the value gate has to come from somewhere else (multi-corpus retrieval, domain-specific tags, drift signal).
What this is not
- Not a benchmark. No comparison against external systems; only internal cold-vs-warm.
- Not a regression gate. Each run is a snapshot. Scores will drift with corpus changes, judge updates, and playbook math tuning. Don't wire
just verifyto demand a minimum lift. - Not human-validated. The LLM judge is the ground truth proxy. Sample 5-10 verdicts manually per run to sanity-check the judge isn't pathological.