migrate the reality-test harness's judge-model default from a hardcoded "qwen3.5:latest" string to cfg.Models.LocalJudge. resolution priority: explicit -judge flag > $JUDGE_MODEL env > cfg.Models.LocalJudge from lakehouse.toml > hardcoded fallback. bumping the judge for run #N+1 now means editing one line in lakehouse.toml [models].local_judge — no Go file or shell script edits required. changes: - scripts/playbook_lift/main.go: -config flag added, judge default flips to "" so resolution chain runs. Imports internal/shared for config loader. - scripts/playbook_lift.sh: JUDGE_MODEL no longer defaulted in bash; EFFECTIVE_JUDGE resolved by mirror-of-the-Go-chain (env > config grep > qwen3.5:latest fallback). Used for the Ollama presence check + report header. Pre-flight grep avoids requiring jq just to read the toml. - reports/reality-tests/README.md: documents the 4-step priority chain. verified all 4 paths produce the expected judge: - config (no env): qwen3.5:latest (from lakehouse.toml) - env override: env wins - flag override: flag wins over env - missing config: DefaultConfig fallback still gives qwen3.5:latest just verify PASS. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
83 lines
3.8 KiB
Markdown
83 lines
3.8 KiB
Markdown
# reports/reality-tests — does the 5-loop substrate actually work?
|
||
|
||
Reality tests measure **product outcomes**, not substrate health. The 21 smokes prove the system *runs*; the proof harness proves the system *makes the claims it claims*; reality tests answer: **does the small-model pipeline + matrix indexer + playbook give measurably better results than raw cosine?**
|
||
|
||
This is the gate from `project_small_model_pipeline_vision.md`: *"the playbook + matrix indexer must give the results we're looking for."* Single load-bearing criterion. Throughput, scaling, code elegance are secondary.
|
||
|
||
---
|
||
|
||
## What lives here
|
||
|
||
Each reality test is a numbered run that produces:
|
||
|
||
- `<test>_<NNN>.json` — raw structured evidence (per-query data, summary metrics)
|
||
- `<test>_<NNN>.md` — human-readable report with headline metrics, per-query table, honesty caveats, next moves
|
||
|
||
Runs are append-only. Earlier runs stay in tree as historical baseline.
|
||
|
||
---
|
||
|
||
## Test catalog
|
||
|
||
### `playbook_lift_<NNN>` — does the playbook actually lift the right answer?
|
||
|
||
**Driver:** `scripts/playbook_lift.sh` → `bin/playbook_lift`
|
||
**Queries:** `tests/reality/playbook_lift_queries.txt`
|
||
**Pipeline:** cold pass → LLM judge → playbook record → warm pass → measure ranking shift.
|
||
|
||
The headline question: **when the LLM judge finds a better answer than cosine top-1, can the playbook boost it to top-1 on the next run?** If yes, the learning loop closes; if no, the matrix layer + playbook is infrastructure for a thesis that doesn't pay rent.
|
||
|
||
See the run reports for honesty caveats — chiefly that the LLM judge IS the ground-truth proxy.
|
||
|
||
---
|
||
|
||
## Running a reality test
|
||
|
||
```bash
|
||
# Defaults: judge resolved from lakehouse.toml [models].local_judge,
|
||
# workers limit 5000, run id 001
|
||
./scripts/playbook_lift.sh
|
||
|
||
# Re-run with a different judge to check inter-judge agreement
|
||
# (env JUDGE_MODEL overrides the config tier)
|
||
JUDGE_MODEL=qwen3:latest RUN_ID=002 ./scripts/playbook_lift.sh
|
||
|
||
# Smaller scale for fast iteration
|
||
WORKERS_LIMIT=1000 K=5 RUN_ID=dev ./scripts/playbook_lift.sh
|
||
```
|
||
|
||
**Judge resolution priority** (Phase 3, 2026-04-29):
|
||
1. `-judge` flag on the Go driver (explicit override)
|
||
2. `JUDGE_MODEL` env var (operator override)
|
||
3. `lakehouse.toml [models].local_judge` (default)
|
||
4. Hardcoded `qwen3.5:latest` (last-resort fallback if config missing)
|
||
|
||
This means model bumps land in `lakehouse.toml`, not in this script or
|
||
the Go driver. Bumping `local_judge` to a stronger local model (e.g.
|
||
when qwen4 ships) takes one line.
|
||
|
||
Requires: Ollama on `:11434` with `nomic-embed-text` + the resolved judge
|
||
model loaded. Skips cleanly (exit 0) if Ollama is absent.
|
||
|
||
---
|
||
|
||
## Interpreting results
|
||
|
||
Three thresholds matter on the `playbook_lift` tests:
|
||
|
||
| Lift rate (lifts / discoveries) | Verdict |
|
||
|---|---|
|
||
| ≥ 50% | Loop closes — playbook is doing real work, move to paraphrase queries |
|
||
| 20-50% | Lift exists but inconsistent — investigate boost math (`score × 0.5`) or judge variance |
|
||
| < 20% | Loop is not pulling its weight — diagnose before adding more components |
|
||
|
||
A separate concern: **discovery rate** (cold judge-best ≠ cold top-1). If discovery is itself rare (< 30% of queries), cosine is already close to optimal on this query distribution and the matrix+playbook layer has little headroom. That's not necessarily a bug — but it means the value gate has to come from somewhere else (multi-corpus retrieval, domain-specific tags, drift signal).
|
||
|
||
---
|
||
|
||
## What this is not
|
||
|
||
- **Not a benchmark.** No comparison against external systems; only internal cold-vs-warm.
|
||
- **Not a regression gate.** Each run is a snapshot. Scores will drift with corpus changes, judge updates, and playbook math tuning. Don't wire `just verify` to demand a minimum lift.
|
||
- **Not human-validated.** The LLM judge is the ground truth proxy. Sample 5-10 verdicts manually per run to sanity-check the judge isn't pathological.
|