root 848cbf5fef phase 3: playbook_lift harness reads judge from config
migrate the reality-test harness's judge-model default from a
hardcoded "qwen3.5:latest" string to cfg.Models.LocalJudge.

resolution priority: explicit -judge flag > $JUDGE_MODEL env >
cfg.Models.LocalJudge from lakehouse.toml > hardcoded fallback.

bumping the judge for run #N+1 now means editing one line in
lakehouse.toml [models].local_judge — no Go file or shell script
edits required.

changes:
- scripts/playbook_lift/main.go: -config flag added, judge default
  flips to "" so resolution chain runs. Imports internal/shared for
  config loader.
- scripts/playbook_lift.sh: JUDGE_MODEL no longer defaulted in bash;
  EFFECTIVE_JUDGE resolved by mirror-of-the-Go-chain (env > config
  grep > qwen3.5:latest fallback). Used for the Ollama presence
  check + report header. Pre-flight grep avoids requiring jq just
  to read the toml.
- reports/reality-tests/README.md: documents the 4-step priority
  chain.

verified all 4 paths produce the expected judge:
- config (no env): qwen3.5:latest (from lakehouse.toml)
- env override:    env wins
- flag override:   flag wins over env
- missing config:  DefaultConfig fallback still gives qwen3.5:latest

just verify PASS.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-29 23:57:28 -05:00

83 lines
3.8 KiB
Markdown
Raw Permalink Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# reports/reality-tests — does the 5-loop substrate actually work?
Reality tests measure **product outcomes**, not substrate health. The 21 smokes prove the system *runs*; the proof harness proves the system *makes the claims it claims*; reality tests answer: **does the small-model pipeline + matrix indexer + playbook give measurably better results than raw cosine?**
This is the gate from `project_small_model_pipeline_vision.md`: *"the playbook + matrix indexer must give the results we're looking for."* Single load-bearing criterion. Throughput, scaling, code elegance are secondary.
---
## What lives here
Each reality test is a numbered run that produces:
- `<test>_<NNN>.json` — raw structured evidence (per-query data, summary metrics)
- `<test>_<NNN>.md` — human-readable report with headline metrics, per-query table, honesty caveats, next moves
Runs are append-only. Earlier runs stay in tree as historical baseline.
---
## Test catalog
### `playbook_lift_<NNN>` — does the playbook actually lift the right answer?
**Driver:** `scripts/playbook_lift.sh``bin/playbook_lift`
**Queries:** `tests/reality/playbook_lift_queries.txt`
**Pipeline:** cold pass → LLM judge → playbook record → warm pass → measure ranking shift.
The headline question: **when the LLM judge finds a better answer than cosine top-1, can the playbook boost it to top-1 on the next run?** If yes, the learning loop closes; if no, the matrix layer + playbook is infrastructure for a thesis that doesn't pay rent.
See the run reports for honesty caveats — chiefly that the LLM judge IS the ground-truth proxy.
---
## Running a reality test
```bash
# Defaults: judge resolved from lakehouse.toml [models].local_judge,
# workers limit 5000, run id 001
./scripts/playbook_lift.sh
# Re-run with a different judge to check inter-judge agreement
# (env JUDGE_MODEL overrides the config tier)
JUDGE_MODEL=qwen3:latest RUN_ID=002 ./scripts/playbook_lift.sh
# Smaller scale for fast iteration
WORKERS_LIMIT=1000 K=5 RUN_ID=dev ./scripts/playbook_lift.sh
```
**Judge resolution priority** (Phase 3, 2026-04-29):
1. `-judge` flag on the Go driver (explicit override)
2. `JUDGE_MODEL` env var (operator override)
3. `lakehouse.toml [models].local_judge` (default)
4. Hardcoded `qwen3.5:latest` (last-resort fallback if config missing)
This means model bumps land in `lakehouse.toml`, not in this script or
the Go driver. Bumping `local_judge` to a stronger local model (e.g.
when qwen4 ships) takes one line.
Requires: Ollama on `:11434` with `nomic-embed-text` + the resolved judge
model loaded. Skips cleanly (exit 0) if Ollama is absent.
---
## Interpreting results
Three thresholds matter on the `playbook_lift` tests:
| Lift rate (lifts / discoveries) | Verdict |
|---|---|
| ≥ 50% | Loop closes — playbook is doing real work, move to paraphrase queries |
| 20-50% | Lift exists but inconsistent — investigate boost math (`score × 0.5`) or judge variance |
| < 20% | Loop is not pulling its weight diagnose before adding more components |
A separate concern: **discovery rate** (cold judge-best cold top-1). If discovery is itself rare (< 30% of queries), cosine is already close to optimal on this query distribution and the matrix+playbook layer has little headroom. That's not necessarily a bug but it means the value gate has to come from somewhere else (multi-corpus retrieval, domain-specific tags, drift signal).
---
## What this is not
- **Not a benchmark.** No comparison against external systems; only internal cold-vs-warm.
- **Not a regression gate.** Each run is a snapshot. Scores will drift with corpus changes, judge updates, and playbook math tuning. Don't wire `just verify` to demand a minimum lift.
- **Not human-validated.** The LLM judge is the ground truth proxy. Sample 5-10 verdicts manually per run to sanity-check the judge isn't pathological.