root 3dd7d9fe30 reality-tests: playbook-lift harness — does the 5-loop substrate beat raw cosine?
First reality test driver. Two-pass design:
- Pass 1 (cold): matrix.search use_playbook=false → small-model judge
  rates top-K → record playbook entry pointing at the highest-rated
  result (which may NOT be top-1 by distance — that's the discovery).
- Pass 2 (warm): same queries with use_playbook=true → measure
  ranking shift. Lift = real if recorded answer becomes top-1.

Files:
- scripts/playbook_lift/main.go         driver (391 LoC)
- scripts/playbook_lift.sh              stack-bring-up + report gen
- tests/reality/playbook_lift_queries.txt  query corpus (5 placeholders;
                                            J writes real 20+)
- reports/reality-tests/README.md       framework + interpretation
- .gitignore                            track reports/reality-tests/
                                        but ignore per-run JSON evidence

This answers the gate from project_small_model_pipeline_vision.md:
"the playbook + matrix indexer must give the results we're looking
for." Without ground-truth labels, the LLM judge is the proxy — the
same small-model thesis applied to evaluation. Honest about that
limitation in the generated reports.

Driver compiles clean; full run requires Ollama + workers/candidates
ingest. Skips cleanly if Ollama absent.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-29 23:22:36 -05:00

70 lines
3.3 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# reports/reality-tests — does the 5-loop substrate actually work?
Reality tests measure **product outcomes**, not substrate health. The 21 smokes prove the system *runs*; the proof harness proves the system *makes the claims it claims*; reality tests answer: **does the small-model pipeline + matrix indexer + playbook give measurably better results than raw cosine?**
This is the gate from `project_small_model_pipeline_vision.md`: *"the playbook + matrix indexer must give the results we're looking for."* Single load-bearing criterion. Throughput, scaling, code elegance are secondary.
---
## What lives here
Each reality test is a numbered run that produces:
- `<test>_<NNN>.json` — raw structured evidence (per-query data, summary metrics)
- `<test>_<NNN>.md` — human-readable report with headline metrics, per-query table, honesty caveats, next moves
Runs are append-only. Earlier runs stay in tree as historical baseline.
---
## Test catalog
### `playbook_lift_<NNN>` — does the playbook actually lift the right answer?
**Driver:** `scripts/playbook_lift.sh``bin/playbook_lift`
**Queries:** `tests/reality/playbook_lift_queries.txt`
**Pipeline:** cold pass → LLM judge → playbook record → warm pass → measure ranking shift.
The headline question: **when the LLM judge finds a better answer than cosine top-1, can the playbook boost it to top-1 on the next run?** If yes, the learning loop closes; if no, the matrix layer + playbook is infrastructure for a thesis that doesn't pay rent.
See the run reports for honesty caveats — chiefly that the LLM judge IS the ground-truth proxy.
---
## Running a reality test
```bash
# Defaults: judge=qwen3.5:latest, workers limit 5000, run id 001
./scripts/playbook_lift.sh
# Re-run with a different judge to check inter-judge agreement
JUDGE_MODEL=qwen2.5:latest RUN_ID=002 ./scripts/playbook_lift.sh
# Smaller scale for fast iteration
WORKERS_LIMIT=1000 K=5 RUN_ID=dev ./scripts/playbook_lift.sh
```
Requires: Ollama on `:11434` with `nomic-embed-text` + the chosen judge model loaded. Skips cleanly (exit 0) if Ollama is absent.
---
## Interpreting results
Three thresholds matter on the `playbook_lift` tests:
| Lift rate (lifts / discoveries) | Verdict |
|---|---|
| ≥ 50% | Loop closes — playbook is doing real work, move to paraphrase queries |
| 20-50% | Lift exists but inconsistent — investigate boost math (`score × 0.5`) or judge variance |
| < 20% | Loop is not pulling its weight diagnose before adding more components |
A separate concern: **discovery rate** (cold judge-best cold top-1). If discovery is itself rare (< 30% of queries), cosine is already close to optimal on this query distribution and the matrix+playbook layer has little headroom. That's not necessarily a bug but it means the value gate has to come from somewhere else (multi-corpus retrieval, domain-specific tags, drift signal).
---
## What this is not
- **Not a benchmark.** No comparison against external systems; only internal cold-vs-warm.
- **Not a regression gate.** Each run is a snapshot. Scores will drift with corpus changes, judge updates, and playbook math tuning. Don't wire `just verify` to demand a minimum lift.
- **Not human-validated.** The LLM judge is the ground truth proxy. Sample 5-10 verdicts manually per run to sanity-check the judge isn't pathological.