root 3dd7d9fe30 reality-tests: playbook-lift harness — does the 5-loop substrate beat raw cosine?
First reality test driver. Two-pass design:
- Pass 1 (cold): matrix.search use_playbook=false → small-model judge
  rates top-K → record playbook entry pointing at the highest-rated
  result (which may NOT be top-1 by distance — that's the discovery).
- Pass 2 (warm): same queries with use_playbook=true → measure
  ranking shift. Lift = real if recorded answer becomes top-1.

Files:
- scripts/playbook_lift/main.go         driver (391 LoC)
- scripts/playbook_lift.sh              stack-bring-up + report gen
- tests/reality/playbook_lift_queries.txt  query corpus (5 placeholders;
                                            J writes real 20+)
- reports/reality-tests/README.md       framework + interpretation
- .gitignore                            track reports/reality-tests/
                                        but ignore per-run JSON evidence

This answers the gate from project_small_model_pipeline_vision.md:
"the playbook + matrix indexer must give the results we're looking
for." Without ground-truth labels, the LLM judge is the proxy — the
same small-model thesis applied to evaluation. Honest about that
limitation in the generated reports.

Driver compiles clean; full run requires Ollama + workers/candidates
ingest. Skips cleanly if Ollama absent.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-29 23:22:36 -05:00

3.3 KiB
Raw Blame History

reports/reality-tests — does the 5-loop substrate actually work?

Reality tests measure product outcomes, not substrate health. The 21 smokes prove the system runs; the proof harness proves the system makes the claims it claims; reality tests answer: does the small-model pipeline + matrix indexer + playbook give measurably better results than raw cosine?

This is the gate from project_small_model_pipeline_vision.md: "the playbook + matrix indexer must give the results we're looking for." Single load-bearing criterion. Throughput, scaling, code elegance are secondary.


What lives here

Each reality test is a numbered run that produces:

  • <test>_<NNN>.json — raw structured evidence (per-query data, summary metrics)
  • <test>_<NNN>.md — human-readable report with headline metrics, per-query table, honesty caveats, next moves

Runs are append-only. Earlier runs stay in tree as historical baseline.


Test catalog

playbook_lift_<NNN> — does the playbook actually lift the right answer?

Driver: scripts/playbook_lift.shbin/playbook_lift Queries: tests/reality/playbook_lift_queries.txt Pipeline: cold pass → LLM judge → playbook record → warm pass → measure ranking shift.

The headline question: when the LLM judge finds a better answer than cosine top-1, can the playbook boost it to top-1 on the next run? If yes, the learning loop closes; if no, the matrix layer + playbook is infrastructure for a thesis that doesn't pay rent.

See the run reports for honesty caveats — chiefly that the LLM judge IS the ground-truth proxy.


Running a reality test

# Defaults: judge=qwen3.5:latest, workers limit 5000, run id 001
./scripts/playbook_lift.sh

# Re-run with a different judge to check inter-judge agreement
JUDGE_MODEL=qwen2.5:latest RUN_ID=002 ./scripts/playbook_lift.sh

# Smaller scale for fast iteration
WORKERS_LIMIT=1000 K=5 RUN_ID=dev ./scripts/playbook_lift.sh

Requires: Ollama on :11434 with nomic-embed-text + the chosen judge model loaded. Skips cleanly (exit 0) if Ollama is absent.


Interpreting results

Three thresholds matter on the playbook_lift tests:

Lift rate (lifts / discoveries) Verdict
≥ 50% Loop closes — playbook is doing real work, move to paraphrase queries
20-50% Lift exists but inconsistent — investigate boost math (score × 0.5) or judge variance
< 20% Loop is not pulling its weight — diagnose before adding more components

A separate concern: discovery rate (cold judge-best ≠ cold top-1). If discovery is itself rare (< 30% of queries), cosine is already close to optimal on this query distribution and the matrix+playbook layer has little headroom. That's not necessarily a bug — but it means the value gate has to come from somewhere else (multi-corpus retrieval, domain-specific tags, drift signal).


What this is not

  • Not a benchmark. No comparison against external systems; only internal cold-vs-warm.
  • Not a regression gate. Each run is a snapshot. Scores will drift with corpus changes, judge updates, and playbook math tuning. Don't wire just verify to demand a minimum lift.
  • Not human-validated. The LLM judge is the ground truth proxy. Sample 5-10 verdicts manually per run to sanity-check the judge isn't pathological.