2 Commits

Author SHA1 Message Date
root
848cbf5fef phase 3: playbook_lift harness reads judge from config
migrate the reality-test harness's judge-model default from a
hardcoded "qwen3.5:latest" string to cfg.Models.LocalJudge.

resolution priority: explicit -judge flag > $JUDGE_MODEL env >
cfg.Models.LocalJudge from lakehouse.toml > hardcoded fallback.

bumping the judge for run #N+1 now means editing one line in
lakehouse.toml [models].local_judge — no Go file or shell script
edits required.

changes:
- scripts/playbook_lift/main.go: -config flag added, judge default
  flips to "" so resolution chain runs. Imports internal/shared for
  config loader.
- scripts/playbook_lift.sh: JUDGE_MODEL no longer defaulted in bash;
  EFFECTIVE_JUDGE resolved by mirror-of-the-Go-chain (env > config
  grep > qwen3.5:latest fallback). Used for the Ollama presence
  check + report header. Pre-flight grep avoids requiring jq just
  to read the toml.
- reports/reality-tests/README.md: documents the 4-step priority
  chain.

verified all 4 paths produce the expected judge:
- config (no env): qwen3.5:latest (from lakehouse.toml)
- env override:    env wins
- flag override:   flag wins over env
- missing config:  DefaultConfig fallback still gives qwen3.5:latest

just verify PASS.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-29 23:57:28 -05:00
root
3dd7d9fe30 reality-tests: playbook-lift harness — does the 5-loop substrate beat raw cosine?
First reality test driver. Two-pass design:
- Pass 1 (cold): matrix.search use_playbook=false → small-model judge
  rates top-K → record playbook entry pointing at the highest-rated
  result (which may NOT be top-1 by distance — that's the discovery).
- Pass 2 (warm): same queries with use_playbook=true → measure
  ranking shift. Lift = real if recorded answer becomes top-1.

Files:
- scripts/playbook_lift/main.go         driver (391 LoC)
- scripts/playbook_lift.sh              stack-bring-up + report gen
- tests/reality/playbook_lift_queries.txt  query corpus (5 placeholders;
                                            J writes real 20+)
- reports/reality-tests/README.md       framework + interpretation
- .gitignore                            track reports/reality-tests/
                                        but ignore per-run JSON evidence

This answers the gate from project_small_model_pipeline_vision.md:
"the playbook + matrix indexer must give the results we're looking
for." Without ground-truth labels, the LLM judge is the proxy — the
same small-model thesis applied to evaluation. Honest about that
limitation in the generated reports.

Driver compiles clean; full run requires Ollama + workers/candidates
ingest. Skips cleanly if Ollama absent.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-29 23:22:36 -05:00