migrate the reality-test harness's judge-model default from a
hardcoded "qwen3.5:latest" string to cfg.Models.LocalJudge.
resolution priority: explicit -judge flag > $JUDGE_MODEL env >
cfg.Models.LocalJudge from lakehouse.toml > hardcoded fallback.
bumping the judge for run #N+1 now means editing one line in
lakehouse.toml [models].local_judge — no Go file or shell script
edits required.
changes:
- scripts/playbook_lift/main.go: -config flag added, judge default
flips to "" so resolution chain runs. Imports internal/shared for
config loader.
- scripts/playbook_lift.sh: JUDGE_MODEL no longer defaulted in bash;
EFFECTIVE_JUDGE resolved by mirror-of-the-Go-chain (env > config
grep > qwen3.5:latest fallback). Used for the Ollama presence
check + report header. Pre-flight grep avoids requiring jq just
to read the toml.
- reports/reality-tests/README.md: documents the 4-step priority
chain.
verified all 4 paths produce the expected judge:
- config (no env): qwen3.5:latest (from lakehouse.toml)
- env override: env wins
- flag override: flag wins over env
- missing config: DefaultConfig fallback still gives qwen3.5:latest
just verify PASS.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
First reality test driver. Two-pass design:
- Pass 1 (cold): matrix.search use_playbook=false → small-model judge
rates top-K → record playbook entry pointing at the highest-rated
result (which may NOT be top-1 by distance — that's the discovery).
- Pass 2 (warm): same queries with use_playbook=true → measure
ranking shift. Lift = real if recorded answer becomes top-1.
Files:
- scripts/playbook_lift/main.go driver (391 LoC)
- scripts/playbook_lift.sh stack-bring-up + report gen
- tests/reality/playbook_lift_queries.txt query corpus (5 placeholders;
J writes real 20+)
- reports/reality-tests/README.md framework + interpretation
- .gitignore track reports/reality-tests/
but ignore per-run JSON evidence
This answers the gate from project_small_model_pipeline_vision.md:
"the playbook + matrix indexer must give the results we're looking
for." Without ground-truth labels, the LLM judge is the proxy — the
same small-model thesis applied to evaluation. Honest about that
limitation in the generated reports.
Driver compiles clean; full run requires Ollama + workers/candidates
ingest. Skips cleanly if Ollama absent.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Audited stash-clean c7e3124 (30 commits past rerun-1 4840c10).
3 HIGH risks closed (R-002 internal/shared, R-003 internal/storeclient,
R-008 queryd/db.go). 3 advanced to partial (R-001 via fail-loud-bind +
opt-in auth, R-006 via g2_smoke_fixtures, R-007 via ADR-003 auth.go).
Biggest move: Agent Memory Correctness 4 → 9 — pathway Mem0 ops
(ADD/UPDATE/REVISE/RETIRE/HISTORY) all tested, including cycle-detection
and retired-trace-exclusion. Sprint 2 acceptance criteria are now
verified code, not design-bar work.
Two new findings:
- F1 (MED): cmd/{matrixd,observerd,pathwayd}/main_test.go absent —
reopens R-005 against new daemons.
- F2 (LOW): scripts/staffing_*/main.go flag-defaults reach
/home/profit/lakehouse/data/...
Evidence under reports/scrum/_evidence/rerun2/ (local; per
.gitkeep convention).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Re-runs the SCRUM.md framework against HEAD (4840c10) to score the
delta from the audit baseline at 91edd43. Composite +8.
Scoring deltas:
Reproducibility 7 → 9 (just verify, just doctor, pre-push hook)
Test Coverage 6 → 8 (168 proof harness assertions; Go-test
gaps in shared/storeclient remain)
Trust Boundary 7 → 7 (no code change; R-001/R-007 open)
Memory Correctness 3 → 4 (vectord persistence proven; Mem0
pathway/playbook still not ported)
Deployment Readiness 4 → 5 (just doctor; REPLICATION/systemd open)
Maintainability 8 → 8 (spine unchanged; harness obeys
CLAUDE_REFACTOR_GUARDRAILS)
Risk register changes:
R-004 (smokes not gated) CLOSED — just verify + pre-push hook
R-005 (cmd/main.go untested) partial — proof harness covers wiring
R-012 (empty tests/ dir) CLOSED — populated by harness
R-001/R-002/R-003/R-006/R-007/R-008/R-009/R-010 unchanged
Sprint 0 progress:
S0.1 just doctor DONE
S0.3 just verify + pre-push DONE
S0.6 tests/ dir cleanup DONE
S0.2 just smoke-fixtures open
S0.4 cmd/main_test × 6 partial (harness coverage; go-test gap)
S0.5 shared/storeclient tests open (HIGH risks still unaddressed)
New finding from this rerun (worth recording):
Queryd refresh-tick race in 04_query_correctness — cache-warm
binaries fire SELECTs faster than queryd's 500ms refresh tick.
Caught by integration mode going 104/0/1 → 102/1/1, fixed at
4840c10 with proof_wait_for_sql helper. Exactly the failure-mode
the harness was designed to catch.
Original 5 audit reports preserved as immutable history at
91edd43; this file documents the delta only.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adapts docs/SCRUM.md framework (originally written for the
matrix-agent-validated repo) to the Go rewrite. Five deliverables:
golang-lakehouse-scrum-test.md top-line + scoring + verdict
risk-register.md 12 findings, R-001..R-012
claim-coverage-table.md claim/test/risk for Sprint 2
sprint-backlog.md 5 sprints, ~2 weeks of work
acceptance-gates.md DoD as runnable commands
Every claim cites file:line, command output, or "missing evidence."
Smoke chain ran clean (33s wall, all 9 PASS) and is captured in
reports/scrum/_evidence/smoke_chain.log (gitignored — runtime artifact).
Scoring:
Reproducibility 7/10 9 smokes deterministic, no just/CI gate
Test Coverage 6/10 internal/ packages tested, 6/7 cmd/ aren't
Trust Boundary 7/10 escapes ok, zero auth, /sql is RCE-eq off-loopback
Memory Correctness 3/10 pathway/playbook/observer not yet ported
Deployment Readiness 4/10 no REPLICATION, no env template, no systemd
Maintainability 8/10 no god-files, 7 lean binaries, ADRs current
Top three risks:
R-001 HIGH queryd /sql + DuckDB + non-loopback bind = RCE-equivalent
R-002 HIGH internal/shared (server.go + config.go) zero tests
R-003 HIGH internal/storeclient zero tests, used by 2 services
R-004 MED 9-smoke chain green but not gated (no justfile/hook)
The audit is the work; refactors come after. Sprint 0 owns coverage
+ CI gating; Sprint 1 owns trust-boundary decisions; Sprints 2-3 are
mostly design-bar work for unbuilt agent components.
.gitignore exception: /reports/* + !/reports/scrum/ keeps reports/
a runtime-artifact directory while exposing reports/scrum/ as
tracked documentation. Mirrors the pattern future audit passes will
land in.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>