golangLAKEHOUSE

Author	SHA1	Message	Date
root	848cbf5fef	phase 3: playbook_lift harness reads judge from config migrate the reality-test harness's judge-model default from a hardcoded "qwen3.5:latest" string to cfg.Models.LocalJudge. resolution priority: explicit -judge flag > $JUDGE_MODEL env > cfg.Models.LocalJudge from lakehouse.toml > hardcoded fallback. bumping the judge for run #N+1 now means editing one line in lakehouse.toml [models].local_judge — no Go file or shell script edits required. changes: - scripts/playbook_lift/main.go: -config flag added, judge default flips to "" so resolution chain runs. Imports internal/shared for config loader. - scripts/playbook_lift.sh: JUDGE_MODEL no longer defaulted in bash; EFFECTIVE_JUDGE resolved by mirror-of-the-Go-chain (env > config grep > qwen3.5:latest fallback). Used for the Ollama presence check + report header. Pre-flight grep avoids requiring jq just to read the toml. - reports/reality-tests/README.md: documents the 4-step priority chain. verified all 4 paths produce the expected judge: - config (no env): qwen3.5:latest (from lakehouse.toml) - env override: env wins - flag override: flag wins over env - missing config: DefaultConfig fallback still gives qwen3.5:latest just verify PASS. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-29 23:57:28 -05:00
root	3dd7d9fe30	reality-tests: playbook-lift harness — does the 5-loop substrate beat raw cosine? First reality test driver. Two-pass design: - Pass 1 (cold): matrix.search use_playbook=false → small-model judge rates top-K → record playbook entry pointing at the highest-rated result (which may NOT be top-1 by distance — that's the discovery). - Pass 2 (warm): same queries with use_playbook=true → measure ranking shift. Lift = real if recorded answer becomes top-1. Files: - scripts/playbook_lift/main.go driver (391 LoC) - scripts/playbook_lift.sh stack-bring-up + report gen - tests/reality/playbook_lift_queries.txt query corpus (5 placeholders; J writes real 20+) - reports/reality-tests/README.md framework + interpretation - .gitignore track reports/reality-tests/ but ignore per-run JSON evidence This answers the gate from project_small_model_pipeline_vision.md: "the playbook + matrix indexer must give the results we're looking for." Without ground-truth labels, the LLM judge is the proxy — the same small-model thesis applied to evaluation. Honest about that limitation in the generated reports. Driver compiles clean; full run requires Ollama + workers/candidates ingest. Skips cleanly if Ollama absent. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-29 23:22:36 -05:00
root	c41698acae	scrum rerun-2 — 50/60 (Δ R1 +7, Δ baseline +15) at c7e3124 Audited stash-clean c7e3124 (30 commits past rerun-1 4840c10). 3 HIGH risks closed (R-002 internal/shared, R-003 internal/storeclient, R-008 queryd/db.go). 3 advanced to partial (R-001 via fail-loud-bind + opt-in auth, R-006 via g2_smoke_fixtures, R-007 via ADR-003 auth.go). Biggest move: Agent Memory Correctness 4 → 9 — pathway Mem0 ops (ADD/UPDATE/REVISE/RETIRE/HISTORY) all tested, including cycle-detection and retired-trace-exclusion. Sprint 2 acceptance criteria are now verified code, not design-bar work. Two new findings: - F1 (MED): cmd/{matrixd,observerd,pathwayd}/main_test.go absent — reopens R-005 against new daemons. - F2 (LOW): scripts/staffing_*/main.go flag-defaults reach /home/profit/lakehouse/data/... Evidence under reports/scrum/_evidence/rerun2/ (local; per .gitkeep convention). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-29 23:13:01 -05:00
root	ff9823b871	scrum audit re-run: 35 → 43 / 60 after Phase A-E + S0.3 Re-runs the SCRUM.md framework against HEAD (4840c10) to score the delta from the audit baseline at 91edd43. Composite +8. Scoring deltas: Reproducibility 7 → 9 (just verify, just doctor, pre-push hook) Test Coverage 6 → 8 (168 proof harness assertions; Go-test gaps in shared/storeclient remain) Trust Boundary 7 → 7 (no code change; R-001/R-007 open) Memory Correctness 3 → 4 (vectord persistence proven; Mem0 pathway/playbook still not ported) Deployment Readiness 4 → 5 (just doctor; REPLICATION/systemd open) Maintainability 8 → 8 (spine unchanged; harness obeys CLAUDE_REFACTOR_GUARDRAILS) Risk register changes: R-004 (smokes not gated) CLOSED — just verify + pre-push hook R-005 (cmd/main.go untested) partial — proof harness covers wiring R-012 (empty tests/ dir) CLOSED — populated by harness R-001/R-002/R-003/R-006/R-007/R-008/R-009/R-010 unchanged Sprint 0 progress: S0.1 just doctor DONE S0.3 just verify + pre-push DONE S0.6 tests/ dir cleanup DONE S0.2 just smoke-fixtures open S0.4 cmd/main_test × 6 partial (harness coverage; go-test gap) S0.5 shared/storeclient tests open (HIGH risks still unaddressed) New finding from this rerun (worth recording): Queryd refresh-tick race in 04_query_correctness — cache-warm binaries fire SELECTs faster than queryd's 500ms refresh tick. Caught by integration mode going 104/0/1 → 102/1/1, fixed at 4840c10 with proof_wait_for_sql helper. Exactly the failure-mode the harness was designed to catch. Original 5 audit reports preserved as immutable history at 91edd43; this file documents the delta only. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-29 05:37:45 -05:00
root	91edd43164	scrum audit: 5 reports under reports/scrum/ · score 35/60 Adapts docs/SCRUM.md framework (originally written for the matrix-agent-validated repo) to the Go rewrite. Five deliverables: golang-lakehouse-scrum-test.md top-line + scoring + verdict risk-register.md 12 findings, R-001..R-012 claim-coverage-table.md claim/test/risk for Sprint 2 sprint-backlog.md 5 sprints, ~2 weeks of work acceptance-gates.md DoD as runnable commands Every claim cites file:line, command output, or "missing evidence." Smoke chain ran clean (33s wall, all 9 PASS) and is captured in reports/scrum/_evidence/smoke_chain.log (gitignored — runtime artifact). Scoring: Reproducibility 7/10 9 smokes deterministic, no just/CI gate Test Coverage 6/10 internal/ packages tested, 6/7 cmd/ aren't Trust Boundary 7/10 escapes ok, zero auth, /sql is RCE-eq off-loopback Memory Correctness 3/10 pathway/playbook/observer not yet ported Deployment Readiness 4/10 no REPLICATION, no env template, no systemd Maintainability 8/10 no god-files, 7 lean binaries, ADRs current Top three risks: R-001 HIGH queryd /sql + DuckDB + non-loopback bind = RCE-equivalent R-002 HIGH internal/shared (server.go + config.go) zero tests R-003 HIGH internal/storeclient zero tests, used by 2 services R-004 MED 9-smoke chain green but not gated (no justfile/hook) The audit is the work; refactors come after. Sprint 0 owns coverage + CI gating; Sprint 1 owns trust-boundary decisions; Sprints 2-3 are mostly design-bar work for unbuilt agent components. .gitignore exception: /reports/* + !/reports/scrum/ keeps reports/ a runtime-artifact directory while exposing reports/scrum/ as tracked documentation. Mirrors the pattern future audit passes will land in. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-29 04:51:47 -05:00

5 Commits