golangLAKEHOUSE

Author	SHA1	Message	Date
root	b2e45f7f26	playbook_lift: harness expansion + reality test #001 (7/8 lift, 87.5%) The 5-loop substrate's load-bearing gate is verified — playbook + matrix indexer give the results we're looking for. Per the report's rubric, lift ≥ 50% of discoveries means matrix is doing real work; 7/8 = 87.5% blew through that. Harness was structurally hiding bugs behind a 5-daemon stripped boot. Expanding to the full 10-daemon prod stack surfaced 7 fixes in cascade: 1. driver→matrixd: {"query": ...} → {"query_text": ...} field name 2. harness temp toml missing [s3] → wrong default bucket → catalogd rehydrate 500 on first call 3. harness→queryd SQL probe: {"q": ...} → {"sql": ...} field name 4. expand boot from 5 → 10 daemons in dep-ordered launch 5. add SQL surface probe (3-row CSV ingest → COUNT(*)=3 assertion) 6. candidates corpus was synthetic SWE-tech (Swift/iOS, Scala/Spark) — wrong domain for staffing queries; replaced with ethereal_workers (10K rows, real staffing schema, "e-" id prefix to avoid collision with workers' "w-"). staffing_workers driver gains -index-name + -id-prefix flags so the same binary serves both corpora 7. local_judge qwen3.5:latest is a vision-SSM 256K-ctx build running ~30s per judge call against the lift loop; reverted to qwen2.5:latest (~1s/call, 30× faster, held lift theory) Each contract drift (1, 3) is now locked into a cmd/<bin>/main_test.go so future drift fires in `go test`, not in a reality run. R-005 closed: - cmd/matrixd/main_test.go (new) — playbook record drift detector + score bounds + 6 routes mounted - cmd/queryd/main_test.go — wrong-field-name drift detector - cmd/pathwayd/main_test.go (new) — 9 routes + add round-trip + retire - cmd/observerd/main_test.go (new) — 4 routes + invalid-op + unknown-mode `go test ./cmd/{matrixd,queryd,pathwayd,observerd}` all green. Reality test results (reports/reality-tests/playbook_lift_001.{json,md}): Queries 21 (staffing-domain, 7 categories) Discoveries 8 (judge ≠ cosine top-1) Lifts 7/8 (87.5%) Boosts triggered 9 Mean Δ distance -0.053 (warm closer than cold) OOD honesty dental/RN/SWE rated 1, no fake matches Cross-corpus boosts confirmed (e- ↔ w- swaps in lifts) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-30 06:22:21 -05:00
root	848cbf5fef	phase 3: playbook_lift harness reads judge from config migrate the reality-test harness's judge-model default from a hardcoded "qwen3.5:latest" string to cfg.Models.LocalJudge. resolution priority: explicit -judge flag > $JUDGE_MODEL env > cfg.Models.LocalJudge from lakehouse.toml > hardcoded fallback. bumping the judge for run #N+1 now means editing one line in lakehouse.toml [models].local_judge — no Go file or shell script edits required. changes: - scripts/playbook_lift/main.go: -config flag added, judge default flips to "" so resolution chain runs. Imports internal/shared for config loader. - scripts/playbook_lift.sh: JUDGE_MODEL no longer defaulted in bash; EFFECTIVE_JUDGE resolved by mirror-of-the-Go-chain (env > config grep > qwen3.5:latest fallback). Used for the Ollama presence check + report header. Pre-flight grep avoids requiring jq just to read the toml. - reports/reality-tests/README.md: documents the 4-step priority chain. verified all 4 paths produce the expected judge: - config (no env): qwen3.5:latest (from lakehouse.toml) - env override: env wins - flag override: flag wins over env - missing config: DefaultConfig fallback still gives qwen3.5:latest just verify PASS. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-29 23:57:28 -05:00
root	3dd7d9fe30	reality-tests: playbook-lift harness — does the 5-loop substrate beat raw cosine? First reality test driver. Two-pass design: - Pass 1 (cold): matrix.search use_playbook=false → small-model judge rates top-K → record playbook entry pointing at the highest-rated result (which may NOT be top-1 by distance — that's the discovery). - Pass 2 (warm): same queries with use_playbook=true → measure ranking shift. Lift = real if recorded answer becomes top-1. Files: - scripts/playbook_lift/main.go driver (391 LoC) - scripts/playbook_lift.sh stack-bring-up + report gen - tests/reality/playbook_lift_queries.txt query corpus (5 placeholders; J writes real 20+) - reports/reality-tests/README.md framework + interpretation - .gitignore track reports/reality-tests/ but ignore per-run JSON evidence This answers the gate from project_small_model_pipeline_vision.md: "the playbook + matrix indexer must give the results we're looking for." Without ground-truth labels, the LLM judge is the proxy — the same small-model thesis applied to evaluation. Honest about that limitation in the generated reports. Driver compiles clean; full run requires Ollama + workers/candidates ingest. Skips cleanly if Ollama absent. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-29 23:22:36 -05:00
root	c41698acae	scrum rerun-2 — 50/60 (Δ R1 +7, Δ baseline +15) at c7e3124 Audited stash-clean c7e3124 (30 commits past rerun-1 4840c10). 3 HIGH risks closed (R-002 internal/shared, R-003 internal/storeclient, R-008 queryd/db.go). 3 advanced to partial (R-001 via fail-loud-bind + opt-in auth, R-006 via g2_smoke_fixtures, R-007 via ADR-003 auth.go). Biggest move: Agent Memory Correctness 4 → 9 — pathway Mem0 ops (ADD/UPDATE/REVISE/RETIRE/HISTORY) all tested, including cycle-detection and retired-trace-exclusion. Sprint 2 acceptance criteria are now verified code, not design-bar work. Two new findings: - F1 (MED): cmd/{matrixd,observerd,pathwayd}/main_test.go absent — reopens R-005 against new daemons. - F2 (LOW): scripts/staffing_*/main.go flag-defaults reach /home/profit/lakehouse/data/... Evidence under reports/scrum/_evidence/rerun2/ (local; per .gitkeep convention). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-29 23:13:01 -05:00
root	ff9823b871	scrum audit re-run: 35 → 43 / 60 after Phase A-E + S0.3 Re-runs the SCRUM.md framework against HEAD (4840c10) to score the delta from the audit baseline at 91edd43. Composite +8. Scoring deltas: Reproducibility 7 → 9 (just verify, just doctor, pre-push hook) Test Coverage 6 → 8 (168 proof harness assertions; Go-test gaps in shared/storeclient remain) Trust Boundary 7 → 7 (no code change; R-001/R-007 open) Memory Correctness 3 → 4 (vectord persistence proven; Mem0 pathway/playbook still not ported) Deployment Readiness 4 → 5 (just doctor; REPLICATION/systemd open) Maintainability 8 → 8 (spine unchanged; harness obeys CLAUDE_REFACTOR_GUARDRAILS) Risk register changes: R-004 (smokes not gated) CLOSED — just verify + pre-push hook R-005 (cmd/main.go untested) partial — proof harness covers wiring R-012 (empty tests/ dir) CLOSED — populated by harness R-001/R-002/R-003/R-006/R-007/R-008/R-009/R-010 unchanged Sprint 0 progress: S0.1 just doctor DONE S0.3 just verify + pre-push DONE S0.6 tests/ dir cleanup DONE S0.2 just smoke-fixtures open S0.4 cmd/main_test × 6 partial (harness coverage; go-test gap) S0.5 shared/storeclient tests open (HIGH risks still unaddressed) New finding from this rerun (worth recording): Queryd refresh-tick race in 04_query_correctness — cache-warm binaries fire SELECTs faster than queryd's 500ms refresh tick. Caught by integration mode going 104/0/1 → 102/1/1, fixed at 4840c10 with proof_wait_for_sql helper. Exactly the failure-mode the harness was designed to catch. Original 5 audit reports preserved as immutable history at 91edd43; this file documents the delta only. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-29 05:37:45 -05:00
root	91edd43164	scrum audit: 5 reports under reports/scrum/ · score 35/60 Adapts docs/SCRUM.md framework (originally written for the matrix-agent-validated repo) to the Go rewrite. Five deliverables: golang-lakehouse-scrum-test.md top-line + scoring + verdict risk-register.md 12 findings, R-001..R-012 claim-coverage-table.md claim/test/risk for Sprint 2 sprint-backlog.md 5 sprints, ~2 weeks of work acceptance-gates.md DoD as runnable commands Every claim cites file:line, command output, or "missing evidence." Smoke chain ran clean (33s wall, all 9 PASS) and is captured in reports/scrum/_evidence/smoke_chain.log (gitignored — runtime artifact). Scoring: Reproducibility 7/10 9 smokes deterministic, no just/CI gate Test Coverage 6/10 internal/ packages tested, 6/7 cmd/ aren't Trust Boundary 7/10 escapes ok, zero auth, /sql is RCE-eq off-loopback Memory Correctness 3/10 pathway/playbook/observer not yet ported Deployment Readiness 4/10 no REPLICATION, no env template, no systemd Maintainability 8/10 no god-files, 7 lean binaries, ADRs current Top three risks: R-001 HIGH queryd /sql + DuckDB + non-loopback bind = RCE-equivalent R-002 HIGH internal/shared (server.go + config.go) zero tests R-003 HIGH internal/storeclient zero tests, used by 2 services R-004 MED 9-smoke chain green but not gated (no justfile/hook) The audit is the work; refactors come after. Sprint 0 owns coverage + CI gating; Sprint 1 owns trust-boundary decisions; Sprints 2-3 are mostly design-bar work for unbuilt agent components. .gitignore exception: /reports/* + !/reports/scrum/ keeps reports/ a runtime-artifact directory while exposing reports/scrum/ as tracked documentation. Mirrors the pattern future audit passes will land in. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-29 04:51:47 -05:00

6 Commits