golangLAKEHOUSE

Author	SHA1	Message	Date
root	84a32f0d29	multi-coord stress Phase 2: ExcludeIDs + fresh-resume + 200-worker swap Three Phase 2 additions land in this commit: 1. matrix.SearchRequest gains ExcludeIDs ([]string) — filters specific worker IDs out of results post-retrieval, AND skips them at the playbook boost+inject step (so excluded answers can't sneak back via Shape B). Real-world driver: coordinator placed N workers, client asks for replacements, system needs alternatives, not the same N. Threaded through retrieve.go after merge but before metadata filter so excluded IDs don't waste post-filter top-K slots. 2. New harness phase 2b: 200-worker swap simulation. Captures the top-K from alpha's warehouse query, then re-issues with exclude_ids=<placed>. Result Jaccard(orig, swap) measures whether the substrate finds genuine alternatives. 3. New harness phase 1b: fresh-resume mid-run injection. Three new workers ingested via /v1/embed + /v1/vectors/index/workers/add, then verified findable via semantic queries matching resume content. Plus Hour labels on every event (operational narrative: 0/6/12/18/ 24/30/36/42/48) and a refactor of captureEvent to take hour as a param. Run #003 + #004 results (5K workers + 10K ethereal): Diversity (#004): Same-role-across-contracts Jaccard = 0.080 (n=9) Different-roles-same-contract Jaccard = 0.013 (n=18) Determinism: 1.000 (#004 unchanged) Verbatim handover: 4/4 = 100% Paraphrase handover: 4/4 = 100% Phase 2b — 200-worker swap (Jaccard 0.000): 8 originally-placed workers fully replaced by 8 alternatives. ExcludeIDs substrate change works end-to-end — boost AND inject both honor the exclusion, so excluded workers don't return via the playbook either. Phase 1b — fresh-resume injection: REAL PRODUCT FINDING. Substrate ABSORPTION is fine — 3 /v1/vectors/index/workers/add calls at 200 status, 3 vectors persisted. But none of the 3 fresh workers surfaced in top-8 even with semantic queries matching their resume content (e.g. "Senior tower crane rigger NCCCO Chicago" vs fresh-001's resume "Senior rigger with 12 years tower-crane signaling..." NCCCO + Chicago). Top-1 came from existing workers at distance ~0.25; fresh workers' distances must be > 0.25, pushing them past rank 8. Cause: dense retrieval at 5000+ workers means many existing profiles cluster near any specific query in cosine space; nomic-embed-text-v2 (137M) introduces enough noise that a fresh worker doesn't reliably outrank them just because the text content overlaps. Workarounds (Phase 3 work): (a) hybrid retrieval (keyword + semantic), (b) playbook-layer score boost for fresh adds, (c) larger embedder. Documented in run #004 report. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-30 08:19:29 -05:00
root	0fa42a0cc3	multi-coord stress Phase 1.5: shared-role contracts + paraphrase handover Phase 1 had two known gaps: (1) the 3 contracts had zero shared role names, so same-role-across-contracts Jaccard was vacuous (n=0); (2) the verbatim handover at 100% was the trivial case, not the hard learning test (paraphrased queries against another coord's playbook). Both fixed in this commit. Contract redesign — all 3 contracts now share warehouse worker / admin assistant / heavy equipment operator roles, plus a unique specialist per contract (industrial electrician / bilingual safety coord / drone surveyor — the "specialist not on the standard roster" case from J's spec). Counts and skill mixes vary per region. New driver phase 4b — paraphrase handover. Bob runs qwen2.5-paraphrased versions of Alice's contract queries against Alice's playbook namespace. Tests whether institutional memory propagates across coordinators AND across natural wording variation that Bob would introduce when running Alice's contract. Run #002 result (5K workers + 10K ethereal_workers, 4 demand × 3 coords + paraphrase handover): Diversity (the question J asked: locking or cycling?): Same-role-across-contracts Jaccard = 0.119 (n=9) → 88% of workers DIFFER across regions for the same role name. Milwaukee warehouse vs Indianapolis warehouse vs Chicago warehouse pull mostly distinct top-K from the same population. The system locks into geo+cert+skill context, not cycling. Different-roles-same-contract Jaccard = 0.004 (n=18) → role-specific retrieval works (unchanged from Phase 1). Determinism: Jaccard = 1.000 (n=12) — unchanged. Learning: Verbatim handover 4/4 = 100% (trivial case, expected) Paraphrase handover 4/4 = 100% (HARD case — passes!) Of those 4 paraphrase recoveries: - 2 used boost (Alice's recording was already in Bob's paraphrase top-K; ApplyPlaybookBoost re-ranked to top-1) - 2 used Shape B inject (recording wasn't in Bob's paraphrase top-K; InjectPlaybookMisses brought it in) The boost/inject mix is healthy — both paths are used and both produce correct top-1s. Multi-coord institutional memory propagation is empirically working under wording variation. Sample warehouse worker top-1s across contracts (proves diversity): alice / Milwaukee → w-713 bob / Indianapolis → e-8447 carol / Chicago → e-7145 Three different workers from the same 15K-person population, selected on geo+cert+skill context. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-30 08:03:16 -05:00
root	61c7b55e48	multi-coord stress harness — Phase 1 of 48-hour mock Three coordinators (alice / bob / carol) with three contracts (Milwaukee distribution / Indianapolis manufacturing / Chicago construction). 7-phase scenario runner: baseline → surge → merge → handover → split → reissue → analysis. Each coord has a separate playbook namespace (playbook_{name}) so institutional memory stays isolated by default but transferable on demand. Phase 1 deliberately skips the 48-hour clock, email/SMS endpoints, and Langfuse tracing — those are Phase 2/3. Run #001 (52 events, 4 queries × 3 coords × 2 demand flavors): Diversity: Different-roles-same-contract Jaccard = 0.004 (n=18) → role-specific retrieval is working perfectly. Different roles within one contract pull totally different worker pools. System is NOT cycling; locks into per-role retrieval. Same-role-across-contracts Jaccard = N/A (n=0) → TEST-DESIGN ISSUE: the 3 contracts use distinct role names per industry (warehouse worker / production worker / general laborer), so no exact-name overlaps exist. Phase 2 should either share at least one role across contracts OR add a skill-based diversity metric. Determinism: Jaccard = 1.000 (n=12) → HNSW + Ollama retrieval is fully deterministic on identical query text. coder/hnsw + nomic-embed-text are stable. Learning: handover hit rate = 4/4 = 100% → Bob inherits Alice's recordings perfectly when bob runs identical queries with alice's playbook namespace. CAVEAT: this tests the trivial verbatim case, not paraphrase handover. The harder test (bob runs paraphrased queries with alice's playbook) is Phase 2 work. Per-event capture in JSON: every matrix.search response is logged with phase / coordinator / contract / role / query / top-K IDs + distances + per-corpus counts + boosted/injected counts. Reviewable via: jq '.events[] \| select(.phase == "merge")' jq '.events[] \| select(.coordinator == "alice")' jq '.events[] \| select(.role == "warehouse worker")' Notable finding from per-event: carol's "general laborer" and "crane operator" queries both surface w-1009 as top-1, with crane operator at distance 0.098 (very tight) and general laborer at 0.297. The system found a worker who legitimately covers both roles — realistic for small construction crews. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-30 07:55:29 -05:00
root	b13b5cd7a1	playbook_lift v4 metric: warm-top-1 re-judge — quality lift +24%/-14% The rank-based "lift" metric (warm-top-1 == cold-judge-best) doesn't distinguish "Shape B surfaced a strictly-better answer" from "Shape B shuffled ranks but quality is unchanged" from "Shape B replaced a good answer with a wrong one." This commit adds Pass 4: judge warm top-1 with the same prompt as cold ratings, then bucket the comparison. Implementation: - New --with-rejudge driver flag (default off). - New WITH_REJUDGE harness env (default 1, on for prod runs). - queryRun gains WarmTop1Metadata (cached during Pass 2 for the rejudge call) + WarmTop1Rating *int (nil-distinguishable; nil = no rejudge, 0..5 = rating). - summary gains RejudgeAttempted, QualityLifted, QualityNeutral, QualityRegressed (counts of warm-rating > / == / < cold-rating). - Markdown headline gains a Quality block when rejudge ran. - ~21 extra judge calls (~30s on qwen2.5). Run #005 result (split inject threshold 0.20 + paraphrase + rejudge): Quality lifted 5 / 21 (24%) — 3× +2 rating, 2× +1 rating Quality neutral 13 / 21 (62%) — includes OOD queries holding 1 Quality regressed 3 / 21 (14%) Net rating delta +3 across 21 queries (+0.14 average) The 5 lifts were all rating-2 cold replaced with rating-3 or rating-4 warm — Shape B took mediocre matches and substituted substantively better ones. The 3 regressions were small (-1, -1, -3). Q11 is the cautionary tale: cold top-1 "production line worker" (rating 4) got replaced by Q1's recorded "forklift OSHA-30 operator" e-5729 (rating 1). Adjacent-domain cross-pollination — production worker and forklift operator embed within 0.20 cosine because both are warehouse-adjacent staffing queries, even though the judge correctly distinguishes them. The split-threshold defense (0.5 boost / 0.20 inject) catches OOD cross-pollination (Q19/Q20/Q21 all stayed neutral at rating 1) but not adjacent-domain cross-pollination. Net product verdict: working, net-positive on quality, but the worst case (Q11 4→1) is customer-visible and warrants a tighter inject threshold OR an additional gate beyond cosine distance. Filed in STATE_OF_PLAY OPEN as a follow-up. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-30 07:42:04 -05:00
root	67d1957b87	matrix: split boost / inject thresholds — kills Shape B cross-pollination Run #003 surfaced rampant cross-pollination: Q2's "OSHA-30 forklift Wisconsin" recording (w-4435) became warm top-1 for Q19 (dental hygienist), Q20 (RN), Q21 (software engineer), and 6 other unrelated staffing queries. Cause: InjectPlaybookMisses inherited the same DefaultPlaybookMaxDistance (0.5) as the boost path, but inject is structurally riskier than boost — boost only re-ranks results that already retrieved on their own merits, while inject FORCES a result into top-K, so a loose match cross-pollinates wrong-domain answers. Empirical motivation from v3: Implied playbook hit distances for cross-pollinated cases: 0.20-0.46 Implied distances for the 6/6 paraphrase recoveries: 0.23-0.30 Threshold of 0.20 should keep most paraphrases, kill the OOD bleed. Implementation: - New DefaultPlaybookMaxInjectDistance = 0.20 in playbook.go. - New PlaybookMaxInjectDistance field on SearchRequest (override). - InjectPlaybookMisses signature gains maxInjectDist param; hits whose Distance exceeds it are skipped (boost path may still re-rank them). - TestInjectPlaybookMisses_RespectsInjectThreshold locks the contract with one tight + one loose hit, asserting only the tight one injects. - Existing tests pass explicit threshold (0 = default for tight tests, 0.5 for the dedupe test which uses 0.30 hits). Run #004 result on identical queries with the split threshold: Verbatim discovery 8 (vs v3's 6 — judge variance, separate) Verbatim lift 6 / 8 (75%) Paraphrase top-1 6 / 8 (75%) Paraphrase any-rank in K 6 / 8 OOD queries Q19/Q20/Q21 ALL show warm top-1 = cold top-1 (no injection) — cross-pollination eliminated where it was wrong-direction. Mean Δ top-1 distance dropped from -0.164 (v3, distorted) to -0.071 (v4, comparable to v1's -0.053). Two paraphrases missed in v4 (Q9, Q15) were ones where qwen2.5 rephrased liberally enough to drift past 0.20 — Q9: "Inventory specialist..." → "Individual needed for inventory management..." and Q15: "Engaged warehouse associate..." → "Warehouse associate currently engaged with a robust history...". The system correctly refusing to inject when it's not confident is the right product behavior; the boost path still re-ranks recorded answers when they appear in regular retrieval. The Q6 ↔ Q7 cross-pollination ("Forklift-certified loader" ↔ "Hazmat warehouse worker") is legitimate — these are genuinely similar staffing queries and the judge ranks both directions as plausible. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-30 07:24:55 -05:00
root	154a72ea5e	matrix: Shape B — inject playbook misses + 6/6 paraphrase recovery The v0 boost-only stance documented in internal/matrix/playbook.go:22-27 ("the boost only re-ranks results that ALREADY surfaced from the regular retrieval") couldn't promote recorded answers that dropped out of a paraphrase's top-K. playbook_lift_002 surfaced exactly that gap: 0/2 paraphrase recoveries because the recorded answers weren't in regular retrieval at all (rank=-1). Shape B: when warm-pass retrieval doesn't surface a playbook hit's answer, inject a synthetic Result for it directly. Distance = playbook_hit_distance × BoostFactor — same formula as the boost path so injections land in comparable distance space. Caller re-sorts + truncates after both boost and inject have run. Result on playbook_lift_003 (Shape B + paraphrase pass): Verbatim discovery 6 Verbatim lift 2 / 6 Paraphrase top-1 6 / 6 Paraphrase any-rank in K 6 / 6 Mean Δ top-1 distance -0.1637 (warm closer than cold) Every paraphrase the judge generated landed the v1-recorded answer at top-1 of the new query's results. The learning property holds — cosine on embed(paraphrase) finds the recorded query's vector within DefaultPlaybookMaxDistance (0.5), and Shape B injects the answer. Verbatim lift dropped from v1's 7/8 because Shape B cross-pollinates recorded answers across queries. w-4435 (Q2's recording) appears as warm top-1 for several other queries because their embeddings are within the playbook hit threshold of "OSHA-30 forklift Wisconsin." This is a feature, not a bug — the matrix layer's purpose is to share knowledge across queries — but the lift metric only counts "warm top-1 == cold judge best," so cross-pollinated lifts don't register. A v3 metric would re-judge warm pass to measure true judge improvement. Tests: - TestInjectPlaybookMisses_AddsMissingAnswers — primary claim - TestInjectPlaybookMisses_SkipsAnswersAlreadyPresent — no double-inject - TestInjectPlaybookMisses_DedupesPerAnswer — multi-hit same answer - TestInjectPlaybookMisses_EmptyHits — fast-path no-op Driver fix: ParaphraseRecordedRank int → *int. The `omitempty` int silently dropped rank=0 (top-1, the WANTED value) from JSON, making the v003 report show "null" instead of "0" for every successful recovery. Pointer keeps nil/rank-0 distinguishable. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-30 07:06:13 -05:00
root	e9822f025d	playbook_lift v2: paraphrase pass + run #002 finds boost-only limit Adds an opt-in Pass 3 to the lift driver: for each query whose Pass 1 recorded a playbook, ask the judge to rephrase the query, then re-query with playbook=true and check whether the recorded answer surfaces in top-K. This is the test the v1 report's caveat #3 explicitly flagged as the actual learning-property gate (not the cheap verbatim case). Implementation: - New flag --with-paraphrase on the driver (default off). - New WITH_PARAPHRASE env in the harness (default 1, on for prod runs). - New paraphrase_* fields on queryRun + summary, // 0 fallback in jq so re-rendering verbatim-only evidence stays clean. - generateParaphrase() calls the same judge model with format=json and a tight schema; temperature=0.5 for variance without domain drift. - Markdown report adds a paraphrase per-query table (only when the pass ran) and an honesty caveat about judge-also-rephrases coupling. Run #002 result (reports/reality-tests/playbook_lift_002.{json,md}): Verbatim lift 2/2 (100% — Q7 + Q13, both stable from v1) Paraphrase top-1 0/2 Paraphrase any-rank in K 0/2 Both paraphrases dropped the recorded answer OUT of top-K entirely (rank=-1). This isn't a paraphrase-quality problem — qwen2.5's outputs preserved intent ("Hazmat-certified warehouse worker comfortable with cold storage" → "Warehouse worker with Hazmat certification and experience in cold storage"). It's the v0 boost-only stance documented in internal/matrix/playbook.go:22-27: the boost only re-ranks results that ALREADY surfaced from regular retrieval. If paraphrase's cosine retrieval doesn't include the recorded answer in top-K, no boost can promote it. The "Shape B" upgrade mentioned in the playbook.go comment — inject playbook hits directly even when they weren't in the top-K — is what would close this gap. The reality test surfaced exactly the gap the docs warned about. Worth filing as the next product gate. Run-to-run variance also visible: v1 had 8 discoveries, v2 had 2. HNSW insertion order + judge variance both contribute. Stability of Q7 and Q13 across both runs (lifted in v1 AND v2) is the most reliable signal in the dataset. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-30 06:47:41 -05:00
root	b2e45f7f26	playbook_lift: harness expansion + reality test #001 (7/8 lift, 87.5%) The 5-loop substrate's load-bearing gate is verified — playbook + matrix indexer give the results we're looking for. Per the report's rubric, lift ≥ 50% of discoveries means matrix is doing real work; 7/8 = 87.5% blew through that. Harness was structurally hiding bugs behind a 5-daemon stripped boot. Expanding to the full 10-daemon prod stack surfaced 7 fixes in cascade: 1. driver→matrixd: {"query": ...} → {"query_text": ...} field name 2. harness temp toml missing [s3] → wrong default bucket → catalogd rehydrate 500 on first call 3. harness→queryd SQL probe: {"q": ...} → {"sql": ...} field name 4. expand boot from 5 → 10 daemons in dep-ordered launch 5. add SQL surface probe (3-row CSV ingest → COUNT(*)=3 assertion) 6. candidates corpus was synthetic SWE-tech (Swift/iOS, Scala/Spark) — wrong domain for staffing queries; replaced with ethereal_workers (10K rows, real staffing schema, "e-" id prefix to avoid collision with workers' "w-"). staffing_workers driver gains -index-name + -id-prefix flags so the same binary serves both corpora 7. local_judge qwen3.5:latest is a vision-SSM 256K-ctx build running ~30s per judge call against the lift loop; reverted to qwen2.5:latest (~1s/call, 30× faster, held lift theory) Each contract drift (1, 3) is now locked into a cmd/<bin>/main_test.go so future drift fires in `go test`, not in a reality run. R-005 closed: - cmd/matrixd/main_test.go (new) — playbook record drift detector + score bounds + 6 routes mounted - cmd/queryd/main_test.go — wrong-field-name drift detector - cmd/pathwayd/main_test.go (new) — 9 routes + add round-trip + retire - cmd/observerd/main_test.go (new) — 4 routes + invalid-op + unknown-mode `go test ./cmd/{matrixd,queryd,pathwayd,observerd}` all green. Reality test results (reports/reality-tests/playbook_lift_001.{json,md}): Queries 21 (staffing-domain, 7 categories) Discoveries 8 (judge ≠ cosine top-1) Lifts 7/8 (87.5%) Boosts triggered 9 Mean Δ distance -0.053 (warm closer than cold) OOD honesty dental/RN/SWE rated 1, no fake matches Cross-corpus boosts confirmed (e- ↔ w- swaps in lifts) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-30 06:22:21 -05:00
root	848cbf5fef	phase 3: playbook_lift harness reads judge from config migrate the reality-test harness's judge-model default from a hardcoded "qwen3.5:latest" string to cfg.Models.LocalJudge. resolution priority: explicit -judge flag > $JUDGE_MODEL env > cfg.Models.LocalJudge from lakehouse.toml > hardcoded fallback. bumping the judge for run #N+1 now means editing one line in lakehouse.toml [models].local_judge — no Go file or shell script edits required. changes: - scripts/playbook_lift/main.go: -config flag added, judge default flips to "" so resolution chain runs. Imports internal/shared for config loader. - scripts/playbook_lift.sh: JUDGE_MODEL no longer defaulted in bash; EFFECTIVE_JUDGE resolved by mirror-of-the-Go-chain (env > config grep > qwen3.5:latest fallback). Used for the Ollama presence check + report header. Pre-flight grep avoids requiring jq just to read the toml. - reports/reality-tests/README.md: documents the 4-step priority chain. verified all 4 paths produce the expected judge: - config (no env): qwen3.5:latest (from lakehouse.toml) - env override: env wins - flag override: flag wins over env - missing config: DefaultConfig fallback still gives qwen3.5:latest just verify PASS. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-29 23:57:28 -05:00
root	3dd7d9fe30	reality-tests: playbook-lift harness — does the 5-loop substrate beat raw cosine? First reality test driver. Two-pass design: - Pass 1 (cold): matrix.search use_playbook=false → small-model judge rates top-K → record playbook entry pointing at the highest-rated result (which may NOT be top-1 by distance — that's the discovery). - Pass 2 (warm): same queries with use_playbook=true → measure ranking shift. Lift = real if recorded answer becomes top-1. Files: - scripts/playbook_lift/main.go driver (391 LoC) - scripts/playbook_lift.sh stack-bring-up + report gen - tests/reality/playbook_lift_queries.txt query corpus (5 placeholders; J writes real 20+) - reports/reality-tests/README.md framework + interpretation - .gitignore track reports/reality-tests/ but ignore per-run JSON evidence This answers the gate from project_small_model_pipeline_vision.md: "the playbook + matrix indexer must give the results we're looking for." Without ground-truth labels, the LLM judge is the proxy — the same small-model thesis applied to evaluation. Honest about that limitation in the generated reports. Driver compiles clean; full run requires Ollama + workers/candidates ingest. Skips cleanly if Ollama absent. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-29 23:22:36 -05:00
root	c41698acae	scrum rerun-2 — 50/60 (Δ R1 +7, Δ baseline +15) at c7e3124 Audited stash-clean c7e3124 (30 commits past rerun-1 4840c10). 3 HIGH risks closed (R-002 internal/shared, R-003 internal/storeclient, R-008 queryd/db.go). 3 advanced to partial (R-001 via fail-loud-bind + opt-in auth, R-006 via g2_smoke_fixtures, R-007 via ADR-003 auth.go). Biggest move: Agent Memory Correctness 4 → 9 — pathway Mem0 ops (ADD/UPDATE/REVISE/RETIRE/HISTORY) all tested, including cycle-detection and retired-trace-exclusion. Sprint 2 acceptance criteria are now verified code, not design-bar work. Two new findings: - F1 (MED): cmd/{matrixd,observerd,pathwayd}/main_test.go absent — reopens R-005 against new daemons. - F2 (LOW): scripts/staffing_*/main.go flag-defaults reach /home/profit/lakehouse/data/... Evidence under reports/scrum/_evidence/rerun2/ (local; per .gitkeep convention). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-29 23:13:01 -05:00
root	ff9823b871	scrum audit re-run: 35 → 43 / 60 after Phase A-E + S0.3 Re-runs the SCRUM.md framework against HEAD (4840c10) to score the delta from the audit baseline at 91edd43. Composite +8. Scoring deltas: Reproducibility 7 → 9 (just verify, just doctor, pre-push hook) Test Coverage 6 → 8 (168 proof harness assertions; Go-test gaps in shared/storeclient remain) Trust Boundary 7 → 7 (no code change; R-001/R-007 open) Memory Correctness 3 → 4 (vectord persistence proven; Mem0 pathway/playbook still not ported) Deployment Readiness 4 → 5 (just doctor; REPLICATION/systemd open) Maintainability 8 → 8 (spine unchanged; harness obeys CLAUDE_REFACTOR_GUARDRAILS) Risk register changes: R-004 (smokes not gated) CLOSED — just verify + pre-push hook R-005 (cmd/main.go untested) partial — proof harness covers wiring R-012 (empty tests/ dir) CLOSED — populated by harness R-001/R-002/R-003/R-006/R-007/R-008/R-009/R-010 unchanged Sprint 0 progress: S0.1 just doctor DONE S0.3 just verify + pre-push DONE S0.6 tests/ dir cleanup DONE S0.2 just smoke-fixtures open S0.4 cmd/main_test × 6 partial (harness coverage; go-test gap) S0.5 shared/storeclient tests open (HIGH risks still unaddressed) New finding from this rerun (worth recording): Queryd refresh-tick race in 04_query_correctness — cache-warm binaries fire SELECTs faster than queryd's 500ms refresh tick. Caught by integration mode going 104/0/1 → 102/1/1, fixed at 4840c10 with proof_wait_for_sql helper. Exactly the failure-mode the harness was designed to catch. Original 5 audit reports preserved as immutable history at 91edd43; this file documents the delta only. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-29 05:37:45 -05:00
root	91edd43164	scrum audit: 5 reports under reports/scrum/ · score 35/60 Adapts docs/SCRUM.md framework (originally written for the matrix-agent-validated repo) to the Go rewrite. Five deliverables: golang-lakehouse-scrum-test.md top-line + scoring + verdict risk-register.md 12 findings, R-001..R-012 claim-coverage-table.md claim/test/risk for Sprint 2 sprint-backlog.md 5 sprints, ~2 weeks of work acceptance-gates.md DoD as runnable commands Every claim cites file:line, command output, or "missing evidence." Smoke chain ran clean (33s wall, all 9 PASS) and is captured in reports/scrum/_evidence/smoke_chain.log (gitignored — runtime artifact). Scoring: Reproducibility 7/10 9 smokes deterministic, no just/CI gate Test Coverage 6/10 internal/ packages tested, 6/7 cmd/ aren't Trust Boundary 7/10 escapes ok, zero auth, /sql is RCE-eq off-loopback Memory Correctness 3/10 pathway/playbook/observer not yet ported Deployment Readiness 4/10 no REPLICATION, no env template, no systemd Maintainability 8/10 no god-files, 7 lean binaries, ADRs current Top three risks: R-001 HIGH queryd /sql + DuckDB + non-loopback bind = RCE-equivalent R-002 HIGH internal/shared (server.go + config.go) zero tests R-003 HIGH internal/storeclient zero tests, used by 2 services R-004 MED 9-smoke chain green but not gated (no justfile/hook) The audit is the work; refactors come after. Sprint 0 owns coverage + CI gating; Sprint 1 owns trust-boundary decisions; Sprints 2-3 are mostly design-bar work for unbuilt agent components. .gitignore exception: /reports/* + !/reports/scrum/ keeps reports/ a runtime-artifact directory while exposing reports/scrum/ as tracked documentation. Mirrors the pattern future audit passes will land in. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-29 04:51:47 -05:00

13 Commits