root 39a2856851

lakehouse/auditor 1 blocking issue: cloud: claim not backed — "journal event verified live (total_events_created 0→1 after probe)."

docs: rewrite PR #10 description to drop unfalsifiable metric claims

Auditor correctly flagged the '3 → 6' score claim as unbacked by diff
(consensus: 3/3 not-backed). The claim referenced scrum_reviews.jsonl —
an external metric file — which the auditor cannot verify against
source changes alone. Rewrote the PR body to only claim what's
directly verifiable from the diff (committed tests, committed code
paths, committed startup logging). Trajectory data remains in
docs/SCRUM_LOOP_NOTES.md for historical reference but is no longer
asserted as fact in the PR body.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

2026-04-24 03:02:21 -05:00

14 KiB

Raw Blame History

Scrum Loop Notes — Observations across iterations

Running notes from the 6x scrum loop (started 2026-04-23). One section per iteration. "Fix next loop" items accumulate here so the next scrum run picks them up — do not fix inline during a running iteration.

Iteration tracker

Iter	Status	Scrum started	Scrum finished	Fixes applied	Build green	Re-sweep findings
1	🟡 scrum running	2026-04-23 (brqz3jxgo)	-	-	-	- (baseline = 19)
2	🟡 scrum running	2026-04-23 (bzs6miehr)	-	-	-	pending
3	⬜ queued	-	-	-	-	-
4	⬜ queued	-	-	-	-	-
5	⬜ queued	-	-	-	-	-
6	⬜ queued	-	-	-	-	-

Iteration 1 — in flight

Target files: 21 source files extracted from the 19 Phase 0→42 findings. Ladder: cloud-first per feedback_scrum_cloud_first.md (gpt-oss:120b → qwen3.5:397b → devstral-2:123b → mistral-large-3:675b → gpt-oss:20b → qwen3.5:latest). Proposal: docs/SCRUM_FIX_WAVE.md (via LH_SCRUM_PROPOSAL env).

Fix next loop — observations accumulating

Add items here as the scrum runs. Keep each item to one line with a pointer to file + reason. Don't fix inline.

[ITER 2 OBSERVATIONS]

[FORENSIC vs thin-detector mismatch] iter 2 first attempt on auth.rs triggered "thin/unstructured" rejection at 2031 chars. Cause: forensic prompt asks for strict JSON verdict output, scrum's thin-answer detector expects markdown with score + table. The detector logic needs a forensic-aware branch OR the forensic prompt should preserve markdown output shape while still applying the 8 audit passes. File: tests/real-world/scrum_master_pipeline.ts, function that scores accepted vs thin. Fix next loop: add isForensicAcceptable(text) that checks for "verdict" field + at least one of critical_failures/pseudocode_flags/required_next_actions.
[OBSERVATION metric] 11 #[allow(dead_code)] markers cluster in crates/gateway/{auth,access,tools/registry,execution_loop,v1/truth} + crates/aibridge/providers/openrouter + crates/vectord/service. Each one maps cleanly to an audit finding. The execution_loop/mod.rs:85 comment even admits it: // reserved for Phase 42 truth-gate (step 6). Metric: fewer #[allow(dead_code)] markers per iteration = less pseudo-real code. Baseline = 11. Target after iter 6: ≤ 2 (only ones that are genuinely optional helpers).
[OBSERVATION gateway-as-router] scrum_master_pipeline currently fetches GATEWAY/v1/chat directly but its LADDER is still a hardcoded const. Should be driven by config/routing.toml via RoutingEngine (blocked by P40-001 until iter 1 lands fix). File: tests/real-world/scrum_master_pipeline.ts:53.
[OBSERVATION file-type] iter 1 target list is .rs only. Iter 2 must include tests/multi-agent/*.ts (executor, observer, kb consumer), auditor/checks/*.ts, sidecar/sidecar/*.py, and config/*.{json,toml}. The scrum pipeline handles any text file.
[OBSERVATION triangulation] auth.rs scrum review (first file out) independently identified P5-001 exactly: flagged #[allow(dead_code)], scored alignment 4/10, prescribed an AgentIdentity { name, role, hashed_key } type matching SCRUM_FIX_WAVE. Audit + scrum converged without seeing each other's output — strong signal the findings are real, not artifacts of one method.
[RULE from J 2026-04-23] Wiring-gap fixes happen AFTER the scrum completes, not inline. Accumulate observations, apply in one coherent pass. Matches feedback_audit_findings_log.md.
[OBSERVATION oversize-file] crates/gateway/src/execution_loop/mod.rs is 80,901 chars → 24 shards (scrum pipeline's tree-split kicks in at 6KB threshold). A single-file-of-this-size for an execution module is itself a smell — it's the Phase 43 scaffold we kept piling into. Split candidates: executor prompts, reviewer prompts, budget accounting, truth-gate hook, fixtures. Not a fix for this iter, but queue for iter 3.
[OBSERVATION cost-tracking] zero escalations across first 8 files — 0.0 dollar cloud spend above the minimum. Per-request cost on gpt-oss:120b via Ollama Cloud is effectively $0 in this environment (self-hosted or flat-rate per the llm_team_config key). If we add per-iter token totals to scrum_loop_metrics.jsonl we can show trajectory even when cost is flat.

[ITER 3 OBSERVATIONS]

[LARGE-HANDLER thin] kimi-k2:1t went thin on crates/gateway/src/tools/service.rs (~11KB, single large axum handler). deepseek-v3.1:671b rescued on attempt 2 (92.8s, 5408 chars, accepted). Pattern: very large routing files challenge even 1T models. Fix next loop: raise tree-split threshold for handler files OR shard by function boundaries not byte count.
[WRITE-ONLY INDICATORS STILL] 8 KB files write-only after iter 3: audits.jsonl (189 rows/1.9MB — biggest waste), phase_sweep_findings.jsonl (35), distilled_facts.jsonl (17), human_overrides.jsonl (8), classifications.jsonl (5), scrum_loop_metrics.jsonl (2), distilled_config_hints.jsonl (2), distilled_procedures.jsonl (2). Fix next loop: extend auditor/checks/kb_query.ts to surface these on PR review, OR build a single "KB health dashboard" reader.
[ISOLATED AUTOTUNE] crates/vectord/src/agent.rs has zero refs to scrum/audit/human_override KB. It tunes HNSW but doesn't know which indexes are attached to files the scrum flagged. Fix next loop: add TriggerEvent::CodeReviewFlag { index_name, gradient_tier } that biases trial budget toward indexes of flagged files.
[CONFIDENCE WELL-CALIBRATED] kimi-k2:1t confidences span 75-98 across iter3 files, cluster 85-95. No 100% inflation; min 75 = honest edge-case uncertainty. Good signal — the model is calibrating, not performance-signaling. Do NOT "fix" this by prompt-boosting confidence.
[SCRUM→OBSERVER WIRED] 2026-04-24 fix landed in iter-3 source but applies to iter 4+ (bun loaded code pre-edit). Verify next loop: curl :3800/stats should show by_source.scrum > 0 after iter 4 runs. If zero, observer /event payload schema mismatch.
[LOW-CONFIDENCE BLOCK flag WORKING] crates/storaged/src/federation_service.rs scored 3/10 with confidences [40, 35, 50] → avg 42, min 35. Permission Gradient correctly classifies this file as block tier (min <50) — exactly the signal we want. kimi-k2:1t is honestly saying "I'm not confident about these suggestions." Human must review before any auto-apply. Validates Layer #6 working end-to-end. DO NOT fix this — it's the system protecting us.
[CONFIDENCE-COLUMN DROPOUT] crates/gateway/src/main.rs accepted review has 0 confidences extracted despite the prompt requiring them. Either kimi-k2:1t omitted the Confidence column in its table format (regex didn't match), or the specific content ran into token budget and confidence got trimmed. Fix next loop: strengthen the prompt instruction to say "if you omit the Confidence column you MUST provide Confidence: NN% inline after every suggestion" AND post-accept sanity-check the row (reject if findings_count=0 across 5+ suggestions).
[SCORE-PARSER MISS on some files] crates/queryd/src/service.rs and crates/queryd/src/delta.rs reviews return score='?' — regex (\d)\s*/\s*10 isn't matching. Models may have written score as "Score: 4.5/10" or "4/10 (mid)" which the parser can't see. Fix next loop: regex accepts decimals + surrounding context. Not critical — confidences still landed.
[SCORE DISTRIBUTION] across 12 iter-3 files: 3/10 (7 files), 4/10 (2), 6/10 (1), ?/? (2). vectord/index_registry.rs alone scored 6/10 — meaningfully higher. That's worth a closer look: what does that file do right? If we find the pattern there and replicate to peers, scores should rise. This is learning-loop signal.
[FORENSIC JSON OUTPUT IS RICHER THAN MARKDOWN] kimi-k2:1t emitted strict JSON-schema output (matching SCRUM_FORENSIC_PROMPT.md) for vectord/src/index_registry.rs. Structure separates critical_failures / pseudocode_flags / prd_mismatches / broken_pipelines / missing_components / risk_points / verified_components / required_next_actions — each with confidence per entry. Markdown reviews never captured verified_components (what's confirmed working). Fix next loop: scrum pipeline detects JSON format in accepted output and extracts structured fields into KB row (verified_components_count, critical_failures_count, missing_components_count, verdict). Downstream consumers then filter PR files by verdict: needs_patch or prioritize by missing_components_count. This is the biggest next-iter KB quality jump available — goes from "confidence as a scalar" to "confidence per specific claim with evidence field." Verified_components in particular is the PROOF-OF-LIFE signal that tells us what's real — flipside of P9/P13/P42 gaps.
[🔴 TREE-SPLIT REVIEW SHARD-LEAKAGE] ingestd/src/service.rs (7 shards, 24.3KB) accepted output is titled "Forensic Audit Report – ... (shard 3)". The review covers only the Postgres-import path (shard 3). The ingest_file handler where my P9-001 fix lives (journal.record_ingest call) is in shard 1 or 2 — that reviewer never saw the fix. tree_split_fired: true is supposed to mean the output is the reducer-merged summary of all shards, but this review retained shard-specific scope. Either (a) the reduce step didn't integrate shard summaries, (b) the accepted attempt was one individual shard response that slipped past the reducer, or (c) the reducer prompt doesn't instruct the model to present the file holistically. This is a real correctness bug — it means file-level findings can be ghost-negative (fix applied, reviewer blind to it) and ghost-positive (gap exists in unreviewed shard, reviewer gives clean bill). Fix next loop: examine the reduce-step prompt in tree-split path, ensure accepted output comes from reduce step not from any individual shard. Validate by running on a file with a known fix and confirming the review notices it.

[ITER 4 OBSERVATIONS]

[🔴 OBSERVER DROPS SCRUM RICH FIELDS] Scrum→observer wiring works (by_source.scrum appears in /stats on iter 4 file 1). BUT observer.ts:262-283 ObservedOp = {...} literal only spreads known keys (endpoint, success, duration_ms, role, city, state, count, rescue_*). My scrum-specific fields (confidence_avg, confidence_min, gradient_tier, verdict, critical_failures_count, verified_components_count, missing_components_count, alignment_score, output_format, findings_count, attempts_made, thin_rejections) are silently discarded. Observer knows the scrum event happened but loses review-quality data. Fix next loop: add metadata?: Record<string, any> passthrough on ObservedOp, or declare scrum-specific fields explicitly. Preferred: metadata passthrough so future sources (auditor, kb_extractor) land the same way.
[SCHEMA V4 LANDING CORRECTLY] main.rs iter-4 KB row has alignment_score=3 (decimal parser fixed), output_format="markdown" (classifier works), verdict=null (correct — only forensic_json produces verdict), confidence_avg=91 (previous iter got 0 due to column dropout — run-to-run variance self-healed this). Structured counters (critical/verified/missing) = 0 on markdown rows, populated on forensic_json rows.
[RING BUFFER EVICTING LANGFUSE] observer ring hit 2000 cap; first 2 scrum events pushed 2 langfuse entries out (1999 → 1997). Not a bug — ring works as designed — but means old-context retention is bounded. If we care about historical Langfuse traces we need a larger ring OR a separate per-source ring.
[UI Playwright probe found 2 real bugs] (fixed 2026-04-24): (a) ui/server.ts tryFetch relied on content-type header to decide JSON vs text; observer Bun.serve returns JSON without application/json content-type, so stats were strings — UI showed "0 ops" instead of 2000. Fixed: always attempt JSON.parse, fall back to raw text. (b) ui.js renderNodeContext used Object.entries(n.health) which iterates characters on a string — gateway /health returns "lakehouse ok" and the panel showed rows like 0=l, 1=a, 2=k, .... Fixed: primitive-vs-object guard. Both were invisible in functional tests — only a real browser render exposed them. Worth adding a Playwright smoke test to CI for any future UI changes.

Iter 1 results

Populated after scrum finishes.

Iteration 2 — queued

Prompt shape change (from J 2026-04-23): iter 2+ uses docs/SCRUM_FORENSIC_PROMPT.md as the system prompt, replacing the softer iter-1 framing. Adversarial auditor tone with 8 audit passes. Strict JSON output format with verdict: pass|fail|needs_patch. If system can't prove itself, verdict is FAIL.

Scrum pipeline change: scrum_master_pipeline.ts needs an env LH_SCRUM_SYSTEM_PROMPT (new) to inject the forensic frame alongside the proposal doc. The file-level loop still asks for suggestions per file but under the 8-pass adversarial lens.

Goal: Self-host. Pipeline loads its ladder from config/routing.toml via the RoutingEngine that iter 1 wired. If that still isn't loaded, note gap, proceed with hardcoded ladder, flag for iter 3. Target expansion: beyond .rs to .ts (tests/multi-agent, auditor/), .py (sidecar), .md (docs).

Iterations 3-6 — queued

Goal: measure trajectory. Each iteration reduces finding count, raises unit test count, reduces grep-for-fake-markers count. If any iteration doesn't improve, that's the data point.

Metrics per iteration

Capture after each re-sweep:

findings_total (baseline: 19)
findings_by_severity (baseline: 3h / 8m / 8l)
phases_partial_count (baseline: 9)
phases_real_count (baseline: 25 of 35)
rust_test_count (baseline: 194+)
gateway_test_fail_count (baseline: 1 — P38-001)
grep_hits_unimplemented run: grep -rEc 'todo!|unimplemented!|FIXME' crates/
grep_hits_pseudo run: grep -rEc '\"placeholder\"|\"stub\"|\"mock\"|\"fake\"' crates/

Rules for this loop

Cloud-first for every iteration. Per feedback_scrum_cloud_first.md, strategic review uses 120B+ tier.
One cross-cutting PR per iteration when possible. Meta-pattern from audit: identity+auth+access+journal+truth share a pipe. Fix them together.
Build must be green before next iteration starts. A broken build is evidence the last iteration regressed, not progressed.
Log findings to the jsonl as new rows per iteration with sweep_id: phase_sweep_2026-04-23-iterN. Never overwrite prior iteration's findings — the trajectory is the whole point.
Don't fix things during an iteration. Every observation goes into "Fix next loop" section above. Next iteration's scrum picks them up.

14 KiB Raw Blame History Unescape Escape