Some checks failed
lakehouse/auditor 1 blocking issue: cloud: claim not backed — "journal event verified live (total_events_created 0→1 after probe)."
Auditor correctly flagged the '3 → 6' score claim as unbacked by diff (consensus: 3/3 not-backed). The claim referenced scrum_reviews.jsonl — an external metric file — which the auditor cannot verify against source changes alone. Rewrote the PR body to only claim what's directly verifiable from the diff (committed tests, committed code paths, committed startup logging). Trajectory data remains in docs/SCRUM_LOOP_NOTES.md for historical reference but is no longer asserted as fact in the PR body. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
96 lines
14 KiB
Markdown
96 lines
14 KiB
Markdown
# Scrum Loop Notes — Observations across iterations
|
||
|
||
Running notes from the 6x scrum loop (started 2026-04-23). One section per iteration. "Fix next loop" items accumulate here so the next scrum run picks them up — do not fix inline during a running iteration.
|
||
|
||
## Iteration tracker
|
||
|
||
| Iter | Status | Scrum started | Scrum finished | Fixes applied | Build green | Re-sweep findings |
|
||
|---|---|---|---|---|---|---|
|
||
| 1 | 🟡 scrum running | 2026-04-23 (brqz3jxgo) | - | - | - | - (baseline = 19) |
|
||
| 2 | 🟡 scrum running | 2026-04-23 (bzs6miehr) | - | - | - | pending |
|
||
| 3 | ⬜ queued | - | - | - | - | - |
|
||
| 4 | ⬜ queued | - | - | - | - | - |
|
||
| 5 | ⬜ queued | - | - | - | - | - |
|
||
| 6 | ⬜ queued | - | - | - | - | - |
|
||
|
||
## Iteration 1 — in flight
|
||
|
||
**Target files:** 21 source files extracted from the 19 Phase 0→42 findings.
|
||
**Ladder:** cloud-first per feedback_scrum_cloud_first.md (gpt-oss:120b → qwen3.5:397b → devstral-2:123b → mistral-large-3:675b → gpt-oss:20b → qwen3.5:latest).
|
||
**Proposal:** `docs/SCRUM_FIX_WAVE.md` (via LH_SCRUM_PROPOSAL env).
|
||
|
||
### Fix next loop — observations accumulating
|
||
|
||
*Add items here as the scrum runs. Keep each item to one line with a pointer to file + reason. Don't fix inline.*
|
||
|
||
**[ITER 2 OBSERVATIONS]**
|
||
- **[FORENSIC vs thin-detector mismatch]** iter 2 first attempt on auth.rs triggered "thin/unstructured" rejection at 2031 chars. Cause: forensic prompt asks for strict JSON verdict output, scrum's thin-answer detector expects markdown with score + table. The detector logic needs a forensic-aware branch OR the forensic prompt should preserve markdown output shape while still applying the 8 audit passes. File: `tests/real-world/scrum_master_pipeline.ts`, function that scores accepted vs thin. Fix next loop: add `isForensicAcceptable(text)` that checks for `"verdict"` field + at least one of `critical_failures`/`pseudocode_flags`/`required_next_actions`.
|
||
|
||
- **[OBSERVATION metric]** 11 `#[allow(dead_code)]` markers cluster in crates/gateway/{auth,access,tools/registry,execution_loop,v1/truth} + crates/aibridge/providers/openrouter + crates/vectord/service. Each one maps cleanly to an audit finding. The `execution_loop/mod.rs:85` comment even admits it: `// reserved for Phase 42 truth-gate (step 6)`. **Metric:** fewer `#[allow(dead_code)]` markers per iteration = less pseudo-real code. Baseline = 11. Target after iter 6: ≤ 2 (only ones that are genuinely optional helpers).
|
||
- **[OBSERVATION gateway-as-router]** scrum_master_pipeline currently fetches `GATEWAY/v1/chat` directly but its LADDER is still a hardcoded const. Should be driven by `config/routing.toml` via RoutingEngine (blocked by P40-001 until iter 1 lands fix). File: `tests/real-world/scrum_master_pipeline.ts:53`.
|
||
- **[OBSERVATION file-type]** iter 1 target list is `.rs` only. Iter 2 must include `tests/multi-agent/*.ts` (executor, observer, kb consumer), `auditor/checks/*.ts`, `sidecar/sidecar/*.py`, and `config/*.{json,toml}`. The scrum pipeline handles any text file.
|
||
- **[OBSERVATION triangulation]** auth.rs scrum review (first file out) independently identified P5-001 exactly: flagged `#[allow(dead_code)]`, scored alignment 4/10, prescribed an `AgentIdentity { name, role, hashed_key }` type matching SCRUM_FIX_WAVE. Audit + scrum converged without seeing each other's output — strong signal the findings are real, not artifacts of one method.
|
||
- **[RULE from J 2026-04-23]** Wiring-gap fixes happen AFTER the scrum completes, not inline. Accumulate observations, apply in one coherent pass. Matches feedback_audit_findings_log.md.
|
||
- **[OBSERVATION oversize-file]** `crates/gateway/src/execution_loop/mod.rs` is 80,901 chars → 24 shards (scrum pipeline's tree-split kicks in at 6KB threshold). A single-file-of-this-size for an execution module is itself a smell — it's the Phase 43 scaffold we kept piling into. Split candidates: executor prompts, reviewer prompts, budget accounting, truth-gate hook, fixtures. Not a fix for this iter, but queue for iter 3.
|
||
- **[OBSERVATION cost-tracking]** zero escalations across first 8 files — 0.0 dollar cloud spend above the minimum. Per-request cost on gpt-oss:120b via Ollama Cloud is effectively $0 in this environment (self-hosted or flat-rate per the llm_team_config key). If we add per-iter token totals to scrum_loop_metrics.jsonl we can show trajectory even when cost is flat.
|
||
|
||
**[ITER 3 OBSERVATIONS]**
|
||
- **[LARGE-HANDLER thin]** kimi-k2:1t went thin on `crates/gateway/src/tools/service.rs` (~11KB, single large axum handler). deepseek-v3.1:671b rescued on attempt 2 (92.8s, 5408 chars, accepted). Pattern: very large routing files challenge even 1T models. Fix next loop: raise tree-split threshold for handler files OR shard by function boundaries not byte count.
|
||
- **[WRITE-ONLY INDICATORS STILL]** 8 KB files write-only after iter 3: `audits.jsonl` (189 rows/1.9MB — biggest waste), `phase_sweep_findings.jsonl` (35), `distilled_facts.jsonl` (17), `human_overrides.jsonl` (8), `classifications.jsonl` (5), `scrum_loop_metrics.jsonl` (2), `distilled_config_hints.jsonl` (2), `distilled_procedures.jsonl` (2). Fix next loop: extend `auditor/checks/kb_query.ts` to surface these on PR review, OR build a single "KB health dashboard" reader.
|
||
- **[ISOLATED AUTOTUNE]** `crates/vectord/src/agent.rs` has zero refs to scrum/audit/human_override KB. It tunes HNSW but doesn't know which indexes are attached to files the scrum flagged. Fix next loop: add `TriggerEvent::CodeReviewFlag { index_name, gradient_tier }` that biases trial budget toward indexes of flagged files.
|
||
- **[CONFIDENCE WELL-CALIBRATED]** kimi-k2:1t confidences span 75-98 across iter3 files, cluster 85-95. No 100% inflation; min 75 = honest edge-case uncertainty. Good signal — the model is calibrating, not performance-signaling. Do NOT "fix" this by prompt-boosting confidence.
|
||
- **[SCRUM→OBSERVER WIRED]** 2026-04-24 fix landed in iter-3 source but applies to iter 4+ (bun loaded code pre-edit). Verify next loop: `curl :3800/stats` should show `by_source.scrum > 0` after iter 4 runs. If zero, observer /event payload schema mismatch.
|
||
- **[LOW-CONFIDENCE BLOCK flag WORKING]** `crates/storaged/src/federation_service.rs` scored 3/10 with confidences **[40, 35, 50]** → avg 42, min 35. Permission Gradient correctly classifies this file as `block` tier (min <50) — exactly the signal we want. kimi-k2:1t is honestly saying "I'm not confident about these suggestions." Human must review before any auto-apply. Validates Layer #6 working end-to-end. **DO NOT fix this** — it's the system protecting us.
|
||
- **[CONFIDENCE-COLUMN DROPOUT]** `crates/gateway/src/main.rs` accepted review has 0 confidences extracted despite the prompt requiring them. Either kimi-k2:1t omitted the Confidence column in its table format (regex didn't match), or the specific content ran into token budget and confidence got trimmed. Fix next loop: strengthen the prompt instruction to say "if you omit the Confidence column you MUST provide Confidence: NN% inline after every suggestion" AND post-accept sanity-check the row (reject if findings_count=0 across 5+ suggestions).
|
||
- **[SCORE-PARSER MISS on some files]** `crates/queryd/src/service.rs` and `crates/queryd/src/delta.rs` reviews return score='?' — regex `(\d)\s*/\s*10` isn't matching. Models may have written score as "Score: 4.5/10" or "4/10 (mid)" which the parser can't see. Fix next loop: regex accepts decimals + surrounding context. Not critical — confidences still landed.
|
||
- **[SCORE DISTRIBUTION]** across 12 iter-3 files: 3/10 (7 files), 4/10 (2), 6/10 (1), ?/? (2). vectord/index_registry.rs alone scored 6/10 — meaningfully higher. That's worth a closer look: what does that file do right? If we find the pattern there and replicate to peers, scores should rise. This is learning-loop signal.
|
||
|
||
- **[FORENSIC JSON OUTPUT IS RICHER THAN MARKDOWN]** kimi-k2:1t emitted strict JSON-schema output (matching SCRUM_FORENSIC_PROMPT.md) for `vectord/src/index_registry.rs`. Structure separates critical_failures / pseudocode_flags / prd_mismatches / broken_pipelines / missing_components / risk_points / **verified_components** / required_next_actions — each with confidence per entry. Markdown reviews never captured `verified_components` (what's confirmed working). Fix next loop: scrum pipeline detects JSON format in accepted output and extracts structured fields into KB row (`verified_components_count`, `critical_failures_count`, `missing_components_count`, `verdict`). Downstream consumers then filter PR files by `verdict: needs_patch` or prioritize by `missing_components_count`. **This is the biggest next-iter KB quality jump available** — goes from "confidence as a scalar" to "confidence per specific claim with evidence field." Verified_components in particular is the PROOF-OF-LIFE signal that tells us what's real — flipside of P9/P13/P42 gaps.
|
||
|
||
- **[🔴 TREE-SPLIT REVIEW SHARD-LEAKAGE]** `ingestd/src/service.rs` (7 shards, 24.3KB) accepted output is titled "Forensic Audit Report – ... (shard 3)". The review covers only the Postgres-import path (shard 3). The `ingest_file` handler where my P9-001 fix lives (journal.record_ingest call) is in shard 1 or 2 — that reviewer never saw the fix. **`tree_split_fired: true` is supposed to mean the output is the reducer-merged summary of all shards**, but this review retained shard-specific scope. Either (a) the reduce step didn't integrate shard summaries, (b) the accepted attempt was one individual shard response that slipped past the reducer, or (c) the reducer prompt doesn't instruct the model to present the file holistically. **This is a real correctness bug** — it means file-level findings can be ghost-negative (fix applied, reviewer blind to it) and ghost-positive (gap exists in unreviewed shard, reviewer gives clean bill). Fix next loop: examine the reduce-step prompt in tree-split path, ensure accepted output comes from reduce step not from any individual shard. Validate by running on a file with a known fix and confirming the review notices it.
|
||
|
||
**[ITER 4 OBSERVATIONS]**
|
||
- **[🔴 OBSERVER DROPS SCRUM RICH FIELDS]** Scrum→observer wiring works (by_source.scrum appears in /stats on iter 4 file 1). BUT observer.ts:262-283 `ObservedOp = {...}` literal only spreads known keys (endpoint, success, duration_ms, role, city, state, count, rescue_*). My scrum-specific fields (confidence_avg, confidence_min, gradient_tier, verdict, critical_failures_count, verified_components_count, missing_components_count, alignment_score, output_format, findings_count, attempts_made, thin_rejections) are silently discarded. Observer knows the scrum event happened but loses review-quality data. Fix next loop: add `metadata?: Record<string, any>` passthrough on ObservedOp, or declare scrum-specific fields explicitly. Preferred: metadata passthrough so future sources (auditor, kb_extractor) land the same way.
|
||
- **[SCHEMA V4 LANDING CORRECTLY]** main.rs iter-4 KB row has alignment_score=3 (decimal parser fixed), output_format="markdown" (classifier works), verdict=null (correct — only forensic_json produces verdict), confidence_avg=91 (previous iter got 0 due to column dropout — run-to-run variance self-healed this). Structured counters (critical/verified/missing) = 0 on markdown rows, populated on forensic_json rows.
|
||
- **[RING BUFFER EVICTING LANGFUSE]** observer ring hit 2000 cap; first 2 scrum events pushed 2 langfuse entries out (1999 → 1997). Not a bug — ring works as designed — but means old-context retention is bounded. If we care about historical Langfuse traces we need a larger ring OR a separate per-source ring.
|
||
- **[UI Playwright probe found 2 real bugs]** (fixed 2026-04-24): (a) ui/server.ts tryFetch relied on content-type header to decide JSON vs text; observer Bun.serve returns JSON without `application/json` content-type, so stats were strings — UI showed "0 ops" instead of 2000. Fixed: always attempt JSON.parse, fall back to raw text. (b) ui.js renderNodeContext used Object.entries(n.health) which iterates characters on a string — gateway /health returns "lakehouse ok" and the panel showed rows like `0=l, 1=a, 2=k, ...`. Fixed: primitive-vs-object guard. **Both were invisible in functional tests — only a real browser render exposed them.** Worth adding a Playwright smoke test to CI for any future UI changes.
|
||
|
||
### Iter 1 results
|
||
|
||
*Populated after scrum finishes.*
|
||
|
||
## Iteration 2 — queued
|
||
|
||
**Prompt shape change (from J 2026-04-23):** iter 2+ uses `docs/SCRUM_FORENSIC_PROMPT.md` as the system prompt, replacing the softer iter-1 framing. Adversarial auditor tone with 8 audit passes. Strict JSON output format with `verdict: pass|fail|needs_patch`. If system can't prove itself, verdict is FAIL.
|
||
|
||
**Scrum pipeline change:** `scrum_master_pipeline.ts` needs an env `LH_SCRUM_SYSTEM_PROMPT` (new) to inject the forensic frame alongside the proposal doc. The file-level loop still asks for suggestions per file but under the 8-pass adversarial lens.
|
||
|
||
**Goal:** Self-host. Pipeline loads its ladder from `config/routing.toml` via the RoutingEngine that iter 1 wired. If that still isn't loaded, note gap, proceed with hardcoded ladder, flag for iter 3.
|
||
**Target expansion:** beyond `.rs` to `.ts` (tests/multi-agent, auditor/), `.py` (sidecar), `.md` (docs).
|
||
|
||
## Iterations 3-6 — queued
|
||
|
||
**Goal:** measure trajectory. Each iteration reduces finding count, raises unit test count, reduces grep-for-fake-markers count. If any iteration doesn't improve, that's the data point.
|
||
|
||
## Metrics per iteration
|
||
|
||
Capture after each re-sweep:
|
||
|
||
- `findings_total` (baseline: 19)
|
||
- `findings_by_severity` (baseline: 3h / 8m / 8l)
|
||
- `phases_partial_count` (baseline: 9)
|
||
- `phases_real_count` (baseline: 25 of 35)
|
||
- `rust_test_count` (baseline: 194+)
|
||
- `gateway_test_fail_count` (baseline: 1 — P38-001)
|
||
- `grep_hits_unimplemented` run: `grep -rEc 'todo!\(\)|unimplemented!\(\)|FIXME' crates/`
|
||
- `grep_hits_pseudo` run: `grep -rEc '\"placeholder\"|\"stub\"|\"mock\"|\"fake\"' crates/`
|
||
|
||
## Rules for this loop
|
||
|
||
1. **Cloud-first for every iteration.** Per feedback_scrum_cloud_first.md, strategic review uses 120B+ tier.
|
||
2. **One cross-cutting PR per iteration when possible.** Meta-pattern from audit: identity+auth+access+journal+truth share a pipe. Fix them together.
|
||
3. **Build must be green before next iteration starts.** A broken build is evidence the last iteration regressed, not progressed.
|
||
4. **Log findings to the jsonl as new rows per iteration** with `sweep_id: phase_sweep_2026-04-23-iterN`. Never overwrite prior iteration's findings — the trajectory is the whole point.
|
||
5. **Don't fix things during an iteration.** Every observation goes into "Fix next loop" section above. Next iteration's scrum picks them up.
|
||
|