root f753e11157

lakehouse/auditor 9 blocking issues: todo!() macro call in tests/real-world/scrum_master_pipeline.ts

docs: SCRUM_MASTER_SPEC timeline — productization wave + verified live state

Splits the existing 04-25/26 section into two waves:
- experiment wave (mode-runner build-out, pre-productization)
- productization wave (OpenAI-compat, Archon, answers corpus,
  staffing native runner, multi-corpus + downgrade gate, observer
  paid escalation, /v1/chat → observer event wiring)

Adds verified-live block at the end with the numbers a fresh session
needs to anchor on: pathway memory 88 traces / 11 successful replays
at 100% (probation gate crossed), strong-model auto-downgrade firing
on grok-4.1-fast, and the auditor blind spot at static.ts:117 (now
fixed in 107a682).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

2026-04-26 20:50:05 -05:00

23 KiB

Raw Blame History

Scrum Master Pipeline — Spec + Current State

Status: Active iteration on branch scrum/auto-apply-19814 → PR #11 at git.agentview.dev/profit/lakehouse Branch commit head: see git log --oneline -1 scrum/auto-apply-19814 (auto-stale; check it)

2026-04-25 — see also docs/MATRIX_AGENT_HANDOVER.md for the standalone matrix-agent-validated repo split + the Ansible playbook that deploys it. Note: VPS at 192.168.1.145 is a TEST VENV ONLY (partial deploy); the real destination is the matrix-test Incus container at 10.111.129.50.

This doc is the single handoff artifact for the scrum-master + auto-apply + pathway-memory loop. A fresh Claude Code session reading this + docs/DECISIONS.md (ADR-020 and ADR-021) + docs/MATRIX_AGENT_HANDOVER.md + MEMORY.md should have the same context as the session that wrote it.

▶ Refactor timeline (read in order)

The pipeline has been refactored substantially since the 2026-04-24 baseline below. Read the changes top-down to understand current shape:

2026-04-23 → 24 (foundation, captured in §1-§12 below)

9-rung cloud ladder + tree-split + adversarial prompt
Pathway memory base + ADR-021 semantic-correctness layer
Hardened auto-applier (5 gates: confidence/size/cargo/warnings/rationale)
Hand-review wire (commit 3f166a5) — judgment moved out of inner loop
Anchor-grounding post-verifier (commit 9cc0ceb / 9ecc584)
Single-model retry with enrichment (commit d187bcd) — stop cascading on quality
Unified matrix retriever pulling from ALL KB corpora (commit a496ced)
Paid OpenRouter ladder + Kimi K2.6 + Gemini 2.5 (commit 4ac5656)
Goal-driven autonomous loop harness (commit e79e51e)

2026-04-25 → 26 morning (mode-runner experiment wave)

Observer health-probe TypeConfusion fix (54689d5) — r.json() on text/plain /health was crash-looping the observer; sealed in pathway_memory as TypeConfusion:fetch-health-json.
Adjacency-pollution relevance filter (0115a60) — observer /relevance endpoint + scrum wiring (LH_RELEVANCE_FILTER / LH_RELEVANCE_THRESHOLD). Drops chunks about symbols the focus file IMPORTS but doesn't define.
Audit-consensus → retire wire (626f18d) — when observer rejects a hot-swap-recommended attempt, immediately call /vectors/pathway/retire on the trace. HotSwapCandidate gained trace_uid for single-trace precision. Confidence ≥0.7 gate avoids retiring on heuristic-fallback verdicts.
/v1/mode router phase 1 (d277efb) — task_class → mode/model decision endpoint with config/modes.toml. Decision-only; doesn't execute.
Native enrichment runner (86f63a0) — codereview_lakehouse mode that COMPOSES every primitive (focus file + bug fingerprints + relevance-filtered matrix + adversarial framing) into ONE prompt for one-shot success. POST /v1/mode/execute. Modes-as-prompt-molders, not model-pickers — see ★ Insight from session 2026-04-26.
Parameterized runner + 5 experiment modes (7c47734) — codereview_lakehouse|null|isolation|matrix_only|playbook_only. Each isolates one architectural axis. scripts/mode_experiment.ts sweeps files × modes; scripts/mode_compare.ts aggregates with grounding check (catches confabulation by comparing cited symbols to real file content).
Scrum mode-runner fast path (7c47734) — gated by LH_USE_MODE_RUNNER=1, scrum tries /v1/mode/execute BEFORE the 9-rung ladder. Falls through to ladder if response < LH_MODE_MIN_CHARS or anything errors. Off by default until A/B-validated.
Mode-compare grounding column (52bb216) — emoji-tolerant section regex + control-flag tagging. Caught playbook_only confabulation that hand-grading also found.

2026-04-26 evening (productization wave)

Override knobs + staffing native runner (56bf30c) — pass 2/3/4 harnesses, mode runner now serves staffing.fill task class natively, not just code review.
Multi-corpus runner + variance harness + strong-model downgrade gate (2dbc8db) — three corpora (arch / findings / symbols) selectable per mode. Paid models auto-downgrade: skip matrix corpus, isolation framing only. Driven by feedback_composed_corpora_anti_additive.md (composed corpora LOST 5/5 vs isolation on grok-4.1-fast, p=0.031).
OpenAI-compat alias + smart provider routing (3a0b37e) — gateway is now a drop-in middleware for any OpenAI SDK consumer. Three routing flavors verified via /tmp/archon-test/sdk-test.ts: openai/gpt-4o-mini, bare gpt-4o-mini, x-ai/grok-4.1-fast.
OpenAI multimodal content shape (540a9a2) — accepts content: [...] array-of-parts.
/v1/chat fires observer event (d1d97a0) — every chat call now lands both Langfuse trace AND observer /event (was Langfuse only).
Archon workflow (69919d9) — .archon/workflows/lakehouse-architect-review.yaml. 3 Pi nodes (shape → weakness → improvement) using openrouter/x-ai/grok-4.1-fast through the gateway.
Observer KB enrichment preamble (d9bd4c9) — observer prepends KB context to escalation prompts (was raw failure cluster).
Observer escalation → paid OpenRouter (340fca2) — deepseek-v3.1-terminus instead of free-tier rescue. Verified: diagnoses cite architectural patterns (circuit breaker, adapter files) instead of generic timeouts.
Gold-standard answer corpus (0844206) — scripts/build_answers_corpus.ts indexes lakehouse_answers_v1 from scrum_reviews.jsonl + observer_escalations.jsonl. Doc ID prefixes (review: vs escalation:) let consumers same-file-gate or broaden. Auto-rebuilds from scrum epilogue (LH_SCRUM_SKIP_ANSWERS_REBUILD=1 to disable). Observer buildKbPreamble now blends three sources (pathway + arch + answers); preamble grew 416 → 727 chars.

Verified live state (2026-04-26 ~23:30)

Pathway memory: 88 traces, 11/11 successful replays = 100% — hot-swap probation gate crossed; live recommendations firing.
Strong-model auto-downgrade verified: scrum on grok-4.1-fast → matrix corpus dropped, isolation mode auto-selected, 3 files accepted on attempt 1, ~27s each.
Auditor verdict on PR #11 head 0844206: block on 8 false positives — auditor/checks/static.ts:117 "field added but never read" check doesn't follow serde derives. Fix is in the auditor, not the code.

Verified architectural insights (2026-04-26 experiment)

codereview_lakehouse produces 100% grounded findings, beats every challenger.
codereview_playbook_only (pathway-only, no file content) confabulates ~50% of findings — keep as control, NEVER as recommendation.
codereview_null (no enrichment, generic prompt) produces 0 ranked findings — adversarial framing is load-bearing.
Matrix corpus contributes ~2 grounded findings vs isolation. Small but real.

Where to read what

Loop architecture (this doc, §1-§12): original 2026-04-24 design.
Modes-as-enrichment vision: crates/gateway/src/v1/mode.rs doc comment + config/modes.toml.
Mode experiment results: data/_kb/mode_experiments.jsonl + bun run scripts/mode_compare.ts.
Pathway memory mechanics: crates/vectord/src/pathway_memory.rs + ADR-021 in docs/DECISIONS.md.
Handover to fresh box: docs/MATRIX_AGENT_HANDOVER.md.

1. What the loop is

An autonomous review-and-commit pipeline that:

Scrum master (tests/real-world/scrum_master_pipeline.ts) — walks a target-file list, asks a 9-rung escalation ladder of cloud models to produce a forensic audit against PRD + a change proposal doc, retries with learning context until acceptance, emits a structured review row.
Pathway memory (crates/vectord/src/pathway_memory.rs) — stores the full backtrack context of each review (attempts, KB chunks, flags, bug fingerprints) indexed by a narrow fingerprint (task_class + file_prefix + signal_class). On every new review, it prepends historical bug patterns as a preamble so the reviewer preempts recurrences. Retired pathways auto-exclude themselves from hot-swap eligibility.
Auto-applier (tests/real-world/scrum_applier.ts) — filters schema_v4 review rows by gradient_tier + confidence, asks qwen3-coder:480b for concrete old_string/new_string patches, runs cargo check --workspace, commits on green OR reverts on red/warning-count-up/rationale-mismatch.
Observer (mcp-server/observer.ts) — receives per-file /event emissions, escalates failure clusters to LLM Team via /v1/chat with qwen3-coder:480b.
Auditor (auditor/audit.ts) — external N=3 consensus re-check of scrum findings; writes to data/_kb/audit_facts.jsonl.

The guiding principle: every KB write has a reader, every PR claim is diff-verifiable.

2. The 9-rung ladder (cloud-first, strongest-model-first)

Defined in tests/real-world/scrum_master_pipeline.ts at const LADDER:

#	Provider	Model	Role
1	ollama_cloud	`kimi-k2:1t`	flagship, 1T params
2	ollama_cloud	`qwen3-coder:480b`	coding specialist, 480B
3	ollama_cloud	`deepseek-v3.1:671b`	reasoning, 671B
4	ollama_cloud	`mistral-large-3:675b`	deep analysis, 675B
5	ollama_cloud	`gpt-oss:120b`	reliable workhorse
6	ollama_cloud	`qwen3.5:397b`	dense 397B, final thinker
7	openrouter	`openai/gpt-oss-120b:free`	free-tier rescue
8	openrouter	`google/gemma-3-27b-it:free`	fastest rescue
9	ollama	`qwen3.5:latest`	last-resort local

Each attempt is evaluated by isAcceptable() (chars ≥ 3800 AND not a malformed JSON-only dump). On reject, the next rung sees a learning preamble with the prior rejection reason.

3. Tree-split reducer

Files larger than FILE_TREE_SPLIT_THRESHOLD = 6000 bytes get chunked into FILE_SHARD_SIZE = 3500-byte shards. Each shard gets summarized via a fast rung, summaries are concatenated with internal §N§ markers, then fed as a SCRATCHPAD to the reviewer. The §N§ markers are stripped before the reviewer sees the merged context so it cannot claim "(shard 3)" in titles.

Bug regime this fixed: pre-tree-split iters had reviewers claim fields were "missing" because the field was past the 6KB context cutoff, not actually absent.

4. Schema v4 KB rows

data/_kb/scrum_reviews.jsonl — one row per accepted review. Fields:

{
  "file": "crates/queryd/src/service.rs",
  "reviewed_at": "2026-04-24T11:06:56Z",
  "accepted_model": "ollama_cloud/kimi-k2:1t",
  "accepted_on_attempt": 1,
  "attempts_made": 1,
  "tree_split_fired": true,
  "suggestions_preview": "<truncated-2000-char>",
  "confidences_per_finding": [92, 90, 88, 85, 75],
  "confidence_avg": 86,
  "confidence_min": 75,
  "findings_count": 5,
  "gradient_tier": "dry_run",          // auto ≥90 / dry_run ≥70 / simulation ≥50 / block <50
  "gradient_tier_avg": "dry_run",
  "alignment_score": 3,                // 1-10 self-rated
  "output_format": "forensic_json",
  "verdict": "fail",                   // pass | needs_patch | fail
  "critical_failures_count": 3,
  "pseudocode_flags_count": 0,
  "prd_mismatches_count": 4,
  "missing_components_count": 6,
  "verified_components_count": 2,
  "risk_points_count": 3,
  "schema_version": 4,
  "scrum_master_reviewed": true,
  // ADR-021 fields on pathway trace (NOT this row, see pathway_memory state.json)
  "pathway_hot_swap_hit": false,
  "pathway_id": null,
  "pathway_similarity": null,
  "pathway_success_rate": null,
  "rungs_saved": 0
}

5. Applier hardened gates (landed `5e8d87b`)

tests/real-world/scrum_applier.ts has 5 gates between emitter output and commit. A patch must pass ALL:

Confidence gate — emitter's self-reported confidence >= MIN_CONF (default 90; 85 with relaxed env). Rejected patches log reason confidence NN < MM.
Size gate — max 6 lines changed per patch. Prevents cascading cross-file refactors.
Cargo-green gate — cargo check --workspace must pass. Red build → git checkout -- file.
Warning-count gate — workspace baseline warning count measured at start; after patch, new count must be ≤ baseline. Catches unused-import additions and dead_code-after-allow-removed patterns. THIS GATE CAUGHT 96b46cd's HashSet noise.
Rationale-diff token alignment — rationale text must share at least one non-stopword token with the patch's new_string. Catches "Add destructive SQL filter" claims on use tracing; diffs.

Plus:

Deny-list: config/, ops/, auditor/, docs/, data/, mcp-server/, ui/, sidecar/, scripts/ can't be auto-applied (human review required).
Branch guard: refuses to run on main.
Dry-run workspace revert: in COMMIT=0 mode, file is reverted after check regardless of outcome — no state pollution between runs.

Every decision logs to data/_kb/auto_apply.jsonl (action: committed / build_red_reverted / warnings_increased_reverted / rationale_mismatch_reverted / all_rejected / no_patches / dry_run_would_commit).

6. Pathway memory (ADR-021)

Full spec: docs/DECISIONS.md ADR-021. Code: crates/vectord/src/pathway_memory.rs.

Three-layer matrix index for compounding semantic-correctness signal:

Fingerprint (narrow)

pathway_id = SHA256(task_class + "|" + file_prefix + "|" + signal_class) — first 2 path segments (crates/queryd) so related files in the same crate share pathways.

Embedding (similarity vector)

32-bucket L2-normalized token hash. Tokens include: task_class, file_path, signal_class, per-attempt model+rung+accepted flag, KB chunk source_docs, observer class, bridge libraries, sub-pipeline calls, semantic_flags, and bug_fingerprints (flag+pattern_key).

TS and Rust implementations byte-match — verified by smoke test showing cosine=1.0 on same input tokens. This is load-bearing for the TS-written traces to be searchable against the Rust-indexed space.

Hot-swap gate (5-factor AND)

narrow_fingerprint_matches
AND audit_consensus.pass != false (null OK during bootstrap)
AND replay_count >= 3 (probation)
AND success_rate >= 0.80
AND NOT retired
AND similarity(query_vec, stored.pathway_vec) >= 0.90

Replay bookkeeping: on hot-swap, replay_count++; if the recommended model succeeded, replays_succeeded++; if replay_count >= 3 AND success_rate < 0.80 → retired = true (sticky — prevents oscillation on noise).

Semantic-correctness layer (ADR-021)

Each PathwayTrace carries:

semantic_flags: Vec<SemanticFlag> — one of 9 variants: UnitMismatch, TypeConfusion, NullableConfusion, OffByOne, StaleReference, PseudoImpl, DeadCode, WarningNoise, BoundaryViolation
bug_fingerprints: Vec<BugFingerprint> — {flag, pattern_key, example, occurrences} where pattern_key = "{Flag}:{sorted-top-3-identifiers-joined-by-hyphen}". Stable across prose variation.
type_hints_used: Vec<TypeHint> — {source, symbol, type_repr}. Phase E (not yet populated).

Pre-review enrichment: scrum calls POST /vectors/pathway/bug_fingerprints with {task_class, file_path, signal_class, limit} — returns aggregated fingerprints sorted by occurrences descending. If any, a 📚 PATHWAY MEMORY preamble is prepended to the reviewer prompt with "this file area had these patterns before — check for recurrences."

Post-review extractor (Phase D, scrum_master_pipeline.ts): walks reviewer markdown line-by-line, finds lines containing a SemanticFlag variant, extracts identifier-shaped backtick-quoted tokens, filters out flag names + Rust keywords (self/mut/async/etc), sorts and takes top 3, builds pattern_key = "{Flag}:{tokens}".

HTTP surface (on gateway port 3100)

Endpoint	Purpose
`POST /vectors/pathway/insert`	write a full PathwayTrace
`POST /vectors/pathway/query`	hot-swap candidate check (returns `{candidate: null}` or `{candidate: {...}}`)
`POST /vectors/pathway/record_replay`	update replay_count + success_rate after hot-swap
`GET /vectors/pathway/stats`	totals + reuse_rate + replay_success_rate
`POST /vectors/pathway/bug_fingerprints`	aggregated fingerprints by narrow fingerprint (for pre-review preamble)

State persistence

data/_pathway_memory/state.json — JSON dump of all buckets. Loaded at gateway boot (crates/gateway/src/main.rs has pwm.load_from_storage().await).

7. Current state (2026-04-24 end of session)

Commits on branch `scrum/auto-apply-19814` since iter-5 baseline

#	SHA	Subject
1	`25ea3de`	observer fix — route LLM Team escalation to `/v1/chat` qwen3-coder
2	`8b77d67`	OpenRouter rescue ladder + tree-split reducer + first auto-applier
3	`96b46cd`	first auto-applied commit (later found misleading)
4	`5e8d87b`	cleanup + applier hardening (warning + rationale + dry-run gates)
5	`9cc0ceb`	P42-002 — truth gate into queryd `/sql` + `/paged` paths
6	`2f8b347`	pathway_memory base (PathwayTrace, hot-swap, 18 tests)
7	`86901f8`	queryd/delta.rs 6-line unit-mismatch fix
8	`92df0e9`	ADR-021 spec
9	`0a0843b`	ADR-021 Phases A+B+C (semantic_flags, prompt tags, preamble endpoint)
10	`ee31424`	ADR-021 Phase D (fingerprint extractor)
11	`f4cff66`	Phase D fix — strip flag names + Rust keywords from pattern_keys

Matrix index state

12 pathway traces in data/_pathway_memory/state.json
11 distinct bug fingerprints across 4 Flag categories on crates/queryd narrow fingerprint (1 manually seeded + 10 extracted)
0 hot-swaps fired (probation requires ≥3 replays per pathway; none reached yet)

Active in-flight

Iter 9 complete; next iter 10+ will use cleaner fingerprint extractor (f4cff66)
4 "noisy" pattern_keys from iter-9-file-1 pre-fix run (e.g., DeadCode:DeadCode) — dormant, won't match future output, acceptable dead entries

Queued (not yet implemented)

Phase E — type_hints_used population from catalogd column types, Arrow RecordBatch.schema(), Rust struct field types. Feeds typed context to reviewer prompt.
Auditor → pathway audit_consensus wire — activates the strict-audit gate (currently lenient: null bootstraps, only explicit false blocks).
VCP UI cards for "top bug fingerprints in last N iters" + "new patterns learned this iter"

8. How to run a new iteration

# Default 3 files (playbook_memory.rs, doc_drift.rs, auditor/audit.ts)
LH_SCRUM_FORENSIC=/home/profit/lakehouse/docs/SCRUM_FORENSIC_PROMPT.md \
LH_SCRUM_PROPOSAL=/home/profit/lakehouse/docs/SCRUM_FIX_WAVE.md \
bun run tests/real-world/scrum_master_pipeline.ts

# Targeted files:
LH_SCRUM_FILES="/home/profit/lakehouse/crates/queryd/src/delta.rs,/home/profit/lakehouse/crates/queryd/src/service.rs" \
LH_SCRUM_FORENSIC=... LH_SCRUM_PROPOSAL=... \
bun run tests/real-world/scrum_master_pipeline.ts

# Dry-run auto-applier against the latest scrum output:
LH_APPLIER_MIN_CONF=85 LH_APPLIER_MAX_FILES=10 \
LH_APPLIER_MODEL=qwen3-coder:480b \
LH_APPLIER_BRANCH=scrum/auto-apply-19814 \
bun run tests/real-world/scrum_applier.ts

# Actually commit (ONLY after dry-run looks clean):
LH_APPLIER_COMMIT=1 LH_APPLIER_MIN_CONF=85 LH_APPLIER_MAX_FILES=10 \
LH_APPLIER_MODEL=qwen3-coder:480b \
LH_APPLIER_BRANCH=scrum/auto-apply-19814 \
bun run tests/real-world/scrum_applier.ts

9. Verify services before running

# Gateway (port 3100) — must be up; pathway endpoints are here
curl -s http://localhost:3100/health            # "lakehouse ok"
curl -s http://localhost:3100/vectors/pathway/stats   # pathway memory totals

# UI (port 3950) — VCP dashboard + /data/pathway_stats aggregation
curl -s http://localhost:3950/data/pathway_stats

# Observer (port 3800) — event receiver + LLM Team escalation
curl -s http://localhost:3800/health 2>/dev/null || true

# Sidecar (port 3200) — Python embed
curl -s http://localhost:3200/health 2>/dev/null || true

# LLM Team (port 5000) — /api/run?mode=extract ONLY registered mode
# (others like code_review/patch/refactor return "Unknown mode")
curl -s http://localhost:5000/health 2>/dev/null || true

If gateway missing new routes after code change: cargo build --release -p gateway && sudo systemctl restart lakehouse.service.

If UI missing new routes: kill old bun run ui/server.ts and restart (not a systemd service right now).

10. Where things live (code pointers)

Concern	File
Scrum orchestrator	`tests/real-world/scrum_master_pipeline.ts`
Scrum ladder constant	same file, `const LADDER` line ~92
Tree-split reducer	same file, `async function treeSplitFile`
Forensic prompt preamble (loaded via env)	`docs/SCRUM_FORENSIC_PROMPT.md`
Fix-wave proposal preamble	`docs/SCRUM_FIX_WAVE.md`
Scrum iter notes	`docs/SCRUM_LOOP_NOTES.md`
Auto-applier	`tests/real-world/scrum_applier.ts`
Applier audit trail	`data/_kb/auto_apply.jsonl`
Scrum reviews KB	`data/_kb/scrum_reviews.jsonl`
Model trust journal	`data/_kb/model_trust.jsonl`
Pathway memory module	`crates/vectord/src/pathway_memory.rs`
Pathway HTTP handlers	`crates/vectord/src/service.rs` (bottom)
Pathway state on disk	`data/_pathway_memory/state.json`
VCP UI server	`ui/server.ts`
VCP UI client	`ui/ui.js` + `ui/ui.css` + `ui/index.html`
Observer	`mcp-server/observer.ts`
Auditor	`auditor/audit.ts`
LLM Team extract client	`auditor/fact_extractor.ts`
ADR-021 spec	`docs/DECISIONS.md` ADR-021

11. Key memory files a fresh session should read

From /root/.claude/projects/-home-profit/memory/:

project_scrum_pipeline.md — updated state of the scrum iterations
project_first_auto_apply.md — 96b46cd story + cleanup + hardening evidence from iter 7
feedback_semantic_correctness_via_matrix.md — J's insight on compounding, the ADR-021 rule
feedback_endpoint_probe_discipline.md — GET 405 is not endpoint validation
reference_llm_team_modes.md — only extract is registered on port 5000
feedback_scrum_cloud_first.md — scrum/audit/enrich pipelines use cloud first
feedback_cloud_determinism.md — cloud N=3 consensus + qwen3-coder tie-breaker

12. Known gotchas

Gateway restart needed after Rust route additions. sudo systemctl restart lakehouse.service — the service is systemd-managed.
UI server needs manual restart after ui/server.ts changes (no systemd unit). Kill old bun pid, restart with bun run ui/server.ts &.
LLM Team mode code_review doesn't exist — only extract is registered in /root/llm_team_ui.py. Don't wire new features to "Unknown mode" endpoints. See reference_llm_team_modes.md.
OpenRouter free-tier 429s during consensus probes are normal (rate-limited upstream). In the production ladder they hit as last-resort rescue with seconds-to-minutes gap; different traffic pattern than rapid-fire consensus runs.
Openrouter minimax-m2.5:free has a 45s timeout — not in ladder, only for one-off probes.
Probation period is 3 replays before hot-swap can fire. On a fresh install, no hot-swap fires until a pathway has been re-visited ≥3 times.

23 KiB Raw Blame History Unescape Escape