lakehouse/config/modes.toml
root 20a039c379
Some checks failed
lakehouse/auditor 13 blocking issues: cloud: claim not backed — "Invariants enforced (proven by tests + real run):"
auditor: rebuild on mode runner + drop tree-split (use distillation substrate)
Architectural simplification leveraging Phase 5 distillation work:
the auditor no longer pre-extracts facts via per-shard summaries
because lakehouse_answers_v1 (gold-standard prior PR audits + observer
escalations corpus) supplies cross-PR context through the mode runner's
matrix retrieval. Same signal, ~50× fewer cloud calls per audit.

Per-audit cost:
  Before: 168 gpt-oss:120b shard summaries + 3 final inference calls
  After:  3 deepseek-v3.1:671b mode-runner calls (full retrieval included)

Wall-clock on PR #11 (1.36MB diff):
  Before: ~25 minutes
  After:  88 seconds (3/3 consensus succeeded)

Files:
  auditor/checks/inference.ts
    - Default MODEL kimi-k2:1t → deepseek-v3.1:671b. kimi-k2 is hitting
      sustained Ollama Cloud 500 ISE (verified via repeated trivial
      probes; multi-hour outage). deepseek is the proven drop-in from
      Phase 5 distillation acceptance testing.
    - Dropped treeSplitDiff invocation. Diff truncates to MAX_DIFF_CHARS
      and goes straight to /v1/mode/execute task_class=pr_audit; mode
      runner pulls cross-PR context from lakehouse_answers_v1 via
      matrix retrieval. SHARD_MODEL retained for legacy callCloud
      compatibility (default qwen3-coder:480b if it ever runs).
    - extractAndPersistFacts now reads from truncated diff (no
      scratchpad post-tree-split-removal).

  auditor/checks/static.ts
    - serde-derived struct exemption (commit 107a682 shipped this; this
      commit is the rest of the auditor rebuild it landed alongside)
    - multi-line template literal awareness in isInsideQuotedString —
      tracks backtick state across lines so todo!() inside docstrings
      doesn't trip BLOCK_PATTERNS.

  crates/gateway/src/v1/mode.rs
    - pr_audit native runner mode added to VALID_MODES + is_native_mode
      + flags_for_mode + framing_text. PrAudit framing produces strict
      JSON {claim_verdicts, unflagged_gaps} for the auditor to parse.

  config/modes.toml
    - pr_audit task class with default_model=deepseek-v3.1:671b and
      matrix_corpus=lakehouse_answers_v1. Documents kimi-k2 outage
      with link to the swap rationale.

Real-data audit on PR #11 head 1b433a9 (which is the PR with all the
distillation work + auditor rebuild itself):
  - Pipeline ran to completion (88s for inference; full audit ~3 min)
  - 3/3 consensus runs succeeded on deepseek-v3.1:671b
  - 156 findings: 12 block, 23 warn, 121 info
  - Block findings are legitimate signal: 12 reviewer claims like
    "Invariants enforced (proven by tests + real run):" that the
    truncated diff can't directly verify. The auditor is correctly
    flagging claim-vs-diff divergence — exactly its job.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-26 23:32:44 -05:00

87 lines
3.7 KiB
TOML
Raw Permalink Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Mode router config — task_class → mode mapping
#
# `preferred_mode` is the first choice for a task class; `fallback_modes`
# get tried in order if the preferred one isn't available (LLM Team can
# return Unknown mode for some, OR the matrix has stronger signal for a
# fallback). `default_model` seeds the mode runner's model field if the
# caller doesn't override.
#
# Modes are dispatched against LLM Team UI (localhost:5000/api/run) for
# now; future Rust-native runners will short-circuit before the proxy.
# See crates/gateway/src/v1/mode.rs for the dispatch path.
[[task_class]]
name = "scrum_review"
# 2026-04-26 pass5 variance test (5 reps × 4 conditions, grok-4.1-fast,
# pathway_memory.rs): composed corpus LOST 5/5 vs isolation (Δ 1.8
# grounded findings, p=0.031). See docs/MODE_RUNNER_TUNING_PLAN.md.
# Default is now isolation — bug fingerprints + adversarial framing +
# file content carries strong models without matrix noise. The
# `codereview_lakehouse` matrix path remains available via force_mode
# (auto-downgrades to isolation on strong models — see the
# is_strong_model gate in crates/gateway/src/v1/mode.rs).
preferred_mode = "codereview_isolation"
fallback_modes = ["codereview_lakehouse", "codereview", "consensus", "ladder"]
default_model = "qwen3-coder:480b"
# Corpora kept defined so experimental modes (codereview_matrix_only,
# pass2/pass5 sweeps) and weak-model rescue rungs can still pull them.
# scrum_findings_v1 is built but EXCLUDED — bake-off showed 24% OOB
# line citations from cross-file drift, only safe with same-file gating.
matrix_corpus = ["lakehouse_arch_v1", "lakehouse_symbols_v1"]
[[task_class]]
name = "contract_analysis"
preferred_mode = "deep_analysis"
fallback_modes = ["research", "extract"]
default_model = "kimi-k2:1t"
matrix_corpus = "chicago_permits_v1"
[[task_class]]
name = "staffing_inference"
# Staffing-domain native enrichment runner — Pass 4 (2026-04-26).
# Same composer architecture as codereview_lakehouse but with staffing
# framing + workers corpus. Validates that the modes-as-prompt-molders
# pattern generalizes beyond code review.
preferred_mode = "staffing_inference_lakehouse"
fallback_modes = ["ladder", "consensus", "pipeline"]
default_model = "openai/gpt-oss-120b:free"
matrix_corpus = "workers_500k_v8"
[[task_class]]
name = "fact_extract"
preferred_mode = "extract"
fallback_modes = ["distill"]
default_model = "qwen2.5"
matrix_corpus = "kb_team_runs_v1"
[[task_class]]
name = "doc_drift_check"
preferred_mode = "drift"
fallback_modes = ["validator"]
default_model = "gpt-oss:120b"
matrix_corpus = "distilled_factual_v20260423095819"
[[task_class]]
name = "pr_audit"
# Auditor's claim-vs-diff verification mode (2026-04-26 rebuild).
# Replaces the auditor's hand-rolled inference check with the mode-runner
# composer: pathway memory (PR-level patterns) + lakehouse_answers_v1
# corpus (prior accepted reviews + observer escalations) + adversarial
# JSON-shaped framing. Default model is paid Ollama Cloud kimi-k2:1t for
# strong claim-grounding; tie-breaker via auditor-side env override.
preferred_mode = "pr_audit"
fallback_modes = ["consensus", "ladder"]
# kimi-k2:1t broken upstream 2026-04-27 (Ollama Cloud 500 ISE, multi-hour
# sustained outage verified by repeated probes). deepseek-v3.1:671b is
# the drop-in substitute — proven working end-to-end through pr_audit
# during Phase 5 distillation acceptance testing.
default_model = "deepseek-v3.1:671b"
matrix_corpus = "lakehouse_answers_v1"
# Fallback when task_class isn't in the table — useful for ad-hoc calls
# during development that don't yet have a mapped mode.
[default]
preferred_mode = "pipeline"
fallback_modes = ["consensus", "ladder"]
default_model = "qwen3.5:latest"