root 8aa7ee974f auditor: auto-promote to Claude Opus 4.7 on big diffs (>100k chars)

Smart-routing in kimi_architect: default model (Haiku 4.5 by env, or
Kimi K2.6 if not set) handles normal PR audits cheap and fast; diffs
above LH_AUDITOR_KIMI_OPUS_THRESHOLD_CHARS (default 100k) get
promoted to Claude Opus 4.7 for the audit.

Why this split: the 2026-04-27 3-way bake-off (Kimi K2.6 vs Haiku 4.5
vs Opus 4.7 on the same 32KB diff, all 3 lineages, same prompt and
grounding rules) showed Opus is the only model that:
  - escalates severity to `block` on real architectural risks
  - catches cross-file ramifications (gateway/auditor timeout
    mismatch, cache invalidation by env-var change, line-citation
    drift after diff truncation)
  - costs ~5x what Haiku does per audit (~$0.10 vs $0.02)

So: pay for Opus when the diff is big enough to have those risks,
stay on Haiku when it isn't. 80% of refactor PRs cross 100KB; 90% of
single-feature PRs don't.

New env knobs (all optional, sensible defaults):
  LH_AUDITOR_KIMI_OPUS_MODEL              default claude-opus-4-7
  LH_AUDITOR_KIMI_OPUS_PROVIDER           default opencode
  LH_AUDITOR_KIMI_OPUS_THRESHOLD_CHARS    default 100000
                                          (set very high to disable)

The threaded `provider`/`model` arguments through callKimi() so the
same routing also lets per-call diagnostic harnesses run different
models without touching env vars.

Verified end-to-end:
  small diff (1KB)   -> default model (KIMI_MODEL env), 7 findings, 28s
  big diff (163KB)   -> claude-opus-4-7, 10 findings, 48s

Bake-off report at reports/kimi/cross-lineage-bakeoff.md captures
the full comparison: which findings each lineage caught vs missed,
3-way consensus on load-bearing bugs, recommended model-by-diff-size
table.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

2026-04-27 06:48:38 -05:00

3.2 KiB

Raw Permalink Blame History

Cross-Lineage Auditor Bake-Off — 2026-04-27

Same diff (HEAD~5..HEAD~2, 32KB, 3 commits = the kimi-integration work) audited by three models from three vendor lineages. All three through the lakehouse gateway, all three with the same kimi_architect prompt template + grounding verification.

Results

	Kimi K2.6 (Moonshot)	Haiku 4.5 (Anthropic)	Opus 4.7 (Anthropic)
Provider	ollama_cloud	opencode/Zen	opencode/Zen
Latency	53.7s	20.5s	53.6s
Findings	10	9	10
Grounded	10/10	9/9	10/10
Severity (block/warn/info)	0 / 9 / 1	0 / 5 / 4	3 / 5 / 2
Cost	flat-sub (Ollama Pro)	~$0.02	~$0.10–0.15
Style	Architectural / migration	Boundary / resilience	Escalation / cross-file

Severity escalation pattern

Only Opus produced block-level findings. Haiku and Kimi described the same kind of issues as warn. This isn't randomness — it's training. Anthropic's Opus is calibrated to flag merge-stoppers more confidently than the lighter-weight or different-lineage models.

What ONLY Opus 4.7 caught

parseFindings rationale regex truncates on inline **bold** inside rationales — neither Haiku nor Kimi noticed
Cache-by-head-SHA survives LH_AUDITOR_KIMI_MODEL env flip (silently returns old findings under wrong model name)
Gateway/auditor timeout mismatch: kimi.rs 600s vs auditor curl 900s

What ALL three caught

(ev as any).contractor schema bypass (3/3)
Empty-env Number("") returns 0 trap (3/3)
readFileSync in async function (3/3)
mode.rs Rust test compile error (3/3)

Three-lineage consensus = high-confidence load-bearing real bug.

What only Kimi K2.6 caught

Schema version bump v1→v2 without explicit migration path
ISO timestamp precision in run_id derivation
Multimodal content array passed verbatim to Kimi (would 400)

Kimi favors architectural / API-contract concerns. Useful when the diff is a refactor rather than a feature.

What only Haiku 4.5 caught

appendMetrics mkdir target uses join(path, "..") not dirname
KIMI_MODEL cast in parseFindings not validated against type
Truncation of error messages in callKimi at 300 chars loses context

Haiku favors boundary cases — what happens when assumptions break.

Cost-vs-quality verdict

Diff size	Recommended model	Why
< 100k chars (normal PRs)	Haiku 4.5	80% of the same surface, 5x cheaper, 2.6x faster
> 100k chars (refactors, multi-file)	Opus 4.7	Cross-file ramifications + escalation that lighter models miss

Auto-promotion implemented in auditor/checks/kimi_architect.ts:74 via selectModel(diffLen). Threshold env-overridable (LH_AUDITOR_KIMI_OPUS_THRESHOLD_CHARS, default 100000).

Methodology notes

Same prompt template, same grounding rules, same input bundle
Each call cached at data/_auditor/kimi_verdicts/<pr>-<sha>.json
Per-call metrics appended to data/_kb/kimi_audits.jsonl
Wall-clock measured from request POST to response parse
Cost computed as prompt_tokens * input_rate + completion_tokens * output_rate
usage.prompt_tokens underreports through opencode proxy path (verified ~7k input tokens vs reported 5); cost figures use observed prompt size rather than reported.

3.2 KiB Raw Permalink Blame History Unescape Escape