Smart-routing in kimi_architect: default model (Haiku 4.5 by env, or
Kimi K2.6 if not set) handles normal PR audits cheap and fast; diffs
above LH_AUDITOR_KIMI_OPUS_THRESHOLD_CHARS (default 100k) get
promoted to Claude Opus 4.7 for the audit.
Why this split: the 2026-04-27 3-way bake-off (Kimi K2.6 vs Haiku 4.5
vs Opus 4.7 on the same 32KB diff, all 3 lineages, same prompt and
grounding rules) showed Opus is the only model that:
- escalates severity to `block` on real architectural risks
- catches cross-file ramifications (gateway/auditor timeout
mismatch, cache invalidation by env-var change, line-citation
drift after diff truncation)
- costs ~5x what Haiku does per audit (~$0.10 vs $0.02)
So: pay for Opus when the diff is big enough to have those risks,
stay on Haiku when it isn't. 80% of refactor PRs cross 100KB; 90% of
single-feature PRs don't.
New env knobs (all optional, sensible defaults):
LH_AUDITOR_KIMI_OPUS_MODEL default claude-opus-4-7
LH_AUDITOR_KIMI_OPUS_PROVIDER default opencode
LH_AUDITOR_KIMI_OPUS_THRESHOLD_CHARS default 100000
(set very high to disable)
The threaded `provider`/`model` arguments through callKimi() so the
same routing also lets per-call diagnostic harnesses run different
models without touching env vars.
Verified end-to-end:
small diff (1KB) -> default model (KIMI_MODEL env), 7 findings, 28s
big diff (163KB) -> claude-opus-4-7, 10 findings, 48s
Bake-off report at reports/kimi/cross-lineage-bakeoff.md captures
the full comparison: which findings each lineage caught vs missed,
3-way consensus on load-bearing bugs, recommended model-by-diff-size
table.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
3.2 KiB
Cross-Lineage Auditor Bake-Off — 2026-04-27
Same diff (HEAD~5..HEAD~2, 32KB, 3 commits = the kimi-integration work)
audited by three models from three vendor lineages. All three through
the lakehouse gateway, all three with the same kimi_architect prompt
template + grounding verification.
Results
| Kimi K2.6 (Moonshot) | Haiku 4.5 (Anthropic) | Opus 4.7 (Anthropic) | |
|---|---|---|---|
| Provider | ollama_cloud | opencode/Zen | opencode/Zen |
| Latency | 53.7s | 20.5s | 53.6s |
| Findings | 10 | 9 | 10 |
| Grounded | 10/10 | 9/9 | 10/10 |
| Severity (block/warn/info) | 0 / 9 / 1 | 0 / 5 / 4 | 3 / 5 / 2 |
| Cost | flat-sub (Ollama Pro) | ~$0.02 | ~$0.10–0.15 |
| Style | Architectural / migration | Boundary / resilience | Escalation / cross-file |
Severity escalation pattern
Only Opus produced block-level findings. Haiku and Kimi described
the same kind of issues as warn. This isn't randomness — it's
training. Anthropic's Opus is calibrated to flag merge-stoppers more
confidently than the lighter-weight or different-lineage models.
What ONLY Opus 4.7 caught
parseFindingsrationale regex truncates on inline**bold**inside rationales — neither Haiku nor Kimi noticed- Cache-by-head-SHA survives
LH_AUDITOR_KIMI_MODELenv flip (silently returns old findings under wrong model name) - Gateway/auditor timeout mismatch: kimi.rs 600s vs auditor curl 900s
What ALL three caught
(ev as any).contractorschema bypass (3/3)- Empty-env
Number("")returns 0 trap (3/3) readFileSyncin async function (3/3)- mode.rs Rust test compile error (3/3)
Three-lineage consensus = high-confidence load-bearing real bug.
What only Kimi K2.6 caught
- Schema version bump v1→v2 without explicit migration path
- ISO timestamp precision in run_id derivation
- Multimodal content array passed verbatim to Kimi (would 400)
Kimi favors architectural / API-contract concerns. Useful when the diff is a refactor rather than a feature.
What only Haiku 4.5 caught
appendMetricsmkdir target usesjoin(path, "..")notdirnameKIMI_MODELcast inparseFindingsnot validated against type- Truncation of error messages in callKimi at 300 chars loses context
Haiku favors boundary cases — what happens when assumptions break.
Cost-vs-quality verdict
| Diff size | Recommended model | Why |
|---|---|---|
| < 100k chars (normal PRs) | Haiku 4.5 | 80% of the same surface, 5x cheaper, 2.6x faster |
| > 100k chars (refactors, multi-file) | Opus 4.7 | Cross-file ramifications + escalation that lighter models miss |
Auto-promotion implemented in auditor/checks/kimi_architect.ts:74
via selectModel(diffLen). Threshold env-overridable
(LH_AUDITOR_KIMI_OPUS_THRESHOLD_CHARS, default 100000).
Methodology notes
- Same prompt template, same grounding rules, same input bundle
- Each call cached at
data/_auditor/kimi_verdicts/<pr>-<sha>.json - Per-call metrics appended to
data/_kb/kimi_audits.jsonl - Wall-clock measured from request POST to response parse
- Cost computed as
prompt_tokens * input_rate + completion_tokens * output_rate usage.prompt_tokensunderreports through opencode proxy path (verified ~7k input tokens vs reported 5); cost figures use observed prompt size rather than reported.