Smart-routing in kimi_architect: default model (Haiku 4.5 by env, or
Kimi K2.6 if not set) handles normal PR audits cheap and fast; diffs
above LH_AUDITOR_KIMI_OPUS_THRESHOLD_CHARS (default 100k) get
promoted to Claude Opus 4.7 for the audit.
Why this split: the 2026-04-27 3-way bake-off (Kimi K2.6 vs Haiku 4.5
vs Opus 4.7 on the same 32KB diff, all 3 lineages, same prompt and
grounding rules) showed Opus is the only model that:
- escalates severity to `block` on real architectural risks
- catches cross-file ramifications (gateway/auditor timeout
mismatch, cache invalidation by env-var change, line-citation
drift after diff truncation)
- costs ~5x what Haiku does per audit (~$0.10 vs $0.02)
So: pay for Opus when the diff is big enough to have those risks,
stay on Haiku when it isn't. 80% of refactor PRs cross 100KB; 90% of
single-feature PRs don't.
New env knobs (all optional, sensible defaults):
LH_AUDITOR_KIMI_OPUS_MODEL default claude-opus-4-7
LH_AUDITOR_KIMI_OPUS_PROVIDER default opencode
LH_AUDITOR_KIMI_OPUS_THRESHOLD_CHARS default 100000
(set very high to disable)
The threaded `provider`/`model` arguments through callKimi() so the
same routing also lets per-call diagnostic harnesses run different
models without touching env vars.
Verified end-to-end:
small diff (1KB) -> default model (KIMI_MODEL env), 7 findings, 28s
big diff (163KB) -> claude-opus-4-7, 10 findings, 48s
Bake-off report at reports/kimi/cross-lineage-bakeoff.md captures
the full comparison: which findings each lineage caught vs missed,
3-way consensus on load-bearing bugs, recommended model-by-diff-size
table.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
82 lines
3.2 KiB
Markdown
82 lines
3.2 KiB
Markdown
# Cross-Lineage Auditor Bake-Off — 2026-04-27
|
||
|
||
Same diff (`HEAD~5..HEAD~2`, 32KB, 3 commits = the kimi-integration work)
|
||
audited by three models from three vendor lineages. All three through
|
||
the lakehouse gateway, all three with the same `kimi_architect` prompt
|
||
template + grounding verification.
|
||
|
||
## Results
|
||
|
||
| | Kimi K2.6 (Moonshot) | Haiku 4.5 (Anthropic) | Opus 4.7 (Anthropic) |
|
||
|---|---|---|---|
|
||
| Provider | ollama_cloud | opencode/Zen | opencode/Zen |
|
||
| Latency | 53.7s | **20.5s** | 53.6s |
|
||
| Findings | 10 | 9 | 10 |
|
||
| Grounded | 10/10 | 9/9 | 10/10 |
|
||
| Severity (block/warn/info) | 0 / 9 / 1 | 0 / 5 / 4 | **3 / 5 / 2** |
|
||
| Cost | flat-sub (Ollama Pro) | ~$0.02 | ~$0.10–0.15 |
|
||
| Style | Architectural / migration | Boundary / resilience | Escalation / cross-file |
|
||
|
||
## Severity escalation pattern
|
||
|
||
Only Opus produced `block`-level findings. Haiku and Kimi described
|
||
the same kind of issues as `warn`. This isn't randomness — it's
|
||
training. Anthropic's Opus is calibrated to flag merge-stoppers more
|
||
confidently than the lighter-weight or different-lineage models.
|
||
|
||
## What ONLY Opus 4.7 caught
|
||
|
||
- `parseFindings` rationale regex truncates on inline `**bold**`
|
||
inside rationales — neither Haiku nor Kimi noticed
|
||
- Cache-by-head-SHA survives `LH_AUDITOR_KIMI_MODEL` env flip
|
||
(silently returns old findings under wrong model name)
|
||
- Gateway/auditor timeout mismatch: kimi.rs 600s vs auditor curl 900s
|
||
|
||
## What ALL three caught
|
||
|
||
- `(ev as any).contractor` schema bypass (3/3)
|
||
- Empty-env `Number("")` returns 0 trap (3/3)
|
||
- `readFileSync` in async function (3/3)
|
||
- mode.rs Rust test compile error (3/3)
|
||
|
||
Three-lineage consensus = high-confidence load-bearing real bug.
|
||
|
||
## What only Kimi K2.6 caught
|
||
|
||
- Schema version bump v1→v2 without explicit migration path
|
||
- ISO timestamp precision in run_id derivation
|
||
- Multimodal content array passed verbatim to Kimi (would 400)
|
||
|
||
Kimi favors architectural / API-contract concerns. Useful when the
|
||
diff is a refactor rather than a feature.
|
||
|
||
## What only Haiku 4.5 caught
|
||
|
||
- `appendMetrics` mkdir target uses `join(path, "..")` not `dirname`
|
||
- `KIMI_MODEL` cast in `parseFindings` not validated against type
|
||
- Truncation of error messages in callKimi at 300 chars loses context
|
||
|
||
Haiku favors boundary cases — what happens when assumptions break.
|
||
|
||
## Cost-vs-quality verdict
|
||
|
||
| Diff size | Recommended model | Why |
|
||
|---|---|---|
|
||
| < 100k chars (normal PRs) | Haiku 4.5 | 80% of the same surface, 5x cheaper, 2.6x faster |
|
||
| > 100k chars (refactors, multi-file) | Opus 4.7 | Cross-file ramifications + escalation that lighter models miss |
|
||
|
||
Auto-promotion implemented in `auditor/checks/kimi_architect.ts:74`
|
||
via `selectModel(diffLen)`. Threshold env-overridable
|
||
(`LH_AUDITOR_KIMI_OPUS_THRESHOLD_CHARS`, default 100000).
|
||
|
||
## Methodology notes
|
||
|
||
- Same prompt template, same grounding rules, same input bundle
|
||
- Each call cached at `data/_auditor/kimi_verdicts/<pr>-<sha>.json`
|
||
- Per-call metrics appended to `data/_kb/kimi_audits.jsonl`
|
||
- Wall-clock measured from request POST to response parse
|
||
- Cost computed as `prompt_tokens * input_rate + completion_tokens * output_rate`
|
||
- `usage.prompt_tokens` underreports through opencode proxy path
|
||
(verified ~7k input tokens vs reported 5); cost figures use
|
||
observed prompt size rather than reported.
|