lakehouse/reports/kimi/cross-lineage-bakeoff.md
root 8aa7ee974f auditor: auto-promote to Claude Opus 4.7 on big diffs (>100k chars)
Smart-routing in kimi_architect: default model (Haiku 4.5 by env, or
Kimi K2.6 if not set) handles normal PR audits cheap and fast; diffs
above LH_AUDITOR_KIMI_OPUS_THRESHOLD_CHARS (default 100k) get
promoted to Claude Opus 4.7 for the audit.

Why this split: the 2026-04-27 3-way bake-off (Kimi K2.6 vs Haiku 4.5
vs Opus 4.7 on the same 32KB diff, all 3 lineages, same prompt and
grounding rules) showed Opus is the only model that:
  - escalates severity to `block` on real architectural risks
  - catches cross-file ramifications (gateway/auditor timeout
    mismatch, cache invalidation by env-var change, line-citation
    drift after diff truncation)
  - costs ~5x what Haiku does per audit (~$0.10 vs $0.02)

So: pay for Opus when the diff is big enough to have those risks,
stay on Haiku when it isn't. 80% of refactor PRs cross 100KB; 90% of
single-feature PRs don't.

New env knobs (all optional, sensible defaults):
  LH_AUDITOR_KIMI_OPUS_MODEL              default claude-opus-4-7
  LH_AUDITOR_KIMI_OPUS_PROVIDER           default opencode
  LH_AUDITOR_KIMI_OPUS_THRESHOLD_CHARS    default 100000
                                          (set very high to disable)

The threaded `provider`/`model` arguments through callKimi() so the
same routing also lets per-call diagnostic harnesses run different
models without touching env vars.

Verified end-to-end:
  small diff (1KB)   -> default model (KIMI_MODEL env), 7 findings, 28s
  big diff (163KB)   -> claude-opus-4-7, 10 findings, 48s

Bake-off report at reports/kimi/cross-lineage-bakeoff.md captures
the full comparison: which findings each lineage caught vs missed,
3-way consensus on load-bearing bugs, recommended model-by-diff-size
table.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-27 06:48:38 -05:00

3.2 KiB
Raw Permalink Blame History

Cross-Lineage Auditor Bake-Off — 2026-04-27

Same diff (HEAD~5..HEAD~2, 32KB, 3 commits = the kimi-integration work) audited by three models from three vendor lineages. All three through the lakehouse gateway, all three with the same kimi_architect prompt template + grounding verification.

Results

Kimi K2.6 (Moonshot) Haiku 4.5 (Anthropic) Opus 4.7 (Anthropic)
Provider ollama_cloud opencode/Zen opencode/Zen
Latency 53.7s 20.5s 53.6s
Findings 10 9 10
Grounded 10/10 9/9 10/10
Severity (block/warn/info) 0 / 9 / 1 0 / 5 / 4 3 / 5 / 2
Cost flat-sub (Ollama Pro) ~$0.02 ~$0.100.15
Style Architectural / migration Boundary / resilience Escalation / cross-file

Severity escalation pattern

Only Opus produced block-level findings. Haiku and Kimi described the same kind of issues as warn. This isn't randomness — it's training. Anthropic's Opus is calibrated to flag merge-stoppers more confidently than the lighter-weight or different-lineage models.

What ONLY Opus 4.7 caught

  • parseFindings rationale regex truncates on inline **bold** inside rationales — neither Haiku nor Kimi noticed
  • Cache-by-head-SHA survives LH_AUDITOR_KIMI_MODEL env flip (silently returns old findings under wrong model name)
  • Gateway/auditor timeout mismatch: kimi.rs 600s vs auditor curl 900s

What ALL three caught

  • (ev as any).contractor schema bypass (3/3)
  • Empty-env Number("") returns 0 trap (3/3)
  • readFileSync in async function (3/3)
  • mode.rs Rust test compile error (3/3)

Three-lineage consensus = high-confidence load-bearing real bug.

What only Kimi K2.6 caught

  • Schema version bump v1→v2 without explicit migration path
  • ISO timestamp precision in run_id derivation
  • Multimodal content array passed verbatim to Kimi (would 400)

Kimi favors architectural / API-contract concerns. Useful when the diff is a refactor rather than a feature.

What only Haiku 4.5 caught

  • appendMetrics mkdir target uses join(path, "..") not dirname
  • KIMI_MODEL cast in parseFindings not validated against type
  • Truncation of error messages in callKimi at 300 chars loses context

Haiku favors boundary cases — what happens when assumptions break.

Cost-vs-quality verdict

Diff size Recommended model Why
< 100k chars (normal PRs) Haiku 4.5 80% of the same surface, 5x cheaper, 2.6x faster
> 100k chars (refactors, multi-file) Opus 4.7 Cross-file ramifications + escalation that lighter models miss

Auto-promotion implemented in auditor/checks/kimi_architect.ts:74 via selectModel(diffLen). Threshold env-overridable (LH_AUDITOR_KIMI_OPUS_THRESHOLD_CHARS, default 100000).

Methodology notes

  • Same prompt template, same grounding rules, same input bundle
  • Each call cached at data/_auditor/kimi_verdicts/<pr>-<sha>.json
  • Per-call metrics appended to data/_kb/kimi_audits.jsonl
  • Wall-clock measured from request POST to response parse
  • Cost computed as prompt_tokens * input_rate + completion_tokens * output_rate
  • usage.prompt_tokens underreports through opencode proxy path (verified ~7k input tokens vs reported 5); cost figures use observed prompt size rather than reported.