lakehouse/reports/kimi/cross-lineage-bakeoff.md
root 8aa7ee974f auditor: auto-promote to Claude Opus 4.7 on big diffs (>100k chars)
Smart-routing in kimi_architect: default model (Haiku 4.5 by env, or
Kimi K2.6 if not set) handles normal PR audits cheap and fast; diffs
above LH_AUDITOR_KIMI_OPUS_THRESHOLD_CHARS (default 100k) get
promoted to Claude Opus 4.7 for the audit.

Why this split: the 2026-04-27 3-way bake-off (Kimi K2.6 vs Haiku 4.5
vs Opus 4.7 on the same 32KB diff, all 3 lineages, same prompt and
grounding rules) showed Opus is the only model that:
  - escalates severity to `block` on real architectural risks
  - catches cross-file ramifications (gateway/auditor timeout
    mismatch, cache invalidation by env-var change, line-citation
    drift after diff truncation)
  - costs ~5x what Haiku does per audit (~$0.10 vs $0.02)

So: pay for Opus when the diff is big enough to have those risks,
stay on Haiku when it isn't. 80% of refactor PRs cross 100KB; 90% of
single-feature PRs don't.

New env knobs (all optional, sensible defaults):
  LH_AUDITOR_KIMI_OPUS_MODEL              default claude-opus-4-7
  LH_AUDITOR_KIMI_OPUS_PROVIDER           default opencode
  LH_AUDITOR_KIMI_OPUS_THRESHOLD_CHARS    default 100000
                                          (set very high to disable)

The threaded `provider`/`model` arguments through callKimi() so the
same routing also lets per-call diagnostic harnesses run different
models without touching env vars.

Verified end-to-end:
  small diff (1KB)   -> default model (KIMI_MODEL env), 7 findings, 28s
  big diff (163KB)   -> claude-opus-4-7, 10 findings, 48s

Bake-off report at reports/kimi/cross-lineage-bakeoff.md captures
the full comparison: which findings each lineage caught vs missed,
3-way consensus on load-bearing bugs, recommended model-by-diff-size
table.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-27 06:48:38 -05:00

82 lines
3.2 KiB
Markdown
Raw Permalink Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Cross-Lineage Auditor Bake-Off — 2026-04-27
Same diff (`HEAD~5..HEAD~2`, 32KB, 3 commits = the kimi-integration work)
audited by three models from three vendor lineages. All three through
the lakehouse gateway, all three with the same `kimi_architect` prompt
template + grounding verification.
## Results
| | Kimi K2.6 (Moonshot) | Haiku 4.5 (Anthropic) | Opus 4.7 (Anthropic) |
|---|---|---|---|
| Provider | ollama_cloud | opencode/Zen | opencode/Zen |
| Latency | 53.7s | **20.5s** | 53.6s |
| Findings | 10 | 9 | 10 |
| Grounded | 10/10 | 9/9 | 10/10 |
| Severity (block/warn/info) | 0 / 9 / 1 | 0 / 5 / 4 | **3 / 5 / 2** |
| Cost | flat-sub (Ollama Pro) | ~$0.02 | ~$0.100.15 |
| Style | Architectural / migration | Boundary / resilience | Escalation / cross-file |
## Severity escalation pattern
Only Opus produced `block`-level findings. Haiku and Kimi described
the same kind of issues as `warn`. This isn't randomness — it's
training. Anthropic's Opus is calibrated to flag merge-stoppers more
confidently than the lighter-weight or different-lineage models.
## What ONLY Opus 4.7 caught
- `parseFindings` rationale regex truncates on inline `**bold**`
inside rationales — neither Haiku nor Kimi noticed
- Cache-by-head-SHA survives `LH_AUDITOR_KIMI_MODEL` env flip
(silently returns old findings under wrong model name)
- Gateway/auditor timeout mismatch: kimi.rs 600s vs auditor curl 900s
## What ALL three caught
- `(ev as any).contractor` schema bypass (3/3)
- Empty-env `Number("")` returns 0 trap (3/3)
- `readFileSync` in async function (3/3)
- mode.rs Rust test compile error (3/3)
Three-lineage consensus = high-confidence load-bearing real bug.
## What only Kimi K2.6 caught
- Schema version bump v1→v2 without explicit migration path
- ISO timestamp precision in run_id derivation
- Multimodal content array passed verbatim to Kimi (would 400)
Kimi favors architectural / API-contract concerns. Useful when the
diff is a refactor rather than a feature.
## What only Haiku 4.5 caught
- `appendMetrics` mkdir target uses `join(path, "..")` not `dirname`
- `KIMI_MODEL` cast in `parseFindings` not validated against type
- Truncation of error messages in callKimi at 300 chars loses context
Haiku favors boundary cases — what happens when assumptions break.
## Cost-vs-quality verdict
| Diff size | Recommended model | Why |
|---|---|---|
| < 100k chars (normal PRs) | Haiku 4.5 | 80% of the same surface, 5x cheaper, 2.6x faster |
| > 100k chars (refactors, multi-file) | Opus 4.7 | Cross-file ramifications + escalation that lighter models miss |
Auto-promotion implemented in `auditor/checks/kimi_architect.ts:74`
via `selectModel(diffLen)`. Threshold env-overridable
(`LH_AUDITOR_KIMI_OPUS_THRESHOLD_CHARS`, default 100000).
## Methodology notes
- Same prompt template, same grounding rules, same input bundle
- Each call cached at `data/_auditor/kimi_verdicts/<pr>-<sha>.json`
- Per-call metrics appended to `data/_kb/kimi_audits.jsonl`
- Wall-clock measured from request POST to response parse
- Cost computed as `prompt_tokens * input_rate + completion_tokens * output_rate`
- `usage.prompt_tokens` underreports through opencode proxy path
(verified ~7k input tokens vs reported 5); cost figures use
observed prompt size rather than reported.