lakehouse/reports/kimi/cross-lineage-bakeoff.md

# Cross-Lineage Auditor Bake-Off — 2026-04-27

Same diff (`HEAD~5..HEAD~2`, 32KB, 3 commits = the kimi-integration work)
audited by three models from three vendor lineages. All three through
the lakehouse gateway, all three with the same `kimi_architect` prompt
template + grounding verification.

## Results

| | Kimi K2.6 (Moonshot) | Haiku 4.5 (Anthropic) | Opus 4.7 (Anthropic) |
|---|---|---|---|
| Provider | ollama_cloud | opencode/Zen | opencode/Zen |
| Latency | 53.7s | **20.5s** | 53.6s |
| Findings | 10 | 9 | 10 |
| Grounded | 10/10 | 9/9 | 10/10 |
| Severity (block/warn/info) | 0 / 9 / 1 | 0 / 5 / 4 | **3 / 5 / 2** |
| Cost | flat-sub (Ollama Pro) | ~$0.02 | ~$0.10–0.15 |
| Style | Architectural / migration | Boundary / resilience | Escalation / cross-file |

## Severity escalation pattern

Only Opus produced `block`-level findings. Haiku and Kimi described
the same kind of issues as `warn`. This isn't randomness — it's
training. Anthropic's Opus is calibrated to flag merge-stoppers more
confidently than the lighter-weight or different-lineage models.

## What ONLY Opus 4.7 caught

- `parseFindings` rationale regex truncates on inline `**bold**`
  inside rationales — neither Haiku nor Kimi noticed
- Cache-by-head-SHA survives `LH_AUDITOR_KIMI_MODEL` env flip
  (silently returns old findings under wrong model name)
- Gateway/auditor timeout mismatch: kimi.rs 600s vs auditor curl 900s

## What ALL three caught

- `(ev as any).contractor` schema bypass (3/3)
- Empty-env `Number("")` returns 0 trap (3/3)
- `readFileSync` in async function (3/3)
- mode.rs Rust test compile error (3/3)

Three-lineage consensus = high-confidence load-bearing real bug.

## What only Kimi K2.6 caught

- Schema version bump v1→v2 without explicit migration path
- ISO timestamp precision in run_id derivation
- Multimodal content array passed verbatim to Kimi (would 400)

Kimi favors architectural / API-contract concerns. Useful when the
diff is a refactor rather than a feature.

## What only Haiku 4.5 caught

- `appendMetrics` mkdir target uses `join(path, "..")` not `dirname`
- `KIMI_MODEL` cast in `parseFindings` not validated against type
- Truncation of error messages in callKimi at 300 chars loses context

Haiku favors boundary cases — what happens when assumptions break.

## Cost-vs-quality verdict

| Diff size | Recommended model | Why |
|---|---|---|
| < 100k chars (normal PRs) | Haiku 4.5 | 80% of the same surface, 5x cheaper, 2.6x faster |
| > 100k chars (refactors, multi-file) | Opus 4.7 | Cross-file ramifications + escalation that lighter models miss |

Auto-promotion implemented in `auditor/checks/kimi_architect.ts:74`
via `selectModel(diffLen)`. Threshold env-overridable
(`LH_AUDITOR_KIMI_OPUS_THRESHOLD_CHARS`, default 100000).

## Methodology notes

- Same prompt template, same grounding rules, same input bundle
- Each call cached at `data/_auditor/kimi_verdicts/<pr>-<sha>.json`
- Per-call metrics appended to `data/_kb/kimi_audits.jsonl`
- Wall-clock measured from request POST to response parse
- Cost computed as `prompt_tokens * input_rate + completion_tokens * output_rate`
- `usage.prompt_tokens` underreports through opencode proxy path
  (verified ~7k input tokens vs reported 5); cost figures use
  observed prompt size rather than reported.