lakehouse/reports/kimi/audit-last-week-full.md
root 41b0a99ed2 chore: add real content that was sitting untracked
Surfaced by today's untracked-files audit. None of these are accidents —
multiple are referenced by name in CLAUDE.md and memory files but were
never added.

Categories:
- docs/PHASE_AUDIT_GUIDE.md (106 LOC) — Claude Code phase audit guidance
- ops/systemd/lakehouse-langfuse-bridge.service — Langfuse bridge unit
- package.json — top-level npm manifest
- scripts/e2e_pipeline_check.sh + production_smoke.sh — real test scripts
- reports/kimi/audit-last-week*.md — the "Two reports live" CLAUDE.md cites
- tests/multi-agent/scenarios/ — 44 staffing scenarios (cutover decision A)
- tests/multi-agent/playbooks/ — 102 playbook records
- tests/battery/, tests/agent_test/PRD.md, tests/real-world/* — real tests
- sidecar/sidecar/{lab_ui,pipeline_lab}.py — 888 LOC dev-only UIs that
  remain in service post-sidecar-drop (commit ba928b1 explicitly kept them)

Sensitivity check: scenarios use synthetic company names ("Heritage Foods",
"Cornerstone Fabrication"); audit reports describe code findings only;
no PII or secrets surfaced.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-02 22:22:10 -05:00

46 lines
6.3 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Kimi Forensic Audit (FULL FILES) — distillation v1.0.0
**Generated:** 2026-04-27 by `kimi-for-coding` via gateway /v1/chat
**Latency:** 270.6s | **finish:** stop | **usage:** {'prompt_tokens': 66338, 'completion_tokens': 10159, 'total_tokens': 76497}
**Input:** /tmp/kimi-audit-full.md (238KB · 12 commits · 15 files · line-numbered, no truncation)
---
## Verdict
**Hold**: the substrates TypeScript pipeline is architecturally coherent and the SFT firewall is genuine, but committed Rust tests fail to compile, drift detection hardcodes an unverified integrity assertion, and deterministic guarantees leak wall-clock time in multiple places.
## What's solid
- **Three-layer SFT contamination firewall is real.** Schema enum restricts `quality_score` to `["accepted", "partially_accepted"]` (`sft_sample.ts:13,62`), exporter constant `SFT_NEVER` blocks rejected/needs_human_review before synthesis (`export_sft.ts:51,205`), and `receipts.ts` re-reads the output to fail loud if any forbidden score leaked (`receipts.ts:231-236`).
- **Core scorer is pure and deterministic.** `scoreRecord` takes an `EvidenceRecord`, performs no I/O, no LLM calls, and uses no mutable state (`scorer.ts:1-5,257-273`).
- **Quarantine is exhaustive and observable.** Every exporter routes skips to structured `exports/quarantine/<exporter>.jsonl` with typed reasons; silent drops are impossible by construction (`quarantine.ts:1-6,14-26`).
- **Evidence provenance is mandatory on every row.** Every `EvidenceRecord` carries `source_file`, `line_offset`, `sig_hash`, and `recorded_at` (`build_evidence_index.ts:27-34`).
- **Local-first replay reduces cloud calls.** `replay.ts` defaults to a local model, augments via RAG retrieval, and only escalates on validation failure, directly supporting the cloud-call reduction claim (`replay.ts:24,349-376`).
## What's risky
1. **receipts.ts:495** hardcodes `input_hash_match: true` in drift reports while comments on lines 467-469 admit input-hash comparison is unimplemented; this is false telemetry in a forensic system.
2. **score_runs.ts:159** deduplicates scored runs by `scored.provenance.sig_hash` (the *evidence* hash), not by a composite of evidence + scorer version, so scorer logic or `SCORER_VERSION` updates are silently ignored on re-runs against existing partition files.
3. **transforms.ts:181** `auto_apply` transform falls back to `new Date().toISOString()` when `row.ts` is missing, injecting wall-clock time into the supposedly deterministic materialization layer.
4. **mode.rs:1035,1042** Rust test code assigns `Some("...".into())` and `None` to a `Vec<String>` field (`matrix_corpus`), which would fail `cargo test` compilation; this contradicts the claim that the tag is fully tested.
5. **export_sft.ts:109-133** synthesizes fake instruction templates per source stem instead of using actual historical prompts; the SFT firewall prevents category contamination but not prompt-fidelity distortion.
## Specific findings
- **mode.rs:1035** — Compile error in test helper: `matrix_corpus: Some("distilled_procedural_v1".into())` mismatches the `Vec<String>` type declared at line 172. **Rationale:** Direct struct construction in the test module uses an `Option` where a `Vec` is required, so the Rust test suite cannot compile.
- **receipts.ts:495** — Drift detection hardcodes `input_hash_match: true`. **Rationale:** The adjacent comment admits input-hash comparison is simplified and unimplemented (lines 467-469); asserting a verified match is misleading telemetry that will hide real input-side regressions.
- **score_runs.ts:159** — Scored-run dedup ignores scorer version. **Rationale:** `loadSeenHashes` and the skip logic key only on the EvidenceRecord `sig_hash`, meaning an existing scored-run file from yesterday will block updated scores even if `SCORER_VERSION` or scorer logic changed today.
- **transforms.ts:181** — Non-deterministic timestamp fallback in `auto_apply` transform. **Rationale:** `row.ts ?? new Date().toISOString()` injects wall-clock time when the source row lacks a timestamp, violating the header claim that transforms are “deterministic by construction” and breaking bit-identical reproducibility for that stream.
- **export_sft.ts:126** — Unsafe property access via `as any`. **Rationale:** `(ev as any).contractor` bypasses the `EvidenceRecord` type contract; if the property is absent the template silently emits `"<contractor>"`, degrading SFT data quality without a type error.
- **scorer.ts:30** — Environmental dependency in deterministic scorer. **Rationale:** `process.env.LH_SCORER_VERSION` means identical evidence inputs produce different `scorer_version` stamps (and different downstream receipts) depending on the runtime environment, undermining bit-identical claims.
- **replay.ts:378** — Non-deterministic run identifier. **Rationale:** `` `replay:${task_hash.slice(0, 16)}:${Date.now()}` `` makes replay evidence rows non-reproducible and risks collision under rapid successive calls.
- **export_sft.ts:109-133** — Synthetic instruction generation replaces ground-truth prompts. **Rationale:** The exporter fabricates instruction strings from metadata (e.g., hardcoded scrum review phrasing) rather than retrieving the actual historical prompt, so the resulting SFT dataset trains on reconstructed, not authentic, user instructions.
## Direction recommendation
**Pause the staffing audit and harden the substrate first.** Before building the staffing inference mode (`staffing_inference_lakehouse` in `mode.rs:54`) on top of this substrate:
1. Fix the Rust test compile errors (`mode.rs:1035,1042`) and ensure `cargo test` runs in CI.
2. Replace the hardcoded `input_hash_match: true` in drift detection (`receipts.ts:495`) with a real hash comparison or remove the field until it is implemented.
3. Change scored-run dedup (`score_runs.ts:159`) to key on a composite hash of `evidence_sig_hash + scorer_version + SCORER_VERSION` so scorer updates force re-scoring.
4. Remove the `new Date().toISOString()` fallback in `transforms.ts:181` or fail the row so determinism is preserved.
5. Audit all `as any` casts in the export layer (`export_sft.ts:126`) for type-safe alternatives.
Once those fixes land and acceptance re-runs pass, proceed to the staffing audit wave; the architecture is sound enough to support it, but the forensic guarantees must be honest before downstream teams depend on them.