lakehouse/reports/kimi/audit-last-week-full.md
root 41b0a99ed2 chore: add real content that was sitting untracked
Surfaced by today's untracked-files audit. None of these are accidents —
multiple are referenced by name in CLAUDE.md and memory files but were
never added.

Categories:
- docs/PHASE_AUDIT_GUIDE.md (106 LOC) — Claude Code phase audit guidance
- ops/systemd/lakehouse-langfuse-bridge.service — Langfuse bridge unit
- package.json — top-level npm manifest
- scripts/e2e_pipeline_check.sh + production_smoke.sh — real test scripts
- reports/kimi/audit-last-week*.md — the "Two reports live" CLAUDE.md cites
- tests/multi-agent/scenarios/ — 44 staffing scenarios (cutover decision A)
- tests/multi-agent/playbooks/ — 102 playbook records
- tests/battery/, tests/agent_test/PRD.md, tests/real-world/* — real tests
- sidecar/sidecar/{lab_ui,pipeline_lab}.py — 888 LOC dev-only UIs that
  remain in service post-sidecar-drop (commit ba928b1 explicitly kept them)

Sensitivity check: scenarios use synthetic company names ("Heritage Foods",
"Cornerstone Fabrication"); audit reports describe code findings only;
no PII or secrets surfaced.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-02 22:22:10 -05:00

6.3 KiB
Raw Blame History

Kimi Forensic Audit (FULL FILES) — distillation v1.0.0

Generated: 2026-04-27 by kimi-for-coding via gateway /v1/chat Latency: 270.6s | finish: stop | usage: {'prompt_tokens': 66338, 'completion_tokens': 10159, 'total_tokens': 76497} Input: /tmp/kimi-audit-full.md (238KB · 12 commits · 15 files · line-numbered, no truncation)


Verdict

Hold: the substrates TypeScript pipeline is architecturally coherent and the SFT firewall is genuine, but committed Rust tests fail to compile, drift detection hardcodes an unverified integrity assertion, and deterministic guarantees leak wall-clock time in multiple places.

What's solid

  • Three-layer SFT contamination firewall is real. Schema enum restricts quality_score to ["accepted", "partially_accepted"] (sft_sample.ts:13,62), exporter constant SFT_NEVER blocks rejected/needs_human_review before synthesis (export_sft.ts:51,205), and receipts.ts re-reads the output to fail loud if any forbidden score leaked (receipts.ts:231-236).
  • Core scorer is pure and deterministic. scoreRecord takes an EvidenceRecord, performs no I/O, no LLM calls, and uses no mutable state (scorer.ts:1-5,257-273).
  • Quarantine is exhaustive and observable. Every exporter routes skips to structured exports/quarantine/<exporter>.jsonl with typed reasons; silent drops are impossible by construction (quarantine.ts:1-6,14-26).
  • Evidence provenance is mandatory on every row. Every EvidenceRecord carries source_file, line_offset, sig_hash, and recorded_at (build_evidence_index.ts:27-34).
  • Local-first replay reduces cloud calls. replay.ts defaults to a local model, augments via RAG retrieval, and only escalates on validation failure, directly supporting the cloud-call reduction claim (replay.ts:24,349-376).

What's risky

  1. receipts.ts:495 hardcodes input_hash_match: true in drift reports while comments on lines 467-469 admit input-hash comparison is unimplemented; this is false telemetry in a forensic system.
  2. score_runs.ts:159 deduplicates scored runs by scored.provenance.sig_hash (the evidence hash), not by a composite of evidence + scorer version, so scorer logic or SCORER_VERSION updates are silently ignored on re-runs against existing partition files.
  3. transforms.ts:181 auto_apply transform falls back to new Date().toISOString() when row.ts is missing, injecting wall-clock time into the supposedly deterministic materialization layer.
  4. mode.rs:1035,1042 Rust test code assigns Some("...".into()) and None to a Vec<String> field (matrix_corpus), which would fail cargo test compilation; this contradicts the claim that the tag is fully tested.
  5. export_sft.ts:109-133 synthesizes fake instruction templates per source stem instead of using actual historical prompts; the SFT firewall prevents category contamination but not prompt-fidelity distortion.

Specific findings

  • mode.rs:1035 — Compile error in test helper: matrix_corpus: Some("distilled_procedural_v1".into()) mismatches the Vec<String> type declared at line 172. Rationale: Direct struct construction in the test module uses an Option where a Vec is required, so the Rust test suite cannot compile.
  • receipts.ts:495 — Drift detection hardcodes input_hash_match: true. Rationale: The adjacent comment admits input-hash comparison is simplified and unimplemented (lines 467-469); asserting a verified match is misleading telemetry that will hide real input-side regressions.
  • score_runs.ts:159 — Scored-run dedup ignores scorer version. Rationale: loadSeenHashes and the skip logic key only on the EvidenceRecord sig_hash, meaning an existing scored-run file from yesterday will block updated scores even if SCORER_VERSION or scorer logic changed today.
  • transforms.ts:181 — Non-deterministic timestamp fallback in auto_apply transform. Rationale: row.ts ?? new Date().toISOString() injects wall-clock time when the source row lacks a timestamp, violating the header claim that transforms are “deterministic by construction” and breaking bit-identical reproducibility for that stream.
  • export_sft.ts:126 — Unsafe property access via as any. Rationale: (ev as any).contractor bypasses the EvidenceRecord type contract; if the property is absent the template silently emits "<contractor>", degrading SFT data quality without a type error.
  • scorer.ts:30 — Environmental dependency in deterministic scorer. Rationale: process.env.LH_SCORER_VERSION means identical evidence inputs produce different scorer_version stamps (and different downstream receipts) depending on the runtime environment, undermining bit-identical claims.
  • replay.ts:378 — Non-deterministic run identifier. Rationale: `replay:${task_hash.slice(0, 16)}:${Date.now()}` makes replay evidence rows non-reproducible and risks collision under rapid successive calls.
  • export_sft.ts:109-133 — Synthetic instruction generation replaces ground-truth prompts. Rationale: The exporter fabricates instruction strings from metadata (e.g., hardcoded scrum review phrasing) rather than retrieving the actual historical prompt, so the resulting SFT dataset trains on reconstructed, not authentic, user instructions.

Direction recommendation

Pause the staffing audit and harden the substrate first. Before building the staffing inference mode (staffing_inference_lakehouse in mode.rs:54) on top of this substrate:

  1. Fix the Rust test compile errors (mode.rs:1035,1042) and ensure cargo test runs in CI.
  2. Replace the hardcoded input_hash_match: true in drift detection (receipts.ts:495) with a real hash comparison or remove the field until it is implemented.
  3. Change scored-run dedup (score_runs.ts:159) to key on a composite hash of evidence_sig_hash + scorer_version + SCORER_VERSION so scorer updates force re-scoring.
  4. Remove the new Date().toISOString() fallback in transforms.ts:181 or fail the row so determinism is preserved.
  5. Audit all as any casts in the export layer (export_sft.ts:126) for type-safe alternatives.

Once those fixes land and acceptance re-runs pass, proceed to the staffing audit wave; the architecture is sound enough to support it, but the forensic guarantees must be honest before downstream teams depend on them.