Surfaced by today's untracked-files audit. None of these are accidents —
multiple are referenced by name in CLAUDE.md and memory files but were
never added.
Categories:
- docs/PHASE_AUDIT_GUIDE.md (106 LOC) — Claude Code phase audit guidance
- ops/systemd/lakehouse-langfuse-bridge.service — Langfuse bridge unit
- package.json — top-level npm manifest
- scripts/e2e_pipeline_check.sh + production_smoke.sh — real test scripts
- reports/kimi/audit-last-week*.md — the "Two reports live" CLAUDE.md cites
- tests/multi-agent/scenarios/ — 44 staffing scenarios (cutover decision A)
- tests/multi-agent/playbooks/ — 102 playbook records
- tests/battery/, tests/agent_test/PRD.md, tests/real-world/* — real tests
- sidecar/sidecar/{lab_ui,pipeline_lab}.py — 888 LOC dev-only UIs that
remain in service post-sidecar-drop (commit ba928b1 explicitly kept them)
Sensitivity check: scenarios use synthetic company names ("Heritage Foods",
"Cornerstone Fabrication"); audit reports describe code findings only;
no PII or secrets surfaced.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
6.3 KiB
Kimi Forensic Audit (FULL FILES) — distillation v1.0.0
Generated: 2026-04-27 by kimi-for-coding via gateway /v1/chat
Latency: 270.6s | finish: stop | usage: {'prompt_tokens': 66338, 'completion_tokens': 10159, 'total_tokens': 76497}
Input: /tmp/kimi-audit-full.md (238KB · 12 commits · 15 files · line-numbered, no truncation)
Verdict
Hold: the substrate’s TypeScript pipeline is architecturally coherent and the SFT firewall is genuine, but committed Rust tests fail to compile, drift detection hardcodes an unverified integrity assertion, and deterministic guarantees leak wall-clock time in multiple places.
What's solid
- Three-layer SFT contamination firewall is real. Schema enum restricts
quality_scoreto["accepted", "partially_accepted"](sft_sample.ts:13,62), exporter constantSFT_NEVERblocks rejected/needs_human_review before synthesis (export_sft.ts:51,205), andreceipts.tsre-reads the output to fail loud if any forbidden score leaked (receipts.ts:231-236). - Core scorer is pure and deterministic.
scoreRecordtakes anEvidenceRecord, performs no I/O, no LLM calls, and uses no mutable state (scorer.ts:1-5,257-273). - Quarantine is exhaustive and observable. Every exporter routes skips to structured
exports/quarantine/<exporter>.jsonlwith typed reasons; silent drops are impossible by construction (quarantine.ts:1-6,14-26). - Evidence provenance is mandatory on every row. Every
EvidenceRecordcarriessource_file,line_offset,sig_hash, andrecorded_at(build_evidence_index.ts:27-34). - Local-first replay reduces cloud calls.
replay.tsdefaults to a local model, augments via RAG retrieval, and only escalates on validation failure, directly supporting the cloud-call reduction claim (replay.ts:24,349-376).
What's risky
- receipts.ts:495 hardcodes
input_hash_match: truein drift reports while comments on lines 467-469 admit input-hash comparison is unimplemented; this is false telemetry in a forensic system. - score_runs.ts:159 deduplicates scored runs by
scored.provenance.sig_hash(the evidence hash), not by a composite of evidence + scorer version, so scorer logic orSCORER_VERSIONupdates are silently ignored on re-runs against existing partition files. - transforms.ts:181
auto_applytransform falls back tonew Date().toISOString()whenrow.tsis missing, injecting wall-clock time into the supposedly deterministic materialization layer. - mode.rs:1035,1042 Rust test code assigns
Some("...".into())andNoneto aVec<String>field (matrix_corpus), which would failcargo testcompilation; this contradicts the claim that the tag is fully tested. - export_sft.ts:109-133 synthesizes fake instruction templates per source stem instead of using actual historical prompts; the SFT firewall prevents category contamination but not prompt-fidelity distortion.
Specific findings
- mode.rs:1035 — Compile error in test helper:
matrix_corpus: Some("distilled_procedural_v1".into())mismatches theVec<String>type declared at line 172. Rationale: Direct struct construction in the test module uses anOptionwhere aVecis required, so the Rust test suite cannot compile. - receipts.ts:495 — Drift detection hardcodes
input_hash_match: true. Rationale: The adjacent comment admits input-hash comparison is simplified and unimplemented (lines 467-469); asserting a verified match is misleading telemetry that will hide real input-side regressions. - score_runs.ts:159 — Scored-run dedup ignores scorer version. Rationale:
loadSeenHashesand the skip logic key only on the EvidenceRecordsig_hash, meaning an existing scored-run file from yesterday will block updated scores even ifSCORER_VERSIONor scorer logic changed today. - transforms.ts:181 — Non-deterministic timestamp fallback in
auto_applytransform. Rationale:row.ts ?? new Date().toISOString()injects wall-clock time when the source row lacks a timestamp, violating the header claim that transforms are “deterministic by construction” and breaking bit-identical reproducibility for that stream. - export_sft.ts:126 — Unsafe property access via
as any. Rationale:(ev as any).contractorbypasses theEvidenceRecordtype contract; if the property is absent the template silently emits"<contractor>", degrading SFT data quality without a type error. - scorer.ts:30 — Environmental dependency in deterministic scorer. Rationale:
process.env.LH_SCORER_VERSIONmeans identical evidence inputs produce differentscorer_versionstamps (and different downstream receipts) depending on the runtime environment, undermining bit-identical claims. - replay.ts:378 — Non-deterministic run identifier. Rationale:
`replay:${task_hash.slice(0, 16)}:${Date.now()}`makes replay evidence rows non-reproducible and risks collision under rapid successive calls. - export_sft.ts:109-133 — Synthetic instruction generation replaces ground-truth prompts. Rationale: The exporter fabricates instruction strings from metadata (e.g., hardcoded scrum review phrasing) rather than retrieving the actual historical prompt, so the resulting SFT dataset trains on reconstructed, not authentic, user instructions.
Direction recommendation
Pause the staffing audit and harden the substrate first. Before building the staffing inference mode (staffing_inference_lakehouse in mode.rs:54) on top of this substrate:
- Fix the Rust test compile errors (
mode.rs:1035,1042) and ensurecargo testruns in CI. - Replace the hardcoded
input_hash_match: truein drift detection (receipts.ts:495) with a real hash comparison or remove the field until it is implemented. - Change scored-run dedup (
score_runs.ts:159) to key on a composite hash ofevidence_sig_hash + scorer_version + SCORER_VERSIONso scorer updates force re-scoring. - Remove the
new Date().toISOString()fallback intransforms.ts:181or fail the row so determinism is preserved. - Audit all
as anycasts in the export layer (export_sft.ts:126) for type-safe alternatives.
Once those fixes land and acceptance re-runs pass, proceed to the staffing audit wave; the architecture is sound enough to support it, but the forensic guarantees must be honest before downstream teams depend on them.