Phase 2 ships the JOIN script that turns 12 source JSONL streams
into unified data/evidence/YYYY/MM/DD/<source>.jsonl rows conforming
to EvidenceRecord v1, plus a high-level health audit proving the
substrate is real before Phase 3 reads from it.
Files:
scripts/distillation/build_evidence_index.ts materializeAll() + cli
scripts/distillation/check_evidence_health.ts provenance + coverage audit
tests/distillation/build_evidence_index.test.ts 9 acceptance tests
Test metrics:
9/9 pass · 85 expect() calls · 323ms
Real-data run (2026-04-27T03:33:53Z):
1053 rows read from 12 source streams
1051 written (99.8%) to data/evidence/2026/04/27/
2 skipped (outcomes.jsonl rows missing created_at — schema-level catch)
0 deduped on first run
Sources covered (priority order from recon):
TIER 1 (validated 100% in Phase 1, 8 sources):
distilled_facts/procedures/config_hints, contract_analyses,
mode_experiments, scrum_reviews, observer_escalations, audit_facts
TIER 2 (added by Phase 2):
auto_apply, observer_reviews, audits, outcomes
High-level audit results:
Provenance round-trip: 30/30 sampled rows trace cleanly to source
rows with matching canonicalSha256(orderedKeys(row)). Every output
has source_file + line_offset + sig_hash + recorded_at. Proven.
Score-readiness: 54% aggregate scoreable. Three-class taxonomy
emerges from coverage matrix:
- Verdict-bearing (100% scoreable): scrum_reviews, observer_reviews,
audits, contract_analyses — direct scoring inputs
- Telemetry-rich (0-70%): mode_experiments, audit_facts, outcomes
— Phase 3 will derive markers from latency/grounding/retrieval
- Pure-extraction (0%): distilled_*, observer_escalations
— context for OTHER scoring, not scoreable themselves
Invariants enforced (proven by tests + real-data audit):
- ZERO model calls in materializer (deterministic only)
- canonicalSha256(orderedKeys(row)) per source row → stable sig_hash
- Schema validator gates output: rejected rows go to skips, never to evidence/
- JSON.parse failures caught + logged, never crash the run
- Missing source files tallied as rows_present=false, never error
- Idempotent: second run on identical input writes 0 rows (proven on
real data: 1053 read, 0 written, 1051 deduped)
- Bit-stable: identical input produces byte-identical output (proven
by tests/distillation/build_evidence_index.test.ts case 3)
- Receipt self-validates against schema before write
- validation_pass = boolean (skipped == 0), never inferred
Receipt at:
reports/distillation/2026-04-27T03-33-53-972Z/receipt.json
- schema_version=1, git_sha pinned, sha256 on every input/output
- record_counts: {in:1053, out:1051, skipped:2, deduped:0}
- validation_pass=false (skipped > 0; spec says explicit, never inferred)
Skips at:
data/_kb/distillation_skips.jsonl (2 rows from outcomes.jsonl,
reason: timestamp field missing — schema layer caught it cleanly)
Health audit at:
data/_kb/evidence_health.md
Phase 2 done-criteria all met:
✓ tests pass
✓ ≥1 row from each Tier-1 source on real data (8/8 + 4 Tier 2 bonus)
✓ data/_kb/distillation_skips.jsonl populated with reasons
✓ Receipt JSON written + self-validates
✓ Provenance round-trip proven on real sampled rows
✓ Score-readiness coverage measured
Carry-overs to Phase 3:
- audit_discrepancies transform (needed before Phase 4c preference data)
- model_trust transform (needed before ModelLedgerEntry aggregation)
- outcomes.jsonl created_at: 2 rows fail materialization, decide
transform-side fix vs source-side fix
- 11 untested streams from recon still have no transform; add as
Phase 3+ consumers need them
- mode_experiments + distilled_* are 0% scoreable; Phase 3 must
JOIN to adjacent verdict-bearing records, NOT score in isolation
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
255 lines
11 KiB
TypeScript
255 lines
11 KiB
TypeScript
// check_evidence_health.ts — high-level audit of the materialized
|
|
// EvidenceRecord substrate. Answers two questions Phase 3 needs:
|
|
//
|
|
// 1. PROVENANCE ROUND-TRIP — sample N output rows, look up the
|
|
// source row at the recorded (source_file, line_offset),
|
|
// recompute canonicalSha256, confirm it matches provenance.sig_hash.
|
|
// Hard pass/fail. If even one row fails, provenance is theater.
|
|
//
|
|
// 2. SCORE-READINESS COVERAGE — for each source, what fraction of
|
|
// materialized rows carry the signals the Success Scorer will
|
|
// need: model_role, success_markers, failure_markers,
|
|
// observer_verdict, latency_ms, retrieved_context, text. Tells
|
|
// Phase 3 which sources to read from for each gate.
|
|
//
|
|
// Output: markdown report to stdout + data/_kb/evidence_health.md.
|
|
//
|
|
// Run: bun run scripts/distillation/check_evidence_health.ts
|
|
|
|
import { existsSync, readFileSync, readdirSync, statSync, writeFileSync } from "node:fs";
|
|
import { resolve } from "node:path";
|
|
import { canonicalSha256 } from "../../auditor/schemas/distillation/types";
|
|
|
|
const ROOT = process.env.LH_DISTILL_ROOT ?? "/home/profit/lakehouse";
|
|
const SAMPLE_FOR_PROVENANCE = 30;
|
|
|
|
interface CoverageBucket {
|
|
source: string;
|
|
total: number;
|
|
with_model_role: number;
|
|
with_model_name: number;
|
|
with_success_markers: number;
|
|
with_failure_markers: number;
|
|
with_observer_verdict: number;
|
|
with_latency_ms: number;
|
|
with_retrieved_context: number;
|
|
with_text: number;
|
|
scoreable: number; // has at least ONE signal the scorer can use
|
|
}
|
|
|
|
interface ProvenanceCheck {
|
|
passed: number;
|
|
failed: number;
|
|
failures: Array<{ output_path: string; line: number; reason: string }>;
|
|
}
|
|
|
|
function listEvidenceFiles(evidence_root: string): string[] {
|
|
const out: string[] = [];
|
|
if (!existsSync(evidence_root)) return out;
|
|
for (const yyyy of readdirSync(evidence_root).sort()) {
|
|
const ydir = resolve(evidence_root, yyyy);
|
|
if (!statSync(ydir).isDirectory()) continue;
|
|
for (const mm of readdirSync(ydir).sort()) {
|
|
const mdir = resolve(ydir, mm);
|
|
if (!statSync(mdir).isDirectory()) continue;
|
|
for (const dd of readdirSync(mdir).sort()) {
|
|
const ddir = resolve(mdir, dd);
|
|
if (!statSync(ddir).isDirectory()) continue;
|
|
for (const f of readdirSync(ddir)) {
|
|
if (f.endsWith(".jsonl")) out.push(resolve(ddir, f));
|
|
}
|
|
}
|
|
}
|
|
}
|
|
return out;
|
|
}
|
|
|
|
// Has at least one deterministic signal the Phase 3 scorer can act on.
|
|
// Order is generous: any of these counts as "scoreable", because the
|
|
// scorer combines multiple signals.
|
|
function isScoreable(row: any): boolean {
|
|
if (Array.isArray(row.success_markers) && row.success_markers.length > 0) return true;
|
|
if (Array.isArray(row.failure_markers) && row.failure_markers.length > 0) return true;
|
|
if (typeof row.observer_verdict === "string") return true;
|
|
if (row.validation_results && Object.keys(row.validation_results).length > 0) return true;
|
|
if (Array.isArray(row.observer_notes) && row.observer_notes.length > 0) return true;
|
|
return false;
|
|
}
|
|
|
|
function bucketStart(source: string): CoverageBucket {
|
|
return {
|
|
source, total: 0,
|
|
with_model_role: 0, with_model_name: 0,
|
|
with_success_markers: 0, with_failure_markers: 0,
|
|
with_observer_verdict: 0, with_latency_ms: 0,
|
|
with_retrieved_context: 0, with_text: 0,
|
|
scoreable: 0,
|
|
};
|
|
}
|
|
|
|
function pct(n: number, total: number): string {
|
|
if (total === 0) return "—";
|
|
return Math.round(100 * n / total) + "%";
|
|
}
|
|
|
|
async function main() {
|
|
const evidenceFiles = listEvidenceFiles(resolve(ROOT, "data/evidence"));
|
|
if (evidenceFiles.length === 0) {
|
|
console.error("No evidence files found. Run scripts/distillation/build_evidence_index.ts first.");
|
|
process.exit(1);
|
|
}
|
|
|
|
// ── 1. Coverage scan ────────────────────────────────────────────
|
|
const buckets = new Map<string, CoverageBucket>();
|
|
const allOutputRows: Array<{ output_path: string; line: number; row: any }> = [];
|
|
|
|
for (const evPath of evidenceFiles) {
|
|
const sourceLabel = evPath.split("/").pop()!.replace(/\.jsonl$/, "");
|
|
const b = buckets.get(sourceLabel) ?? bucketStart(sourceLabel);
|
|
const lines = readFileSync(evPath, "utf8").split("\n").filter(Boolean);
|
|
for (let i = 0; i < lines.length; i++) {
|
|
const row = JSON.parse(lines[i]);
|
|
b.total++;
|
|
if (row.model_role) b.with_model_role++;
|
|
if (row.model_name) b.with_model_name++;
|
|
if (Array.isArray(row.success_markers) && row.success_markers.length > 0) b.with_success_markers++;
|
|
if (Array.isArray(row.failure_markers) && row.failure_markers.length > 0) b.with_failure_markers++;
|
|
if (typeof row.observer_verdict === "string") b.with_observer_verdict++;
|
|
if (typeof row.latency_ms === "number") b.with_latency_ms++;
|
|
if (row.retrieved_context && Object.keys(row.retrieved_context).length > 0) b.with_retrieved_context++;
|
|
if (typeof row.text === "string" && row.text.length > 0) b.with_text++;
|
|
if (isScoreable(row)) b.scoreable++;
|
|
allOutputRows.push({ output_path: evPath, line: i, row });
|
|
}
|
|
buckets.set(sourceLabel, b);
|
|
}
|
|
|
|
// ── 2. Provenance round-trip on a random sample ─────────────────
|
|
const sampleSize = Math.min(SAMPLE_FOR_PROVENANCE, allOutputRows.length);
|
|
const indices = new Set<number>();
|
|
// Deterministic-ish sample: stride through evenly so we hit different sources.
|
|
const stride = Math.max(1, Math.floor(allOutputRows.length / sampleSize));
|
|
for (let i = 0; i < allOutputRows.length && indices.size < sampleSize; i += stride) indices.add(i);
|
|
// Top up with the tail in case stride truncates early.
|
|
while (indices.size < sampleSize) indices.add(allOutputRows.length - 1 - indices.size);
|
|
|
|
const provCheck: ProvenanceCheck = { passed: 0, failed: 0, failures: [] };
|
|
// Cache source-file lines so we don't re-read big files repeatedly.
|
|
const sourceCache = new Map<string, string[]>();
|
|
|
|
for (const idx of indices) {
|
|
const { output_path, line, row } = allOutputRows[idx];
|
|
const prov = row.provenance;
|
|
if (!prov || !prov.source_file || prov.line_offset == null || !prov.sig_hash) {
|
|
provCheck.failed++;
|
|
provCheck.failures.push({ output_path, line, reason: "missing provenance fields" });
|
|
continue;
|
|
}
|
|
const sourceAbs = resolve(ROOT, prov.source_file);
|
|
if (!sourceCache.has(sourceAbs)) {
|
|
if (!existsSync(sourceAbs)) {
|
|
provCheck.failed++;
|
|
provCheck.failures.push({ output_path, line, reason: `source missing: ${prov.source_file}` });
|
|
continue;
|
|
}
|
|
sourceCache.set(sourceAbs, readFileSync(sourceAbs, "utf8").split("\n"));
|
|
}
|
|
const sourceLines = sourceCache.get(sourceAbs)!;
|
|
if (prov.line_offset >= sourceLines.length) {
|
|
provCheck.failed++;
|
|
provCheck.failures.push({ output_path, line, reason: `line_offset ${prov.line_offset} past EOF (source has ${sourceLines.length} lines)` });
|
|
continue;
|
|
}
|
|
const sourceLine = sourceLines[prov.line_offset];
|
|
let sourceRow: any;
|
|
try { sourceRow = JSON.parse(sourceLine); }
|
|
catch (e) {
|
|
provCheck.failed++;
|
|
provCheck.failures.push({ output_path, line, reason: `source line not JSON: ${(e as Error).message.slice(0, 60)}` });
|
|
continue;
|
|
}
|
|
const recomputed = await canonicalSha256(sourceRow);
|
|
if (recomputed !== prov.sig_hash) {
|
|
provCheck.failed++;
|
|
provCheck.failures.push({
|
|
output_path, line,
|
|
reason: `sig_hash mismatch: prov=${prov.sig_hash.slice(0, 16)}… recomputed=${recomputed.slice(0, 16)}…`,
|
|
});
|
|
continue;
|
|
}
|
|
provCheck.passed++;
|
|
}
|
|
|
|
// ── 3. Render markdown ──────────────────────────────────────────
|
|
const md: string[] = [];
|
|
md.push("# Evidence Health — Phase 2 high-level audit");
|
|
md.push("");
|
|
md.push(`**Run:** ${new Date().toISOString()}`);
|
|
md.push(`**Evidence files:** ${evidenceFiles.length}`);
|
|
md.push(`**Total records:** ${allOutputRows.length}`);
|
|
md.push("");
|
|
md.push("## 1. Provenance round-trip");
|
|
md.push("");
|
|
md.push(`Sample size: **${sampleSize}** rows (stride sample across all evidence).`);
|
|
md.push("");
|
|
md.push(`| Passed | Failed |`);
|
|
md.push(`|---|---|`);
|
|
md.push(`| ${provCheck.passed} | ${provCheck.failed} |`);
|
|
md.push("");
|
|
if (provCheck.failed > 0) {
|
|
md.push("### Failures");
|
|
for (const f of provCheck.failures.slice(0, 20)) {
|
|
md.push(`- \`${f.output_path.split("/").slice(-2).join("/")}\` line ${f.line}: ${f.reason}`);
|
|
}
|
|
} else {
|
|
md.push("**All sampled rows traced cleanly back to source rows with matching canonical sig_hash.**");
|
|
}
|
|
md.push("");
|
|
md.push("## 2. Score-readiness coverage");
|
|
md.push("");
|
|
md.push("Per source, fraction of materialized rows carrying each signal the Phase 3 Success Scorer will read.");
|
|
md.push("");
|
|
md.push("| Source | Rows | role | name | success | failure | obs.verdict | latency | retrieval | text | scoreable |");
|
|
md.push("|---|---|---|---|---|---|---|---|---|---|---|");
|
|
const sortedBuckets = Array.from(buckets.values()).sort((a, b) => b.total - a.total);
|
|
for (const b of sortedBuckets) {
|
|
md.push(`| ${b.source} | ${b.total} | ${pct(b.with_model_role, b.total)} | ${pct(b.with_model_name, b.total)} | ${pct(b.with_success_markers, b.total)} | ${pct(b.with_failure_markers, b.total)} | ${pct(b.with_observer_verdict, b.total)} | ${pct(b.with_latency_ms, b.total)} | ${pct(b.with_retrieved_context, b.total)} | ${pct(b.with_text, b.total)} | **${pct(b.scoreable, b.total)}** |`);
|
|
}
|
|
md.push("");
|
|
// Aggregate totals row
|
|
const totals = sortedBuckets.reduce((acc, b) => ({
|
|
total: acc.total + b.total,
|
|
role: acc.role + b.with_model_role,
|
|
name: acc.name + b.with_model_name,
|
|
success: acc.success + b.with_success_markers,
|
|
failure: acc.failure + b.with_failure_markers,
|
|
obs: acc.obs + b.with_observer_verdict,
|
|
lat: acc.lat + b.with_latency_ms,
|
|
ret: acc.ret + b.with_retrieved_context,
|
|
text: acc.text + b.with_text,
|
|
score: acc.score + b.scoreable,
|
|
}), { total: 0, role: 0, name: 0, success: 0, failure: 0, obs: 0, lat: 0, ret: 0, text: 0, score: 0 });
|
|
md.push(`**Aggregate:** ${totals.total} rows · role ${pct(totals.role, totals.total)} · name ${pct(totals.name, totals.total)} · success ${pct(totals.success, totals.total)} · failure ${pct(totals.failure, totals.total)} · obs.verdict ${pct(totals.obs, totals.total)} · latency ${pct(totals.lat, totals.total)} · retrieval ${pct(totals.ret, totals.total)} · text ${pct(totals.text, totals.total)} · scoreable **${pct(totals.score, totals.total)}**`);
|
|
md.push("");
|
|
md.push("## 3. Phase 3 readiness");
|
|
md.push("");
|
|
if (provCheck.failed > 0) {
|
|
md.push("**❌ NOT READY** — provenance round-trip failed. Fix materializer or transforms before Phase 3.");
|
|
} else if (totals.score < totals.total * 0.5) {
|
|
md.push(`**⚠️ PARTIAL READINESS** — only ${pct(totals.score, totals.total)} of records are scoreable. Phase 3 will produce many \`needs_human_review\` until transforms enrich more sources with markers.`);
|
|
} else {
|
|
md.push(`**✓ READY** — provenance traces, ${pct(totals.score, totals.total)} of records carry scorer signals. Phase 3 can begin.`);
|
|
}
|
|
md.push("");
|
|
|
|
const out = md.join("\n");
|
|
console.log(out);
|
|
writeFileSync(resolve(ROOT, "data/_kb/evidence_health.md"), out);
|
|
|
|
if (provCheck.failed > 0) process.exit(1);
|
|
}
|
|
|
|
if (import.meta.main) {
|
|
main().catch(e => { console.error(e); process.exit(1); });
|
|
}
|