Some checks failed
lakehouse/auditor 13 blocking issues: cloud: claim not backed — "probes; multi-hour outage). deepseek is the proven drop-in from"
Runtime layer that takes a task → retrieves matching playbooks/RAG
records → builds a structured context bundle → feeds it to a LOCAL
model (qwen3.5:latest, ~7B class) → validates output → escalates only
when needed → logs the full run as new evidence. NOT model training.
Pure runtime behavior shaping via retrieval against the Phase 0-6
distillation substrate.
Files (3 new + 1 modified):
scripts/distillation/replay.ts ~370 lines
tests/distillation/replay.test.ts 10 tests, 19 expects
scripts/distillation/distill.ts +replay subcommand
reports/distillation/phase7-replay-report.md
Test metrics: 145 cumulative distillation tests pass · 0 fail · 372 expects · 618ms
Real-data A/B on 3 tasks (same qwen3.5:latest local model, with vs
without retrieval) — proves the spec claim "local model improves
with retrieval":
Task 1 "Audit phase 38 provider routing":
WITH retrieval: cited V1State, openrouter, /v1/chat, ProviderAdapter,
PRD.md line ranges — REAL Lakehouse internals
WITHOUT retrieval: invented "P99999, Z99999 placeholder codes" and
"production routing table" — pure fabrication
Task 2 "Verify pr_audit mode wired":
WITH: correct crates/gateway/src/main.rs path + lakehouse_answers_v1
WITHOUT: same assertion, no proof, asserts confidently
Task 3 "Audit phase 40 PRD circuit breaker drift":
WITH: anchored on the actual audit finding "no breaker class found"
WITHOUT: invented "0.0% failure rate vs 5.0% threshold" and signed
off as PASS on broken code — exact failure mode the
distillation pipeline was built to prevent
Both runs passed the structural validation gate (length, no hedges,
checklist token overlap) — the difference is grounding, supplied by
the retrieval layer pulling from exports/rag/playbooks.jsonl (446
records from earlier Phase 4 export).
Architecture:
jaccard token overlap against rag corpus → top-K (default 8) split
into accepted exemplars (top 3) + partial-warnings (top 2) + extracted
validation_steps (lines starting verify|check|assert|ensure|confirm)
→ prompt assembly → qwen3.5:latest via /v1/chat (or OpenRouter
for namespaced/free models) → deterministic validation gate →
escalation to deepseek-v3.1:671b on fail with --allow-escalation
→ log to data/_kb/replay_runs.jsonl
Spec invariants enforced:
- never bypass retrieval (--no-retrieval is explicit baseline, not default)
- never discard provenance (task_hash + rag_ids + full bundle logged)
- never allow free-form hallucinated output (validation gate is
deterministic code, never an LLM)
- log every run as new evidence (replay_run.v1 schema, append-only
to data/_kb/replay_runs.jsonl)
CLI:
./scripts/distill replay --task "<input>" [--local-only]
[--allow-escalation]
[--no-retrieval]
What this unlocks:
The substrate for "small-model bootstrapping" and "local inference
dominance" J flagged after Phase 5. Phase 8+ closes the loop:
schedule replay runs on common tasks, score outputs, feed accepted
ones back into corpus, measure escalation rate decreasing over time.
Known limitations (documented in report):
- Validation gate is structural not semantic (catches hedges/empty
but not plausible-wrong). Phase 13 wiring: run auditor against
every replay output.
- Retrieval is jaccard keyword. Works at 446 corpus, scale via
/vectors/search HNSW retrieval once corpus crosses ~10k.
- Convergence claim is architectural (deterministic retrieval +
low-temp call); longitudinal empirical study is Phase 8+.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
166 lines
8.5 KiB
TypeScript
166 lines
8.5 KiB
TypeScript
// distill.ts — single-entry CLI dispatcher for the distillation
|
|
// pipeline. Mirrors the spec's `./scripts/distill <command>` shape.
|
|
//
|
|
// USAGE
|
|
// bun run scripts/distillation/distill.ts <command> [flags]
|
|
//
|
|
// COMMANDS
|
|
// build-evidence materialize EvidenceRecord rows from data/_kb/*.jsonl
|
|
// score run deterministic Success Scorer
|
|
// export-rag RAG export (--include-review opt-in)
|
|
// export-sft SFT export (--include-partial opt-in)
|
|
// export-preference preference export
|
|
// export-all RAG + SFT + preference (no opt-ins by default)
|
|
// health evidence health audit
|
|
//
|
|
// All commands accept --dry-run.
|
|
|
|
import { materializeAll } from "./build_evidence_index";
|
|
import { scoreAll } from "./score_runs";
|
|
import { exportRag } from "./export_rag";
|
|
import { exportSft } from "./export_sft";
|
|
import { exportPreference } from "./export_preference";
|
|
import { runAllWithReceipts } from "./receipts";
|
|
import { replay } from "./replay";
|
|
import { TRANSFORMS } from "./transforms";
|
|
import { spawnSync } from "node:child_process";
|
|
|
|
const DEFAULT_ROOT = process.env.LH_DISTILL_ROOT ?? "/home/profit/lakehouse";
|
|
|
|
async function main() {
|
|
const cmd = process.argv[2];
|
|
const dry_run = process.argv.includes("--dry-run");
|
|
const include_partial = process.argv.includes("--include-partial");
|
|
const include_review = process.argv.includes("--include-review");
|
|
const recorded_at = new Date().toISOString();
|
|
|
|
switch (cmd) {
|
|
case "build-evidence": {
|
|
const r = await materializeAll({ root: DEFAULT_ROOT, transforms: TRANSFORMS, recorded_at, dry_run });
|
|
console.log(`[build-evidence] in=${r.totals.rows_read} out=${r.totals.rows_written} skip=${r.totals.rows_skipped} dedup=${r.totals.rows_deduped}`);
|
|
if (!dry_run) console.log(`[build-evidence] receipt: ${r.receipt_path}`);
|
|
if (!r.receipt.validation_pass) process.exit(1);
|
|
break;
|
|
}
|
|
case "score": {
|
|
const r = await scoreAll({ root: DEFAULT_ROOT, recorded_at, dry_run });
|
|
const c = r.totals.by_category;
|
|
console.log(`[score] in=${r.totals.rows_read} out=${r.totals.rows_written} acc=${c.accepted ?? 0} part=${c.partially_accepted ?? 0} rej=${c.rejected ?? 0} hum=${c.needs_human_review ?? 0}`);
|
|
if (!dry_run) console.log(`[score] receipt: ${r.receipt_path}`);
|
|
break;
|
|
}
|
|
case "export-rag": {
|
|
const r = await exportRag({ root: DEFAULT_ROOT, recorded_at, include_review, dry_run });
|
|
console.log(`[export-rag] in=${r.records_read} out=${r.records_exported} ${r.quarantine_summary}`);
|
|
console.log(`[export-rag] output: ${r.output_path}${include_review ? " (review included)" : ""}`);
|
|
break;
|
|
}
|
|
case "export-sft": {
|
|
const r = await exportSft({ root: DEFAULT_ROOT, recorded_at, include_partial, dry_run });
|
|
console.log(`[export-sft] in=${r.records_read} out=${r.records_exported} ${r.quarantine_summary}`);
|
|
console.log(`[export-sft] output: ${r.output_path}${include_partial ? " (partial included)" : ""}`);
|
|
break;
|
|
}
|
|
case "export-preference": {
|
|
const r = await exportPreference({ root: DEFAULT_ROOT, recorded_at, dry_run });
|
|
console.log(`[export-preference] in=${r.records_read} pairs=${r.pairs_exported} task_ids_paired=${r.task_ids_with_pairs} ${r.quarantine_summary}`);
|
|
console.log(`[export-preference] output: ${r.output_path}`);
|
|
break;
|
|
}
|
|
case "export-all": {
|
|
const rRag = await exportRag({ root: DEFAULT_ROOT, recorded_at, include_review, dry_run });
|
|
const rSft = await exportSft({ root: DEFAULT_ROOT, recorded_at, include_partial, dry_run });
|
|
const rPref = await exportPreference({ root: DEFAULT_ROOT, recorded_at, dry_run });
|
|
console.log("");
|
|
console.log("─── export-all summary ───");
|
|
console.log(` RAG: in=${rRag.records_read} out=${rRag.records_exported} ${rRag.quarantine_summary}`);
|
|
console.log(` SFT: in=${rSft.records_read} out=${rSft.records_exported} ${rSft.quarantine_summary}`);
|
|
console.log(` Preference: in=${rPref.records_read} pairs=${rPref.pairs_exported} ${rPref.quarantine_summary}`);
|
|
break;
|
|
}
|
|
case "run-all": {
|
|
// Phase 5 entry — full pipeline with structured receipts.
|
|
const r = await runAllWithReceipts({ root: DEFAULT_ROOT, include_partial, include_review });
|
|
console.log(`[run-all] run_id=${r.run_id} overall_passed=${r.summary.overall_passed}`);
|
|
console.log(`[run-all] datasets: rag=${r.summary.rag_records} sft=${r.summary.sft_records} pref=${r.summary.preference_pairs}`);
|
|
console.log(`[run-all] drift severity=${r.drift.severity}`);
|
|
console.log(`[run-all] reports/distillation/${r.run_id}/summary.md`);
|
|
if (!r.summary.overall_passed) process.exit(1);
|
|
break;
|
|
}
|
|
case "replay": {
|
|
const taskIdx = process.argv.indexOf("--task");
|
|
if (taskIdx < 0 || !process.argv[taskIdx + 1]) {
|
|
console.error("usage: distill.ts replay --task \"<input>\" [--local-only] [--allow-escalation] [--no-retrieval]");
|
|
process.exit(2);
|
|
}
|
|
const r = await replay({
|
|
task: process.argv[taskIdx + 1],
|
|
local_only: process.argv.includes("--local-only"),
|
|
allow_escalation: process.argv.includes("--allow-escalation"),
|
|
no_retrieval: process.argv.includes("--no-retrieval"),
|
|
}, DEFAULT_ROOT);
|
|
console.log(`[replay] run_id=${r.recorded_run_id}`);
|
|
console.log(`[replay] retrieval: ${r.context_bundle ? r.context_bundle.retrieved_playbooks.length + " playbooks" : "DISABLED"}`);
|
|
console.log(`[replay] escalation_path: ${r.escalation_path.join(" → ")}`);
|
|
console.log(`[replay] model_used: ${r.model_used} · ${r.duration_ms}ms`);
|
|
console.log(`[replay] validation: ${r.validation_result.passed ? "PASS" : "FAIL"}${r.validation_result.reasons.length ? " (" + r.validation_result.reasons.join("; ") + ")" : ""}`);
|
|
console.log("");
|
|
console.log("─── response ───");
|
|
console.log(r.model_response.slice(0, 1500));
|
|
if (r.model_response.length > 1500) console.log(`... [${r.model_response.length - 1500} more chars]`);
|
|
if (!r.validation_result.passed && !process.argv.includes("--allow-escalation")) process.exit(1);
|
|
break;
|
|
}
|
|
case "acceptance": {
|
|
// Phase 6 — fixture-driven end-to-end gate. Spawns the dedicated
|
|
// acceptance script so its non-zero exit propagates.
|
|
const r = spawnSync("bun", ["run", "scripts/distillation/acceptance.ts"], {
|
|
cwd: DEFAULT_ROOT, stdio: "inherit",
|
|
});
|
|
process.exit(r.status ?? 1);
|
|
}
|
|
case "receipts": {
|
|
// Read receipts for a previously-run pipeline.
|
|
const idx = process.argv.indexOf("--run-id");
|
|
if (idx < 0 || !process.argv[idx + 1]) {
|
|
console.error("usage: distill.ts receipts --run-id <id>");
|
|
process.exit(2);
|
|
}
|
|
const run_id = process.argv[idx + 1];
|
|
const path = `${DEFAULT_ROOT}/reports/distillation/${run_id}/summary.md`;
|
|
// Defer to bun's file APIs to keep this lean.
|
|
const { readFileSync } = await import("node:fs");
|
|
try { console.log(readFileSync(path, "utf8")); }
|
|
catch { console.error(`run not found: ${path}`); process.exit(2); }
|
|
break;
|
|
}
|
|
case "health":
|
|
case "help":
|
|
case undefined: {
|
|
console.log("Usage: bun run scripts/distillation/distill.ts <command> [flags]");
|
|
console.log("");
|
|
console.log("Commands:");
|
|
console.log(" build-evidence materialize EvidenceRecord rows");
|
|
console.log(" score run deterministic Success Scorer");
|
|
console.log(" export-rag RAG export (--include-review opt-in)");
|
|
console.log(" export-sft SFT export (--include-partial opt-in)");
|
|
console.log(" export-preference preference export");
|
|
console.log(" export-all RAG + SFT + preference");
|
|
console.log(" run-all full pipeline with structured receipts (Phase 5)");
|
|
console.log(" receipts read summary for a run (--run-id <id>)");
|
|
console.log(" acceptance fixture-driven end-to-end gate (Phase 6)");
|
|
console.log(" replay retrieval-driven local-model bootstrap (Phase 7) — needs --task");
|
|
console.log("");
|
|
console.log("Flags: --dry-run, --include-partial, --include-review,");
|
|
console.log(" --task \"<text>\", --local-only, --allow-escalation, --no-retrieval");
|
|
break;
|
|
}
|
|
default:
|
|
console.error(`unknown command: ${cmd}. Try 'help'.`);
|
|
process.exit(2);
|
|
}
|
|
}
|
|
|
|
main().catch(e => { console.error(e); process.exit(1); });
|