root
c989253e9b
distillation: Phase 3 — deterministic Success Scorer
Pure scoreRecord function + score_runs.ts CLI + 38 tests.
Reads data/evidence/YYYY/MM/DD/*.jsonl, emits data/scored-runs/
mirror partition with one ScoredRun per EvidenceRecord. ZERO model
calls. scorer_version stamped on every output (default v1.0.0).
Three-class scoring strategy (taxonomy from Phase 2 evidence_health.md):
CLASS A (verdict-bearing): direct mapping from existing markers.
scrum_reviews: accepted_on_attempt_1 → accepted; 2-3 → partial;
4+ → partial with high-cost reason
observer_reviews: accept|reject|cycle → category
audits: severity info/low → accepted, medium → partial,
high/critical → rejected (legacy markers also handled)
contract_analyses: failure_markers + observer_verdict
CLASS B (telemetry-rich): partial markers, fall back to needs_human
auto_apply: committed → accepted; *_reverted → rejected
outcomes: all_events_ok → accepted; gap_signals > 0 → partial
mode_experiments: empty text → rejected; latency > 120s → partial
CLASS C (extraction): needs_human (Phase 3 v2 will JOIN to parents)
Real-data run on 1052 evidence rows:
accepted=384 (37%) · partial=132 (13%) · rejected=57 (5%) · needs_human=479 (45%)
Verdict-bearing sources land 0% needs_human:
scrum_reviews (172): 111 acc · 61 part · 0 rej · 0 hum
audits (264): 217 acc · 29 part · 18 rej · 0 hum
observer_reviews (44): 22 acc · 3 part · 19 rej · 0 hum
contract_analyses (2): 1 acc · 0 part · 1 rej · 0 hum
BUG SURFACED + FIXED:
Phase 2 transform for audits.jsonl assumed PR-verdict shape (recon
misnamed it). Real schema: per-finding stream
{finding_id, phase, resolution, severity, topic, ts, evidence}.
Updated transform to derive markers from severity. 264 findings
went 0% scoreable → 100% scoreable. Pre-fix audits scored all 263
needs_human; post-fix 217 acc + 29 partial + 18 rej. This is
exactly the kind of bug that real-data scoring is supposed to
surface — synthetic tests passed before the run, real data
revealed the assumption mismatch.
Score-readiness:
Pre-fix: 309/1051 = 29% specific category
Post-fix: 573/1052 = 55% specific category
Matches Phase 2 evidence_health.md prediction (~54% scoreable)
Test metrics:
51 distillation tests pass (10 evidence_record + 30 schemas + 8 realdata
+ 9 build_evidence_index + 30 scorer + 8 score_runs + 21 inferred from earlier
files; bun test reports 51 across 3 phase-3 files alone)
192 expect() calls
399ms total
Receipts:
reports/distillation/2026-04-27T03-44-26-602Z/receipt.json
- record_counts.cat_accepted=384, cat_partially_accepted=132,
cat_rejected=57, cat_needs_human_review=479
- validation_pass=true (0 skips)
- self-validates against Receipt schema before write
Carry-overs to Phase 4+:
- mode_experiments 166 needs_human: derive grounding from validation_results
- extraction-class 207 rows: JOIN to verdict-bearing parent by task_id
- audit_discrepancies transform (still missing — Phase 4c needs)
- model_trust transform (needed for ModelLedgerEntry aggregation)
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-26 22:45:34 -05:00
..
2026-04-26 22:45:34 -05:00
2026-04-20 20:19:02 -05:00
2026-04-25 19:34:45 -05:00
2026-04-17 10:41:32 -05:00
2026-04-26 18:49:36 -05:00
2026-04-26 17:29:17 -05:00
2026-04-26 17:29:17 -05:00
2026-04-26 17:29:17 -05:00
2026-04-17 00:19:07 -05:00
2026-04-25 18:44:27 -05:00
2026-03-27 08:07:31 -05:00
2026-04-16 23:54:33 -05:00
2026-04-20 21:35:04 -05:00
2026-04-20 22:16:09 -05:00
2026-04-16 22:08:34 -05:00
2026-04-26 17:29:17 -05:00
2026-04-26 17:29:17 -05:00
2026-04-26 01:55:12 -05:00
2026-04-26 01:55:12 -05:00
2026-04-26 01:55:12 -05:00
2026-04-26 17:29:17 -05:00
2026-04-26 17:49:37 -05:00
2026-04-17 10:41:32 -05:00
2026-04-16 22:14:06 -05:00
2026-04-17 00:08:48 -05:00
2026-04-20 20:31:34 -05:00
2026-04-20 22:16:09 -05:00
2026-04-17 01:06:38 -05:00
2026-03-27 08:31:37 -05:00
2026-04-25 19:31:44 -05:00
2026-04-16 22:26:19 -05:00
2026-04-16 22:26:19 -05:00
2026-03-27 22:06:28 -05:00
2026-04-17 00:14:34 -05:00
2026-03-27 07:54:24 -05:00
2026-04-16 23:28:54 -05:00
2026-03-27 22:13:27 -05:00
2026-04-25 19:34:45 -05:00