9-run empirical test showed 20 of 27 audit_lessons signatures were
singletons (count=1) — the cloud producing slightly-different summary
phrasings for the SAME underlying claim on each audit, each hashing
to a fresh signature. That's the creep J flagged — not explosive,
but steady ~2 new sigs per run, unbounded over hundreds of runs.
Root cause: temperature=0.2 + think=true was letting variable prose
leak into the classification output. Fix: temp=0 (greedy sample →
identical input yields identical output on same model version),
think=false (no reasoning trace variance), max_tokens 3000→1500
(tighter bound prevents tail wander).
The compounding policy itself was validated by the 9 runs:
- 7 recurring claims (the legitimate signals) all at conf 0.08-0.20
- ratingSeverity() correctly held them at info (below 0.3 threshold)
- cross-PR signal test separately confirmed conf=1.00 → sev=block
Also: LH_AUDIT_RUNS env so the test can validate with smaller N.
lakehouse/auditor 1 blocking issue: cloud: claim not backed — "the proven escalation ladder with learning context, collects"
Phase 1 — definition-layer over append-only JSONL scratchpads.
auditor/kb_index.ts is the single shared aggregator:
aggregate<T>(jsonlPath, { keyFn, scopeFn, checkFn, tailLimit })
→ Map<signature, {count, distinct_scopes, confidence,
first_seen, last_seen, representative_summary, ...}>
ratingSeverity(agg) — confidence × count severity policy shared
across all KB readers. Kills the "same unfixed PR inflates its
own recurrence score" failure mode by design: confidence =
distinct_scopes/count, so same-scope noise stays below the 0.3
escalation threshold no matter how many times it repeats.
checkAuditLessons now routes through aggregate + ratingSeverity.
Net effect: the recurrence detector's bespoke Map/Set bookkeeping is
gone; same behavior, shared discipline, reusable by scrum/observer.
Also: symbolsExistInRepo now skips files >500KB so the audit can't
get stuck slurping a fixture.
Phase 2 — nine-consecutive audit runner.
tests/real-world/nine_consecutive_audits.ts pushes 9 empty commits,
waits for each verdict, captures the audit_lessons aggregate state
after each run, reports:
- sig_count trajectory (should stabilize, not grow linearly)
- max_count trajectory (same-signature repeat rate)
- max_confidence trajectory (must stay LOW on same-PR noise)
- verdict_stable across runs (must NOT oscillate)
This is the empirical proof that the KB compounds favorably:
noise doesn't escalate itself, and signal stays distinguishable.
Unit-tested both failure modes: same-PR × 9 repeats = conf=0.11
(info); cross-PR × 5 distinct = conf=1.00 (block). The rating
function correctly discriminates.