9-run empirical test showed 20 of 27 audit_lessons signatures were
singletons (count=1) — the cloud producing slightly-different summary
phrasings for the SAME underlying claim on each audit, each hashing
to a fresh signature. That's the creep J flagged — not explosive,
but steady ~2 new sigs per run, unbounded over hundreds of runs.
Root cause: temperature=0.2 + think=true was letting variable prose
leak into the classification output. Fix: temp=0 (greedy sample →
identical input yields identical output on same model version),
think=false (no reasoning trace variance), max_tokens 3000→1500
(tighter bound prevents tail wander).
The compounding policy itself was validated by the 9 runs:
- 7 recurring claims (the legitimate signals) all at conf 0.08-0.20
- ratingSeverity() correctly held them at info (below 0.3 threshold)
- cross-PR signal test separately confirmed conf=1.00 → sev=block
Also: LH_AUDIT_RUNS env so the test can validate with smaller N.
lakehouse/auditor 1 blocking issue: cloud: claim not backed — "the proven escalation ladder with learning context, collects"
Phase 1 — definition-layer over append-only JSONL scratchpads.
auditor/kb_index.ts is the single shared aggregator:
aggregate<T>(jsonlPath, { keyFn, scopeFn, checkFn, tailLimit })
→ Map<signature, {count, distinct_scopes, confidence,
first_seen, last_seen, representative_summary, ...}>
ratingSeverity(agg) — confidence × count severity policy shared
across all KB readers. Kills the "same unfixed PR inflates its
own recurrence score" failure mode by design: confidence =
distinct_scopes/count, so same-scope noise stays below the 0.3
escalation threshold no matter how many times it repeats.
checkAuditLessons now routes through aggregate + ratingSeverity.
Net effect: the recurrence detector's bespoke Map/Set bookkeeping is
gone; same behavior, shared discipline, reusable by scrum/observer.
Also: symbolsExistInRepo now skips files >500KB so the audit can't
get stuck slurping a fixture.
Phase 2 — nine-consecutive audit runner.
tests/real-world/nine_consecutive_audits.ts pushes 9 empty commits,
waits for each verdict, captures the audit_lessons aggregate state
after each run, reports:
- sig_count trajectory (should stabilize, not grow linearly)
- max_count trajectory (same-signature repeat rate)
- max_confidence trajectory (must stay LOW on same-PR noise)
- verdict_stable across runs (must NOT oscillate)
This is the empirical proof that the KB compounds favorably:
noise doesn't escalate itself, and signal stays distinguishable.
Unit-tested both failure modes: same-PR × 9 repeats = conf=0.11
(info); cross-PR × 5 distinct = conf=1.00 (block). The rating
function correctly discriminates.
lakehouse/auditor 1 blocking issue: cloud: claim not backed — "the proven escalation ladder with learning context, collects"
Observed on PR #8 audit (de11ac4): 7 warn findings, all from the
cloud inference check. Investigation showed two distinct bug classes
that weren't "ship bad code", they were "auditor misreads the diff":
1. Cloud flagged "X not defined in this diff / missing implementation"
for symbols like `tailJsonl` and `stubFinding` that ARE defined —
just not in the added lines of this diff. Fix: extract candidate
symbols from the cloud's gap summary, grep the repo for their
definitions (function/const/let/def/class/struct/enum/trait/fn).
If every named symbol resolves, drop the finding; if some do,
demote to info with the resolution in evidence.
2. Cloud flagged runtime metrics like "58 cloud calls, 306s
end-to-end" as unbacked claims. These are empirical outputs
from running the test, not things a static diff can prove.
Fix: claim_parser now has an `empirical` strength class
matching iteration counts, cloud-call counts, duration metrics,
attempt counts, tier-count phrases. Inference drops empirical
claims from its cloud prompt (verifiable[] subset only) and
claim-index mapping uses verifiable[] so cloud responses still
line up.
Added `claims_empirical` to audit metrics so the verdict is
introspectable: how many claims WERE runtime-only vs how many
are diff-verifiable?
Verified: unit tests confirm empirical classification on 5
sample commit messages; symbol resolver found both false-positive
symbols (tailJsonl + stubFinding) and correctly skipped a known-
fake symbol.