2 Commits

Author SHA1 Message Date
profit
1e00eb4472 auditor: inference temp=0, think=false — kill signature creep
9-run empirical test showed 20 of 27 audit_lessons signatures were
singletons (count=1) — the cloud producing slightly-different summary
phrasings for the SAME underlying claim on each audit, each hashing
to a fresh signature. That's the creep J flagged — not explosive,
but steady ~2 new sigs per run, unbounded over hundreds of runs.

Root cause: temperature=0.2 + think=true was letting variable prose
leak into the classification output. Fix: temp=0 (greedy sample →
identical input yields identical output on same model version),
think=false (no reasoning trace variance), max_tokens 3000→1500
(tighter bound prevents tail wander).

The compounding policy itself was validated by the 9 runs:
  - 7 recurring claims (the legitimate signals) all at conf 0.08-0.20
  - ratingSeverity() correctly held them at info (below 0.3 threshold)
  - cross-PR signal test separately confirmed conf=1.00 → sev=block

Also: LH_AUDIT_RUNS env so the test can validate with smaller N.
2026-04-22 22:09:35 -05:00
profit
9d12a814e3 auditor: kb_index aggregator + nine-consecutive empirical test
Some checks failed
lakehouse/auditor 1 blocking issue: cloud: claim not backed — "the proven escalation ladder with learning context, collects"
Phase 1 — definition-layer over append-only JSONL scratchpads.

auditor/kb_index.ts is the single shared aggregator:

  aggregate<T>(jsonlPath, { keyFn, scopeFn, checkFn, tailLimit })
      → Map<signature, {count, distinct_scopes, confidence,
                        first_seen, last_seen, representative_summary, ...}>

  ratingSeverity(agg) — confidence × count severity policy shared
    across all KB readers. Kills the "same unfixed PR inflates its
    own recurrence score" failure mode by design: confidence =
    distinct_scopes/count, so same-scope noise stays below the 0.3
    escalation threshold no matter how many times it repeats.

checkAuditLessons now routes through aggregate + ratingSeverity.
Net effect: the recurrence detector's bespoke Map/Set bookkeeping is
gone; same behavior, shared discipline, reusable by scrum/observer.

Also: symbolsExistInRepo now skips files >500KB so the audit can't
get stuck slurping a fixture.

Phase 2 — nine-consecutive audit runner.

tests/real-world/nine_consecutive_audits.ts pushes 9 empty commits,
waits for each verdict, captures the audit_lessons aggregate state
after each run, reports:

  - sig_count trajectory (should stabilize, not grow linearly)
  - max_count trajectory (same-signature repeat rate)
  - max_confidence trajectory (must stay LOW on same-PR noise)
  - verdict_stable across runs (must NOT oscillate)

This is the empirical proof that the KB compounds favorably:
noise doesn't escalate itself, and signal stays distinguishable.

Unit-tested both failure modes: same-PR × 9 repeats = conf=0.11
(info); cross-PR × 5 distinct = conf=1.00 (block). The rating
function correctly discriminates.
2026-04-22 21:49:46 -05:00