lakehouse

Author	SHA1	Message	Date
profit	1e00eb4472	auditor: inference temp=0, think=false — kill signature creep 9-run empirical test showed 20 of 27 audit_lessons signatures were singletons (count=1) — the cloud producing slightly-different summary phrasings for the SAME underlying claim on each audit, each hashing to a fresh signature. That's the creep J flagged — not explosive, but steady ~2 new sigs per run, unbounded over hundreds of runs. Root cause: temperature=0.2 + think=true was letting variable prose leak into the classification output. Fix: temp=0 (greedy sample → identical input yields identical output on same model version), think=false (no reasoning trace variance), max_tokens 3000→1500 (tighter bound prevents tail wander). The compounding policy itself was validated by the 9 runs: - 7 recurring claims (the legitimate signals) all at conf 0.08-0.20 - ratingSeverity() correctly held them at info (below 0.3 threshold) - cross-PR signal test separately confirmed conf=1.00 → sev=block Also: LH_AUDIT_RUNS env so the test can validate with smaller N.	2026-04-22 22:09:35 -05:00
profit	9d12a814e3	auditor: kb_index aggregator + nine-consecutive empirical test Some checks failed lakehouse/auditor 1 blocking issue: cloud: claim not backed — "the proven escalation ladder with learning context, collects" Phase 1 — definition-layer over append-only JSONL scratchpads. auditor/kb_index.ts is the single shared aggregator: aggregate<T>(jsonlPath, { keyFn, scopeFn, checkFn, tailLimit }) → Map<signature, {count, distinct_scopes, confidence, first_seen, last_seen, representative_summary, ...}> ratingSeverity(agg) — confidence × count severity policy shared across all KB readers. Kills the "same unfixed PR inflates its own recurrence score" failure mode by design: confidence = distinct_scopes/count, so same-scope noise stays below the 0.3 escalation threshold no matter how many times it repeats. checkAuditLessons now routes through aggregate + ratingSeverity. Net effect: the recurrence detector's bespoke Map/Set bookkeeping is gone; same behavior, shared discipline, reusable by scrum/observer. Also: symbolsExistInRepo now skips files >500KB so the audit can't get stuck slurping a fixture. Phase 2 — nine-consecutive audit runner. tests/real-world/nine_consecutive_audits.ts pushes 9 empty commits, waits for each verdict, captures the audit_lessons aggregate state after each run, reports: - sig_count trajectory (should stabilize, not grow linearly) - max_count trajectory (same-signature repeat rate) - max_confidence trajectory (must stay LOW on same-PR noise) - verdict_stable across runs (must NOT oscillate) This is the empirical proof that the KB compounds favorably: noise doesn't escalate itself, and signal stays distinguishable. Unit-tested both failure modes: same-PR × 9 repeats = conf=0.11 (info); cross-PR × 5 distinct = conf=1.00 (block). The rating function correctly discriminates.	2026-04-22 21:49:46 -05:00
profit	f4be27a879	auditor: fix two false-positive classes from cloud inference Some checks failed lakehouse/auditor 1 blocking issue: cloud: claim not backed — "the proven escalation ladder with learning context, collects" Observed on PR #8 audit (de11ac4): 7 warn findings, all from the cloud inference check. Investigation showed two distinct bug classes that weren't "ship bad code", they were "auditor misreads the diff": 1. Cloud flagged "X not defined in this diff / missing implementation" for symbols like `tailJsonl` and `stubFinding` that ARE defined — just not in the added lines of this diff. Fix: extract candidate symbols from the cloud's gap summary, grep the repo for their definitions (function/const/let/def/class/struct/enum/trait/fn). If every named symbol resolves, drop the finding; if some do, demote to info with the resolution in evidence. 2. Cloud flagged runtime metrics like "58 cloud calls, 306s end-to-end" as unbacked claims. These are empirical outputs from running the test, not things a static diff can prove. Fix: claim_parser now has an `empirical` strength class matching iteration counts, cloud-call counts, duration metrics, attempt counts, tier-count phrases. Inference drops empirical claims from its cloud prompt (verifiable[] subset only) and claim-index mapping uses verifiable[] so cloud responses still line up. Added `claims_empirical` to audit metrics so the verdict is introspectable: how many claims WERE runtime-only vs how many are diff-verifiable? Verified: unit tests confirm empirical classification on 5 sample commit messages; symbol resolver found both false-positive symbols (tailJsonl + stubFinding) and correctly skipped a known- fake symbol.	2026-04-22 21:40:03 -05:00
profit	efc7b5ac44	Auditor: dynamic + inference checks auditor/checks/dynamic.ts — wraps runHybridFixture, maps layer results to Findings. Placeholder-style errors (404/unimplemented/ slice N) → info; other failures → warn. Always emits a summary finding with real numbers (shipped/placeholder phase counts + per- layer latency). Live-tested against current stack: 2 info findings, 0 warnings — all shipped layers actually work. auditor/checks/inference.ts — wraps the run_codereview reviewer pattern from llm_team_ui.py, adapted for claim-vs-diff verification. Calls /v1/chat provider=ollama_cloud model=gpt-oss:120b. Requests strict JSON response with claim_verdicts[] and unflagged_gaps[]. A strong claim marked "not backed" by cloud → BLOCK severity; moderate → warn; weak → info. Cloud-unreachable or unparseable-output → info (never blocks on the reviewer being down). Live-tested against PR #1 (this PR, 20 claims, 39KB diff): - 36.9s round-trip - 7 block + 23 warn + 2 info findings - gpt-oss:120b correctly flagged "Fully-functional auditor (tasks 1-9 complete)" as not-backed (only 6/10 tasks done at that commit) — accurate catch - Some false positives from the original 15KB truncation threshold (cloud missed gitea.ts, flagged "no Gitea client present") - Bumped MAX_DIFF_CHARS from 15000 to 40000 to fit the full PR diff in context; reviewer precision improves accordingly Tasks 5 + 6 completed. Remaining: #7 (KB query), #8 (verdict + Gitea poster), #9 (poller), #10 (end-to-end proof), #12 (upsert UPDATE-drops-doc_refs).	2026-04-22 03:54:18 -05:00

4 Commits