11 Commits

Author SHA1 Message Date
profit
2afad0f83f auditor/inference: N=3 consensus + qwen3-coder:480b tie-breaker
Closes the determinism gap observed in the 3-run baseline test: 1 of
8 findings (the "proven escalation ladder" block) was flipping across
identical-state audits. Root cause: cloud non-determinism at temp=0
is real in practice even though it shouldn't be in theory.

Fix: run the primary reviewer (gpt-oss:120b) N=3 times in PARALLEL
(Promise.all, wall-clock ≈ single call because they're independent
HTTP requests). Aggregate votes per claim_idx. Majority wins. On a
1-1-1 split, call a tie-breaker model with different architecture:
qwen3-coder:480b — newer coding specialist, 4x params of the primary,
distinct training lineage.

Every case where the 3 runs disagreed (even when majority resolved)
is logged to data/_kb/audit_discrepancies.jsonl with the vote counts
and resolution type. This is how we measure consensus drift over
time — a dashboard metric is literally `wc -l audit_discrepancies`
relative to audit count.

Verified: 2 back-to-back audits on unchanged PR #8 produced
identical 8 findings each (1 block + 7 warn). consensus=3/3 on every
claim, zero discrepancies logged. Cost: 3x primary tokens (7K per
audit vs 2K), wall-clock ~unchanged because calls are parallel.

New env vars:
  LH_AUDITOR_CONSENSUS_N        default 3
  LH_AUDITOR_TIEBREAKER_MODEL   default qwen3-coder:480b

Factored the cloud call into runCloudInference() helper so the
consensus loop is clean and the tie-breaker reuses the same prompt
shape as the primary.
2026-04-22 23:38:17 -05:00
profit
77650c4ba3 auditor: inference curation layer + llm_team fact extraction → KB
Closes the cycle J asked for: curated cloud output lands structured
knowledge in the KB so future audits have architectural context, not
just a log of per-finding signatures.

Three pieces:

1. Inference curation (tree-split) — when diff > 30KB, shard at 4.5KB,
   summarize each shard via cloud (temp=0, think=false on small
   shards; think=true on main call). Merge into scratchpad. The cloud
   verification then runs against the scratchpad, not truncated raw.
   Eliminates the 40KB MAX_DIFF_CHARS truncation path for large PRs
   (PR #8 is 102KB — was losing 62KB). Anti-false-positive guard in
   the prompt: cloud is told scratchpad absence is NOT diff absence,
   so it doesn't flag curated-out symbols as missing. unflagged_gaps
   section is dropped entirely when curated (scratchpad can't ground
   them).

2. fact_extractor — TS client for llm_team_ui's extract-facts mode at
   localhost:5000/api/run. Sends curated scratchpad through qwen2.5
   extractor + gemma2 verifier, parses SSE stream, returns structured
   {facts, entities, relationships, verification, llm_team_run_id}.
   Best-effort: if llm_team is down, extraction fails silently and
   the audit still completes. AWAITED so CLI tools (audit_one.ts)
   don't exit before extraction lands — the systemd poller has 90s
   headroom so the extra ~15s doesn't matter.

3. audit_facts.jsonl + checkAuditFacts() — one row per curated audit
   with the extraction result. kb_query tails the jsonl, explodes
   entity rows, aggregates by entity name with distinct-PR counting,
   surfaces entities recurring in 2+ PRs as info findings. Filters
   out short names (<3 chars, extractor truncation artifacts) and
   generic types (string/number/etc.) so signal isn't drowned.

Verified end-to-end on PR #8: 102KB diff → 23 shards → 1KB scratchpad
→ qwen2.5 extracted 4 facts + 6 entities + 6 relationships (real
code-level knowledge: AggregateOptions<T> type, aggregate<T> async
function with real signature, typed relationships). llm_team_run_id
cross-references to llm_team's own team_runs table.

Also: audit.ts passes (pr_number, head_sha) as InferenceContext so
extracted facts are scope-tagged for the KB index.
2026-04-22 23:09:14 -05:00
profit
47f1ca73e7 auditor: Level 1 correction — keep think=true, only temp=0 is needed
Some checks failed
lakehouse/auditor 4 warnings — see review
The previous Level 1 commit set think=false which broke the cloud
inference check on real PR audits. gpt-oss:120b is a reasoning model;
at think=false on large prompts (40KB diff + 14 claims) it returned
empty content — verified by inspecting verdict 8-8e4ebbe4b38a which
showed "cloud returned unparseable output — skipped" with 13421
tokens used and head:<empty>.

Small-prompt tests passed because the model could respond without
needing to think. Real audits with the full diff + claims context
require the reasoning channel to produce any output at all.

The determinism we need comes from temp=0 (greedy sampling). The
reasoning trace at think=true varies in prose but greedy sampling
converges to the same FINAL classification from identical starting
state, so signatures remain stable.

max_tokens restored to 3000 for the think trace + response.
2026-04-22 22:24:25 -05:00
profit
1e00eb4472 auditor: inference temp=0, think=false — kill signature creep
9-run empirical test showed 20 of 27 audit_lessons signatures were
singletons (count=1) — the cloud producing slightly-different summary
phrasings for the SAME underlying claim on each audit, each hashing
to a fresh signature. That's the creep J flagged — not explosive,
but steady ~2 new sigs per run, unbounded over hundreds of runs.

Root cause: temperature=0.2 + think=true was letting variable prose
leak into the classification output. Fix: temp=0 (greedy sample →
identical input yields identical output on same model version),
think=false (no reasoning trace variance), max_tokens 3000→1500
(tighter bound prevents tail wander).

The compounding policy itself was validated by the 9 runs:
  - 7 recurring claims (the legitimate signals) all at conf 0.08-0.20
  - ratingSeverity() correctly held them at info (below 0.3 threshold)
  - cross-PR signal test separately confirmed conf=1.00 → sev=block

Also: LH_AUDIT_RUNS env so the test can validate with smaller N.
2026-04-22 22:09:35 -05:00
profit
9d12a814e3 auditor: kb_index aggregator + nine-consecutive empirical test
Some checks failed
lakehouse/auditor 1 blocking issue: cloud: claim not backed — "the proven escalation ladder with learning context, collects"
Phase 1 — definition-layer over append-only JSONL scratchpads.

auditor/kb_index.ts is the single shared aggregator:

  aggregate<T>(jsonlPath, { keyFn, scopeFn, checkFn, tailLimit })
      → Map<signature, {count, distinct_scopes, confidence,
                        first_seen, last_seen, representative_summary, ...}>

  ratingSeverity(agg) — confidence × count severity policy shared
    across all KB readers. Kills the "same unfixed PR inflates its
    own recurrence score" failure mode by design: confidence =
    distinct_scopes/count, so same-scope noise stays below the 0.3
    escalation threshold no matter how many times it repeats.

checkAuditLessons now routes through aggregate + ratingSeverity.
Net effect: the recurrence detector's bespoke Map/Set bookkeeping is
gone; same behavior, shared discipline, reusable by scrum/observer.

Also: symbolsExistInRepo now skips files >500KB so the audit can't
get stuck slurping a fixture.

Phase 2 — nine-consecutive audit runner.

tests/real-world/nine_consecutive_audits.ts pushes 9 empty commits,
waits for each verdict, captures the audit_lessons aggregate state
after each run, reports:

  - sig_count trajectory (should stabilize, not grow linearly)
  - max_count trajectory (same-signature repeat rate)
  - max_confidence trajectory (must stay LOW on same-PR noise)
  - verdict_stable across runs (must NOT oscillate)

This is the empirical proof that the KB compounds favorably:
noise doesn't escalate itself, and signal stays distinguishable.

Unit-tested both failure modes: same-PR × 9 repeats = conf=0.11
(info); cross-PR × 5 distinct = conf=1.00 (block). The rating
function correctly discriminates.
2026-04-22 21:49:46 -05:00
profit
f4be27a879 auditor: fix two false-positive classes from cloud inference
Some checks failed
lakehouse/auditor 1 blocking issue: cloud: claim not backed — "the proven escalation ladder with learning context, collects"
Observed on PR #8 audit (de11ac4): 7 warn findings, all from the
cloud inference check. Investigation showed two distinct bug classes
that weren't "ship bad code", they were "auditor misreads the diff":

1. Cloud flagged "X not defined in this diff / missing implementation"
   for symbols like `tailJsonl` and `stubFinding` that ARE defined —
   just not in the added lines of this diff. Fix: extract candidate
   symbols from the cloud's gap summary, grep the repo for their
   definitions (function/const/let/def/class/struct/enum/trait/fn).
   If every named symbol resolves, drop the finding; if some do,
   demote to info with the resolution in evidence.

2. Cloud flagged runtime metrics like "58 cloud calls, 306s
   end-to-end" as unbacked claims. These are empirical outputs
   from running the test, not things a static diff can prove.
   Fix: claim_parser now has an `empirical` strength class
   matching iteration counts, cloud-call counts, duration metrics,
   attempt counts, tier-count phrases. Inference drops empirical
   claims from its cloud prompt (verifiable[] subset only) and
   claim-index mapping uses verifiable[] so cloud responses still
   line up.

Added `claims_empirical` to audit metrics so the verdict is
introspectable: how many claims WERE runtime-only vs how many
are diff-verifiable?

Verified: unit tests confirm empirical classification on 5
sample commit messages; symbol resolver found both false-positive
symbols (tailJsonl + stubFinding) and correctly skipped a known-
fake symbol.
2026-04-22 21:40:03 -05:00
profit
0306dd88c1 auditor: close the verdict→playbook loop + fix rubric-string false positive
Some checks failed
lakehouse/auditor 2 blocking issues: unimplemented!() macro call in tests/real-world/hard_task_escalation.ts
Two changes that fell out of running the auto-loop for real on PR #8:

1. The systemd auditor blocked PR #8 on 'unimplemented!()' / 'todo!()'
   in tests/real-world/hard_task_escalation.ts — but those strings are
   the rubric itself, not macro calls. Added isInsideQuotedString()
   detection in static.ts: BLOCK_PATTERNS now skip matches that fall
   inside double-quoted / single-quoted / backtick string literals on
   the added line. WARN/INFO patterns still run — a TODO comment in
   a string is still a valid signal.

2. Verdicts were being persisted to disk but never fed back as
   learning signal. Added appendAuditLessons() — every block/warn
   finding writes a JSONL row to data/_kb/audit_lessons.jsonl with a
   path-agnostic signature (strips file paths, line numbers, commit
   hashes) so the SAME class of finding on DIFFERENT files dedups to
   one signature.

   kb_query now tails audit_lessons.jsonl and emits recurrence
   findings: 2 distinct PRs hit a signature = info, 3-4 = warn, 5+ =
   block. Severity ramps on distinct-PR count, not total rows, so a
   single unfixed PR being re-audited doesn't inflate its own
   recurrence score.

Fires on post-verdict fire-and-forget (can't break the audit if
disk write fails). The learning loop is now closed: each audit
contributes to the KB that guides the next audit.

Tested: unit tests for normalizedSignature confirmed path-agnostic
dedup; static.ts regression tests confirmed rubric strings no longer
trip BLOCK while real unquoted unimplemented!() still does.
2026-04-22 21:31:35 -05:00
profit
dc01ba0a3b auditor: kb_query surfaces scrum-master reviews for files in PR diff
Some checks failed
lakehouse/auditor 2 blocking issues: unimplemented!() macro call in tests/real-world/hard_task_escalation.ts
Wires the cohesion-plan Phase C link: the scrum-master pipeline writes
per-file reviews to data/_kb/scrum_reviews.jsonl on accept; the
auditor now reads that same file and emits one kb_query finding per
scrum review whose `file` matches a path in the PR's diff.

Severity heuristic: attempt 1-3 → info, attempt 4+ → warn. Reaching
the cloud specialist (attempt 4+) means the ladder had to escalate,
which is meaningful signal reviewers should see. Tree-split fired is
also surfaced in the finding summary.

audit.ts now passes pr.files.map(f => f.path) into runKbCheck (the
old signature dropped it on the floor). Also adds auditor/audit_one.ts
— a dry-run CLI for auditing a single PR without posting to Gitea,
useful for verifying check behavior without spamming review comments.

Verified: after writing scrum_reviews for auditor/audit.ts and
mcp-server/observer.ts (both in PR #7), audit_one 7 surfaced both as
info findings with preview + accepted_model + tree_split flag. A
scrum review for playbook_memory.rs (NOT in PR #7) was correctly
filtered out.
2026-04-22 21:18:21 -05:00
profit
039ed32411 Auditor: KB query check + verdict orchestrator + Gitea poster
All checks were successful
lakehouse/auditor all checks passed (4 findings, all info)
auditor/checks/kb_query.ts (task #7) — reads data/_kb/outcomes.jsonl,
error_corrections.jsonl, data/_observer/ops.jsonl, data/_bot/cycles/*.
Cheap/offline: no model calls, tail-reads only. Fail-rate >30% in
recent scenario outcomes → warn; otherwise info. Live-proven: 1
finding emitted against current KB state (69 scenario runs, 27.7%
fail rate — below warn threshold).

auditor/audit.ts (task #8) — orchestrator. Runs static + dynamic +
inference + kb_query in parallel, calls assembleVerdict, persists
to data/_auditor/verdicts/, posts to Gitea (commit status + issue
comment). AuditOptions supports skip_dynamic/skip_inference/dry_run
for iteration.

auditor/gitea.ts — added postIssueComment (author can comment on
own PR, unlike postReview which self-review-blocks).

static.ts — skip BLOCK_PATTERNS scan on auditor/checks/* and
auditor/fixtures/* because those files legitimately contain the
patterns as regex/string-literal data. WARN/INFO patterns (TODO
comments, hardcoded placeholders) still run. Live-proven: dry-run
audit of PR #1 after fix went from 13 block findings to 0 from
static; 11 warn from inference still fire on real overreach claims.

Dry-run audit against PR #1, skip_dynamic=true:
  verdict: block (BEFORE the static fix)
  verdict: request_changes (AFTER — inference correctly flagged
           "tasks 1-9 complete" as not backed; 0 false-positive
           blocks from static self-match)
  42.5s total across checks (mostly cloud inference: 36s)
  26 claims, 39KB diff

Tasks 5 + 6 + 7 + 8 complete. Remaining: #9 (poller) + #10
(end-to-end proof) + #12 (upsert UPDATE merge fix).
2026-04-22 03:59:38 -05:00
profit
efc7b5ac44 Auditor: dynamic + inference checks
auditor/checks/dynamic.ts — wraps runHybridFixture, maps layer
results to Findings. Placeholder-style errors (404/unimplemented/
slice N) → info; other failures → warn. Always emits a summary
finding with real numbers (shipped/placeholder phase counts + per-
layer latency). Live-tested against current stack: 2 info findings,
0 warnings — all shipped layers actually work.

auditor/checks/inference.ts — wraps the run_codereview reviewer
pattern from llm_team_ui.py, adapted for claim-vs-diff verification.
Calls /v1/chat provider=ollama_cloud model=gpt-oss:120b. Requests
strict JSON response with claim_verdicts[] and unflagged_gaps[]. A
strong claim marked "not backed" by cloud → BLOCK severity; moderate
→ warn; weak → info. Cloud-unreachable or unparseable-output → info
(never blocks on the reviewer being down).

Live-tested against PR #1 (this PR, 20 claims, 39KB diff):
  - 36.9s round-trip
  - 7 block + 23 warn + 2 info findings
  - gpt-oss:120b correctly flagged "Fully-functional auditor (tasks
    1-9 complete)" as not-backed (only 6/10 tasks done at that
    commit) — accurate catch
  - Some false positives from the original 15KB truncation threshold
    (cloud missed gitea.ts, flagged "no Gitea client present")
  - Bumped MAX_DIFF_CHARS from 15000 to 40000 to fit the full PR
    diff in context; reviewer precision improves accordingly

Tasks 5 + 6 completed. Remaining: #7 (KB query), #8 (verdict +
Gitea poster), #9 (poller), #10 (end-to-end proof), #12 (upsert
UPDATE-drops-doc_refs).
2026-04-22 03:54:18 -05:00
profit
b933334ae2 Auditor: static diff check — catches own Phase 45 placeholder
auditor/checks/static.ts — grep-style scan of PR diffs, no AST,
no LLM. High-signal patterns only.

Severity grading:
- BLOCK — unimplemented!(), todo!(), panic!("not implemented"),
  throw new Error("not implemented")
- WARN  — TODO/FIXME/XXX/HACK in added lines;
          new pub struct fields with <2 mentions in the diff
          (added but nobody reads it — placeholder state)
- INFO  — hardcoded "placeholder"/"dummy"/"foobar"/"changeme"/"xxx"
          strings in added lines

Live-proven — the existential test J asked for:

  vs PR #1 (scaffold):        0 findings (all scaffold fields cross-
                              reference within the diff)
  vs commit 2a4b81b (Phase    5 WARN: every DocRef field (tool,
  45 first slice — I          version_seen, snippet_hash, source_url,
  half-admitted placeholder): seen_at) added with 0 read-sites in
                              the diff

That's the auditor flagging my own "Phase 45 first slice" commit as
state-without-consumer, which is exactly what I half-admitted it
was. If PR #1 had required auditor-pass (branch protection), the
DocRef commit would have been blocked pre-merge. The auditor works
because it agreed with the honest read.

Next: dynamic hybrid test fixture (task #4) — the never-run multi-
layer pipeline test.
2026-04-22 03:29:31 -05:00