7 Commits

Author SHA1 Message Date
root
20a039c379 auditor: rebuild on mode runner + drop tree-split (use distillation substrate)
Some checks failed
lakehouse/auditor 13 blocking issues: cloud: claim not backed — "Invariants enforced (proven by tests + real run):"
Architectural simplification leveraging Phase 5 distillation work:
the auditor no longer pre-extracts facts via per-shard summaries
because lakehouse_answers_v1 (gold-standard prior PR audits + observer
escalations corpus) supplies cross-PR context through the mode runner's
matrix retrieval. Same signal, ~50× fewer cloud calls per audit.

Per-audit cost:
  Before: 168 gpt-oss:120b shard summaries + 3 final inference calls
  After:  3 deepseek-v3.1:671b mode-runner calls (full retrieval included)

Wall-clock on PR #11 (1.36MB diff):
  Before: ~25 minutes
  After:  88 seconds (3/3 consensus succeeded)

Files:
  auditor/checks/inference.ts
    - Default MODEL kimi-k2:1t → deepseek-v3.1:671b. kimi-k2 is hitting
      sustained Ollama Cloud 500 ISE (verified via repeated trivial
      probes; multi-hour outage). deepseek is the proven drop-in from
      Phase 5 distillation acceptance testing.
    - Dropped treeSplitDiff invocation. Diff truncates to MAX_DIFF_CHARS
      and goes straight to /v1/mode/execute task_class=pr_audit; mode
      runner pulls cross-PR context from lakehouse_answers_v1 via
      matrix retrieval. SHARD_MODEL retained for legacy callCloud
      compatibility (default qwen3-coder:480b if it ever runs).
    - extractAndPersistFacts now reads from truncated diff (no
      scratchpad post-tree-split-removal).

  auditor/checks/static.ts
    - serde-derived struct exemption (commit 107a682 shipped this; this
      commit is the rest of the auditor rebuild it landed alongside)
    - multi-line template literal awareness in isInsideQuotedString —
      tracks backtick state across lines so todo!() inside docstrings
      doesn't trip BLOCK_PATTERNS.

  crates/gateway/src/v1/mode.rs
    - pr_audit native runner mode added to VALID_MODES + is_native_mode
      + flags_for_mode + framing_text. PrAudit framing produces strict
      JSON {claim_verdicts, unflagged_gaps} for the auditor to parse.

  config/modes.toml
    - pr_audit task class with default_model=deepseek-v3.1:671b and
      matrix_corpus=lakehouse_answers_v1. Documents kimi-k2 outage
      with link to the swap rationale.

Real-data audit on PR #11 head 1b433a9 (which is the PR with all the
distillation work + auditor rebuild itself):
  - Pipeline ran to completion (88s for inference; full audit ~3 min)
  - 3/3 consensus runs succeeded on deepseek-v3.1:671b
  - 156 findings: 12 block, 23 warn, 121 info
  - Block findings are legitimate signal: 12 reviewer claims like
    "Invariants enforced (proven by tests + real run):" that the
    truncated diff can't directly verify. The auditor is correctly
    flagging claim-vs-diff divergence — exactly its job.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-26 23:32:44 -05:00
root
107a68224d auditor: skip serde-derived structs in unread-field check
Fields on structs that derive Serialize or Deserialize ARE read — by
the macro, on every JSON round-trip — but the static check only
looked for explicit `.field` references in the diff. Result: every
new response/request struct shipped through `/v1/*` was flagged as
"placeholder state without a consumer."

PR #11 head 0844206 surfaced 8 such false positives across mode.rs,
respond.rs, truth.rs, and profiles/memory.rs — same shape as the
existing string-literal exemption for BLOCK_PATTERNS, just at a
different syntactic layer.

Two helpers added:
- extractNewFieldsWithLine: keeps each field's diff-line index so the
  caller can locate the parent struct.
- parentStructHasSerdeDerive: walks back ≤80 lines for a `pub struct`
  boundary, then ≤8 lines above it for `#[derive(...)]` lines
  containing Serialize or Deserialize. Stops on closing-brace-at-col-0
  to avoid escaping the enclosing scope.

Verified on PR #11's actual diff: unread-field warnings dropped from
8 → 0. Synthetic cases confirm the check still fires on plain
(non-serde) structs with no in-diff reader, so the
genuine-placeholder catch is preserved.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-26 20:49:06 -05:00
7c1745611a Audit pipeline PR #9: determinism + fact extraction + verifier gate + KB stats + context injection (PR #9)
Bundles PR #9's work for the audit pipeline:

- N=3 consensus on cloud inference (gpt-oss:120b parallel) with qwen3-coder:480b tie-breaker
- audit_discrepancies.jsonl logs N-run disagreements
- scrum_master reviews route through llm_team fact extraction; source="scrum_review"
- Verifier-gated persistence: drops INCORRECT, keeps UNVERIFIABLE/UNCHECKED; schema_version:2
- scrum_master_reviewed flag on accepted reviews
- auditor/kb_stats.ts: on-demand observability script
- claim_parser history/proof pattern class (verified-on-PR, was-flipping, the-proven-X)
- claim_parser quoted-string guard (mirrors static.ts fix)
- fact_extractor project context injection via docs/AUDITOR_CONTEXT.md
- Fixed verifier-verdict parser to handle multiple gemma2 output formats

Empirical: 3-run determinism test on unchanged PR #9 SHA showed 7/7 warn findings stable; block count oscillation eliminated; llm_team quality scores 8-9 on context-injected extract runs.

See PR #9 for full run-by-run commit history.
2026-04-23 05:29:38 +00:00
156dae6732 Auditor self-test branch: real-world pipelines + cohesion Phase C + KB index (PR #8)
Bundles 12 commits validating the auditor + scrum_master architecture end-to-end:

- enrich_prd_pipeline / hard_task_escalation / scrum_master_pipeline stress tests
- Tree-split + scrum_reviews.jsonl + kb_query surfacing
- Verdict → audit_lessons feedback loop (closed)
- kb_index aggregator with confidence-based severity policy
- 9-run + 5-run empirical tests proved the predictive-compounding property
- Level 1 correction: temp=0 cloud inference for deterministic per-claim verdicts
- audit_one.ts dry-run CLI
- Fixes: static quoted-string guard, empirical-claim classification, symbol-resolver gate, repo-file size cap

See PR #8 for run-by-run commit history.
2026-04-23 03:28:32 +00:00
profit
039ed32411 Auditor: KB query check + verdict orchestrator + Gitea poster
All checks were successful
lakehouse/auditor all checks passed (4 findings, all info)
auditor/checks/kb_query.ts (task #7) — reads data/_kb/outcomes.jsonl,
error_corrections.jsonl, data/_observer/ops.jsonl, data/_bot/cycles/*.
Cheap/offline: no model calls, tail-reads only. Fail-rate >30% in
recent scenario outcomes → warn; otherwise info. Live-proven: 1
finding emitted against current KB state (69 scenario runs, 27.7%
fail rate — below warn threshold).

auditor/audit.ts (task #8) — orchestrator. Runs static + dynamic +
inference + kb_query in parallel, calls assembleVerdict, persists
to data/_auditor/verdicts/, posts to Gitea (commit status + issue
comment). AuditOptions supports skip_dynamic/skip_inference/dry_run
for iteration.

auditor/gitea.ts — added postIssueComment (author can comment on
own PR, unlike postReview which self-review-blocks).

static.ts — skip BLOCK_PATTERNS scan on auditor/checks/* and
auditor/fixtures/* because those files legitimately contain the
patterns as regex/string-literal data. WARN/INFO patterns (TODO
comments, hardcoded placeholders) still run. Live-proven: dry-run
audit of PR #1 after fix went from 13 block findings to 0 from
static; 11 warn from inference still fire on real overreach claims.

Dry-run audit against PR #1, skip_dynamic=true:
  verdict: block (BEFORE the static fix)
  verdict: request_changes (AFTER — inference correctly flagged
           "tasks 1-9 complete" as not backed; 0 false-positive
           blocks from static self-match)
  42.5s total across checks (mostly cloud inference: 36s)
  26 claims, 39KB diff

Tasks 5 + 6 + 7 + 8 complete. Remaining: #9 (poller) + #10
(end-to-end proof) + #12 (upsert UPDATE merge fix).
2026-04-22 03:59:38 -05:00
profit
efc7b5ac44 Auditor: dynamic + inference checks
auditor/checks/dynamic.ts — wraps runHybridFixture, maps layer
results to Findings. Placeholder-style errors (404/unimplemented/
slice N) → info; other failures → warn. Always emits a summary
finding with real numbers (shipped/placeholder phase counts + per-
layer latency). Live-tested against current stack: 2 info findings,
0 warnings — all shipped layers actually work.

auditor/checks/inference.ts — wraps the run_codereview reviewer
pattern from llm_team_ui.py, adapted for claim-vs-diff verification.
Calls /v1/chat provider=ollama_cloud model=gpt-oss:120b. Requests
strict JSON response with claim_verdicts[] and unflagged_gaps[]. A
strong claim marked "not backed" by cloud → BLOCK severity; moderate
→ warn; weak → info. Cloud-unreachable or unparseable-output → info
(never blocks on the reviewer being down).

Live-tested against PR #1 (this PR, 20 claims, 39KB diff):
  - 36.9s round-trip
  - 7 block + 23 warn + 2 info findings
  - gpt-oss:120b correctly flagged "Fully-functional auditor (tasks
    1-9 complete)" as not-backed (only 6/10 tasks done at that
    commit) — accurate catch
  - Some false positives from the original 15KB truncation threshold
    (cloud missed gitea.ts, flagged "no Gitea client present")
  - Bumped MAX_DIFF_CHARS from 15000 to 40000 to fit the full PR
    diff in context; reviewer precision improves accordingly

Tasks 5 + 6 completed. Remaining: #7 (KB query), #8 (verdict +
Gitea poster), #9 (poller), #10 (end-to-end proof), #12 (upsert
UPDATE-drops-doc_refs).
2026-04-22 03:54:18 -05:00
profit
b933334ae2 Auditor: static diff check — catches own Phase 45 placeholder
auditor/checks/static.ts — grep-style scan of PR diffs, no AST,
no LLM. High-signal patterns only.

Severity grading:
- BLOCK — unimplemented!(), todo!(), panic!("not implemented"),
  throw new Error("not implemented")
- WARN  — TODO/FIXME/XXX/HACK in added lines;
          new pub struct fields with <2 mentions in the diff
          (added but nobody reads it — placeholder state)
- INFO  — hardcoded "placeholder"/"dummy"/"foobar"/"changeme"/"xxx"
          strings in added lines

Live-proven — the existential test J asked for:

  vs PR #1 (scaffold):        0 findings (all scaffold fields cross-
                              reference within the diff)
  vs commit 2a4b81b (Phase    5 WARN: every DocRef field (tool,
  45 first slice — I          version_seen, snippet_hash, source_url,
  half-admitted placeholder): seen_at) added with 0 read-sites in
                              the diff

That's the auditor flagging my own "Phase 45 first slice" commit as
state-without-consumer, which is exactly what I half-admitted it
was. If PR #1 had required auditor-pass (branch protection), the
DocRef commit would have been blocked pre-merge. The auditor works
because it agreed with the honest read.

Next: dynamic hybrid test fixture (task #4) — the never-run multi-
layer pipeline test.
2026-04-22 03:29:31 -05:00