5 Commits
| Author | SHA1 | Message | Date | |
|---|---|---|---|---|
|
|
77650c4ba3 |
auditor: inference curation layer + llm_team fact extraction → KB
Closes the cycle J asked for: curated cloud output lands structured knowledge in the KB so future audits have architectural context, not just a log of per-finding signatures. Three pieces: 1. Inference curation (tree-split) — when diff > 30KB, shard at 4.5KB, summarize each shard via cloud (temp=0, think=false on small shards; think=true on main call). Merge into scratchpad. The cloud verification then runs against the scratchpad, not truncated raw. Eliminates the 40KB MAX_DIFF_CHARS truncation path for large PRs (PR #8 is 102KB — was losing 62KB). Anti-false-positive guard in the prompt: cloud is told scratchpad absence is NOT diff absence, so it doesn't flag curated-out symbols as missing. unflagged_gaps section is dropped entirely when curated (scratchpad can't ground them). 2. fact_extractor — TS client for llm_team_ui's extract-facts mode at localhost:5000/api/run. Sends curated scratchpad through qwen2.5 extractor + gemma2 verifier, parses SSE stream, returns structured {facts, entities, relationships, verification, llm_team_run_id}. Best-effort: if llm_team is down, extraction fails silently and the audit still completes. AWAITED so CLI tools (audit_one.ts) don't exit before extraction lands — the systemd poller has 90s headroom so the extra ~15s doesn't matter. 3. audit_facts.jsonl + checkAuditFacts() — one row per curated audit with the extraction result. kb_query tails the jsonl, explodes entity rows, aggregates by entity name with distinct-PR counting, surfaces entities recurring in 2+ PRs as info findings. Filters out short names (<3 chars, extractor truncation artifacts) and generic types (string/number/etc.) so signal isn't drowned. Verified end-to-end on PR #8: 102KB diff → 23 shards → 1KB scratchpad → qwen2.5 extracted 4 facts + 6 entities + 6 relationships (real code-level knowledge: AggregateOptions<T> type, aggregate<T> async function with real signature, typed relationships). llm_team_run_id cross-references to llm_team's own team_runs table. Also: audit.ts passes (pr_number, head_sha) as InferenceContext so extracted facts are scope-tagged for the KB index. |
||
|
|
f4be27a879 |
auditor: fix two false-positive classes from cloud inference
Some checks failed
lakehouse/auditor 1 blocking issue: cloud: claim not backed — "the proven escalation ladder with learning context, collects"
Observed on PR #8 audit (de11ac4): 7 warn findings, all from the cloud inference check. Investigation showed two distinct bug classes that weren't "ship bad code", they were "auditor misreads the diff": 1. Cloud flagged "X not defined in this diff / missing implementation" for symbols like `tailJsonl` and `stubFinding` that ARE defined — just not in the added lines of this diff. Fix: extract candidate symbols from the cloud's gap summary, grep the repo for their definitions (function/const/let/def/class/struct/enum/trait/fn). If every named symbol resolves, drop the finding; if some do, demote to info with the resolution in evidence. 2. Cloud flagged runtime metrics like "58 cloud calls, 306s end-to-end" as unbacked claims. These are empirical outputs from running the test, not things a static diff can prove. Fix: claim_parser now has an `empirical` strength class matching iteration counts, cloud-call counts, duration metrics, attempt counts, tier-count phrases. Inference drops empirical claims from its cloud prompt (verifiable[] subset only) and claim-index mapping uses verifiable[] so cloud responses still line up. Added `claims_empirical` to audit metrics so the verdict is introspectable: how many claims WERE runtime-only vs how many are diff-verifiable? Verified: unit tests confirm empirical classification on 5 sample commit messages; symbol resolver found both false-positive symbols (tailJsonl + stubFinding) and correctly skipped a known- fake symbol. |
||
|
|
0306dd88c1 |
auditor: close the verdict→playbook loop + fix rubric-string false positive
Some checks failed
lakehouse/auditor 2 blocking issues: unimplemented!() macro call in tests/real-world/hard_task_escalation.ts
Two changes that fell out of running the auto-loop for real on PR #8: 1. The systemd auditor blocked PR #8 on 'unimplemented!()' / 'todo!()' in tests/real-world/hard_task_escalation.ts — but those strings are the rubric itself, not macro calls. Added isInsideQuotedString() detection in static.ts: BLOCK_PATTERNS now skip matches that fall inside double-quoted / single-quoted / backtick string literals on the added line. WARN/INFO patterns still run — a TODO comment in a string is still a valid signal. 2. Verdicts were being persisted to disk but never fed back as learning signal. Added appendAuditLessons() — every block/warn finding writes a JSONL row to data/_kb/audit_lessons.jsonl with a path-agnostic signature (strips file paths, line numbers, commit hashes) so the SAME class of finding on DIFFERENT files dedups to one signature. kb_query now tails audit_lessons.jsonl and emits recurrence findings: 2 distinct PRs hit a signature = info, 3-4 = warn, 5+ = block. Severity ramps on distinct-PR count, not total rows, so a single unfixed PR being re-audited doesn't inflate its own recurrence score. Fires on post-verdict fire-and-forget (can't break the audit if disk write fails). The learning loop is now closed: each audit contributes to the KB that guides the next audit. Tested: unit tests for normalizedSignature confirmed path-agnostic dedup; static.ts regression tests confirmed rubric strings no longer trip BLOCK while real unquoted unimplemented!() still does. |
||
|
|
dc01ba0a3b |
auditor: kb_query surfaces scrum-master reviews for files in PR diff
Some checks failed
lakehouse/auditor 2 blocking issues: unimplemented!() macro call in tests/real-world/hard_task_escalation.ts
Wires the cohesion-plan Phase C link: the scrum-master pipeline writes per-file reviews to data/_kb/scrum_reviews.jsonl on accept; the auditor now reads that same file and emits one kb_query finding per scrum review whose `file` matches a path in the PR's diff. Severity heuristic: attempt 1-3 → info, attempt 4+ → warn. Reaching the cloud specialist (attempt 4+) means the ladder had to escalate, which is meaningful signal reviewers should see. Tree-split fired is also surfaced in the finding summary. audit.ts now passes pr.files.map(f => f.path) into runKbCheck (the old signature dropped it on the floor). Also adds auditor/audit_one.ts — a dry-run CLI for auditing a single PR without posting to Gitea, useful for verifying check behavior without spamming review comments. Verified: after writing scrum_reviews for auditor/audit.ts and mcp-server/observer.ts (both in PR #7), audit_one 7 surfaced both as info findings with preview + accepted_model + tree_split flag. A scrum review for playbook_memory.rs (NOT in PR #7) was correctly filtered out. |
||
|
|
039ed32411 |
Auditor: KB query check + verdict orchestrator + Gitea poster
All checks were successful
lakehouse/auditor all checks passed (4 findings, all info)
auditor/checks/kb_query.ts (task #7) — reads data/_kb/outcomes.jsonl, error_corrections.jsonl, data/_observer/ops.jsonl, data/_bot/cycles/*. Cheap/offline: no model calls, tail-reads only. Fail-rate >30% in recent scenario outcomes → warn; otherwise info. Live-proven: 1 finding emitted against current KB state (69 scenario runs, 27.7% fail rate — below warn threshold). auditor/audit.ts (task #8) — orchestrator. Runs static + dynamic + inference + kb_query in parallel, calls assembleVerdict, persists to data/_auditor/verdicts/, posts to Gitea (commit status + issue comment). AuditOptions supports skip_dynamic/skip_inference/dry_run for iteration. auditor/gitea.ts — added postIssueComment (author can comment on own PR, unlike postReview which self-review-blocks). static.ts — skip BLOCK_PATTERNS scan on auditor/checks/* and auditor/fixtures/* because those files legitimately contain the patterns as regex/string-literal data. WARN/INFO patterns (TODO comments, hardcoded placeholders) still run. Live-proven: dry-run audit of PR #1 after fix went from 13 block findings to 0 from static; 11 warn from inference still fire on real overreach claims. Dry-run audit against PR #1, skip_dynamic=true: verdict: block (BEFORE the static fix) verdict: request_changes (AFTER — inference correctly flagged "tasks 1-9 complete" as not backed; 0 false-positive blocks from static self-match) 42.5s total across checks (mostly cloud inference: 36s) 26 claims, 39KB diff Tasks 5 + 6 + 7 + 8 complete. Remaining: #9 (poller) + #10 (end-to-end proof) + #12 (upsert UPDATE merge fix). |