|
|
7c1745611a
|
Audit pipeline PR #9: determinism + fact extraction + verifier gate + KB stats + context injection (PR #9)
Bundles PR #9's work for the audit pipeline:
- N=3 consensus on cloud inference (gpt-oss:120b parallel) with qwen3-coder:480b tie-breaker
- audit_discrepancies.jsonl logs N-run disagreements
- scrum_master reviews route through llm_team fact extraction; source="scrum_review"
- Verifier-gated persistence: drops INCORRECT, keeps UNVERIFIABLE/UNCHECKED; schema_version:2
- scrum_master_reviewed flag on accepted reviews
- auditor/kb_stats.ts: on-demand observability script
- claim_parser history/proof pattern class (verified-on-PR, was-flipping, the-proven-X)
- claim_parser quoted-string guard (mirrors static.ts fix)
- fact_extractor project context injection via docs/AUDITOR_CONTEXT.md
- Fixed verifier-verdict parser to handle multiple gemma2 output formats
Empirical: 3-run determinism test on unchanged PR #9 SHA showed 7/7 warn findings stable; block count oscillation eliminated; llm_team quality scores 8-9 on context-injected extract runs.
See PR #9 for full run-by-run commit history.
|
2026-04-23 05:29:38 +00:00 |
|
|
|
156dae6732
|
Auditor self-test branch: real-world pipelines + cohesion Phase C + KB index (PR #8)
Bundles 12 commits validating the auditor + scrum_master architecture end-to-end:
- enrich_prd_pipeline / hard_task_escalation / scrum_master_pipeline stress tests
- Tree-split + scrum_reviews.jsonl + kb_query surfacing
- Verdict → audit_lessons feedback loop (closed)
- kb_index aggregator with confidence-based severity policy
- 9-run + 5-run empirical tests proved the predictive-compounding property
- Level 1 correction: temp=0 cloud inference for deterministic per-claim verdicts
- audit_one.ts dry-run CLI
- Fixes: static quoted-string guard, empirical-claim classification, symbol-resolver gate, repo-file size cap
See PR #8 for run-by-run commit history.
|
2026-04-23 03:28:32 +00:00 |
|
profit
|
efc7b5ac44
|
Auditor: dynamic + inference checks
auditor/checks/dynamic.ts — wraps runHybridFixture, maps layer
results to Findings. Placeholder-style errors (404/unimplemented/
slice N) → info; other failures → warn. Always emits a summary
finding with real numbers (shipped/placeholder phase counts + per-
layer latency). Live-tested against current stack: 2 info findings,
0 warnings — all shipped layers actually work.
auditor/checks/inference.ts — wraps the run_codereview reviewer
pattern from llm_team_ui.py, adapted for claim-vs-diff verification.
Calls /v1/chat provider=ollama_cloud model=gpt-oss:120b. Requests
strict JSON response with claim_verdicts[] and unflagged_gaps[]. A
strong claim marked "not backed" by cloud → BLOCK severity; moderate
→ warn; weak → info. Cloud-unreachable or unparseable-output → info
(never blocks on the reviewer being down).
Live-tested against PR #1 (this PR, 20 claims, 39KB diff):
- 36.9s round-trip
- 7 block + 23 warn + 2 info findings
- gpt-oss:120b correctly flagged "Fully-functional auditor (tasks
1-9 complete)" as not-backed (only 6/10 tasks done at that
commit) — accurate catch
- Some false positives from the original 15KB truncation threshold
(cloud missed gitea.ts, flagged "no Gitea client present")
- Bumped MAX_DIFF_CHARS from 15000 to 40000 to fit the full PR
diff in context; reviewer precision improves accordingly
Tasks 5 + 6 completed. Remaining: #7 (KB query), #8 (verdict +
Gitea poster), #9 (poller), #10 (end-to-end proof), #12 (upsert
UPDATE-drops-doc_refs).
|
2026-04-22 03:54:18 -05:00 |
|