Caught by running a side-test through LLM Team's run_codereview
flow (gpt-oss:120b reviewer) against this fixture, 2026-04-22.
BEFORE:
const ourStart = Date.parse(
l1.evidence.match(/tokens=/) ? result.ran_at : result.ran_at
);
// Both branches return result.ran_at — the ternary is meaningless.
// result.ran_at is the fixture start time, NOT the moment we fired
// /v1/chat. Any trace created between fixture-start and chat-fetch
// would false-negative.
AFTER:
const chat_request_sent_ms = Date.now(); // captured before layer 1
// ...
const recent = items.filter(t =>
Date.parse(t.timestamp) >= chat_request_sent_ms
);
Re-ran the fixture against the live stack — layers 1,2,4 still pass
(no regression); layer 2 trace matched at age=2494ms which is within
the chat-to-trace propagation window. Layers 3,5 still fail for the
original unrelated reasons (UpsertOutcome serde panic + Phase 45
slice 3 endpoint not built).
First concrete act-on-finding from a code-checker run. The process
works.
auditor/fixtures/hybrid_38_40_45.ts — the never-before-run hybrid
test. Exercises Phase 38 /v1/chat → Phase 40 Langfuse → Phase 45
slice 1 seed+doc_refs → Phase 45 slice 2 bridge drift → (expected-
fail) Phase 45 slice 3 drift-check endpoint.
auditor/fixtures/cli.ts — standalone runner. Human-readable summary
to stderr, machine-readable JSON to stdout, exit code 0/1/2 for
pass / fail / partial_pass.
Live run results — honest measurements, not hand-waved:
✓ Phase 38 /v1/chat returns 9 visible tokens, 6.7s latency
("docker run is a common Docker command.")
✓ Phase 40 Langfuse trace 18a8a0b7 landed in 2.5s
✗ Phase 45.1 seed endpoint returns empty reply — discovered a
PRE-EXISTING BUG unrelated to doc_refs:
playbook_memory.rs:257 UpsertOutcome has newtype
variants Added(String) and Noop(String) under
#[serde(tag="mode")] — serde panics on serialize.
panicked at crates/vectord/src/service.rs:2323:
Error("cannot serialize tagged newtype variant
UpsertOutcome::Added containing a string")
Reproduced: curl /seed with AND without doc_refs
both get "Empty reply from server" (socket closed
mid-response). This bug has existed since Phase 26
shipped (commit 640db8c, 2026-04-21). No test or
caller in the repo exercised the response path live
against the gateway until this fixture did.
✓ Phase 45.2 context7 bridge confirms drift: current hash
475a0396ca436bba vs our stale input, upstream last
updated 2026-04-20
✗ Phase 45.3 /doc_drift/check endpoint — correctly unreachable
because layer 3 blocked us from getting a playbook_id;
endpoint still doesn't exist independent of that
Real numbers published: per-layer latency_ms, token counts,
trace_age_ms, library_id, current_hash_length. All stored in the
JSON output for downstream audit.
Value delivered: the fixture's first live run found a bug that
unit tests, compile checks, and my own "phase shipped" commits all
missed. Exactly the gap J called out — the auditor is doing what
it's supposed to do.
Bug fix is a SEPARATE concern: new task #11 tracks a separate PR
(fix/upsert-outcome-serde) so the audit finding and the fix stay
cleanly attributed.
auditor/checks/static.ts — grep-style scan of PR diffs, no AST,
no LLM. High-signal patterns only.
Severity grading:
- BLOCK — unimplemented!(), todo!(), panic!("not implemented"),
throw new Error("not implemented")
- WARN — TODO/FIXME/XXX/HACK in added lines;
new pub struct fields with <2 mentions in the diff
(added but nobody reads it — placeholder state)
- INFO — hardcoded "placeholder"/"dummy"/"foobar"/"changeme"/"xxx"
strings in added lines
Live-proven — the existential test J asked for:
vs PR #1 (scaffold): 0 findings (all scaffold fields cross-
reference within the diff)
vs commit 2a4b81b (Phase 5 WARN: every DocRef field (tool,
45 first slice — I version_seen, snippet_hash, source_url,
half-admitted placeholder): seen_at) added with 0 read-sites in
the diff
That's the auditor flagging my own "Phase 45 first slice" commit as
state-without-consumer, which is exactly what I half-admitted it
was. If PR #1 had required auditor-pass (branch protection), the
DocRef commit would have been blocked pre-merge. The auditor works
because it agreed with the honest read.
Next: dynamic hybrid test fixture (task #4) — the never-run multi-
layer pipeline test.
auditor/claim_parser.ts — reads PR body + commit messages, extracts
ship-claims. Regex-based, intentionally not LLM-driven: the parser's
job is to surface claim substrates, not to judge them (that's the
inference check's job, runs later with cloud model).
Three strength tiers:
- strong — "verified end-to-end", "live-proven", "production-ready",
"phase N shipped", "proven"
- moderate — "shipped", "landed", "green", "passing", "works",
"complete", "done"
- weak — "should work", "expected to", "probably"
Live-proven against PR #1 (this PR): 4 claims extracted from
1 commit (2 strong, 2 moderate). "live-proven" correctly tagged as
strong (it IS a stronger claim than "shipped").
Next: static diff check consumes these claims + the PR diff to find
placeholder patterns — empty fns, TODO, unwired fields, etc.
All-Bun sub-agent that watches open PRs on Gitea, reads ship-claims,
and hard-blocks merges when the code doesn't back the claim. First
commit of N; this is the skeleton. Dynamic/static/inference/kb checks
+ poller land in follow-up commits on this same branch.
- auditor/types.ts — Claim, Finding, Verdict, PrSnapshot shapes
- auditor/gitea.ts — minimal API client (listOpenPrs, getPrDiff,
postCommitStatus, postReview). Live-proven: returned 0 open PRs
against our repo (which IS the current state — every commit today
went to main directly, which is the problem this auditor is meant
to prevent)
- auditor/policy.ts — stub `assembleVerdict` + severity rules.
Intentionally conservative defaults: strong claim + zero evidence
= block, not warn.
- auditor/README.md — how to run + the hard-block mechanism
Workflow discipline change: starting with this branch, no more
direct pushes to main. Every change lands as a PR. When this
auditor is fully built and running, it'll review its own
completion PR — the recursive self-test.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>