Auditor: hybrid fixture — found a pre-existing bug on first live run
auditor/fixtures/hybrid_38_40_45.ts — the never-before-run hybrid
test. Exercises Phase 38 /v1/chat → Phase 40 Langfuse → Phase 45
slice 1 seed+doc_refs → Phase 45 slice 2 bridge drift → (expected-
fail) Phase 45 slice 3 drift-check endpoint.
auditor/fixtures/cli.ts — standalone runner. Human-readable summary
to stderr, machine-readable JSON to stdout, exit code 0/1/2 for
pass / fail / partial_pass.
Live run results — honest measurements, not hand-waved:
✓ Phase 38 /v1/chat returns 9 visible tokens, 6.7s latency
("docker run is a common Docker command.")
✓ Phase 40 Langfuse trace 18a8a0b7 landed in 2.5s
✗ Phase 45.1 seed endpoint returns empty reply — discovered a
PRE-EXISTING BUG unrelated to doc_refs:
playbook_memory.rs:257 UpsertOutcome has newtype
variants Added(String) and Noop(String) under
#[serde(tag="mode")] — serde panics on serialize.
panicked at crates/vectord/src/service.rs:2323:
Error("cannot serialize tagged newtype variant
UpsertOutcome::Added containing a string")
Reproduced: curl /seed with AND without doc_refs
both get "Empty reply from server" (socket closed
mid-response). This bug has existed since Phase 26
shipped (commit 640db8c, 2026-04-21). No test or
caller in the repo exercised the response path live
against the gateway until this fixture did.
✓ Phase 45.2 context7 bridge confirms drift: current hash
475a0396ca436bba vs our stale input, upstream last
updated 2026-04-20
✗ Phase 45.3 /doc_drift/check endpoint — correctly unreachable
because layer 3 blocked us from getting a playbook_id;
endpoint still doesn't exist independent of that
Real numbers published: per-layer latency_ms, token counts,
trace_age_ms, library_id, current_hash_length. All stored in the
JSON output for downstream audit.
Value delivered: the fixture's first live run found a bug that
unit tests, compile checks, and my own "phase shipped" commits all
missed. Exactly the gap J called out — the auditor is doing what
it's supposed to do.
Bug fix is a SEPARATE concern: new task #11 tracks a separate PR
(fix/upsert-outcome-serde) so the audit finding and the fix stay
cleanly attributed.
This commit is contained in:
parent
b933334ae2
commit
9c893fbb8c
49
auditor/fixtures/cli.ts
Normal file
49
auditor/fixtures/cli.ts
Normal file
@ -0,0 +1,49 @@
|
||||
// Standalone runner for auditor fixtures. Invoke:
|
||||
//
|
||||
// bun run auditor/fixtures/cli.ts hybrid_38_40_45
|
||||
//
|
||||
// Prints human-readable per-layer breakdown + JSON result so the
|
||||
// output can be captured by the dynamic check without re-running.
|
||||
|
||||
import { runHybridFixture, type FixtureResult } from "./hybrid_38_40_45.ts";
|
||||
|
||||
const fixtureArg = process.argv[2] ?? "hybrid_38_40_45";
|
||||
|
||||
let result: FixtureResult;
|
||||
switch (fixtureArg) {
|
||||
case "hybrid_38_40_45":
|
||||
result = await runHybridFixture();
|
||||
break;
|
||||
default:
|
||||
console.error(`unknown fixture: ${fixtureArg}`);
|
||||
process.exit(2);
|
||||
}
|
||||
|
||||
// Human-readable summary
|
||||
console.error(""); // blank line to stderr so stdout JSON stays clean
|
||||
console.error(`─── Fixture: ${result.fixture} ───`);
|
||||
console.error(` overall: ${result.overall.toUpperCase()}`);
|
||||
console.error(` shipped: [${result.shipped_phases.join(", ") || "—"}]`);
|
||||
console.error(` placeholder: [${result.placeholder_phases.join(", ") || "—"}]`);
|
||||
console.error("");
|
||||
for (const l of result.layers) {
|
||||
const mark = l.ok ? "✓" : "✗";
|
||||
const phaseStr = `Phase ${l.phase}`.padEnd(11);
|
||||
console.error(` ${mark} ${phaseStr} ${l.layer.padEnd(30)} ${String(l.latency_ms).padStart(6)}ms`);
|
||||
if (l.ok) {
|
||||
console.error(` ${l.evidence.slice(0, 200)}`);
|
||||
} else {
|
||||
console.error(` ERROR: ${l.error?.slice(0, 240) ?? "unknown"}`);
|
||||
}
|
||||
}
|
||||
console.error("");
|
||||
for (const n of result.notes) console.error(` note: ${n}`);
|
||||
console.error("");
|
||||
|
||||
// Machine-readable output on stdout.
|
||||
console.log(JSON.stringify(result, null, 2));
|
||||
|
||||
// Exit code reflects overall: 0 pass, 2 partial, 1 fail.
|
||||
// Dynamic check reads this AND the JSON; partial-pass is treated
|
||||
// as informative (some layers shipped), not blocking on its own.
|
||||
process.exit(result.overall === "pass" ? 0 : result.overall === "partial_pass" ? 2 : 1);
|
||||
317
auditor/fixtures/hybrid_38_40_45.ts
Normal file
317
auditor/fixtures/hybrid_38_40_45.ts
Normal file
@ -0,0 +1,317 @@
|
||||
// The never-run hybrid test fixture. Exercises the full stack end-to-end
|
||||
// across Phase 38 (/v1/chat), Phase 40 (Langfuse tracing), Phase 45
|
||||
// slice 1 (DocRef on playbook), and Phase 45 slice 2 (context7 bridge
|
||||
// drift detection).
|
||||
//
|
||||
// Returns HONEST per-layer results. A layer that's unimplemented or
|
||||
// broken returns ok=false with a real error; overall verdict flags
|
||||
// which phases are genuinely shipped vs still placeholder.
|
||||
//
|
||||
// Deterministic: always runs the same payload so a run's output is
|
||||
// comparable across audits. Uses TEST-prefixed names so playbook
|
||||
// state doesn't get polluted with production-looking fixture data.
|
||||
//
|
||||
// Preconditions:
|
||||
// - gateway up on :3100
|
||||
// - Python sidecar up on :3200
|
||||
// - Ollama local + Ollama Cloud key loaded
|
||||
// - Langfuse up on :3001
|
||||
// - context7 bridge up on :3900 (manual-start per mcp-server/README)
|
||||
// If bridge isn't running, that layer reports ok=false with a
|
||||
// clear error — the fixture doesn't try to auto-start it.
|
||||
|
||||
import { readFile } from "node:fs/promises";
|
||||
|
||||
const GATEWAY = process.env.LH_GATEWAY_URL ?? "http://localhost:3100";
|
||||
const LANGFUSE = process.env.LANGFUSE_URL ?? "http://localhost:3001";
|
||||
const BRIDGE = process.env.CONTEXT7_BRIDGE_URL ?? "http://localhost:3900";
|
||||
const FIXTURE_ID = "auditor:hybrid_38_40_45:v1";
|
||||
const TEST_WORKER_NAME = "__auditor_test_worker__";
|
||||
const STALE_HASH = "stale-hash-for-drift-proof";
|
||||
|
||||
export interface LayerResult {
|
||||
layer: string;
|
||||
phase: string;
|
||||
ok: boolean;
|
||||
latency_ms: number;
|
||||
evidence: string;
|
||||
error?: string;
|
||||
// Layer-specific numbers that downstream checks can aggregate.
|
||||
metrics?: Record<string, number | string | boolean>;
|
||||
}
|
||||
|
||||
export interface FixtureResult {
|
||||
fixture: string;
|
||||
ran_at: string;
|
||||
overall: "pass" | "partial_pass" | "fail";
|
||||
layers: LayerResult[];
|
||||
real_numbers: Record<string, number>;
|
||||
shipped_phases: string[];
|
||||
placeholder_phases: string[];
|
||||
notes: string[];
|
||||
}
|
||||
|
||||
// Wraps a layer's async body with timing + structured error capture.
|
||||
// On throw: returns ok=false with the error string as evidence, so
|
||||
// downstream verdict code treats it as a layer failure rather than
|
||||
// a fixture crash.
|
||||
async function measureLayer(
|
||||
layer: string,
|
||||
phase: string,
|
||||
body: () => Promise<{ evidence: string; metrics?: Record<string, number | string | boolean> }>,
|
||||
): Promise<LayerResult> {
|
||||
const t0 = Date.now();
|
||||
try {
|
||||
const { evidence, metrics } = await body();
|
||||
return {
|
||||
layer, phase, ok: true,
|
||||
latency_ms: Date.now() - t0,
|
||||
evidence, metrics,
|
||||
};
|
||||
} catch (e) {
|
||||
return {
|
||||
layer, phase, ok: false,
|
||||
latency_ms: Date.now() - t0,
|
||||
evidence: `FAILED: ${(e as Error).message.slice(0, 200)}`,
|
||||
error: (e as Error).message,
|
||||
};
|
||||
}
|
||||
}
|
||||
|
||||
async function langfuseAuth(): Promise<{ pk: string; sk: string } | null> {
|
||||
// Default credentials match mcp-server/tracing.ts. Same sources the
|
||||
// Rust side uses in gateway/src/v1/langfuse_trace.rs.
|
||||
let pk = process.env.LANGFUSE_PUBLIC_KEY ?? "pk-lf-staffing";
|
||||
let sk = process.env.LANGFUSE_SECRET_KEY ?? "sk-lf-staffing-secret";
|
||||
if (!pk || !sk) return null;
|
||||
return { pk, sk };
|
||||
}
|
||||
|
||||
export async function runHybridFixture(): Promise<FixtureResult> {
|
||||
const result: FixtureResult = {
|
||||
fixture: FIXTURE_ID,
|
||||
ran_at: new Date().toISOString(),
|
||||
overall: "fail",
|
||||
layers: [],
|
||||
real_numbers: {},
|
||||
shipped_phases: [],
|
||||
placeholder_phases: [],
|
||||
notes: [],
|
||||
};
|
||||
|
||||
// ========================================================================
|
||||
// Layer 1 — Phase 38: POST /v1/chat returns valid OpenAI shape
|
||||
// ========================================================================
|
||||
const l1 = await measureLayer("phase38_chat", "38", async () => {
|
||||
const r = await fetch(`${GATEWAY}/v1/chat`, {
|
||||
method: "POST",
|
||||
headers: { "content-type": "application/json" },
|
||||
body: JSON.stringify({
|
||||
model: "qwen3.5:latest",
|
||||
messages: [
|
||||
{ role: "system", content: "Respond in 8 words or fewer." },
|
||||
{ role: "user", content: "Name one Docker command." },
|
||||
],
|
||||
max_tokens: 60,
|
||||
temperature: 0.2,
|
||||
}),
|
||||
signal: AbortSignal.timeout(60000),
|
||||
});
|
||||
if (!r.ok) throw new Error(`gateway ${r.status}: ${await r.text()}`);
|
||||
const j: any = await r.json();
|
||||
const content = j.choices?.[0]?.message?.content;
|
||||
if (typeof content !== "string" || content.length === 0) {
|
||||
throw new Error(`empty content — think-false default may not be wired: ${JSON.stringify(j).slice(0, 200)}`);
|
||||
}
|
||||
return {
|
||||
evidence: `content="${content.slice(0, 60)}" tokens=${j.usage?.total_tokens}`,
|
||||
metrics: {
|
||||
prompt_tokens: j.usage?.prompt_tokens ?? 0,
|
||||
completion_tokens: j.usage?.completion_tokens ?? 0,
|
||||
total_tokens: j.usage?.total_tokens ?? 0,
|
||||
content_nonempty: true,
|
||||
},
|
||||
};
|
||||
});
|
||||
result.layers.push(l1);
|
||||
|
||||
// ========================================================================
|
||||
// Layer 2 — Phase 40: Langfuse trace lands within 3s
|
||||
// ========================================================================
|
||||
// Fire-and-forget tracing means we need a brief sleep before query.
|
||||
await new Promise(res => setTimeout(res, 2500));
|
||||
const l2 = await measureLayer("phase40_langfuse_trace", "40", async () => {
|
||||
const auth = await langfuseAuth();
|
||||
if (!auth) throw new Error("Langfuse credentials not in env");
|
||||
const r = await fetch(`${LANGFUSE}/api/public/traces?limit=5&name=v1.chat:ollama`, {
|
||||
headers: { Authorization: `Basic ${btoa(`${auth.pk}:${auth.sk}`)}` },
|
||||
signal: AbortSignal.timeout(10000),
|
||||
});
|
||||
if (!r.ok) throw new Error(`langfuse ${r.status}: ${await r.text()}`);
|
||||
const j: any = await r.json();
|
||||
const items = Array.isArray(j.data) ? j.data : [];
|
||||
// Find a trace newer than our l1 start timestamp.
|
||||
const ourStart = Date.parse(l1.evidence.match(/tokens=/) ? result.ran_at : result.ran_at);
|
||||
const recent = items.filter((t: any) => Date.parse(t.timestamp) >= ourStart);
|
||||
if (recent.length === 0) {
|
||||
throw new Error(`no v1.chat:ollama trace since ${new Date(ourStart).toISOString()} (${items.length} older traces visible, Langfuse reachable — tracing is not firing)`);
|
||||
}
|
||||
const trace = recent[0];
|
||||
return {
|
||||
evidence: `trace ${trace.id.slice(0, 8)} name=${trace.name} age=${Date.now() - Date.parse(trace.timestamp)}ms`,
|
||||
metrics: {
|
||||
trace_age_ms: Date.now() - Date.parse(trace.timestamp),
|
||||
trace_latency_reported: Number(trace.latency ?? 0),
|
||||
recent_trace_count: recent.length,
|
||||
},
|
||||
};
|
||||
});
|
||||
result.layers.push(l2);
|
||||
|
||||
// ========================================================================
|
||||
// Layer 3 — Phase 45 slice 1: /seed accepts doc_refs and persists them
|
||||
// ========================================================================
|
||||
// Seed a playbook with doc_refs referencing Docker at a stale hash.
|
||||
// Ignore the cleanup concern for v1 — this fixture pollutes the
|
||||
// in-memory playbook state; operator can retire test entries.
|
||||
const l3 = await measureLayer("phase45_seed_with_doc_refs", "45.1", async () => {
|
||||
const r = await fetch(`${GATEWAY}/vectors/playbook_memory/seed`, {
|
||||
method: "POST",
|
||||
headers: { "content-type": "application/json" },
|
||||
body: JSON.stringify({
|
||||
operation: `TEST: fill: __auditor__ x1 in TestCity, TC`,
|
||||
approach: "auditor hybrid fixture run",
|
||||
context: "doc_refs integration test — retire after audit",
|
||||
endorsed_names: [TEST_WORKER_NAME],
|
||||
append: true,
|
||||
doc_refs: [
|
||||
{
|
||||
tool: "docker",
|
||||
version_seen: "24.0.7",
|
||||
snippet_hash: STALE_HASH,
|
||||
source_url: "https://context7.com/docker/docs",
|
||||
seen_at: new Date().toISOString(),
|
||||
},
|
||||
],
|
||||
}),
|
||||
signal: AbortSignal.timeout(30000),
|
||||
});
|
||||
if (!r.ok) throw new Error(`seed ${r.status}: ${await r.text()}`);
|
||||
const j: any = await r.json();
|
||||
// Verify doc_refs actually persisted — read state.json directly.
|
||||
// That's the canonical source of truth; the seed response may or
|
||||
// may not echo doc_refs back.
|
||||
const state_raw = await readFile("/home/profit/lakehouse/data/_playbook_memory/state.json", "utf8");
|
||||
const state = JSON.parse(state_raw);
|
||||
const entries = state.entries ?? [];
|
||||
const ours = entries.find((e: any) => (e.endorsed_names ?? []).includes(TEST_WORKER_NAME));
|
||||
if (!ours) throw new Error(`seeded entry not found in state.json (worker ${TEST_WORKER_NAME} not present)`);
|
||||
const docRefs = ours.doc_refs ?? [];
|
||||
if (docRefs.length === 0) {
|
||||
throw new Error("entry saved but doc_refs field is empty — the field accepted by the API isn't being persisted");
|
||||
}
|
||||
return {
|
||||
evidence: `playbook ${ours.playbook_id.slice(0, 16)} persisted with doc_refs.length=${docRefs.length}; first tool=${docRefs[0]?.tool} hash=${docRefs[0]?.snippet_hash}`,
|
||||
metrics: {
|
||||
doc_refs_stored: docRefs.length,
|
||||
entries_after: j.entries_after ?? -1,
|
||||
playbook_id: ours.playbook_id,
|
||||
mode: j.outcome?.mode ?? "unknown",
|
||||
},
|
||||
};
|
||||
});
|
||||
result.layers.push(l3);
|
||||
|
||||
// ========================================================================
|
||||
// Layer 4 — Phase 45 slice 2: context7 bridge detects drift vs stale hash
|
||||
// ========================================================================
|
||||
const l4 = await measureLayer("phase45_bridge_diff", "45.2", async () => {
|
||||
// Health check first — if bridge isn't running, fail with a clear
|
||||
// message rather than mysterious connection refused.
|
||||
try {
|
||||
const h = await fetch(`${BRIDGE}/health`, { signal: AbortSignal.timeout(3000) });
|
||||
if (!h.ok) throw new Error(`bridge /health ${h.status}`);
|
||||
} catch (e) {
|
||||
throw new Error(`context7 bridge at ${BRIDGE} is not reachable (bridge is manual-start; run 'bun run mcp-server/context7_bridge.ts'): ${(e as Error).message}`);
|
||||
}
|
||||
const r = await fetch(`${BRIDGE}/docs/docker/diff?since=${encodeURIComponent(STALE_HASH)}`, {
|
||||
signal: AbortSignal.timeout(30000),
|
||||
});
|
||||
if (!r.ok) throw new Error(`bridge diff ${r.status}: ${await r.text()}`);
|
||||
const j: any = await r.json();
|
||||
if (j.drifted !== true) {
|
||||
throw new Error(`expected drifted=true against ${STALE_HASH}; got ${JSON.stringify(j)}`);
|
||||
}
|
||||
return {
|
||||
evidence: `bridge confirms drift: current=${j.current_snippet_hash} previous=${j.previous_snippet_hash} upstream_updated=${j.last_updated_upstream}`,
|
||||
metrics: {
|
||||
drifted: true,
|
||||
current_hash_length: String(j.current_snippet_hash ?? "").length,
|
||||
library_id: j.library_id,
|
||||
},
|
||||
};
|
||||
});
|
||||
result.layers.push(l4);
|
||||
|
||||
// ========================================================================
|
||||
// Layer 5 — Phase 45 slice 3: /doc_drift/check/{id} endpoint
|
||||
// (EXPECTED TO FAIL — endpoint unimplemented)
|
||||
// ========================================================================
|
||||
const l5 = await measureLayer("phase45_slice3_drift_flag", "45.3", async () => {
|
||||
const pid = String(l3.metrics?.playbook_id ?? "");
|
||||
if (!pid) throw new Error("layer 3 did not produce a playbook_id to check");
|
||||
const r = await fetch(`${GATEWAY}/vectors/playbook_memory/doc_drift/check/${encodeURIComponent(pid)}`, {
|
||||
method: "POST",
|
||||
signal: AbortSignal.timeout(10000),
|
||||
});
|
||||
if (r.status === 404 || r.status === 405) {
|
||||
// This is the HONEST signal: the endpoint doesn't exist yet.
|
||||
// Fail this layer with a clear marker — not a silent pass.
|
||||
throw new Error(`endpoint unimplemented (${r.status}) — Phase 45 slice 3 is still placeholder. doc_refs are persisted (layer 3) and the bridge can answer drift questions (layer 4), but nothing ties them together into a playbook flag yet.`);
|
||||
}
|
||||
if (!r.ok) throw new Error(`drift/check ${r.status}: ${await r.text()}`);
|
||||
const j: any = await r.json();
|
||||
// If the endpoint DID exist, assert the flag got set.
|
||||
if (!j.flagged) {
|
||||
throw new Error(`endpoint responded but flag not set: ${JSON.stringify(j).slice(0, 200)}`);
|
||||
}
|
||||
return {
|
||||
evidence: `playbook flagged: ${JSON.stringify(j).slice(0, 160)}`,
|
||||
metrics: { flagged: true },
|
||||
};
|
||||
});
|
||||
result.layers.push(l5);
|
||||
|
||||
// ========================================================================
|
||||
// Verdict assembly
|
||||
// ========================================================================
|
||||
for (const l of result.layers) {
|
||||
if (l.ok) result.shipped_phases.push(l.phase);
|
||||
else result.placeholder_phases.push(l.phase);
|
||||
result.real_numbers[`${l.layer}_latency_ms`] = l.latency_ms;
|
||||
if (l.metrics) {
|
||||
for (const [k, v] of Object.entries(l.metrics)) {
|
||||
if (typeof v === "number") result.real_numbers[`${l.layer}.${k}`] = v;
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
const anyFail = result.placeholder_phases.length > 0;
|
||||
const allPass = result.layers.every(l => l.ok);
|
||||
result.overall = allPass ? "pass" : (result.shipped_phases.length > 0 ? "partial_pass" : "fail");
|
||||
|
||||
if (!l5.ok) {
|
||||
result.notes.push(
|
||||
"Phase 45 slice 3 (doc_drift/check endpoint) is the expected-fail layer. " +
|
||||
"Layer 5 failing with a 404/405 is the CORRECT honest signal — don't mask this " +
|
||||
"to get a green result. When slice 3 ships, layer 5 flips to ok=true automatically."
|
||||
);
|
||||
}
|
||||
if (anyFail) {
|
||||
result.notes.push(
|
||||
`${result.placeholder_phases.length} layer(s) failed. Check evidence per layer for specifics.`
|
||||
);
|
||||
}
|
||||
|
||||
return result;
|
||||
}
|
||||
Loading…
x
Reference in New Issue
Block a user