auditor: Level 1 correction — keep think=true, only temp=0 is needed
Some checks failed
lakehouse/auditor 4 warnings — see review
Some checks failed
lakehouse/auditor 4 warnings — see review
The previous Level 1 commit set think=false which broke the cloud inference check on real PR audits. gpt-oss:120b is a reasoning model; at think=false on large prompts (40KB diff + 14 claims) it returned empty content — verified by inspecting verdict 8-8e4ebbe4b38a which showed "cloud returned unparseable output — skipped" with 13421 tokens used and head:<empty>. Small-prompt tests passed because the model could respond without needing to think. Real audits with the full diff + claims context require the reasoning channel to produce any output at all. The determinism we need comes from temp=0 (greedy sampling). The reasoning trace at think=true varies in prose but greedy sampling converges to the same FINAL classification from identical starting state, so signatures remain stable. max_tokens restored to 3000 for the think trace + response.
This commit is contained in:
parent
8e4ebbe4b3
commit
47f1ca73e7
@ -112,19 +112,22 @@ export async function runInferenceCheck(claims: Claim[], diff: string): Promise<
|
|||||||
{ role: "system", content: systemMsg },
|
{ role: "system", content: systemMsg },
|
||||||
{ role: "user", content: userMsg },
|
{ role: "user", content: userMsg },
|
||||||
],
|
],
|
||||||
// Deterministic classification mode — temp=0 is greedy-sample,
|
// Deterministic classification — temp=0 is greedy-sample, so
|
||||||
// so identical input → identical output on the same model
|
// identical input yields identical output on the same model
|
||||||
// version. think=false disables the reasoning trace that was
|
// version. This kills the signature creep we observed in the
|
||||||
// letting variable prose leak into the classification output
|
// 9-run empirical test (sig_count 16→27 from cloud phrasing
|
||||||
// and inflate the audit_lessons signature set (observed as
|
// variance at temp=0.2).
|
||||||
// sig_count creep across the 9-run empirical test).
|
|
||||||
//
|
//
|
||||||
// max_tokens tightened to 1500 — the structured JSON response
|
// IMPORTANT: keep think=true. gpt-oss:120b is a reasoning
|
||||||
// fits comfortably in 1500 tokens for typical PRs (~7 claims);
|
// model; setting think=false caused it to return empty content
|
||||||
// the old 3000 just gave the model room to wander.
|
// on large prompts (observed during Level 1 validation: 13421
|
||||||
max_tokens: 1500,
|
// tokens used, empty content returned). The reasoning trace is
|
||||||
|
// variable prose, but at temp=0 the FINAL classification is
|
||||||
|
// still deterministic because greedy sampling converges to
|
||||||
|
// the same conclusion from the same starting state.
|
||||||
|
max_tokens: 3000,
|
||||||
temperature: 0,
|
temperature: 0,
|
||||||
think: false,
|
think: true,
|
||||||
}),
|
}),
|
||||||
signal: AbortSignal.timeout(CALL_TIMEOUT_MS),
|
signal: AbortSignal.timeout(CALL_TIMEOUT_MS),
|
||||||
});
|
});
|
||||||
|
|||||||
Loading…
x
Reference in New Issue
Block a user