Three coupled fixes that together turned the Riverfront Steel scenario
from 0/5 (mistral) to 4/5 (qwen3.5) with T3 flagging real staffing
concerns rather than linter advice.
MODEL SWAP
- Executor: mistral → qwen3.5:latest (9.7B, 262K ctx, thinking).
mistral's decoder emitted malformed JSON on complex SQL filters
regardless of prompt; J called it — stop using mistral.
- Reviewer: qwen2.5 → qwen3:latest (40K ctx)
- Applied to scenario.ts, orchestrator.ts, network_proving.ts,
run_e2e_rated.ts
CONTINUATION PRIMITIVE (agent.ts)
- generateContinuable(): empty-response → geometric backoff retry;
truncated-JSON → continue from partial as scratchpad; bounded by
budget cap + max_continuations. No more "bump max_tokens until it
stops truncating" tourniquet.
- generateTreeSplit(): map-reduce for oversized input corpora with
running scratchpad digest, reduce pass for final synthesis.
- Empty text no longer throws — it's a signal to continuable that
thinking ate the budget.
think:false FOR HOT PATH
- qwen3.5 burned ~650 tokens of hidden thinking for trivial JSON
emission. For executor/reviewer/draft: think:false. For T3/T4/T5
overseers: thinking stays on (that's the point).
- Sidecar generate endpoint accepts `think` bool, passes through to
Ollama's /api/generate.
VERIFIED OUTCOMES
Riverfront Steel 2026-04-21, qwen3.5+continuable+think:false:
08:00 baseline_fill 3/3 4 turns
10:30 recurring 2/2 3 turns (1 playbook citation)
12:15 expansion 0/5 drift-aborted (5-fill orchestration
problem, separate work)
14:00 emergency 4/4 3 turns (1 citation)
15:45 misplacement 1/1 3 turns
→ T3 caught Patrick Ross double-booking across events
→ T3 flagged forklift cert drift on the event that failed
→ Cross-day lesson proposed "maintain buffer of ≥3 emergency
candidates, pre-fetch certs for expansion, booking system
cross-check" — real staffing advice, not generic linter output
PRD PHASE 21 rewritten to reflect the actual primitive shape (two-
call map-reduce with scratchpad glue) instead of the tourniquet
approach originally documented. Rust port queued for next sprint.
scripts/ab_t3_test.sh: A/B harness that chains B→C→D runs and emits
tests/multi-agent/playbooks/ab_scorecard.json.
62 lines
1.6 KiB
Python
62 lines
1.6 KiB
Python
import os
|
|
|
|
from fastapi import APIRouter, HTTPException
|
|
from pydantic import BaseModel
|
|
|
|
from .ollama import client
|
|
|
|
router = APIRouter()
|
|
|
|
GEN_MODEL = os.environ.get("GEN_MODEL", "qwen2.5")
|
|
|
|
|
|
class GenerateRequest(BaseModel):
|
|
prompt: str
|
|
model: str | None = None
|
|
system: str | None = None
|
|
temperature: float = 0.7
|
|
max_tokens: int = 2048
|
|
# think=false disables hidden reasoning blocks on thinking models
|
|
# (qwen3, qwen3.5, gpt-oss). Required for hot-path JSON emitters
|
|
# that need the whole token budget for the visible response.
|
|
think: bool | None = None
|
|
|
|
|
|
class GenerateResponse(BaseModel):
|
|
text: str
|
|
model: str
|
|
tokens_evaluated: int | None = None
|
|
tokens_generated: int | None = None
|
|
|
|
|
|
@router.post("", response_model=GenerateResponse)
|
|
async def generate(req: GenerateRequest):
|
|
model = req.model or GEN_MODEL
|
|
|
|
payload = {
|
|
"model": model,
|
|
"prompt": req.prompt,
|
|
"stream": False,
|
|
"options": {
|
|
"temperature": req.temperature,
|
|
"num_predict": req.max_tokens,
|
|
},
|
|
}
|
|
if req.system:
|
|
payload["system"] = req.system
|
|
if req.think is not None:
|
|
payload["think"] = req.think
|
|
|
|
async with client() as c:
|
|
resp = await c.post("/api/generate", json=payload)
|
|
if resp.status_code != 200:
|
|
raise HTTPException(502, f"Ollama error: {resp.text}")
|
|
data = resp.json()
|
|
|
|
return GenerateResponse(
|
|
text=data.get("response", ""),
|
|
model=model,
|
|
tokens_evaluated=data.get("prompt_eval_count"),
|
|
tokens_generated=data.get("eval_count"),
|
|
)
|