lakehouse/sidecar/sidecar/generate.py
root 0c4868c191 qwen3.5 executor + continuation primitive + think:false
Three coupled fixes that together turned the Riverfront Steel scenario
from 0/5 (mistral) to 4/5 (qwen3.5) with T3 flagging real staffing
concerns rather than linter advice.

MODEL SWAP
- Executor: mistral → qwen3.5:latest (9.7B, 262K ctx, thinking).
  mistral's decoder emitted malformed JSON on complex SQL filters
  regardless of prompt; J called it — stop using mistral.
- Reviewer: qwen2.5 → qwen3:latest (40K ctx)
- Applied to scenario.ts, orchestrator.ts, network_proving.ts,
  run_e2e_rated.ts

CONTINUATION PRIMITIVE (agent.ts)
- generateContinuable(): empty-response → geometric backoff retry;
  truncated-JSON → continue from partial as scratchpad; bounded by
  budget cap + max_continuations. No more "bump max_tokens until it
  stops truncating" tourniquet.
- generateTreeSplit(): map-reduce for oversized input corpora with
  running scratchpad digest, reduce pass for final synthesis.
- Empty text no longer throws — it's a signal to continuable that
  thinking ate the budget.

think:false FOR HOT PATH
- qwen3.5 burned ~650 tokens of hidden thinking for trivial JSON
  emission. For executor/reviewer/draft: think:false. For T3/T4/T5
  overseers: thinking stays on (that's the point).
- Sidecar generate endpoint accepts `think` bool, passes through to
  Ollama's /api/generate.

VERIFIED OUTCOMES
Riverfront Steel 2026-04-21, qwen3.5+continuable+think:false:
  08:00 baseline_fill  3/3  4 turns
  10:30 recurring      2/2  3 turns (1 playbook citation)
  12:15 expansion      0/5  drift-aborted (5-fill orchestration
                            problem, separate work)
  14:00 emergency      4/4  3 turns (1 citation)
  15:45 misplacement   1/1  3 turns
  → T3 caught Patrick Ross double-booking across events
  → T3 flagged forklift cert drift on the event that failed
  → Cross-day lesson proposed "maintain buffer of ≥3 emergency
    candidates, pre-fetch certs for expansion, booking system
    cross-check" — real staffing advice, not generic linter output

PRD PHASE 21 rewritten to reflect the actual primitive shape (two-
call map-reduce with scratchpad glue) instead of the tourniquet
approach originally documented. Rust port queued for next sprint.

scripts/ab_t3_test.sh: A/B harness that chains B→C→D runs and emits
tests/multi-agent/playbooks/ab_scorecard.json.
2026-04-20 20:19:02 -05:00

62 lines
1.6 KiB
Python

import os
from fastapi import APIRouter, HTTPException
from pydantic import BaseModel
from .ollama import client
router = APIRouter()
GEN_MODEL = os.environ.get("GEN_MODEL", "qwen2.5")
class GenerateRequest(BaseModel):
prompt: str
model: str | None = None
system: str | None = None
temperature: float = 0.7
max_tokens: int = 2048
# think=false disables hidden reasoning blocks on thinking models
# (qwen3, qwen3.5, gpt-oss). Required for hot-path JSON emitters
# that need the whole token budget for the visible response.
think: bool | None = None
class GenerateResponse(BaseModel):
text: str
model: str
tokens_evaluated: int | None = None
tokens_generated: int | None = None
@router.post("", response_model=GenerateResponse)
async def generate(req: GenerateRequest):
model = req.model or GEN_MODEL
payload = {
"model": model,
"prompt": req.prompt,
"stream": False,
"options": {
"temperature": req.temperature,
"num_predict": req.max_tokens,
},
}
if req.system:
payload["system"] = req.system
if req.think is not None:
payload["think"] = req.think
async with client() as c:
resp = await c.post("/api/generate", json=payload)
if resp.status_code != 200:
raise HTTPException(502, f"Ollama error: {resp.text}")
data = resp.json()
return GenerateResponse(
text=data.get("response", ""),
model=model,
tokens_evaluated=data.get("prompt_eval_count"),
tokens_generated=data.get("eval_count"),
)