Three coupled fixes that together turned the Riverfront Steel scenario
from 0/5 (mistral) to 4/5 (qwen3.5) with T3 flagging real staffing
concerns rather than linter advice.
MODEL SWAP
- Executor: mistral → qwen3.5:latest (9.7B, 262K ctx, thinking).
mistral's decoder emitted malformed JSON on complex SQL filters
regardless of prompt; J called it — stop using mistral.
- Reviewer: qwen2.5 → qwen3:latest (40K ctx)
- Applied to scenario.ts, orchestrator.ts, network_proving.ts,
run_e2e_rated.ts
CONTINUATION PRIMITIVE (agent.ts)
- generateContinuable(): empty-response → geometric backoff retry;
truncated-JSON → continue from partial as scratchpad; bounded by
budget cap + max_continuations. No more "bump max_tokens until it
stops truncating" tourniquet.
- generateTreeSplit(): map-reduce for oversized input corpora with
running scratchpad digest, reduce pass for final synthesis.
- Empty text no longer throws — it's a signal to continuable that
thinking ate the budget.
think:false FOR HOT PATH
- qwen3.5 burned ~650 tokens of hidden thinking for trivial JSON
emission. For executor/reviewer/draft: think:false. For T3/T4/T5
overseers: thinking stays on (that's the point).
- Sidecar generate endpoint accepts `think` bool, passes through to
Ollama's /api/generate.
VERIFIED OUTCOMES
Riverfront Steel 2026-04-21, qwen3.5+continuable+think:false:
08:00 baseline_fill 3/3 4 turns
10:30 recurring 2/2 3 turns (1 playbook citation)
12:15 expansion 0/5 drift-aborted (5-fill orchestration
problem, separate work)
14:00 emergency 4/4 3 turns (1 citation)
15:45 misplacement 1/1 3 turns
→ T3 caught Patrick Ross double-booking across events
→ T3 flagged forklift cert drift on the event that failed
→ Cross-day lesson proposed "maintain buffer of ≥3 emergency
candidates, pre-fetch certs for expansion, booking system
cross-check" — real staffing advice, not generic linter output
PRD PHASE 21 rewritten to reflect the actual primitive shape (two-
call map-reduce with scratchpad glue) instead of the tourniquet
approach originally documented. Rust port queued for next sprint.
scripts/ab_t3_test.sh: A/B harness that chains B→C→D runs and emits
tests/multi-agent/playbooks/ab_scorecard.json.
PRD: add Phase 20 (model matrix, wired) and Phase 21 (context stability,
partial). Phase 21 exists because LLM Team hit this exact wall — running
multi-model ranking on large context silently truncated, rankings
degraded, no pipeline caught it. The stable answer: every agent call
goes through a budget check against the model's declared context_window
minus safety_margin, with a declared overflow_policy when the check
fails.
config/models.json:
- context_window + context_budget per tier
- overflow_policies block: summarize_oldest_tool_results_via_t3,
chunk_lessons_via_cosine_topk, two_pass_map_reduce,
escalate_to_kimi_k2_1t_or_split_decision
- chunking_cache spec (data/_chunk_cache/, corpus-hash keyed)
agent.ts:
- estimateTokens() chars/4 biased safe ~15%
- CONTEXT_WINDOWS table (fallback; prod reads models.json)
- assertContextBudget() — throws on overflow with exact numbers, can
bypass with bypass_budget:true for callers with their own policy
- Wired into generate() and generateCloud() so EVERY call is checked
scenario.ts:
- T3 lesson archive to data/_playbook_lessons/*.json (the old
/vectors/playbook_memory/seed path was silently failing with HTTP 400
because it requires 'fill: Role xN in City, ST' operation shape)
- loadPriorLessons() at scenario start — filters by city/state match,
date-sorted, takes top-3
- prior_lessons.json archived per-run (honest signal for A/B)
- guidanceFor() injects up to 2 prior lessons (≤500 chars each) into
the executor's per-event context
- Retrospective shows explicit "Prior lessons loaded: N" line
Verified: mistral correctly rejects a 150K-char prompt (7532 tokens
over), gpt-oss:120b accepts it with 90K headroom. The enforcement is
in-band on every call now, not an afterthought.
Full chunking service (Rust) remains deferred to the sprint this feeds:
crates/aibridge/src/budget.rs + chunk.rs + storaged/chunk_cache.rs
Hot path (T1/T2) stays mistral + qwen2.5. The new T3 tier runs a
thinking model SPARINGLY — after every misplacement, every N-th event
(default N=3), and once post-scenario for the cross-day lesson.
- agent.ts: generateCloud() for Ollama Cloud (gpt-oss:120b etc). Uses
the same /api/generate shape; thinking field is discarded.
- scenario.ts: runOverviewCheckpoint + runCrossDayLesson. Outputs land
in checkpoints.jsonl and lesson.md. Lesson also seeds playbook_memory
under operation "cross-day-lesson-{date}" — future runs pick it up
through the existing similarity boost.
- Env knobs: LH_OVERVIEW_CLOUD=1 routes T3 to cloud, LH_OVERVIEW_MODEL
overrides (default gpt-oss:20b local, gpt-oss:120b cloud),
LH_T3_CHECKPOINT_EVERY controls cadence, LH_T3_DISABLE=1 turns it off.
Why this shape: prior feedback_phase19_seed_text.md warned that verbose
seeds dilute the embedding and silently kill the boost. T3's rich prose
goes to lesson.md; the embedded "approach" + "context" stay terse.
Verified end-to-end: local 20b checkpoint 10.9s, lesson 4.0s; cloud
120b lesson 3.7s. Cloud output is both faster AND more specific than
local (sequenced, tactical, logging advice included).
parseAction now strips stray `)` before `}` and trailing commas —
qwen2.5 emits those regularly on tool_call outputs; soft-fix beats
retry-loops. hybrid_search no longer hard-requires `question`; defaults
to "qualified available workers" when the model drops it (mistral's
most common failure mode on complex events).
Kept original TOOL_CATALOG shape (args examples only, not full
action envelopes). The verbose few-shot version from the prior
iteration confused mistral into wrapping propose_done as tool_call.
Scenario V7 result: expansion (5 Forklift Ops) and emergency
(4 Loaders) — previously-failing complex events — now seal reliably.
Pool sizes: 687 and 380 from 500K corpus. Patterns endpoint produces
real operator-actionable signals:
expansion: "recurring certifications: Forklift (40%), OSHA-10 (40%)
· recurring skills: mill (40%) · archetype mostly: leader
· reliability median 0.83"
Baseline + recurring are now flaky (inverted trade-off, pure
model-reliability variance).
Backend:
- crates/vectord/src/playbook_memory.rs (new): Phase 19 in-memory boost
store with seed/rebuild/snapshot, plus temporal decay (e^-age/30 per
playbook), persist_to_sql endpoint backing successful_playbooks_live,
and discover_patterns endpoint for meta-index pattern aggregation
(recurring certs/skills/archetype/reliability across similar past fills).
- DEFAULT_TOP_K_PLAYBOOKS bumped 5 → 25; old default silently missed
most boosts when memory had > 25 entries.
- service.rs: new routes /vectors/playbook_memory/{seed,rebuild,stats,
persist_sql,patterns}.
Bun staffing co-pilot (mcp-server/):
- /search, /match, /verify, /proof, /simulation/run, MCP tools all
forward use_playbook_memory:true and playbook_memory_k:25 to the
hybrid endpoint. Boost was previously dark across the entire app.
- /log no longer POSTs to /ingest/file — that endpoint REPLACES the
dataset's object list, so single-row CSV writes were wiping all prior
rows in successful_playbooks (sp_rows went 33→1 in one /log call).
/log now seeds playbook_memory with canonical short text and calls
/persist_sql to keep successful_playbooks_live in sync.
- /simulation/run cumulative end-of-week CSV write removed for the same
reason. Per-day per-contract /seed (added in this session) is the
accumulating feedback path now.
- search.html addWorkerInsight renders a green "Endorsed · N playbooks"
chip with playbook citations when boost > 0.
Internal Dioxus UI (crates/ui/):
- Dashboard phase list rewritten through Phase 19 (was stuck at "Phase
16: File Watcher" / "Phase 17: DB Connector" — both wrong).
- Removed fabricated "27ms" stat label.
- Ask tab examples + SQL default replaced with real staffing prompts
against candidates/clients/job_orders (was referencing nonexistent
employees/products/events).
- New Playbook tab exposes /vectors/playbook_memory/{stats,rebuild} and
side-by-side hybrid search (boost OFF vs ON) with citations.
Tests (tests/multi-agent/):
- run_e2e_rated.ts: parallel two-agent (mistral + qwen2.5) build phase
+ verifier rating (geo, auth, persist, boost, speed → /10).
- network_proving.ts: continuous build → verify → repeat with
staffing-recruiter profile hot-swap; geo-discrimination check.
- chain_of_custody.ts: single recruiter operation traced through every
layer (Bun /search, direct /vectors/hybrid parity, /log, SQL,
playbook_memory growth, profile activation, post-op boost lift).