Three PRD gaps closed in one coherent batch — all were cosmetic or
scaffold-shaped, now real files:
Phase 39 (PRD:57):
+ config/providers.toml — provider registry (name/base_url/auth/
default_model) for ollama, ollama_cloud, openrouter. Commented
stubs for gemini + claude pending adapter work. Secrets stay in
/etc/lakehouse/secrets.toml or env, NEVER inline.
Phase 41 (PRD:115):
+ crates/vectord/src/activation.rs — ActivationTracker with the
PRD-named single-flight guard ("refuse new activation if one is
pending/running"). Per-profile granularity — activating A doesn't
block B. 5 tests cover the full state machine. Handler body stays
in service.rs for now; tracker usage integration is a follow-up.
Phase 41 (PRD:113):
+ crates/shared/src/profiles/ with 4 submodules:
* execution.rs — `pub use crate::types::ModelProfile as
ExecutionProfile` (backward-compat rename per PRD)
* retrieval.rs — top_k, rerank_top_k, freshness cutoff,
playbook boost, sensitivity-gate enforcement
* memory.rs — playbook boost ceiling, history cap, doc
staleness, auto-retire-on-failure
* observer.rs — failure cluster size, alert cooldown, ring
size, langfuse forwarding
All fields `#[serde(default)]` so existing ModelProfile files
load unchanged.
Still open from the same phases:
- Gemini + Claude provider adapters (Phase 40 — 100-200 LOC each)
- Full activate_profile handler extraction into activation.rs
(Phase 41 — module-structure refactor)
- Catalogd CRUD endpoints for retrieval/memory/observer profiles
(Phase 41 — exists at list level, no create/update/delete yet)
- truth/ repo-root directory for file-backed rules (Phase 42 —
TOML loader + schema)
- crates/validator crate (Phase 43 — full greenfield)
Workspace warnings still at 0. 5 new tests, all green.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Three coupled fixes that together turned the Riverfront Steel scenario
from 0/5 (mistral) to 4/5 (qwen3.5) with T3 flagging real staffing
concerns rather than linter advice.
MODEL SWAP
- Executor: mistral → qwen3.5:latest (9.7B, 262K ctx, thinking).
mistral's decoder emitted malformed JSON on complex SQL filters
regardless of prompt; J called it — stop using mistral.
- Reviewer: qwen2.5 → qwen3:latest (40K ctx)
- Applied to scenario.ts, orchestrator.ts, network_proving.ts,
run_e2e_rated.ts
CONTINUATION PRIMITIVE (agent.ts)
- generateContinuable(): empty-response → geometric backoff retry;
truncated-JSON → continue from partial as scratchpad; bounded by
budget cap + max_continuations. No more "bump max_tokens until it
stops truncating" tourniquet.
- generateTreeSplit(): map-reduce for oversized input corpora with
running scratchpad digest, reduce pass for final synthesis.
- Empty text no longer throws — it's a signal to continuable that
thinking ate the budget.
think:false FOR HOT PATH
- qwen3.5 burned ~650 tokens of hidden thinking for trivial JSON
emission. For executor/reviewer/draft: think:false. For T3/T4/T5
overseers: thinking stays on (that's the point).
- Sidecar generate endpoint accepts `think` bool, passes through to
Ollama's /api/generate.
VERIFIED OUTCOMES
Riverfront Steel 2026-04-21, qwen3.5+continuable+think:false:
08:00 baseline_fill 3/3 4 turns
10:30 recurring 2/2 3 turns (1 playbook citation)
12:15 expansion 0/5 drift-aborted (5-fill orchestration
problem, separate work)
14:00 emergency 4/4 3 turns (1 citation)
15:45 misplacement 1/1 3 turns
→ T3 caught Patrick Ross double-booking across events
→ T3 flagged forklift cert drift on the event that failed
→ Cross-day lesson proposed "maintain buffer of ≥3 emergency
candidates, pre-fetch certs for expansion, booking system
cross-check" — real staffing advice, not generic linter output
PRD PHASE 21 rewritten to reflect the actual primitive shape (two-
call map-reduce with scratchpad glue) instead of the tourniquet
approach originally documented. Rust port queued for next sprint.
scripts/ab_t3_test.sh: A/B harness that chains B→C→D runs and emits
tests/multi-agent/playbooks/ab_scorecard.json.
PRD: add Phase 20 (model matrix, wired) and Phase 21 (context stability,
partial). Phase 21 exists because LLM Team hit this exact wall — running
multi-model ranking on large context silently truncated, rankings
degraded, no pipeline caught it. The stable answer: every agent call
goes through a budget check against the model's declared context_window
minus safety_margin, with a declared overflow_policy when the check
fails.
config/models.json:
- context_window + context_budget per tier
- overflow_policies block: summarize_oldest_tool_results_via_t3,
chunk_lessons_via_cosine_topk, two_pass_map_reduce,
escalate_to_kimi_k2_1t_or_split_decision
- chunking_cache spec (data/_chunk_cache/, corpus-hash keyed)
agent.ts:
- estimateTokens() chars/4 biased safe ~15%
- CONTEXT_WINDOWS table (fallback; prod reads models.json)
- assertContextBudget() — throws on overflow with exact numbers, can
bypass with bypass_budget:true for callers with their own policy
- Wired into generate() and generateCloud() so EVERY call is checked
scenario.ts:
- T3 lesson archive to data/_playbook_lessons/*.json (the old
/vectors/playbook_memory/seed path was silently failing with HTTP 400
because it requires 'fill: Role xN in City, ST' operation shape)
- loadPriorLessons() at scenario start — filters by city/state match,
date-sorted, takes top-3
- prior_lessons.json archived per-run (honest signal for A/B)
- guidanceFor() injects up to 2 prior lessons (≤500 chars each) into
the executor's per-event context
- Retrospective shows explicit "Prior lessons loaded: N" line
Verified: mistral correctly rejects a 150K-char prompt (7532 tokens
over), gpt-oss:120b accepts it with 90K headroom. The enforcement is
in-band on every call now, not an afterthought.
Full chunking service (Rust) remains deferred to the sprint this feeds:
crates/aibridge/src/budget.rs + chunk.rs + storaged/chunk_cache.rs
config/models.json is the authoritative catalog. Hot path (T1/T2) stays
local; cloud is consulted only for overview (T3), strategic (T4), and
gatekeeper (T5) calls. J named qwen3.5 + newer models (minimax-m2.7,
glm-5, qwen3-next) specifically — all mapped with real reachable IDs
verified against ollama.com/api/tags.
Tier shape:
- t1_hot mistral + qwen2.5 local — 50-200 calls/scenario
- t2_review qwen2.5 + qwen3 local — 5-14 calls/event
- t3_overview gpt-oss:120b cloud — 1-3 calls/scenario
- t4_strategic qwen3.5:397b + glm-4.7 — 1-10 calls/day
- t5_gatekeeper kimi-k2-thinking — 1-5 calls/day, audit-logged
Rate budgets are declared in-config — Ollama Cloud paid tier is generous
but we cap overview/strategic/gatekeeper so no single rogue scenario can
blow the day's quota.
Experimental rotation list wired but disabled by default. When enabled,
T4 randomly routes 10% of calls to a rotating minimax/GLM/qwen-next/
deepseek/nemotron/cogito/mistral-large candidate, logs comparisons, and
auto-promotes after 3 rotations of wins.
Playbook versioning SPEC embedded under `playbook_versioning` key: every
seed gets version + parent_id + retired_at + architecture_snapshot, so
when a schema migration breaks a playbook we can pinpoint which change
retired it. Implementation flagged for next sprint (touches gateway +
catalogd + mcp-server) — not wired here.
- scenario.ts now loads config/models.json at init, env vars still override
- mcp-server exposes /models/matrix read-only so UI can render it