lakehouse/auditor 1 blocking issue: todo!() macro call in tests/real-world/scrum_master_pipeline.ts
J's framing (2026-04-26): "Modes are how you ask ONCE and get BETTER
information — they mold the data, hyperfocus the prompt on this
codebase's needs, so the model gets it right the first time without
the cascading retry ladder."
Built the first concrete native enrichment runner (codereview_lakehouse)
that composes every context primitive the gateway exposes:
1. Focus file content (read from disk OR caller-supplied)
2. Pathway memory bug_fingerprints for this file area (ADR-021
preamble — "📚 BUGS PREVIOUSLY FOUND IN THIS FILE AREA")
3. Matrix corpus search via the task_class's matrix_corpus
4. Relevance filter (observer /relevance) drops adjacency pollution
5. Assembles ONE precise prompt with system framing
6. Single call to /v1/chat with the recommended model
POST /v1/mode/execute dispatches. Native mode → runs the composer.
Non-native mode → 501 NOT_IMPLEMENTED with hint (proxy to LLM Team
/api/run is queued).
Provider hint logic auto-routes by model name shape:
- vendor/model[:tag] → openrouter
- kimi-*/qwen3-coder*/deepseek-v*/mistral-large* → ollama_cloud
- everything else → local ollama
Live test against crates/queryd/src/delta.rs (10593 bytes, 10
historical bug fingerprints, 2 matrix chunks dropped by relevance):
- enriched_chars: 12876
- response_chars: 16346 (14 findings with confidence percentages)
- Model literally cited the pathway memory preamble in finding #7
- One call to free-tier gpt-oss:120b produced what previously
required the 9-rung escalation ladder
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
lakehouse/auditor 1 blocking issue: todo!() macro call in tests/real-world/scrum_master_pipeline.ts
HANDOVER §queued (2026-04-25): "Mode router — port LLM Team multi-model
patterns. Pick the right TOOL/MODE for each task class via the matrix,
not cascade through models."
Two-stage architecture:
1. Decision (POST /v1/mode) — pure recommendation, no execution.
Returns {mode, model, decision: {source, fallbacks, matrix_corpus,
notes}} so callers see WHY this mode was picked.
2. Execution (future POST /v1/mode/execute) — proxy to LLM Team
/api/run for modes not yet ported to native Rust runners. Not
wired in this phase.
Splitting decision from execution lets us A/B-test the routing logic
without committing to running every recommendation. The decision
function is pure enough for exhaustive unit tests (3 added).
config/modes.toml — initial map for 5 task_classes (scrum_review,
contract_analysis, staffing_inference, fact_extract, doc_drift_check)
+ a default. matrix_corpus per task is reserved for the future
matrix-informed routing pass.
VALID_MODES list (24 modes) is kept in sync manually with LLM Team's
/api/run handler at /root/llm_team_ui.py:10581. Adding a mode here
without adding it upstream returns 400 from a future proxy.
GET /v1/mode/list — operator introspection so a UI can render the
registry table without re-parsing TOML.
Live-tested: 5 task classes match, unknown classes fall through to
default, force_mode override works + validates, bogus modes return
400 with the valid_modes list.
Updates reference_llm_team_modes.md memory — earlier note claiming
"only extract is registered" was wrong (all 25 are registered).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Three PRD gaps closed in one coherent batch — all were cosmetic or
scaffold-shaped, now real files:
Phase 39 (PRD:57):
+ config/providers.toml — provider registry (name/base_url/auth/
default_model) for ollama, ollama_cloud, openrouter. Commented
stubs for gemini + claude pending adapter work. Secrets stay in
/etc/lakehouse/secrets.toml or env, NEVER inline.
Phase 41 (PRD:115):
+ crates/vectord/src/activation.rs — ActivationTracker with the
PRD-named single-flight guard ("refuse new activation if one is
pending/running"). Per-profile granularity — activating A doesn't
block B. 5 tests cover the full state machine. Handler body stays
in service.rs for now; tracker usage integration is a follow-up.
Phase 41 (PRD:113):
+ crates/shared/src/profiles/ with 4 submodules:
* execution.rs — `pub use crate::types::ModelProfile as
ExecutionProfile` (backward-compat rename per PRD)
* retrieval.rs — top_k, rerank_top_k, freshness cutoff,
playbook boost, sensitivity-gate enforcement
* memory.rs — playbook boost ceiling, history cap, doc
staleness, auto-retire-on-failure
* observer.rs — failure cluster size, alert cooldown, ring
size, langfuse forwarding
All fields `#[serde(default)]` so existing ModelProfile files
load unchanged.
Still open from the same phases:
- Gemini + Claude provider adapters (Phase 40 — 100-200 LOC each)
- Full activate_profile handler extraction into activation.rs
(Phase 41 — module-structure refactor)
- Catalogd CRUD endpoints for retrieval/memory/observer profiles
(Phase 41 — exists at list level, no create/update/delete yet)
- truth/ repo-root directory for file-backed rules (Phase 42 —
TOML loader + schema)
- crates/validator crate (Phase 43 — full greenfield)
Workspace warnings still at 0. 5 new tests, all green.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Three coupled fixes that together turned the Riverfront Steel scenario
from 0/5 (mistral) to 4/5 (qwen3.5) with T3 flagging real staffing
concerns rather than linter advice.
MODEL SWAP
- Executor: mistral → qwen3.5:latest (9.7B, 262K ctx, thinking).
mistral's decoder emitted malformed JSON on complex SQL filters
regardless of prompt; J called it — stop using mistral.
- Reviewer: qwen2.5 → qwen3:latest (40K ctx)
- Applied to scenario.ts, orchestrator.ts, network_proving.ts,
run_e2e_rated.ts
CONTINUATION PRIMITIVE (agent.ts)
- generateContinuable(): empty-response → geometric backoff retry;
truncated-JSON → continue from partial as scratchpad; bounded by
budget cap + max_continuations. No more "bump max_tokens until it
stops truncating" tourniquet.
- generateTreeSplit(): map-reduce for oversized input corpora with
running scratchpad digest, reduce pass for final synthesis.
- Empty text no longer throws — it's a signal to continuable that
thinking ate the budget.
think:false FOR HOT PATH
- qwen3.5 burned ~650 tokens of hidden thinking for trivial JSON
emission. For executor/reviewer/draft: think:false. For T3/T4/T5
overseers: thinking stays on (that's the point).
- Sidecar generate endpoint accepts `think` bool, passes through to
Ollama's /api/generate.
VERIFIED OUTCOMES
Riverfront Steel 2026-04-21, qwen3.5+continuable+think:false:
08:00 baseline_fill 3/3 4 turns
10:30 recurring 2/2 3 turns (1 playbook citation)
12:15 expansion 0/5 drift-aborted (5-fill orchestration
problem, separate work)
14:00 emergency 4/4 3 turns (1 citation)
15:45 misplacement 1/1 3 turns
→ T3 caught Patrick Ross double-booking across events
→ T3 flagged forklift cert drift on the event that failed
→ Cross-day lesson proposed "maintain buffer of ≥3 emergency
candidates, pre-fetch certs for expansion, booking system
cross-check" — real staffing advice, not generic linter output
PRD PHASE 21 rewritten to reflect the actual primitive shape (two-
call map-reduce with scratchpad glue) instead of the tourniquet
approach originally documented. Rust port queued for next sprint.
scripts/ab_t3_test.sh: A/B harness that chains B→C→D runs and emits
tests/multi-agent/playbooks/ab_scorecard.json.
PRD: add Phase 20 (model matrix, wired) and Phase 21 (context stability,
partial). Phase 21 exists because LLM Team hit this exact wall — running
multi-model ranking on large context silently truncated, rankings
degraded, no pipeline caught it. The stable answer: every agent call
goes through a budget check against the model's declared context_window
minus safety_margin, with a declared overflow_policy when the check
fails.
config/models.json:
- context_window + context_budget per tier
- overflow_policies block: summarize_oldest_tool_results_via_t3,
chunk_lessons_via_cosine_topk, two_pass_map_reduce,
escalate_to_kimi_k2_1t_or_split_decision
- chunking_cache spec (data/_chunk_cache/, corpus-hash keyed)
agent.ts:
- estimateTokens() chars/4 biased safe ~15%
- CONTEXT_WINDOWS table (fallback; prod reads models.json)
- assertContextBudget() — throws on overflow with exact numbers, can
bypass with bypass_budget:true for callers with their own policy
- Wired into generate() and generateCloud() so EVERY call is checked
scenario.ts:
- T3 lesson archive to data/_playbook_lessons/*.json (the old
/vectors/playbook_memory/seed path was silently failing with HTTP 400
because it requires 'fill: Role xN in City, ST' operation shape)
- loadPriorLessons() at scenario start — filters by city/state match,
date-sorted, takes top-3
- prior_lessons.json archived per-run (honest signal for A/B)
- guidanceFor() injects up to 2 prior lessons (≤500 chars each) into
the executor's per-event context
- Retrospective shows explicit "Prior lessons loaded: N" line
Verified: mistral correctly rejects a 150K-char prompt (7532 tokens
over), gpt-oss:120b accepts it with 90K headroom. The enforcement is
in-band on every call now, not an afterthought.
Full chunking service (Rust) remains deferred to the sprint this feeds:
crates/aibridge/src/budget.rs + chunk.rs + storaged/chunk_cache.rs
config/models.json is the authoritative catalog. Hot path (T1/T2) stays
local; cloud is consulted only for overview (T3), strategic (T4), and
gatekeeper (T5) calls. J named qwen3.5 + newer models (minimax-m2.7,
glm-5, qwen3-next) specifically — all mapped with real reachable IDs
verified against ollama.com/api/tags.
Tier shape:
- t1_hot mistral + qwen2.5 local — 50-200 calls/scenario
- t2_review qwen2.5 + qwen3 local — 5-14 calls/event
- t3_overview gpt-oss:120b cloud — 1-3 calls/scenario
- t4_strategic qwen3.5:397b + glm-4.7 — 1-10 calls/day
- t5_gatekeeper kimi-k2-thinking — 1-5 calls/day, audit-logged
Rate budgets are declared in-config — Ollama Cloud paid tier is generous
but we cap overview/strategic/gatekeeper so no single rogue scenario can
blow the day's quota.
Experimental rotation list wired but disabled by default. When enabled,
T4 randomly routes 10% of calls to a rotating minimax/GLM/qwen-next/
deepseek/nemotron/cogito/mistral-large candidate, logs comparisons, and
auto-promotes after 3 rotations of wins.
Playbook versioning SPEC embedded under `playbook_versioning` key: every
seed gets version + parent_id + retired_at + architecture_snapshot, so
when a schema migration breaks a playbook we can pinpoint which change
retired it. Implementation flagged for next sprint (touches gateway +
catalogd + mcp-server) — not wired here.
- scenario.ts now loads config/models.json at init, env vars still override
- mcp-server exposes /models/matrix read-only so UI can render it