PRD: add Phase 20 (model matrix, wired) and Phase 21 (context stability, partial). Phase 21 exists because LLM Team hit this exact wall — running multi-model ranking on large context silently truncated, rankings degraded, no pipeline caught it. The stable answer: every agent call goes through a budget check against the model's declared context_window minus safety_margin, with a declared overflow_policy when the check fails. config/models.json: - context_window + context_budget per tier - overflow_policies block: summarize_oldest_tool_results_via_t3, chunk_lessons_via_cosine_topk, two_pass_map_reduce, escalate_to_kimi_k2_1t_or_split_decision - chunking_cache spec (data/_chunk_cache/, corpus-hash keyed) agent.ts: - estimateTokens() chars/4 biased safe ~15% - CONTEXT_WINDOWS table (fallback; prod reads models.json) - assertContextBudget() — throws on overflow with exact numbers, can bypass with bypass_budget:true for callers with their own policy - Wired into generate() and generateCloud() so EVERY call is checked scenario.ts: - T3 lesson archive to data/_playbook_lessons/*.json (the old /vectors/playbook_memory/seed path was silently failing with HTTP 400 because it requires 'fill: Role xN in City, ST' operation shape) - loadPriorLessons() at scenario start — filters by city/state match, date-sorted, takes top-3 - prior_lessons.json archived per-run (honest signal for A/B) - guidanceFor() injects up to 2 prior lessons (≤500 chars each) into the executor's per-event context - Retrospective shows explicit "Prior lessons loaded: N" line Verified: mistral correctly rejects a 150K-char prompt (7532 tokens over), gpt-oss:120b accepts it with 90K headroom. The enforcement is in-band on every call now, not an afterthought. Full chunking service (Rust) remains deferred to the sprint this feeds: crates/aibridge/src/budget.rs + chunk.rs + storaged/chunk_cache.rs
185 lines
11 KiB
JSON
185 lines
11 KiB
JSON
{
|
|
"_description": "Lakehouse model matrix — authoritative routing for all agent tiers. Local models do the heavy lifting; cloud models are consulted sparingly for overview, strategic, and gatekeeper decisions. Read by tests/multi-agent/scenario.ts and mcp-server/index.ts.",
|
|
"version": 1,
|
|
"updated": "2026-04-21",
|
|
"providers": {
|
|
"ollama_local": {
|
|
"base_url": "http://localhost:11434",
|
|
"key_env": null
|
|
},
|
|
"ollama_cloud": {
|
|
"base_url": "https://ollama.com",
|
|
"key_env": "OLLAMA_CLOUD_KEY",
|
|
"key_source": "/root/llm_team_config.json → providers.ollama_cloud.api_key",
|
|
"rate_budget": {
|
|
"calls_per_hour": 200,
|
|
"calls_per_day": 2000,
|
|
"notes": "Paid tier — generous. Policy: keep overview calls ≤ 3/scenario, strategic ≤ 10/day, gatekeeper ≤ 5/day."
|
|
}
|
|
}
|
|
},
|
|
"tiers": {
|
|
"t1_hot": {
|
|
"purpose": "Per tool call — SQL generation, hybrid_search, sql(). Runs 50-200 times per scenario. Latency-sensitive.",
|
|
"kind": "local_fast",
|
|
"primary": { "model": "mistral:latest", "provider": "ollama_local", "context_window": 32768 },
|
|
"fallback": { "model": "qwen2.5:latest", "provider": "ollama_local", "context_window": 32768 },
|
|
"max_tokens": 800,
|
|
"temperature": 0.3,
|
|
"never_route_cloud": true,
|
|
"context_budget": {
|
|
"system_prompt_cap": 4000,
|
|
"prior_context_cap": 6000,
|
|
"tool_results_cap": 8000,
|
|
"safety_margin": 2000,
|
|
"overflow_policy": "summarize_oldest_tool_results_via_t3"
|
|
},
|
|
"rationale": "Mistral produces valid JSON reliably. Qwen2.5 is the consensus reviewer. Known flakiness on 5-fill + misplacement events — do NOT mask by upgrading; route to T3 for post-hoc review instead."
|
|
},
|
|
"t2_review": {
|
|
"purpose": "Per step consensus — executor ↔ reviewer loop critique. 5-14 calls per event.",
|
|
"kind": "local_balanced",
|
|
"primary": { "model": "qwen2.5:latest", "provider": "ollama_local", "context_window": 32768 },
|
|
"fallback": { "model": "qwen3:latest", "provider": "ollama_local", "context_window": 40960 },
|
|
"max_tokens": 600,
|
|
"temperature": 0.3,
|
|
"never_route_cloud": true,
|
|
"context_budget": {
|
|
"system_prompt_cap": 2000,
|
|
"recent_turns_cap": 4000,
|
|
"safety_margin": 1000
|
|
},
|
|
"rationale": "Reviewer only needs to detect schema violations and drift — a 7B model is sufficient."
|
|
},
|
|
"t3_overview": {
|
|
"purpose": "Mid-day checkpoint after every misplacement + every Nth event, and cross-day lesson. 1-3 calls per scenario.",
|
|
"kind": "thinking_cloud",
|
|
"primary": { "model": "gpt-oss:120b", "provider": "ollama_cloud", "context_window": 131072 },
|
|
"local_fallback": { "model": "gpt-oss:20b", "provider": "ollama_local", "context_window": 131072 },
|
|
"max_tokens": 900,
|
|
"temperature": 0.2,
|
|
"cloud_budget_per_scenario": 5,
|
|
"env_flag": "LH_OVERVIEW_CLOUD=1",
|
|
"context_budget": {
|
|
"event_digest_cap": 30000,
|
|
"checkpoint_cap": 8000,
|
|
"lesson_corpus_cap": 40000,
|
|
"safety_margin": 8000,
|
|
"overflow_policy": "chunk_lessons_via_cosine_topk"
|
|
},
|
|
"rationale": "Same prompt family as local 20b (gpt-oss series) — prompts port directly. 120b is faster via cloud than 20b local in practice, and lessons are noticeably more specific."
|
|
},
|
|
"t4_strategic": {
|
|
"purpose": "Daily playbook board re-ranking, weekly gap audit, pattern discovery across accumulated playbooks. 1-10 calls per day.",
|
|
"kind": "thinking_cloud_large",
|
|
"primary": { "model": "qwen3.5:397b", "provider": "ollama_cloud", "context_window": 131072 },
|
|
"fallback": { "model": "glm-4.7", "provider": "ollama_cloud", "context_window": 131072 },
|
|
"local_fallback": { "model": "gpt-oss:20b", "provider": "ollama_local", "context_window": 131072 },
|
|
"max_tokens": 2000,
|
|
"temperature": 0.2,
|
|
"cloud_budget_per_day": 10,
|
|
"context_budget": {
|
|
"playbook_corpus_cap": 80000,
|
|
"pattern_history_cap": 20000,
|
|
"safety_margin": 16000,
|
|
"overflow_policy": "two_pass_map_reduce"
|
|
},
|
|
"rationale": "J named qwen3.5 specifically. GLM-4.7 is a promising alternate for debate phase. Runs after all scenarios complete for the day."
|
|
},
|
|
"t5_gatekeeper": {
|
|
"purpose": "MUST route here: architecture changes, new client onboarding, schema migrations, playbook retirements, index rebuilds, autotune config changes.",
|
|
"kind": "thinking_cloud_deepest",
|
|
"primary": { "model": "kimi-k2-thinking", "provider": "ollama_cloud", "context_window": 200000 },
|
|
"fallback": { "model": "deepseek-v3.1:671b", "provider": "ollama_cloud", "context_window": 131072 },
|
|
"secondary_fallback": { "model": "qwen3.5:397b", "provider": "ollama_cloud", "context_window": 131072 },
|
|
"local_fallback": { "model": "gpt-oss:20b", "provider": "ollama_local", "context_window": 131072 },
|
|
"max_tokens": 4000,
|
|
"temperature": 0.1,
|
|
"cloud_budget_per_day": 5,
|
|
"audit_log": true,
|
|
"context_budget": {
|
|
"decision_doc_cap": 50000,
|
|
"evidence_bundle_cap": 100000,
|
|
"prior_gatekeeper_decisions_cap": 20000,
|
|
"safety_margin": 20000,
|
|
"overflow_policy": "escalate_to_kimi_k2_1t_or_split_decision"
|
|
},
|
|
"rationale": "Highest-stakes decisions — reasoning depth matters more than latency. Audit log so J can always see what the gatekeeper was asked and what it answered. No human approval required today; escalate later if mis-decisions show up."
|
|
}
|
|
},
|
|
|
|
"context_management": {
|
|
"_description": "Rule zero: NEVER call a model with more tokens than its context_window minus safety_margin. Every call goes through the budget checker first. If over budget → chunk, summarize, or escalate. This is the stability floor.",
|
|
"token_estimator": {
|
|
"method": "chars_div_4",
|
|
"note": "Rough, biased safe by ~15%. For production, swap to tiktoken or the provider's tokenizer endpoint."
|
|
},
|
|
"overflow_policies": {
|
|
"summarize_oldest_tool_results_via_t3": {
|
|
"when": "T1 conversation history + tool results exceed context_budget.tool_results_cap",
|
|
"how": "Send oldest N tool results to T3 with prompt 'summarize these into 500 tokens that preserve what the executor needs to know'; replace them with the summary in the running conversation."
|
|
},
|
|
"chunk_lessons_via_cosine_topk": {
|
|
"when": "lesson corpus in data/_playbook_lessons/*.json exceeds lesson_corpus_cap",
|
|
"how": "Embed the current scenario spec, cosine-rank all lessons, take top-K until budget exhausted. Fall back to date-sorted if embeddings unavailable."
|
|
},
|
|
"two_pass_map_reduce": {
|
|
"when": "T4 playbook corpus exceeds playbook_corpus_cap",
|
|
"how": "Pass 1: chunk playbooks into ≤30K token shards, run primary model on each shard to emit 'shard summary'. Pass 2: feed all summaries to primary model for global synthesis. Logged as two audit entries."
|
|
},
|
|
"escalate_to_kimi_k2_1t_or_split_decision": {
|
|
"when": "T5 decision evidence exceeds decision_doc_cap + evidence_bundle_cap",
|
|
"how": "Prefer kimi-k2:1t which has 1M context. If still over, split decision into sub-decisions (e.g. 'retire playbooks by city' instead of 'retire playbooks globally') and loop."
|
|
}
|
|
},
|
|
"chunking_cache": {
|
|
"_description": "Precomputed shards of the playbook corpus, indexed by (corpus_version, shard_id). Avoids re-chunking on every T4 run.",
|
|
"location": "data/_chunk_cache/",
|
|
"invalidation": "Key includes corpus hash. When playbook_memory changes, the hash changes, the cache misses, and chunks regenerate.",
|
|
"implementation_status": "SPEC — next sprint."
|
|
},
|
|
"implementation_status": "context_window + context_budget fields WIRED in config. Enforcement helper NOT yet wired in agent.ts. Next: add estimateTokens() + budgetCheck() helpers, route all generate() calls through them, emit a metric when overflow policy fires."
|
|
},
|
|
"experimental_rotation": {
|
|
"enabled": false,
|
|
"purpose": "Sample newer models on a schedule to collect comparison data without rate-limit risk.",
|
|
"candidates": [
|
|
{ "model": "minimax-m2.7", "notes": "Newer minimax; unknown output stability" },
|
|
{ "model": "glm-5", "notes": "GLM next-gen; larger context" },
|
|
{ "model": "glm-5.1", "notes": "Incremental on GLM-5" },
|
|
{ "model": "qwen3-next:80b", "notes": "Qwen's experimental successor; smaller than 3.5" },
|
|
{ "model": "qwen3-coder-next", "notes": "Coder-optimized — good for SQL gen T1 experiments" },
|
|
{ "model": "deepseek-v3.2", "notes": "Smaller deepseek; reasoning/coding" },
|
|
{ "model": "nemotron-3-super", "notes": "NVIDIA 230B; general" },
|
|
{ "model": "cogito-2.1:671b", "notes": "671B general" },
|
|
{ "model": "mistral-large-3:675b", "notes": "Mistral's flagship; good T3 candidate" }
|
|
],
|
|
"rotation": "weekly",
|
|
"sample_rate": 0.1,
|
|
"apply_to_tier": "t4_strategic",
|
|
"notes": "When enabled, T4 routes 10% of calls to a rotating experimental model. Log comparison in /data/_model_eval/ — if the experimental consistently beats primary across 3 rotations, promote it to primary."
|
|
},
|
|
"playbook_versioning": {
|
|
"enabled": true,
|
|
"purpose": "A playbook can work, then break when architecture changes. Versioning lets us pin which change retired which playbook.",
|
|
"dataset": "playbook_memory",
|
|
"schema_additions": {
|
|
"version": "integer — auto-increment per operation",
|
|
"parent_id": "string — previous version entry_id for same operation (null for v1)",
|
|
"retired_at": "timestamp — set when success_rate drops or architecture changes",
|
|
"retirement_reason": "string — e.g. 'schema_migration:workers_500k 2026-05-03'",
|
|
"architecture_snapshot": "object — crate versions, index name, schema fingerprint at seed time"
|
|
},
|
|
"retire_triggers": [
|
|
"success_rate < 0.3 over last 20 citations",
|
|
"schema_fingerprint mismatch detected at retrieval time",
|
|
"architecture change event emitted by ingestd/vectord",
|
|
"T5 gatekeeper explicitly retires via /vectors/playbook_memory/retire"
|
|
],
|
|
"read_back_policy": "retrieval returns only non-retired versions. History endpoint /vectors/playbook_memory/history/{operation} returns the full chain.",
|
|
"ui_surface": "mcp-server to render a diff view: side-by-side of playbook versions with a timeline of what changed and when.",
|
|
"implementation_status": "SPEC — not yet wired. Target: next sprint. Touches gateway + catalogd + mcp-server."
|
|
},
|
|
"matrix_index_hybrid_search_note": "Phase 22 candidate: elevate the hybrid_search T1 tool to consult T3 when a pool returns <3 matches OR when the same (role, city) combo has failed N times in 24h. Consult result is a reformulated sql_filter the executor retries with. Keeps T1 fast on the happy path, escalates to T3 only on low-recall signals."
|
|
}
|