diff --git a/mcp-server/spec.html b/mcp-server/spec.html index 9b81834..c54646c 100644 --- a/mcp-server/spec.html +++ b/mcp-server/spec.html @@ -123,10 +123,11 @@ table.plain tr:hover td{background:#0d1117}
devop.live/lakehouse. Routes: /search /match /log /log_failure /clients/:c/blacklist /intelligence/*. Proxies to the Rust gateway for heavy work.agent.ts (prompts + protocol), orchestrator.ts (single task), scenario.ts (5-event warehouse week), run_e2e_rated.ts (parallel pairs + rating), chain_of_custody.ts (layer-by-layer audit).devop.live/lakehouse. Routes: /search /match /log /log_failure /clients/:c/blacklist /intelligence/* /memory/query /models/matrix /system/summary. Observer sibling at observer.ts with HTTP listener on :3800 for scenario event ingest. Proxies to the Rust gateway for heavy work.agent.ts (prompts, continuation + tree-split primitives, cloud routing), orchestrator.ts, scenario.ts (contracts + staffer + tool_level), kb.ts (KB indexing, competence scoring, neighbor retrieval), normalize.ts (input normalizer — structured / regex / LLM), memory_query.ts (unified /memory/query), gen_scenarios.ts + gen_staffer_demo.ts (corpus generators), run_e2e_rated.ts, chain_of_custody.ts. Unit tests colocated (kb.test.ts, normalize.test.ts).models.json — authoritative 5-tier model matrix (T1 hot local / T2 review local / T3 overview cloud / T4 strategic / T5 gatekeeper). Per-tier context_window + context_budget + overflow_policy. Read at runtime by scenario.ts; hot-swap friendly.PRD.md, PHASES.md, DECISIONS.md (20 ADRs). Every significant architectural choice has an ADR with the alternatives that were rejected and why._playbook_memory/state.json (now with retirement fields — Phase 25), catalog manifests. Plus four learning-loop directories: _kb/ (signatures, outcomes, recommendations, error_corrections, config_snapshots, staffers), _playbook_lessons/ (T3 cross-day lessons archived per run), _observer/ops.jsonl (append journal, durable scenario outcome stream), _chunk_cache/ (spec'd for Phase 21 Rust port). Rebuildable from repo + this dir alone.POST /vectors/profile/<id>/search rejects out-of-scope queries with 403 + list of allowed bindingskeep_alive=0; only one model in VRAM at a timeFive tiers declared in config/models.json. Each call site picks the tier appropriate to its purpose — hot-path JSON emitters get fast local, overview/strategic/gatekeeper decisions get thinking models on cloud. Every tier carries context_window, context_budget, and overflow_policy.
| Tier | Purpose | Primary model | Frequency |
|---|---|---|---|
| T1 hot | Per tool call — SQL gen, hybrid_search, propose_done | qwen3.5:latest local, think:false | 50-200/scenario |
| T2 review | Per-step consensus, drift flagging | qwen3:latest local, think:false | 5-14/event |
| T3 overview | Mid-day checkpoints + cross-day lesson distill | gpt-oss:120b Ollama Cloud, thinking on | 1-3/scenario |
| T4 strategic | Pattern re-ranking, weekly gap audit | qwen3.5:397b cloud | 1-10/day |
| T5 gatekeeper | Schema migrations, autotune config changes | kimi-k2-thinking cloud, audit-logged | 1-5/day |
Key mechanical finding (2026-04-21): qwen3.5 and qwen3 are thinking models — they burn ~650 tokens of hidden reasoning before emitting the visible response. For hot-path JSON emitters this meant 400-token budgets returned empty strings. Fix: think: false plumbed through sidecar's /generate endpoint; hot path disables thinking (structure matters more than reasoning depth), overseer tiers keep it on. Mistral was dropped entirely after a 0/14 fill rate on complex scenarios (decoder-level malformed-JSON bug, not a prompt issue).
Continuation primitive (Phase 21): generateContinuable() handles output-overflow without max_tokens tourniquets — empty response → geometric backoff retry; truncated-JSON → continue with partial as scratchpad. generateTreeSplit() handles input-overflow via map-reduce with running scratchpad. Both respect assertContextBudget() so silent truncation can't happen.
Scenarios can be scoped to a specific coordinator (staffer: {id, name, tenure_months, role, tool_level}). tool_level controls which tiers are available:
full — T1/T2 local, T3 cloud, cloud rescue on failurelocal — T1/T2/T3 all local (gpt-oss:20b as overseer)basic — kimi-k2.5 cloud executor + local reviewer + local T3, no rescueminimal — kimi-k2.5 cloud executor, no T3, no rescue. Playbook inheritance is the only signal.Measured 2026-04-21 on a 36-run demo (4 staffers × 3 contracts × 3 rounds): James Park (mid, local tools) ranked first at 92.9% fill and 36.8 cites/run, Maria Chen (senior, full tools) second at 81.0%. Cloud T3 adds latency without measurable benefit on this workload. Alex Rivera (trainee, minimal) still hit 59.5% fill purely from playbook inheritance — proof the memory carries knowledge when the model is capable.
+ +vector_backend: lance flip — disk-resident IVF_PQ scales past the RAM line (ADR-019).keep_alive=0). Phase 17 profile activation unloads the prior model on swap.valid_until + schema_fingerprint fields + POST /vectors/playbook_memory/retire endpoint (manual or schema-drift triggered). Active vs retired split surfaced on GET /vectors/playbook_memory/status. Brute-force cosine still sub-ms at current size; Letta-style working-memory hot cache deferred until entry count crosses ~100K.embed calls fail fastsystemctl restart lakehouse-sidecar; vector search falls back to pre-computed embeddingsstate.json / trial journalscatalog::dedupe reports DedupeReport with winner selection (non-null row_count first, then newest updated_at)POST /catalog/dedupe collapses duplicates idempotentlygpt-oss:120b sees SQL filters attempted, row counts, reviewer drift notes, contract terms; returns structured {retry, new_city, new_state, new_role, new_count, rationale}generateContinuable() detects via brace-balance + JSON.parse; no silent truncationmax_continuations.POST /vectors/playbook_memory/retire with current fingerprint retires all mismatched entries in scope; diagnostic log shows countspostObserverEvent() uses 2s AbortSignal.timeout; silent skip if :3800 is downdata/_observer/ops.jsonl is append-only so prior events survive.Workspace activity log + per-staffer filter on the event journal gives "what did Sarah do today" as a direct query. The foundation for shift-handoff reports.
+ +Each scenario run carries an explicit staffer: {id, name, tenure_months, role, tool_level}. The KB aggregates per-staffer stats (data/_kb/staffers.jsonl) that roll up into a single competence_score:
competence_score = 0.45·fill_rate + 0.20·turn_efficiency + 0.20·citation_density + 0.15·rescue_rate+
When any query runs kb.findNeighbors(spec, k), the ranking isn't just cosine similarity — it's weighted_score = cosine × max_staffer_competence over the best coordinator who ran that signature. Senior staffers' playbooks surface above juniors' on similar scenarios, even when the juniors' scenario was marginally closer in embedding space.
The tool_level knob (full / local / basic / minimal) controls which tiers are available to a given staffer's runs. See Ch 3 for the mapping. Variance is real and measurable: 36-run demo produced a 46pt fill-rate delta between James (local tools, 93%) and Alex (minimal tools, 60%) on identical contracts.
A second-order output of the competence path: when multiple staffers independently endorse the same worker on similar-signature playbooks, that worker accumulates cross-staffer endorsements. scripts/kb_staffer_report.py surfaces them — after 36 runs, Rachel D. Lewis (Welder Nashville) had 18 endorsements across 4 staffers, Angela U. Ward (Machine Op Indianapolis) 19. These are high-confidence "reliable" labels the system produced without human tagging. The UI could badge these workers on future queries; today they're visible via /memory/query's discovered_patterns bundle.
tests/multi-agent/scenario.ts gets report.md auto-generated. Workspace activity logs aggregate per staffer. GET /vectors/playbook_memory/stats shows the day's new entries.tests/multi-agent/scenario.ts gets report.md auto-generated. Workspace activity logs aggregate per staffer. GET /vectors/playbook_memory/status shows active vs retired counts. KB indexes the run (kb.indexRun) and the overview model synthesizes a pathway recommendation for the next matching signature. Every event outcome has already streamed to lakehouse-observer.service on :3800 for ERROR_ANALYZER + PLAYBOOK_BUILDER consumption.POST /memory/query on the Bun MCP. Regex normalizer extracts role / city / count / deadline / intent in 0ms (no LLM call). The unified response returns playbook workers (auto-surfaced reliable performers for Joliet Forklift with citation counts), pathway recommendation from KB, prior T3 lessons for Joliet, and top staffers by competence — all in ~300ms.detectErrorCorrections scans today's outcomes for fail→succeed deltas on the same signature; any correction gets logged to data/_kb/error_corrections.jsonl with the config diff. Tomorrow morning, the system is measurably better at something it got asked about today.After each sealed fill (via scenario.ts or manual /log flow with downstream hooks), generateArtifacts in the scenario runner produces: (a) one SMS per worker (TO: Name, message under 180 chars), (b) one client confirmation email. Drafts are saved to sms.md and emails.md under the scenario output dir. Ollama drafts them; the staffer reviews and sends. No auto-send; human-in-the-loop.
pay_rate to workers, bill_rate to contracts, and a filter + warning path. Phase 20 item.pay_rate to workers, bill_rate to contracts, and a filter + warning path. Partially addressed via ContractTerms.budget_per_hour_max passed to T3/rescue prompts, but the match-time filter isn't wired yet./seed only ADDs. Same (operation, date) pair appends a duplicate instead of refining an existing entry. Phase 26 item — cheap to add, moderate payoff.generateContinuable + generateTreeSplit are wired, but crates/aibridge/src/{continuation.rs, tree_split.rs} + crates/storaged/src/chunk_cache.rs remain queued. Gateway-side callers currently don't have the same protection against silent truncation that the TS test harness does.data/_observer/ops.jsonl; autotune agent still runs on its own HNSW-trial schedule and hasn't subscribed to the outcome metric stream yet. Phase 26+ item — connects the last loop.