diff --git a/mcp-server/spec.html b/mcp-server/spec.html index c54646c..4b3394b 100644 --- a/mcp-server/spec.html +++ b/mcp-server/spec.html @@ -78,13 +78,14 @@ table.plain tr:hover td{background:#0d1117} -
evaluate(task_class, ctx) → Vec<RuleOutcome> (ADR-021 — semantic-correctness matrix layer). Loaded from truth/*.toml at gateway boot.ProviderAdapter dispatch for cloud (ollama_cloud, openrouter, opencode, kimi). VRAM introspection via nvidia-smi. All LLM calls flow through here./v1/* (drop-in middleware), mode runner (/v1/mode/execute), validator (/v1/validate), iterate loop (/v1/iterate), tools registry, cost telemetry, Langfuse + observer fan-out on every chat. Every external request enters here.FillValidator, EmailValidator, ParquetWorkerLookup (loads workers_500k.parquet at boot). Fail-closed when roster absent.devop.live/lakehouse. Routes: /search /match /log /log_failure /clients/:c/blacklist /intelligence/* /memory/query /models/matrix /system/summary. Observer sibling at observer.ts with HTTP listener on :3800 for scenario event ingest. Proxies to the Rust gateway for heavy work.agent.ts (prompts, continuation + tree-split primitives, cloud routing), orchestrator.ts, scenario.ts (contracts + staffer + tool_level), kb.ts (KB indexing, competence scoring, neighbor retrieval), normalize.ts (input normalizer — structured / regex / LLM), memory_query.ts (unified /memory/query), gen_scenarios.ts + gen_staffer_demo.ts (corpus generators), run_e2e_rated.ts, chain_of_custody.ts. Unit tests colocated (kb.test.ts, normalize.test.ts).models.json — authoritative 5-tier model matrix (T1 hot local / T2 review local / T3 overview cloud / T4 strategic / T5 gatekeeper). Per-tier context_window + context_budget + overflow_policy. Read at runtime by scenario.ts; hot-swap friendly.PRD.md, PHASES.md, DECISIONS.md (20 ADRs). Every significant architectural choice has an ADR with the alternatives that were rejected and why._playbook_memory/state.json (now with retirement fields — Phase 25), catalog manifests. Plus four learning-loop directories: _kb/ (signatures, outcomes, recommendations, error_corrections, config_snapshots, staffers), _playbook_lessons/ (T3 cross-day lessons archived per run), _observer/ops.jsonl (append journal, durable scenario outcome stream), _chunk_cache/ (spec'd for Phase 21 Rust port). Rebuildable from repo + this dir alone.devop.live/lakehouse. Pages: dashboard / console / profiler / contractor / proof / spec / onboard / alerts / workspaces. Routes: /search /match /log /log_failure /clients/:c/blacklist /intelligence/* /staffers /memory/query /models/matrix /system/summary. Observer sibling at observer.ts on :3800 for event ingest.data/_auditor/kimi_verdicts/.agent.ts, scenario.ts (contracts + staffer + tool_level), kb.ts (KB indexing, competence scoring), normalize.ts, memory_query.ts, run_e2e_rated.ts. Unit tests colocated.distillation-v1.0.0 / commit e7636f2). 145 unit tests, 22/22 acceptance, 16/16 audit-full, bit-identical reproducibility. Multi-layer contamination firewall on SFT exports.modes.toml — task_class → mode/model router (scrum_review, contract_analysis, staffing_inference, pr_audit, doc_drift_check, fact_extract). providers.toml — 5 active providers (ollama, ollama_cloud, openrouter, opencode 40-model, kimi direct). routing.toml — cost gates per task class.PRD.md, PHASES.md, DECISIONS.md (21 ADRs). Every significant architectural choice has an ADR with the alternatives that were rejected and why._playbook_memory/state.json, _pathway_memory/state.json (88 traces, 11/11 successful replays, ADR-021), catalog manifests. Plus learning-loop directories: _kb/, _playbook_lessons/, _observer/ops.jsonl, _auditor/kimi_verdicts/. Rebuildable from repo + this dir alone.keep_alive=0; only one model in VRAM at a timeFive tiers declared in config/models.json. Each call site picks the tier appropriate to its purpose — hot-path JSON emitters get fast local, overview/strategic/gatekeeper decisions get thinking models on cloud. Every tier carries context_window, context_budget, and overflow_policy.
Declared in config/providers.toml + config/modes.toml. Gateway is an OpenAI-compatible drop-in middleware: any consumer that speaks POST /v1/chat/completions gets routing, audit, cost telemetry, and the full memory substrate behind every call.
| Tier | Purpose | Primary model | Frequency |
|---|---|---|---|
| Provider | Reach | Use case | |
| T1 hot | Per tool call — SQL gen, hybrid_search, propose_done | qwen3.5:latest local, think:false | 50-200/scenario |
| T2 review | Per-step consensus, drift flagging | qwen3:latest local, think:false | 5-14/event |
| T3 overview | Mid-day checkpoints + cross-day lesson distill | gpt-oss:120b Ollama Cloud, thinking on | 1-3/scenario |
| T4 strategic | Pattern re-ranking, weekly gap audit | qwen3.5:397b cloud | 1-10/day |
| T5 gatekeeper | Schema migrations, autotune config changes | kimi-k2-thinking cloud, audit-logged | 1-5/day |
ollama | localhost:3200 — local sidecar over Ollama | Hot-path JSON emitters, embeddings, last-resort rescue | |
ollama_cloud | ollama.com bearer key — gpt-oss:120b, qwen3-coder:480b, deepseek-v3.1:671b, kimi-k2:1t, mistral-large-3:675b, qwen3.5:397b | Strong-model reviewer rungs, T3+ overview, scrum master pipeline | |
openrouter | openrouter.ai/api/v1 — 343 models incl. Anthropic/Google/OpenAI/MiniMax/Qwen, paid + free tiers | Paid ladder for observer escalations, free-tier rescue | |
opencode | opencode.ai/zen/v1 — 40 frontier models reachable through ONE sk-* key: Claude Opus 4.7 / Sonnet / Haiku, GPT-5.5-pro / 5.4 / codex variants, Gemini 3.1-pro, Kimi K2.6, GLM 5.1, DeepSeek, Qwen 3.6+, MiniMax, plus 4 free-tier | Cross-architecture tie-breakers, auditor cross-lineage (Haiku 4.5 + Opus 4.7), high-context reasoning (Opus on diffs >100k chars) | |
kimi | api.kimi.com/coding/v1 — direct Kimi For Coding | kimi_architect when ollama_cloud rate-limits; TOS-clean primary path |
Key mechanical finding (2026-04-21): qwen3.5 and qwen3 are thinking models — they burn ~650 tokens of hidden reasoning before emitting the visible response. For hot-path JSON emitters this meant 400-token budgets returned empty strings. Fix: think: false plumbed through sidecar's /generate endpoint; hot path disables thinking (structure matters more than reasoning depth), overseer tiers keep it on. Mistral was dropped entirely after a 0/14 fill rate on complex scenarios (decoder-level malformed-JSON bug, not a prompt issue).
Continuation primitive (Phase 21): generateContinuable() handles output-overflow without max_tokens tourniquets — empty response → geometric backoff retry; truncated-JSON → continue with partial as scratchpad. generateTreeSplit() handles input-overflow via map-reduce with running scratchpad. Both respect assertContextBudget() so silent truncation can't happen.
Defined in tests/real-world/scrum_master_pipeline.ts as const LADDER. Each attempt is evaluated by isAcceptable() = chars ≥ 3800 ∧ not malformed JSON-only. On reject, the next rung sees a learning preamble carrying the prior rejection reason.
1 ollama_cloud / kimi-k2:1t 1T params · flagship +2 ollama_cloud / qwen3-coder:480b coding specialist +3 ollama_cloud / deepseek-v3.1:671b reasoning +4 ollama_cloud / mistral-large-3:675b deep analysis +5 ollama_cloud / gpt-oss:120b reliable workhorse +6 ollama_cloud / qwen3.5:397b dense final thinker +7 openrouter / openai/gpt-oss-120b:free rescue tier +8 openrouter / google/gemma-3-27b-it:free fastest rescue +9 ollama / qwen3.5:latest last-resort local+ +
Every audit and every consensus-required call fires the primary reviewer N=3 times in parallel (Promise.all — wall-clock = single call). Aggregate votes per claim_idx, majority wins. On a 1-1-1 split, a tie-breaker model with different architecture (qwen3-coder:480b vs primary gpt-oss/kimi) is invoked. Every disagreement, even when majority resolves, writes to data/_kb/audit_discrepancies.jsonl. Closes the cloud-non-determinism gap: temp=0 isn't actually deterministic in practice across hours; consensus + cross-architecture tie-break stabilizes verdicts.
Every push to PR #11 triggers auditor/audit.ts within ~90s. To prevent a single model lineage's blind spots from becoming the system's blind spots, audits alternate between Kimi K2.6 (Moonshot lineage) and Haiku 4.5 (Anthropic lineage) by head SHA. Diffs over 100k chars auto-promote to Claude Opus 4.7 (Anthropic frontier). Per-PR cap of 3 audits with auto-reset on each new head SHA prevents infinite-loop spend. Latest verdict on c3c9c21: Haiku 4.5, 24.6s, 100% grounding-verified across 10 findings.
The substrate the auditor and mode runner sit on is tagged at distillation-v1.0.0 / commit e7636f2. 145 unit tests pass · 22/22 acceptance invariants · 16/16 audit-full checks · bit-identical reproducibility verified. The distillation phase exports clean SFT / RAG / preference samples with a multi-layer contamination firewall (SFT_NEVER constant + scorer category mapping + acceptance fixtures); the auditor consumes the substrate. The frozen tag means: any future "the system regressed" question has a baseline to bisect against, byte-for-byte.
generateContinuable() handles output-overflow without max_tokens tourniquets — empty response → geometric backoff retry; truncated-JSON → continue with partial as scratchpad. generateTreeSplit() handles input-overflow via map-reduce with running scratchpad. Both respect assertContextBudget() so silent truncation can't happen. Now Rust-native in crates/aibridge/src/continuation.rs (Phase 44).
Scenarios can be scoped to a specific coordinator (staffer: {id, name, tenure_months, role, tool_level}). tool_level controls which tiers are available:
/log_failure + 0.5n penalty)/vectors/playbook_memory/patterns)staffer_id on /intelligence/chat); MARIA'S MEMORY pill labels the playbook context with the active coordinator/intelligence/chat Route 6 — pulls profile + 5 same-role same-geo backfills sorted by responsiveness + drafts client SMS in ~250ms)/intelligence/permit_contracts — Chicago Socrata permit → role / headcount / deadline / fill probability / gross revenue per card)/intelligence/profiler_index — direct + parent + co-permit associated tickers; live Stooq prices)Every sealed fill is seeded to playbook_memory. The boost fires inside /vectors/hybrid when use_playbook_memory: true. Math, tightened 2026-04-21 after a diagnostic pass found globally-ranked playbooks were missing the SQL-filtered candidate pool entirely:
Answers "who handled this" as a first-class matrix-index dimension. Each scenario carries staffer: {id, name, tenure_months, role, tool_level}. After every run, recomputeStafferStats(staffer_id) aggregates their fill_rate, turn efficiency, citation density, rescue rate into a single competence_score (0.45·fill + 0.20·turn_eff + 0.20·cites + 0.15·rescue).
findNeighbors returns weighted_score = cosine × max_staffer_competence — top-performer playbooks rank above juniors' on similar scenarios. Auto-discovery emerges: running 4 staffers × 3 contracts × 3 rounds surfaced Rachel D. Lewis (Welder Nashville) with 18 endorsements across all 4 staffers, Angela U. Ward (Machine Op Indianapolis) with 19 — reliable-performer labels the system built without human tagging.
Memory at the system layer, not the worker layer. Every accepted scrum review writes a PathwayTrace with the full backtrack: file fingerprint, model used, signal class, KB chunks consulted, observer events, semantic flags (UnitMismatch, TypeConfusion, OffByOne, StaleReference, DeadCode, BoundaryViolation, …), bug fingerprints. A new query that fingerprints to the same trace can hot-swap to the prior result without re-running the 9-rung escalation. Five-factor hot-swap gate: narrow fingerprint match AND audit consensus pass AND replay_count ≥ 3 (probation) AND success_rate ≥ 0.80 AND NOT retired AND vector cosine ≥ 0.90.
Live state (verified on this load): 88 traces · 11 / 11 successful replays · 100% reuse rate · probation gate crossed. Endpoints: /vectors/pathway/insert · /query · /record_replay · /stats · /bug_fingerprints. Spec: docs/DECISIONS.md ADR-021.
Memory scoped to whoever's acting. /intelligence/chat accepts staffer_id; on match, defaults state filter to staffer territory, scopes playbook-pattern geo to staffer's primary city/state, and surfaces response.staffer.name so the UI relabels MEMORY → MARIA'S MEMORY. Same query "forklift operators" returns 167 IL workers as Maria, 89 IN as Devon, 16 WI as Aisha. The corpus stays intact; the relevance gradient is per coordinator; each accumulates fills independently.
Roster: /staffers endpoint reads from STAFFERS in mcp-server/index.ts. Three personas today (Maria/Devon/Aisha); architecture generalizes — every new metro adds territories, not code paths.
Memory at the network layer. Every contractor in the corpus is also a forward indicator on the public equities they touch via three attribution flavors: direct (contractor IS the public issuer — SEC tickers index match), parent (subsidiary of a public parent — curated KNOWN_PARENT_MAP, e.g. Turner → HOC.DE via Hochtief AG), associated (co-permit network — Bob's Electric appears with TARGET CORPORATION 3+ times → inherits TGT). The associated path is the moat: a staffing-permit dataset that maps contractor-to-public-issuer is not commercially available; we synthesize it from the Socrata co-occurrence graph.
BAI (Building Activity Index) = attribution-weighted average day-change across surfaced issuers. Indexed build value = total $ of permits attributable to ANY public issuer in scope. Network depth = issuers / total attribution edges. Cross-metro replication explicit in the architecture — Chicago is Phase 1; NYC DOB / LA County / Houston BCD / Boston ISD / DC DCRA are all Socrata-shaped, ship as config-only adapters.
+ +Observer runs as lakehouse-observer.service, now with an HTTP listener on :3800. Scenarios POST per-event outcomes to /event with full provenance (staffer_id, sig_hash, event_kind, role, city, state, rescue flags). Observer's ERROR_ANALYZER and PLAYBOOK_BUILDER loops consume them alongside MCP-wrapped ops. Persistence switched from the old /ingest/file REPLACE path to an append-only data/_observer/ops.jsonl journal so the trace survives across restarts.
Maria runs Chicago. Devon runs Indianapolis. Aisha runs Wisconsin/Michigan. They share one corpus, but search results, recurring-skill patterns, and playbook context all reshape to whoever is acting. /intelligence/chat accepts staffer_id; on match, defaults state filter to the staffer's territory, scopes playbook-pattern geo to their primary city/state, and surfaces response.staffer.name so the UI relabels MEMORY → MARIA'S MEMORY.
Verified end-to-end: same query "forklift operators" returns 167 IL workers as Maria, 89 IN as Devon, 16 WI as Aisha (live numbers; refresh the profiler page to recompute). The corpus stays intact; the relevance gradient is per coordinator. As each accumulates fills, their slice of the playbook compounds independently. Roster: /staffers endpoint, declared in STAFFERS in mcp-server/index.ts. Adding a staffer is one append; the architecture is metro-agnostic by construction.
Scopes every search. A staffing-recruiter profile bound to workers_500k sees only that dataset. A security-analyst profile bound to threat_intel cannot see worker data. GET /vectors/profile/<id>/audit records every tool invocation by model identity.
/log_failure → mark_failed records a penalty. Next similar query dampens Dave's boost by 0.5. Sarah continues the refill — the refill excludes Dave and the 2 others already booked for this shift./log_failure on Dave records the penalty for the next similar query.data/_kb/audit_baselines.jsonl append pattern); just hasn't run long enough.pay_rate to workers, bill_rate to contracts, and a filter + warning path. Partially addressed via ContractTerms.budget_per_hour_max passed to T3/rescue prompts, but the match-time filter isn't wired yet./seed only ADDs. Same (operation, date) pair appends a duplicate instead of refining an existing entry. Phase 26 item — cheap to add, moderate payoff.generateContinuable + generateTreeSplit are wired, but crates/aibridge/src/{continuation.rs, tree_split.rs} + crates/storaged/src/chunk_cache.rs remain queued. Gateway-side callers currently don't have the same protection against silent truncation that the TS test harness does./seed only ADDs. Same (operation, date) pair appends a duplicate instead of refining an existing entry. Cheap to add, moderate payoff.data/_observer/ops.jsonl; autotune agent still runs on its own HNSW-trial schedule and hasn't subscribed to the outcome metric stream yet. Phase 26+ item — connects the last loop.