diff --git a/mcp-server/spec.html b/mcp-server/spec.html index 9b81834..c54646c 100644 --- a/mcp-server/spec.html +++ b/mcp-server/spec.html @@ -123,10 +123,11 @@ table.plain tr:hover td{background:#0d1117} crates/aibridge/Rust ↔ Python sidecar. HTTP client over FastAPI wrapper around Ollama. VRAM introspection via nvidia-smi. All LLM calls (embed, generate, rerank) flow through here. crates/gateway/Axum HTTP (:3100) + gRPC (:3101). Auth middleware, tools registry (Phase 12 — governed actions), CORS. Every external request enters here. crates/ui/Dioxus WASM developer UI. Internal tool. Not exposed externally. -mcp-server/Bun/TypeScript recruiter-facing app. Serves devop.live/lakehouse. Routes: /search /match /log /log_failure /clients/:c/blacklist /intelligence/*. Proxies to the Rust gateway for heavy work. -tests/multi-agent/Dual-agent scenario harness. agent.ts (prompts + protocol), orchestrator.ts (single task), scenario.ts (5-event warehouse week), run_e2e_rated.ts (parallel pairs + rating), chain_of_custody.ts (layer-by-layer audit). +mcp-server/Bun/TypeScript recruiter-facing app. Serves devop.live/lakehouse. Routes: /search /match /log /log_failure /clients/:c/blacklist /intelligence/* /memory/query /models/matrix /system/summary. Observer sibling at observer.ts with HTTP listener on :3800 for scenario event ingest. Proxies to the Rust gateway for heavy work. +tests/multi-agent/Dual-agent scenario harness + memory stack. agent.ts (prompts, continuation + tree-split primitives, cloud routing), orchestrator.ts, scenario.ts (contracts + staffer + tool_level), kb.ts (KB indexing, competence scoring, neighbor retrieval), normalize.ts (input normalizer — structured / regex / LLM), memory_query.ts (unified /memory/query), gen_scenarios.ts + gen_staffer_demo.ts (corpus generators), run_e2e_rated.ts, chain_of_custody.ts. Unit tests colocated (kb.test.ts, normalize.test.ts). +config/models.json — authoritative 5-tier model matrix (T1 hot local / T2 review local / T3 overview cloud / T4 strategic / T5 gatekeeper). Per-tier context_window + context_budget + overflow_policy. Read at runtime by scenario.ts; hot-swap friendly. docs/PRD.md, PHASES.md, DECISIONS.md (20 ADRs). Every significant architectural choice has an ADR with the alternatives that were rejected and why. -data/Default local object store. Parquet files per dataset, append-log batches, HNSW trial journals, promotion registries, playbook_memory state.json, catalog manifests. Rebuildable from repo + this dir alone. +data/Default local object store. Parquet files per dataset, append-log batches, HNSW trial journals, promotion registries, _playbook_memory/state.json (now with retirement fields — Phase 25), catalog manifests. Plus four learning-loop directories: _kb/ (signatures, outcomes, recommendations, error_corrections, config_snapshots, staffers), _playbook_lessons/ (T3 cross-day lessons archived per run), _observer/ops.jsonl (append journal, durable scenario outcome stream), _chunk_cache/ (spec'd for Phase 21 Rust port). Rebuildable from repo + this dir alone. @@ -197,7 +198,33 @@ table.plain tr:hover td{background:#0d1117}
  • Search via POST /vectors/profile/<id>/search rejects out-of-scope queries with 403 + list of allowed bindings
  • Ollama swaps to the profile's model via keep_alive=0; only one model in VRAM at a time
  • -
    Code: crates/vectord/src/{hnsw.rs, autotune.rs, agent.rs, promotion.rs} · ADR-019
    + +

    Model matrix (Phase 20)

    +

    Five tiers declared in config/models.json. Each call site picks the tier appropriate to its purpose — hot-path JSON emitters get fast local, overview/strategic/gatekeeper decisions get thinking models on cloud. Every tier carries context_window, context_budget, and overflow_policy.

    + + + + + + + + + +
    TierPurposePrimary modelFrequency
    T1 hotPer tool call — SQL gen, hybrid_search, propose_doneqwen3.5:latest local, think:false50-200/scenario
    T2 reviewPer-step consensus, drift flaggingqwen3:latest local, think:false5-14/event
    T3 overviewMid-day checkpoints + cross-day lesson distillgpt-oss:120b Ollama Cloud, thinking on1-3/scenario
    T4 strategicPattern re-ranking, weekly gap auditqwen3.5:397b cloud1-10/day
    T5 gatekeeperSchema migrations, autotune config changeskimi-k2-thinking cloud, audit-logged1-5/day
    +

    Key mechanical finding (2026-04-21): qwen3.5 and qwen3 are thinking models — they burn ~650 tokens of hidden reasoning before emitting the visible response. For hot-path JSON emitters this meant 400-token budgets returned empty strings. Fix: think: false plumbed through sidecar's /generate endpoint; hot path disables thinking (structure matters more than reasoning depth), overseer tiers keep it on. Mistral was dropped entirely after a 0/14 fill rate on complex scenarios (decoder-level malformed-JSON bug, not a prompt issue).

    +

    Continuation primitive (Phase 21): generateContinuable() handles output-overflow without max_tokens tourniquets — empty response → geometric backoff retry; truncated-JSON → continue with partial as scratchpad. generateTreeSplit() handles input-overflow via map-reduce with running scratchpad. Both respect assertContextBudget() so silent truncation can't happen.

    + +

    Per-staffer tool_level (Phase 23)

    +

    Scenarios can be scoped to a specific coordinator (staffer: {id, name, tenure_months, role, tool_level}). tool_level controls which tiers are available:

    + +

    Measured 2026-04-21 on a 36-run demo (4 staffers × 3 contracts × 3 rounds): James Park (mid, local tools) ranked first at 92.9% fill and 36.8 cites/run, Maria Chen (senior, full tools) second at 81.0%. Cloud T3 adds latency without measurable benefit on this workload. Alex Rivera (trainee, minimal) still hit 59.5% fill purely from playbook inheritance — proof the memory carries knowledge when the model is capable.

    + +
    Code: crates/vectord/src/{hnsw.rs, autotune.rs, agent.rs, promotion.rs} · tests/multi-agent/{agent.ts, scenario.ts} · config/models.json · ADR-019
    @@ -337,7 +364,7 @@ boost[(city, state, name)] = min(Σ per_worker, 0.25)
  • Ollama inference is serial. Embedding 1M rows at ~50 chunks/sec through nomic-embed-text = ~6 hours. Acceptable for overnight refresh, not for "immediate." Mitigated by incremental refresh (only deltas).
  • RAM ceiling on HNSW. Around 5M vectors × 768d, HNSW stops fitting in 128GB comfortably. Mitigation: per-profile vector_backend: lance flip — disk-resident IVF_PQ scales past the RAM line (ADR-019).
  • VRAM ceiling for model variety. A4000 16GB holds 1-2 loaded models. Multi-model recruiter surfaces are a sequential swap, not parallel (Ollama keep_alive=0). Phase 17 profile activation unloads the prior model on swap.
  • -
  • playbook_memory growth. Currently unbounded. 391 entries today at this rate becomes ~5K in six months. Default k=200 still sub-ms at 5K. Compaction policy (TTL + decay + merge) deferred.
  • +
  • playbook_memory growth. 1936 entries today. Phase 25 (2026-04-21) added retirement via valid_until + schema_fingerprint fields + POST /vectors/playbook_memory/retire endpoint (manual or schema-drift triggered). Active vs retired split surfaced on GET /vectors/playbook_memory/status. Brute-force cosine still sub-ms at current size; Letta-style working-memory hot cache deferred until entry count crosses ~100K.
  • @@ -360,6 +387,10 @@ boost[(city, state, name)] = min(Σ per_worker, 0.25) Ollama sidecar down502 Bad Gateway from aibridge; embed calls fail fastRestart: systemctl restart lakehouse-sidecar; vector search falls back to pre-computed embeddings Gateway restart mid-operationIn-memory state (playbook_memory, HNSW) reloaded from persisted state.json / trial journalsZero data loss; catalog, storage, journals are all source-of-truth Schema fingerprint diverges across manifestscatalog::dedupe reports DedupeReport with winner selection (non-null row_count first, then newest updated_at)POST /catalog/dedupe collapses duplicates idempotently +Scenario event fails on zero-supply cityCloud rescue (Phase 22 item B) fires — gpt-oss:120b sees SQL filters attempted, row counts, reviewer drift notes, contract terms; returns structured {retry, new_city, new_state, new_role, new_count, rationale}Retry with pivot runs same executor loop with new geography; verified Gary IN → South Bend IN filled 1/1 after original drift-abort +LLM response truncated mid-JSON (thinking model ate token budget)Phase 21 generateContinuable() detects via brace-balance + JSON.parse; no silent truncationAuto-continue with partial as scratchpad, or geometric backoff if initial call returned empty. Bounded by max_continuations. +Schema migration invalidates existing playbooksPhase 25 — POST /vectors/playbook_memory/retire with current fingerprint retires all mismatched entries in scope; diagnostic log shows countsRetired entries stay in journal for forensics but are skipped by all boost calculations. Scoped by (city, state) so unrelated geos aren't touched. +Observer fails to reach scenario outcome streamScenario postObserverEvent() uses 2s AbortSignal.timeout; silent skip if :3800 is downScenario log is still the source of truth; observer re-ingest on next run restores the stream. data/_observer/ops.jsonl is append-only so prior events survive. @@ -384,6 +415,15 @@ boost[(city, state, name)] = min(Σ per_worker, 0.25)

    Daily summary per staffer

    Workspace activity log + per-staffer filter on the event journal gives "what did Sarah do today" as a direct query. The foundation for shift-handoff reports.

    + +

    Staffer identity + competence-weighted retrieval (Phase 23)

    +

    Each scenario run carries an explicit staffer: {id, name, tenure_months, role, tool_level}. The KB aggregates per-staffer stats (data/_kb/staffers.jsonl) that roll up into a single competence_score:

    +
    competence_score = 0.45·fill_rate + 0.20·turn_efficiency + 0.20·citation_density + 0.15·rescue_rate
    +

    When any query runs kb.findNeighbors(spec, k), the ranking isn't just cosine similarity — it's weighted_score = cosine × max_staffer_competence over the best coordinator who ran that signature. Senior staffers' playbooks surface above juniors' on similar scenarios, even when the juniors' scenario was marginally closer in embedding space.

    +

    The tool_level knob (full / local / basic / minimal) controls which tiers are available to a given staffer's runs. See Ch 3 for the mapping. Variance is real and measurable: 36-run demo produced a 46pt fill-rate delta between James (local tools, 93%) and Alex (minimal tools, 60%) on identical contracts.

    + +

    Auto-discovered reliable-performer labels

    +

    A second-order output of the competence path: when multiple staffers independently endorse the same worker on similar-signature playbooks, that worker accumulates cross-staffer endorsements. scripts/kb_staffer_report.py surfaces them — after 36 runs, Rachel D. Lewis (Welder Nashville) had 18 endorsements across 4 staffers, Angela U. Ward (Machine Op Indianapolis) 19. These are high-confidence "reliable" labels the system produced without human tagging. The UI could badge these workers on future queries; today they're visible via /memory/query's discovered_patterns bundle.

    @@ -410,9 +450,11 @@ boost[(city, state, name)] = min(Σ per_worker, 0.25)
    15:00
    New embeddings live. Hot-swap promotion. Searches now see all 1M new profiles. Sarah's noon query re-run would produce different top-5.
    -
    17:00
    End-of-day retrospective. Any staffer who ran tests/multi-agent/scenario.ts gets report.md auto-generated. Workspace activity logs aggregate per staffer. GET /vectors/playbook_memory/stats shows the day's new entries.
    +
    17:00
    End-of-day retrospective. Any staffer who ran tests/multi-agent/scenario.ts gets report.md auto-generated. Workspace activity logs aggregate per staffer. GET /vectors/playbook_memory/status shows active vs retired counts. KB indexes the run (kb.indexRun) and the overview model synthesizes a pathway recommendation for the next matching signature. Every event outcome has already streamed to lakehouse-observer.service on :3800 for ERROR_ANALYZER + PLAYBOOK_BUILDER consumption.
    -
    22:00
    Overnight trial cycle. Autotune agent continues in the background. Trial journal grows. Tomorrow morning, the system is measurably better at something it got asked about today.
    +
    17:15
    Kim fires a natural-language query from the search box. "need 3 forklift operators in Joliet by Monday" → POST /memory/query on the Bun MCP. Regex normalizer extracts role / city / count / deadline / intent in 0ms (no LLM call). The unified response returns playbook workers (auto-surfaced reliable performers for Joliet Forklift with citation counts), pathway recommendation from KB, prior T3 lessons for Joliet, and top staffers by competence — all in ~300ms.
    + +
    22:00
    Overnight trial cycle. Autotune agent continues in the background. Trial journal grows. KB's detectErrorCorrections scans today's outcomes for fail→succeed deltas on the same signature; any correction gets logged to data/_kb/error_corrections.jsonl with the config diff. Tomorrow morning, the system is measurably better at something it got asked about today.

    SMS + email drafts in the pipeline

    After each sealed fill (via scenario.ts or manual /log flow with downstream hooks), generateArtifacts in the scenario runner produces: (a) one SMS per worker (TO: Name, message under 180 chars), (b) one client confirmation email. Drafts are saved to sms.md and emails.md under the scenario output dir. Ollama drafts them; the staffer reviews and sends. No auto-send; human-in-the-loop.

    @@ -426,11 +468,13 @@ boost[(city, state, name)] = min(Σ per_worker, 0.25)

    Deferred — real architectural work, just not shipped yet

    @@ -452,6 +496,6 @@ boost[(city, state, name)] = min(Σ per_worker, 0.25) - +