matrix-agent-validated/mcp-server/spec.html

<!DOCTYPE html>
<html lang="en"><head>
<meta charset="utf-8"><meta name="viewport" content="width=device-width,initial-scale=1">
<title>Lakehouse — Technical Specification</title>
<style>
*{margin:0;padding:0;box-sizing:border-box}
body{font-family:'Inter',-apple-system,system-ui,sans-serif;background:#090c10;color:#b0b8c4;font-size:14px;line-height:1.6;-webkit-font-smoothing:antialiased}
a{color:#58a6ff;text-decoration:none}
a:hover{color:#79c0ff}

.bar{background:#0d1117;padding:0 24px;height:56px;border-bottom:1px solid #171d27;display:flex;justify-content:space-between;align-items:center;position:sticky;top:0;z-index:10}
.bar h1{font-size:14px;font-weight:600;color:#e6edf3;letter-spacing:-0.2px}
.bar nav{display:flex;gap:2px}
.bar nav a{font-size:12px;color:#545d68;padding:6px 14px;border-radius:6px;transition:all 0.15s}
.bar nav a:hover{color:#e6edf3;background:#161b22}
.bar nav a.active{color:#e6edf3;background:#1c2333}
.bar .rt{font-size:11px;color:#545d68}

.layout{display:grid;grid-template-columns:220px 1fr;gap:0;max-width:1200px;margin:0 auto}
.toc{position:sticky;top:72px;align-self:start;padding:24px 12px;border-right:1px solid #171d27;height:calc(100vh - 72px);overflow-y:auto}
.toc .hdr{color:#545d68;font-size:10px;text-transform:uppercase;letter-spacing:1.4px;font-weight:600;margin-bottom:10px;padding-left:8px}
.toc a{display:block;color:#8b949e;font-size:12px;padding:6px 10px;border-radius:6px;margin-bottom:2px;line-height:1.4}
.toc a:hover{color:#e6edf3;background:#161b22}

.wrap{padding:28px 24px 60px;min-width:0}

.chapter{margin-bottom:48px;scroll-margin-top:72px}
.chapter .num{color:#545d68;font-size:11px;font-weight:600;letter-spacing:1.6px;text-transform:uppercase;margin-bottom:6px}
.chapter h2{color:#e6edf3;font-size:24px;font-weight:700;letter-spacing:-0.4px;margin-bottom:8px;line-height:1.2}
.chapter h3{color:#e6edf3;font-size:16px;font-weight:600;margin:22px 0 8px}
.chapter h4{color:#c9d1d9;font-size:14px;font-weight:600;margin:16px 0 6px}
.chapter .lede{color:#8b949e;font-size:14px;margin-bottom:18px;max-width:760px;line-height:1.7}
.chapter p{color:#b0b8c4;margin-bottom:12px;max-width:800px}
.chapter ul{color:#b0b8c4;margin:8px 0 16px 24px}
.chapter li{margin-bottom:6px;line-height:1.7}
.chapter strong{color:#e6edf3;font-weight:600}

.card{background:#0d1117;border:1px solid #171d27;border-radius:12px;padding:20px;margin:12px 0}
.accent-l{border-left:3px solid #2ea043}
.accent-b{border-left:3px solid #1f6feb}
.accent-a{border-left:3px solid #bc8cff}
.accent-w{border-left:3px solid #d29922}
.accent-r{border-left:3px solid #f85149}

code{background:#161b22;color:#e6edf3;padding:2px 6px;border-radius:4px;font-family:ui-monospace,Menlo,monospace;font-size:12px}
pre{background:#161b22;border:1px solid #171d27;border-radius:8px;padding:14px 16px;overflow-x:auto;font-family:ui-monospace,Menlo,monospace;font-size:12px;color:#c9d1d9;line-height:1.5;margin:8px 0}

table.plain{width:100%;border-collapse:collapse;font-size:12px;margin:10px 0}
table.plain th{text-align:left;padding:8px 12px;color:#545d68;font-weight:600;text-transform:uppercase;font-size:10px;letter-spacing:0.8px;border-bottom:1px solid #171d27;background:#0d1117}
table.plain td{padding:8px 12px;border-bottom:1px solid #171d27;color:#c9d1d9;vertical-align:top}
table.plain td.mono{font-family:ui-monospace,Menlo,monospace}
table.plain tr:hover td{background:#0d1117}

.narr{color:#8b949e;font-size:13px;line-height:1.7;margin:10px 0;padding:10px 14px;border-left:2px solid #21262d}
.narr strong{color:#c9d1d9}

.ref{color:#545d68;font-size:11px;margin-top:6px;font-family:ui-monospace,Menlo,monospace}
.ref strong{color:#79c0ff;font-weight:600}

.step{display:flex;gap:14px;margin-bottom:14px;padding:12px 16px;background:#0d1117;border:1px solid #171d27;border-radius:8px}
.step .n{color:#58a6ff;font-weight:700;font-size:18px;flex-shrink:0;min-width:30px}
.step .body{color:#b0b8c4;font-size:13px;line-height:1.7}
.step .body strong{color:#e6edf3}

.footer{border-top:1px solid #171d27;padding:20px;text-align:center;color:#3d444d;font-size:11px}

@media(max-width:900px){
  .layout{grid-template-columns:1fr}
  .toc{display:none}
  .wrap{padding:20px 14px 40px}
  .bar nav{display:none}
}
</style></head>
<body>

<div class="bar">
  <h1>Lakehouse — Technical Specification</h1>
  <nav>
    <a href=".">Dashboard</a>
    <a href="console">Walkthrough</a>
    <a href="proof">Architecture</a>
    <a href="spec" class="active">Spec</a>
    <a href="onboard">Onboard</a>
    <a href="alerts">Alerts</a>
    <a href="workspaces">Workspaces</a>
  </nav>
  <div class="rt">v1 · 2026-04-20</div>
</div>

<div class="layout">
<aside class="toc">
  <div class="hdr">Contents</div>
  <a href="#ch1">1. Repository layout</a>
  <a href="#ch2">2. Data ingest pipeline</a>
  <a href="#ch3">3. Measurement &amp; indexing</a>
  <a href="#ch4">4. Contract inference</a>
  <a href="#ch5">5. What a CRM can't do</a>
  <a href="#ch6">6. How it gets better over time</a>
  <a href="#ch7">7. Scale story — 20 staffers, 300 contracts, a surge</a>
  <a href="#ch8">8. Error surfaces &amp; recovery</a>
  <a href="#ch9">9. Per-staffer context</a>
  <a href="#ch10">10. A day in the life</a>
  <a href="#ch11">11. Known limits &amp; non-goals</a>
</aside>
<div class="wrap">

<!-- ═══ 1. REPO LAYOUT ═══ -->
<div class="chapter" id="ch1">
<div class="num">Chapter 1</div>
<h2>Repository layout</h2>
<div class="lede">What lives where. Every folder below has a single, bounded responsibility. A maintainer reading this should know — in under ten minutes — which crate owns a failing behavior.</div>
<table class="plain">
<thead><tr><th>Path</th><th>Owns</th></tr></thead>
<tbody>
<tr><td class="mono">crates/shared/</td><td>Types, errors, Arrow helpers, schema fingerprints, PII detection, secrets provider. Every other crate depends on this.</td></tr>
<tr><td class="mono">crates/storaged/</td><td>Raw object I/O. <code>BucketRegistry</code> (multi-bucket, rescue-aware), <code>AppendLog</code> (write-once batched append), <code>ErrorJournal</code> (bucket op failures). ADR-017 (federation), ADR-018 (append pattern).</td></tr>
<tr><td class="mono">crates/catalogd/</td><td>Metadata authority. Dataset manifests, schema fingerprints (ADR-020), tombstones (soft delete), AI-safe views, model profiles (Phase 17). In-memory index persisted as Parquet on storage.</td></tr>
<tr><td class="mono">crates/queryd/</td><td>SQL engine. DataFusion over Parquet + MemTable cache + delta merge-on-read + compaction. Registers every bucket as an object_store so SQL can join across them.</td></tr>
<tr><td class="mono">crates/ingestd/</td><td>Data on-ramp. CSV / JSON / PDF (+OCR via Tesseract) / Postgres streaming / MySQL streaming / inbox watcher / cron schedules. Every ingest path auto-tags PII (emails, phones, SSNs, addresses), records lineage, and marks embeddings stale.</td></tr>
<tr><td class="mono">crates/vectord/</td><td>The vector + learning surface. Embeddings stored as Parquet (ADR-008), HNSW index (Phase 15), trial system (autotune), promotion registry (Phase 16), playbook_memory (Phase 19). Core feedback loop lives here.</td></tr>
<tr><td class="mono">crates/vectord-lance/</td><td>Firewall crate. Lance 4.0 + Arrow 57, isolated from the main Arrow-55 workspace. Provides secondary vector backend for large-scale, random-access, and append-heavy workloads (ADR-019).</td></tr>
<tr><td class="mono">crates/journald/</td><td>Append-only mutation event log (ADR-012). Every insert/update/delete writes here — who, when, what, old/new value. Never mutated. Foundation for time-travel + compliance audit.</td></tr>
<tr><td class="mono">crates/aibridge/</td><td>Rust ↔ Python sidecar. HTTP client over FastAPI wrapper around Ollama. VRAM introspection via nvidia-smi. All LLM calls (embed, generate, rerank) flow through here.</td></tr>
<tr><td class="mono">crates/gateway/</td><td>Axum HTTP (:3100) + gRPC (:3101). Auth middleware, tools registry (Phase 12 — governed actions), CORS. Every external request enters here.</td></tr>
<tr><td class="mono">crates/ui/</td><td>Dioxus WASM developer UI. Internal tool. Not exposed externally.</td></tr>
<tr><td class="mono">mcp-server/</td><td>Bun/TypeScript recruiter-facing app. Serves <code>devop.live/lakehouse</code>. Routes: <code>/search /match /log /log_failure /clients/:c/blacklist /intelligence/* /memory/query /models/matrix /system/summary</code>. Observer sibling at <code>observer.ts</code> with HTTP listener on :3800 for scenario event ingest. Proxies to the Rust gateway for heavy work.</td></tr>
<tr><td class="mono">tests/multi-agent/</td><td>Dual-agent scenario harness + memory stack. <code>agent.ts</code> (prompts, continuation + tree-split primitives, cloud routing), <code>orchestrator.ts</code>, <code>scenario.ts</code> (contracts + staffer + tool_level), <code>kb.ts</code> (KB indexing, competence scoring, neighbor retrieval), <code>normalize.ts</code> (input normalizer — structured / regex / LLM), <code>memory_query.ts</code> (unified /memory/query), <code>gen_scenarios.ts</code> + <code>gen_staffer_demo.ts</code> (corpus generators), <code>run_e2e_rated.ts</code>, <code>chain_of_custody.ts</code>. Unit tests colocated (<code>kb.test.ts</code>, <code>normalize.test.ts</code>).</td></tr>
<tr><td class="mono">config/</td><td><code>models.json</code> — authoritative 5-tier model matrix (T1 hot local / T2 review local / T3 overview cloud / T4 strategic / T5 gatekeeper). Per-tier context_window + context_budget + overflow_policy. Read at runtime by scenario.ts; hot-swap friendly.</td></tr>
<tr><td class="mono">docs/</td><td><code>PRD.md</code>, <code>PHASES.md</code>, <code>DECISIONS.md</code> (20 ADRs). Every significant architectural choice has an ADR with the alternatives that were rejected and why.</td></tr>
<tr><td class="mono">data/</td><td>Default local object store. Parquet files per dataset, append-log batches, HNSW trial journals, promotion registries, <code>_playbook_memory/state.json</code> (now with retirement fields — Phase 25), catalog manifests. Plus four learning-loop directories: <code>_kb/</code> (signatures, outcomes, recommendations, error_corrections, config_snapshots, staffers), <code>_playbook_lessons/</code> (T3 cross-day lessons archived per run), <code>_observer/ops.jsonl</code> (append journal, durable scenario outcome stream), <code>_chunk_cache/</code> (spec'd for Phase 21 Rust port). Rebuildable from repo + this dir alone.</td></tr>
</tbody>
</table>
</div>

<!-- ═══ 2. INGEST PIPELINE ═══ -->
<div class="chapter" id="ch2">
<div class="num">Chapter 2</div>
<h2>Data ingest pipeline</h2>
<div class="lede">How staffing data gets into the system — whether from a CSV drop, an ATS export, a Postgres replica, or a PDF resume. Every path ends at the same place: a registered dataset with known schema, known lineage, known sensitivity.</div>

<div class="step"><div class="n">1</div><div class="body"><strong>Source arrives.</strong> Four shapes: (a) file upload via <code>POST /ingest/file</code>, (b) inbox watcher (drops in <code>./inbox/</code> → auto-ingested in under 15s), (c) Postgres or MySQL streaming connector (<code>POST /ingest/db</code> with DSN), (d) scheduled ingest via <code>ingestd::schedule</code> with cron.</div></div>

<div class="step"><div class="n">2</div><div class="body"><strong>Parse + normalize.</strong> CSV parser infers types per column; defaults to <code>String</code> on ambiguity (ADR-010 — better to ingest everything than reject on type mismatch). JSON parser flattens nested objects. PDF extractor uses <code>lopdf</code> first; falls back to Tesseract OCR for scanned/image PDFs. Output is always an Arrow <code>RecordBatch</code>.</div></div>

<div class="step"><div class="n">3</div><div class="body"><strong>Auto-detect PII.</strong> <code>shared::pii</code> scans column values and names. Identifies emails, phone numbers, SSNs, salaries, street addresses, medical terms. Tags columns with <code>sensitivity: PII | PHI | Financial | Internal | Public</code> (Phase 10 catalog v2).</div></div>

<div class="step"><div class="n">4</div><div class="body"><strong>Deduplicate by content hash.</strong> Every uploaded file's SHA-256 is checked against the catalog's seen-hash log. Re-ingesting the same file is a no-op (ADR invariant #5).</div></div>

<div class="step"><div class="n">5</div><div class="body"><strong>Write Parquet to object storage.</strong> <code>arrow_helpers::record_batch_to_parquet</code> → <code>storaged::ops::put</code> → file lands under <code>data/datasets/&lt;name&gt;.parquet</code> (or bucket-scoped via <code>BucketRegistry</code>). Schema fingerprint computed.</div></div>

<div class="step"><div class="n">6</div><div class="body"><strong>Register in catalog.</strong> <code>catalogd::Registry::register(name, fingerprint, objects)</code> — idempotent on (name, fingerprint). Same name + same fingerprint = reuse manifest, bump updated_at. Same name + different fingerprint = <code>409 Conflict</code> (ADR-020 — prevents silent schema drift). New name = create new manifest with owner, lineage, freshness SLA, column metadata, PII tags.</div></div>

<div class="step"><div class="n">7</div><div class="body"><strong>Mark embeddings stale.</strong> If the dataset already has a vector index, the new rows mean that index is now behind. <code>Registry::mark_embeddings_stale</code> flips a flag; <code>POST /vectors/refresh/&lt;dataset&gt;</code> runs an incremental re-embed (only new rows, not the whole corpus).</div></div>

<div class="step"><div class="n">8</div><div class="body"><strong>Queryable immediately.</strong> <code>queryd::context</code> picks up the new manifest on next query. Hot-cache warms on first hit. Delta merge-on-read means updates land without rewriting the base Parquet.</div></div>

<div class="ref"><strong>Code:</strong> crates/ingestd/src/{service.rs, csv.rs, json.rs, pdf.rs, pg_stream.rs, my_stream.rs, schedule.rs}</div>
</div>

<!-- ═══ 3. MEASUREMENT & INDEXING ═══ -->
<div class="chapter" id="ch3">
<div class="num">Chapter 3</div>
<h2>Measurement &amp; indexing</h2>
<div class="lede">Once data is in, the system describes it rigorously and builds fast-access indexes over the parts that will be queried. Every measurement is deterministic, versioned, and visible via HTTP.</div>

<h3>What gets measured per dataset</h3>
<ul>
<li><strong>Row count</strong> (from parquet footer, not a SELECT COUNT). O(1).</li>
<li><strong>Schema fingerprint</strong> — SHA-256 over (column_name, type, nullability, sort) tuples. Drives ADR-020 idempotent register.</li>
<li><strong>Owner / sensitivity / freshness SLA</strong> — catalog v2 metadata. PII auto-detected; owner assigned on ingest.</li>
<li><strong>Lineage</strong> — source_system → ingest_job → dataset. Who put this here, when, from what.</li>
<li><strong>Last embedded at</strong> — when the vector index covering this dataset was last refreshed. Drives stale-detection.</li>
</ul>

<h3>How vector indexes are built</h3>
<p>Two backends, chosen per profile (ADR-019):</p>
<table class="plain">
<thead><tr><th></th><th>HNSW over Parquet (primary)</th><th>Lance (secondary)</th></tr></thead>
<tbody>
<tr><td>Storage</td><td>Embeddings as Parquet columns (<code>doc_id, chunk_text, vector</code>)</td><td>Lance native dataset</td></tr>
<tr><td>Index</td><td>HNSW in RAM, serialized sidecar</td><td>IVF_PQ on disk</td></tr>
<tr><td>Build time (100K × 768d)</td><td>~230s</td><td>~16s (14× faster)</td></tr>
<tr><td>Search p50 (100K)</td><td>~873μs</td><td>~7.4ms at recall 1.0</td></tr>
<tr><td>Append</td><td>Rewrite required</td><td>Structural (0.08s for 100 rows)</td></tr>
<tr><td>Random fetch by doc_id</td><td>Full scan</td><td>~311μs (112× faster)</td></tr>
<tr><td>RAM ceiling</td><td>~5M vectors</td><td>Scales past RAM — disk-resident</td></tr>
</tbody>
</table>

<h3>Autotune</h3>
<p>The <code>vectord::agent</code> background task runs continuously. Per index, it proposes HNSW configurations (<code>ef_construction × ef_search</code>), executes a trial against a stored eval set, journals the result as JSONL, and — if recall beats the min_recall gate (0.9) and latency wins the Pareto test — promotes the new config atomically via <code>promotion_registry</code>. No downtime. Rollback in milliseconds.</p>

<h3>Per-profile / per-staffer indexing</h3>
<p>Model profiles (Phase 17) are not routing strings — they are named scopes. Each profile has <code>bound_datasets[]</code>, <code>hnsw_config</code>, <code>vector_backend</code>, and <code>bucket</code>. When a staffer activates a profile:</p>
<ul>
<li>EmbeddingCache warms for bound indexes only</li>
<li>HNSW is rebuilt with the profile's config (if different from current)</li>
<li>Search via <code>POST /vectors/profile/&lt;id&gt;/search</code> rejects out-of-scope queries with 403 + list of allowed bindings</li>
<li>Ollama swaps to the profile's model via <code>keep_alive=0</code>; only one model in VRAM at a time</li>
</ul>

<h3>Model matrix (Phase 20)</h3>
<p>Five tiers declared in <code>config/models.json</code>. Each call site picks the tier appropriate to its purpose — hot-path JSON emitters get fast local, overview/strategic/gatekeeper decisions get thinking models on cloud. Every tier carries <code>context_window</code>, <code>context_budget</code>, and <code>overflow_policy</code>.</p>
<table class="plain">
<thead><tr><th>Tier</th><th>Purpose</th><th>Primary model</th><th>Frequency</th></tr></thead>
<tbody>
<tr><td>T1 hot</td><td>Per tool call — SQL gen, hybrid_search, propose_done</td><td><code>qwen3.5:latest</code> local, <code>think:false</code></td><td>50-200/scenario</td></tr>
<tr><td>T2 review</td><td>Per-step consensus, drift flagging</td><td><code>qwen3:latest</code> local, <code>think:false</code></td><td>5-14/event</td></tr>
<tr><td>T3 overview</td><td>Mid-day checkpoints + cross-day lesson distill</td><td><code>gpt-oss:120b</code> Ollama Cloud, thinking on</td><td>1-3/scenario</td></tr>
<tr><td>T4 strategic</td><td>Pattern re-ranking, weekly gap audit</td><td><code>qwen3.5:397b</code> cloud</td><td>1-10/day</td></tr>
<tr><td>T5 gatekeeper</td><td>Schema migrations, autotune config changes</td><td><code>kimi-k2-thinking</code> cloud, audit-logged</td><td>1-5/day</td></tr>
</tbody>
</table>
<p><strong>Key mechanical finding (2026-04-21):</strong> qwen3.5 and qwen3 are <em>thinking</em> models — they burn ~650 tokens of hidden reasoning before emitting the visible response. For hot-path JSON emitters this meant 400-token budgets returned empty strings. Fix: <code>think: false</code> plumbed through sidecar's <code>/generate</code> endpoint; hot path disables thinking (structure matters more than reasoning depth), overseer tiers keep it on. Mistral was dropped entirely after a 0/14 fill rate on complex scenarios (decoder-level malformed-JSON bug, not a prompt issue).</p>
<p><strong>Continuation primitive (Phase 21):</strong> <code>generateContinuable()</code> handles output-overflow without <code>max_tokens</code> tourniquets — empty response → geometric backoff retry; truncated-JSON → continue with partial as scratchpad. <code>generateTreeSplit()</code> handles input-overflow via map-reduce with running scratchpad. Both respect <code>assertContextBudget()</code> so silent truncation can't happen.</p>

<h3>Per-staffer tool_level (Phase 23)</h3>
<p>Scenarios can be scoped to a specific coordinator (<code>staffer: {id, name, tenure_months, role, tool_level}</code>). <code>tool_level</code> controls which tiers are available:</p>
<ul>
<li><code>full</code> — T1/T2 local, T3 cloud, cloud rescue on failure</li>
<li><code>local</code> — T1/T2/T3 all local (gpt-oss:20b as overseer)</li>
<li><code>basic</code> — kimi-k2.5 cloud executor + local reviewer + local T3, no rescue</li>
<li><code>minimal</code> — kimi-k2.5 cloud executor, no T3, no rescue. Playbook inheritance is the only signal.</li>
</ul>
<p>Measured 2026-04-21 on a 36-run demo (4 staffers × 3 contracts × 3 rounds): James Park (mid, local tools) ranked first at 92.9% fill and 36.8 cites/run, Maria Chen (senior, full tools) second at 81.0%. Cloud T3 adds latency without measurable benefit on this workload. Alex Rivera (trainee, minimal) still hit 59.5% fill purely from playbook inheritance — proof the memory carries knowledge when the model is capable.</p>

<div class="ref"><strong>Code:</strong> crates/vectord/src/{hnsw.rs, autotune.rs, agent.rs, promotion.rs} · tests/multi-agent/{agent.ts, scenario.ts} · config/models.json · ADR-019</div>
</div>

<!-- ═══ 4. CONTRACT INFERENCE ═══ -->
<div class="chapter" id="ch4">
<div class="num">Chapter 4</div>
<h2>Contract inference from external signal</h2>
<div class="lede">Most CRMs wait for a contract to land. This system watches upstream demand and pre-builds the ranking before the contract lands.</div>

<p>The concrete example running on <code>devop.live/lakehouse</code> is Chicago Department of Buildings permit data (public Socrata API). Every permit is a signal that construction — and therefore staffing — is coming.</p>

<h3>Flow</h3>
<div class="step"><div class="n">1</div><div class="body"><strong>Fetch.</strong> <code>/intelligence/market</code> and <code>/intelligence/permit_contracts</code> hit <code>data.cityofchicago.org/resource/ydr8-5enu.json</code> live. No caching of permit data — every page load is fresh.</div></div>
<div class="step"><div class="n">2</div><div class="body"><strong>Map work_type → role.</strong> Industry dictionary: "Electrical Work" → "Electrician", "Masonry Work" → "Production Worker", "Mechanical Work" → "Maintenance Tech", etc.</div></div>
<div class="step"><div class="n">3</div><div class="body"><strong>Derive worker count.</strong> Heuristic: ~1 worker per $150K of permit cost, capped 2-8 per contract for staffing realism. Operator-configurable when real client history is available.</div></div>
<div class="step"><div class="n">4</div><div class="body"><strong>Derive timeline.</strong> Permit issued → construction starts ~45 days later → staffing window opens ~14 days before construction. Classifies each permit as <code>overdue</code>, <code>urgent</code>, <code>soon</code>, <code>scheduled</code>.</div></div>
<div class="step"><div class="n">5</div><div class="body"><strong>Run hybrid search against the bench.</strong> For each derived contract, <code>POST /vectors/hybrid</code> with <code>sql_filter</code> on role+state+city+availability, <code>use_playbook_memory: true</code>, <code>playbook_memory_k: 200</code>. Returns top-5 candidates with boost + citations.</div></div>
<div class="step"><div class="n">6</div><div class="body"><strong>Query the meta-index.</strong> <code>POST /vectors/playbook_memory/patterns</code> aggregates traits across similar past playbooks — recurring certs, skills, archetype, reliability distribution. Surfaces signal the operator didn't query for.</div></div>
<div class="step"><div class="n">7</div><div class="body"><strong>Render on the dashboard.</strong> Each card shows permit + derived contract + top 3 candidates with memory chips + discovered pattern + urgency. All of this pre-computed before any staffer opens the UI.</div></div>

<h3>Coverage forecast</h3>
<p><code>/intelligence/staffing_forecast</code> aggregates the last 30 days of permits into predicted role-level demand, joins against the IL bench supply, computes coverage %, and classifies each role as <code>critical</code> / <code>tight</code> / <code>watch</code> / <code>ok</code>. The dashboard's top panel renders this — staffers see supply gaps before they query.</p>
</div>

<!-- ═══ 5. WHAT A CRM CAN'T DO ═══ -->
<div class="chapter" id="ch5">
<div class="num">Chapter 5</div>
<h2>What a CRM can't do (and why)</h2>
<div class="lede">A CRM stores. This system infers, predicts, re-ranks, and compounds. The six capabilities below are load-bearing — missing any of them is the gap between "software that logs calls" and "software that makes the next call better."</div>

<div class="card accent-b">
<table class="plain">
<thead><tr><th>Capability</th><th>CRM</th><th>This system</th></tr></thead>
<tbody>
<tr><td>Store candidate records</td><td>Yes</td><td>Yes (workers_500k, candidates)</td></tr>
<tr><td>Search by structured field</td><td>Yes</td><td>Yes (DataFusion SQL, sub-100ms on 3M rows)</td></tr>
<tr><td>Search by semantic meaning</td><td>No</td><td>Yes (HNSW + nomic-embed-text)</td></tr>
<tr><td>Combine SQL filter + semantic rank</td><td>No</td><td>Yes (<code>/vectors/hybrid</code>)</td></tr>
<tr><td>Boost workers based on past success</td><td>No</td><td>Yes (Phase 19 playbook_memory)</td></tr>
<tr><td>Penalize workers based on past failure</td><td>No</td><td>Yes (<code>/log_failure</code> + <code>0.5<sup>n</sup></code> penalty)</td></tr>
<tr><td>Surface traits across past fills</td><td>No</td><td>Yes (<code>/vectors/playbook_memory/patterns</code>)</td></tr>
<tr><td>Predict staffing demand from external data</td><td>No</td><td>Yes (Chicago permit feed + 30-day rolling forecast)</td></tr>
<tr><td>Count down to staffing deadline per contract</td><td>No</td><td>Yes (permit issue_date + heuristic timeline)</td></tr>
<tr><td>Explain why each candidate ranked</td><td>No</td><td>Yes (boost chip + narrative citations + memory pattern)</td></tr>
<tr><td>Improve ranking from operator actions</td><td>No</td><td>Yes (every Call/SMS/No-show click → re-rank signal)</td></tr>
</tbody>
</table>
</div>
</div>

<!-- ═══ 6. HOW IT GETS BETTER ═══ -->
<div class="chapter" id="ch6">
<div class="num">Chapter 6</div>
<h2>How it gets better over time</h2>
<div class="lede">Compounding learning across seven paths. The first three are automatic background loops. Paths 4-7 landed 2026-04-21 and turn the system into a reinforcement-learning pipeline: outcomes → knowledge base → pathway recommendations → cloud rescue → competence-weighted retrieval → observer analysis. All seven happen without operator intervention.</div>

<h3>Path 1 — Playbook boost with geo + role prefilter (Phase 19 + refinement)</h3>
<p>Every sealed fill is seeded to <code>playbook_memory</code>. The boost fires inside <code>/vectors/hybrid</code> when <code>use_playbook_memory: true</code>. Math, tightened 2026-04-21 after a diagnostic pass found globally-ranked playbooks were missing the SQL-filtered candidate pool entirely:</p>
<pre>per_worker = cosine(query_emb, playbook_emb) × 0.5 × e^(-age/30) × 0.5^failures / n_workers
boost[(city, state, name)] = min(Σ per_worker, 0.25)</pre>
<p><strong>Multi-strategy retrieval (new):</strong> before cosine, <code>compute_boost_for_filtered_with_role(target_geo, target_role)</code> prefilters to same-city playbooks, then gives exact (role, city, state) matches similarity=1.0 and fills up to half the top-k. Cosine fills the rest. Mirrors 2026 Mem0/Zep guidance on parallel-strategy rerank.</p>
<p><strong>Measured lift:</strong> before geo-filter, Nashville Welder query returned <code>boosts=170 matched=0</code> (zero intersection with candidate pool). After: <code>boosts=36 matched=11</code>. On the Riverfront Steel scenario, total playbook citations went from 2 → 28 per run — a 14× delta on identical inputs. The diagnostic log <code>playbook_boost: boosts=N sources=N parsed=N matched=N target_geo=? target_role=?</code> runs on every hybrid call so the class of silent-miss bug stays visible.</p>

<h3>Path 2 — Pattern discovery (meta-index)</h3>
<p><code>/vectors/playbook_memory/patterns</code> goes beyond "who was endorsed" to answer "what did past similar fills have in common?" Aggregates recurring certifications, skills, archetype, reliability distribution across the top-K semantically similar playbooks. Surfaces signal the operator didn't explicitly query for.</p>

<h3>Path 3 — Autotune agent</h3>
<p>The <code>vectord::agent</code> background task runs continuously. Watches the HNSW trial journal, proposes configs, executes trials, promotes Pareto winners — without human intervention. Operator sees "the index got faster overnight" and doesn't know why. The journal knows why.</p>

<h3>Path 4 — Knowledge Base + pathway recommender (Phase 22)</h3>
<p>Meta-layer over playbook_memory. Files under <code>data/_kb/</code>:</p>
<ul>
<li><code>signatures.jsonl</code> — sig_hash + embedding of every run's event shape</li>
<li><code>outcomes.jsonl</code> — per-run summary (models, fill rate, turns, citations, rescue stats, staffer, elapsed)</li>
<li><code>pathway_recommendations.jsonl</code> — AI-synthesized advice for the next run of a similar sig</li>
<li><code>error_corrections.jsonl</code> — fail→succeed deltas on the same sig</li>
<li><code>config_snapshots.jsonl</code> — hash of active models + tool_level at each run</li>
</ul>
<p>Cycle: scenario ends → <code>kb.indexRun()</code> appends outcome → <code>kb.recommendFor(nextSpec)</code> finds k-NN signatures, feeds outcome history to an overview model, writes structured JSON advice → next scenario reads it via <code>kb.loadRecommendation(spec)</code> and injects <code>pathway_notes</code> into the executor's context alongside prior T3 lessons.</p>

<h3>Path 5 — Cloud rescue on failure (Phase 22 item B)</h3>
<p>When an event fails (drift abort, JSON parse, pool exhaustion) and cloud T3 is enabled, <code>requestCloudRemediation()</code> feeds the full failure trace — SQL filters attempted, row counts, reviewer drift notes, gap signals, contract terms — to <code>gpt-oss:120b</code> on Ollama Cloud. Cloud returns structured <code>{retry, new_city, new_state, new_role, new_count, rationale}</code>. Event retries once with the pivot. Verified on stress_01: Gary IN (zero workers indexed) misplacement → cloud proposed South Bend IN → retry filled 1/1.</p>

<h3>Path 6 — Staffer competence-weighted retrieval (Phase 23)</h3>
<p>Answers "who handled this" as a first-class matrix-index dimension. Each scenario carries <code>staffer: {id, name, tenure_months, role, tool_level}</code>. After every run, <code>recomputeStafferStats(staffer_id)</code> aggregates their fill_rate, turn efficiency, citation density, rescue rate into a single <code>competence_score</code> (0.45·fill + 0.20·turn_eff + 0.20·cites + 0.15·rescue).</p>
<p><code>findNeighbors</code> returns <code>weighted_score = cosine × max_staffer_competence</code> — top-performer playbooks rank above juniors' on similar scenarios. Auto-discovery emerges: running 4 staffers × 3 contracts × 3 rounds surfaced Rachel D. Lewis (Welder Nashville) with 18 endorsements across all 4 staffers, Angela U. Ward (Machine Op Indianapolis) with 19 — reliable-performer labels the system built without human tagging.</p>

<h3>Path 7 — Observer outcome ingest (Phase 24)</h3>
<p>Observer runs as <code>lakehouse-observer.service</code>, now with an HTTP listener on <code>:3800</code>. Scenarios POST per-event outcomes to <code>/event</code> with full provenance (staffer_id, sig_hash, event_kind, role, city, state, rescue flags). Observer's ERROR_ANALYZER and PLAYBOOK_BUILDER loops consume them alongside MCP-wrapped ops. Persistence switched from the old <code>/ingest/file</code> REPLACE path to an append-only <code>data/_observer/ops.jsonl</code> journal so the trace survives across restarts.</p>

<h3>Input normalizer + unified memory query</h3>
<p>Two surfaces added 2026-04-21 to make the memory stack respond coherently to any input shape:</p>
<ul>
<li><code>normalizeInput(raw)</code> — accepts structured JSON, natural language, or mixed. Three-tier: structured fast-path → regex (handles "need 3 welders in Nashville, TN" in 0ms without an LLM call) → qwen3 LLM fallback for low-signal inputs.</li>
<li><code>POST /memory/query</code> — one endpoint, returns every memory surface in a single bundle: <code>playbook_workers</code> (with boost + citations), <code>pathway_recommendation</code> (KB), <code>neighbor_signatures</code> (competence-weighted), <code>prior_lessons</code> (T3 overseer history), <code>top_staffers</code> (leaderboard), <code>discovered_patterns</code> (auto-surfaced reliable workers for this role+city), <code>latency_ms</code> (per-source). Natural-language query end-to-end: 319ms.</li>
</ul>

<h3>Honest gaps — what we can still implement</h3>
<p>Three of the five 2026-era memory findings remain unwired. Flagged for near-term implementation, not hidden:</p>
<ul>
<li><strong>Zep-style validity windows</strong> — playbook entries have timestamps but no <code>valid_until</code> or <code>schema_fingerprint</code>. Load-bearing: when a schema migration changes a column, stale playbooks silently keep boosting. Biggest-value remaining fix.</li>
<li><strong>Mem0-style UPDATE / DELETE / NOOP ops</strong> — <code>/seed</code> only ADDs. Same (operation, date) pair appends a duplicate instead of refining an existing entry. Playbook file grows faster than necessary.</li>
<li><strong>Letta working-memory hot cache</strong> — every query scans all 1560 playbook entries from disk. Cheap today at 1.5K, not at 100K. Solution: LRU in-process cache for the last N playbooks or the current sig's neighborhood.</li>
</ul>
<p>Validity windows is next — preserves the trust signal (boost only fires on playbooks that are still true given the current schema) rather than the latency signal (which the current scale doesn't need yet).</p>

<div class="ref"><strong>Code:</strong> crates/vectord/src/{playbook_memory.rs, service.rs} · tests/multi-agent/{kb.ts, memory_query.ts, normalize.ts, scenario.ts} · mcp-server/{observer.ts, index.ts} · data/_kb/ · data/_observer/</div>
</div>

<!-- ═══ 7. SCALE STORY ═══ -->
<div class="chapter" id="ch7">
<div class="num">Chapter 7</div>
<h2>Scale story — 20 staffers, 300 contracts, a surge</h2>
<div class="lede">What happens when the demo-level load becomes the production-level load, and midday a client pushes 20 more contracts plus a 1M-row ATS delta. Honest: some of this is architectural headroom, not measured scale. The designed behaviors are below.</div>

<h3>20 concurrent staffers</h3>
<p><strong>Axum is async.</strong> The gateway handles concurrent requests on Tokio with work-stealing. No per-request thread. Tested at 10 parallel queries in 82ms total on this hardware.</p>
<p><strong>Per-staffer profile isolation.</strong> Each staffer activates their own profile (Phase 17) or workspace (Phase 8.5). Profile scopes their search to bound datasets. Workspace carries their in-progress contracts across sessions.</p>
<p><strong>Per-client blacklists.</strong> Auto-applied when the caller passes <code>client: "X"</code> on <code>/search</code>. Staffer A filling for Acme never sees Acme's flagged workers. Staffer B filling for MidState sees them normally.</p>

<h3>300 active contracts</h3>
<p><strong>SQL on <code>job_orders</code> is cheap.</strong> 300 rows is nothing — a scan is microseconds.</p>
<p><strong>Workspace per contract.</strong> Each contract gets its own workspace with saved searches, shortlists, activity log. Zero-copy handoff between staffers (pointer swap, not data copy).</p>
<p><strong>Forecast remains coherent.</strong> <code>/intelligence/staffing_forecast</code> aggregates 30-day permit data regardless of contract count. The bench supply query (<code>GROUP BY role</code> over workers_500k) is a single sub-second SQL.</p>

<h3>Midday surge: +20 contracts, +1M profiles</h3>
<p>The delta arrives at 12:30. Here's what happens in the following minutes:</p>
<div class="step"><div class="n">1</div><div class="body"><strong>+20 contracts via /ingest/db or /ingest/file.</strong> Parsed, schema-checked, Parquet-written, catalog-registered. No queries blocked — register holds a write lock across the manifest write only.</div></div>
<div class="step"><div class="n">2</div><div class="body"><strong>+1M worker profiles arrives as delta to workers_500k.</strong> Append-log pattern (ADR-018) means the new rows write to a fresh batch file — base Parquet is NOT rewritten. Queries against workers_500k immediately merge-on-read the new batches.</div></div>
<div class="step"><div class="n">3</div><div class="body"><strong>Embeddings marked stale.</strong> The vector index for workers_500k_v1 now has 1M rows it hasn't seen. <code>mark_embeddings_stale</code> flips the flag.</div></div>
<div class="step"><div class="n">4</div><div class="body"><strong>Incremental refresh fires.</strong> <code>POST /vectors/refresh/workers_500k</code> reads only the new rows (diff against existing embeddings), embeds them in batches of 64 via Ollama, writes delta embedding Parquet. Measured on threat_intel: 34 new rows in 970ms (6× faster than full re-embed).</div></div>
<div class="step"><div class="n">5</div><div class="body"><strong>Search degrades gracefully.</strong> During the refresh, searches against workers_500k_v1 still work — they serve from the old embeddings. Brute-force cosine over new-rows-without-embeddings is allowed but costs more. HNSW rebuild happens after all embeds complete.</div></div>
<div class="step"><div class="n">6</div><div class="body"><strong>Hot-swap promotion.</strong> When the new index is ready, <code>promotion_registry</code> atomically flips the active pointer. Next search hits the new config. Rollback stays available.</div></div>
<div class="step"><div class="n">7</div><div class="body"><strong>Autotune re-enters the loop.</strong> The agent queue picks up a <code>DatasetAppended</code> trigger and schedules a fresh HNSW trial cycle against the expanded index.</div></div>

<h3>Known pain points at this scale</h3>
<ul>
<li><strong>Ollama inference is serial.</strong> Embedding 1M rows at ~50 chunks/sec through nomic-embed-text = ~6 hours. Acceptable for overnight refresh, not for "immediate." Mitigated by incremental refresh (only deltas).</li>
<li><strong>RAM ceiling on HNSW.</strong> Around 5M vectors × 768d, HNSW stops fitting in 128GB comfortably. Mitigation: per-profile <code>vector_backend: lance</code> flip — disk-resident IVF_PQ scales past the RAM line (ADR-019).</li>
<li><strong>VRAM ceiling for model variety.</strong> A4000 16GB holds 1-2 loaded models. Multi-model recruiter surfaces are a sequential swap, not parallel (Ollama <code>keep_alive=0</code>). Phase 17 profile activation unloads the prior model on swap.</li>
<li><strong>playbook_memory growth.</strong> 1936 entries today. Phase 25 (2026-04-21) added retirement via <code>valid_until</code> + <code>schema_fingerprint</code> fields + <code>POST /vectors/playbook_memory/retire</code> endpoint (manual or schema-drift triggered). Active vs retired split surfaced on <code>GET /vectors/playbook_memory/status</code>. Brute-force cosine still sub-ms at current size; Letta-style working-memory hot cache deferred until entry count crosses ~100K.</li>
</ul>
</div>

<!-- ═══ 8. ERROR SURFACES ═══ -->
<div class="chapter" id="ch8">
<div class="num">Chapter 8</div>
<h2>Error surfaces &amp; recovery</h2>
<div class="lede">Every failure mode has a named surface, a structured response, and a recovery path. No silent failures.</div>

<table class="plain">
<thead><tr><th>Failure mode</th><th>Surface / response</th><th>Recovery</th></tr></thead>
<tbody>
<tr><td>Ingest receives file with schema mismatch vs existing dataset</td><td><code>409 Conflict</code> with both fingerprints named (ADR-020)</td><td>Re-ingest under a new name, or migrate the existing via Phase 14 schema evolution</td></tr>
<tr><td>Bucket unreachable on write</td><td>Hard 503, error journaled to <code>primary://_errors/bucket_errors/</code></td><td><code>GET /storage/errors</code> lists failures; <code>GET /storage/bucket-health</code> shows per-bucket status</td></tr>
<tr><td>Bucket unreachable on read</td><td>Rescue bucket fallback, <code>X-Lakehouse-Rescue-Used: true</code> header on response</td><td>Response still succeeds; operator sees rescue flag</td></tr>
<tr><td>/log receives name that doesn't exist in workers_500k</td><td>Seed is SKIPPED; response includes <code>rejected_ghost_names: [...]</code> and a note</td><td>Operator sees exactly which names were rejected and why</td></tr>
<tr><td>Dual-agent executor malforms tool call</td><td>Result appended to log with <code>error</code> field; counter increments</td><td>After 3 consecutive: abort with full log dump at <code>tests/multi-agent/playbooks/&lt;id&gt;-FAILED.json</code></td></tr>
<tr><td>Dual-agent drifts from target</td><td>Reviewer verdict = <code>drift</code>, counter increments</td><td>After 3 consecutive drifts: abort with full log</td></tr>
<tr><td>Hybrid search finds zero candidates</td><td>Returns empty <code>sources[]</code> + <code>sql_matches: 0</code></td><td>Gap signal captured by scenario runner; operator prompted to broaden filter</td></tr>
<tr><td>Ollama sidecar down</td><td>502 Bad Gateway from aibridge; <code>embed</code> calls fail fast</td><td>Restart: <code>systemctl restart lakehouse-sidecar</code>; vector search falls back to pre-computed embeddings</td></tr>
<tr><td>Gateway restart mid-operation</td><td>In-memory state (playbook_memory, HNSW) reloaded from persisted <code>state.json</code> / trial journals</td><td>Zero data loss; catalog, storage, journals are all source-of-truth</td></tr>
<tr><td>Schema fingerprint diverges across manifests</td><td><code>catalog::dedupe</code> reports <code>DedupeReport</code> with winner selection (non-null row_count first, then newest updated_at)</td><td><code>POST /catalog/dedupe</code> collapses duplicates idempotently</td></tr>
<tr><td>Scenario event fails on zero-supply city</td><td>Cloud rescue (Phase 22 item B) fires — <code>gpt-oss:120b</code> sees SQL filters attempted, row counts, reviewer drift notes, contract terms; returns structured <code>{retry, new_city, new_state, new_role, new_count, rationale}</code></td><td>Retry with pivot runs same executor loop with new geography; verified Gary IN → South Bend IN filled 1/1 after original drift-abort</td></tr>
<tr><td>LLM response truncated mid-JSON (thinking model ate token budget)</td><td>Phase 21 <code>generateContinuable()</code> detects via brace-balance + JSON.parse; no silent truncation</td><td>Auto-continue with partial as scratchpad, or geometric backoff if initial call returned empty. Bounded by <code>max_continuations</code>.</td></tr>
<tr><td>Schema migration invalidates existing playbooks</td><td>Phase 25 — <code>POST /vectors/playbook_memory/retire</code> with current fingerprint retires all mismatched entries in scope; diagnostic log shows counts</td><td>Retired entries stay in journal for forensics but are skipped by all boost calculations. Scoped by (city, state) so unrelated geos aren't touched.</td></tr>
<tr><td>Observer fails to reach scenario outcome stream</td><td>Scenario <code>postObserverEvent()</code> uses 2s <code>AbortSignal.timeout</code>; silent skip if :3800 is down</td><td>Scenario log is still the source of truth; observer re-ingest on next run restores the stream. <code>data/_observer/ops.jsonl</code> is append-only so prior events survive.</td></tr>
</tbody>
</table>
</div>

<!-- ═══ 9. PER-STAFFER CONTEXT ═══ -->
<div class="chapter" id="ch9">
<div class="num">Chapter 9</div>
<h2>Per-staffer context</h2>
<div class="lede">Twenty staffers don't see the same UI state. Each one's session is shaped by their active profile, their workspaces, their assigned contracts, and their client's blacklists.</div>

<h3>Active profile (Phase 17)</h3>
<p>Scopes every search. A <code>staffing-recruiter</code> profile bound to <code>workers_500k</code> sees only that dataset. A <code>security-analyst</code> profile bound to <code>threat_intel</code> cannot see worker data. <code>GET /vectors/profile/&lt;id&gt;/audit</code> records every tool invocation by model identity.</p>

<h3>Workspace (Phase 8.5)</h3>
<p>Per-contract state. Each workspace has daily/weekly/monthly tiers, saved searches, shortlists, activity logs. Survives across sessions. Instant zero-copy handoff between staffers — pointer swap, not data copy. Persisted to object storage, rebuilt on startup.</p>

<h3>Client blacklist</h3>
<p>Per-client worker exclusion. Populated via <code>POST /clients/:client/blacklist</code>. Auto-applied when the caller passes <code>client: "X"</code> on <code>/search</code>. JSON-backed; would move to catalog table under real client load.</p>

<h3>Audit trail</h3>
<p>Phase 12 tool registry logs every governed-action invocation (who called what, with what args, when, outcome). <code>GET /tools/audit</code> queryable. Phase 13 access control layers on top — role-based field masking, query audit log.</p>

<h3>Daily summary per staffer</h3>
<p>Workspace activity log + per-staffer filter on the event journal gives <strong>"what did Sarah do today"</strong> as a direct query. The foundation for shift-handoff reports.</p>

<h3>Staffer identity + competence-weighted retrieval (Phase 23)</h3>
<p>Each scenario run carries an explicit <code>staffer: {id, name, tenure_months, role, tool_level}</code>. The KB aggregates per-staffer stats (<code>data/_kb/staffers.jsonl</code>) that roll up into a single <code>competence_score</code>:</p>
<pre>competence_score = 0.45·fill_rate + 0.20·turn_efficiency + 0.20·citation_density + 0.15·rescue_rate</pre>
<p>When any query runs <code>kb.findNeighbors(spec, k)</code>, the ranking isn't just cosine similarity — it's <code>weighted_score = cosine × max_staffer_competence</code> over the best coordinator who ran that signature. Senior staffers' playbooks surface above juniors' on similar scenarios, even when the juniors' scenario was marginally closer in embedding space.</p>
<p>The <code>tool_level</code> knob (full / local / basic / minimal) controls which tiers are available to a given staffer's runs. See Ch 3 for the mapping. Variance is real and measurable: 36-run demo produced a 46pt fill-rate delta between James (local tools, 93%) and Alex (minimal tools, 60%) on identical contracts.</p>

<h3>Auto-discovered reliable-performer labels</h3>
<p>A second-order output of the competence path: when multiple staffers independently endorse the same worker on similar-signature playbooks, that worker accumulates cross-staffer endorsements. <code>scripts/kb_staffer_report.py</code> surfaces them — after 36 runs, Rachel D. Lewis (Welder Nashville) had 18 endorsements across 4 staffers, Angela U. Ward (Machine Op Indianapolis) 19. These are high-confidence "reliable" labels the system produced without human tagging. The UI could badge these workers on future queries; today they're visible via <code>/memory/query</code>'s <code>discovered_patterns</code> bundle.</p>
</div>

<!-- ═══ 10. A DAY IN THE LIFE ═══ -->
<div class="chapter" id="ch10">
<div class="num">Chapter 10</div>
<h2>A day in the life — from morning brief to EOD retrospective</h2>
<div class="lede">Concrete operator timeline. Every step touches a real endpoint that exists today.</div>

<div class="step"><div class="n">07:00</div><div class="body"><strong>Overnight housekeeping.</strong> Scheduled ingest runs — the configured cron picks up the client's latest ATS CSV delta, runs it through the pipeline in Ch2, marks workers_500k embeddings stale. Autotune agent promotes any Pareto-winner HNSW configs from overnight trials.</div></div>

<div class="step"><div class="n">07:30</div><div class="body"><strong>Embedding refresh.</strong> Background job re-embeds the new rows. Old index keeps serving. Hot-swap promotes when done.</div></div>

<div class="step"><div class="n">08:00</div><div class="body"><strong>Sarah (staffer) opens devop.live/lakehouse.</strong> Page loads in ~3s. Forecast panel shows: "$275M construction coming, 4 tight roles this week." Live Contracts section shows 6 Chicago permits with proposed fills + boost chips + pattern signals.</div></div>

<div class="step"><div class="n">08:15</div><div class="body"><strong>Sarah drills into a $5M permit.</strong> Top candidate card: Carmen Green, Endorsed · 3 playbooks chip, boost +0.166, pattern line reads "leader archetype · 47% OSHA-10." Sarah hovers the chip — narrative tooltip: "filled Welder x2 in Toledo (2026-04-15), Welder x1 in Toledo (2026-04-18)."</div></div>

<div class="step"><div class="n">08:30</div><div class="body"><strong>Sarah calls Carmen.</strong> Clicks Call button → <code>/log</code> fires → <code>playbook_memory.seed</code> → <code>persist_sql</code> → successful_playbooks_live grows by one. Button flashes "Logged" for 1.4s. No modal, no form, no second click.</div></div>

<div class="step"><div class="n">09:00</div><div class="body"><strong>Kim (another staffer) opens the same UI.</strong> Her profile loads. Her workspaces show her own contracts. She searches "reliable forklift Chicago" — MEMORY chip shows the pattern discovered across Sarah's morning work AND prior fills. Carmen, already logged by Sarah, shows up with an updated citation count.</div></div>

<div class="step"><div class="n">12:30</div><div class="body"><strong>Client pushes 20 new contracts + 1M ATS delta.</strong> Ch7 scale flow fires. Ingest in seconds; embedding refresh kicks off as a background job. Searches continue against old embeddings.</div></div>

<div class="step"><div class="n">14:00</div><div class="body"><strong>Emergency: worker Dave no-showed.</strong> Sarah clicks No-show button on Dave's card → <code>/log_failure</code> → <code>mark_failed</code> records a penalty. Next similar query dampens Dave's boost by 0.5. Sarah continues the refill — the refill excludes Dave and the 2 others already booked for this shift.</div></div>

<div class="step"><div class="n">15:00</div><div class="body"><strong>New embeddings live.</strong> Hot-swap promotion. Searches now see all 1M new profiles. Sarah's noon query re-run would produce different top-5.</div></div>

<div class="step"><div class="n">17:00</div><div class="body"><strong>End-of-day retrospective.</strong> Any staffer who ran <code>tests/multi-agent/scenario.ts</code> gets <code>report.md</code> auto-generated. Workspace activity logs aggregate per staffer. <code>GET /vectors/playbook_memory/status</code> shows active vs retired counts. KB indexes the run (<code>kb.indexRun</code>) and the overview model synthesizes a pathway recommendation for the next matching signature. Every event outcome has already streamed to <code>lakehouse-observer.service</code> on :3800 for ERROR_ANALYZER + PLAYBOOK_BUILDER consumption.</div></div>

<div class="step"><div class="n">17:15</div><div class="body"><strong>Kim fires a natural-language query from the search box.</strong> "need 3 forklift operators in Joliet by Monday" → <code>POST /memory/query</code> on the Bun MCP. Regex normalizer extracts role / city / count / deadline / intent in 0ms (no LLM call). The unified response returns playbook workers (auto-surfaced reliable performers for Joliet Forklift with citation counts), pathway recommendation from KB, prior T3 lessons for Joliet, and top staffers by competence — all in ~300ms.</div></div>

<div class="step"><div class="n">22:00</div><div class="body"><strong>Overnight trial cycle.</strong> Autotune agent continues in the background. Trial journal grows. KB's <code>detectErrorCorrections</code> scans today's outcomes for fail→succeed deltas on the same signature; any correction gets logged to <code>data/_kb/error_corrections.jsonl</code> with the config diff. Tomorrow morning, the system is measurably better at something it got asked about today.</div></div>

<h3>SMS + email drafts in the pipeline</h3>
<p>After each sealed fill (via scenario.ts or manual <code>/log</code> flow with downstream hooks), <code>generateArtifacts</code> in the scenario runner produces: (a) one SMS per worker (TO: Name, message under 180 chars), (b) one client confirmation email. Drafts are saved to <code>sms.md</code> and <code>emails.md</code> under the scenario output dir. Ollama drafts them; the staffer reviews and sends. No auto-send; human-in-the-loop.</p>
</div>

<!-- ═══ 11. LIMITS & NON-GOALS ═══ -->
<div class="chapter" id="ch11">
<div class="num">Chapter 11</div>
<h2>Known limits &amp; non-goals</h2>
<div class="lede">Honesty is a feature. Everything below is either deferred or explicitly out of scope.</div>

<h4>Deferred — real architectural work, just not shipped yet</h4>
<ul>
<li><strong>Rate / margin awareness.</strong> Worker pay expectations vs contract bill rate not modeled. Requires adding <code>pay_rate</code> to workers, <code>bill_rate</code> to contracts, and a filter + warning path. Partially addressed via <code>ContractTerms.budget_per_hour_max</code> passed to T3/rescue prompts, but the match-time filter isn't wired yet.</li>
<li><strong>Mem0-style UPDATE / DELETE / NOOP operations on playbooks.</strong> Today <code>/seed</code> only ADDs. Same <code>(operation, date)</code> pair appends a duplicate instead of refining an existing entry. Phase 26 item — cheap to add, moderate payoff.</li>
<li><strong>Letta working-memory hot cache.</strong> Every boost query scans all active playbook entries from in-memory state. 1.9K today; cheap. Will bite somewhere north of 100K. LRU for the last-N playbooks or current-sig neighborhood deferred until that ceiling approaches.</li>
<li><strong>Chunking cache (Phase 21 Rust port).</strong> TS primitives <code>generateContinuable</code> + <code>generateTreeSplit</code> are wired, but <code>crates/aibridge/src/{continuation.rs, tree_split.rs}</code> + <code>crates/storaged/src/chunk_cache.rs</code> remain queued. Gateway-side callers currently don't have the same protection against silent truncation that the TS test harness does.</li>
<li><strong>Confidence calibration.</strong> Top-K is a rank, not a probability. No calibrated "85% likely to accept" score. Requires outcome-labeled training data.</li>
<li><strong>Neural re-ranker.</strong> Phase 19 is statistical + semantic (now with geo + role prefilter, Phase 25 retirement). A (query, candidate, outcome)-trained re-ranker is deferred only if the statistical floor plateaus below usable recall — current 14× citation lift on identical inputs suggests it hasn't.</li>
<li><strong>Observer → autotune feedback wire.</strong> Phase 24 streams scenario outcomes into <code>data/_observer/ops.jsonl</code>; autotune agent still runs on its own HNSW-trial schedule and hasn't subscribed to the outcome metric stream yet. Phase 26+ item — connects the last loop.</li>
<li><strong>call_log cross-reference.</strong> Infrastructure present; current synthetic candidates table is too small to cross-ref. Fixes when real ATS lands.</li>
</ul>

<h4>Non-goals — explicitly out of scope</h4>
<ul>
<li><strong>Cloud deployment.</strong> Local-first by design. Works offline after setup.</li>
<li><strong>Full ACID transactions.</strong> Single-writer model is sufficient; Delta Lake-grade MVCC is deliberately not attempted.</li>
<li><strong>Real-time streaming / CDC.</strong> Batch ingest is the model. Scheduled refresh, not transactional replication.</li>
<li><strong>Replacing the CRM.</strong> This is the analytical + AI layer <em>behind</em> the CRM. Operational CRUD stays with the existing system.</li>
<li><strong>Custom file formats.</strong> Parquet for datasets, sidecar indexes for vectors. No proprietary formats (ADR-008, ADR-018 reaffirm).</li>
<li><strong>Hard multi-tenant isolation.</strong> Profiles and federation provide soft isolation. Adversarial multi-tenant is not a goal — this system assumes a single-trust operator.</li>
</ul>

<div class="narr">
<strong>Overall bet.</strong> The substrate is conservative: Parquet + DataFusion + HNSW + Ollama + object storage. Every layer is replaceable, open, auditable. The intelligence layer (playbook_memory, patterns, autotune) is statistical, not neural — cheaper, explainable, rebuildable from the journal alone. If the statistical floor plateaus below what a real client needs, Phase 20+ adds neural re-rank on top. We don't make that call until measurement demands it.
</div>
</div>

</div>
</div>

<div class="footer">Lakehouse spec · v2 2026-04-21 · Phases 19-25 shipped (playbook boost, model matrix, continuation, KB, staffer competence, observer ingest, validity windows) · maintained from <code>docs/DECISIONS.md</code> · <a href="proof">architecture live-tested</a> · <a href="console">walkthrough</a></div>

</body></html>