lakehouse/mcp-server/spec.html

<!DOCTYPE html>
<html lang="en"><head>
<meta charset="utf-8"><meta name="viewport" content="width=device-width,initial-scale=1">
<title>Lakehouse — Technical Specification</title>
<style>
*{margin:0;padding:0;box-sizing:border-box}
body{font-family:'Inter',-apple-system,system-ui,sans-serif;background:#090c10;color:#b0b8c4;font-size:14px;line-height:1.6;-webkit-font-smoothing:antialiased}
a{color:#58a6ff;text-decoration:none}
a:hover{color:#79c0ff}

.bar{background:#0d1117;padding:0 24px;height:56px;border-bottom:1px solid #171d27;display:flex;justify-content:space-between;align-items:center;position:sticky;top:0;z-index:10}
.bar h1{font-size:14px;font-weight:600;color:#e6edf3;letter-spacing:-0.2px}
.bar nav{display:flex;gap:2px}
.bar nav a{font-size:12px;color:#545d68;padding:6px 14px;border-radius:6px;transition:all 0.15s}
.bar nav a:hover{color:#e6edf3;background:#161b22}
.bar nav a.active{color:#e6edf3;background:#1c2333}
.bar .rt{font-size:11px;color:#545d68}

.layout{display:grid;grid-template-columns:220px 1fr;gap:0;max-width:1200px;margin:0 auto}
.toc{position:sticky;top:72px;align-self:start;padding:24px 12px;border-right:1px solid #171d27;height:calc(100vh - 72px);overflow-y:auto}
.toc .hdr{color:#545d68;font-size:10px;text-transform:uppercase;letter-spacing:1.4px;font-weight:600;margin-bottom:10px;padding-left:8px}
.toc a{display:block;color:#8b949e;font-size:12px;padding:6px 10px;border-radius:6px;margin-bottom:2px;line-height:1.4}
.toc a:hover{color:#e6edf3;background:#161b22}

.wrap{padding:28px 24px 60px;min-width:0}

.chapter{margin-bottom:48px;scroll-margin-top:72px}
.chapter .num{color:#545d68;font-size:11px;font-weight:600;letter-spacing:1.6px;text-transform:uppercase;margin-bottom:6px}
.chapter h2{color:#e6edf3;font-size:24px;font-weight:700;letter-spacing:-0.4px;margin-bottom:8px;line-height:1.2}
.chapter h3{color:#e6edf3;font-size:16px;font-weight:600;margin:22px 0 8px}
.chapter h4{color:#c9d1d9;font-size:14px;font-weight:600;margin:16px 0 6px}
.chapter .lede{color:#8b949e;font-size:14px;margin-bottom:18px;max-width:760px;line-height:1.7}
.chapter p{color:#b0b8c4;margin-bottom:12px;max-width:800px}
.chapter ul{color:#b0b8c4;margin:8px 0 16px 24px}
.chapter li{margin-bottom:6px;line-height:1.7}
.chapter strong{color:#e6edf3;font-weight:600}

.card{background:#0d1117;border:1px solid #171d27;border-radius:12px;padding:20px;margin:12px 0}
.accent-l{border-left:3px solid #2ea043}
.accent-b{border-left:3px solid #1f6feb}
.accent-a{border-left:3px solid #bc8cff}
.accent-w{border-left:3px solid #d29922}
.accent-r{border-left:3px solid #f85149}

code{background:#161b22;color:#e6edf3;padding:2px 6px;border-radius:4px;font-family:ui-monospace,Menlo,monospace;font-size:12px}
pre{background:#161b22;border:1px solid #171d27;border-radius:8px;padding:14px 16px;overflow-x:auto;font-family:ui-monospace,Menlo,monospace;font-size:12px;color:#c9d1d9;line-height:1.5;margin:8px 0}

table.plain{width:100%;border-collapse:collapse;font-size:12px;margin:10px 0}
table.plain th{text-align:left;padding:8px 12px;color:#545d68;font-weight:600;text-transform:uppercase;font-size:10px;letter-spacing:0.8px;border-bottom:1px solid #171d27;background:#0d1117}
table.plain td{padding:8px 12px;border-bottom:1px solid #171d27;color:#c9d1d9;vertical-align:top}
table.plain td.mono{font-family:ui-monospace,Menlo,monospace}
table.plain tr:hover td{background:#0d1117}

.narr{color:#8b949e;font-size:13px;line-height:1.7;margin:10px 0;padding:10px 14px;border-left:2px solid #21262d}
.narr strong{color:#c9d1d9}

.ref{color:#545d68;font-size:11px;margin-top:6px;font-family:ui-monospace,Menlo,monospace}
.ref strong{color:#79c0ff;font-weight:600}

.step{display:flex;gap:14px;margin-bottom:14px;padding:12px 16px;background:#0d1117;border:1px solid #171d27;border-radius:8px}
.step .n{color:#58a6ff;font-weight:700;font-size:18px;flex-shrink:0;min-width:30px}
.step .body{color:#b0b8c4;font-size:13px;line-height:1.7}
.step .body strong{color:#e6edf3}

.footer{border-top:1px solid #171d27;padding:20px;text-align:center;color:#3d444d;font-size:11px}

@media(max-width:900px){
  .layout{grid-template-columns:1fr}
  .toc{display:none}
  .wrap{padding:20px 14px 40px}
  .bar nav{display:none}
}
</style></head>
<body>

<div class="bar">
  <h1>Lakehouse — Technical Specification</h1>
  <nav>
    <a href=".">Dashboard</a>
    <a href="console">Walkthrough</a>
    <a href="proof">Architecture</a>
    <a href="spec" class="active">Spec</a>
    <a href="onboard">Onboard</a>
    <a href="alerts">Alerts</a>
  </nav>
  <div class="rt">v1 · 2026-04-20</div>
</div>

<div class="layout">
<aside class="toc">
  <div class="hdr">Contents</div>
  <a href="#ch1">1. Repository layout</a>
  <a href="#ch2">2. Data ingest pipeline</a>
  <a href="#ch3">3. Measurement &amp; indexing</a>
  <a href="#ch4">4. Contract inference</a>
  <a href="#ch5">5. What a CRM can't do</a>
  <a href="#ch6">6. How it gets better over time</a>
  <a href="#ch7">7. Scale story — 20 staffers, 300 contracts, a surge</a>
  <a href="#ch8">8. Error surfaces &amp; recovery</a>
  <a href="#ch9">9. Per-staffer context</a>
  <a href="#ch10">10. A day in the life</a>
  <a href="#ch11">11. Known limits &amp; non-goals</a>
</aside>
<div class="wrap">

<!-- ═══ 1. REPO LAYOUT ═══ -->
<div class="chapter" id="ch1">
<div class="num">Chapter 1</div>
<h2>Repository layout</h2>
<div class="lede">What lives where. Every folder below has a single, bounded responsibility. A maintainer reading this should know — in under ten minutes — which crate owns a failing behavior.</div>
<table class="plain">
<thead><tr><th>Path</th><th>Owns</th></tr></thead>
<tbody>
<tr><td class="mono">crates/shared/</td><td>Types, errors, Arrow helpers, schema fingerprints, PII detection, secrets provider. Every other crate depends on this.</td></tr>
<tr><td class="mono">crates/storaged/</td><td>Raw object I/O. <code>BucketRegistry</code> (multi-bucket, rescue-aware), <code>AppendLog</code> (write-once batched append), <code>ErrorJournal</code> (bucket op failures). ADR-017 (federation), ADR-018 (append pattern).</td></tr>
<tr><td class="mono">crates/catalogd/</td><td>Metadata authority. Dataset manifests, schema fingerprints (ADR-020), tombstones (soft delete), AI-safe views, model profiles (Phase 17). In-memory index persisted as Parquet on storage.</td></tr>
<tr><td class="mono">crates/queryd/</td><td>SQL engine. DataFusion over Parquet + MemTable cache + delta merge-on-read + compaction. Registers every bucket as an object_store so SQL can join across them.</td></tr>
<tr><td class="mono">crates/ingestd/</td><td>Data on-ramp. CSV / JSON / PDF (+OCR via Tesseract) / Postgres streaming / MySQL streaming / inbox watcher / cron schedules. Every ingest path auto-tags PII (emails, phones, SSNs, addresses), records lineage, and marks embeddings stale.</td></tr>
<tr><td class="mono">crates/vectord/</td><td>The vector + learning surface. Embeddings stored as Parquet (ADR-008), HNSW index (Phase 15), trial system (autotune), promotion registry (Phase 16), playbook_memory (Phase 19). Core feedback loop lives here.</td></tr>
<tr><td class="mono">crates/vectord-lance/</td><td>Firewall crate. Lance 4.0 + Arrow 57, isolated from the main Arrow-55 workspace. Provides secondary vector backend for large-scale, random-access, and append-heavy workloads (ADR-019).</td></tr>
<tr><td class="mono">crates/journald/</td><td>Append-only mutation event log (ADR-012). Every insert/update/delete writes here — who, when, what, old/new value. Never mutated. Foundation for time-travel + compliance audit.</td></tr>
<tr><td class="mono">crates/aibridge/</td><td>Rust ↔ Python sidecar. HTTP client over FastAPI wrapper around Ollama. VRAM introspection via nvidia-smi. All LLM calls (embed, generate, rerank) flow through here.</td></tr>
<tr><td class="mono">crates/gateway/</td><td>Axum HTTP (:3100) + gRPC (:3101). Auth middleware, tools registry (Phase 12 — governed actions), CORS. Every external request enters here.</td></tr>
<tr><td class="mono">crates/ui/</td><td>Dioxus WASM developer UI. Internal tool. Not exposed externally.</td></tr>
<tr><td class="mono">mcp-server/</td><td>Bun/TypeScript recruiter-facing app. Serves <code>devop.live/lakehouse</code>. Routes: <code>/search /match /log /log_failure /clients/:c/blacklist /intelligence/*</code>. Proxies to the Rust gateway for heavy work.</td></tr>
<tr><td class="mono">tests/multi-agent/</td><td>Dual-agent scenario harness. <code>agent.ts</code> (prompts + protocol), <code>orchestrator.ts</code> (single task), <code>scenario.ts</code> (5-event warehouse week), <code>run_e2e_rated.ts</code> (parallel pairs + rating), <code>chain_of_custody.ts</code> (layer-by-layer audit).</td></tr>
<tr><td class="mono">docs/</td><td><code>PRD.md</code>, <code>PHASES.md</code>, <code>DECISIONS.md</code> (20 ADRs). Every significant architectural choice has an ADR with the alternatives that were rejected and why.</td></tr>
<tr><td class="mono">data/</td><td>Default local object store. Parquet files per dataset, append-log batches, HNSW trial journals, promotion registries, playbook_memory state.json, catalog manifests. Rebuildable from repo + this dir alone.</td></tr>
</tbody>
</table>
</div>

<!-- ═══ 2. INGEST PIPELINE ═══ -->
<div class="chapter" id="ch2">
<div class="num">Chapter 2</div>
<h2>Data ingest pipeline</h2>
<div class="lede">How staffing data gets into the system — whether from a CSV drop, an ATS export, a Postgres replica, or a PDF resume. Every path ends at the same place: a registered dataset with known schema, known lineage, known sensitivity.</div>

<div class="step"><div class="n">1</div><div class="body"><strong>Source arrives.</strong> Four shapes: (a) file upload via <code>POST /ingest/file</code>, (b) inbox watcher (drops in <code>./inbox/</code> → auto-ingested in under 15s), (c) Postgres or MySQL streaming connector (<code>POST /ingest/db</code> with DSN), (d) scheduled ingest via <code>ingestd::schedule</code> with cron.</div></div>

<div class="step"><div class="n">2</div><div class="body"><strong>Parse + normalize.</strong> CSV parser infers types per column; defaults to <code>String</code> on ambiguity (ADR-010 — better to ingest everything than reject on type mismatch). JSON parser flattens nested objects. PDF extractor uses <code>lopdf</code> first; falls back to Tesseract OCR for scanned/image PDFs. Output is always an Arrow <code>RecordBatch</code>.</div></div>

<div class="step"><div class="n">3</div><div class="body"><strong>Auto-detect PII.</strong> <code>shared::pii</code> scans column values and names. Identifies emails, phone numbers, SSNs, salaries, street addresses, medical terms. Tags columns with <code>sensitivity: PII | PHI | Financial | Internal | Public</code> (Phase 10 catalog v2).</div></div>

<div class="step"><div class="n">4</div><div class="body"><strong>Deduplicate by content hash.</strong> Every uploaded file's SHA-256 is checked against the catalog's seen-hash log. Re-ingesting the same file is a no-op (ADR invariant #5).</div></div>

<div class="step"><div class="n">5</div><div class="body"><strong>Write Parquet to object storage.</strong> <code>arrow_helpers::record_batch_to_parquet</code> → <code>storaged::ops::put</code> → file lands under <code>data/datasets/&lt;name&gt;.parquet</code> (or bucket-scoped via <code>BucketRegistry</code>). Schema fingerprint computed.</div></div>

<div class="step"><div class="n">6</div><div class="body"><strong>Register in catalog.</strong> <code>catalogd::Registry::register(name, fingerprint, objects)</code> — idempotent on (name, fingerprint). Same name + same fingerprint = reuse manifest, bump updated_at. Same name + different fingerprint = <code>409 Conflict</code> (ADR-020 — prevents silent schema drift). New name = create new manifest with owner, lineage, freshness SLA, column metadata, PII tags.</div></div>

<div class="step"><div class="n">7</div><div class="body"><strong>Mark embeddings stale.</strong> If the dataset already has a vector index, the new rows mean that index is now behind. <code>Registry::mark_embeddings_stale</code> flips a flag; <code>POST /vectors/refresh/&lt;dataset&gt;</code> runs an incremental re-embed (only new rows, not the whole corpus).</div></div>

<div class="step"><div class="n">8</div><div class="body"><strong>Queryable immediately.</strong> <code>queryd::context</code> picks up the new manifest on next query. Hot-cache warms on first hit. Delta merge-on-read means updates land without rewriting the base Parquet.</div></div>

<div class="ref"><strong>Code:</strong> crates/ingestd/src/{service.rs, csv.rs, json.rs, pdf.rs, pg_stream.rs, my_stream.rs, schedule.rs}</div>
</div>

<!-- ═══ 3. MEASUREMENT & INDEXING ═══ -->
<div class="chapter" id="ch3">
<div class="num">Chapter 3</div>
<h2>Measurement &amp; indexing</h2>
<div class="lede">Once data is in, the system describes it rigorously and builds fast-access indexes over the parts that will be queried. Every measurement is deterministic, versioned, and visible via HTTP.</div>

<h3>What gets measured per dataset</h3>
<ul>
<li><strong>Row count</strong> (from parquet footer, not a SELECT COUNT). O(1).</li>
<li><strong>Schema fingerprint</strong> — SHA-256 over (column_name, type, nullability, sort) tuples. Drives ADR-020 idempotent register.</li>
<li><strong>Owner / sensitivity / freshness SLA</strong> — catalog v2 metadata. PII auto-detected; owner assigned on ingest.</li>
<li><strong>Lineage</strong> — source_system → ingest_job → dataset. Who put this here, when, from what.</li>
<li><strong>Last embedded at</strong> — when the vector index covering this dataset was last refreshed. Drives stale-detection.</li>
</ul>

<h3>How vector indexes are built</h3>
<p>Two backends, chosen per profile (ADR-019):</p>
<table class="plain">
<thead><tr><th></th><th>HNSW over Parquet (primary)</th><th>Lance (secondary)</th></tr></thead>
<tbody>
<tr><td>Storage</td><td>Embeddings as Parquet columns (<code>doc_id, chunk_text, vector</code>)</td><td>Lance native dataset</td></tr>
<tr><td>Index</td><td>HNSW in RAM, serialized sidecar</td><td>IVF_PQ on disk</td></tr>
<tr><td>Build time (100K × 768d)</td><td>~230s</td><td>~16s (14× faster)</td></tr>
<tr><td>Search p50 (100K)</td><td>~873μs</td><td>~7.4ms at recall 1.0</td></tr>
<tr><td>Append</td><td>Rewrite required</td><td>Structural (0.08s for 100 rows)</td></tr>
<tr><td>Random fetch by doc_id</td><td>Full scan</td><td>~311μs (112× faster)</td></tr>
<tr><td>RAM ceiling</td><td>~5M vectors</td><td>Scales past RAM — disk-resident</td></tr>
</tbody>
</table>

<h3>Autotune</h3>
<p>The <code>vectord::agent</code> background task runs continuously. Per index, it proposes HNSW configurations (<code>ef_construction × ef_search</code>), executes a trial against a stored eval set, journals the result as JSONL, and — if recall beats the min_recall gate (0.9) and latency wins the Pareto test — promotes the new config atomically via <code>promotion_registry</code>. No downtime. Rollback in milliseconds.</p>

<h3>Per-profile / per-staffer indexing</h3>
<p>Model profiles (Phase 17) are not routing strings — they are named scopes. Each profile has <code>bound_datasets[]</code>, <code>hnsw_config</code>, <code>vector_backend</code>, and <code>bucket</code>. When a staffer activates a profile:</p>
<ul>
<li>EmbeddingCache warms for bound indexes only</li>
<li>HNSW is rebuilt with the profile's config (if different from current)</li>
<li>Search via <code>POST /vectors/profile/&lt;id&gt;/search</code> rejects out-of-scope queries with 403 + list of allowed bindings</li>
<li>Ollama swaps to the profile's model via <code>keep_alive=0</code>; only one model in VRAM at a time</li>
</ul>
<div class="ref"><strong>Code:</strong> crates/vectord/src/{hnsw.rs, autotune.rs, agent.rs, promotion.rs} · ADR-019</div>
</div>

<!-- ═══ 4. CONTRACT INFERENCE ═══ -->
<div class="chapter" id="ch4">
<div class="num">Chapter 4</div>
<h2>Contract inference from external signal</h2>
<div class="lede">Most CRMs wait for a contract to land. This system watches upstream demand and pre-builds the ranking before the contract lands.</div>

<p>The concrete example running on <code>devop.live/lakehouse</code> is Chicago Department of Buildings permit data (public Socrata API). Every permit is a signal that construction — and therefore staffing — is coming.</p>

<h3>Flow</h3>
<div class="step"><div class="n">1</div><div class="body"><strong>Fetch.</strong> <code>/intelligence/market</code> and <code>/intelligence/permit_contracts</code> hit <code>data.cityofchicago.org/resource/ydr8-5enu.json</code> live. No caching of permit data — every page load is fresh.</div></div>
<div class="step"><div class="n">2</div><div class="body"><strong>Map work_type → role.</strong> Industry dictionary: "Electrical Work" → "Electrician", "Masonry Work" → "Production Worker", "Mechanical Work" → "Maintenance Tech", etc.</div></div>
<div class="step"><div class="n">3</div><div class="body"><strong>Derive worker count.</strong> Heuristic: ~1 worker per $150K of permit cost, capped 2-8 per contract for staffing realism. Operator-configurable when real client history is available.</div></div>
<div class="step"><div class="n">4</div><div class="body"><strong>Derive timeline.</strong> Permit issued → construction starts ~45 days later → staffing window opens ~14 days before construction. Classifies each permit as <code>overdue</code>, <code>urgent</code>, <code>soon</code>, <code>scheduled</code>.</div></div>
<div class="step"><div class="n">5</div><div class="body"><strong>Run hybrid search against the bench.</strong> For each derived contract, <code>POST /vectors/hybrid</code> with <code>sql_filter</code> on role+state+city+availability, <code>use_playbook_memory: true</code>, <code>playbook_memory_k: 200</code>. Returns top-5 candidates with boost + citations.</div></div>
<div class="step"><div class="n">6</div><div class="body"><strong>Query the meta-index.</strong> <code>POST /vectors/playbook_memory/patterns</code> aggregates traits across similar past playbooks — recurring certs, skills, archetype, reliability distribution. Surfaces signal the operator didn't query for.</div></div>
<div class="step"><div class="n">7</div><div class="body"><strong>Render on the dashboard.</strong> Each card shows permit + derived contract + top 3 candidates with memory chips + discovered pattern + urgency. All of this pre-computed before any staffer opens the UI.</div></div>

<h3>Coverage forecast</h3>
<p><code>/intelligence/staffing_forecast</code> aggregates the last 30 days of permits into predicted role-level demand, joins against the IL bench supply, computes coverage %, and classifies each role as <code>critical</code> / <code>tight</code> / <code>watch</code> / <code>ok</code>. The dashboard's top panel renders this — staffers see supply gaps before they query.</p>
</div>

<!-- ═══ 5. WHAT A CRM CAN'T DO ═══ -->
<div class="chapter" id="ch5">
<div class="num">Chapter 5</div>
<h2>What a CRM can't do (and why)</h2>
<div class="lede">A CRM stores. This system infers, predicts, re-ranks, and compounds. The six capabilities below are load-bearing — missing any of them is the gap between "software that logs calls" and "software that makes the next call better."</div>

<div class="card accent-b">
<table class="plain">
<thead><tr><th>Capability</th><th>CRM</th><th>This system</th></tr></thead>
<tbody>
<tr><td>Store candidate records</td><td>Yes</td><td>Yes (workers_500k, candidates)</td></tr>
<tr><td>Search by structured field</td><td>Yes</td><td>Yes (DataFusion SQL, sub-100ms on 3M rows)</td></tr>
<tr><td>Search by semantic meaning</td><td>No</td><td>Yes (HNSW + nomic-embed-text)</td></tr>
<tr><td>Combine SQL filter + semantic rank</td><td>No</td><td>Yes (<code>/vectors/hybrid</code>)</td></tr>
<tr><td>Boost workers based on past success</td><td>No</td><td>Yes (Phase 19 playbook_memory)</td></tr>
<tr><td>Penalize workers based on past failure</td><td>No</td><td>Yes (<code>/log_failure</code> + <code>0.5<sup>n</sup></code> penalty)</td></tr>
<tr><td>Surface traits across past fills</td><td>No</td><td>Yes (<code>/vectors/playbook_memory/patterns</code>)</td></tr>
<tr><td>Predict staffing demand from external data</td><td>No</td><td>Yes (Chicago permit feed + 30-day rolling forecast)</td></tr>
<tr><td>Count down to staffing deadline per contract</td><td>No</td><td>Yes (permit issue_date + heuristic timeline)</td></tr>
<tr><td>Explain why each candidate ranked</td><td>No</td><td>Yes (boost chip + narrative citations + memory pattern)</td></tr>
<tr><td>Improve ranking from operator actions</td><td>No</td><td>Yes (every Call/SMS/No-show click → re-rank signal)</td></tr>
</tbody>
</table>
</div>
</div>

<!-- ═══ 6. HOW IT GETS BETTER ═══ -->
<div class="chapter" id="ch6">
<div class="num">Chapter 6</div>
<h2>How it gets better over time</h2>
<div class="lede">Compounding learning in three paths — all three happen automatically, no operator intervention required.</div>

<h3>Path 1 — Playbook boost (Phase 19)</h3>
<p>Every sealed fill is seeded to <code>playbook_memory</code> via <code>/vectors/playbook_memory/seed</code>. The next hybrid query for a semantically similar role+geo surfaces the past endorsed workers with a boost. Math:</p>
<pre>per_worker = cosine(query_emb, playbook_emb) × 0.5 × e^(-age/30) × 0.5^failures / n_workers
boost[(city, state, name)] = min(Σ per_worker, 0.25)</pre>
<p>Caps, decay, and negative signal mean one popular worker can't dominate, old playbooks fade, and no-shows stop boosting. Verified live: 3 identical seeds → +0.250 boost capped, 3 citations.</p>

<h3>Path 2 — Pattern discovery (meta-index)</h3>
<p><code>/vectors/playbook_memory/patterns</code> goes beyond "who was endorsed" to answer "what did past similar fills have in common?" Aggregates recurring certifications, skills, archetype, reliability distribution across the top-K semantically similar playbooks. Surfaces signal the operator didn't explicitly query for.</p>

<h3>Path 3 — Autotune agent</h3>
<p>The <code>vectord::agent</code> background task runs continuously. Watches the HNSW trial journal, proposes configs, executes trials, promotes Pareto winners — without human intervention. Operator sees "the index got faster overnight" and doesn't know why. The journal knows why.</p>
</div>

<!-- ═══ 7. SCALE STORY ═══ -->
<div class="chapter" id="ch7">
<div class="num">Chapter 7</div>
<h2>Scale story — 20 staffers, 300 contracts, a surge</h2>
<div class="lede">What happens when the demo-level load becomes the production-level load, and midday a client pushes 20 more contracts plus a 1M-row ATS delta. Honest: some of this is architectural headroom, not measured scale. The designed behaviors are below.</div>

<h3>20 concurrent staffers</h3>
<p><strong>Axum is async.</strong> The gateway handles concurrent requests on Tokio with work-stealing. No per-request thread. Tested at 10 parallel queries in 82ms total on this hardware.</p>
<p><strong>Per-staffer profile isolation.</strong> Each staffer activates their own profile (Phase 17) or workspace (Phase 8.5). Profile scopes their search to bound datasets. Workspace carries their in-progress contracts across sessions.</p>
<p><strong>Per-client blacklists.</strong> Auto-applied when the caller passes <code>client: "X"</code> on <code>/search</code>. Staffer A filling for Acme never sees Acme's flagged workers. Staffer B filling for MidState sees them normally.</p>

<h3>300 active contracts</h3>
<p><strong>SQL on <code>job_orders</code> is cheap.</strong> 300 rows is nothing — a scan is microseconds.</p>
<p><strong>Workspace per contract.</strong> Each contract gets its own workspace with saved searches, shortlists, activity log. Zero-copy handoff between staffers (pointer swap, not data copy).</p>
<p><strong>Forecast remains coherent.</strong> <code>/intelligence/staffing_forecast</code> aggregates 30-day permit data regardless of contract count. The bench supply query (<code>GROUP BY role</code> over workers_500k) is a single sub-second SQL.</p>

<h3>Midday surge: +20 contracts, +1M profiles</h3>
<p>The delta arrives at 12:30. Here's what happens in the following minutes:</p>
<div class="step"><div class="n">1</div><div class="body"><strong>+20 contracts via /ingest/db or /ingest/file.</strong> Parsed, schema-checked, Parquet-written, catalog-registered. No queries blocked — register holds a write lock across the manifest write only.</div></div>
<div class="step"><div class="n">2</div><div class="body"><strong>+1M worker profiles arrives as delta to workers_500k.</strong> Append-log pattern (ADR-018) means the new rows write to a fresh batch file — base Parquet is NOT rewritten. Queries against workers_500k immediately merge-on-read the new batches.</div></div>
<div class="step"><div class="n">3</div><div class="body"><strong>Embeddings marked stale.</strong> The vector index for workers_500k_v1 now has 1M rows it hasn't seen. <code>mark_embeddings_stale</code> flips the flag.</div></div>
<div class="step"><div class="n">4</div><div class="body"><strong>Incremental refresh fires.</strong> <code>POST /vectors/refresh/workers_500k</code> reads only the new rows (diff against existing embeddings), embeds them in batches of 64 via Ollama, writes delta embedding Parquet. Measured on threat_intel: 34 new rows in 970ms (6× faster than full re-embed).</div></div>
<div class="step"><div class="n">5</div><div class="body"><strong>Search degrades gracefully.</strong> During the refresh, searches against workers_500k_v1 still work — they serve from the old embeddings. Brute-force cosine over new-rows-without-embeddings is allowed but costs more. HNSW rebuild happens after all embeds complete.</div></div>
<div class="step"><div class="n">6</div><div class="body"><strong>Hot-swap promotion.</strong> When the new index is ready, <code>promotion_registry</code> atomically flips the active pointer. Next search hits the new config. Rollback stays available.</div></div>
<div class="step"><div class="n">7</div><div class="body"><strong>Autotune re-enters the loop.</strong> The agent queue picks up a <code>DatasetAppended</code> trigger and schedules a fresh HNSW trial cycle against the expanded index.</div></div>

<h3>Known pain points at this scale</h3>
<ul>
<li><strong>Ollama inference is serial.</strong> Embedding 1M rows at ~50 chunks/sec through nomic-embed-text = ~6 hours. Acceptable for overnight refresh, not for "immediate." Mitigated by incremental refresh (only deltas).</li>
<li><strong>RAM ceiling on HNSW.</strong> Around 5M vectors × 768d, HNSW stops fitting in 128GB comfortably. Mitigation: per-profile <code>vector_backend: lance</code> flip — disk-resident IVF_PQ scales past the RAM line (ADR-019).</li>
<li><strong>VRAM ceiling for model variety.</strong> A4000 16GB holds 1-2 loaded models. Multi-model recruiter surfaces are a sequential swap, not parallel (Ollama <code>keep_alive=0</code>). Phase 17 profile activation unloads the prior model on swap.</li>
<li><strong>playbook_memory growth.</strong> Currently unbounded. 391 entries today at this rate becomes ~5K in six months. Default k=200 still sub-ms at 5K. Compaction policy (TTL + decay + merge) deferred.</li>
</ul>
</div>

<!-- ═══ 8. ERROR SURFACES ═══ -->
<div class="chapter" id="ch8">
<div class="num">Chapter 8</div>
<h2>Error surfaces &amp; recovery</h2>
<div class="lede">Every failure mode has a named surface, a structured response, and a recovery path. No silent failures.</div>

<table class="plain">
<thead><tr><th>Failure mode</th><th>Surface / response</th><th>Recovery</th></tr></thead>
<tbody>
<tr><td>Ingest receives file with schema mismatch vs existing dataset</td><td><code>409 Conflict</code> with both fingerprints named (ADR-020)</td><td>Re-ingest under a new name, or migrate the existing via Phase 14 schema evolution</td></tr>
<tr><td>Bucket unreachable on write</td><td>Hard 503, error journaled to <code>primary://_errors/bucket_errors/</code></td><td><code>GET /storage/errors</code> lists failures; <code>GET /storage/bucket-health</code> shows per-bucket status</td></tr>
<tr><td>Bucket unreachable on read</td><td>Rescue bucket fallback, <code>X-Lakehouse-Rescue-Used: true</code> header on response</td><td>Response still succeeds; operator sees rescue flag</td></tr>
<tr><td>/log receives name that doesn't exist in workers_500k</td><td>Seed is SKIPPED; response includes <code>rejected_ghost_names: [...]</code> and a note</td><td>Operator sees exactly which names were rejected and why</td></tr>
<tr><td>Dual-agent executor malforms tool call</td><td>Result appended to log with <code>error</code> field; counter increments</td><td>After 3 consecutive: abort with full log dump at <code>tests/multi-agent/playbooks/&lt;id&gt;-FAILED.json</code></td></tr>
<tr><td>Dual-agent drifts from target</td><td>Reviewer verdict = <code>drift</code>, counter increments</td><td>After 3 consecutive drifts: abort with full log</td></tr>
<tr><td>Hybrid search finds zero candidates</td><td>Returns empty <code>sources[]</code> + <code>sql_matches: 0</code></td><td>Gap signal captured by scenario runner; operator prompted to broaden filter</td></tr>
<tr><td>Ollama sidecar down</td><td>502 Bad Gateway from aibridge; <code>embed</code> calls fail fast</td><td>Restart: <code>systemctl restart lakehouse-sidecar</code>; vector search falls back to pre-computed embeddings</td></tr>
<tr><td>Gateway restart mid-operation</td><td>In-memory state (playbook_memory, HNSW) reloaded from persisted <code>state.json</code> / trial journals</td><td>Zero data loss; catalog, storage, journals are all source-of-truth</td></tr>
<tr><td>Schema fingerprint diverges across manifests</td><td><code>catalog::dedupe</code> reports <code>DedupeReport</code> with winner selection (non-null row_count first, then newest updated_at)</td><td><code>POST /catalog/dedupe</code> collapses duplicates idempotently</td></tr>
</tbody>
</table>
</div>

<!-- ═══ 9. PER-STAFFER CONTEXT ═══ -->
<div class="chapter" id="ch9">
<div class="num">Chapter 9</div>
<h2>Per-staffer context</h2>
<div class="lede">Twenty staffers don't see the same UI state. Each one's session is shaped by their active profile, their workspaces, their assigned contracts, and their client's blacklists.</div>

<h3>Active profile (Phase 17)</h3>
<p>Scopes every search. A <code>staffing-recruiter</code> profile bound to <code>workers_500k</code> sees only that dataset. A <code>security-analyst</code> profile bound to <code>threat_intel</code> cannot see worker data. <code>GET /vectors/profile/&lt;id&gt;/audit</code> records every tool invocation by model identity.</p>

<h3>Workspace (Phase 8.5)</h3>
<p>Per-contract state. Each workspace has daily/weekly/monthly tiers, saved searches, shortlists, activity logs. Survives across sessions. Instant zero-copy handoff between staffers — pointer swap, not data copy. Persisted to object storage, rebuilt on startup.</p>

<h3>Client blacklist</h3>
<p>Per-client worker exclusion. Populated via <code>POST /clients/:client/blacklist</code>. Auto-applied when the caller passes <code>client: "X"</code> on <code>/search</code>. JSON-backed; would move to catalog table under real client load.</p>

<h3>Audit trail</h3>
<p>Phase 12 tool registry logs every governed-action invocation (who called what, with what args, when, outcome). <code>GET /tools/audit</code> queryable. Phase 13 access control layers on top — role-based field masking, query audit log.</p>

<h3>Daily summary per staffer</h3>
<p>Workspace activity log + per-staffer filter on the event journal gives <strong>"what did Sarah do today"</strong> as a direct query. The foundation for shift-handoff reports.</p>
</div>

<!-- ═══ 10. A DAY IN THE LIFE ═══ -->
<div class="chapter" id="ch10">
<div class="num">Chapter 10</div>
<h2>A day in the life — from morning brief to EOD retrospective</h2>
<div class="lede">Concrete operator timeline. Every step touches a real endpoint that exists today.</div>

<div class="step"><div class="n">07:00</div><div class="body"><strong>Overnight housekeeping.</strong> Scheduled ingest runs — the configured cron picks up the client's latest ATS CSV delta, runs it through the pipeline in Ch2, marks workers_500k embeddings stale. Autotune agent promotes any Pareto-winner HNSW configs from overnight trials.</div></div>

<div class="step"><div class="n">07:30</div><div class="body"><strong>Embedding refresh.</strong> Background job re-embeds the new rows. Old index keeps serving. Hot-swap promotes when done.</div></div>

<div class="step"><div class="n">08:00</div><div class="body"><strong>Sarah (staffer) opens devop.live/lakehouse.</strong> Page loads in ~3s. Forecast panel shows: "$275M construction coming, 4 tight roles this week." Live Contracts section shows 6 Chicago permits with proposed fills + boost chips + pattern signals.</div></div>

<div class="step"><div class="n">08:15</div><div class="body"><strong>Sarah drills into a $5M permit.</strong> Top candidate card: Carmen Green, Endorsed · 3 playbooks chip, boost +0.166, pattern line reads "leader archetype · 47% OSHA-10." Sarah hovers the chip — narrative tooltip: "filled Welder x2 in Toledo (2026-04-15), Welder x1 in Toledo (2026-04-18)."</div></div>

<div class="step"><div class="n">08:30</div><div class="body"><strong>Sarah calls Carmen.</strong> Clicks Call button → <code>/log</code> fires → <code>playbook_memory.seed</code> → <code>persist_sql</code> → successful_playbooks_live grows by one. Button flashes "Logged" for 1.4s. No modal, no form, no second click.</div></div>

<div class="step"><div class="n">09:00</div><div class="body"><strong>Kim (another staffer) opens the same UI.</strong> Her profile loads. Her workspaces show her own contracts. She searches "reliable forklift Chicago" — MEMORY chip shows the pattern discovered across Sarah's morning work AND prior fills. Carmen, already logged by Sarah, shows up with an updated citation count.</div></div>

<div class="step"><div class="n">12:30</div><div class="body"><strong>Client pushes 20 new contracts + 1M ATS delta.</strong> Ch7 scale flow fires. Ingest in seconds; embedding refresh kicks off as a background job. Searches continue against old embeddings.</div></div>

<div class="step"><div class="n">14:00</div><div class="body"><strong>Emergency: worker Dave no-showed.</strong> Sarah clicks No-show button on Dave's card → <code>/log_failure</code> → <code>mark_failed</code> records a penalty. Next similar query dampens Dave's boost by 0.5. Sarah continues the refill — the refill excludes Dave and the 2 others already booked for this shift.</div></div>

<div class="step"><div class="n">15:00</div><div class="body"><strong>New embeddings live.</strong> Hot-swap promotion. Searches now see all 1M new profiles. Sarah's noon query re-run would produce different top-5.</div></div>

<div class="step"><div class="n">17:00</div><div class="body"><strong>End-of-day retrospective.</strong> Any staffer who ran <code>tests/multi-agent/scenario.ts</code> gets <code>report.md</code> auto-generated. Workspace activity logs aggregate per staffer. <code>GET /vectors/playbook_memory/stats</code> shows the day's new entries.</div></div>

<div class="step"><div class="n">22:00</div><div class="body"><strong>Overnight trial cycle.</strong> Autotune agent continues in the background. Trial journal grows. Tomorrow morning, the system is measurably better at something it got asked about today.</div></div>

<h3>SMS + email drafts in the pipeline</h3>
<p>After each sealed fill (via scenario.ts or manual <code>/log</code> flow with downstream hooks), <code>generateArtifacts</code> in the scenario runner produces: (a) one SMS per worker (TO: Name, message under 180 chars), (b) one client confirmation email. Drafts are saved to <code>sms.md</code> and <code>emails.md</code> under the scenario output dir. Ollama drafts them; the staffer reviews and sends. No auto-send; human-in-the-loop.</p>
</div>

<!-- ═══ 11. LIMITS & NON-GOALS ═══ -->
<div class="chapter" id="ch11">
<div class="num">Chapter 11</div>
<h2>Known limits &amp; non-goals</h2>
<div class="lede">Honesty is a feature. Everything below is either deferred or explicitly out of scope.</div>

<h4>Deferred — real architectural work, just not shipped yet</h4>
<ul>
<li><strong>Rate / margin awareness.</strong> Worker pay expectations vs contract bill rate not modeled. Requires adding <code>pay_rate</code> to workers, <code>bill_rate</code> to contracts, and a filter + warning path. Phase 20 item.</li>
<li><strong>Push / background presence.</strong> The app requires being opened. No Slack/email/SMS push when a contract lands with a pre-ranked candidate list. Would make the "system is already thinking" claim more visible to phone-first shops.</li>
<li><strong>Confidence calibration.</strong> Top-K is a rank, not a probability. No calibrated "85% likely to accept" score. Requires outcome-labeled training data.</li>
<li><strong>Neural re-ranker.</strong> Phase 19 is statistical + semantic. A (query, candidate, outcome)-trained re-ranker is deferred to Phase 20+ per ADR, only if the statistical floor plateaus below usable recall.</li>
<li><strong>playbook_memory compaction.</strong> No TTL or merge policy. Entries accumulate. At expected rate this hits 10K in a year — still tractable but warrants a policy.</li>
<li><strong>call_log cross-reference.</strong> Infrastructure present; current synthetic candidates table is too small to cross-ref. Fixes when real ATS lands.</li>
</ul>

<h4>Non-goals — explicitly out of scope</h4>
<ul>
<li><strong>Cloud deployment.</strong> Local-first by design. Works offline after setup.</li>
<li><strong>Full ACID transactions.</strong> Single-writer model is sufficient; Delta Lake-grade MVCC is deliberately not attempted.</li>
<li><strong>Real-time streaming / CDC.</strong> Batch ingest is the model. Scheduled refresh, not transactional replication.</li>
<li><strong>Replacing the CRM.</strong> This is the analytical + AI layer <em>behind</em> the CRM. Operational CRUD stays with the existing system.</li>
<li><strong>Custom file formats.</strong> Parquet for datasets, sidecar indexes for vectors. No proprietary formats (ADR-008, ADR-018 reaffirm).</li>
<li><strong>Hard multi-tenant isolation.</strong> Profiles and federation provide soft isolation. Adversarial multi-tenant is not a goal — this system assumes a single-trust operator.</li>
</ul>

<div class="narr">
<strong>Overall bet.</strong> The substrate is conservative: Parquet + DataFusion + HNSW + Ollama + object storage. Every layer is replaceable, open, auditable. The intelligence layer (playbook_memory, patterns, autotune) is statistical, not neural — cheaper, explainable, rebuildable from the journal alone. If the statistical floor plateaus below what a real client needs, Phase 20+ adds neural re-rank on top. We don't make that call until measurement demands it.
</div>
</div>

</div>
</div>

<div class="footer">Lakehouse spec · v1 2026-04-20 · maintained from <code>docs/DECISIONS.md</code> · <a href="proof">architecture live-tested</a> · <a href="console">walkthrough</a></div>

</body></html>