Spec v2 — all chapters aligned with Phases 19-25

Full audit pass on devop.live/lakehouse/spec. Five chapters were
stale, one had an outright incorrect line. Scope was bigger than
ch6 alone — J asked "you want to update all" and the honest answer
was yes.

Ch 1 (Repository layout):
- mcp-server row gains /memory/query, /models/matrix, /system/summary,
  observer.ts with :3800 listener
- tests/multi-agent/ row lists all new files: kb.ts, normalize.ts,
  memory_query.ts, gen_scenarios.ts, gen_staffer_demo.ts, and the
  colocated unit tests (kb.test.ts, normalize.test.ts)
- NEW config/ row documents models.json as the 5-tier matrix
- data/ row enumerates the four learning-loop directories:
  _kb/, _playbook_lessons/, _observer/, _chunk_cache/

Ch 3 (Measurement & indexing):
- NEW "Model matrix (Phase 20)" subsection — 5-tier table (T1 hot /
  T2 review / T3 overview / T4 strategic / T5 gatekeeper), per-tier
  primary model, frequency, the think:false mechanical finding
  called out with the 650-token reasoning-budget example
- NEW "Continuation primitive (Phase 21)" paragraph
- NEW "Per-staffer tool_level (Phase 23)" section with full/local/
  basic/minimal mapping and the 46pt fill-rate delta from the 36-run
  demo

Ch 7 (Scale story):
- FIX: playbook_memory growth bullet was claiming "No TTL or merge
  policy" — Phase 25 added retirement via valid_until +
  schema_fingerprint + /retire endpoint. Rewritten to name current
  state (1936 entries, active vs retired split exposed).

Ch 8 (Error surfaces):
- Five new rows added to the failure-mode table:
  * Zero-supply city → cloud rescue (Phase 22 item B) with the
    Gary IN → South Bend IN concrete example
  * LLM truncation → generateContinuable (Phase 21)
  * Schema migration → /vectors/playbook_memory/retire (Phase 25)
  * Observer unreachable → scenario silent-skip + append journal
    survivability

Ch 9 (Per-staffer context):
- NEW "Staffer identity + competence-weighted retrieval (Phase 23)"
  section with the competence_score formula and findNeighbors
  weighted_score
- NEW "Auto-discovered reliable-performer labels" section naming
  Rachel D. Lewis (18 endorsements) and Angela U. Ward (19) as
  concrete output of 36-run demo

Ch 10 (A day in the life):
- Added 17:15 timeline entry — Kim using /memory/query with natural
  language, regex normalizer extracting role/city/count in 0ms
- 17:00 entry updated to mention KB indexing + pathway recommendation
  + observer stream
- 22:00 entry updated to mention detectErrorCorrections nightly scan

Ch 11 (Known limits & non-goals):
- FIX: "playbook_memory compaction" bullet rewritten since retirement
  is now wired; reframed as the honest Mem0 UPDATE/NOOP gap
- Added Letta hot cache deferred item with honest "cheap at 1.9K,
  will bite at 100K" framing
- Added Chunking cache (Phase 21 Rust port) deferred item
- Added Observer → autotune feedback wire deferred item (Phase 26+)

Footer bumped v1 2026-04-20 → v2 2026-04-21 with Phase list.

Verified all updates live on devop.live/lakehouse/spec.
This commit is contained in:
root 2026-04-21 00:16:41 -05:00
parent e0a843d1a5
commit 138592dc56

View File

@ -123,10 +123,11 @@ table.plain tr:hover td{background:#0d1117}
<tr><td class="mono">crates/aibridge/</td><td>Rust ↔ Python sidecar. HTTP client over FastAPI wrapper around Ollama. VRAM introspection via nvidia-smi. All LLM calls (embed, generate, rerank) flow through here.</td></tr>
<tr><td class="mono">crates/gateway/</td><td>Axum HTTP (:3100) + gRPC (:3101). Auth middleware, tools registry (Phase 12 — governed actions), CORS. Every external request enters here.</td></tr>
<tr><td class="mono">crates/ui/</td><td>Dioxus WASM developer UI. Internal tool. Not exposed externally.</td></tr>
<tr><td class="mono">mcp-server/</td><td>Bun/TypeScript recruiter-facing app. Serves <code>devop.live/lakehouse</code>. Routes: <code>/search /match /log /log_failure /clients/:c/blacklist /intelligence/*</code>. Proxies to the Rust gateway for heavy work.</td></tr>
<tr><td class="mono">tests/multi-agent/</td><td>Dual-agent scenario harness. <code>agent.ts</code> (prompts + protocol), <code>orchestrator.ts</code> (single task), <code>scenario.ts</code> (5-event warehouse week), <code>run_e2e_rated.ts</code> (parallel pairs + rating), <code>chain_of_custody.ts</code> (layer-by-layer audit).</td></tr>
<tr><td class="mono">mcp-server/</td><td>Bun/TypeScript recruiter-facing app. Serves <code>devop.live/lakehouse</code>. Routes: <code>/search /match /log /log_failure /clients/:c/blacklist /intelligence/* /memory/query /models/matrix /system/summary</code>. Observer sibling at <code>observer.ts</code> with HTTP listener on :3800 for scenario event ingest. Proxies to the Rust gateway for heavy work.</td></tr>
<tr><td class="mono">tests/multi-agent/</td><td>Dual-agent scenario harness + memory stack. <code>agent.ts</code> (prompts, continuation + tree-split primitives, cloud routing), <code>orchestrator.ts</code>, <code>scenario.ts</code> (contracts + staffer + tool_level), <code>kb.ts</code> (KB indexing, competence scoring, neighbor retrieval), <code>normalize.ts</code> (input normalizer — structured / regex / LLM), <code>memory_query.ts</code> (unified /memory/query), <code>gen_scenarios.ts</code> + <code>gen_staffer_demo.ts</code> (corpus generators), <code>run_e2e_rated.ts</code>, <code>chain_of_custody.ts</code>. Unit tests colocated (<code>kb.test.ts</code>, <code>normalize.test.ts</code>).</td></tr>
<tr><td class="mono">config/</td><td><code>models.json</code> — authoritative 5-tier model matrix (T1 hot local / T2 review local / T3 overview cloud / T4 strategic / T5 gatekeeper). Per-tier context_window + context_budget + overflow_policy. Read at runtime by scenario.ts; hot-swap friendly.</td></tr>
<tr><td class="mono">docs/</td><td><code>PRD.md</code>, <code>PHASES.md</code>, <code>DECISIONS.md</code> (20 ADRs). Every significant architectural choice has an ADR with the alternatives that were rejected and why.</td></tr>
<tr><td class="mono">data/</td><td>Default local object store. Parquet files per dataset, append-log batches, HNSW trial journals, promotion registries, playbook_memory state.json, catalog manifests. Rebuildable from repo + this dir alone.</td></tr>
<tr><td class="mono">data/</td><td>Default local object store. Parquet files per dataset, append-log batches, HNSW trial journals, promotion registries, <code>_playbook_memory/state.json</code> (now with retirement fields — Phase 25), catalog manifests. Plus four learning-loop directories: <code>_kb/</code> (signatures, outcomes, recommendations, error_corrections, config_snapshots, staffers), <code>_playbook_lessons/</code> (T3 cross-day lessons archived per run), <code>_observer/ops.jsonl</code> (append journal, durable scenario outcome stream), <code>_chunk_cache/</code> (spec'd for Phase 21 Rust port). Rebuildable from repo + this dir alone.</td></tr>
</tbody>
</table>
</div>
@ -197,7 +198,33 @@ table.plain tr:hover td{background:#0d1117}
<li>Search via <code>POST /vectors/profile/&lt;id&gt;/search</code> rejects out-of-scope queries with 403 + list of allowed bindings</li>
<li>Ollama swaps to the profile's model via <code>keep_alive=0</code>; only one model in VRAM at a time</li>
</ul>
<div class="ref"><strong>Code:</strong> crates/vectord/src/{hnsw.rs, autotune.rs, agent.rs, promotion.rs} · ADR-019</div>
<h3>Model matrix (Phase 20)</h3>
<p>Five tiers declared in <code>config/models.json</code>. Each call site picks the tier appropriate to its purpose — hot-path JSON emitters get fast local, overview/strategic/gatekeeper decisions get thinking models on cloud. Every tier carries <code>context_window</code>, <code>context_budget</code>, and <code>overflow_policy</code>.</p>
<table class="plain">
<thead><tr><th>Tier</th><th>Purpose</th><th>Primary model</th><th>Frequency</th></tr></thead>
<tbody>
<tr><td>T1 hot</td><td>Per tool call — SQL gen, hybrid_search, propose_done</td><td><code>qwen3.5:latest</code> local, <code>think:false</code></td><td>50-200/scenario</td></tr>
<tr><td>T2 review</td><td>Per-step consensus, drift flagging</td><td><code>qwen3:latest</code> local, <code>think:false</code></td><td>5-14/event</td></tr>
<tr><td>T3 overview</td><td>Mid-day checkpoints + cross-day lesson distill</td><td><code>gpt-oss:120b</code> Ollama Cloud, thinking on</td><td>1-3/scenario</td></tr>
<tr><td>T4 strategic</td><td>Pattern re-ranking, weekly gap audit</td><td><code>qwen3.5:397b</code> cloud</td><td>1-10/day</td></tr>
<tr><td>T5 gatekeeper</td><td>Schema migrations, autotune config changes</td><td><code>kimi-k2-thinking</code> cloud, audit-logged</td><td>1-5/day</td></tr>
</tbody>
</table>
<p><strong>Key mechanical finding (2026-04-21):</strong> qwen3.5 and qwen3 are <em>thinking</em> models — they burn ~650 tokens of hidden reasoning before emitting the visible response. For hot-path JSON emitters this meant 400-token budgets returned empty strings. Fix: <code>think: false</code> plumbed through sidecar's <code>/generate</code> endpoint; hot path disables thinking (structure matters more than reasoning depth), overseer tiers keep it on. Mistral was dropped entirely after a 0/14 fill rate on complex scenarios (decoder-level malformed-JSON bug, not a prompt issue).</p>
<p><strong>Continuation primitive (Phase 21):</strong> <code>generateContinuable()</code> handles output-overflow without <code>max_tokens</code> tourniquets — empty response → geometric backoff retry; truncated-JSON → continue with partial as scratchpad. <code>generateTreeSplit()</code> handles input-overflow via map-reduce with running scratchpad. Both respect <code>assertContextBudget()</code> so silent truncation can't happen.</p>
<h3>Per-staffer tool_level (Phase 23)</h3>
<p>Scenarios can be scoped to a specific coordinator (<code>staffer: {id, name, tenure_months, role, tool_level}</code>). <code>tool_level</code> controls which tiers are available:</p>
<ul>
<li><code>full</code> — T1/T2 local, T3 cloud, cloud rescue on failure</li>
<li><code>local</code> — T1/T2/T3 all local (gpt-oss:20b as overseer)</li>
<li><code>basic</code> — kimi-k2.5 cloud executor + local reviewer + local T3, no rescue</li>
<li><code>minimal</code> — kimi-k2.5 cloud executor, no T3, no rescue. Playbook inheritance is the only signal.</li>
</ul>
<p>Measured 2026-04-21 on a 36-run demo (4 staffers × 3 contracts × 3 rounds): James Park (mid, local tools) ranked first at 92.9% fill and 36.8 cites/run, Maria Chen (senior, full tools) second at 81.0%. Cloud T3 adds latency without measurable benefit on this workload. Alex Rivera (trainee, minimal) still hit 59.5% fill purely from playbook inheritance — proof the memory carries knowledge when the model is capable.</p>
<div class="ref"><strong>Code:</strong> crates/vectord/src/{hnsw.rs, autotune.rs, agent.rs, promotion.rs} · tests/multi-agent/{agent.ts, scenario.ts} · config/models.json · ADR-019</div>
</div>
<!-- ═══ 4. CONTRACT INFERENCE ═══ -->
@ -337,7 +364,7 @@ boost[(city, state, name)] = min(Σ per_worker, 0.25)</pre>
<li><strong>Ollama inference is serial.</strong> Embedding 1M rows at ~50 chunks/sec through nomic-embed-text = ~6 hours. Acceptable for overnight refresh, not for "immediate." Mitigated by incremental refresh (only deltas).</li>
<li><strong>RAM ceiling on HNSW.</strong> Around 5M vectors × 768d, HNSW stops fitting in 128GB comfortably. Mitigation: per-profile <code>vector_backend: lance</code> flip — disk-resident IVF_PQ scales past the RAM line (ADR-019).</li>
<li><strong>VRAM ceiling for model variety.</strong> A4000 16GB holds 1-2 loaded models. Multi-model recruiter surfaces are a sequential swap, not parallel (Ollama <code>keep_alive=0</code>). Phase 17 profile activation unloads the prior model on swap.</li>
<li><strong>playbook_memory growth.</strong> Currently unbounded. 391 entries today at this rate becomes ~5K in six months. Default k=200 still sub-ms at 5K. Compaction policy (TTL + decay + merge) deferred.</li>
<li><strong>playbook_memory growth.</strong> 1936 entries today. Phase 25 (2026-04-21) added retirement via <code>valid_until</code> + <code>schema_fingerprint</code> fields + <code>POST /vectors/playbook_memory/retire</code> endpoint (manual or schema-drift triggered). Active vs retired split surfaced on <code>GET /vectors/playbook_memory/status</code>. Brute-force cosine still sub-ms at current size; Letta-style working-memory hot cache deferred until entry count crosses ~100K.</li>
</ul>
</div>
@ -360,6 +387,10 @@ boost[(city, state, name)] = min(Σ per_worker, 0.25)</pre>
<tr><td>Ollama sidecar down</td><td>502 Bad Gateway from aibridge; <code>embed</code> calls fail fast</td><td>Restart: <code>systemctl restart lakehouse-sidecar</code>; vector search falls back to pre-computed embeddings</td></tr>
<tr><td>Gateway restart mid-operation</td><td>In-memory state (playbook_memory, HNSW) reloaded from persisted <code>state.json</code> / trial journals</td><td>Zero data loss; catalog, storage, journals are all source-of-truth</td></tr>
<tr><td>Schema fingerprint diverges across manifests</td><td><code>catalog::dedupe</code> reports <code>DedupeReport</code> with winner selection (non-null row_count first, then newest updated_at)</td><td><code>POST /catalog/dedupe</code> collapses duplicates idempotently</td></tr>
<tr><td>Scenario event fails on zero-supply city</td><td>Cloud rescue (Phase 22 item B) fires — <code>gpt-oss:120b</code> sees SQL filters attempted, row counts, reviewer drift notes, contract terms; returns structured <code>{retry, new_city, new_state, new_role, new_count, rationale}</code></td><td>Retry with pivot runs same executor loop with new geography; verified Gary IN → South Bend IN filled 1/1 after original drift-abort</td></tr>
<tr><td>LLM response truncated mid-JSON (thinking model ate token budget)</td><td>Phase 21 <code>generateContinuable()</code> detects via brace-balance + JSON.parse; no silent truncation</td><td>Auto-continue with partial as scratchpad, or geometric backoff if initial call returned empty. Bounded by <code>max_continuations</code>.</td></tr>
<tr><td>Schema migration invalidates existing playbooks</td><td>Phase 25 — <code>POST /vectors/playbook_memory/retire</code> with current fingerprint retires all mismatched entries in scope; diagnostic log shows counts</td><td>Retired entries stay in journal for forensics but are skipped by all boost calculations. Scoped by (city, state) so unrelated geos aren't touched.</td></tr>
<tr><td>Observer fails to reach scenario outcome stream</td><td>Scenario <code>postObserverEvent()</code> uses 2s <code>AbortSignal.timeout</code>; silent skip if :3800 is down</td><td>Scenario log is still the source of truth; observer re-ingest on next run restores the stream. <code>data/_observer/ops.jsonl</code> is append-only so prior events survive.</td></tr>
</tbody>
</table>
</div>
@ -384,6 +415,15 @@ boost[(city, state, name)] = min(Σ per_worker, 0.25)</pre>
<h3>Daily summary per staffer</h3>
<p>Workspace activity log + per-staffer filter on the event journal gives <strong>"what did Sarah do today"</strong> as a direct query. The foundation for shift-handoff reports.</p>
<h3>Staffer identity + competence-weighted retrieval (Phase 23)</h3>
<p>Each scenario run carries an explicit <code>staffer: {id, name, tenure_months, role, tool_level}</code>. The KB aggregates per-staffer stats (<code>data/_kb/staffers.jsonl</code>) that roll up into a single <code>competence_score</code>:</p>
<pre>competence_score = 0.45·fill_rate + 0.20·turn_efficiency + 0.20·citation_density + 0.15·rescue_rate</pre>
<p>When any query runs <code>kb.findNeighbors(spec, k)</code>, the ranking isn't just cosine similarity — it's <code>weighted_score = cosine × max_staffer_competence</code> over the best coordinator who ran that signature. Senior staffers' playbooks surface above juniors' on similar scenarios, even when the juniors' scenario was marginally closer in embedding space.</p>
<p>The <code>tool_level</code> knob (full / local / basic / minimal) controls which tiers are available to a given staffer's runs. See Ch 3 for the mapping. Variance is real and measurable: 36-run demo produced a 46pt fill-rate delta between James (local tools, 93%) and Alex (minimal tools, 60%) on identical contracts.</p>
<h3>Auto-discovered reliable-performer labels</h3>
<p>A second-order output of the competence path: when multiple staffers independently endorse the same worker on similar-signature playbooks, that worker accumulates cross-staffer endorsements. <code>scripts/kb_staffer_report.py</code> surfaces them — after 36 runs, Rachel D. Lewis (Welder Nashville) had 18 endorsements across 4 staffers, Angela U. Ward (Machine Op Indianapolis) 19. These are high-confidence "reliable" labels the system produced without human tagging. The UI could badge these workers on future queries; today they're visible via <code>/memory/query</code>'s <code>discovered_patterns</code> bundle.</p>
</div>
<!-- ═══ 10. A DAY IN THE LIFE ═══ -->
@ -410,9 +450,11 @@ boost[(city, state, name)] = min(Σ per_worker, 0.25)</pre>
<div class="step"><div class="n">15:00</div><div class="body"><strong>New embeddings live.</strong> Hot-swap promotion. Searches now see all 1M new profiles. Sarah's noon query re-run would produce different top-5.</div></div>
<div class="step"><div class="n">17:00</div><div class="body"><strong>End-of-day retrospective.</strong> Any staffer who ran <code>tests/multi-agent/scenario.ts</code> gets <code>report.md</code> auto-generated. Workspace activity logs aggregate per staffer. <code>GET /vectors/playbook_memory/stats</code> shows the day's new entries.</div></div>
<div class="step"><div class="n">17:00</div><div class="body"><strong>End-of-day retrospective.</strong> Any staffer who ran <code>tests/multi-agent/scenario.ts</code> gets <code>report.md</code> auto-generated. Workspace activity logs aggregate per staffer. <code>GET /vectors/playbook_memory/status</code> shows active vs retired counts. KB indexes the run (<code>kb.indexRun</code>) and the overview model synthesizes a pathway recommendation for the next matching signature. Every event outcome has already streamed to <code>lakehouse-observer.service</code> on :3800 for ERROR_ANALYZER + PLAYBOOK_BUILDER consumption.</div></div>
<div class="step"><div class="n">22:00</div><div class="body"><strong>Overnight trial cycle.</strong> Autotune agent continues in the background. Trial journal grows. Tomorrow morning, the system is measurably better at something it got asked about today.</div></div>
<div class="step"><div class="n">17:15</div><div class="body"><strong>Kim fires a natural-language query from the search box.</strong> "need 3 forklift operators in Joliet by Monday" → <code>POST /memory/query</code> on the Bun MCP. Regex normalizer extracts role / city / count / deadline / intent in 0ms (no LLM call). The unified response returns playbook workers (auto-surfaced reliable performers for Joliet Forklift with citation counts), pathway recommendation from KB, prior T3 lessons for Joliet, and top staffers by competence — all in ~300ms.</div></div>
<div class="step"><div class="n">22:00</div><div class="body"><strong>Overnight trial cycle.</strong> Autotune agent continues in the background. Trial journal grows. KB's <code>detectErrorCorrections</code> scans today's outcomes for fail→succeed deltas on the same signature; any correction gets logged to <code>data/_kb/error_corrections.jsonl</code> with the config diff. Tomorrow morning, the system is measurably better at something it got asked about today.</div></div>
<h3>SMS + email drafts in the pipeline</h3>
<p>After each sealed fill (via scenario.ts or manual <code>/log</code> flow with downstream hooks), <code>generateArtifacts</code> in the scenario runner produces: (a) one SMS per worker (TO: Name, message under 180 chars), (b) one client confirmation email. Drafts are saved to <code>sms.md</code> and <code>emails.md</code> under the scenario output dir. Ollama drafts them; the staffer reviews and sends. No auto-send; human-in-the-loop.</p>
@ -426,11 +468,13 @@ boost[(city, state, name)] = min(Σ per_worker, 0.25)</pre>
<h4>Deferred — real architectural work, just not shipped yet</h4>
<ul>
<li><strong>Rate / margin awareness.</strong> Worker pay expectations vs contract bill rate not modeled. Requires adding <code>pay_rate</code> to workers, <code>bill_rate</code> to contracts, and a filter + warning path. Phase 20 item.</li>
<li><strong>Push / background presence.</strong> The app requires being opened. No Slack/email/SMS push when a contract lands with a pre-ranked candidate list. Would make the "system is already thinking" claim more visible to phone-first shops.</li>
<li><strong>Rate / margin awareness.</strong> Worker pay expectations vs contract bill rate not modeled. Requires adding <code>pay_rate</code> to workers, <code>bill_rate</code> to contracts, and a filter + warning path. Partially addressed via <code>ContractTerms.budget_per_hour_max</code> passed to T3/rescue prompts, but the match-time filter isn't wired yet.</li>
<li><strong>Mem0-style UPDATE / DELETE / NOOP operations on playbooks.</strong> Today <code>/seed</code> only ADDs. Same <code>(operation, date)</code> pair appends a duplicate instead of refining an existing entry. Phase 26 item — cheap to add, moderate payoff.</li>
<li><strong>Letta working-memory hot cache.</strong> Every boost query scans all active playbook entries from in-memory state. 1.9K today; cheap. Will bite somewhere north of 100K. LRU for the last-N playbooks or current-sig neighborhood deferred until that ceiling approaches.</li>
<li><strong>Chunking cache (Phase 21 Rust port).</strong> TS primitives <code>generateContinuable</code> + <code>generateTreeSplit</code> are wired, but <code>crates/aibridge/src/{continuation.rs, tree_split.rs}</code> + <code>crates/storaged/src/chunk_cache.rs</code> remain queued. Gateway-side callers currently don't have the same protection against silent truncation that the TS test harness does.</li>
<li><strong>Confidence calibration.</strong> Top-K is a rank, not a probability. No calibrated "85% likely to accept" score. Requires outcome-labeled training data.</li>
<li><strong>Neural re-ranker.</strong> Phase 19 is statistical + semantic. A (query, candidate, outcome)-trained re-ranker is deferred to Phase 20+ per ADR, only if the statistical floor plateaus below usable recall.</li>
<li><strong>playbook_memory compaction.</strong> No TTL or merge policy. Entries accumulate. At expected rate this hits 10K in a year — still tractable but warrants a policy.</li>
<li><strong>Neural re-ranker.</strong> Phase 19 is statistical + semantic (now with geo + role prefilter, Phase 25 retirement). A (query, candidate, outcome)-trained re-ranker is deferred only if the statistical floor plateaus below usable recall — current 14× citation lift on identical inputs suggests it hasn't.</li>
<li><strong>Observer → autotune feedback wire.</strong> Phase 24 streams scenario outcomes into <code>data/_observer/ops.jsonl</code>; autotune agent still runs on its own HNSW-trial schedule and hasn't subscribed to the outcome metric stream yet. Phase 26+ item — connects the last loop.</li>
<li><strong>call_log cross-reference.</strong> Infrastructure present; current synthetic candidates table is too small to cross-ref. Fixes when real ATS lands.</li>
</ul>
@ -452,6 +496,6 @@ boost[(city, state, name)] = min(Σ per_worker, 0.25)</pre>
</div>
</div>
<div class="footer">Lakehouse spec · v1 2026-04-20 · maintained from <code>docs/DECISIONS.md</code> · <a href="proof">architecture live-tested</a> · <a href="console">walkthrough</a></div>
<div class="footer">Lakehouse spec · v2 2026-04-21 · Phases 19-25 shipped (playbook boost, model matrix, continuation, KB, staffer competence, observer ingest, validity windows) · maintained from <code>docs/DECISIONS.md</code> · <a href="proof">architecture live-tested</a> · <a href="console">walkthrough</a></div>
</body></html>