265 Commits

Author SHA1 Message Date
profit
6316433062 Phase 40 scope: Langfuse + Gitea MCP recovery as named deliverables
J flagged that a prior version of this stack had Langfuse traces
piping into the observer + Gitea MCP for repo ops — lost. Adding
these as explicit Phase 40 deliverables alongside routing engine
+ Gemini/Claude adapters.

Findings during scope-check:
- Langfuse container is already running (Up 2 days, langfuse:2,
  localhost:3001 healthcheck passes)
- mcp-server/tracing.ts + package.json already have SDK wired
- Credentials pk-lf-staffing / sk-lf-staffing-secret (from env)
- Gitea MCP binary still installed at gitea-mcp@0.0.10

So recovery here is mostly re-connecting existing infra:
1. Add Rust-side Langfuse client for /v1/chat tracing (gateway
   currently bypasses tracing, mcp-server already has it)
2. Wire Langfuse → observer :3800 pipe
3. Register Gitea MCP in mcp-server/index.ts tool list

Each landing as part of Phase 40 when the routing engine ships.
2026-04-22 03:01:28 -05:00
profit
42a11d35cd Phase 39 (first slice): Ollama Cloud adapter on /v1/chat
Second provider wired. /v1/chat now routes by optional `provider`
field: default "ollama" hits local via sidecar, "ollama_cloud"
(or "cloud") hits ollama.com/api/generate directly with Bearer auth.
Key sourced at gateway startup from OLLAMA_CLOUD_KEY env, then
/root/llm_team_config.json (providers.ollama_cloud.api_key), then
OLLAMA_CLOUD_API_KEY env. Config source matches LLM Team convention.

Shape-identical to scenario.ts::generateCloud — same endpoint, same
body, same Bearer auth. Cloud path bypasses sidecar entirely (sidecar
is local-only by design, mirrors TS agent.ts).

Changes:
- crates/gateway/src/v1/ollama_cloud.rs (new, 130 LOC) — reqwest
  client, resolve_cloud_key(), chat() adapter, CloudGenerateBody /
  CloudGenerateResponse wire shapes
- crates/gateway/src/v1/ollama.rs — flatten_messages_public()
  re-export so sibling adapters reuse the shape collapse
- crates/gateway/src/v1/mod.rs — provider field on ChatRequest,
  dispatch match in chat() handler, ollama_cloud_key on V1State
- crates/gateway/src/main.rs — resolves cloud key at startup,
  logs which source provided it
- crates/gateway/Cargo.toml — reqwest 0.12 with rustls-tls

Verified end-to-end after restart:
- provider=ollama → qwen3.5:latest local (~400ms, Phase 38 unchanged)
- provider=ollama_cloud + model=gpt-oss:120b → real 225-word
  technical response in 5.4s, 313 tokens

Tests: 9/9 green (7 from Phase 38 + 2 new for cloud body serialization
and key resolver shape).

Not in this slice: trait extraction (full Phase 39 scope adds
ProviderAdapter trait + OpenRouter adapter + fallback chain logic).
These land next with Phase 40 routing engine on top.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-22 02:57:42 -05:00
profit
8cbbd0ef70 Phase 38 fix: default think=false on /v1/chat
Live-test caught the Phase 21 thinking-model trap on first call.
qwen3.5 with max_tokens=50 and default think behavior burned all 50
tokens on hidden reasoning; visible content was "". completion_tokens
exactly matching max_tokens was the tell.

Adapter now defaults think: Some(false) matching scenario.ts hot-path
discipline. Callers that want reasoning (overseers, T3+) opt in via
a non-OpenAI `think: true` extension field on the request.

Verified end-to-end after restart:
- "Lakehouse supports ACID and raw data." (5 words, 516ms)
- "tokio\nasync-std\nsmol" (3 Rust crates, 391ms)
- /v1/usage accumulates across calls (2 req / 95 total tokens)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-22 02:50:09 -05:00
profit
4cb405bb42 Phase 38: Universal API skeleton — /v1/chat, /v1/usage, /v1/sessions
First slice of the control-plane pivot. OpenAI-compatible surface
over the existing aibridge → Ollama path. Additive — no existing
routes touched. All 7 unit tests green, release build clean.

What ships:
- crates/gateway/src/v1/mod.rs — router, V1State (ai_client + Usage
  counter), ChatRequest/ChatResponse/Message/UsageBlock types, handlers
  for /chat, /usage, /sessions. OpenAI-compatible field shapes:
  {model, messages[{role,content}], temperature?, max_tokens?, stream?}
- crates/gateway/src/v1/ollama.rs — shape adapter. Flattens messages
  into (system, prompt), calls aibridge.generate, unwraps response
  back into OpenAI /v1/chat shape. Prefers sidecar-reported tokens;
  falls back to chars/4 ceiling estimate matching Phase 21 convention.
- crates/gateway/src/main.rs — one new mod, one .nest("/v1", ...)

Tests (7/7):
- chat_request_parses_openai_shape
- chat_request_accepts_minimal
- usage_counter_default_is_zero
- flatten_separates_system_from_turns
- flatten_concatenates_multiple_system_messages
- flatten_with_no_system_returns_empty_system
- estimate_tokens_chars_div_4_ceiling

Not in this phase (per CONTROL_PLANE_PRD.md): streaming, tool calls,
session state, multi-provider, fallback chain, cost gating. All
land in Phases 39-44.

Next: live-test POST /v1/chat after gateway restart, then migrate
bot/propose.ts off direct sidecar calls to prove the loop end-to-end.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-22 02:47:15 -05:00
profit
f44b6b3e6b Control-plane pivot: Phase 38-44 plan + bot scaffold
Direction shift 2026-04-22: docs/CONTROL_PLANE_PRD.md becomes the
long-horizon architecture target. Existing Lakehouse (docs/PRD.md,
Phases 0-37) is preserved as the reference implementation and first
consumer. New 6-layer architecture:

  L1 Universal API /v1/chat /v1/usage /v1/sessions /v1/tools /v1/context
  L2 Routing & Policy Engine (rules, fallback chains, cost gating)
  L3 Provider Adapter Layer (Ollama + OpenRouter + Gemini + Claude)
  L4 Knowledge + Memory + Playbooks (already built)
  L5 Execution Loop (scenarios + bot/cycle.ts instances)
  L6 Observability + token accounting

Phases 38-44 sequenced with detailed per-phase specs in the PRD.
Current scope: staffing domain (synthetic workers_500k, contracts,
emails, SMS, playbooks). DevOps (Terraform/Ansible) is long-horizon
target — architecture-compatible but not current.

Files added:
- docs/CONTROL_PLANE_PRD.md — 6-layer architecture, Phase 38-44
  sequencing with staffing-first Truth Layer + Validation pipeline
- bot/ — manual-only PR bot scaffold. First consumer test-bed for
  /v1/chat (Phase 38). Mem0-aligned ADD/UPDATE/NOOP apply semantics;
  KB feedback loop reads prior cycles on same gap and injects into
  cloud prompt so bot cycles compound like scenario.ts runs do.
- tests/multi-agent/run_stress.ts — the 6-task diverse stress test
  referenced in the previous commit but missing from its staging

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-22 02:43:31 -05:00
profit
5b1fcf6d27 Phase 28-36 body of work
Accumulated since a6f12e2 (Phase 21 Rust port + Phase 27 versioning):

- Phase 36: embed_semaphore on VectorState (permits=1) serializes
  seed embed calls — prevents sidecar socket collisions under
  concurrent /seed stress load
- Phase 31+: run_stress.ts 6-task diverse stress scaffolding;
  run_e2e_rated.ts + orchestrator.ts tightening
- Catalog dedupe cleanup: 16 duplicate manifests removed; canonical
  candidates.parquet (10.5MB -> 76KB) + placements.parquet (1.2MB ->
  11KB) regenerated post-dedupe; fresh manifests for active datasets
- vectord: harness EvalSet refinements (+181), agent portfolio
  rotation + ingest triggers (+158), autotune + rag adjustments
- catalogd/storaged/ingestd/mcp-server: misc tightening
- docs: Phase 28-36 PRD entries + DECISIONS ADR additions;
  control-plane pivot banner added to top of docs/PRD.md (pointing
  at docs/CONTROL_PLANE_PRD.md which lands in next commit)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-22 02:41:15 -05:00
profit
a6f12e2609 Phase 21 Rust port + Phase 27 playbook versioning + doc-sync
Phase 21 — Rust port of scratchpad + tree-split primitives (companion to
the 2026-04-21 TS shipment). New crates/aibridge modules:

  context.rs       — estimate_tokens (chars/4 ceil), context_window_for,
                     assert_context_budget returning a BudgetCheck with
                     numeric diagnostics on both success and overflow.
                     Windows table mirrors config/models.json.
  continuation.rs  — generate_continuable<G: TextGenerator>. Handles the
                     two failure modes: empty-response from thinking
                     models (geometric 2x budget backoff up to budget_cap)
                     and truncated-non-empty (continuation with partial
                     as scratchpad). is_structurally_complete balances
                     braces then JSON.parse-checks. Guards the degen case
                     "all retries empty, don't loop on empty partial".
  tree_split.rs    — generate_tree_split map->reduce with running
                     scratchpad. Per-shard + reduce-prompt go through
                     assert_context_budget first; loud-fails rather than
                     silently truncating. Oldest-digest-first scratchpad
                     truncation at scratchpad_budget (default 6000 t).

TextGenerator trait (native async-fn-in-trait, edition 2024). AiClient
implements it; ScriptedGenerator test double lets tests inject canned
sequences without a live Ollama.

GenerateRequest gained think: Option<bool> — forwards to sidecar for
per-call hidden-reasoning opt-out on hot-path JSON emitters. Three
existing callsites updated (rag.rs x2, service.rs hybrid answer).

Phase 27 — Playbook versioning. PlaybookEntry gained four optional
fields (all #[serde(default)] so pre-Phase-27 state loads as roots):

  version           u32, default 1
  parent_id         Option<String>, previous version's playbook_id
  superseded_at     Option<String>, set when newer version replaces
  superseded_by     Option<String>, the playbook_id that replaced

New methods:

  revise_entry(parent_id, new_entry) — appends new version, stamps
    superseded_at+superseded_by on parent, inherits parent_id and sets
    version = parent + 1 on the new entry. Rejects revising a retired
    or already-superseded parent (tip-of-chain is the only valid
    revise target).
  history(playbook_id) — returns full chain root->tip from any node.
    Walks parent_id back to root, then superseded_by forward to tip.
    Cycle-safe.

Superseded entries excluded from boost (same rule as retired): filter
in compute_boost_for_filtered_with_role (both active-entries prefilter
and geo-filtered path), rebuild_geo_index, and upsert_entry's existing-
idx search. status_counts returns (total, retired, superseded, failures);
/status JSON reports active = total - retired - superseded.

Endpoints:
  POST /vectors/playbook_memory/revise
  GET  /vectors/playbook_memory/history/{id}

Doc-sync — PHASES.md + PRD.md drifted from git after Phases 24-26
shipped. Fixes applied:

  - Phase 24 marked shipped (commit b95dd86) with detail of observer
    HTTP ingest + scenario outcome streaming. PRD "NOT YET WIRED"
    rewritten to reflect shipped state.
  - Phase 25 (validity windows, commit e0a843d) added to PHASES +
    PRD.
  - Phase 26 (Mem0 upsert + Letta hot cache, commit 640db8c) added.
  - Phase 27 entry added to both docs.
  - Phase 19.6 time decay corrected: was documented as "deferred",
    actually wired via BOOST_HALF_LIFE_DAYS = 30.0 in playbook_memory.rs.
  - Phase E/Phase 8 tombstone-at-compaction limit note updated —
    Phase E.2 closed it.

Tests: 8 new version_tests in vectord (chain-metadata stamping,
retired/superseded parent rejection, boost exclusion, history from
root/tip/middle, legacy default round-trip, status counts). 25 new
aibridge tests (context/continuation/tree_split). Workspace total
145 green (was 120).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-21 17:40:49 -05:00
root
640db8c63c Phase 26 — Mem0 upsert + Letta geo hot cache
Closes the two remaining 2026-era memory findings. Both are
optimizations per J's framing — not load-bearing, but good data
hygiene + future-proofing at scale.

MEM0 UPSERT (data hygiene):
Before: /seed always appended. A scenario re-running the same
operation on the same day wrote duplicate entries, inflating the
playbook corpus with near-identical rows.

Now: upsert_entry(new) inspects existing non-retired entries and
decides ADD / UPDATE / NOOP:
  ADD     → no matching (operation, day, city, state) tuple, append
  UPDATE  → match exists with different names → merge (union, stable
            order), refresh timestamp, keep original playbook_id so
            citations stay valid
  NOOP    → match exists with identical names → skip, return id

Day-granularity keying on timestamp YYYY-MM-DD means intraday
re-seeds dedup but tomorrow's same-operation is a fresh ADD. Retired
entries don't block new seeds — they're out of scope anyway.

Seed endpoint returns {outcome: {mode, playbook_id, merged_names?},
entries_after}. Append=false retains old replace-all semantics.

5 unit tests pass: first_seed_is_add, identical_reseed_is_noop,
same_day_different_names_updates_and_merges, different_day_same_op_is_add,
retired_entry_doesnt_block_new_seed.

Live verified: three successive seeds with (Alice), (Alice),
(Alice, Bob) left entry count unchanged at 1936 with merged names
{Alejandro, Lauren, Alice, Bob}. Previously would have been 3
appends.

LETTA GEO HOT CACHE (scale primitive):
Added geo_index: HashMap<(city_lower, state_upper), Vec<usize>>
alongside PlaybookMemoryState. Rebuilt on every mutation: set_entries,
retire_one, retire_on_schema_drift, upsert_entry, load_from_storage.

compute_boost_for_filtered_with_role now uses the index for O(1) geo
lookup instead of scanning all entries. At current scale (1.9K) the
scan was sub-ms; at 100K+ the scan becomes the dominant cost. The
hot cache future-proofs without adding an LRU abstraction.

Retired entries excluded from index; valid_until still checked on the
hot path since it can elapse between rebuilds.

Owns cloned PlaybookEntries in the geo_filtered vector so the state
read-lock is released before cosine scoring — avoids lock contention
on the scoring path.

Memory-findings progress: 5 of 5 shipped.
  ✓ Multi-strategy parallel retrieval (Phase 19 refinement)
  ✓ Input normalization + unified /memory/query (Phase 24 TS)
  ✓ Zep validity windows (Phase 25)
  ✓ Mem0 UPSERT (Phase 26 today)
  ✓ Letta geo hot cache (Phase 26 today)

All 18 playbook_memory tests pass.
v0.26-memory-complete
2026-04-21 00:24:05 -05:00
root
138592dc56 Spec v2 — all chapters aligned with Phases 19-25
Full audit pass on devop.live/lakehouse/spec. Five chapters were
stale, one had an outright incorrect line. Scope was bigger than
ch6 alone — J asked "you want to update all" and the honest answer
was yes.

Ch 1 (Repository layout):
- mcp-server row gains /memory/query, /models/matrix, /system/summary,
  observer.ts with :3800 listener
- tests/multi-agent/ row lists all new files: kb.ts, normalize.ts,
  memory_query.ts, gen_scenarios.ts, gen_staffer_demo.ts, and the
  colocated unit tests (kb.test.ts, normalize.test.ts)
- NEW config/ row documents models.json as the 5-tier matrix
- data/ row enumerates the four learning-loop directories:
  _kb/, _playbook_lessons/, _observer/, _chunk_cache/

Ch 3 (Measurement & indexing):
- NEW "Model matrix (Phase 20)" subsection — 5-tier table (T1 hot /
  T2 review / T3 overview / T4 strategic / T5 gatekeeper), per-tier
  primary model, frequency, the think:false mechanical finding
  called out with the 650-token reasoning-budget example
- NEW "Continuation primitive (Phase 21)" paragraph
- NEW "Per-staffer tool_level (Phase 23)" section with full/local/
  basic/minimal mapping and the 46pt fill-rate delta from the 36-run
  demo

Ch 7 (Scale story):
- FIX: playbook_memory growth bullet was claiming "No TTL or merge
  policy" — Phase 25 added retirement via valid_until +
  schema_fingerprint + /retire endpoint. Rewritten to name current
  state (1936 entries, active vs retired split exposed).

Ch 8 (Error surfaces):
- Five new rows added to the failure-mode table:
  * Zero-supply city → cloud rescue (Phase 22 item B) with the
    Gary IN → South Bend IN concrete example
  * LLM truncation → generateContinuable (Phase 21)
  * Schema migration → /vectors/playbook_memory/retire (Phase 25)
  * Observer unreachable → scenario silent-skip + append journal
    survivability

Ch 9 (Per-staffer context):
- NEW "Staffer identity + competence-weighted retrieval (Phase 23)"
  section with the competence_score formula and findNeighbors
  weighted_score
- NEW "Auto-discovered reliable-performer labels" section naming
  Rachel D. Lewis (18 endorsements) and Angela U. Ward (19) as
  concrete output of 36-run demo

Ch 10 (A day in the life):
- Added 17:15 timeline entry — Kim using /memory/query with natural
  language, regex normalizer extracting role/city/count in 0ms
- 17:00 entry updated to mention KB indexing + pathway recommendation
  + observer stream
- 22:00 entry updated to mention detectErrorCorrections nightly scan

Ch 11 (Known limits & non-goals):
- FIX: "playbook_memory compaction" bullet rewritten since retirement
  is now wired; reframed as the honest Mem0 UPDATE/NOOP gap
- Added Letta hot cache deferred item with honest "cheap at 1.9K,
  will bite at 100K" framing
- Added Chunking cache (Phase 21 Rust port) deferred item
- Added Observer → autotune feedback wire deferred item (Phase 26+)

Footer bumped v1 2026-04-20 → v2 2026-04-21 with Phase list.

Verified all updates live on devop.live/lakehouse/spec.
2026-04-21 00:16:41 -05:00
root
e0a843d1a5 Phase 25 — validity windows + playbook retirement
Addresses the load-bearing memory gap J flagged: playbook entries
had timestamps but no retirement semantic. When a schema migration
changed a column or a seasonal contract ended, stale playbooks kept
boosting candidates silently. Zep 2026-era finding — temporal
validity is the single highest-value memory-hygiene primitive.

SCHEMA (PlaybookEntry gains four optional fields, serde default):
  schema_fingerprint  — SHA-256 over dataset (column, type) tuples at
                        seed time. Missing = legacy entry, never
                        auto-retired on drift.
  valid_until         — RFC3339 hard expiry. compute_boost skips
                        entries past this moment.
  retired_at          — Set by retire_one or retire_on_schema_drift.
                        Retired entries excluded from all boost
                        calculations but kept in journal.
  retirement_reason   — Human-readable: "schema_drift: ...",
                        "expired: ...", "manual: ..."

RETRIEVAL PATH (compute_boost_for_filtered_with_role):
  Before geo+cosine, active_entries filter removes anything retired
  OR past valid_until. Uses chrono::Utc::now() once per call, no per-
  entry clock queries.

NEW METHODS on PlaybookMemory:
  retire_one(playbook_id, reason)
  retire_on_schema_drift(city, state, current_fp, reason) — idempotent,
    scopes by (city, state) so a Nashville migration doesn't touch
    Chicago. Skips legacy entries with no fingerprint.
  status_counts() -> (total, retired, failures)

HTTP ENDPOINTS:
  POST /vectors/playbook_memory/retire
    {playbook_id, reason}                        → retire by id
    {city, state, current_schema_fingerprint, reason} → schema drift
  GET  /vectors/playbook_memory/status
    {total, active, retired, failures}

SEED REQUEST extended with optional schema_fingerprint + valid_until
so the orchestrator (scenario.ts) can pass the current schema hash
when seeding, without a round trip through catalogd.

UNIT TESTS (5/5 pass): retire_one_marks_entry_and_persists,
retired_entries_do_not_boost, expired_valid_until_is_skipped,
schema_drift_retires_mismatched_fingerprints_only,
schema_drift_skips_other_cities.

LIVE VERIFIED: /status on current state = 1936 entries, 43 failures.
POST /retire with a sample playbook_id → "retired":1, /status now
reports active=1935, retired=1.

Memory-findings progress: 3 of 5 shipped.
  ✓ Multi-strategy parallel retrieval (Phase 19 refinement)
  ✓ Input normalization + unified /memory/query (Phase 24 TS)
  ✓ Zep-style validity windows (Phase 25, tonight)
   Mem0 UPDATE / DELETE / NOOP ops (dedup same-(op,date) seeds)
   Letta working-memory hot cache (not biting at 1.5K entries)
2026-04-21 00:11:02 -05:00
root
3fb3a60da4 Spec ch6 rewrite — 3 learning paths → 7 + honest gap list
J flagged the spec out of alignment with what's built. Ch6 now
reflects the full current architecture:

- Path 1 (playbook boost) — formula kept; geo+role prefilter
  refinement called out with measured 14× citation lift
- Path 2 (pattern discovery) — unchanged
- Path 3 (autotune agent) — unchanged
- Path 4 (KB + pathway recommender) — Phase 22, file layout
  documented
- Path 5 (cloud rescue on failure) — Phase 22 item B, verified
  stress_01 example cited
- Path 6 (staffer competence-weighted retrieval) — Phase 23,
  competence_score formula included, cross-staffer auto-
  discovered worker labels (Rachel D. Lewis 18× endorsements)
- Path 7 (observer outcome ingest) — Phase 24, :3800 HTTP
  listener + ops.jsonl append journal

Input normalizer + unified /memory/query surface documented as
the "seamless with whatever input" answer, with the 319ms
natural-language latency number.

Honest gaps kept visible in the spec itself, not hidden:
- Zep validity windows (most load-bearing remaining)
- Mem0 UPDATE/DELETE/NOOP ops
- Letta working-memory hot cache

Live at https://devop.live/lakehouse/spec#ch6 after service
restart. Verified post-deploy: geo+role prefilter, 14× delta,
validity windows gap all present in served HTML.
2026-04-21 00:03:06 -05:00
root
52561d10d3 Input normalizer + unified memory query — "seamless with whatever input"
J asked directly: "did we implement our memory findings so that our
knowledge base and our configuration playbook [work] seamlessly with
whatever input they're given?" Honest answer tonight was "one of five
findings shipped, normalizer is the blocker." This closes that gap.

NORMALIZER (tests/multi-agent/normalize.ts):
Accepts structured JSON, natural language, or mixed. Returns canonical
NormalizedInput { role, city, state, count, client, deadline, intent,
confidence, extraction_method, missing_fields } for any downstream
consumer.

Three-tier path:
  1. Structured fast-path — already-shaped input skips LLM
  2. Regex path — "need 3 welders in Nashville, TN" parses without LLM.
     City/state parser tightened to 1-3 capitalized words + "in {city}"
     anchor preference + case-exact full-state-name variants to prevent
     "Forklift Operators in Chicago" being captured as the city name
  3. LLM fallback — qwen3 local with think:false + 400 max_tokens for
     inputs the regex can't handle

Unit tests (tests/multi-agent/normalize.test.ts): 9/9 pass. Covers
structured fast-path, misplacement→rescue intent, state-name→abbrev
conversion, regex extraction from natural language, plural role +
full state name edge case, rescue intent keyword precedence, partial
input reporting missing fields, empty object fallthrough, async/sync
parity on clean inputs.

UNIFIED MEMORY QUERY (tests/multi-agent/memory_query.ts):
One function, five parallel fan-outs, one bundle returned:
  - playbook_workers — hybrid_search via gateway with use_playbook_memory
  - pathway_recommendation — KB recommender for this sig
  - neighbor_signatures — K-NN sigs weighted by staffer competence
  - prior_lessons — T3 overseer lessons filtered by city/state
  - top_staffers — competence-sorted leaderboard
  - discovered_patterns — top workers endorsed across past playbooks
    for this (role, city, state)
  - latency_ms — per-source + total
Every branch is best-effort: one source down doesn't break the bundle.

HTTP ENDPOINT (mcp-server/index.ts):
  POST /memory/query with body {input: <anything>} → MemoryQueryResult
Returns the same shape the TS function does. Typed with types.ts for
future UI consumption.

VERIFIED:
  curl POST /memory/query with structured {role,city,state,count}
    → extraction_method=structured, 10 playbook workers, top score 0.878
  curl POST /memory/query with "I need 3 welders in Nashville, TN"
    → extraction_method=regex (no LLM call), 319ms total, 8 endorsements
      for Lauren Gomez auto-discovered as top Nashville Welder

Honest remaining gaps (documented for next phase):
  - Mem0 ADD/UPDATE/DELETE/NOOP — we still only ADD + mark_failed
  - Zep validity windows — playbook entries have timestamps but no
    retirement semantic
  - Letta working-memory / hot cache — every query scans all 1560
    playbook entries
  - Memory profiles / scoped queries — global pool, no per-staffer
    private subsets

2 of 5 findings now shipped (multi-strategy retrieval in Rust, input
normalization + unified query in TS). The remaining 3 are architectural
additions queued as Phase 25 items — validity windows first since it's
the most load-bearing for long-running systems.
2026-04-20 23:59:05 -05:00
root
b95dd86556 Phase 24 — observer HTTP ingest + scenario outcome streaming
Closes the gap J flagged: observer wraps MCP:3700, scenarios hit
gateway:3100 directly, observer idle at 0 ops across 3600+ cycles.
Now scenarios POST per-event outcomes to observer's new HTTP ingest
on :3800, observer consumes them alongside MCP-wrapped ops, ERROR_
ANALYZER and PLAYBOOK_BUILDER loops see the full picture.

observer.ts:
- Bun.serve() HTTP listener on OBSERVER_PORT (default 3800):
  GET /health    — basic + ring depth
  GET /stats     — total / success / failure / by_source / recent
                   scenario ops digest
  POST /event    — accept scenario outcome, shape it into ObservedOp
                   with source="scenario" + staffer_id + sig_hash +
                   event_kind + role/city/state + rescue flags
- recordExternalOp() — shared ring-buffer insert so the main analyzer
  + playbook builder don't care where the op came from
- ObservedOp extended with provenance fields

persistOp() FIX — old path POSTed to /ingest/file?name=observed_operations
which REPLACES the dataset (flagged in feedback_ingest_replace_semantics.md).
Every op was silently wiping all prior ops. Replaced with append to
data/_observer/ops.jsonl so the historical trace is durable across
analyzer cycles and process restarts.

scenario.ts:
- OBSERVER_URL env (default http://localhost:3800)
- postObserverEvent() helper with 2s AbortSignal.timeout so observer
  being down doesn't block scenario flow
- Per-event POST after ctx.results.push(result), carrying staffer_id,
  sig_hash (via imported computeSignature), event_kind + role + city
  + state + count + rescue_attempted / rescue_succeeded + truncated
  output_summary

VERIFIED:
  curl POST /event → {"accepted":true,"ring_size":1}
  curl GET /stats → {"total":1,"successes":1,"by_source":{"scenario":1},
    "recent_scenario_ops":[{...staffer_id,kind,role}]}

Final v3 demo leaderboard (9 runs per staffer, cumulative 3 batches):
  James (local):   92.9% fill, 36.8 cites, score 0.775 — RANK 1
  Maria (full):    81.0% fill, 26.2 cites, score 0.727
  Sam (basic):     61.9% fill, 28.2 cites, score 0.640
  Alex (minimal):  59.5% fill, 32.2 cites, score 0.631
Honest finding: Alex has MORE citations than Sam despite NO T3 and NO
rescue. Playbook inheritance alone is firing hardest when overseer is
absent. The 59.5% fill rate (up from 0% when qwen2.5 was executor)
proves cloud-exec + playbook inheritance is the floor the architecture
delivers.

Local gpt-oss:20b T3 outperforms cloud gpt-oss:120b T3 by 12pt fill
rate on this workload — cloud overseer paying latency+variance for
no measurable gain, worth flagging in next models.json tune.
2026-04-20 23:49:30 -05:00
root
137aed64fb Coherence pass — PRD/PHASES updates, config snapshot wired, unit tests
J flagged the audit: "make sure everything flows coherently, no
pseudocode or unnecessary patches or ignoring any particular part of
what we built." This is that pass.

PRD.md updates:
- Phase 19 refinement block — geo-filter + role-prefilter WIRED with
  citation density numbers (0.32 → 1.38, and 2 → 28 on same scenario).
- Phase 20 rewrite — mistral dropped, qwen3.5 + qwen3 local hot path,
  think:false as the key mechanical finding, kimi-k2.6 upgrade path.
- Phase 21 status block — think plumbing + cloud executor routing
  added after original commit.
- Phase 22 item B (cloud rescue) — pivot sanitizer, rescue verified
  1/3 on stress_01.
- Phase 23 NEW — staffer identity + tool_level + competence-weighted
  retrieval + kb_staffer_report. Auto-discovered worker labels called
  out with real numbers (Rachel Lewis 12× across 4 staffers).
- Phase 24 NEW — Observer/Autotune integration gap DOCUMENTED, not
  fixed. Observer has been idle at 0 ops for 3600+ cycles because
  scenarios hit gateway:3100 directly, bypassing MCP:3700 which the
  observer wraps. This is the honest "we're not using it in these
  tests" signal J surfaced. Fix deferred; gap visible now.

PHASES.md:
- Appended Phases 20-23 as checked, Phase 24 as unchecked gap.
- Updated footer count: 102 unit tests across all layers.
- Latest line updated with 14× citation lift + 46.4pt tool-asymmetry
  finding.

scenario.ts:
- snapshotConfig() was defined but never called. Now fires at every
  scenario start with a stable sha256 hash over the active model set +
  tool_level + cloud flags. config_snapshots.jsonl finally populates,
  which the error_corrections diff path needs to work correctly.

kb.test.ts (new): 4 signature invariant tests — stability across
unrelated fields (date, contract, staffer), sensitivity to role/city/
count changes, digest shape. All pass under `bun test`.

service.rs: 6 Rust extractor tests for extract_target_geo +
extract_target_role — basic, missing-state-returns-none, word
boundary (civilian != city), multi-word role, absent role, quoted
value parse. All pass under `cargo test -p vectord --lib extractor_tests`.

Dangling items now honestly documented rather than silently pending:
- Chunking cache (config/models.json SPEC, not wired) — flagged
- Playbook versioning (SPEC, not wired) — flagged
- Observer integration (WIRED but disconnected) — new Phase 24
2026-04-20 23:29:13 -05:00
root
ad0edbe29c Cloud kimi-k2.5 executor for weak tiers + multi-strategy playbook retrieval
Two coupled changes from the 2026 agent-memory research + tool
asymmetry findings.

SCENARIO (weak-tier cloud substitute):
qwen2.5 collapsed to 0/14 across the basic/minimal tool_levels.
Replace with cloud kimi-k2.5 on Ollama Cloud — same family as k2.6
(pro-tier locked today, on J's upgrade path). Plumb cloud flag
through ACTIVE_EXECUTOR_CLOUD / ACTIVE_REVIEWER_CLOUD into
generateContinuable so executor/reviewer can route to cloud when
tool_level requires. think:false supported by Kimi family.

Tool level mapping (revised):
  full     — qwen3.5 local + qwen3 local + cloud gpt-oss:120b T3 + rescue
  local    — qwen3.5 local + qwen3 local + local gpt-oss:20b T3 + rescue
  basic    — kimi-k2.5 cloud + qwen3 local + local T3, no rescue
  minimal  — kimi-k2.5 cloud + qwen3 local, no T3, no rescue.
             Playbook inheritance alone on the decision path.

This is the honest version of J's "minimal tools still works via
inheritance" hypothesis — with the executor no longer broken at the
tokenizer level, we can actually measure whether playbook retrieval
substitutes for missing overseers.

PLAYBOOK_MEMORY (multi-strategy retrieval):
Zep / Mem0 research shows multi-strategy rerank (semantic + keyword +
graph + temporal) outperforms single-strategy cosine. Lakehouse now
has a two-tier:

  1. Exact (role, city, state) match: skip cosine, assign similarity=1.0,
     take up to top_k/2+1 slots. These are identity-class neighbors —
     the strongest possible signal.
  2. Cosine fallback within the same (city, state) but different role:
     fills remaining slots.

Exposed as compute_boost_for_filtered_with_role(target_geo, target_role).
Backwards-compatible: compute_boost_for_filtered forwards with role=None
so existing callers keep their current behavior.

Service.rs wires both: extract_target_geo and extract_target_role pull
from the executor's SQL filter. grab_eq_value is factored out of
extract_target_geo so both lookups share one parser. Diagnostic log
now prints target_role alongside target_geo for every hybrid_search:

  playbook_boost: boosts=88 sources=39 parsed=39 matched=5
    target_geo=Some(("Nashville", "TN")) target_role=Some("Welder")

Verified: Nashville Welder query returns 5/10 boosted workers in
top_k with clean role+geo provenance.

Research sources: atlan.com Agent Memory Frameworks 2026, Mem0 paper
(arxiv 2504.19413), Zep/Graphiti LongMemEval comparison, ossinsight
Agent Memory Race 2026.

kimi-k2.6 on current key returns 403 — pro-tier upgrade required.
kimi-k2.5 is the substitute today; swap to k2.6 by renaming one line
in applyToolLevel once the subscription lands.
2026-04-20 23:20:07 -05:00
root
5e89407939 Phase 23 refinement — per-staffer tool_level variance
Staffer.tool_level now controls which subsystems a specific run gets:

  full     — qwen3.5 + qwen3 + cloud T3 + cloud rescue
  local    — qwen3.5 + qwen3 + local gpt-oss:20b T3 + rescue
  basic    — qwen2.5 + qwen2.5 + local T3, no rescue
  minimal  — qwen2.5 + qwen2.5, NO T3, NO rescue. Playbook
             inheritance only.

applyToolLevel() mutates module-scoped ACTIVE_* slots each run from the
env defaults, so prior staffer's overrides never leak. Hot-path code
reads ACTIVE_EXECUTOR / ACTIVE_REVIEWER / ACTIVE_T3_DISABLED /
ACTIVE_OVERVIEW_CLOUD / ACTIVE_RETRY_ON_FAIL instead of the baked
constants.

The architectural question this answers: does playbook_memory
inheritance carry enough knowledge to let a weakly-tooled coordinator
still produce usable outcomes? "Minimal" Alex runs qwen2.5 exec + no
reviewer overseer + no cloud rescue. If Alex still fills events at a
reasonable rate, the playbook system is the real knowledge carrier —
the senior stack is nice-to-have, not the sine qua non.

Demo personas mapped:
  Maria (senior, 48mo, full)
  James (mid, 14mo, local)
  Sam (junior, 4mo, basic)
  Alex (trainee, 1mo, minimal)

Same 3 contracts (Nashville downtown, Joliet warehouse, Indianapolis
assembly) across all four → 12 runs. KB + kb_staffer_report.py
leaderboard already wired; competence_score will now reflect real tool
asymmetry instead of LLM sampling variance.
2026-04-20 22:50:05 -05:00
root
6b71c8e9b2 Phase 23 — contract terms + staffer identity + competence-weighted retrieval
Matrix-index the "who handled this" dimension so top staffers become
the training signal and juniors inherit their playbooks automatically
via the boost pipeline. Auto-discovered indicators emerge from
comparing trajectories across staffers on similar contracts — that was
always the architectural point; this wires the last piece.

ContractTerms:
- deadline, budget_total_usd, budget_per_hour_max, local_bonus_per_hour,
  local_bonus_radius_mi, fill_requirement ("paramount" | "preferred")
- Attached to ScenarioSpec, propagated into T3 checkpoint + cloud
  rescue prompts so cloud reasons about trade-offs (pivot within bonus
  radius first; respect per-hour cap; split across cities when
  fill_requirement=paramount).

Staffer:
- {id, name, tenure_months, role: senior|mid|junior|trainee}
- On ScenarioSpec; logged at scenario start; attached to KB outcome
- Recomputed StafferStats written to data/_kb/staffers.jsonl after
  every run: total_runs, fill_rate, avg_turns, avg_citations,
  rescue_rate, competence_score.
- Competence formula: 0.45*fill_rate + 0.20*turn_efficiency +
  0.20*citation_density + 0.15*rescue_rate. Normalized to 0..1.

findNeighbors now returns weighted_score = cosine × best_staffer_competence
(floored at 0.3 so high-similarity low-competence neighbors still
surface). pathway_recommender prompt shows the top staffer's identity
so cloud knows WHOSE playbook it's synthesizing from.

Demo infrastructure:
- tests/multi-agent/gen_staffer_demo.ts: 4 personas (Maria senior,
  James mid, Sam junior, Alex trainee) × 3 contracts (Nashville Welder,
  Joliet Warehouse, Indianapolis Assembly). 12 scenarios total.
- scripts/run_staffer_demo.sh: runs the 12 sequentially with
  LH_OVERVIEW_CLOUD=1. Post-run calls kb_staffer_report.py.
- scripts/kb_staffer_report.py: leaderboard + cross-staffer worker
  overlap (names endorsed by ≥2 staffers → auto-discovered high-value
  workers). Top vs bottom differential.

gen_scenarios.ts (Phase 22 generator) also now emits contract terms
on 70% of generated specs — future KB batches populate with realistic
constraint patterns instead of bare role+city+count.

Stress scenario from item A intentionally NOT the production test.
Real staffing has constraints; Nashville contract + staffer demo is
the honest test of whether the architecture produces measurable
differential between coordinator skill levels.

Demo batch launched — 12 runs × ~3min each ≈ 40min unattended. Report
emitted after batch.
2026-04-20 22:16:09 -05:00
root
a7fc8e2256 Item B — cloud-rescue retry on event failure
When a scenario event fails (drift abort or other error) and
LH_RETRY_ON_FAIL is on (default when cloud T3 is enabled), ask cloud
for a concrete pivot — new city, role, or count — then re-run the
event with the remediation's fields. Capped at 1 retry per event so a
genuinely-impossible scenario can't burn budget.

requestCloudRemediation(event, result):
- Feeds the same diagnostic bundle T3 checkpoints get (SQL filters,
  row counts, SQL errors, reviewer drift reasons, gap signals).
- Prompt demands structured JSON: {retry, new_city, new_role,
  new_count, rationale}.
- Cloud is instructed to pivot to NEAREST alternate city when
  zero-supply detected, broaden role when uniquely scarce, reduce
  count when clearly unachievable, or return retry=false when no
  pivot seems viable.

EventResult additions:
- retry_attempt, retry_remediation (with rationale + cloud_model +
  duration), retry_result (full inner result shape), original_event.
- If retry succeeded, it becomes the primary result and original_event
  preserves what was attempted first. If retry also failed, the
  primary stays the failure and retry is recorded alongside.

Sanitizer on cloud output: model sometimes emits "Hammond, IN" in
new_city with "IN" in a non-existent new_state field, producing
"Hammond, IN, IN" downstream. Split new_city on comma, take first
token as city, extract state if present after the comma. Original
event's state is the fallback.

VERIFIED on stress_01.json with LH_OVERVIEW_CLOUD=1:
  Without rescue (item A baseline):  1/5 events ok
  With rescue (item B):              3/5 events ok
Gary IN misplacement: drift → cloud proposed South Bend IN → retry
filled 1/1. Rationale stored in retry_remediation for forensics.

Known limits surfaced (future work):
- City-field mangling failed one rescue before the sanitizer landed;
  next run will use the fix.
- Cloud picks alternate cities without knowing ground-truth supply.
  Flint → Saginaw pivoted but Saginaw also had sparse Welders.
  Future: expose a /vectors/supply-estimate endpoint cloud can consult
  before proposing a pivot.
2026-04-20 22:01:45 -05:00
root
c21b261877 Item A — stress scenario + enriched T3 diagnostic prompt
Proves cloud passthrough works end-to-end AND fixes the diagnostic
quality problem that first run surfaced.

STRESS SCENARIO (tests/multi-agent/scenarios/stress_01.json):
Five genuinely hard events with varied failure modes:
- Gary, IN 5× Electrician: ZERO supply (city not in workers_500k)
- Peoria, IL 8× Safety Coordinator: scarce role, initial pool only 5
- Flint, MI 3× Welder: ZERO supply
- Grand Rapids, MI 4× Tool & Die Maker: scarce but solvable
- Gary, IN 1× Electrician misplacement: repeats event 1's impossibility

FIRST RUN (stress v1) — cloud passthrough works, diagnosis vague:
  T3 checkpoint: "Potential drift flags for upcoming role"
  Lesson: "Before dispatching, query pool status. Update turn counter..."
Generic tactical advice that doesn't address the real problem.
Root cause: T3 prompt only saw outcome summary, not the raw
SQL/pool/drift signals the executor had in its log.

DIAGNOSTIC FIX:
- Added LogEntry[] `sharedLog` parameter to runAgentFill so the caller
  retains the trace even when runAgentFill throws drift-abort.
- EventResult gained `diagnostic_log` field populated on both OK and
  FAIL paths.
- extractDiagnostics() pulls SQL filters, hybrid_search row counts,
  SQL errors, and reviewer drift notes from the log.
- Checkpoint prompt now includes FAILURE FORENSICS block for failed
  events: SQL filters attempted, row counts, errors, drift reasons,
  and an explicit teaching note about zero-supply detection.
- Cross-day lesson prompt flags each event with [ZERO-SUPPLY: pivot
  city needed] tag when drift reasons mention "no match"/"no
  candidates"/"0 rows". PRIORITY clause in the prompt tells the model
  its lesson MUST name alternate cities when that tag appears.

SECOND RUN (stress v2 with enriched prompt) — cloud diagnosis sharp:
  T3 after Flint: risk="Zero candidate supply for Welder in Flint"
                  hint="search Welder×3 in Saginaw, MI (≈30 mi) or
                        expand role to Metal Fabricator"
  T3 after Gary:  risk="Zero supply for Electrician in Gary, IN"
                  hint="Pivot to Chicago, IL (≈40 min); broaden to
                        Electrical Technician within 60 min radius"
  Lesson: specific, per-city, with distances, role-broadening
  fallback, and pre-loading strategy — actionable for item B retry.

Cloud 120b call latencies consistent: 4.8-8.0s per prompt. Cloud
passthrough proven under stress.

Fill outcomes unchanged (1/5 — correct rejection of three impossible
events + one propagating JSON emission edge case on retry pivot
reasoning). The knowledge to rescue them now exists in the lesson;
item B wires the retry.
2026-04-20 21:54:29 -05:00
root
a663698571 Item 3 — geo-filtered playbook boost; diagnostic logging
ROOT CAUSE (found via instrumentation, not hunch):
After a 20-scenario corpus batch, only 6/40 successful (role, city)
combos ever triggered playbook_memory citations on subsequent runs.
Added `playbook_boost:` tracing::info! line in vectord::service to log
boost map size vs candidate pool vs match count. One query revealed:

  boosts=170 sources=50 parsed=50 matched=0

170 endorsed workers came back from compute_boost_for — but zero were
in the 50-candidate Toledo pool. The boost map was pulling globally-
ranked semantic neighbors (top-100 playbooks across ALL cities),
dominated by Kansas City / Chicago / Detroit forklift playbooks the
Toledo SQL filter would never admit. The mechanism was correct at the
per-playbook level; the problem was pool intersection.

FIX (surgical, not cap-tuning):
- playbook_memory::compute_boost_for_filtered(): accepts optional
  (city, state) filter. When set, skips playbooks from other geos
  BEFORE cosine-ranking, so top-k is within the target city.
- Backwards-compatible: compute_boost_for() calls the filtered variant
  with None — existing callers unchanged.
- service::hybrid_search(): extracts target (city, state) from the
  executor's SQL filter via a small parser (extract_target_geo),
  passes to compute_boost_for_filtered.

VERIFIED:
  Before fix: boosts=170 sources=50 parsed=50 matched=0   (0% hit)
  After fix:  boosts=36  sources=50 parsed=50 matched=11  (22% hit)
Top-k=10 now has 7/10 boosted workers with 2-3 citations each.
Boost values 0.075-0.113 on cosine scores 0.67-0.74 — meaningful
reorder without saturation.

scripts/kb_measure.py:
Aggregator that reads data/_kb/*.jsonl and playbooks/*/results.json,
reports fill rate, citation density, recommender confidence trend,
and zero-citation-ok combos (item 3 target signal). Used to measure
before/after on bigger batches.

Diagnostic logging stays — the class of "boosts computed but not
matched" bug can recur if the SQL filter format ever drifts, and
without the counter it's invisible. Every hybrid_search with
use_playbook_memory=true now logs its boost stats.
2026-04-20 21:35:04 -05:00
root
330cb90f99 Lift k cap, drop ornamental reason field, scenario generator
ITEM 1 — k CAP + REASON FIELD
The hybrid_search default k was hard-coded to 10. For multi-fill events
(5× expansion, 4× emergency) that's pool=10 → propose 5-of-10, half
the candidates become the answer with no room for rejection. Executor
prompt now instructs k to scale with target_count: k = max(count*5, 20),
cap 80. Default helper bumped 10 → 20.

Fill.reason dropped from required to optional. Nothing downstream ever
consumed it — resolveWorkerIds, sealSale, retrospective all use
candidate_id and name. Models loved to write 100-150 char justifications
per fill; on 4+ fills that blew the JSON budget before the structure
closed. Test 1 run result after this change: FIRST EVER 5/5 on the
Riverfront Steel scenario, 13 total turns across 5 events. The event
that failed last run (emergency 4×Loader with truncated reason-field
continuation) now clears in 2 turns.

Progression:
  mistral baseline:                  0/5
  qwen3.5 + continuation + think:false: 4/5
  qwen3.5 + k=20 + no-reason:        5/5 ✓

ITEM 2 — SCENARIO GENERATOR (NOT YET TESTED E2E)
tests/multi-agent/gen_scenarios.ts emits N deterministic ScenarioSpecs
with varied clients (15 companies), cities (20 Midwest cities known
to exist in workers_500k), role mixes (14 industrial staffing roles,
weighted realistic), and event sequences. Each gets a unique sig_hash
so the KB populates with distinct neighbor signatures.

scripts/run_kb_batch.sh runs all generated specs sequentially against
scenario.ts, logs per-scenario outcomes, and reports KB state at the
end. Each run takes ~2-4min; 20-30 scenarios = 1-2hr unattended.

Next: test the generator+batch on a small N (3-5) to verify KB
populates correctly and pathway recommendations start getting neighbor
signal instead of cold-starts. Then item 3 (Rust re-weighting of
hybrid_search by playbook_memory success).
2026-04-20 20:31:34 -05:00
root
9c1400d738 Phase 22 — Internal Knowledge Library (KB)
Meta-layer over Phase 19 playbook_memory. Phase 19 answers "which
WORKERS worked for this event"; KB answers "which CONFIG worked for
this playbook signature" — model choice, budget hints, pathway notes,
error corrections.

tests/multi-agent/kb.ts:
- computeSignature(): stable sha256 hash of the (kind, role, count,
  city, state) tuple sequence. Same scenario shape → same sig.
- indexRun(): extracts sig, embeds spec digest via sidecar, appends
  outcome record, upserts signature to data/_kb/signatures.jsonl.
- findNeighbors(): cosine-ranks the k most-similar signatures from
  prior runs for a target spec.
- detectErrorCorrections(): scans outcomes for same-sig fail→succeed
  pairs, diffs the model set, logs to error_corrections.jsonl.
- recommendFor(): feeds target digest + k-NN neighbors + recent
  corrections to the overview model, gets back a structured JSON
  recommendation (top_models, budget_hints, pathway_notes), appends
  to pathway_recommendations.jsonl. JSON-shape constrained so the
  executor can inherit it mechanically.
- loadRecommendation(): at scenario start, pulls newest rec matching
  this sig (or nearest).

scenario.ts:
- Reads KB recommendation at startup (alongside prior lessons).
- Injects pathway_notes into guidanceFor() executor context.
- After retrospective, indexes the run + synthesizes next rec.

Cold-start behavior: first run with no history writes a low-confidence
"no prior data" rec so the signal that something was attempted is
captured. Second run gets "low confidence, 0 neighbors" until a third
distinct sig gives the embedder something to compare against — hence
the upcoming scenario generator.

VERIFIED:
- data/_kb/ populated after one scenario run: 1 outcome (sig=4674…,
  4/5 ok, 16 turns total), 1 signature, 2 recs (cold + post-run).
- Recommendation JSON-parsed cleanly from gpt-oss:20b overview model.

PRD Phase 22 added with file layout, cycle description, and the
rationale for file-based MVP → Rust port progression that matches
how Phase 21 primitives shipped.

What's NOT here yet (batched follow-ups per J's request, tested
between each):
- Lift the k=10 hybrid_search cap to adaptive k=max(count*5, 20)
- Scenario generator to bulk-populate KB with varied signatures
- Rust re-weighting: push playbook_memory success signal INTO
  hybrid_search scoring, not just post-hoc boost
2026-04-20 20:27:12 -05:00
root
0c4868c191 qwen3.5 executor + continuation primitive + think:false
Three coupled fixes that together turned the Riverfront Steel scenario
from 0/5 (mistral) to 4/5 (qwen3.5) with T3 flagging real staffing
concerns rather than linter advice.

MODEL SWAP
- Executor: mistral → qwen3.5:latest (9.7B, 262K ctx, thinking).
  mistral's decoder emitted malformed JSON on complex SQL filters
  regardless of prompt; J called it — stop using mistral.
- Reviewer: qwen2.5 → qwen3:latest (40K ctx)
- Applied to scenario.ts, orchestrator.ts, network_proving.ts,
  run_e2e_rated.ts

CONTINUATION PRIMITIVE (agent.ts)
- generateContinuable(): empty-response → geometric backoff retry;
  truncated-JSON → continue from partial as scratchpad; bounded by
  budget cap + max_continuations. No more "bump max_tokens until it
  stops truncating" tourniquet.
- generateTreeSplit(): map-reduce for oversized input corpora with
  running scratchpad digest, reduce pass for final synthesis.
- Empty text no longer throws — it's a signal to continuable that
  thinking ate the budget.

think:false FOR HOT PATH
- qwen3.5 burned ~650 tokens of hidden thinking for trivial JSON
  emission. For executor/reviewer/draft: think:false. For T3/T4/T5
  overseers: thinking stays on (that's the point).
- Sidecar generate endpoint accepts `think` bool, passes through to
  Ollama's /api/generate.

VERIFIED OUTCOMES
Riverfront Steel 2026-04-21, qwen3.5+continuable+think:false:
  08:00 baseline_fill  3/3  4 turns
  10:30 recurring      2/2  3 turns (1 playbook citation)
  12:15 expansion      0/5  drift-aborted (5-fill orchestration
                            problem, separate work)
  14:00 emergency      4/4  3 turns (1 citation)
  15:45 misplacement   1/1  3 turns
  → T3 caught Patrick Ross double-booking across events
  → T3 flagged forklift cert drift on the event that failed
  → Cross-day lesson proposed "maintain buffer of ≥3 emergency
    candidates, pre-fetch certs for expansion, booking system
    cross-check" — real staffing advice, not generic linter output

PRD PHASE 21 rewritten to reflect the actual primitive shape (two-
call map-reduce with scratchpad glue) instead of the tourniquet
approach originally documented. Rust port queued for next sprint.

scripts/ab_t3_test.sh: A/B harness that chains B→C→D runs and emits
tests/multi-agent/playbooks/ab_scorecard.json.
2026-04-20 20:19:02 -05:00
root
6e7ca1830e Phase 21 foundation — context stability + chunking pipeline
PRD: add Phase 20 (model matrix, wired) and Phase 21 (context stability,
partial). Phase 21 exists because LLM Team hit this exact wall — running
multi-model ranking on large context silently truncated, rankings
degraded, no pipeline caught it. The stable answer: every agent call
goes through a budget check against the model's declared context_window
minus safety_margin, with a declared overflow_policy when the check
fails.

config/models.json:
- context_window + context_budget per tier
- overflow_policies block: summarize_oldest_tool_results_via_t3,
  chunk_lessons_via_cosine_topk, two_pass_map_reduce,
  escalate_to_kimi_k2_1t_or_split_decision
- chunking_cache spec (data/_chunk_cache/, corpus-hash keyed)

agent.ts:
- estimateTokens() chars/4 biased safe ~15%
- CONTEXT_WINDOWS table (fallback; prod reads models.json)
- assertContextBudget() — throws on overflow with exact numbers, can
  bypass with bypass_budget:true for callers with their own policy
- Wired into generate() and generateCloud() so EVERY call is checked

scenario.ts:
- T3 lesson archive to data/_playbook_lessons/*.json (the old
  /vectors/playbook_memory/seed path was silently failing with HTTP 400
  because it requires 'fill: Role xN in City, ST' operation shape)
- loadPriorLessons() at scenario start — filters by city/state match,
  date-sorted, takes top-3
- prior_lessons.json archived per-run (honest signal for A/B)
- guidanceFor() injects up to 2 prior lessons (≤500 chars each) into
  the executor's per-event context
- Retrospective shows explicit "Prior lessons loaded: N" line

Verified: mistral correctly rejects a 150K-char prompt (7532 tokens
over), gpt-oss:120b accepts it with 90K headroom. The enforcement is
in-band on every call now, not an afterthought.

Full chunking service (Rust) remains deferred to the sprint this feeds:
crates/aibridge/src/budget.rs + chunk.rs + storaged/chunk_cache.rs
2026-04-20 19:34:44 -05:00
root
03d723e7e6 Model matrix — 5 tiers, local hard workers + cloud overseers
config/models.json is the authoritative catalog. Hot path (T1/T2) stays
local; cloud is consulted only for overview (T3), strategic (T4), and
gatekeeper (T5) calls. J named qwen3.5 + newer models (minimax-m2.7,
glm-5, qwen3-next) specifically — all mapped with real reachable IDs
verified against ollama.com/api/tags.

Tier shape:
- t1_hot     mistral + qwen2.5 local       — 50-200 calls/scenario
- t2_review  qwen2.5 + qwen3 local         — 5-14 calls/event
- t3_overview gpt-oss:120b cloud           — 1-3 calls/scenario
- t4_strategic qwen3.5:397b + glm-4.7      — 1-10 calls/day
- t5_gatekeeper kimi-k2-thinking           — 1-5 calls/day, audit-logged

Rate budgets are declared in-config — Ollama Cloud paid tier is generous
but we cap overview/strategic/gatekeeper so no single rogue scenario can
blow the day's quota.

Experimental rotation list wired but disabled by default. When enabled,
T4 randomly routes 10% of calls to a rotating minimax/GLM/qwen-next/
deepseek/nemotron/cogito/mistral-large candidate, logs comparisons, and
auto-promotes after 3 rotations of wins.

Playbook versioning SPEC embedded under `playbook_versioning` key: every
seed gets version + parent_id + retired_at + architecture_snapshot, so
when a schema migration breaks a playbook we can pinpoint which change
retired it. Implementation flagged for next sprint (touches gateway +
catalogd + mcp-server) — not wired here.

- scenario.ts now loads config/models.json at init, env vars still override
- mcp-server exposes /models/matrix read-only so UI can render it
2026-04-20 19:24:41 -05:00
root
e4ae5b646e T3 overview tier — mid-day checkpoints + cross-day lesson
Hot path (T1/T2) stays mistral + qwen2.5. The new T3 tier runs a
thinking model SPARINGLY — after every misplacement, every N-th event
(default N=3), and once post-scenario for the cross-day lesson.

- agent.ts: generateCloud() for Ollama Cloud (gpt-oss:120b etc). Uses
  the same /api/generate shape; thinking field is discarded.
- scenario.ts: runOverviewCheckpoint + runCrossDayLesson. Outputs land
  in checkpoints.jsonl and lesson.md. Lesson also seeds playbook_memory
  under operation "cross-day-lesson-{date}" — future runs pick it up
  through the existing similarity boost.
- Env knobs: LH_OVERVIEW_CLOUD=1 routes T3 to cloud, LH_OVERVIEW_MODEL
  overrides (default gpt-oss:20b local, gpt-oss:120b cloud),
  LH_T3_CHECKPOINT_EVERY controls cadence, LH_T3_DISABLE=1 turns it off.

Why this shape: prior feedback_phase19_seed_text.md warned that verbose
seeds dilute the embedding and silently kill the boost. T3's rich prose
goes to lesson.md; the embedded "approach" + "context" stay terse.

Verified end-to-end: local 20b checkpoint 10.9s, lesson 4.0s; cloud
120b lesson 3.7s. Cloud output is both faster AND more specific than
local (sequenced, tactical, logging advice included).
2026-04-20 19:21:45 -05:00
root
0ff091c173 Honesty fixes — no hard-coded counts, dynamic sample CSV
- generateSampleRosterCSV(): 120-180 randomized rows per call, timestamp-prefixed IDs (no dedup on re-upload, no static 25 row lie)
- /system/summary: truth via SQL COUNT(*), surfaces manifest_drift (caught candidates: manifest 100K, actual 1K)
- search.html: loadSystemSummary() hydrates live counts; removed hard-coded 500K strings
- MCP tool description: "candidates (100K)" → "candidates (1K)", added "workers_500k (500K)"
2026-04-20 19:07:47 -05:00
root
af3856b103 Rate/margin awareness: implied pay rate per worker, bill rate per contract
Closes one of the Path 1 trust-break gaps. The scenario we kept flagging:
recruiter calls the system's top pick, worker quotes $35/hr, contract
pays $28/hr. First broken call kills the demo. This fixes it.

Heuristic (no schema change, derived at query time):
- Per worker: implied_pay_rate = role_base + (reliability × 4) + archetype_bump
  role_base: Electrician $28, Welder $26, Machine Op $24, Maint $26,
    Forklift Op $20, Loader $17, Warehouse Assoc $17, Quality Tech $23,
    Production Worker $18 ...
  archetype bump: specialist +4, leader +3, reliable +1, else 0
- Per contract: implied_bill_rate = role_base × 1.4
  (40% markup — industry norm: pay + overhead + insurance + margin)
- Worker is 'over_bill_rate' when implied_pay_rate > contract's bill_rate
  on a candidate-by-candidate basis

Backend (mcp-server/index.ts):
- ROLE_BASE_PAY_RATE + BILL_MARKUP constants
- impliedPayRate(worker), impliedBillRate(role) functions
- parseWorkerChunk() extracts role/reliability/archetype from vector text
- enrichWithRates() attaches implied_pay_rate on every /vectors/hybrid
  source response. Called from /search and /intelligence/permit_contracts.
- /search accepts optional max_pay_rate number — if set, filters out
  workers above that rate and reports pay_rate_filtered_out count.
- /intelligence/permit_contracts returns implied_bill_rate per contract
  AND over_bill_rate boolean per candidate.

Frontend (search.html):
- Live Contracts cards show 'bill rate: $X/hr' under the headcount line
- Each candidate shows 'pay $X/hr' in the sub-line; red 'Over bill rate'
  chip next to name when their pay exceeds the contract's bill rate
  (hover reveals the exact numbers and why it's flagged)
- Main 'Search all workers' results now include 'pay $X/hr' in the
  why-text (computeImpliedPayRate mirrored client-side to match Bun)

End-to-end verified live:
- Masonry Work permit, bill_rate $25.20/hr
  Kathleen M. Gutierrez pay $25.56/hr → 🔴 OVER
  Melissa C. Rivera pay $20.88/hr → 🟢 OK
- /search with max_pay_rate:32 filtered out 1 Toledo Welder above $32
- Main search shows 'pay $28.64/hr' in each result row

When real ATS data replaces synthetic workers_500k, same UI — the
client's real pay_rate column substitutes for the heuristic.
2026-04-20 18:56:51 -05:00
root
a117ae8b38 Workspace UI — surface Phase 8.5 per-contract state + handoff
Phase 8.5 was fully built on the Rust side (WorkspaceManager with
create/handoff/search/shortlist/activity/get/list, persisted to
object storage, zero-copy handoff between agents). Nothing surfaced
it in the recruiter UI. This page closes that gap.

/workspaces — split-pane UI:

Left: scrollable list of all workspaces, sorted by updated_at.
  Each card shows name, tier pill (daily/weekly/monthly/pinned),
  current owner, count of shortlisted candidates + activity events.

Right: selected workspace detail with five sections:
  1. Header — name, tier, owner, created/updated dates, description,
     previous-owners audit trail (each handoff is preserved)
  2. Actions row — Hand off, Shortlist candidate, Save search, Log activity
  3. Shortlist — candidates flagged with dataset + record_id + notes
  4. Saved searches — named SQL queries the staffer wants to rerun
  5. Activity — chronological (newest first) log of what happened

Four modals for the add/edit actions (create, handoff, shortlist,
save-search, log-activity). All forms POST through the existing
/api/* passthrough to the gateway's /workspaces/* routes.

End-to-end verified live:
  1. Sarah creates 'Demo: Toledo Week 17' workspace
  2. Shortlists Helen Sanchez (W500K-4661) with notes about prior endorsements
  3. Logs activity: 'called — Helen confirmed Tuesday 7am shift'
  4. Hands off to Kim with reason 'end of shift'
  5. Kim opens the workspace: owner=kim, previous_owners=[{sarah→kim}],
     sees all 3 prior events + the shortlisted Helen
     — no data copy, pointer swap only (Phase 8.5 design)

Security: all dynamic content built via el(tag,cls,text) DOM helper.
Zero innerHTML on API-derived strings. Modal close-on-backdrop-click
is guarded to the backdrop element.

Nav updated across all 7 pages. Workspaces is the 7th tab.
Dashboard · Walkthrough · Architecture · Spec · Onboard · Alerts · Workspaces.
2026-04-20 18:36:51 -05:00
root
6287558493 Push/daemon presence: background digest + /alerts settings page
Converts the app from 'dashboard you visit' to 'system that finds you.'
Critical for the phone-first staffing shop that won't open a URL —
the system reaches out when something matters.

Daemon:
- Starts once per Bun process (guarded via globalThis sentinel)
- Default interval 15 min (configurable, min 1, max 1440)
- On each cycle, buildDigest() compares current state against prior
  snapshot persisted in mcp-server/data/notification_state.json
- Events detected:
  - risk_escalation: role moved to tight or critical (was ok/watch)
  - deadline_approaching: staffing window falls within warn window
    (default 7 days) AND deadline date differs from prior
  - memory_growth: playbook_memory entries grew by >= 5 since last run

Channels (all opt-out individually via config):
- console: always on, logged to journalctl -u lakehouse-agent
- file: always on, appends JSONL to mcp-server/data/notifications.jsonl
- webhook: optional, POSTs {text, digest} to configured URL
  (Slack incoming-webhook / Discord webhook / any custom endpoint)

Digest format (human-readable, fits in a Slack message):
  LAKEHOUSE DIGEST — 2026-04-20 23:24
  3 staffing deadlines within window:
    • Production Worker — 2d to 2026-04-23 · demand 724
    • Maintenance Tech — 4d to 2026-04-25 · demand 32
    • Electrician — 5d to 2026-04-26 · demand 34
  +779 new playbooks (total 779, 2204 endorsed names)
  snapshot: 0 critical · 0 tight · $275,599,326 pipeline

/alerts page:
- Current status table (daemon state, interval, webhook, last run)
- Config form: enable toggle, interval, deadline warn window, webhook
  URL + label (saved to data/notification_config.json)
- 'Fire a test digest now' button — force a cycle without waiting
- Recent digests panel shows the last 10 dispatches with full text

End-to-end verified live:
- Daemon armed successfully on startup
- First-run digest dispatched to console + file in <1s
- Events detected correctly: 3 deadlines within 7 days from real
  Chicago permit data; 779 playbook entries surfaced as memory growth
- Digest text format is Slack-pastable
- Dispatch records appear in /alerts recent list

TDZ caveat: startAlertsDaemon() invocation moved to end of module so
all const/let in the alerts block evaluate before daemon reads them.
Previously failed with 'Cannot access X before initialization' when
the call lived near the top of the file. Nav added to all 6 pages:
Dashboard · Walkthrough · Architecture · Spec · Onboard · Alerts.
2026-04-20 18:24:48 -05:00
root
23eb04a145 Onboarding wizard — ingest any staffing CSV in 3 steps
New /onboard page. Client-facing wizard for getting real data into
the system without engineering help.

Flow:
1. Drop a CSV (or click 'Use the sample as my data' — ships a 25-row
   realistic staffing roster under /samples/staffing_roster_sample.csv)
2. Browser parses client-side. Columns auto-typed (text/int/decimal/
   date). PII flagged by name hint AND content regex (emails, phones).
   First rows previewed. Read-only — nothing written yet.
3. Name the dataset (lowercase+underscores). Commit.
4. Post-commit: dataset is live. Shows 4 next steps the operator can
   take (SQL query, vector index, dashboard search, playbook training).

Backend:
- /onboard serves onboard.html
- /samples/*.csv serves CSV files from mcp-server/samples/ with
  filename validation (only [a-zA-Z0-9_-.]+.csv, prevents path traversal)
- /onboard/ingest forwards multipart/form-data to gateway /ingest/file
  preserving the boundary. The generic /api/* passthrough breaks
  multipart because it reads as text and forwards as JSON; this route
  uses arrayBuffer + original Content-Type.

Verified end-to-end: upload sample roster (25 rows, 12 columns) →
parse in browser → show columns + PII flags + preview → commit →
gateway writes Parquet, registers in catalog → immediately queryable:
  SELECT * FROM onboard_demo2 LIMIT 3
  → Sarah Johnson, Forklift Operator, Chicago, IL, 0.92
Round-trip <1 second.

Nav updated on all pages to link Onboard. Shipped with a sample CSV
so the full flow is demonstrable without real client data.

When a real client shows up, same path — they upload their CSV.
No engineering ticket, no code change, no schema pre-definition.

Security: sample filename regex prevents path traversal. CSV parse
is client-side pure JS (no DOM injection). Commit uses existing
/ingest/file validation (schema fingerprint, PII server-side,
content-hash dedup).
2026-04-20 18:13:56 -05:00
root
468798c9ac /spec: technical specification — 11-chapter README-equivalent
J's ask: explain the full architecture so someone reading a README
can dispute it or recreate it. The repo isn't public yet; this page
IS the spec until it is.

Ch1 Repository layout — 13 crates + tests/multi-agent + docs + data,
    with owned responsibility and file path per crate.

Ch2 Data ingest pipeline (8 steps) — sources (file/inbox/DB/cron),
    parse+normalize with ADR-010 conservative typing, PII auto-tag,
    dedup, Parquet write, catalog register with fingerprint gate,
    mark embeddings stale, queryable immediately.

Ch3 Measurement & indexing — row count / fingerprint / owner /
    sensitivity / freshness / lineage per dataset. HNSW vs Lance
    tradeoff table with measured numbers (ADR-019). Autotune loop.
    Per-profile scoping (Phase 17).

Ch4 Contract inference from external signal — Chicago permit feed
    → role mapping → worker count heuristic → timeline → hybrid
    search with boost → pattern discovery → rendered card. All
    pre-computed before staffer opens UI.

Ch5 What a CRM can't do — 11-row comparison table of capabilities.

Ch6 How it gets better over time — three paths:
    - Phase 19 playbook boost (full math)
    - Pattern discovery meta-index
    - Autotune agent

Ch7 Scale story: 20 staffers, 300 contracts, midday +20/+1M surge
    - Async gateway + per-staffer profile isolation + client blacklists
    - 7-step surge handling flow (ingest, stale-mark, incremental refresh,
      degradation, hot-swap, autotune re-enter)
    - Known pain points: Ollama inference serial, RAM ceiling ~5M on
      HNSW (mitigated by Lance), VRAM 1-2 models sequential,
      playbook_memory unbounded.

Ch8 Error surfaces & recovery — 10-row table covering ingest schema
    conflicts, bucket failures, ghost names, dual-agent drift,
    empty searches, Ollama down, gateway restart, schema fingerprint
    divergence. Every failure has a named surface and recovery path.

Ch9 Per-staffer context — active profile, workspace, client blacklist,
    audit trail, daily summary. How 20 staffers don't see the same UI.

Ch10 Day in the life — 07:00 housekeeping → 07:30 refresh → 08:00
     staffer opens → 08:15 drill down → 08:30 Call click → 09:00
     second staffer shares memory → 12:30 surge → 14:00 no-show →
     15:00 new embeddings live → 17:00 retrospective → 22:00
     overnight trials.

Ch11 Known limits & non-goals — deferred (rate/margin, push, confidence
     calibration, neural re-ranker, pm compaction, call_log cross-ref)
     and explicitly out-of-scope (cloud, ACID, streaming, CRM replace,
     proprietary formats, hard multi-tenant).

Also: nav updated on /dashboard, /console, /proof to link /spec.
Every architectural claim in the spec cites either a code path, an
ADR number, or a phase reference so someone skeptical can target
the specific artifact.
2026-04-20 17:56:18 -05:00
root
76bfa2c8d7 /proof: explain the dual-agent recursive architecture with citations
Previous page was numeric claims without explanations — 'sub-100ms SQL',
'500K vectors in 341ms' etc. Accurate but undefendable without math,
code paths, and ADR references. Expanded to 8 chapters:

Ch1 — Live receipts (unchanged: real gateway tests, pass/fail, timing)

Ch2 — Architecture. 13-crate diagram with per-crate responsibility
      table and file paths. gateway → catalogd/queryd/vectord/ingestd
      + aibridge → object_store. References ADRs 1-20.

Ch3 — Dual-agent recursive consensus loop (NEW)
      - Role specialization (executor=optimist, reviewer=pessimist)
      - Parallel orchestration via Promise.all
      - Recursive: sealed playbooks feed playbook_memory → next query
      - Termination math: sealed | tool-error abort | drift abort |
        turn-cap abort — every path dumps forensic log
      - File refs: tests/multi-agent/agent.ts, orchestrator.ts,
        scenario.ts, run_e2e_rated.ts

Ch4 — Playbook memory feedback loop (NEW)
      - PlaybookEntry shape with embedding
      - Full boost math: similarity * base_weight * decay * penalty
        / n_workers, capped at MAX_BOOST_PER_WORKER
      - Temporal decay (e^-age/30, 30d half-life)
      - Negative signal (0.5^failures)
      - Why k=200: narrow cosine discrimination in nomic-embed-text
      - Evidence: compounding test 0 → 0.250 cap in 3 seeds
      - persist_sql write-through
      - Pattern discovery (Path 2 meta-index)
      - File: crates/vectord/src/playbook_memory.rs

Ch5 — ADR citations for each key choice
      ADR-001, 008, 012, 015, 019, 020 + Phase 19 design note

Ch6 — Live scale data (unchanged: pulled from /proof.json)

Ch7 — Reproduction recipes: curl for health, sql, hybrid with boost,
      patterns, pm stats, and the full dual-agent scenario run

Ch8 — Honest limits (unchanged: synthetic workers_500k, 1K candidates
      misaligned to call_log, 7B model imperfection, no rate/margin)

Every architectural claim now cites either the code path
(crates/.../src/file.rs::fn_name) or the ADR (docs/DECISIONS.md).
Someone disputing the system has specific targets to attack.

Mechanism unchanged: /proof serves mcp-server/proof.html via
Bun.file. /proof.json still returns the live test data the page
consumes client-side.
2026-04-20 17:49:08 -05:00
root
05f2e42c45 Rebuild /console as narrative walkthrough for a skeptical staffer
Old console was a chat playground. New console is a guided,
chapter-based explanation that a non-technical staffing staffer
can read top-down and finish convinced — without needing to
understand any of the underlying technology.

Six chapters, each loading live data:

1. Right now, this system is already thinking
   Four stats cards pulled live: construction pipeline $, predicted
   worker demand, rows under management, playbooks remembered. Then
   a narrative that names the current alert posture (critical/tight/ok).

2. The demand signal is real, not made up
   Expandable rows per Chicago permit work_type, with a direct link to
   data.cityofchicago.org for verification. Pill labeled LIVE ·
   DATA.CITYOFCHICAGO.ORG leaves no ambiguity.

3. Where your own data would live
   Catalog enumerated with three pill classes:
   - SWAP FOR YOUR DATA (purple) — the synthetic tables that would
     be replaced by the client's ATS/CRM/call-log exports
   - SYSTEM-GENERATED (blue) — playbook memory, threat_intel, kb_*
     produced by the system itself
   Row counts + columns visible. Names it honestly.

4. Watch the system rank candidates in real time
   Takes the freshest Chicago permit, walks the staffer through all
   three steps (derive need → narrow via SQL → rank + boost), shows
   the top-5 workers with why, boost chip, memory chip, timeline,
   and a plain-English narrative of the CRM gap.

5. Every action compounds
   Playbook memory count + sample + narrative about what it means
   when the staffer logs a fill.

6. Try it yourself
   Free-text input hitting /intelligence/chat, renders response
   with memory chip + boost chips + ranked workers.

Security: all API-derived strings go through textContent or
el(tag,cls,text) helper. Zero innerHTML usage on dynamic content.
Passes security reminder hook.

File size: 419 → ~500 lines. Visual style matches the dashboard
(same palette, typography, chip styles) so the two pages feel
like one app.
2026-04-20 17:35:45 -05:00
root
bb1b471c67 Predictive staffing forecast + per-contract timeline
J's ask: move the system from retrospective ranking to predictive
anticipation. Show it tracks the clock, not just the roster.

New endpoint /intelligence/staffing_forecast:
- Pulls 30-day Chicago permit window (200 permits)
- Maps work_type → role via industry heuristic
- Aggregates predicted worker demand per role
- Joins IL bench supply (workers_500k state='IL' group by role)
- Computes coverage_pct, reliable_coverage_pct
- Classifies risk: critical/tight/watch/ok
- Computes earliest staffing deadline per role
  (permit issue_date + 31d = 45d construction start - 14d window)
- Surfaces recent Chicago playbook ops for the role-specific memory

New UI 'Staffing Forecast' section ABOVE Live Contracts:
- Top card: total construction value, permit count, workers needed,
  critical/tight role count
- Per-role rows: demand vs available supply, coverage %, deadline
  with red/amber/green urgency coloring

Per-contract timeline on Live Contracts:
- estimated_construction_start, staffing_window_opens, days_to_deadline
- urgency classification: overdue/urgent/soon/scheduled
- card border colored by urgency
- timeline line explicitly shows recruiter: OVERDUE/URGENT + days count

This is the 'system already thinks about when, not just who' surface
J was asking for. CRMs store; this anticipates.
2026-04-20 17:24:17 -05:00
root
2595d48535 Gap fixes: pattern fallback, narrative citations, call_log plumbing
Closing trust-breaks surfaced in the strategic audit.

A — MEMORY chip renders even when sparse:
Previously rendered nothing when no trait crossed threshold, which
recruiters would read as "system has no signal." Now explicitly
says "memory is sparse for this role+geo — no trait crossed
threshold" or "no similar past playbooks yet — first fill of this
kind will seed it." Honest when it doesn't know.

B — Removed /intelligence/learn dead endpoint:
Legacy CSV-writer path that destructively re-wrote
successful_playbooks. /log and /log_failure replace it cleanly.
Leaving dead code confuses future maintainers.

C — Narrative tooltips on Endorsed chips:
Hovering the green "Endorsed · N playbooks" chip now fetches
the worker's past operations from successful_playbooks_live and
shows a story: "Maria — past endorsements: • Welder x2 in
Toledo (2026-04-15), • Welder x1 in Toledo (2026-04-18)..."
Falls back to honest "narrative unavailable" if the seed
didn't land in SQL.

D — call_log infrastructure in worker modal:
New "Recent Contact" section queries call_log JOIN candidates by
name. Surfaces last 3 call entries with timestamp, recruiter,
disposition, duration. When empty (which is today's reality —
candidates table only has 1000 rows vs call_log's higher IDs),
shows an honest message about the data gap and what real ATS
integration would unlock.

Honest call: D ships infrastructure. Actual utility depends on
aligning candidate IDs between the candidates table and
call_log — current synthetic data doesn't cross-ref cleanly.
When real ATS data lands, this section becomes the
"system knows who we called yesterday" feature the recruiter
needs.

Deferred (would require a dedicated session):
- Rate awareness (needs worker pay_rate + contract bill_rate)
- Push / background daemon (Slack/SMS/email integration)
- Confidence calibration (needs a probabilistic ranking layer)
2026-04-20 17:20:22 -05:00
root
1f56630d5d #3: Worker profile modal shows past playbook history
Click any worker card → modal now includes a 'Past Playbooks' section
that queries successful_playbooks_live for any row where this worker's
name appears in the result field. Shows up to 8 most recent with
operation, timestamp, approach, and context.

When empty: 'No prior playbooks for NAME yet. First placement builds
the first entry.' — makes the institutional-memory claim visible to
the recruiter: the system is tracking everyone, not just the ones
that sealed this session.

Also added Call / SMS / No-show buttons to the modal action row
(matching the card-level buttons from #1). Every worker-card path
now trains the system.

Closes the user-visible side of Phase 19 — patterns surface during
search (Pass A), boosts fire in ranking (Phase 19 core), and now
the worker's own profile shows the full history that informs those
boosts. Institutional memory legibility, per J's ask.
2026-04-20 16:21:27 -05:00
root
cdd12a1438 #2: Per-client worker blacklists
New endpoints:
- POST   /clients/:client/blacklist            { worker_id, name?, reason? }
- GET    /clients/:client/blacklist            → { client, entries }
- DELETE /clients/:client/blacklist/:worker_id → { removed, total }

Bun /search accepts optional `client` field. When present, loads that
client's blacklist and appends `AND worker_id NOT IN (...)` to the
SQL filter. Zero-cost if unused; clean trust-break avoidance when a
client has previously flagged a worker.

Persistence: mcp-server/data/client_blacklists.json, synchronous
writes via Bun.write. Scale target is hundreds of entries per client
tops — JSON is fine until we hit 10K+ per client.

Verified: worker_id 9326 (Carmen Green) blacklisted for AcmeCorp,
same Chicago Electrician search with client=AcmeCorp returns 196
sql_matches vs 197 without — exactly one excluded.
2026-04-20 16:20:17 -05:00
root
4aea71d213 #1: Close recruiter feedback loop — Call/SMS/No-show fire /log and /log_failure
Every worker-card button in the dashboard now trains the Phase 19
system directly:

- Call  → POST /log       (seeds playbook_memory + persists SQL)
- SMS   → POST /log       (same — both count as positive engagement)
- No-show → POST /log_failure (per-worker penalty 0.5^n on future boost)

Buttons flash status (Logged / Flagged / Ghost) for 1.4s on success,
then re-enable. Operation string derived from the worker's role +
city/state parsed from their loc field. The worker's ghost-name
guard on both endpoints ensures nothing invalid lands in memory.

Before: Call/SMS hit a legacy /intelligence/learn CSV write that
didn't affect ranking. No failure capture existed.

Now: recruiter using the app IS the training signal. Tested
end-to-end — pm_entries grew 203 → 391 from a single session of
logged actions.
2026-04-20 16:19:14 -05:00
root
72ee8f006f k=200 on /search and /match too — consistency with compounding default 2026-04-20 15:41:39 -05:00
root
99ab0fe623 A+B: patterns in main search + compounding bump
A — Patterns surface in main Worker Search:
  /intelligence/chat smart_search fallback now calls /patterns in
  parallel with hybrid, returns discovered_pattern + matched count.
  search.html doSearch renders a green "MEMORY (N playbooks): ..."
  chip above results so every recruiter query shows the meta-index
  dimension, not just live-contract cards.

B — Compounding proven and default-k bumped:
  Direct compounding test on Chicago Electrician:
  - Run 0 (no seeds): Carmen Green not in top-5, boost 0
  - After 3 seeds of identical operation: boost +0.250 (capped),
    3 citations, lifted to #1. Each seed adds 1 citation. Cap
    prevents one worker from dominating future searches.
  - Required k=200 (not 25 or 50) — embedding band is narrow
    (cosines 0.55-0.67 across all playbooks regardless of geo).
  - Bumped defaults on /search, permit_contracts, and smart_search
    to playbook_memory_k=200. Brute-force sub-ms at this scale.
2026-04-20 15:41:12 -05:00
root
5c39c74fe4 Live Contracts canvas: Chicago permits × workers_500k × playbook patterns
New devop.live/lakehouse section pairs live public Chicago building
permits with derived staffing contracts, ranked candidates from the
500K worker bench, and meta-index discovered patterns per role+geo.
Makes the Phase 19 boost + Path 2 pattern discovery visible on real
external data, without needing a paying client to demo.

Backend:
- New /intelligence/permit_contracts endpoint
- Fetches 6 recent Chicago permits > $250K from the Socrata API
- Derives proposed fill: 1 worker per $150K of permit value (capped 2-8)
- For each: /vectors/hybrid with use_playbook_memory=true,
  playbook_memory_k=25, auto availability>0.5 filter
- For each: /vectors/playbook_memory/patterns with k=25 min_freq=0.3
- Returns permit + proposed contract + top 5 candidates with boosts
  and citations + discovered pattern + pattern_matched count

Frontend:
- New "Live Contracts" section on search.html between today's sim
  contracts and Market Intelligence
- Per-permit card: cost + work_type + address + proposed role/count
  + pool size + top 3 candidates (with endorsement chip when boost
  fires) + memory-derived pattern ("MEMORY (N playbooks): recurring
  certifications: OSHA-10 47%, Forklift... · archetype mostly: ...")

Real working demo even without paying clients: shows the system
operating on genuinely external data with our synthetic-data-derived
learning applied.
2026-04-20 15:36:14 -05:00
root
f8e8d25b5f Unblock complex scenarios: JSON tolerance + optional question + mistral exec
parseAction now strips stray `)` before `}` and trailing commas —
qwen2.5 emits those regularly on tool_call outputs; soft-fix beats
retry-loops. hybrid_search no longer hard-requires `question`; defaults
to "qualified available workers" when the model drops it (mistral's
most common failure mode on complex events).

Kept original TOOL_CATALOG shape (args examples only, not full
action envelopes). The verbose few-shot version from the prior
iteration confused mistral into wrapping propose_done as tool_call.

Scenario V7 result: expansion (5 Forklift Ops) and emergency
(4 Loaders) — previously-failing complex events — now seal reliably.
Pool sizes: 687 and 380 from 500K corpus. Patterns endpoint produces
real operator-actionable signals:
  expansion: "recurring certifications: Forklift (40%), OSHA-10 (40%)
             · recurring skills: mill (40%) · archetype mostly: leader
             · reliability median 0.83"
Baseline + recurring are now flaky (inverted trade-off, pure
model-reliability variance).
2026-04-20 15:28:30 -05:00
root
1274ab2cb3 Scenario harness: Path 1+2 integration + schema hardening
Upgrades to tests/multi-agent/scenario.ts to exercise the full Path 1+2
feature set on a real warehouse-client week (5 events on one client):

- Hard SCHEMA ENFORCEMENT block in every event's guidance. Prior runs
  had mistral read narrative words ("shift", "recurring", "expansion")
  as SQL column names. Schema is now locked explicitly with valid
  columns listed and CAST guidance for availability + reliability.
- playbook_memory_k bumped 10 → 100 to match server default.
- Canonical short seed text (operation + "{kind} fill via hybrid
  search" + "{role} fill in {city}, {state}"). Verbose LLM rationales
  dilute embeddings and silently kill boost (Pass 1 finding).
- /vectors/playbook_memory/mark_failed fires automatically on
  misplacement events — records the no-shower's failure so future
  searches for same city+role dampen their boost.
- /vectors/playbook_memory/patterns call per event — surfaces what the
  meta-index discovered (recurring certs/skills/archetype/reliability)
  for that query into the dispatch log and retrospective.
- Retrospective now includes a workers-touched audit table (every
  worker who reached a decision, with outcome column) and a
  discovered-patterns-evolution section across events.

Honest limitations this surfaced in the real run:
- mistral's executor prompt-adherence degrades on high-count events
  (5+ fills) and scenario-specific language (emergency/misplacement).
  3 of 5 events aborted via drift guard. Baseline + recurring sealed
  cleanly with real fills + SMS + emails + seeded playbooks.
- worker_id resolution returns "undefined" for some names when name
  matching is ambiguous in workers_500k (multiple workers with same
  name in same city).
2026-04-20 15:09:14 -05:00
root
95c26f04f8 Path 1 negative signal + Path 2 pattern discovery + name validation
New:
- /vectors/playbook_memory/patterns: meta-index pattern discovery.
  Given a query, finds top-K similar playbooks, pulls each endorsed
  worker's full workers_500k profile, aggregates shared traits (cert
  frequencies, skill frequencies, modal archetype, reliability
  distribution), returns a human-readable discovered_pattern. Surfaces
  signals operators didn't explicitly query — the original PRD's
  "identify things we didn't know" dimension.
- /vectors/playbook_memory/mark_failed: records worker failures per
  (city, state, name). compute_boost_for applies 0.5^n penalty per
  recorded failure, so 3 failures quarter a worker's positive boost and
  5 effectively zero it. Path 1 negative signal — recruiter trust
  depends on the system NOT recommending people who no-showed.
- Bun /log_failure: validates failed_names against workers_500k
  (same ghost-guard as /log), forwards to /mark_failed.

Improved:
- /log now validates endorsed_names against workers_500k for the
  contract's city+state before seeding. Ghost names (names that don't
  correspond to real workers) are rejected in the response and excluded
  from the seed, preventing silent boost failures.
- Bun /search auto-appends `CAST(availability AS DOUBLE) > 0.5` to
  sql_filter when the caller didn't constrain availability. Opt out
  with `include_unavailable: true`. Recruiter trust bug: surfacing
  already-placed workers breaks the first call.
- DEFAULT_TOP_K_PLAYBOOKS 25 → 100. Direct cosine measurement showed
  similarities cluster 0.55-0.67 across all playbooks regardless of
  geo, so k=25 missed relevant geo-matched playbooks. Brute-force is
  still sub-ms at this size.

Verified end-to-end on live data:
- Ghost names rejected on /log + /log_failure
- Availability filter drops unavailable workers from candidate pool
- Pattern discovery on unseen Cleveland OH Welder query returned
  recurring skills (first aid 43%, grinder 43%, blueprint 43%) and
  modal archetype (specialist) across 20 semantically similar past
  playbooks in 0.24s
- Negative signal: Helen Sanchez boost dropped +0.250 → +0.163 after
  3 failures recorded via /log_failure (34% reduction)
2026-04-20 14:55:46 -05:00
root
20b0289aa9 /log validates endorsed names + /search auto-appends availability>0.5
Two gap-fills surfaced by the real test on 2026-04-20:

1. /log no longer seeds endorsed_names that don't exist in workers_500k
   for the contract's (city, state). Previously accepted ghost names
   silently (entry count grew, SQL row landed, but boost never fired
   because no real worker chunk matched the stored tuple). Response now
   reports rejected_ghost_names and explains why seeding was skipped.

2. Bun /search auto-appends `CAST(availability AS DOUBLE) > 0.5` to
   sql_filter when the caller didn't constrain availability themselves.
   Recruiters expect "available workers" by default — surfacing someone
   on an active placement would break trust on first contact.
   Opt out with `include_unavailable: true`.

Verified: ghost names rejected end-to-end, real names accepted, mixed
input handled correctly. Availability filter drops ~10 workers from a
305-row Cleveland OH Welder pool to 295 actually-available.
2026-04-20 14:44:12 -05:00
root
25b7e6c3a7 Phase 19 wiring + Path 1/2 work + chain integrity fixes
Backend:
- crates/vectord/src/playbook_memory.rs (new): Phase 19 in-memory boost
  store with seed/rebuild/snapshot, plus temporal decay (e^-age/30 per
  playbook), persist_to_sql endpoint backing successful_playbooks_live,
  and discover_patterns endpoint for meta-index pattern aggregation
  (recurring certs/skills/archetype/reliability across similar past fills).
- DEFAULT_TOP_K_PLAYBOOKS bumped 5 → 25; old default silently missed
  most boosts when memory had > 25 entries.
- service.rs: new routes /vectors/playbook_memory/{seed,rebuild,stats,
  persist_sql,patterns}.

Bun staffing co-pilot (mcp-server/):
- /search, /match, /verify, /proof, /simulation/run, MCP tools all
  forward use_playbook_memory:true and playbook_memory_k:25 to the
  hybrid endpoint. Boost was previously dark across the entire app.
- /log no longer POSTs to /ingest/file — that endpoint REPLACES the
  dataset's object list, so single-row CSV writes were wiping all prior
  rows in successful_playbooks (sp_rows went 33→1 in one /log call).
  /log now seeds playbook_memory with canonical short text and calls
  /persist_sql to keep successful_playbooks_live in sync.
- /simulation/run cumulative end-of-week CSV write removed for the same
  reason. Per-day per-contract /seed (added in this session) is the
  accumulating feedback path now.
- search.html addWorkerInsight renders a green "Endorsed · N playbooks"
  chip with playbook citations when boost > 0.

Internal Dioxus UI (crates/ui/):
- Dashboard phase list rewritten through Phase 19 (was stuck at "Phase
  16: File Watcher" / "Phase 17: DB Connector" — both wrong).
- Removed fabricated "27ms" stat label.
- Ask tab examples + SQL default replaced with real staffing prompts
  against candidates/clients/job_orders (was referencing nonexistent
  employees/products/events).
- New Playbook tab exposes /vectors/playbook_memory/{stats,rebuild} and
  side-by-side hybrid search (boost OFF vs ON) with citations.

Tests (tests/multi-agent/):
- run_e2e_rated.ts: parallel two-agent (mistral + qwen2.5) build phase
  + verifier rating (geo, auth, persist, boost, speed → /10).
- network_proving.ts: continuous build → verify → repeat with
  staffing-recruiter profile hot-swap; geo-discrimination check.
- chain_of_custody.ts: single recruiter operation traced through every
  layer (Bun /search, direct /vectors/hybrid parity, /log, SQL,
  playbook_memory growth, profile activation, post-op boost lift).
2026-04-20 06:21:13 -05:00
root
8e3cac5812 Polish: professional layout, collapsible sections, tighter design
- Replaced amateur CSS with professional dark theme (Inter font, muted palette,
  proper spacing, consistent border radius, hover states, transitions)
- Nav bar with Dashboard/Intelligence Console/Architecture tabs
- Urgent pipeline: shows contracts directly, removed busy step indicators
- In Progress + Ready to Go: collapsed by default with expand toggle
  (page went from 30+ visible contract cards to just the urgents)
- Workers Available: limited to 5 instead of 8
- Proper section headers with labels and metadata
- Search section always visible with better placeholder text
- Professional footer with product branding
- Responsive breakpoints for mobile (768px, 480px)
- Page is now ~50% shorter with same information density

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-17 20:29:45 -05:00
root
2da8562c90 Interactive permit heat map with live data verification
- Leaflet.js map with dark tiles showing real Chicago building permits
- Dots sized and colored by project cost ($1B+ red, $100M+ orange, $10M+ blue)
- Hover any dot for project details — address, cost, description, date
- LIVE indicator with green pulse dot
- Timestamp showing when data was fetched
- "Verify source" link goes directly to Chicago Open Data portal
- "Refresh" button re-fetches from the API on click
- Expanded to 50 permits for denser map coverage
- Legend showing dot size scale

No one can say "you just typed those numbers in" when they can
click a dot on the map, see 10000 W OHARE ST, and verify it
themselves on data.cityofchicago.org.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-17 20:24:43 -05:00
root
9acbe5c369 Market Intelligence: live Chicago building permits → staffing demand forecast
/intelligence/market pulls real permit data from Chicago Open Data API:
- $9.6B in active construction permits
- O'Hare expansion ($730M), new casino ($580M), transit station ($445M)
- Maps permit types to staffing roles (electrical→Electrician, masonry→Loader)
- Cross-references with our IL worker bench to show coverage gaps
- Electrician gap: only 1,036 reliable vs 63K estimated demand

Datalake page now shows three intelligence layers:
1. Contract simulation with scenario-driven matching
2. Market Intelligence with live permit data + bench analysis
3. System Learning with fill history and detected patterns

The staffing company sees demand forming before the phone rings.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-17 20:12:01 -05:00