15 Commits

Author SHA1 Message Date
root
b95dd86556 Phase 24 — observer HTTP ingest + scenario outcome streaming
Closes the gap J flagged: observer wraps MCP:3700, scenarios hit
gateway:3100 directly, observer idle at 0 ops across 3600+ cycles.
Now scenarios POST per-event outcomes to observer's new HTTP ingest
on :3800, observer consumes them alongside MCP-wrapped ops, ERROR_
ANALYZER and PLAYBOOK_BUILDER loops see the full picture.

observer.ts:
- Bun.serve() HTTP listener on OBSERVER_PORT (default 3800):
  GET /health    — basic + ring depth
  GET /stats     — total / success / failure / by_source / recent
                   scenario ops digest
  POST /event    — accept scenario outcome, shape it into ObservedOp
                   with source="scenario" + staffer_id + sig_hash +
                   event_kind + role/city/state + rescue flags
- recordExternalOp() — shared ring-buffer insert so the main analyzer
  + playbook builder don't care where the op came from
- ObservedOp extended with provenance fields

persistOp() FIX — old path POSTed to /ingest/file?name=observed_operations
which REPLACES the dataset (flagged in feedback_ingest_replace_semantics.md).
Every op was silently wiping all prior ops. Replaced with append to
data/_observer/ops.jsonl so the historical trace is durable across
analyzer cycles and process restarts.

scenario.ts:
- OBSERVER_URL env (default http://localhost:3800)
- postObserverEvent() helper with 2s AbortSignal.timeout so observer
  being down doesn't block scenario flow
- Per-event POST after ctx.results.push(result), carrying staffer_id,
  sig_hash (via imported computeSignature), event_kind + role + city
  + state + count + rescue_attempted / rescue_succeeded + truncated
  output_summary

VERIFIED:
  curl POST /event → {"accepted":true,"ring_size":1}
  curl GET /stats → {"total":1,"successes":1,"by_source":{"scenario":1},
    "recent_scenario_ops":[{...staffer_id,kind,role}]}

Final v3 demo leaderboard (9 runs per staffer, cumulative 3 batches):
  James (local):   92.9% fill, 36.8 cites, score 0.775 — RANK 1
  Maria (full):    81.0% fill, 26.2 cites, score 0.727
  Sam (basic):     61.9% fill, 28.2 cites, score 0.640
  Alex (minimal):  59.5% fill, 32.2 cites, score 0.631
Honest finding: Alex has MORE citations than Sam despite NO T3 and NO
rescue. Playbook inheritance alone is firing hardest when overseer is
absent. The 59.5% fill rate (up from 0% when qwen2.5 was executor)
proves cloud-exec + playbook inheritance is the floor the architecture
delivers.

Local gpt-oss:20b T3 outperforms cloud gpt-oss:120b T3 by 12pt fill
rate on this workload — cloud overseer paying latency+variance for
no measurable gain, worth flagging in next models.json tune.
2026-04-20 23:49:30 -05:00
root
137aed64fb Coherence pass — PRD/PHASES updates, config snapshot wired, unit tests
J flagged the audit: "make sure everything flows coherently, no
pseudocode or unnecessary patches or ignoring any particular part of
what we built." This is that pass.

PRD.md updates:
- Phase 19 refinement block — geo-filter + role-prefilter WIRED with
  citation density numbers (0.32 → 1.38, and 2 → 28 on same scenario).
- Phase 20 rewrite — mistral dropped, qwen3.5 + qwen3 local hot path,
  think:false as the key mechanical finding, kimi-k2.6 upgrade path.
- Phase 21 status block — think plumbing + cloud executor routing
  added after original commit.
- Phase 22 item B (cloud rescue) — pivot sanitizer, rescue verified
  1/3 on stress_01.
- Phase 23 NEW — staffer identity + tool_level + competence-weighted
  retrieval + kb_staffer_report. Auto-discovered worker labels called
  out with real numbers (Rachel Lewis 12× across 4 staffers).
- Phase 24 NEW — Observer/Autotune integration gap DOCUMENTED, not
  fixed. Observer has been idle at 0 ops for 3600+ cycles because
  scenarios hit gateway:3100 directly, bypassing MCP:3700 which the
  observer wraps. This is the honest "we're not using it in these
  tests" signal J surfaced. Fix deferred; gap visible now.

PHASES.md:
- Appended Phases 20-23 as checked, Phase 24 as unchecked gap.
- Updated footer count: 102 unit tests across all layers.
- Latest line updated with 14× citation lift + 46.4pt tool-asymmetry
  finding.

scenario.ts:
- snapshotConfig() was defined but never called. Now fires at every
  scenario start with a stable sha256 hash over the active model set +
  tool_level + cloud flags. config_snapshots.jsonl finally populates,
  which the error_corrections diff path needs to work correctly.

kb.test.ts (new): 4 signature invariant tests — stability across
unrelated fields (date, contract, staffer), sensitivity to role/city/
count changes, digest shape. All pass under `bun test`.

service.rs: 6 Rust extractor tests for extract_target_geo +
extract_target_role — basic, missing-state-returns-none, word
boundary (civilian != city), multi-word role, absent role, quoted
value parse. All pass under `cargo test -p vectord --lib extractor_tests`.

Dangling items now honestly documented rather than silently pending:
- Chunking cache (config/models.json SPEC, not wired) — flagged
- Playbook versioning (SPEC, not wired) — flagged
- Observer integration (WIRED but disconnected) — new Phase 24
2026-04-20 23:29:13 -05:00
root
ad0edbe29c Cloud kimi-k2.5 executor for weak tiers + multi-strategy playbook retrieval
Two coupled changes from the 2026 agent-memory research + tool
asymmetry findings.

SCENARIO (weak-tier cloud substitute):
qwen2.5 collapsed to 0/14 across the basic/minimal tool_levels.
Replace with cloud kimi-k2.5 on Ollama Cloud — same family as k2.6
(pro-tier locked today, on J's upgrade path). Plumb cloud flag
through ACTIVE_EXECUTOR_CLOUD / ACTIVE_REVIEWER_CLOUD into
generateContinuable so executor/reviewer can route to cloud when
tool_level requires. think:false supported by Kimi family.

Tool level mapping (revised):
  full     — qwen3.5 local + qwen3 local + cloud gpt-oss:120b T3 + rescue
  local    — qwen3.5 local + qwen3 local + local gpt-oss:20b T3 + rescue
  basic    — kimi-k2.5 cloud + qwen3 local + local T3, no rescue
  minimal  — kimi-k2.5 cloud + qwen3 local, no T3, no rescue.
             Playbook inheritance alone on the decision path.

This is the honest version of J's "minimal tools still works via
inheritance" hypothesis — with the executor no longer broken at the
tokenizer level, we can actually measure whether playbook retrieval
substitutes for missing overseers.

PLAYBOOK_MEMORY (multi-strategy retrieval):
Zep / Mem0 research shows multi-strategy rerank (semantic + keyword +
graph + temporal) outperforms single-strategy cosine. Lakehouse now
has a two-tier:

  1. Exact (role, city, state) match: skip cosine, assign similarity=1.0,
     take up to top_k/2+1 slots. These are identity-class neighbors —
     the strongest possible signal.
  2. Cosine fallback within the same (city, state) but different role:
     fills remaining slots.

Exposed as compute_boost_for_filtered_with_role(target_geo, target_role).
Backwards-compatible: compute_boost_for_filtered forwards with role=None
so existing callers keep their current behavior.

Service.rs wires both: extract_target_geo and extract_target_role pull
from the executor's SQL filter. grab_eq_value is factored out of
extract_target_geo so both lookups share one parser. Diagnostic log
now prints target_role alongside target_geo for every hybrid_search:

  playbook_boost: boosts=88 sources=39 parsed=39 matched=5
    target_geo=Some(("Nashville", "TN")) target_role=Some("Welder")

Verified: Nashville Welder query returns 5/10 boosted workers in
top_k with clean role+geo provenance.

Research sources: atlan.com Agent Memory Frameworks 2026, Mem0 paper
(arxiv 2504.19413), Zep/Graphiti LongMemEval comparison, ossinsight
Agent Memory Race 2026.

kimi-k2.6 on current key returns 403 — pro-tier upgrade required.
kimi-k2.5 is the substitute today; swap to k2.6 by renaming one line
in applyToolLevel once the subscription lands.
2026-04-20 23:20:07 -05:00
root
5e89407939 Phase 23 refinement — per-staffer tool_level variance
Staffer.tool_level now controls which subsystems a specific run gets:

  full     — qwen3.5 + qwen3 + cloud T3 + cloud rescue
  local    — qwen3.5 + qwen3 + local gpt-oss:20b T3 + rescue
  basic    — qwen2.5 + qwen2.5 + local T3, no rescue
  minimal  — qwen2.5 + qwen2.5, NO T3, NO rescue. Playbook
             inheritance only.

applyToolLevel() mutates module-scoped ACTIVE_* slots each run from the
env defaults, so prior staffer's overrides never leak. Hot-path code
reads ACTIVE_EXECUTOR / ACTIVE_REVIEWER / ACTIVE_T3_DISABLED /
ACTIVE_OVERVIEW_CLOUD / ACTIVE_RETRY_ON_FAIL instead of the baked
constants.

The architectural question this answers: does playbook_memory
inheritance carry enough knowledge to let a weakly-tooled coordinator
still produce usable outcomes? "Minimal" Alex runs qwen2.5 exec + no
reviewer overseer + no cloud rescue. If Alex still fills events at a
reasonable rate, the playbook system is the real knowledge carrier —
the senior stack is nice-to-have, not the sine qua non.

Demo personas mapped:
  Maria (senior, 48mo, full)
  James (mid, 14mo, local)
  Sam (junior, 4mo, basic)
  Alex (trainee, 1mo, minimal)

Same 3 contracts (Nashville downtown, Joliet warehouse, Indianapolis
assembly) across all four → 12 runs. KB + kb_staffer_report.py
leaderboard already wired; competence_score will now reflect real tool
asymmetry instead of LLM sampling variance.
2026-04-20 22:50:05 -05:00
root
6b71c8e9b2 Phase 23 — contract terms + staffer identity + competence-weighted retrieval
Matrix-index the "who handled this" dimension so top staffers become
the training signal and juniors inherit their playbooks automatically
via the boost pipeline. Auto-discovered indicators emerge from
comparing trajectories across staffers on similar contracts — that was
always the architectural point; this wires the last piece.

ContractTerms:
- deadline, budget_total_usd, budget_per_hour_max, local_bonus_per_hour,
  local_bonus_radius_mi, fill_requirement ("paramount" | "preferred")
- Attached to ScenarioSpec, propagated into T3 checkpoint + cloud
  rescue prompts so cloud reasons about trade-offs (pivot within bonus
  radius first; respect per-hour cap; split across cities when
  fill_requirement=paramount).

Staffer:
- {id, name, tenure_months, role: senior|mid|junior|trainee}
- On ScenarioSpec; logged at scenario start; attached to KB outcome
- Recomputed StafferStats written to data/_kb/staffers.jsonl after
  every run: total_runs, fill_rate, avg_turns, avg_citations,
  rescue_rate, competence_score.
- Competence formula: 0.45*fill_rate + 0.20*turn_efficiency +
  0.20*citation_density + 0.15*rescue_rate. Normalized to 0..1.

findNeighbors now returns weighted_score = cosine × best_staffer_competence
(floored at 0.3 so high-similarity low-competence neighbors still
surface). pathway_recommender prompt shows the top staffer's identity
so cloud knows WHOSE playbook it's synthesizing from.

Demo infrastructure:
- tests/multi-agent/gen_staffer_demo.ts: 4 personas (Maria senior,
  James mid, Sam junior, Alex trainee) × 3 contracts (Nashville Welder,
  Joliet Warehouse, Indianapolis Assembly). 12 scenarios total.
- scripts/run_staffer_demo.sh: runs the 12 sequentially with
  LH_OVERVIEW_CLOUD=1. Post-run calls kb_staffer_report.py.
- scripts/kb_staffer_report.py: leaderboard + cross-staffer worker
  overlap (names endorsed by ≥2 staffers → auto-discovered high-value
  workers). Top vs bottom differential.

gen_scenarios.ts (Phase 22 generator) also now emits contract terms
on 70% of generated specs — future KB batches populate with realistic
constraint patterns instead of bare role+city+count.

Stress scenario from item A intentionally NOT the production test.
Real staffing has constraints; Nashville contract + staffer demo is
the honest test of whether the architecture produces measurable
differential between coordinator skill levels.

Demo batch launched — 12 runs × ~3min each ≈ 40min unattended. Report
emitted after batch.
2026-04-20 22:16:09 -05:00
root
a7fc8e2256 Item B — cloud-rescue retry on event failure
When a scenario event fails (drift abort or other error) and
LH_RETRY_ON_FAIL is on (default when cloud T3 is enabled), ask cloud
for a concrete pivot — new city, role, or count — then re-run the
event with the remediation's fields. Capped at 1 retry per event so a
genuinely-impossible scenario can't burn budget.

requestCloudRemediation(event, result):
- Feeds the same diagnostic bundle T3 checkpoints get (SQL filters,
  row counts, SQL errors, reviewer drift reasons, gap signals).
- Prompt demands structured JSON: {retry, new_city, new_role,
  new_count, rationale}.
- Cloud is instructed to pivot to NEAREST alternate city when
  zero-supply detected, broaden role when uniquely scarce, reduce
  count when clearly unachievable, or return retry=false when no
  pivot seems viable.

EventResult additions:
- retry_attempt, retry_remediation (with rationale + cloud_model +
  duration), retry_result (full inner result shape), original_event.
- If retry succeeded, it becomes the primary result and original_event
  preserves what was attempted first. If retry also failed, the
  primary stays the failure and retry is recorded alongside.

Sanitizer on cloud output: model sometimes emits "Hammond, IN" in
new_city with "IN" in a non-existent new_state field, producing
"Hammond, IN, IN" downstream. Split new_city on comma, take first
token as city, extract state if present after the comma. Original
event's state is the fallback.

VERIFIED on stress_01.json with LH_OVERVIEW_CLOUD=1:
  Without rescue (item A baseline):  1/5 events ok
  With rescue (item B):              3/5 events ok
Gary IN misplacement: drift → cloud proposed South Bend IN → retry
filled 1/1. Rationale stored in retry_remediation for forensics.

Known limits surfaced (future work):
- City-field mangling failed one rescue before the sanitizer landed;
  next run will use the fix.
- Cloud picks alternate cities without knowing ground-truth supply.
  Flint → Saginaw pivoted but Saginaw also had sparse Welders.
  Future: expose a /vectors/supply-estimate endpoint cloud can consult
  before proposing a pivot.
2026-04-20 22:01:45 -05:00
root
c21b261877 Item A — stress scenario + enriched T3 diagnostic prompt
Proves cloud passthrough works end-to-end AND fixes the diagnostic
quality problem that first run surfaced.

STRESS SCENARIO (tests/multi-agent/scenarios/stress_01.json):
Five genuinely hard events with varied failure modes:
- Gary, IN 5× Electrician: ZERO supply (city not in workers_500k)
- Peoria, IL 8× Safety Coordinator: scarce role, initial pool only 5
- Flint, MI 3× Welder: ZERO supply
- Grand Rapids, MI 4× Tool & Die Maker: scarce but solvable
- Gary, IN 1× Electrician misplacement: repeats event 1's impossibility

FIRST RUN (stress v1) — cloud passthrough works, diagnosis vague:
  T3 checkpoint: "Potential drift flags for upcoming role"
  Lesson: "Before dispatching, query pool status. Update turn counter..."
Generic tactical advice that doesn't address the real problem.
Root cause: T3 prompt only saw outcome summary, not the raw
SQL/pool/drift signals the executor had in its log.

DIAGNOSTIC FIX:
- Added LogEntry[] `sharedLog` parameter to runAgentFill so the caller
  retains the trace even when runAgentFill throws drift-abort.
- EventResult gained `diagnostic_log` field populated on both OK and
  FAIL paths.
- extractDiagnostics() pulls SQL filters, hybrid_search row counts,
  SQL errors, and reviewer drift notes from the log.
- Checkpoint prompt now includes FAILURE FORENSICS block for failed
  events: SQL filters attempted, row counts, errors, drift reasons,
  and an explicit teaching note about zero-supply detection.
- Cross-day lesson prompt flags each event with [ZERO-SUPPLY: pivot
  city needed] tag when drift reasons mention "no match"/"no
  candidates"/"0 rows". PRIORITY clause in the prompt tells the model
  its lesson MUST name alternate cities when that tag appears.

SECOND RUN (stress v2 with enriched prompt) — cloud diagnosis sharp:
  T3 after Flint: risk="Zero candidate supply for Welder in Flint"
                  hint="search Welder×3 in Saginaw, MI (≈30 mi) or
                        expand role to Metal Fabricator"
  T3 after Gary:  risk="Zero supply for Electrician in Gary, IN"
                  hint="Pivot to Chicago, IL (≈40 min); broaden to
                        Electrical Technician within 60 min radius"
  Lesson: specific, per-city, with distances, role-broadening
  fallback, and pre-loading strategy — actionable for item B retry.

Cloud 120b call latencies consistent: 4.8-8.0s per prompt. Cloud
passthrough proven under stress.

Fill outcomes unchanged (1/5 — correct rejection of three impossible
events + one propagating JSON emission edge case on retry pivot
reasoning). The knowledge to rescue them now exists in the lesson;
item B wires the retry.
2026-04-20 21:54:29 -05:00
root
9c1400d738 Phase 22 — Internal Knowledge Library (KB)
Meta-layer over Phase 19 playbook_memory. Phase 19 answers "which
WORKERS worked for this event"; KB answers "which CONFIG worked for
this playbook signature" — model choice, budget hints, pathway notes,
error corrections.

tests/multi-agent/kb.ts:
- computeSignature(): stable sha256 hash of the (kind, role, count,
  city, state) tuple sequence. Same scenario shape → same sig.
- indexRun(): extracts sig, embeds spec digest via sidecar, appends
  outcome record, upserts signature to data/_kb/signatures.jsonl.
- findNeighbors(): cosine-ranks the k most-similar signatures from
  prior runs for a target spec.
- detectErrorCorrections(): scans outcomes for same-sig fail→succeed
  pairs, diffs the model set, logs to error_corrections.jsonl.
- recommendFor(): feeds target digest + k-NN neighbors + recent
  corrections to the overview model, gets back a structured JSON
  recommendation (top_models, budget_hints, pathway_notes), appends
  to pathway_recommendations.jsonl. JSON-shape constrained so the
  executor can inherit it mechanically.
- loadRecommendation(): at scenario start, pulls newest rec matching
  this sig (or nearest).

scenario.ts:
- Reads KB recommendation at startup (alongside prior lessons).
- Injects pathway_notes into guidanceFor() executor context.
- After retrospective, indexes the run + synthesizes next rec.

Cold-start behavior: first run with no history writes a low-confidence
"no prior data" rec so the signal that something was attempted is
captured. Second run gets "low confidence, 0 neighbors" until a third
distinct sig gives the embedder something to compare against — hence
the upcoming scenario generator.

VERIFIED:
- data/_kb/ populated after one scenario run: 1 outcome (sig=4674…,
  4/5 ok, 16 turns total), 1 signature, 2 recs (cold + post-run).
- Recommendation JSON-parsed cleanly from gpt-oss:20b overview model.

PRD Phase 22 added with file layout, cycle description, and the
rationale for file-based MVP → Rust port progression that matches
how Phase 21 primitives shipped.

What's NOT here yet (batched follow-ups per J's request, tested
between each):
- Lift the k=10 hybrid_search cap to adaptive k=max(count*5, 20)
- Scenario generator to bulk-populate KB with varied signatures
- Rust re-weighting: push playbook_memory success signal INTO
  hybrid_search scoring, not just post-hoc boost
2026-04-20 20:27:12 -05:00
root
0c4868c191 qwen3.5 executor + continuation primitive + think:false
Three coupled fixes that together turned the Riverfront Steel scenario
from 0/5 (mistral) to 4/5 (qwen3.5) with T3 flagging real staffing
concerns rather than linter advice.

MODEL SWAP
- Executor: mistral → qwen3.5:latest (9.7B, 262K ctx, thinking).
  mistral's decoder emitted malformed JSON on complex SQL filters
  regardless of prompt; J called it — stop using mistral.
- Reviewer: qwen2.5 → qwen3:latest (40K ctx)
- Applied to scenario.ts, orchestrator.ts, network_proving.ts,
  run_e2e_rated.ts

CONTINUATION PRIMITIVE (agent.ts)
- generateContinuable(): empty-response → geometric backoff retry;
  truncated-JSON → continue from partial as scratchpad; bounded by
  budget cap + max_continuations. No more "bump max_tokens until it
  stops truncating" tourniquet.
- generateTreeSplit(): map-reduce for oversized input corpora with
  running scratchpad digest, reduce pass for final synthesis.
- Empty text no longer throws — it's a signal to continuable that
  thinking ate the budget.

think:false FOR HOT PATH
- qwen3.5 burned ~650 tokens of hidden thinking for trivial JSON
  emission. For executor/reviewer/draft: think:false. For T3/T4/T5
  overseers: thinking stays on (that's the point).
- Sidecar generate endpoint accepts `think` bool, passes through to
  Ollama's /api/generate.

VERIFIED OUTCOMES
Riverfront Steel 2026-04-21, qwen3.5+continuable+think:false:
  08:00 baseline_fill  3/3  4 turns
  10:30 recurring      2/2  3 turns (1 playbook citation)
  12:15 expansion      0/5  drift-aborted (5-fill orchestration
                            problem, separate work)
  14:00 emergency      4/4  3 turns (1 citation)
  15:45 misplacement   1/1  3 turns
  → T3 caught Patrick Ross double-booking across events
  → T3 flagged forklift cert drift on the event that failed
  → Cross-day lesson proposed "maintain buffer of ≥3 emergency
    candidates, pre-fetch certs for expansion, booking system
    cross-check" — real staffing advice, not generic linter output

PRD PHASE 21 rewritten to reflect the actual primitive shape (two-
call map-reduce with scratchpad glue) instead of the tourniquet
approach originally documented. Rust port queued for next sprint.

scripts/ab_t3_test.sh: A/B harness that chains B→C→D runs and emits
tests/multi-agent/playbooks/ab_scorecard.json.
2026-04-20 20:19:02 -05:00
root
6e7ca1830e Phase 21 foundation — context stability + chunking pipeline
PRD: add Phase 20 (model matrix, wired) and Phase 21 (context stability,
partial). Phase 21 exists because LLM Team hit this exact wall — running
multi-model ranking on large context silently truncated, rankings
degraded, no pipeline caught it. The stable answer: every agent call
goes through a budget check against the model's declared context_window
minus safety_margin, with a declared overflow_policy when the check
fails.

config/models.json:
- context_window + context_budget per tier
- overflow_policies block: summarize_oldest_tool_results_via_t3,
  chunk_lessons_via_cosine_topk, two_pass_map_reduce,
  escalate_to_kimi_k2_1t_or_split_decision
- chunking_cache spec (data/_chunk_cache/, corpus-hash keyed)

agent.ts:
- estimateTokens() chars/4 biased safe ~15%
- CONTEXT_WINDOWS table (fallback; prod reads models.json)
- assertContextBudget() — throws on overflow with exact numbers, can
  bypass with bypass_budget:true for callers with their own policy
- Wired into generate() and generateCloud() so EVERY call is checked

scenario.ts:
- T3 lesson archive to data/_playbook_lessons/*.json (the old
  /vectors/playbook_memory/seed path was silently failing with HTTP 400
  because it requires 'fill: Role xN in City, ST' operation shape)
- loadPriorLessons() at scenario start — filters by city/state match,
  date-sorted, takes top-3
- prior_lessons.json archived per-run (honest signal for A/B)
- guidanceFor() injects up to 2 prior lessons (≤500 chars each) into
  the executor's per-event context
- Retrospective shows explicit "Prior lessons loaded: N" line

Verified: mistral correctly rejects a 150K-char prompt (7532 tokens
over), gpt-oss:120b accepts it with 90K headroom. The enforcement is
in-band on every call now, not an afterthought.

Full chunking service (Rust) remains deferred to the sprint this feeds:
crates/aibridge/src/budget.rs + chunk.rs + storaged/chunk_cache.rs
2026-04-20 19:34:44 -05:00
root
03d723e7e6 Model matrix — 5 tiers, local hard workers + cloud overseers
config/models.json is the authoritative catalog. Hot path (T1/T2) stays
local; cloud is consulted only for overview (T3), strategic (T4), and
gatekeeper (T5) calls. J named qwen3.5 + newer models (minimax-m2.7,
glm-5, qwen3-next) specifically — all mapped with real reachable IDs
verified against ollama.com/api/tags.

Tier shape:
- t1_hot     mistral + qwen2.5 local       — 50-200 calls/scenario
- t2_review  qwen2.5 + qwen3 local         — 5-14 calls/event
- t3_overview gpt-oss:120b cloud           — 1-3 calls/scenario
- t4_strategic qwen3.5:397b + glm-4.7      — 1-10 calls/day
- t5_gatekeeper kimi-k2-thinking           — 1-5 calls/day, audit-logged

Rate budgets are declared in-config — Ollama Cloud paid tier is generous
but we cap overview/strategic/gatekeeper so no single rogue scenario can
blow the day's quota.

Experimental rotation list wired but disabled by default. When enabled,
T4 randomly routes 10% of calls to a rotating minimax/GLM/qwen-next/
deepseek/nemotron/cogito/mistral-large candidate, logs comparisons, and
auto-promotes after 3 rotations of wins.

Playbook versioning SPEC embedded under `playbook_versioning` key: every
seed gets version + parent_id + retired_at + architecture_snapshot, so
when a schema migration breaks a playbook we can pinpoint which change
retired it. Implementation flagged for next sprint (touches gateway +
catalogd + mcp-server) — not wired here.

- scenario.ts now loads config/models.json at init, env vars still override
- mcp-server exposes /models/matrix read-only so UI can render it
2026-04-20 19:24:41 -05:00
root
e4ae5b646e T3 overview tier — mid-day checkpoints + cross-day lesson
Hot path (T1/T2) stays mistral + qwen2.5. The new T3 tier runs a
thinking model SPARINGLY — after every misplacement, every N-th event
(default N=3), and once post-scenario for the cross-day lesson.

- agent.ts: generateCloud() for Ollama Cloud (gpt-oss:120b etc). Uses
  the same /api/generate shape; thinking field is discarded.
- scenario.ts: runOverviewCheckpoint + runCrossDayLesson. Outputs land
  in checkpoints.jsonl and lesson.md. Lesson also seeds playbook_memory
  under operation "cross-day-lesson-{date}" — future runs pick it up
  through the existing similarity boost.
- Env knobs: LH_OVERVIEW_CLOUD=1 routes T3 to cloud, LH_OVERVIEW_MODEL
  overrides (default gpt-oss:20b local, gpt-oss:120b cloud),
  LH_T3_CHECKPOINT_EVERY controls cadence, LH_T3_DISABLE=1 turns it off.

Why this shape: prior feedback_phase19_seed_text.md warned that verbose
seeds dilute the embedding and silently kill the boost. T3's rich prose
goes to lesson.md; the embedded "approach" + "context" stay terse.

Verified end-to-end: local 20b checkpoint 10.9s, lesson 4.0s; cloud
120b lesson 3.7s. Cloud output is both faster AND more specific than
local (sequenced, tactical, logging advice included).
2026-04-20 19:21:45 -05:00
root
f8e8d25b5f Unblock complex scenarios: JSON tolerance + optional question + mistral exec
parseAction now strips stray `)` before `}` and trailing commas —
qwen2.5 emits those regularly on tool_call outputs; soft-fix beats
retry-loops. hybrid_search no longer hard-requires `question`; defaults
to "qualified available workers" when the model drops it (mistral's
most common failure mode on complex events).

Kept original TOOL_CATALOG shape (args examples only, not full
action envelopes). The verbose few-shot version from the prior
iteration confused mistral into wrapping propose_done as tool_call.

Scenario V7 result: expansion (5 Forklift Ops) and emergency
(4 Loaders) — previously-failing complex events — now seal reliably.
Pool sizes: 687 and 380 from 500K corpus. Patterns endpoint produces
real operator-actionable signals:
  expansion: "recurring certifications: Forklift (40%), OSHA-10 (40%)
             · recurring skills: mill (40%) · archetype mostly: leader
             · reliability median 0.83"
Baseline + recurring are now flaky (inverted trade-off, pure
model-reliability variance).
2026-04-20 15:28:30 -05:00
root
1274ab2cb3 Scenario harness: Path 1+2 integration + schema hardening
Upgrades to tests/multi-agent/scenario.ts to exercise the full Path 1+2
feature set on a real warehouse-client week (5 events on one client):

- Hard SCHEMA ENFORCEMENT block in every event's guidance. Prior runs
  had mistral read narrative words ("shift", "recurring", "expansion")
  as SQL column names. Schema is now locked explicitly with valid
  columns listed and CAST guidance for availability + reliability.
- playbook_memory_k bumped 10 → 100 to match server default.
- Canonical short seed text (operation + "{kind} fill via hybrid
  search" + "{role} fill in {city}, {state}"). Verbose LLM rationales
  dilute embeddings and silently kill boost (Pass 1 finding).
- /vectors/playbook_memory/mark_failed fires automatically on
  misplacement events — records the no-shower's failure so future
  searches for same city+role dampen their boost.
- /vectors/playbook_memory/patterns call per event — surfaces what the
  meta-index discovered (recurring certs/skills/archetype/reliability)
  for that query into the dispatch log and retrospective.
- Retrospective now includes a workers-touched audit table (every
  worker who reached a decision, with outcome column) and a
  discovered-patterns-evolution section across events.

Honest limitations this surfaced in the real run:
- mistral's executor prompt-adherence degrades on high-count events
  (5+ fills) and scenario-specific language (emergency/misplacement).
  3 of 5 events aborted via drift guard. Baseline + recurring sealed
  cleanly with real fills + SMS + emails + seeded playbooks.
- worker_id resolution returns "undefined" for some names when name
  matching is ambiguous in workers_500k (multiple workers with same
  name in same city).
2026-04-20 15:09:14 -05:00
root
25b7e6c3a7 Phase 19 wiring + Path 1/2 work + chain integrity fixes
Backend:
- crates/vectord/src/playbook_memory.rs (new): Phase 19 in-memory boost
  store with seed/rebuild/snapshot, plus temporal decay (e^-age/30 per
  playbook), persist_to_sql endpoint backing successful_playbooks_live,
  and discover_patterns endpoint for meta-index pattern aggregation
  (recurring certs/skills/archetype/reliability across similar past fills).
- DEFAULT_TOP_K_PLAYBOOKS bumped 5 → 25; old default silently missed
  most boosts when memory had > 25 entries.
- service.rs: new routes /vectors/playbook_memory/{seed,rebuild,stats,
  persist_sql,patterns}.

Bun staffing co-pilot (mcp-server/):
- /search, /match, /verify, /proof, /simulation/run, MCP tools all
  forward use_playbook_memory:true and playbook_memory_k:25 to the
  hybrid endpoint. Boost was previously dark across the entire app.
- /log no longer POSTs to /ingest/file — that endpoint REPLACES the
  dataset's object list, so single-row CSV writes were wiping all prior
  rows in successful_playbooks (sp_rows went 33→1 in one /log call).
  /log now seeds playbook_memory with canonical short text and calls
  /persist_sql to keep successful_playbooks_live in sync.
- /simulation/run cumulative end-of-week CSV write removed for the same
  reason. Per-day per-contract /seed (added in this session) is the
  accumulating feedback path now.
- search.html addWorkerInsight renders a green "Endorsed · N playbooks"
  chip with playbook citations when boost > 0.

Internal Dioxus UI (crates/ui/):
- Dashboard phase list rewritten through Phase 19 (was stuck at "Phase
  16: File Watcher" / "Phase 17: DB Connector" — both wrong).
- Removed fabricated "27ms" stat label.
- Ask tab examples + SQL default replaced with real staffing prompts
  against candidates/clients/job_orders (was referencing nonexistent
  employees/products/events).
- New Playbook tab exposes /vectors/playbook_memory/{stats,rebuild} and
  side-by-side hybrid search (boost OFF vs ON) with citations.

Tests (tests/multi-agent/):
- run_e2e_rated.ts: parallel two-agent (mistral + qwen2.5) build phase
  + verifier rating (geo, auth, persist, boost, speed → /10).
- network_proving.ts: continuous build → verify → repeat with
  staffing-recruiter profile hot-swap; geo-discrimination check.
- chain_of_custody.ts: single recruiter operation traced through every
  layer (Bun /search, direct /vectors/hybrid parity, /log, SQL,
  playbook_memory growth, profile activation, post-op boost lift).
2026-04-20 06:21:13 -05:00