lakehouse

Author	SHA1	Message	Date
root	c21b261877	Item A — stress scenario + enriched T3 diagnostic prompt Proves cloud passthrough works end-to-end AND fixes the diagnostic quality problem that first run surfaced. STRESS SCENARIO (tests/multi-agent/scenarios/stress_01.json): Five genuinely hard events with varied failure modes: - Gary, IN 5× Electrician: ZERO supply (city not in workers_500k) - Peoria, IL 8× Safety Coordinator: scarce role, initial pool only 5 - Flint, MI 3× Welder: ZERO supply - Grand Rapids, MI 4× Tool & Die Maker: scarce but solvable - Gary, IN 1× Electrician misplacement: repeats event 1's impossibility FIRST RUN (stress v1) — cloud passthrough works, diagnosis vague: T3 checkpoint: "Potential drift flags for upcoming role" Lesson: "Before dispatching, query pool status. Update turn counter..." Generic tactical advice that doesn't address the real problem. Root cause: T3 prompt only saw outcome summary, not the raw SQL/pool/drift signals the executor had in its log. DIAGNOSTIC FIX: - Added LogEntry[] `sharedLog` parameter to runAgentFill so the caller retains the trace even when runAgentFill throws drift-abort. - EventResult gained `diagnostic_log` field populated on both OK and FAIL paths. - extractDiagnostics() pulls SQL filters, hybrid_search row counts, SQL errors, and reviewer drift notes from the log. - Checkpoint prompt now includes FAILURE FORENSICS block for failed events: SQL filters attempted, row counts, errors, drift reasons, and an explicit teaching note about zero-supply detection. - Cross-day lesson prompt flags each event with [ZERO-SUPPLY: pivot city needed] tag when drift reasons mention "no match"/"no candidates"/"0 rows". PRIORITY clause in the prompt tells the model its lesson MUST name alternate cities when that tag appears. SECOND RUN (stress v2 with enriched prompt) — cloud diagnosis sharp: T3 after Flint: risk="Zero candidate supply for Welder in Flint" hint="search Welder×3 in Saginaw, MI (≈30 mi) or expand role to Metal Fabricator" T3 after Gary: risk="Zero supply for Electrician in Gary, IN" hint="Pivot to Chicago, IL (≈40 min); broaden to Electrical Technician within 60 min radius" Lesson: specific, per-city, with distances, role-broadening fallback, and pre-loading strategy — actionable for item B retry. Cloud 120b call latencies consistent: 4.8-8.0s per prompt. Cloud passthrough proven under stress. Fill outcomes unchanged (1/5 — correct rejection of three impossible events + one propagating JSON emission edge case on retry pivot reasoning). The knowledge to rescue them now exists in the lesson; item B wires the retry.	2026-04-20 21:54:29 -05:00
root	330cb90f99	Lift k cap, drop ornamental `reason` field, scenario generator ITEM 1 — k CAP + REASON FIELD The hybrid_search default k was hard-coded to 10. For multi-fill events (5× expansion, 4× emergency) that's pool=10 → propose 5-of-10, half the candidates become the answer with no room for rejection. Executor prompt now instructs k to scale with target_count: k = max(count*5, 20), cap 80. Default helper bumped 10 → 20. Fill.reason dropped from required to optional. Nothing downstream ever consumed it — resolveWorkerIds, sealSale, retrospective all use candidate_id and name. Models loved to write 100-150 char justifications per fill; on 4+ fills that blew the JSON budget before the structure closed. Test 1 run result after this change: FIRST EVER 5/5 on the Riverfront Steel scenario, 13 total turns across 5 events. The event that failed last run (emergency 4×Loader with truncated reason-field continuation) now clears in 2 turns. Progression: mistral baseline: 0/5 qwen3.5 + continuation + think:false: 4/5 qwen3.5 + k=20 + no-reason: 5/5 ✓ ITEM 2 — SCENARIO GENERATOR (NOT YET TESTED E2E) tests/multi-agent/gen_scenarios.ts emits N deterministic ScenarioSpecs with varied clients (15 companies), cities (20 Midwest cities known to exist in workers_500k), role mixes (14 industrial staffing roles, weighted realistic), and event sequences. Each gets a unique sig_hash so the KB populates with distinct neighbor signatures. scripts/run_kb_batch.sh runs all generated specs sequentially against scenario.ts, logs per-scenario outcomes, and reports KB state at the end. Each run takes ~2-4min; 20-30 scenarios = 1-2hr unattended. Next: test the generator+batch on a small N (3-5) to verify KB populates correctly and pathway recommendations start getting neighbor signal instead of cold-starts. Then item 3 (Rust re-weighting of hybrid_search by playbook_memory success).	2026-04-20 20:31:34 -05:00
root	9c1400d738	Phase 22 — Internal Knowledge Library (KB) Meta-layer over Phase 19 playbook_memory. Phase 19 answers "which WORKERS worked for this event"; KB answers "which CONFIG worked for this playbook signature" — model choice, budget hints, pathway notes, error corrections. tests/multi-agent/kb.ts: - computeSignature(): stable sha256 hash of the (kind, role, count, city, state) tuple sequence. Same scenario shape → same sig. - indexRun(): extracts sig, embeds spec digest via sidecar, appends outcome record, upserts signature to data/_kb/signatures.jsonl. - findNeighbors(): cosine-ranks the k most-similar signatures from prior runs for a target spec. - detectErrorCorrections(): scans outcomes for same-sig fail→succeed pairs, diffs the model set, logs to error_corrections.jsonl. - recommendFor(): feeds target digest + k-NN neighbors + recent corrections to the overview model, gets back a structured JSON recommendation (top_models, budget_hints, pathway_notes), appends to pathway_recommendations.jsonl. JSON-shape constrained so the executor can inherit it mechanically. - loadRecommendation(): at scenario start, pulls newest rec matching this sig (or nearest). scenario.ts: - Reads KB recommendation at startup (alongside prior lessons). - Injects pathway_notes into guidanceFor() executor context. - After retrospective, indexes the run + synthesizes next rec. Cold-start behavior: first run with no history writes a low-confidence "no prior data" rec so the signal that something was attempted is captured. Second run gets "low confidence, 0 neighbors" until a third distinct sig gives the embedder something to compare against — hence the upcoming scenario generator. VERIFIED: - data/_kb/ populated after one scenario run: 1 outcome (sig=4674…, 4/5 ok, 16 turns total), 1 signature, 2 recs (cold + post-run). - Recommendation JSON-parsed cleanly from gpt-oss:20b overview model. PRD Phase 22 added with file layout, cycle description, and the rationale for file-based MVP → Rust port progression that matches how Phase 21 primitives shipped. What's NOT here yet (batched follow-ups per J's request, tested between each): - Lift the k=10 hybrid_search cap to adaptive k=max(count*5, 20) - Scenario generator to bulk-populate KB with varied signatures - Rust re-weighting: push playbook_memory success signal INTO hybrid_search scoring, not just post-hoc boost	2026-04-20 20:27:12 -05:00
root	0c4868c191	qwen3.5 executor + continuation primitive + think:false Three coupled fixes that together turned the Riverfront Steel scenario from 0/5 (mistral) to 4/5 (qwen3.5) with T3 flagging real staffing concerns rather than linter advice. MODEL SWAP - Executor: mistral → qwen3.5:latest (9.7B, 262K ctx, thinking). mistral's decoder emitted malformed JSON on complex SQL filters regardless of prompt; J called it — stop using mistral. - Reviewer: qwen2.5 → qwen3:latest (40K ctx) - Applied to scenario.ts, orchestrator.ts, network_proving.ts, run_e2e_rated.ts CONTINUATION PRIMITIVE (agent.ts) - generateContinuable(): empty-response → geometric backoff retry; truncated-JSON → continue from partial as scratchpad; bounded by budget cap + max_continuations. No more "bump max_tokens until it stops truncating" tourniquet. - generateTreeSplit(): map-reduce for oversized input corpora with running scratchpad digest, reduce pass for final synthesis. - Empty text no longer throws — it's a signal to continuable that thinking ate the budget. think:false FOR HOT PATH - qwen3.5 burned ~650 tokens of hidden thinking for trivial JSON emission. For executor/reviewer/draft: think:false. For T3/T4/T5 overseers: thinking stays on (that's the point). - Sidecar generate endpoint accepts `think` bool, passes through to Ollama's /api/generate. VERIFIED OUTCOMES Riverfront Steel 2026-04-21, qwen3.5+continuable+think:false: 08:00 baseline_fill 3/3 4 turns 10:30 recurring 2/2 3 turns (1 playbook citation) 12:15 expansion 0/5 drift-aborted (5-fill orchestration problem, separate work) 14:00 emergency 4/4 3 turns (1 citation) 15:45 misplacement 1/1 3 turns → T3 caught Patrick Ross double-booking across events → T3 flagged forklift cert drift on the event that failed → Cross-day lesson proposed "maintain buffer of ≥3 emergency candidates, pre-fetch certs for expansion, booking system cross-check" — real staffing advice, not generic linter output PRD PHASE 21 rewritten to reflect the actual primitive shape (two- call map-reduce with scratchpad glue) instead of the tourniquet approach originally documented. Rust port queued for next sprint. scripts/ab_t3_test.sh: A/B harness that chains B→C→D runs and emits tests/multi-agent/playbooks/ab_scorecard.json.	2026-04-20 20:19:02 -05:00
root	6e7ca1830e	Phase 21 foundation — context stability + chunking pipeline PRD: add Phase 20 (model matrix, wired) and Phase 21 (context stability, partial). Phase 21 exists because LLM Team hit this exact wall — running multi-model ranking on large context silently truncated, rankings degraded, no pipeline caught it. The stable answer: every agent call goes through a budget check against the model's declared context_window minus safety_margin, with a declared overflow_policy when the check fails. config/models.json: - context_window + context_budget per tier - overflow_policies block: summarize_oldest_tool_results_via_t3, chunk_lessons_via_cosine_topk, two_pass_map_reduce, escalate_to_kimi_k2_1t_or_split_decision - chunking_cache spec (data/_chunk_cache/, corpus-hash keyed) agent.ts: - estimateTokens() chars/4 biased safe ~15% - CONTEXT_WINDOWS table (fallback; prod reads models.json) - assertContextBudget() — throws on overflow with exact numbers, can bypass with bypass_budget:true for callers with their own policy - Wired into generate() and generateCloud() so EVERY call is checked scenario.ts: - T3 lesson archive to data/_playbook_lessons/*.json (the old /vectors/playbook_memory/seed path was silently failing with HTTP 400 because it requires 'fill: Role xN in City, ST' operation shape) - loadPriorLessons() at scenario start — filters by city/state match, date-sorted, takes top-3 - prior_lessons.json archived per-run (honest signal for A/B) - guidanceFor() injects up to 2 prior lessons (≤500 chars each) into the executor's per-event context - Retrospective shows explicit "Prior lessons loaded: N" line Verified: mistral correctly rejects a 150K-char prompt (7532 tokens over), gpt-oss:120b accepts it with 90K headroom. The enforcement is in-band on every call now, not an afterthought. Full chunking service (Rust) remains deferred to the sprint this feeds: crates/aibridge/src/budget.rs + chunk.rs + storaged/chunk_cache.rs	2026-04-20 19:34:44 -05:00
root	03d723e7e6	Model matrix — 5 tiers, local hard workers + cloud overseers config/models.json is the authoritative catalog. Hot path (T1/T2) stays local; cloud is consulted only for overview (T3), strategic (T4), and gatekeeper (T5) calls. J named qwen3.5 + newer models (minimax-m2.7, glm-5, qwen3-next) specifically — all mapped with real reachable IDs verified against ollama.com/api/tags. Tier shape: - t1_hot mistral + qwen2.5 local — 50-200 calls/scenario - t2_review qwen2.5 + qwen3 local — 5-14 calls/event - t3_overview gpt-oss:120b cloud — 1-3 calls/scenario - t4_strategic qwen3.5:397b + glm-4.7 — 1-10 calls/day - t5_gatekeeper kimi-k2-thinking — 1-5 calls/day, audit-logged Rate budgets are declared in-config — Ollama Cloud paid tier is generous but we cap overview/strategic/gatekeeper so no single rogue scenario can blow the day's quota. Experimental rotation list wired but disabled by default. When enabled, T4 randomly routes 10% of calls to a rotating minimax/GLM/qwen-next/ deepseek/nemotron/cogito/mistral-large candidate, logs comparisons, and auto-promotes after 3 rotations of wins. Playbook versioning SPEC embedded under `playbook_versioning` key: every seed gets version + parent_id + retired_at + architecture_snapshot, so when a schema migration breaks a playbook we can pinpoint which change retired it. Implementation flagged for next sprint (touches gateway + catalogd + mcp-server) — not wired here. - scenario.ts now loads config/models.json at init, env vars still override - mcp-server exposes /models/matrix read-only so UI can render it	2026-04-20 19:24:41 -05:00
root	e4ae5b646e	T3 overview tier — mid-day checkpoints + cross-day lesson Hot path (T1/T2) stays mistral + qwen2.5. The new T3 tier runs a thinking model SPARINGLY — after every misplacement, every N-th event (default N=3), and once post-scenario for the cross-day lesson. - agent.ts: generateCloud() for Ollama Cloud (gpt-oss:120b etc). Uses the same /api/generate shape; thinking field is discarded. - scenario.ts: runOverviewCheckpoint + runCrossDayLesson. Outputs land in checkpoints.jsonl and lesson.md. Lesson also seeds playbook_memory under operation "cross-day-lesson-{date}" — future runs pick it up through the existing similarity boost. - Env knobs: LH_OVERVIEW_CLOUD=1 routes T3 to cloud, LH_OVERVIEW_MODEL overrides (default gpt-oss:20b local, gpt-oss:120b cloud), LH_T3_CHECKPOINT_EVERY controls cadence, LH_T3_DISABLE=1 turns it off. Why this shape: prior feedback_phase19_seed_text.md warned that verbose seeds dilute the embedding and silently kill the boost. T3's rich prose goes to lesson.md; the embedded "approach" + "context" stay terse. Verified end-to-end: local 20b checkpoint 10.9s, lesson 4.0s; cloud 120b lesson 3.7s. Cloud output is both faster AND more specific than local (sequenced, tactical, logging advice included).	2026-04-20 19:21:45 -05:00
root	f8e8d25b5f	Unblock complex scenarios: JSON tolerance + optional question + mistral exec parseAction now strips stray `)` before `}` and trailing commas — qwen2.5 emits those regularly on tool_call outputs; soft-fix beats retry-loops. hybrid_search no longer hard-requires `question`; defaults to "qualified available workers" when the model drops it (mistral's most common failure mode on complex events). Kept original TOOL_CATALOG shape (args examples only, not full action envelopes). The verbose few-shot version from the prior iteration confused mistral into wrapping propose_done as tool_call. Scenario V7 result: expansion (5 Forklift Ops) and emergency (4 Loaders) — previously-failing complex events — now seal reliably. Pool sizes: 687 and 380 from 500K corpus. Patterns endpoint produces real operator-actionable signals: expansion: "recurring certifications: Forklift (40%), OSHA-10 (40%) · recurring skills: mill (40%) · archetype mostly: leader · reliability median 0.83" Baseline + recurring are now flaky (inverted trade-off, pure model-reliability variance).	2026-04-20 15:28:30 -05:00
root	1274ab2cb3	Scenario harness: Path 1+2 integration + schema hardening Upgrades to tests/multi-agent/scenario.ts to exercise the full Path 1+2 feature set on a real warehouse-client week (5 events on one client): - Hard SCHEMA ENFORCEMENT block in every event's guidance. Prior runs had mistral read narrative words ("shift", "recurring", "expansion") as SQL column names. Schema is now locked explicitly with valid columns listed and CAST guidance for availability + reliability. - playbook_memory_k bumped 10 → 100 to match server default. - Canonical short seed text (operation + "{kind} fill via hybrid search" + "{role} fill in {city}, {state}"). Verbose LLM rationales dilute embeddings and silently kill boost (Pass 1 finding). - /vectors/playbook_memory/mark_failed fires automatically on misplacement events — records the no-shower's failure so future searches for same city+role dampen their boost. - /vectors/playbook_memory/patterns call per event — surfaces what the meta-index discovered (recurring certs/skills/archetype/reliability) for that query into the dispatch log and retrospective. - Retrospective now includes a workers-touched audit table (every worker who reached a decision, with outcome column) and a discovered-patterns-evolution section across events. Honest limitations this surfaced in the real run: - mistral's executor prompt-adherence degrades on high-count events (5+ fills) and scenario-specific language (emergency/misplacement). 3 of 5 events aborted via drift guard. Baseline + recurring sealed cleanly with real fills + SMS + emails + seeded playbooks. - worker_id resolution returns "undefined" for some names when name matching is ambiguous in workers_500k (multiple workers with same name in same city).	2026-04-20 15:09:14 -05:00
root	25b7e6c3a7	Phase 19 wiring + Path 1/2 work + chain integrity fixes Backend: - crates/vectord/src/playbook_memory.rs (new): Phase 19 in-memory boost store with seed/rebuild/snapshot, plus temporal decay (e^-age/30 per playbook), persist_to_sql endpoint backing successful_playbooks_live, and discover_patterns endpoint for meta-index pattern aggregation (recurring certs/skills/archetype/reliability across similar past fills). - DEFAULT_TOP_K_PLAYBOOKS bumped 5 → 25; old default silently missed most boosts when memory had > 25 entries. - service.rs: new routes /vectors/playbook_memory/{seed,rebuild,stats, persist_sql,patterns}. Bun staffing co-pilot (mcp-server/): - /search, /match, /verify, /proof, /simulation/run, MCP tools all forward use_playbook_memory:true and playbook_memory_k:25 to the hybrid endpoint. Boost was previously dark across the entire app. - /log no longer POSTs to /ingest/file — that endpoint REPLACES the dataset's object list, so single-row CSV writes were wiping all prior rows in successful_playbooks (sp_rows went 33→1 in one /log call). /log now seeds playbook_memory with canonical short text and calls /persist_sql to keep successful_playbooks_live in sync. - /simulation/run cumulative end-of-week CSV write removed for the same reason. Per-day per-contract /seed (added in this session) is the accumulating feedback path now. - search.html addWorkerInsight renders a green "Endorsed · N playbooks" chip with playbook citations when boost > 0. Internal Dioxus UI (crates/ui/): - Dashboard phase list rewritten through Phase 19 (was stuck at "Phase 16: File Watcher" / "Phase 17: DB Connector" — both wrong). - Removed fabricated "27ms" stat label. - Ask tab examples + SQL default replaced with real staffing prompts against candidates/clients/job_orders (was referencing nonexistent employees/products/events). - New Playbook tab exposes /vectors/playbook_memory/{stats,rebuild} and side-by-side hybrid search (boost OFF vs ON) with citations. Tests (tests/multi-agent/): - run_e2e_rated.ts: parallel two-agent (mistral + qwen2.5) build phase + verifier rating (geo, auth, persist, boost, speed → /10). - network_proving.ts: continuous build → verify → repeat with staffing-recruiter profile hot-swap; geo-discrimination check. - chain_of_custody.ts: single recruiter operation traced through every layer (Bun /search, direct /vectors/hybrid parity, /log, SQL, playbook_memory growth, profile activation, post-op boost lift).	2026-04-20 06:21:13 -05:00
root	19bdfab227	Phase 2: DataFusion query engine over Parquet - queryd: SessionContext with custom URL scheme to avoid path doubling with LocalFileSystem - queryd: ListingTable registration from catalog ObjectRefs with schema inference - queryd: POST /query/sql returns JSON {columns, rows, row_count} - queryd→catalogd wiring: reads all datasets, registers as named tables - gateway: wires QueryEngine with shared store + registry - e2e verified: SELECT *, WHERE/ORDER BY, COUNT/AVG all correct Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-27 05:48:20 -05:00

11 Commits