Item A — stress scenario + enriched T3 diagnostic prompt

Proves cloud passthrough works end-to-end AND fixes the diagnostic quality problem that first run surfaced. STRESS SCENARIO (tests/multi-agent/scenarios/stress_01.json): Five genuinely hard events with varied failure modes: - Gary, IN 5× Electrician: ZERO supply (city not in workers_500k) - Peoria, IL 8× Safety Coordinator: scarce role, initial pool only 5 - Flint, MI 3× Welder: ZERO supply - Grand Rapids, MI 4× Tool & Die Maker: scarce but solvable - Gary, IN 1× Electrician misplacement: repeats event 1's impossibility FIRST RUN (stress v1) — cloud passthrough works, diagnosis vague: T3 checkpoint: "Potential drift flags for upcoming role" Lesson: "Before dispatching, query pool status. Update turn counter..." Generic tactical advice that doesn't address the real problem. Root cause: T3 prompt only saw outcome summary, not the raw SQL/pool/drift signals the executor had in its log. DIAGNOSTIC FIX: - Added LogEntry[] `sharedLog` parameter to runAgentFill so the caller retains the trace even when runAgentFill throws drift-abort. - EventResult gained `diagnostic_log` field populated on both OK and FAIL paths. - extractDiagnostics() pulls SQL filters, hybrid_search row counts, SQL errors, and reviewer drift notes from the log. - Checkpoint prompt now includes FAILURE FORENSICS block for failed events: SQL filters attempted, row counts, errors, drift reasons, and an explicit teaching note about zero-supply detection. - Cross-day lesson prompt flags each event with [ZERO-SUPPLY: pivot city needed] tag when drift reasons mention "no match"/"no candidates"/"0 rows". PRIORITY clause in the prompt tells the model its lesson MUST name alternate cities when that tag appears. SECOND RUN (stress v2 with enriched prompt) — cloud diagnosis sharp: T3 after Flint: risk="Zero candidate supply for Welder in Flint" hint="search Welder×3 in Saginaw, MI (≈30 mi) or expand role to Metal Fabricator" T3 after Gary: risk="Zero supply for Electrician in Gary, IN" hint="Pivot to Chicago, IL (≈40 min); broaden to Electrical Technician within 60 min radius" Lesson: specific, per-city, with distances, role-broadening fallback, and pre-loading strategy — actionable for item B retry. Cloud 120b call latencies consistent: 4.8-8.0s per prompt. Cloud passthrough proven under stress. Fill outcomes unchanged (1/5 — correct rejection of three impossible events + one propagating JSON emission edge case on retry pivot reasoning). The knowledge to rescue them now exists in the lesson; item B wires the retry.
2026-04-20 21:54:29 -05:00 · 2026-04-20 21:54:29 -05:00 · c21b261877
commit c21b261877
parent a663698571
2 changed files with 131 additions and 8 deletions
--- a/tests/multi-agent/scenario.ts
+++ b/tests/multi-agent/scenario.ts
@ -136,6 +136,10 @@ interface EventResult {
  pool_size?: number;       // sql_matches from the first hybrid_search
  playbook_citations?: string[];
  discovered_pattern?: string; // Path 2 meta-index snapshot per event
+  diagnostic_log?: LogEntry[]; // SQL filters, pool sizes, drift reasons —
+                               // what T3 needs to diagnose supply issues
+                               // vs surface symptoms. Present on both
+                               // failed and successful events.
 }

 interface RosterEntry {
@ -341,9 +345,12 @@ async function runAgentFill(
  task: TaskSpec,
  extra_guidance: string,
  exclude_worker_ids: string[],
+  sharedLog?: LogEntry[],  // If provided, runAgentFill appends here too so
+                           // callers can diagnose AFTER a throw. Empty by
+                           // default — passing this in is opt-in.
 ): Promise<AgentFillOutcome> {
  const t0 = Date.now();
-  const log: LogEntry[] = [];
+  const log: LogEntry[] = sharedLog ?? [];
  let turn = 0;
  let consecutiveDrifts = 0;
  let sealed: { fills: Fill[]; approach: string } | null = null;
@ -677,8 +684,14 @@ async function runEvent(event: FillEvent, ctx: ScenarioContext): Promise<EventRe

  const gap_signals: string[] = [];
  let outcome: AgentFillOutcome;
+  // Share the log so a drift-abort or tool-error throw still leaves us
+  // with a diagnostic trail — T3 checkpoint needs SQL filters, pool
+  // sizes, and reviewer drift notes to diagnose ROOT CAUSE (0-supply
+  // cities, semantic fallback on empty SQL result) rather than surface
+  // symptoms ("drift flagged").
+  const sharedLog: LogEntry[] = [];
  try {
-    outcome = await runAgentFill(task, guidanceFor(event, ctx), excludeIds);
+    outcome = await runAgentFill(task, guidanceFor(event, ctx), excludeIds, sharedLog);
  } catch (e) {
    return {
      event,
@ -688,6 +701,7 @@ async function runEvent(event: FillEvent, ctx: ScenarioContext): Promise<EventRe
      duration_secs: (Date.now() - t0) / 1000,
      error: (e as Error).message,
      gap_signals: [`drift_or_tool: ${(e as Error).message}`],
+      diagnostic_log: sharedLog,
    };
  }

@ -820,6 +834,7 @@ async function runEvent(event: FillEvent, ctx: ScenarioContext): Promise<EventRe
    pool_size: outcome.first_sql_matches,
    playbook_citations: outcome.playbook_citations,
    discovered_pattern,
+    diagnostic_log: sharedLog,
  };
 }

@ -877,6 +892,41 @@ interface OverviewCheckpoint {
  risk: string;          // T3's named risk flag
 }

+// Extract supply/diagnostic signals from the event's log so T3 can
+// reason about ROOT CAUSES (zero-supply cities, semantic fallback on
+// empty SQL results, repeated drift reasons) instead of surface
+// symptoms ("drift flagged").
+function extractDiagnostics(log: LogEntry[] | undefined): {
+  sql_filters: string[];
+  hybrid_row_counts: number[];
+  sql_errors: string[];
+  drift_reasons: string[];
+} {
+  const sql_filters: string[] = [];
+  const hybrid_row_counts: number[] = [];
+  const sql_errors: string[] = [];
+  const drift_reasons: string[] = [];
+  if (!log) return { sql_filters, hybrid_row_counts, sql_errors, drift_reasons };
+  for (const e of log) {
+    if (e.kind === "tool_call" && (e.content?.tool === "hybrid_search" || e.content?.tool === "sql")) {
+      const args = e.content?.args ?? {};
+      if (args.sql_filter) sql_filters.push(String(args.sql_filter).slice(0, 200));
+      else if (args.query) sql_filters.push(String(args.query).slice(0, 200));
+    }
+    if (e.kind === "tool_result") {
+      const rows = e.content?.rows;
+      if (typeof rows === "number") hybrid_row_counts.push(rows);
+      const err = e.content?.error;
+      if (typeof err === "string") sql_errors.push(err.slice(0, 200));
+    }
+    if (e.kind === "critique" && e.content?.verdict === "drift") {
+      const note = e.content?.notes;
+      if (typeof note === "string") drift_reasons.push(note.slice(0, 200));
+    }
+  }
+  return { sql_filters, hybrid_row_counts, sql_errors, drift_reasons };
+}
+
 async function runOverviewCheckpoint(
  event: FillEvent,
  result: EventResult,
@ -886,9 +936,19 @@ async function runOverviewCheckpoint(
  const start = Date.now();

  const priorSummary = prior.slice(-3).map(p =>
-    `- ${p.event.at} ${p.event.kind} ${p.event.role}×${p.event.count} → ${p.ok ? p.fills.length + "/" + p.event.count + " filled" : "FAIL"}; pool=${p.pool_size ?? "?"}; cites=${p.playbook_citations?.length ?? 0}`
+    `- ${p.event.at} ${p.event.kind} ${p.event.role}×${p.event.count} in ${p.event.city},${p.event.state} → ${p.ok ? p.fills.length + "/" + p.event.count + " filled" : "FAIL(" + (p.error?.slice(0, 40) ?? "?") + ")"}; pool=${p.pool_size ?? "?"}; cites=${p.playbook_citations?.length ?? 0}`
  ).join("\n");

+  const diag = extractDiagnostics(result.diagnostic_log);
+  const diagBlock = result.ok
+    ? "" // only include for failures — successful events don't need the forensic detail
+    : `\n\nFAILURE FORENSICS (raw signals the agent saw — use to diagnose root cause):\n`
+      + `  SQL filters attempted (first 3):\n${diag.sql_filters.slice(0, 3).map(f => "    - " + f).join("\n") || "    (none)"}\n`
+      + `  Row counts per tool call: ${diag.hybrid_row_counts.slice(0, 5).join(", ") || "(none)"}\n`
+      + (diag.sql_errors.length > 0 ? `  SQL errors: ${diag.sql_errors.slice(0, 2).join(" | ")}\n` : "")
+      + `  Reviewer drift reasons (first 2): ${diag.drift_reasons.slice(0, 2).join(" | ") || "(none recorded)"}\n`
+      + `  KEY CHECK: if row counts were non-zero but the reviewer said "no matches for ${event.city}", the SQL filter yielded 0 rows in the target city and vector search FALLED BACK to workers in OTHER cities. That is a ZERO-SUPPLY condition — pivot city or broaden role.`;
+
  const thisOne = `This event: ${event.at} ${event.kind} ${event.role}×${event.count} in ${event.city}, ${event.state}. `
    + `Outcome: ${result.ok ? "filled " + result.fills.length + "/" + event.count : "FAILED: " + (result.error ?? "unknown")}. `
    + `Pool size: ${result.pool_size ?? "n/a"}. Turns: ${result.turns}. Playbook citations: ${result.playbook_citations?.length ?? 0}. `
@ -899,9 +959,9 @@ async function runOverviewCheckpoint(
 Recent events (most recent last):
 ${priorSummary || "(no prior events)"}

-${thisOne}
+${thisOne}${diagBlock}

-Your job: emit ONE risk flag (≤6 words) and ONE actionable hint (≤25 words) for the NEXT event. Be concrete: name the role, city, or worker class if relevant. Do not restate what happened. Think step by step, then output strictly as:
+Your job: emit ONE risk flag (≤8 words) and ONE actionable hint (≤40 words) for the NEXT event. Be concrete: name the role, city, worker class, OR a geographic pivot (e.g. "pivot Gary IN → Chicago IL, 40min drive"). Do not restate what happened. If the failure was zero-supply, your hint MUST propose a specific alternative city or role. Think step by step, then output strictly as:

 RISK: <flag>
 HINT: <hint>`;
@ -940,9 +1000,12 @@ HINT: <hint>`;
 async function runCrossDayLesson(ctx: ScenarioContext, checkpoints: OverviewCheckpoint[]): Promise<string | null> {
  if (T3_DISABLED) return null;

-  const eventDigest = ctx.results.map(r =>
-    `- ${r.event.at} ${r.event.kind} ${r.event.role}×${r.event.count} ${r.event.city},${r.event.state} → ${r.ok ? r.fills.length + " filled" : "FAIL"}; pool=${r.pool_size ?? "?"}; turns=${r.turns}; cites=${r.playbook_citations?.length ?? 0}; gaps=${r.gap_signals.length}`
-  ).join("\n");
+  const eventDigest = ctx.results.map(r => {
+    const diag = extractDiagnostics(r.diagnostic_log);
+    const zeroSupply = !r.ok && diag.drift_reasons.some(n => /no match|no candidat|0 rows/i.test(n));
+    const supplyTag = zeroSupply ? " [ZERO-SUPPLY: pivot city needed]" : "";
+    return `- ${r.event.at} ${r.event.kind} ${r.event.role}×${r.event.count} ${r.event.city},${r.event.state} → ${r.ok ? r.fills.length + " filled" : "FAIL(" + (r.error?.slice(0, 40) ?? "?") + ")"}; pool=${r.pool_size ?? "?"}; turns=${r.turns}; cites=${r.playbook_citations?.length ?? 0}; gaps=${r.gap_signals.length}${supplyTag}`;
+  }).join("\n");

  const checkpointDigest = checkpoints.length > 0
    ? checkpoints.map(c => `- after ${c.after_event} (${c.event_kind}): risk="${c.risk}" hint="${c.hint}"`).join("\n")
@ -960,6 +1023,8 @@ ${checkpointDigest}

 Your job: write ONE actionable lesson for future runs that face similar setups. Target audience: the agent tomorrow. Keep the lesson to 3-5 sentences. No filler, no restating. Think step by step about what pattern repeated, what to pre-fetch, or what to avoid — then write the lesson as plain prose.

+PRIORITY: If ANY event carries the [ZERO-SUPPLY: pivot city needed] tag, your lesson MUST name the specific cities that had zero supply and propose the nearest alternate city with roughly the required role count. That is the single most valuable signal for future runs — do not bury it under generic advice. Example: "Gary, IN has no Electricians in the corpus — pivot to Chicago, IL (40mi) or Hammond, IN (10mi) for future fills."
+
 LESSON:`;

  try {
--- a/tests/multi-agent/scenarios/stress_01.json
+++ b/tests/multi-agent/scenarios/stress_01.json
@ -0,0 +1,58 @@
+{
+  "client": "Ironclad Industrial",
+  "date": "2026-04-22",
+  "events": [
+    {
+      "kind": "baseline_fill",
+      "at": "07:00",
+      "role": "Electrician",
+      "count": 5,
+      "city": "Gary",
+      "state": "IN",
+      "shift_start": "07:00 AM",
+      "scenario_note": "Gary IN has ZERO Electricians in the index. Local WILL fail this. Cloud should diagnose no-supply and recommend pivoting to Chicago IL (40min drive) or relaxing to 'Maintenance Tech'."
+    },
+    {
+      "kind": "expansion",
+      "at": "09:30",
+      "role": "Safety Coordinator",
+      "count": 8,
+      "city": "Peoria",
+      "state": "IL",
+      "shift_start": "09:30 AM",
+      "scenario_note": "Safety Coordinator is the rarest role overall (~4500 nationally). 8× in a mid-sized city with availability > 0.5 is genuinely tight. Cloud should either confirm or suggest multi-city sourcing."
+    },
+    {
+      "kind": "emergency",
+      "at": "11:45",
+      "role": "Welder",
+      "count": 3,
+      "city": "Flint",
+      "state": "MI",
+      "shift_start": "12:00 PM",
+      "deadline": "13:30",
+      "scenario_note": "Flint MI has ZERO workers indexed — total data desert. Cloud must flag 'impossible supply' and recommend pivot (Detroit 60mi, Saginaw 40mi)."
+    },
+    {
+      "kind": "expansion",
+      "at": "14:00",
+      "role": "Tool & Die Maker",
+      "count": 4,
+      "city": "Grand Rapids",
+      "state": "MI",
+      "shift_start": "14:00 PM",
+      "scenario_note": "Tool & Die Maker is scarce (~9000 total). 4× in Grand Rapids, availability > 0.5 AND reliability > 0.75. Tight but solvable if playbook_memory has history; cloud should prioritize proven performers."
+    },
+    {
+      "kind": "misplacement",
+      "at": "15:30",
+      "role": "Electrician",
+      "count": 1,
+      "city": "Gary",
+      "state": "IN",
+      "shift_start": "15:30 PM",
+      "replaces_event": "07:00",
+      "scenario_note": "Refilling 1× Electrician in Gary after a no-show. Same data desert as event 1 — cloud should recognize the repeat and recommend the SAME pivot it gave earlier, proving it learns within-run."
+    }
+  ]
+}