lakehouse/scripts/ab_t3_test.sh
root 0c4868c191 qwen3.5 executor + continuation primitive + think:false
Three coupled fixes that together turned the Riverfront Steel scenario
from 0/5 (mistral) to 4/5 (qwen3.5) with T3 flagging real staffing
concerns rather than linter advice.

MODEL SWAP
- Executor: mistral → qwen3.5:latest (9.7B, 262K ctx, thinking).
  mistral's decoder emitted malformed JSON on complex SQL filters
  regardless of prompt; J called it — stop using mistral.
- Reviewer: qwen2.5 → qwen3:latest (40K ctx)
- Applied to scenario.ts, orchestrator.ts, network_proving.ts,
  run_e2e_rated.ts

CONTINUATION PRIMITIVE (agent.ts)
- generateContinuable(): empty-response → geometric backoff retry;
  truncated-JSON → continue from partial as scratchpad; bounded by
  budget cap + max_continuations. No more "bump max_tokens until it
  stops truncating" tourniquet.
- generateTreeSplit(): map-reduce for oversized input corpora with
  running scratchpad digest, reduce pass for final synthesis.
- Empty text no longer throws — it's a signal to continuable that
  thinking ate the budget.

think:false FOR HOT PATH
- qwen3.5 burned ~650 tokens of hidden thinking for trivial JSON
  emission. For executor/reviewer/draft: think:false. For T3/T4/T5
  overseers: thinking stays on (that's the point).
- Sidecar generate endpoint accepts `think` bool, passes through to
  Ollama's /api/generate.

VERIFIED OUTCOMES
Riverfront Steel 2026-04-21, qwen3.5+continuable+think:false:
  08:00 baseline_fill  3/3  4 turns
  10:30 recurring      2/2  3 turns (1 playbook citation)
  12:15 expansion      0/5  drift-aborted (5-fill orchestration
                            problem, separate work)
  14:00 emergency      4/4  3 turns (1 citation)
  15:45 misplacement   1/1  3 turns
  → T3 caught Patrick Ross double-booking across events
  → T3 flagged forklift cert drift on the event that failed
  → Cross-day lesson proposed "maintain buffer of ≥3 emergency
    candidates, pre-fetch certs for expansion, booking system
    cross-check" — real staffing advice, not generic linter output

PRD PHASE 21 rewritten to reflect the actual primitive shape (two-
call map-reduce with scratchpad glue) instead of the tourniquet
approach originally documented. Rust port queued for next sprint.

scripts/ab_t3_test.sh: A/B harness that chains B→C→D runs and emits
tests/multi-agent/playbooks/ab_scorecard.json.
2026-04-20 20:19:02 -05:00

78 lines
3.3 KiB
Bash
Executable File

#!/usr/bin/env bash
# A/B test of T3 overseer: does it actually make subsequent runs better?
# Chains Run B (T3 seed) → Run C (T3 + read-back) → Run D (T3 cloud).
# Run A is assumed already complete (launched separately). Aggregates
# metrics at the end into ab_scorecard.json.
set -e
cd "$(dirname "$0")/.."
export OLLAMA_CLOUD_KEY="$(python3 -c "import json; print(json.load(open('/root/llm_team_config.json'))['providers']['ollama_cloud']['api_key'])")"
echo "▶ A/B test start at $(date -Iseconds)"
echo "▶ prior lessons dir: $(ls data/_playbook_lessons 2>/dev/null | wc -l) files"
# Run B — T3 enabled local, no prior lessons should exist yet
echo "──── RUN B: T3 local, seeds first lesson ────"
bun tests/multi-agent/scenario.ts > /tmp/lakehouse_ab_B.log 2>&1 || true
echo " B exit=$?"
ls data/_playbook_lessons/*.json 2>/dev/null | head -5
# Run C — T3 enabled local, B's lesson should load
echo "──── RUN C: T3 local, reads B's lesson ────"
bun tests/multi-agent/scenario.ts > /tmp/lakehouse_ab_C.log 2>&1 || true
echo " C exit=$?"
# Run D — T3 enabled CLOUD (gpt-oss:120b), reads B+C lessons
echo "──── RUN D: T3 cloud, reads B+C lessons ────"
LH_OVERVIEW_CLOUD=1 bun tests/multi-agent/scenario.ts > /tmp/lakehouse_ab_D.log 2>&1 || true
echo " D exit=$?"
echo "▶ all runs done at $(date -Iseconds)"
echo "▶ scorecard:"
ls -1dt tests/multi-agent/playbooks/scenario-* | head -4 | tac | python3 -c "
import sys, os, json
runs = [l.strip() for l in sys.stdin if l.strip()]
labels = ['A(no-T3)','B(T3-seed)','C(T3-read)','D(T3-cloud)']
# Prepend Run A: most recent BEFORE the ab_t3_test kicked off is Run A
# (launched separately). But we only picked up the most recent 4 runs.
# Actually: ab_t3_test runs B/C/D, so recent 3 = B,C,D. Run A is the one
# BEFORE those — find it separately.
# Reread to include Run A:
import subprocess
all_runs = subprocess.check_output(['bash','-c','ls -1dt tests/multi-agent/playbooks/scenario-* | head -8']).decode().strip().split('\n')
# The 4 most recent are D, C, B, A (reverse chronological).
top4 = list(reversed(all_runs[:4])) # oldest first → A,B,C,D
rows = []
for i, path in enumerate(top4):
try:
results = json.load(open(os.path.join(path, 'results.json')))
except FileNotFoundError:
continue
ok = sum(1 for r in results if r.get('ok'))
turns = sum(r.get('turns', 0) for r in results)
gaps = sum(len(r.get('gap_signals', [])) for r in results)
cites = sum(len(r.get('playbook_citations') or []) for r in results)
prior = []
try:
prior = json.load(open(os.path.join(path, 'prior_lessons.json')))
except FileNotFoundError:
pass
rows.append({
'label': labels[i] if i < len(labels) else f'run{i}',
'path': path,
'ok_events': ok,
'total_events': len(results),
'total_turns': turns,
'total_gaps': gaps,
'total_citations': cites,
'prior_lessons_loaded': len(prior),
})
scorecard = {'generated_at': __import__('datetime').datetime.utcnow().isoformat()+'Z', 'runs': rows}
open('tests/multi-agent/playbooks/ab_scorecard.json','w').write(json.dumps(scorecard, indent=2))
print(json.dumps(scorecard, indent=2))
"
echo "▶ saved: tests/multi-agent/playbooks/ab_scorecard.json"