profit 6d6a306d4e tests/real-world: add task-level 6-retry loop (per J 2026-04-22)
Two distinct retry loops now both cap at 6 and serve different
purposes:

1. Per-cloud-call continuation (Phase 21 primitive) — when a single
   cloud call returns empty or truncated, stitches up to 6
   continuation calls. Handles output-overflow.

2. Per-TASK retry (this commit) — when the whole task errors
   (500/404, thin answer, etc.), retries the full task up to 6
   times. Each retry gets PRIOR ATTEMPTS' failures injected into
   the prompt as learning context, so attempt N+1 is informed by
   what N failed at. Handles error-recovery with compounding
   context.

Both loops fired on iter 3 of the stress run, proving them
independent and composable:

  FORCING TASK-RETRY LOOP — iter 3 will cycle through 5 invalid
  models + 1 valid
    attempt 1/6: model=deliberately-invalid-model-attempt-1
        /v1/chat 502: ollama.com 404: model not found
    attempt 2/6: [with prior-failure context]
    ... (5 failures total, each with the full chain of prior errors)
    attempt 6/6: model=gpt-oss:20b [with prior-failure context]
        continuation retry 1..6 (empty responses)
        SUCCEEDED after 5 prior failures (441 chars)

What J was asking to prove:
  "I expect it to retry the process six times to build on the
   knowledge database... when an error is legitimately triggered
   that it will go through six times... without getting caught in
   a loop"

Proof:
  - 6/6 attempts fired on the FORCED iteration
  - Each retry embedded the preceding attempts' errors as "do not
    repeat" context
  - Hard cap at MAX_TASK_RETRIES (6) prevents infinite loops
  - Last-ditch local fallback exists if all 6 still fail
  - Other iterations succeed on attempt 1 — the loop ONLY fires
    when errors are legitimately triggered

Stress run totals (runs/moan4h71/):
  6/6 iterations complete, 58 cloud calls, 306s end-to-end
  tree-splits: 6/6   continuations: 10   rescues: 2
  iter 3: 8197+2800 tok, 6 task attempts, 6 continuation retries
  local stored summary + per-iter JSON for inspection

What this proves that prior stress runs did NOT:
  - Error-recovery at task granularity is live, not aspirational
  - Compounding failure context flows between retries as text
  - Loop bound is enforced; runaway cases aren't possible
  - Two retry mechanisms compose without deadlock (continuation
    inside task-retry inside tree-split)

Follow-ups worth doing (separate PRs):
  - Persist retry-history to observer :3800 so cross-run learning
    sees the failure patterns
  - Route retries through /vectors/hybrid to surface similar prior
    errors from the real KB (currently only in-memory across one
    iteration)
  - Fix citation regex in summary — iter 6 received 5 prior IDs
    but counter shows 0 (regex needs to tolerate hyphens in IDs)
2026-04-22 17:50:53 -05:00
..