- Add iteration tracking and stuck detection to orchestrator
- Add triggerAutoRecovery function for automatic pipeline respawn
- Store structured failure context (proposals, conflicts, reason)
- Force GAMMA agent on recovery attempts for conflict resolution
- Limit auto-recovery to 3 attempts to prevent infinite loops
- Add UI status badges for rebooting/aborted states
- Add failure-context API endpoint for orchestrator handoff
- Add test_auto_recovery.py with 6 passing tests
Exit codes: 0=success, 1=error, 2=consensus failure, 3=aborted
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Implements detection and recovery for when agents fail to reach consensus:
- Orchestrator exits with code 2 on consensus failure (distinct from error=1)
- Records failed run context (proposals, agent states, conflicts) to Dragonfly
- Provides fallback options: rerun same, rerun with GAMMA, escalate tier, accept partial
- Adds UI alert with action buttons for user-driven recovery
- Adds failure details modal and downloadable failure report
- Only marks pipeline complete when consensus achieved or user accepts fallback
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Vault token issuance per pipeline with 2-hour TTL
- Automatic token renewal loop every 30 minutes
- Error budget tracking with threshold-based revocation
- Observability-driven token revocation for policy violations
- Diagnostic pipeline spawning on error threshold breach
- Structured handoff reports for error recovery
- Agent lifecycle status API
- New API endpoints: /api/pipeline/token, /api/pipeline/errors,
/api/observability/handoff, /api/observability/diagnostic
Orchestrator now reports errors to parent pipeline's observability
system via PIPELINE_ID environment variable.
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
The orchestrator process was hanging after completing its work because:
1. Fire-and-forget Redis operations in MessageBus.handleMessage() left
unhandled promises that kept the event loop alive
2. No explicit process.exit() call after cleanup
Changes:
- coordination.ts: Add .catch(() => {}) to fire-and-forget Redis ops
- orchestrator.ts: Add explicit process.exit(exitCode) after cleanup
- orchestrator.ts: Improve error handling in main() with proper exit codes
Tested: Pipeline mksup1wq completed full flow and exited cleanly.
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>