agent-governance

Author	SHA1	Message	Date
profit	ef18567674	Implement real supervisor-driven auto-recovery Orchestrator changes: - Force-spawn GAMMA on iteration_limit before abort - GAMMA.synthesize() creates emergency handoff payload - loadRecoveryContext() logs "Resuming from {task_id} handoff" - POST to /api/pipeline/log for resume message visibility AgentGamma changes: - Add synthesize() method for emergency abort synthesis - Merges existing proposals into coherent handoff - Stores as synthesis_type: "abort_recovery" Server changes: - Add POST /api/pipeline/log endpoint for orchestrator logging - Recovery pipeline properly inherits GAMMA synthesis Test coverage: - test_auto_recovery.py: 6 unit tests - test_e2e_auto_recovery.py: 5 E2E tests - test_supervisor_recovery.py: 3 supervisor tests - Success on attempt 2 (recovery works) - Max failures (3 retries then FAILED) - Success on attempt 1 (no recovery needed) Recovery flow: 1. iteration_limit triggers 2. GAMMA force-spawned for emergency synthesis 3. Handoff dumped with GAMMA synthesis 4. Exit code 3 triggers auto-recovery 5. Recovery pipeline loads handoff 6. Logs "Resuming from {prior_pipeline} handoff" 7. Repeat up to 3 times or until success Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-01-24 19:47:56 -05:00
profit	c96919fe35	Implement real auto-recovery with handoff chain Orchestrator changes: - Add dumpAgentHandoff() to dump proposals/analysis before abort - Add loadRecoveryContext() to load inherited context on recovery runs - Add preseedBlackboard() to pre-seed inherited proposals - Force-spawn GAMMA immediately on recovery runs - Track isRecoveryRun, recoveryAttempt, inheritedContext, forceGamma Server changes: - Update recordConsensusFailure() to read orchestrator handoff JSON - Add collectFromBlackboard() helper as fallback - Update triggerAutoRecovery() with comprehensive context passing - Store inherited_handoff reference for recovery pipelines - Track retry_count, abort_reason, handoff_ref in recovery:* keys - Add recovery badge and prior pipeline link in UI Test coverage: - test_auto_recovery.py: 6 unit tests - test_e2e_auto_recovery.py: 5 E2E tests (handoff dump, recovery pipeline creation, inherited context, retry tracking, status update) Redis tracking keys: - handoff:{pipeline_id}:agents - orchestrator dumps proposals here - handoff:{recovery_id}:inherited - recovery pipeline inherits from - recovery:{pipeline_id} - retry_count, abort_reason, handoff_ref Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-01-24 19:39:52 -05:00
profit	a19535b580	Implement auto-recovery for consensus failures - Add iteration tracking and stuck detection to orchestrator - Add triggerAutoRecovery function for automatic pipeline respawn - Store structured failure context (proposals, conflicts, reason) - Force GAMMA agent on recovery attempts for conflict resolution - Limit auto-recovery to 3 attempts to prevent infinite loops - Add UI status badges for rebooting/aborted states - Add failure-context API endpoint for orchestrator handoff - Add test_auto_recovery.py with 6 passing tests Exit codes: 0=success, 1=error, 2=consensus failure, 3=aborted Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-01-24 19:28:27 -05:00
profit	8c6e7831e9	Add Phase 10-12 implementation: multi-tenant, marketplace, observability Major additions: - marketplace/: Agent template registry with FTS5 search, ratings, versioning - observability/: Prometheus metrics, distributed tracing, structured logging - ledger/migrations/: Database migration scripts for multi-tenant support - tests/governance/: 15 new test files for phases 6-12 (295 total tests) - bin/validate-phases: Full 12-phase validation script New features: - Multi-tenant support with tenant isolation and quota enforcement - Agent marketplace with semantic versioning and search - Observability with metrics, tracing, and log correlation - Tier-1 agent bootstrap scripts Updated components: - ledger/api.py: Extended API for tenants, marketplace, observability - ledger/schema.sql: Added tenant, project, marketplace tables - testing/framework.ts: Enhanced test framework - checkpoint/checkpoint.py: Improved checkpoint management Archived: - External integrations (Slack/GitHub/PagerDuty) moved to .archive/ - Old checkpoint files cleaned up Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-01-24 18:39:47 -05:00
profit	09be7eff4b	Add consensus failure handling with fallback options for multi-agent pipelines Implements detection and recovery for when agents fail to reach consensus: - Orchestrator exits with code 2 on consensus failure (distinct from error=1) - Records failed run context (proposals, agent states, conflicts) to Dragonfly - Provides fallback options: rerun same, rerun with GAMMA, escalate tier, accept partial - Adds UI alert with action buttons for user-driven recovery - Adds failure details modal and downloadable failure report - Only marks pipeline complete when consensus achieved or user accepts fallback Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-01-24 18:24:19 -05:00
profit	8561d13728	Add Vault token management and observability integration for multi-agent pipelines - Vault token issuance per pipeline with 2-hour TTL - Automatic token renewal loop every 30 minutes - Error budget tracking with threshold-based revocation - Observability-driven token revocation for policy violations - Diagnostic pipeline spawning on error threshold breach - Structured handoff reports for error recovery - Agent lifecycle status API - New API endpoints: /api/pipeline/token, /api/pipeline/errors, /api/observability/handoff, /api/observability/diagnostic Orchestrator now reports errors to parent pipeline's observability system via PIPELINE_ID environment variable. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-01-24 17:45:20 -05:00
profit	ccc3b01609	Fix orchestrator process hang after cleanup The orchestrator process was hanging after completing its work because: 1. Fire-and-forget Redis operations in MessageBus.handleMessage() left unhandled promises that kept the event loop alive 2. No explicit process.exit() call after cleanup Changes: - coordination.ts: Add .catch(() => {}) to fire-and-forget Redis ops - orchestrator.ts: Add explicit process.exit(exitCode) after cleanup - orchestrator.ts: Improve error handling in main() with proper exit codes Tested: Pipeline mksup1wq completed full flow and exited cleanly. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-01-24 17:01:45 -05:00
profit	77655c298c	Initial commit: Agent Governance System Phase 8 Phase 8 Production Hardening with complete governance infrastructure: - Vault integration with tiered policies (T0-T4) - DragonflyDB state management - SQLite audit ledger - Pipeline DSL and templates - Promotion/revocation engine - Checkpoint system for session persistence - Health manager and circuit breaker for fault tolerance - GitHub/Slack integrations - Architectural test pipeline with bug watcher, suggestion engine, council review - Multi-agent chaos testing framework Test Results: - Governance tests: 68/68 passing - E2E workflow: 16/16 passing - Phase 2 Vault: 14/14 passing - Integration tests: 27/27 passing Coverage: 57.6% average across 12 phases Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-01-23 22:07:06 -05:00

8 Commits