Major additions: - marketplace/: Agent template registry with FTS5 search, ratings, versioning - observability/: Prometheus metrics, distributed tracing, structured logging - ledger/migrations/: Database migration scripts for multi-tenant support - tests/governance/: 15 new test files for phases 6-12 (295 total tests) - bin/validate-phases: Full 12-phase validation script New features: - Multi-tenant support with tenant isolation and quota enforcement - Agent marketplace with semantic versioning and search - Observability with metrics, tracing, and log correlation - Tier-1 agent bootstrap scripts Updated components: - ledger/api.py: Extended API for tenants, marketplace, observability - ledger/schema.sql: Added tenant, project, marketplace tables - testing/framework.ts: Enhanced test framework - checkpoint/checkpoint.py: Improved checkpoint management Archived: - External integrations (Slack/GitHub/PagerDuty) moved to .archive/ - Old checkpoint files cleaned up Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Architectural Test Pipeline
Multi-layer oversight system ensuring no single hidden bug can compromise the Agent Governance System.
Overview
The Architectural Test Pipeline provides continuous validation across all 12 phases through multiple oversight layers that monitor, analyze, review, and report on system health.
Architecture
┌─────────────────────────────────────────────────────────────────────────────┐
│ ARCHITECTURAL TEST PIPELINE │
├─────────────────────────────────────────────────────────────────────────────┤
│ │
│ ┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐ │
│ │ Bug Window │───▶│ Suggestion │───▶│ Council │ │
│ │ Watcher │ │ Engine │ │ Review │ │
│ │ │ │ │ │ │ │
│ │ • Real-time │ │ • Context-aware │ │ • Safety │ │
│ │ • All phases │ │ • Risk-ranked │ │ • Performance │ │
│ │ • Anomalies │ │ • Auto-fixable │ │ • Architecture │ │
│ └────────┬────────┘ └────────┬────────┘ │ • Compliance │ │
│ │ │ │ • Quality │ │
│ │ │ └────────┬────────┘ │
│ │ │ │ │
│ ▼ ▼ ▼ │
│ ┌─────────────────────────────────────────────────────────────┐ │
│ │ Phase Validator │ │
│ │ Phase 1 ✅ │ Phase 2 ✅ │ Phase 3 ✅ │ Phase 4 ✅ │ ... │ │
│ │ Phase 5 ⭐ │ Phase 6 ✅ │ Phase 7 ✅ │ Phase 8 🚧 │ ... │ │
│ └─────────────────────────────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ ┌─────────────────┐ ┌─────────────────┐ │
│ │ Error Injector │ │ Reporter │ │
│ │ │ │ │ │
│ │ • Safe mode │ │ • Markdown │ │
│ │ • Scenarios │ │ • Per-phase │ │
│ │ • Validation │ │ • Actions │ │
│ └─────────────────┘ └─────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────────────────────┘
Oversight Layers
1. Bug Window Watcher (bug_watcher.py)
Real-time monitoring of every pipeline stage.
Features:
- Monitors all 12 phases continuously
- Detects anomalies: errors, regressions, missing artifacts, state inconsistencies
- Links findings to phase, directory, STATUS.md, and checkpoint entries
- Persists to DragonflyDB for cross-session tracking
Anomaly Types:
| Type | Description | Severity Range |
|---|---|---|
| UNHANDLED_ERROR | Uncaught exceptions | Medium-Critical |
| REGRESSION | Behavior change from baseline | High |
| MISSING_ARTIFACT | Required file/config missing | Low-High |
| STATE_INCONSISTENCY | Status mismatch | Medium |
| DEPENDENCY_UNAVAILABLE | Vault/Dragonfly/Ledger down | Critical |
| SECURITY_VIOLATION | Unacknowledged violation | Critical |
2. Suggestion Engine (suggestion_engine.py)
AI-driven analysis using historical context.
Features:
- Gathers context from checkpoints, memory, STATUS files
- Pattern-based suggestions from known fixes
- Context-aware suggestions from historical outcomes
- Risk/impact ranking for prioritization
Suggestion Ranking:
Priority Score = Impact × (1 - Risk)
Impact Levels: transformative (1.0) > high (0.8) > medium (0.6) > low (0.4)
Risk Levels: critical (0.2) < high (0.4) < medium (0.6) < low (0.8)
3. Council Review (council.py)
Multi-perspective review with 5 specialized reviewers.
Reviewers:
| Role | Focus | Risk Tolerance |
|---|---|---|
| Safety | Security, access control | Very Low (0.2) |
| Performance | Latency, throughput | Medium (0.6) |
| Architecture | Design, maintainability | Medium (0.5) |
| Compliance | Governance, policies | Low (0.3) |
| Quality | Testing, documentation | Low (0.4) |
Decision Types:
AUTO_APPROVE- Safe to auto-implementHUMAN_APPROVE- Approved, needs humanDEFER- Needs more discussionREJECT- Do not implementESCALATE- Needs higher authority
4. Phase Validator (phase_validator.py)
Ensures all phases have required components.
Validation Levels:
| Level | Description |
|---|---|
| BASIC | Existence checks only |
| STANDARD | + Functionality tests |
| THOROUGH | + Integration tests |
| COMPREHENSIVE | + Chaos/edge cases |
Special Attention: Phase 5 Phase 5 (Agent Bootstrapping) receives extra validation as the current focus.
5. Error Injector (error_injector.py)
Controlled fault injection for testing oversight.
Predefined Scenarios:
| Scenario | Type | Target |
|---|---|---|
| missing_config | MISSING_FILE | agents/tier0-agent/config/agent.json |
| corrupted_status | INVALID_STATUS | checkpoint/STATUS.md |
| stale_checkpoint | STATE_INCONSISTENCY | checkpoint/storage/ |
| redis_key_missing | DEPENDENCY_FAILURE | oversight:watcher |
| violation_unacked | SECURITY_VIOLATION | violations table |
| blocked_directory | STATE_INCONSISTENCY | preflight/ |
Safe Mode: By default, injections are simulated without modifying files.
6. Reporter (reporter.py)
Comprehensive reporting with actionable insights.
Report Sections:
- Executive Summary
- Phase Status Matrix
- Bug Watcher Summary
- Suggestion Engine Summary
- Council Decisions
- Injection Test Results
- Pending Actions
- Critical Issues
CLI Usage
# Full pipeline execution
oversight run
# With injection tests
oversight run --inject
# Focus on specific phase
oversight run --phase 5
# Quick validation
oversight quick
# Validate specific phase in detail
oversight validate --phase 5
# Generate report only
oversight report
# Show phase matrix
oversight matrix
# JSON output
oversight run --json
Integration with Checkpoints/Status/Memory
Checkpoints
- Watcher checks checkpoint consistency and staleness
- Suggestions reference recent checkpoint context
- Reports include checkpoint link
STATUS.md Files
- Watcher scans STATUS.md for BLOCKED states
- Phase validator checks STATUS.md existence
- Reports track per-directory status
Memory Layer
- Suggestion engine queries memory for related entries
- Context gathered from summaries directory
- Report counts available memory entries
Running Tests
Injection Test Suite
# Run all injection scenarios
oversight run --inject
# Or use injector directly
cd /opt/agent-governance/testing/oversight
python -m testing.oversight.error_injector test-all
Expected Results
A healthy system should:
- Detect all injected errors (100% detection rate)
- Generate relevant suggestions (accurate quality)
- Produce council decisions for each suggestion
- Pass all injection tests
Extending the Pipeline
Adding a New Anomaly Type
- Add to
AnomalyTypeenum inbug_watcher.py - Add detection logic in
_run_phase_specific_checks() - Add fix patterns in
SuggestionEngine.FIX_PATTERNS
Adding a New Council Reviewer
- Add role to
ReviewerRoleenum incouncil.py - Create
ReviewerProfileinREVIEWERSdict - Implement
_<role>_review()method
Adding a New Injection Scenario
- Add to
SCENARIOSdict inerror_injector.py - Implement injection/cleanup in
_perform_injection()
File Structure
testing/oversight/
├── __init__.py # Package exports
├── pipeline.py # Main orchestrator
├── bug_watcher.py # Real-time anomaly detection
├── suggestion_engine.py # Fix recommendations
├── council.py # Multi-agent review
├── phase_validator.py # Phase coverage
├── error_injector.py # Fault injection
├── reporter.py # Report generation
├── README.md # This file
└── reports/ # Generated reports
Example Report
# Architectural Test Pipeline Report
**Generated:** 2026-01-23T12:00:00Z
**Report ID:** rpt-20260123-120000
## Executive Summary
- **Phases Validated:** 12
- **Average Coverage:** 75.3%
- **Total Anomalies:** 8
- **Critical Gaps:** 2
## Phase Status Matrix
| Phase | Name | Status | Coverage | Bugs |
|-------|------|--------|----------|------|
| 1 | Foundation | ✅ complete | 95.0% | 0 |
| 5 | Agent Bootstrapping | 🚧 in_progress | 80.0% | 2 |
| 8 | Production Hardening | ❌ blocked | 40.0% | 3 |
...
Troubleshooting
Pipeline Fails to Start
- Verify DragonflyDB is running:
redis-cli -p 6379 -a governance2026 PING - Check Vault status:
docker exec vault vault status
No Anomalies Detected
- Ensure STATUS.md files exist in directories
- Check checkpoint storage has recent entries
Injection Tests Fail
- Verify safe mode is enabled (default)
- Check file permissions in target directories
Related Documentation
- CONTEXT_MANAGEMENT.md - Checkpoints and STATUS
- MEMORY_LAYER.md - External memory
- STATUS_PROTOCOL.md - Directory status protocol