Three issues from J 2026-04-30:
1. Silent fail2ban-client / nginx subprocess failures
Pre-fix _execute_ban called fail2ban-client with capture_output=True
but threw away the result. _nginx_ban had bare `except: pass`
swallowing everything. So a non-zero fail2ban exit (jail not
configured, IP already banned, IPv6 quirk) or PermissionError
on /etc/nginx/banned_ips.conf logged "AI_BAN" while the attacker
walked through unimpeded. The "errors in logs" J was seeing.
Now every subprocess failure surfaces:
- FAIL2BAN_FAILED rc=N stderr=... — non-zero exit
- FAIL2BAN_TIMEOUT — client didn't return in 5s
- FAIL2BAN_NOT_INSTALLED — binary missing
- NGINX_BAN_WRITE_DENIED — permission error on conf file
- NGINX_RELOAD_FAILED rc=N stderr=... — systemctl reload non-zero
- NGINX_RELOAD_TIMEOUT / NGINX_RELOAD_NO_SYSTEMCTL — runtime gaps
sec_log.error catches these so journalctl -u llm-team-ui shows
the actual reason a ban didn't stick.
2. AI auto-scan failure callback when model is busy
Pre-fix Ollama unreachable / busy / timeout silently preserved
log position + skipped the scan. Operator only learned about
the gap by manually checking sentinel-status. Now:
- 1 retry inside same scan after SENTINEL_AI_RETRY_DELAY_SECS
(30s) on connection error / timeout / 429 / 503
- 4xx errors that won't recover (404 model missing, 400 bad
prompt) fail fast without retrying
- consecutive_ai_failures counter in _sentinel_stats
- On 3+ consecutive failures, send_security_alert() fires —
"Sentinel AI unreachable" email with last error + endpoint
+ model name. One alert per outage (ai_busy_alerted flag);
clears on first successful scan so flapping doesn't spam.
- AI_RECOVERED log line on first scan after a streak.
3. Sentinel ban path still substring-matched 192.168
Same vulnerability class as admin_ban_ip had — only protected
one /16. Replaced 4 sites with is_allowlisted(ip):
- threat-list display filter (line 7638): now hides ALL
allowlisted IPs from the panel
- mass-ban API (line 8016): refuses ban for any allowlisted IP
- sentinel analysis filter (line 12786): saves AI tokens by
never sending allowlisted-IP traffic to the judge
- sentinel ban verdict gate (line 12949): defense in depth —
even if the AI says "ban" on an allowlisted IP, this catches it
Combined with the layered defenses in b09b73c (track_violation,
_auto_escalate, _nginx_ban, admin_ban_ip), there is now no
code path that can ban an allowlisted IP. Operator self-ban
is structurally impossible.
Privilege note: the systemd unit at /root/llm-team-ui/llm-team-ui.service
runs as User=root, so subprocess.run(["fail2ban-client", ...]) and
systemctl reload nginx have permission. The "errors in logs" J was
seeing weren't permission-denied; they were silent non-zero exits.
The new subprocess wrappers surface those.
If the operator later splits the app into a non-root tier
(Opus OB-3 architectural recommendation, deferred), this same
infrastructure still works — the wrappers will then surface
"PermissionError" with full path + uid context, telling the
operator exactly which command needs sudo NOPASSWD or
PolicyKit rule.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
UX request 2026-04-30: when sorting by threat in the threat intel
panel, ban-selected required clicking each per-row checkbox
individually. Pages with 20-50 threats made bulk-ban tedious.
Adds a master `[ ] all` checkbox to the toolbar (right of the
Sort buttons, left of the existing 'N selected' counter) that
toggles every per-row .ip-check on the page in one click. Then
'Ban Selected' / 'Unban Selected' work over the whole set.
Three-state: unchecked (none selected) / checked (all) /
indeterminate (partial — browsers render this as a "half-tick"
so operators get visual feedback when they've toggled some rows
manually after using master). updateSelCount keeps the master
in sync as individual rows toggle so the visual is always
truthful.
No backend change — `/api/admin/security/mass-ban` already
accepts an arbitrary IP list. This is purely a frontend
ergonomics improvement on top of the existing mass-action
infrastructure.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The OB-4 fix at 939dfdd was incomplete. It bypassed the path-regex
EXPLOIT check for allowlisted-IP admins, but _track_violation is
called from THREE sites (exploit_scan, rate_limit, login_fail) and
only the exploit_scan path had the bypass. An admin who hit a rate
limit + had a session timeout + entered a wrong password within 60s
could still self-ban via the OTHER paths, ending up locked out
through the back door.
This commit adds 4 layers of defense in depth, each independently
sufficient to stop an allowlisted IP from being banned:
1. _track_violation: bail early if is_allowlisted(ip). Allowlisted
IPs never accumulate violations from ANY path. Plus eviction
sweep when _violation_tracker grows >10K (same shape as the
_rate_limit eviction at 266de61).
2. _auto_escalate: re-check is_allowlisted before issuing any ban.
Defense in depth — if a future call path bypasses #1, this
catches it.
3. _nginx_ban: refuse to write the deny rule for allowlisted IPs,
even if a buggy caller reached this far. Last write before nginx
reload; last place to stop a bad ban.
4. admin_ban_ip: replace `ip.startswith("192.168.")` substring
check with the canonical ALLOWLIST_IPS membership test. Pre-fix
this only protected one LAN; 10.0.0.0/8, IPv6 loopback ::1, and
custom allowlist entries (e.g. an external monitoring IP) were
all banable by manual admin error. Now uses the same allowlist
as the auto-ban paths.
Operationally the admin can no longer self-ban through any path.
The auto-escalate ban audit log entries get a corresponding
"AUTO_ESCALATE_BLOCKED ip=... — allowlisted" entry instead of the
ban firing silently. Same for nginx_ban: NGINX_BAN_BLOCKED entries
log the saved bullet for operator review.
Builds on 939dfdd (OB-4 path-regex bypass) + 266de61 (rate_limit
eviction + auth_login IP gate). Together these three commits close
the LLM Team UI's ban-system self-foot-shoot vulnerability surface.
Outstanding from the scrum (architectural, separate session):
- OB-3 root-running web app + privileged shell calls
- Sentinel prompt-injection WARN
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Closes the 2 remaining surgical-fix WARNs from the 2026-04-30
cross-lineage scrum on this codebase. OB-3 (root-running web app
with shell calls to fail2ban-client / systemctl / nginx config) and
the sentinel prompt-injection WARN both need bigger architectural
work and stay deferred.
OB-rate-limit (Opus WARN) — _rate_limit dict unbounded
Pre-fix: per-worker dict with no eviction; an attacker slowly
rotating IPs leaked memory forever. Fix: lazy eviction sweep
triggered when dict grows beyond 10K entries (cheap because we
only scan when growth is unusual). Real production wants a
Redis-backed shared counter; this is the in-process band-aid
that prevents runaway growth without changing the deploy shape.
OB-auth-setup (Opus WARN) — first-time setup grant from any IP
Pre-fix: /api/auth/login with setup=true was gated only by
COUNT(*) FROM users == 0. If the users table was ever truncated
or restored empty, the next external visitor (ANY IP) claimed
admin. Fix: also require the source IP to be in ALLOWLIST_IPS
(typically loopback + LAN gateway). Local operator setup still
works; remote attackers hitting the endpoint after an empty-
users state get 403.
Both fixes are surgical — single function, no behavior change for
the happy path. The eviction sweep runs O(n) only when n>10K and
only drops entries already past their useful window, so it never
removes an active rate-limit count.
Outstanding from the scrum (deferred):
- OB-3 root-running web app: needs split into non-root Flask tier
+ privileged sudo wrapper service. 2-4 hr architectural work.
- Sentinel prompt-injection WARN: feeds attacker-controlled UA/
path into LLM judge prompt. Needs prompt-template hardening or
output validation gate before LLM verdicts can issue ban actions.
- CSP unsafe-inline WARN: defeats most XSS protection. Removing
it requires moving inline scripts to external files (HTML
refactor).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Cross-lineage scrum (Opus 4.7 + Kimi K2.6 + Qwen3-coder via the
local-review-harness chatd) on this codebase surfaced 5 BLOCK-class
issues from Opus + a convergent finding from the harness. This
commit lands the 4 surgical fixes; OB-3 (web app runs as root with
fail2ban-client + systemctl reload nginx + writes to
/etc/nginx/banned_ips.conf) needs an architectural split into a
non-root web tier + a privileged sudo wrapper, deferred for its
own session.
OB-1 — log file open at import crashes app on perm error
Pre-fix: `_sec_handler = logging.FileHandler("/var/log/llm-team-
security.log")` raised PermissionError at import time on any
non-root or fresh-install run, killing the app before Flask
started — failure was silent (no Flask process to inspect logs
on).
Fix: try/except, fall back to StreamHandler(sys.stderr) when
the path is unwritable. App starts; sec_log events still land
in journald via stderr. LLM_TEAM_SECURITY_LOG env var lets
operators override the path.
OB-2 — DB password hardcoded in source (CONVERGENT FINDING)
The `kbuser` Postgres credential
`IPbLBA0EQI8u4TeM2YZrbm1OAy5nSwqC` was leaked in source here
AND in voice-ai/audiosocket_bridge.py + voice-ai/sales_assistant.py.
Caught independently by harness LLM phase (qwen3.5 local) on
voice-ai earlier today AND Opus on this file just now. Same
password, same DB (`knowledge_base`) shared between services,
three reviewers converged.
Fix: source from LLM_TEAM_DB_DSN env var, fail loud on unset.
Operator follow-ups:
1. Rotate the password in Postgres (still in git history;
redacting source doesn't un-leak it).
2. Set LLM_TEAM_DB_DSN in /etc/llm-team-ui.env (mode 0600,
loaded via systemd EnvironmentFile=).
3. Same DSN env-var pattern needs applying to
voice-ai/audiosocket_bridge.py:47 once that branch's
workspace_context WIP lands.
OB-5 — demo_mode default=True ships public access on first boot
Pre-fix: `_demo_mode = {"active": True, ...}` + the demo branch
in login_required let users through without a session. Combined
with /api/run + /api/imagegen proxies, fresh installs were open
LLM/compute abuse surface from first boot.
Fix: default to False; LLM_TEAM_DEMO_MODE=1 env override exists
for the public devop.live deployment systemd unit so the demo
doesn't need a manual flip on every restart, but everywhere else
defaults closed.
OB-4 — EXPLOIT_PATTERNS LAN/admin lockout
Pre-fix: regex matched on `request.path` + query string against
patterns like UNION / SELECT / ;-- / <script /admin.php. Admin
URLs containing those keywords in legitimate ways (e.g. a team
name "select-rebrand" or a docs link /admin/select_a_mode) hit
3 violations in 60s and auto-banned the admin's IP. No allowlist.
Fix: bypass the path-based check for authenticated admins from
an ALLOWLIST_IPS source. Body/UA checks still apply (the prompt-
injection-as-DoS WARN in the scrum is separate). Combination
prevents self-ban without weakening the broader scanner defense.
Plus a .gitignore: /.memory/ — the local-review-harness writes
JSONL findings under <repo>/.memory/ when scanning; harness's own
gitignore is at the harness repo root, not here, so without this
the .memory/ dir would show up as untracked on every harness run
against this tree.
Other Opus WARNs deferred:
- Sentinel feeds attacker-controlled UA into LLM prompt → can
steer ban verdicts. Fix needs prompt-template hardening or
output-validation gate.
- CSP `'unsafe-inline'` defeats most XSS protection (would break
inline scripts; needs HTML refactor).
- _rate_limit unbounded dict + per-worker (needs eviction loop or
Redis-backed counter).
- auth_login first-time setup gated only by COUNT(*)==0 (needs
network-source restriction or a setup token).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
New mode: Deep Analysis — 6-phase autonomous pipeline:
1. Research: all selected models answer in parallel
2. Debate: models challenge each other's findings
3. Consensus: merge research with critiques, identify strong/weak points
4. Self-Eval: structured scoring (accuracy, completeness, actionability, nuance)
5. Final Synthesis: strongest model produces definitive answer
6. Knowledge Base: result stored for future RAG retrieval
Designed for cloud models (Ollama Cloud, OpenRouter). Every successful
run trains the local knowledge base so future adaptive runs benefit.
Purple accent in mode selector to distinguish from standard modes.
Token tracking fix:
- Added est_tokens, input_chars, output_chars columns to team_runs
- save_run() now calculates and stores token estimates for ALL runs
- Both logged-in and public/demo/showcase runs track tokens
- Enables accurate usage analytics across all users
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Two bugs preventing adaptive mode from working:
- ml-adaptive was missing from ML_IDS array (line 3160) — model checkboxes
never rendered, so no models were selected
- adaptive-synthesizer was missing from populateAllSelects() ids array
(line 3733) — synthesizer/judge dropdown was always empty
Both are single-line fixes. The backend pipeline (run_adaptive) was
complete and correct — self-eval, RAG retrieval, escalation, quality
scoring, KB storage all work. The UI just never wired the config.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Endpoint now returns all models, not just free ones
- Each model includes: name, context_length, free flag, prompt/completion cost
- UI shows pricing: "128K ctx · $2.50/M tok" for paid, "128K ctx · free" for free
- Filter dropdown: All Models / Free Only / Paid Only
- Search still works alongside the filter
- 29 free + 314 paid models available (GPT-5.4, Grok 4.20, Gemini 3.1, etc)
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
New provider: Ollama Cloud (ollama.com)
- Native Ollama chat API with bearer token auth
- Provider card in Admin → Providers tab
- "Ollama Cloud" tab with Pull Models button (fetches 36 models)
- Search/filter models, one-click Add
- Models route as ollama_cloud::modelname through query_ollama_cloud()
- Test button verifies connection
OpenRouter fix:
- Cleared bad API key from config (was dd00bea4... not sk-or-)
- Real key from /home/profit/.env now used (sk-or-v1-579...)
- Fixed OpenAI provider that had wrong base_url (ollama.com→api.openai.com)
- Bumped OR timeout to 180s for free model rate limits
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- "Explore 3D" button on every response card — opens interactive Three.js viewer
- Orbit controls (drag to rotate, scroll to zoom, auto-rotate)
- 8 procedural GLB scene types: spiral galaxy, toroidal structure, crystal cavern, orbital rings, DNA helix, floating islands, wormhole, lattice world
- Blender exports scenes as GLB in ~1s, cached to disk
- Three.js with ACES filmic tone mapping, fog, 4-point cinematic lighting
- Auto-cleanup: stops rendering when card removed from DOM
- TripoSR pipeline fix: direct ComfyUI call (no self-deadlock)
- AI→3D Sculpt button renamed with clear pipeline labels
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Layout & UX:
- Full-screen composer on load, full-screen output on run (no split view)
- Output scrolls within container (flex-shrink:0 fix), no page scroll
- Progress panel: fixed position above output, collapses to thin bar, fades on completion
- Output cards edge-to-edge in output mode (no panel border/padding/max-width)
- Modern theme as default, theme persists across all pages (CSS injection fix)
- Reddit theme renamed to Coral (copyright)
Markdown & Typography:
- Marked.js + DOMPurify for proper markdown rendering in response cards
- Editorial typography: custom numbered list counters, gold bullet dots, blockquote styling
- Table striping, code blocks with purple syntax, link underlines
- Text normalization: strips excessive indentation, collapses blank lines between list items
- Removed mermaid.js (unreliable with LLM output, caused visible errors)
Response Cache:
- DB tables: response_cache + response_cache_history
- Cache key: SHA256(prompt + mode + sorted models)
- Instant return on cache hit, auto-upgrade when better score arrives
- Full version history for training data export
- "Fresh Run" button to bypass cache
- API: /api/cache/stats, /cache/history, /cache/export, /cache/clear
Adaptive Pipeline (new mode):
- Self-evaluating models: answer + confidence score + limitations
- Quality gates: below threshold triggers RAG retrieval + model escalation
- Knowledge base: successful responses embedded and stored for future retrieval
- Flywheel: system gets smarter with every run
- DB tables: knowledge_base + adaptive_runs
- API: /api/knowledge-base/stats, /search, /entries, /adaptive-runs
Model Error Handling:
- Full HTTP error code coverage (429, 402, 404, 401, 403, 500-504, timeouts)
- safe_query_with_fallback: failed model's role taken by next available model
- Runners updated: debate, pipeline, validator use fallback system
- UI shows "model X failed, model Y took over" notices
Image Generation:
- SDXL Turbo via diffusers (fallback) + ComfyUI + DreamShaper XL Turbo (primary)
- ComfyUI on :8188 with DPM++ SDE Karras sampler, 8 steps
- Abstract editorial style: 8 rotating prompts, forced no-people/no-text
- Disk cache for generated images
- "Illustrate" button on every response card
- Imagegen proxy at :3600, API endpoint /api/imagegen
- 1024x512 at ~3.5s per image (ComfyUI) or ~1s (diffusers fallback)
Prompt Effects:
- Sample chip animations: click-to-swap with exit/enter transitions
- Shuffle icon on hover to cycle prompts without typing
- Typewriter spam-click protection
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Optimization & History:
- Fix optimization history display on both /history and main page slide-out panel
- Full card layout with score bars, "Use This" on all variations, A/B compare, export
- Deep Optimize: chain 2-3 rounds, feed each winner to the next
- Prompt template library: save winners, browse as quick-start chips
- Mode recommendation engine from historical scores
- Score calibration: strict anchor examples (scores now spread 4-8, not 7-9)
Security Hardening:
- Auto-escalation: 3 violations in 60s triggers instant ban + high-alert mode (30s scans)
- Sentinel prompt injection defense: sanitize log data, adversarial boundary instruction
- XSS fixes: escapeHtml on model names, mode labels in history panel
- Log redaction: passwords/tokens/secrets auto-redacted from log display
- Rate-limited /api/admin/logs endpoint (10 req/min)
- HSTS + COOP headers, persistent session secret, HttpOnly+SameSite cookies
- Concurrent ban execution via ThreadPoolExecutor
Prompt Window (pretext integration):
- Canvas particle system: keystroke particles, focus sparkle, paste explosion
- Ghost text typewriter: cycling placeholder with animated typing
- Pretext-powered line measurement for accurate metrics
- Mode-colored particle cascade on mode switch
- Sample prompt typewriter effect with spam-click protection
- Live metrics bar: chars, words, lines, est. tokens
Showcase mode now allows /optimize, /deep-optimize, /score endpoints.
CSP updated for Google Fonts + esm.sh (pretext).
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Server-side save (survives page refresh/close):
- Moved save_run() from generator (client-dependent) into pipeline thread
- Pipeline thread collects responses server-side independently
- save_run() executes in pipeline thread's finally block — ALWAYS runs
- Even if user closes browser mid-run, the run completes and saves to DB
Public user tracking:
- Runs from demo/showcase users tagged with config.owner = "public"
- Admin runs tagged with actual username
- History list shows orange "PUB" badge on public user runs
- owner column added to history list query for fast filtering
Architecture change:
- _pipeline_collected[] built by pipeline thread (not generator loop)
- _run_config stored before generator starts, accessible by pipeline thread
- run_saved SSE event emitted from pipeline thread after save
- Generator's collected[] still tracks for display, but save is independent
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Select All checkbox in header row:
- Toggles all visible checkboxes at once
- Shows indeterminate state when partially selected
- Syncs with individual checkbox changes
Shift-click range selection:
- Hold Shift + click to select/deselect a range of rows
- Tracks last clicked index for range calculation
Selection count badge:
- Shows "N selected / M runs" in the run count badge when items selected
- Updates on every checkbox change
Header updated to include Score column matching the data grid.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Banner no longer covers navigation:
- Pushes content down with body padding instead of overlapping
- Fades out automatically after 10 seconds
- Click to dismiss immediately
- Remembered per session via sessionStorage (won't reappear after dismissal)
- Smooth transition: opacity fade + slide up
Demo mode runs ARE saved to database (confirmed /api/run is in
DEMO_ALLOWED_POSTS and save_run executes in the SSE generator).
The "read-only" restriction only applies to admin actions like
archiving, tagging, banning — not to running prompts.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The analysis LLM sometimes returns strategies as objects like
[{"name": "clarity"}] instead of plain strings ["clarity"]. The
', '.join(strategies) call then fails with "expected str, got dict".
Fix: normalize each strategy to a string regardless of format —
handles str, dict with name/strategy keys, or fallback to str().
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Archive list improvements:
- Score column added to run list (color-coded: green 7+, amber 5+, red)
- OPT badge on runs generated by optimization (shows parent run)
- quality_score, score_method, tags, source, parent_run now in list query
Detail panel score display:
- Large color-coded score badge in header (e.g., "8.0/10" in green)
- Shows scoring method (auto/thumbs up/thumbs down)
- Persistent — visible every time you open the detail, not just once
Full optimization history section:
- Shows all optimization runs with timestamps, scores, call counts
- Each run lists ranked variations with strategy, mode, score
- Winner highlighted with star and green border
- "View" button opens any variation's full detail
- "Use" button on winner sends prompt to composer via sessionStorage
- Always loads from /api/optimize-history — no stale data
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The optimization "Use This" button was on the /history page but tried
to set document.getElementById('prompt') which only exists on /. The
JS value was lost on navigation.
Fix: store prompt in sessionStorage, pick it up on main page load.
Also opens the composer overlay so the user sees the loaded prompt
immediately instead of landing on an empty output view.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
History detail panel now shows optimization results:
- If a run has been optimized, shows results section with best score,
original score, and link to view the winning variation
- Fetches full optimization history via GET /api/optimize-history/<id>
- Shows count of optimizations run and child variation count
- Button changes to "Re-Optimize" for already-optimized runs
Reconnect to active optimizations:
- If optimization is already running, returns job_id in error response
- Frontend detects this and reconnects to the SSE stream
- No more losing progress when navigating away and coming back
- Refactored startOptimize() into startOptimize() + _showOptimizeStream()
New endpoint: GET /api/optimize-history/<run_id>
- Returns all pipeline_runs where pipeline='optimize' for that parent
- Returns all child team_runs created by optimization
- Includes scores, strategies, rankings
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Lab experiment selection:
- Selected experiment now highlighted with accent border + glow
- Clicking auto-navigates to relevant tab (config if idle, monitor if running)
- No more silent toast-only feedback
Live status display:
- SSE "status" events now rendered in monitor (were silently dropped before)
- Shows real-time: "Proposing change... (trial 3/50)" during execution
- Error messages displayed inline instead of just toast
Stuck experiment fix:
- On app startup, reset all "running" experiments to "paused"
- Prevents ghost "running" status after service restart
- Fixed experiments 2, 3, 4 that showed running but had dead threads
Trial cap fix:
- Changed from lifetime cap (trial_num < 50) to per-run cap (trials_this_run < 50)
- Prevents runaway experiments like #1 that accumulated 3762 trials
- Shows trial progress in status: "trial 3/50"
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
When viewing any past run in History, click "Optimize" to trigger an
automated workflow that:
1. Analyzes the original prompt + responses + score
2. Identifies improvement strategies (clarity, depth, specificity, etc.)
3. Generates 3-5 improved prompt variations
4. Tests each variation across original mode + brainstorm
5. Auto-scores all results via background judge
6. Ranks results and highlights the winner
7. "Use This" button loads winning prompt into composer
Architecture:
- _run_optimize(job_id, run_id): background thread, 5-phase engine
- POST /api/runs/<id>/optimize: starts optimization job
- GET /api/optimize/<job_id>/stream: SSE for live progress
- Budget-capped at 15 model calls per optimization
- Child runs saved as real team_runs (source: "optimize")
- Auto-scored → feeds into analytics + routing table automatically
- Results saved to pipeline_runs (pipeline: "optimize")
Frontend:
- "Optimize" button in history detail panel (accent-colored)
- startOptimize(runId): replaces detail view with live optimization stream
- Phase cards: Analysis → Variations → Testing → Ranked Results
- Score bars with color coding (green/amber/red)
- Winner row highlighted with star + "Use This" button
Closes the learning loop: system studies its own history → generates
better prompts → tests them → scores results → routing table improves.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Phase 1 — Run Quality Scoring:
- Auto-score every run in background via qwen2.5 judge (1-10)
- Thumbs up/down vote buttons on output cards
- POST /api/runs/<id>/score for user feedback
- run_saved SSE event enables vote buttons after run completes
- User votes override auto-scores (race-condition safe)
- DB: quality_score, score_method, score_metadata on team_runs
Phase 1 — Analytics Dashboard:
- GET /api/admin/analytics: score-by-mode, score-by-model, heatmap, trend
- New Analytics tab on Admin page with bar charts, heatmap table, trend sparkline
- Scoring coverage tracker (scored vs total runs)
- Model × Mode heatmap with color-coded cells
Phase 2 — Reactive Pipeline:
- _assess_stage(): orchestrator evaluates each stage's output mid-run
- _reactive_decide(): can insert/skip stages based on assessment
- Dynamic stage loop replaces fixed iteration in run_refine()
- Budget tracking prevents infinite loops (max_stages hard cap)
- Reactive decisions render as dashed notification bars between cards
- Pipeline adjusts in real-time: "Inserting VALIDATE — high severity gaps found"
Phase 3 — Cross-Run Learning:
- _build_routing_table(): queries historical scores for model×mode performance
- Best stage sequences per content_type from pipeline_runs
- Routing table cached with 30-min TTL
- Auto-Refine strategist prompt augmented with historical data
- GET /api/suggest-models?mode=X returns top 3 models for that mode
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Each mode now has {basic: [...], mid: [...], advanced: [...]} with 5
prompts per difficulty level. Renderer picks one random prompt from
each tier on every mode switch, so users see fresh examples each time.
315 hand-crafted prompts designed to highlight each mode's strengths:
- brainstorm: creative problem-solving at increasing scale
- pipeline: multi-step transformations from simple to complex
- debate: ethical dilemmas with escalating nuance
- validator: common myths to complex historical misconceptions
- roundrobin: writing tasks that benefit from iterative refinement
- redteam: security vulnerabilities from obvious to systemic
- consensus: opinion questions from clear to deeply contested
- codereview: coding tasks from functions to distributed systems
- ladder: concepts that scale from kindergarten to PhD
- tournament: creative competitions from one-liners to algorithms
- evolution: optimization targets from names to city infrastructure
- blindassembly: decomposable projects from explanations to systems
- staircase: progressive constraints from party planning to treaties
- drift: factual claims from simple dates to complex event sequences
- mesh: stakeholder analysis from office policies to life-or-death
- hallucination: fact-checkable claims from simple to obscure
- timeloop: cascading failures from restaurants to civilization
- research: deep dives from single topics to geopolitical analysis
- eval: benchmark prompts from trivia to formal proofs
- extract: structured extraction from sentences to legal documents
- refine: documents from product blurbs to architecture specs
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Auto-Refine mode (21st mode):
- AI strategist analyzes content type and quality
- Selects 3-5 optimal refinement stages from 8 available
(validate, critique, expand, structure, stakeholder, clarity,
edge_cases, align)
- Executes stages sequentially with output chaining
- Final synthesis produces polished version
- Stages are content-aware — PRD gets different pipeline than essay
- Saved to pipeline_runs DB
Composer UX overhaul:
- Initial state: full-screen centered composer overlay
- Mode grid + models + prompt front-and-center for new users
- On Run: composer closes, output takes full screen width
- "New Prompt" button in header nav bar (not floating)
- Close button (×) on composer overlay
- Works across all 4 themes + mobile
Dropdown fixes:
- Dark theme: select options get solid #1a1d23 bg
- Modern theme: select options get solid #18181b bg
- Light/Reddit: select options get white bg with dark text
- Native <option> elements now readable in all themes
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Theme system (Dark/Light/Reddit/Modern):
- Injectable CSS/JS via after_request — zero template changes
- Dark: original gold accent on black
- Light: warm off-white with indigo accent, readable buttons
- Reddit: bluish-gray bg, orange accent, pill buttons, 8px corners
- Modern: glassmorphism dark, blue accent, frosted cards, 16px corners
- Toggle cycles all 4 themes, persists via localStorage
- Button injected into every page header automatically
Enrichment panel fix:
- threat-card changed from display:flex to display:grid
- enrich-panel now spans full width via grid-column:1/-1
- Added .enrich-section/.enrich-title/.enrich-grid CSS classes
- Sections (Geo, Deep Scan, AI) visually separated with dividers
Iterate/repipe modal themed for all modes:
- Light themes get white modal bg, proper contrast
- Reddit gets rounded corners + orange accent
- Modern gets glassmorphism modal with blue glow
Scrollbar styling across all themes:
- Rounded, properly sized (6-8px), theme-colored thumbs
- macOS-style inset look via background-clip
Layout improvements:
- Output area min-height 400px, padding-bottom 40px
- Empty state centered with more breathing room
- Docker + containerd enabled at boot for web-check survival
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
fail2ban was using nftables action while UFW uses iptables-nft, so bans
were recorded but never enforced. Added three-layer ban enforcement:
1. nginx deny list (/etc/nginx/banned_ips.conf) for instant 403
2. ss -K to kill existing TCP connections on ban
3. Auto-sync nginx deny file on ban/unban (manual, mass, AI sentinel)
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Off: login required for everything
Demo: public gets Team UI + run modes + admin page (browse only)
Blocked: /logs, /admin/monitor, /history, threat intel APIs,
sentinel, wall-of-shame, meta-pipelines, self-reports, vectors
Showcase: public gets full read-only access to ALL pages
Allowed: admin, monitor, logs, threat intel, enrichment,
lab, history, self-analysis, meta-pipelines
Blocked: config changes, bans, deletes, bulk operations
Admin (logged in): full access to everything always
SHOWCASE_ONLY_ROUTES set defines which pages/APIs are
blocked in basic demo but allowed in showcase mode.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Problem: plain toggle set showcase=true, so demo always became showcase.
No way to enable basic demo mode separately.
Fix:
- Three explicit buttons: [Demo] [Showcase] [Off]
- Demo mode: active=true, showcase=false (team UI only)
- Showcase mode: active=true, showcase=true (full read-only admin)
- Off: both false
- Plain toggle cycles demo on/off without touching showcase
- Clear status text shows which mode and what it means
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The demo toggle route was in DEMO_BLOCKED_POSTS, so once showcase
was enabled, the before_request handler blocked the toggle POST
even for admins (the before_request check ran before the route's
own admin check could verify the session).
Fix: removed /api/demo/toggle from blocked list. The route already
has its own admin-only check (line 460).
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
before_request handler still referenced old variable name.
Updated to use DEMO_BLOCKED_POSTS with simpler path-in-set check.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
New mode: Showcase (replaces basic demo mode for client demos)
- Visitors see EVERYTHING: Admin, Monitor, Logs, Threat Intel,
Lab, History, Meta-Pipelines — all without logging in
- Read-only: all GET requests allowed on all routes
- Allowed POSTs: team runs, self-analysis, IP enrichment
(read-like operations that don't modify system config)
- Blocked POSTs: config changes, bans, deletes, bulk archive
Admin UI (Security tab):
- "Enable Showcase" button (magenta) — one click to activate
- "Turn Off" button appears when active
- Clear description of what visitors can and can't do
- Status shows "SHOWCASE MODE" with magenta styling
Banner:
- Magenta gradient banner on all pages when showcase is active
- Shows: "Showcase Mode — Full Read-Only Access — Admin · Monitor · Logs · Lab · History"
- Demo button in nav shows "Showcase" in magenta
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Auto-refresh now skips when any detail panel is open (checks for
meta-detail-* elements). Panel stays stable while reading results.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Each pipeline card now shows:
- Status dot + name + status tag + best score
- Stop button (red) when running
- Restart button (green) when stopped/completed
- Results button (magenta) to drill into iterations
- Live progress text when running
- Stages and iteration count on info line
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Bugs fixed:
- Ratchet loop had no trial cap — experiment #1 ran 3762 trials
unchecked. Now capped at max_trials=50 per start cycle.
- meta_pipelines, meta_runs, self_reports tables had no GRANT
for kbuser — fixed permissions for all tables and sequences.
All 4 running experiments auto-paused on restart.
Stress test confirms all tables accessible, all models responding,
meta-pipeline creation working, self-report save/retrieve working.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Engine:
- Chains modes in sequence: extract → research → validate → debate → synthesize
- Each stage feeds its output to the next as input
- Runs same pipeline with different model sets (one model per iteration)
- Auto-scores final output using judge model (1-10)
- Keeps best result across all iterations
- All stage results + final outputs saved to meta_runs table
4 preset pipelines:
1. Security Deep Dive — security logs through 5-stage analysis
2. Run History Insights — team run data through 4-stage extraction
3. Threat Intel Enrichment — profiled IPs through 5-stage analysis
4. Cross-Report Synthesis — past self-reports through 4-stage debate
Database:
- meta_pipelines: name, source, stages, status, best_score, iterations
- meta_runs: per-iteration stage results, final output, score, models
API:
- POST /api/meta-pipeline — create pipeline from preset
- POST /api/meta-pipeline/:id/start — run in background
- POST /api/meta-pipeline/:id/stop — halt execution
- GET /api/meta-pipelines — list all with live status
- GET /api/meta-pipeline/:id — full detail with all iteration results
UI (Lab page):
- Magenta-bordered Meta-Pipeline card with 4 clickable presets
- Click preset → creates + auto-starts pipeline
- Pipeline list with live status dots, progress, scores
- Click pipeline → drill-down with per-iteration results
- Each stage expandable (click to show output)
- Best output highlighted in green border
- Auto-refreshes every 5 seconds during runs
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Database:
- self_reports table: report_type, model, report text, data_size, timestamp
- Reports auto-saved on generation (no extra step needed)
API:
- GET /api/self-reports — list all past reports (id, type, model, size, date)
- GET /api/self-reports/:id — full report text
UI:
- "✓ Saved as report #N" indicator after generation
- "Past Reports (N)" section below self-analysis buttons
- Click any past report → expands inline (toggle on/off)
- Shows: type, model, timestamp for each saved report
- Reports persist across page refreshes and restarts
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
4 one-click self-analysis reports in Lab:
1. Threat Intelligence Report — security logs → attack taxonomy,
attacker profiling, predictive analysis, recommendations
2. Model Performance Analysis — 96 team runs → usage patterns,
model workload, response efficiency, optimization opportunities
3. Usage Analytics — nginx access logs → traffic patterns, feature
usage, user journey mapping, UX recommendations
4. Security Posture Assessment — combined audit of security logs,
sentinel verdicts, fail2ban, threat intel DB → risk rating
API: POST /api/self-analyze
- type: threat_intel|model_performance|access_patterns|security_posture
- model: which local model to use (default qwen2.5)
- Returns structured report from real system data
Lab UI:
- Green-bordered Self-Analysis card above experiment templates
- Click any report → runs analysis in background → result panel
expands inline with full report (scrollable, closeable)
- Loading state shows "Analyzing..." during generation
Each report analyzes REAL data: actual security logs, actual run
history, actual nginx access patterns — not synthetic test data.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Templates section below experiment list:
BASIC — Better Summaries (3 eval cases)
Optimize summarization quality. Tests across biology, history,
and technical content. Shows the simplest Lab workflow.
INTERMEDIATE — Code Explainer (4 eval cases)
Find the best prompt+model to explain code to non-programmers.
Tests loops, recursion, error handling, comprehensions.
Shows how the ratchet evolves system prompts.
ADVANCED — Security Analyst Persona (5 eval cases)
Evolve a cybersecurity AI across threat classification, executive
summaries, developer education, incident response, and forensics.
Tests multi-audience adaptation and domain expertise.
Click any template → auto-fills the create form with name, objective,
metric, all eval cases, and selects all available models. User can
modify before creating.
Each template card shows: level badge (green/amber/red), name,
eval case count, and a description explaining what the experiment
does and why it matters.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Full theme swap: amber accents, JetBrains Mono, 2px borders
- Animated dot-grid background + scanlines
- Backdrop-filter blur on cards
- Status pills: square with borders (was rounded)
- Model chips: square with 2px borders
- Chart wraps: dark background with 2px borders
- Trial items: monospace numbers and scores
- Best config box: monospace with green border
- Nav bar with links to Team, History, Admin, Logs
- Toast: monospace with fade-out animation
- Config textarea: monospace font with dark background
- Responsive: tabs compact on mobile
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
New /history page (replaces slide-out panel):
- Full-page data table: ID, Mode, Prompt, Models, Tags, Date
- Active/Archived/All view toggle
- Filter by mode, tag, or search text
- Checkbox select for bulk archive/restore
- Click any row → detail panel with full responses
Per-run detail:
- Inline tag editor: add tags (Enter), remove tags (click ✕)
- Notes textarea with auto-save (1s debounce)
- Archive/Restore/Delete buttons
- Collapsible response cards (click header to expand)
Database:
- tags TEXT[] column with GIN index for fast tag queries
- notes TEXT column for freeform annotations
APIs:
- POST /api/runs/:id/tags — update tags and/or notes
- GET /api/runs/tags — list all unique tags in use
- GET /api/runs/vectors — structured text documents for AI/embedding
Returns: mode, prompt, models, date, tags, notes + all response text
Filters: ?mode=, ?tag=, ?limit=
Each doc includes token estimate for embedding planning
Main UI: History button now links to /history page
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Database:
- Added 'archived' boolean column to team_runs (indexed)
- Active runs filtered by archived=false by default
API:
- GET /api/runs?show=active|archived|all
- POST /api/runs/:id/archive — archive single run
- POST /api/runs/:id/restore — restore single run
- POST /api/runs/bulk-archive — archive/restore by IDs or date
History panel UI:
- Active/Archived toggle tabs at top
- Per-run Archive button (magenta) in detail view
- Per-run Restore button (green) in detail view for archived runs
- "Archive All" bulk button when viewing active runs
- "Restore All" bulk button when viewing archived runs
- Archived runs hidden from active view, accessible anytime
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Sentinel thread sets next_scan_ts = time.time() + interval BEFORE sleeping
- API returns next_scan_in derived from real next_scan_ts, not estimated
- Frontend calculates server clock offset and counts down to the actual
target timestamp — refresh shows the same remaining time, not a reset
- Shows ✓ in green when scan fires, resumes countdown on next poll
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Entire sentinel status fits in one header row now
- Mini 28px countdown ring (was 64px) inline with title
- Scans/bans counts inline as text, not grid boxes
- Verdicts collapsed by default — click to expand
- Card padding reduced (8px vs 14px)
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- SVG progress ring shows time until next scan (magenta arc)
- Countdown ticks every second: "245s → 244s → ... → scanning..."
- Ring fills as time progresses, resets on scan
- Turns green and shows "scanning..." when timer hits 0
- Stats grid: Scans count, AI Bans count, Last Run time, Interval
- Backend API returns elapsed_since_scan and next_scan_in
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Database:
- threat_intel table with full enrichment data per IP
- UPSERT on IP — re-enriching updates existing record
- Stores: geo, AI analysis, web-check results, indicators, raw JSON
- Indexed on IP (unique), threat_level, enriched_at
Auto-save:
- Every enrichment auto-saves to DB (step 5 in enrichment pipeline)
- "Saved to Wall of Shame database" indicator in enrichment panel
- No duplicate scans — re-enrich updates the existing record
Wall of Shame tab (/logs):
- Stats bar: Total Profiled, Critical, High, Proxies, Automated
- Sortable table: IP, Threat, Type, Summary, Country, Ports
- Click any row to expand full detail:
ISP, Org, ASN, City, Proxy/Hosting flags, Confidence,
Blocklist count, Pattern, Recommendation, Indicators
- All data persists across restarts — no re-scanning needed
API:
- /api/admin/wall-of-shame — list all enriched IPs with sorting/filtering
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Enrichment AI prompt:
- Explicitly states this is a PRIVATE application
- Strict threat level rules: 10+ blocklists = always critical,
exploit scans = always critical, SSH-only = suspicious
- Added "compromised_host" classification option
- Recommendation options: ban permanently, ban 24h, monitor, ignore
Sentinel batch prompt:
- "Err on the side of banning" directive
- .env.production/.env.local probing = targeted recon, instant ban
- When in doubt, BAN — private server has no public scanning excuse
- Tighter rules for automated UA detection
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Now queries 6 web-check endpoints per IP:
- ports — open port scan
- dns — reverse DNS / PTR records
- block-lists — DNS blocklist check (AdGuard, CloudFlare, etc.)
- trace-route — full network path with per-hop latency
- headers — HTTP response headers (server, powered-by, etc.)
- status — HTTP status code and response time
Frontend rendering:
- Traceroute displayed as hop chips with latency: IP (45ms) → IP (56ms)
- HTTP status with response time
- Server headers inline
- Errors silently skipped (many endpoints fail on raw IPs)
AI analysis now includes:
- Blocklist count and names in prompt
- Traceroute hops in prompt for network path analysis
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Web-check (ports, DNS, blocklists) now runs as step 3, AI analysis
as step 4. AI prompt includes open ports and blocklist status for
richer threat verdicts.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Setup:
- lissy93/web-check running in Docker on port 3000
- Queries ports, DNS, and blocklist endpoints per IP
Enrichment now includes 4 layers:
1. Geolocation (ip-api.com) — country, ISP, proxy/hosting flags
2. Web-Check deep scan — open ports, DNS/PTR, blocklist status
3. Security log aggregation — all activity for that IP
4. AI analysis (qwen2.5) — gets ALL above data as context
Frontend rendering:
- Open ports displayed in red (security risk indicators)
- Blocklist status: "3/8 blocked (AdGuard, AdGuard Family, ...)"
- Reverse DNS (PTR records)
- All data feeds into AI analysis prompt for richer verdicts
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>