llm-team-ui

Go to file

root 083f05c093 sentinel: log subprocess failures + AI-busy retry/callback + allowlist all ban-paths

Three issues from J 2026-04-30:

1. Silent fail2ban-client / nginx subprocess failures
   Pre-fix _execute_ban called fail2ban-client with capture_output=True
   but threw away the result. _nginx_ban had bare `except: pass`
   swallowing everything. So a non-zero fail2ban exit (jail not
   configured, IP already banned, IPv6 quirk) or PermissionError
   on /etc/nginx/banned_ips.conf logged "AI_BAN" while the attacker
   walked through unimpeded. The "errors in logs" J was seeing.

   Now every subprocess failure surfaces:
   - FAIL2BAN_FAILED rc=N stderr=... — non-zero exit
   - FAIL2BAN_TIMEOUT — client didn't return in 5s
   - FAIL2BAN_NOT_INSTALLED — binary missing
   - NGINX_BAN_WRITE_DENIED — permission error on conf file
   - NGINX_RELOAD_FAILED rc=N stderr=... — systemctl reload non-zero
   - NGINX_RELOAD_TIMEOUT / NGINX_RELOAD_NO_SYSTEMCTL — runtime gaps
   sec_log.error catches these so journalctl -u llm-team-ui shows
   the actual reason a ban didn't stick.

2. AI auto-scan failure callback when model is busy
   Pre-fix Ollama unreachable / busy / timeout silently preserved
   log position + skipped the scan. Operator only learned about
   the gap by manually checking sentinel-status. Now:
   - 1 retry inside same scan after SENTINEL_AI_RETRY_DELAY_SECS
     (30s) on connection error / timeout / 429 / 503
   - 4xx errors that won't recover (404 model missing, 400 bad
     prompt) fail fast without retrying
   - consecutive_ai_failures counter in _sentinel_stats
   - On 3+ consecutive failures, send_security_alert() fires —
     "Sentinel AI unreachable" email with last error + endpoint
     + model name. One alert per outage (ai_busy_alerted flag);
     clears on first successful scan so flapping doesn't spam.
   - AI_RECOVERED log line on first scan after a streak.

3. Sentinel ban path still substring-matched 192.168
   Same vulnerability class as admin_ban_ip had — only protected
   one /16. Replaced 4 sites with is_allowlisted(ip):
   - threat-list display filter (line 7638): now hides ALL
     allowlisted IPs from the panel
   - mass-ban API (line 8016): refuses ban for any allowlisted IP
   - sentinel analysis filter (line 12786): saves AI tokens by
     never sending allowlisted-IP traffic to the judge
   - sentinel ban verdict gate (line 12949): defense in depth —
     even if the AI says "ban" on an allowlisted IP, this catches it

   Combined with the layered defenses in b09b73c (track_violation,
   _auto_escalate, _nginx_ban, admin_ban_ip), there is now no
   code path that can ban an allowlisted IP. Operator self-ban
   is structurally impossible.

Privilege note: the systemd unit at /root/llm-team-ui/llm-team-ui.service
runs as User=root, so subprocess.run(["fail2ban-client", ...]) and
systemctl reload nginx have permission. The "errors in logs" J was
seeing weren't permission-denied; they were silent non-zero exits.
The new subprocess wrappers surface those.

If the operator later splits the app into a non-root tier
(Opus OB-3 architectural recommendation, deferred), this same
infrastructure still works — the wrappers will then surface
"PermissionError" with full path + uid context, telling the
operator exactly which command needs sudo NOPASSWD or
PolicyKit rule.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

2026-04-30 03:33:20 -05:00

server

Add triage, backup, and disaster recovery system

2026-03-25 04:52:48 -05:00

.gitignore

llm_team_ui: 4 fixes from 2026-04-30 cross-lineage scrum

2026-04-30 03:14:08 -05:00

llm_team_config.json

LLM Team UI v1.0 — full-stack local AI orchestration platform

2026-03-25 02:51:36 -05:00

llm_team_ui.py

sentinel: log subprocess failures + AI-busy retry/callback + allowlist all ban-paths