sentinel: log subprocess failures + AI-busy retry/callback + allowlist all ban-paths

Three issues from J 2026-04-30: 1. Silent fail2ban-client / nginx subprocess failures Pre-fix _execute_ban called fail2ban-client with capture_output=True but threw away the result. _nginx_ban had bare `except: pass` swallowing everything. So a non-zero fail2ban exit (jail not configured, IP already banned, IPv6 quirk) or PermissionError on /etc/nginx/banned_ips.conf logged "AI_BAN" while the attacker walked through unimpeded. The "errors in logs" J was seeing. Now every subprocess failure surfaces: - FAIL2BAN_FAILED rc=N stderr=... — non-zero exit - FAIL2BAN_TIMEOUT — client didn't return in 5s - FAIL2BAN_NOT_INSTALLED — binary missing - NGINX_BAN_WRITE_DENIED — permission error on conf file - NGINX_RELOAD_FAILED rc=N stderr=... — systemctl reload non-zero - NGINX_RELOAD_TIMEOUT / NGINX_RELOAD_NO_SYSTEMCTL — runtime gaps sec_log.error catches these so journalctl -u llm-team-ui shows the actual reason a ban didn't stick. 2. AI auto-scan failure callback when model is busy Pre-fix Ollama unreachable / busy / timeout silently preserved log position + skipped the scan. Operator only learned about the gap by manually checking sentinel-status. Now: - 1 retry inside same scan after SENTINEL_AI_RETRY_DELAY_SECS (30s) on connection error / timeout / 429 / 503 - 4xx errors that won't recover (404 model missing, 400 bad prompt) fail fast without retrying - consecutive_ai_failures counter in _sentinel_stats - On 3+ consecutive failures, send_security_alert() fires — "Sentinel AI unreachable" email with last error + endpoint + model name. One alert per outage (ai_busy_alerted flag); clears on first successful scan so flapping doesn't spam. - AI_RECOVERED log line on first scan after a streak. 3. Sentinel ban path still substring-matched 192.168 Same vulnerability class as admin_ban_ip had — only protected one /16. Replaced 4 sites with is_allowlisted(ip): - threat-list display filter (line 7638): now hides ALL allowlisted IPs from the panel - mass-ban API (line 8016): refuses ban for any allowlisted IP - sentinel analysis filter (line 12786): saves AI tokens by never sending allowlisted-IP traffic to the judge - sentinel ban verdict gate (line 12949): defense in depth — even if the AI says "ban" on an allowlisted IP, this catches it Combined with the layered defenses in b09b73c (track_violation, _auto_escalate, _nginx_ban, admin_ban_ip), there is now no code path that can ban an allowlisted IP. Operator self-ban is structurally impossible. Privilege note: the systemd unit at /root/llm-team-ui/llm-team-ui.service runs as User=root, so subprocess.run(["fail2ban-client", ...]) and systemctl reload nginx have permission. The "errors in logs" J was seeing weren't permission-denied; they were silent non-zero exits. The new subprocess wrappers surface those. If the operator later splits the app into a non-root tier (Opus OB-3 architectural recommendation, deferred), this same infrastructure still works — the wrappers will then surface "PermissionError" with full path + uid context, telling the operator exactly which command needs sudo NOPASSWD or PolicyKit rule. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
threat-intel: master 'select all' checkbox in toolbar
2026-04-30 03:33:20 -05:00 · 2026-04-30 03:27:20 -05:00 · 2026-04-30 03:24:01 -05:00 · 2026-04-30 03:19:38 -05:00 · 2026-04-30 03:14:08 -05:00
2 changed files with 360 additions and 40 deletions
--- a/.gitignore
+++ b/.gitignore
@ -2,3 +2,4 @@ __pycache__/
 *.pyc
 .env
 *.log
+.memory/
--- a/llm_team_ui.py
+++ b/llm_team_ui.py
@ -3,6 +3,7 @@

 import json
 import os
+import sys
 import time
 import threading
 import secrets
@ -36,8 +37,23 @@ app.config["SESSION_COOKIE_HTTPONLY"] = True
 app.config["SESSION_COOKIE_SAMESITE"] = "Lax"

 # ─── SECURITY LOGGING ─────────────────────────────────────────
-# Dedicated security log for fail2ban and audit trail
-_sec_handler = logging.FileHandler("/var/log/llm-team-security.log")
+# Dedicated security log for fail2ban and audit trail.
+#
+# Cross-lineage scrum 2026-04-30 (Opus BLOCK OB-1): wrapped in
+# try/except. Pre-fix this raised PermissionError at import time
+# when the service user couldn't write /var/log/llm-team-security.log,
+# crashing the app before Flask started. Now falls back to stderr;
+# sec_log still works (ban events still land in journald via stderr),
+# but the app starts. Operator should still create the file with
+# proper perms on production deploy. Path is overridable via
+# LLM_TEAM_SECURITY_LOG env var.
+_LOG_PATH = os.environ.get("LLM_TEAM_SECURITY_LOG", "/var/log/llm-team-security.log")
+try:
+    _sec_handler = logging.FileHandler(_LOG_PATH)
+except (PermissionError, FileNotFoundError, OSError) as _log_err:
+    print(f"[security] WARNING: can't open {_LOG_PATH} ({_log_err}); "
+          f"falling back to stderr.", file=sys.stderr, flush=True)
+    _sec_handler = logging.StreamHandler(sys.stderr)
 _sec_handler.setFormatter(logging.Formatter("%(asctime)s %(message)s"))
 sec_log = logging.getLogger("security")
 sec_log.addHandler(_sec_handler)
@ -93,8 +109,24 @@ _original_sentinel_interval = None  # stash the normal interval during high-aler


 def _track_violation(ip, event_type="unknown"):
-    """Record a security violation. If velocity threshold exceeded, auto-escalate."""
+    """Record a security violation. If velocity threshold exceeded, auto-escalate.
+
+    Cross-lineage scrum 2026-04-30 follow-up: OB-4's path-bypass for
+    admins was incomplete — _track_violation is called from 3 sites
+    (exploit_scan, rate_limit, login_fail) and only the exploit one
+    had the bypass. An admin hitting rate-limit + login-typo + a
+    legit URL containing 'UNION' within 60s could still self-ban
+    via the OTHER paths. Now is_allowlisted bails early so allowlisted
+    IPs never accumulate violations from ANY path. Defense in depth —
+    _auto_escalate also re-checks below.
+
+    Eviction sweep when tracker grows >10K entries (same pattern as
+    _rate_limit; both had identical unbounded-dict WARNs)."""
+    if is_allowlisted(ip):
+        return False
    now = time.time()
+    if len(_violation_tracker) > 10000:
+        _evict_stale_violation_tracker(now)
    if ip not in _violation_tracker:
        _violation_tracker[ip] = []
    _violation_tracker[ip].append(now)
@ -107,8 +139,26 @@ def _track_violation(ip, event_type="unknown"):
    return False


+def _evict_stale_violation_tracker(now):
+    """Drop _violation_tracker entries whose newest timestamp is past
+    the window. Called from _track_violation only when dict exceeds
+    10K — the hot path stays untouched for normal traffic."""
+    cutoff = now - VELOCITY_WINDOW
+    stale = [ip for ip, ts in _violation_tracker.items() if not ts or max(ts) < cutoff]
+    for ip in stale:
+        del _violation_tracker[ip]
+
+
 def _auto_escalate(ip, violation_count, event_type):
-    """Auto-ban IP and switch sentinel to high-alert mode."""
+    """Auto-ban IP and switch sentinel to high-alert mode.
+
+    Defense in depth (2026-04-30): _track_violation already short-
+    circuits for allowlisted IPs, but if a future code path calls
+    _auto_escalate directly we still want the allowlist guard. Bail
+    early; nothing escalates against a trusted IP."""
+    if is_allowlisted(ip):
+        sec_log.info("AUTO_ESCALATE_BLOCKED ip=%s — allowlisted, refused to ban", ip)
+        return
    global _original_sentinel_interval, SENTINEL_INTERVAL
    sec_log.warning("AUTO_ESCALATE ip=%s violations=%d/%ds type=%s", ip, violation_count, VELOCITY_WINDOW, event_type)
    _sentinel_log_entry(f"AUTO_ESCALATE ip={ip} violations={violation_count}/{VELOCITY_WINDOW}s type={event_type}")
@ -150,8 +200,19 @@ def _check_high_alert_expiry():

 # IPs that never get rate-limited (your LAN, localhost)
 ALLOWLIST_IPS = {"127.0.0.1", "::1", "192.168.1.1"}
-# Demo mode state — toggled by admin at runtime
-_demo_mode = {"active": True, "started_by": "boot", "showcase": True}
+# Demo mode state — toggled by admin at runtime.
+#
+# Cross-lineage scrum 2026-04-30 (Opus BLOCK OB-5): pre-fix this
+# defaulted to active=True, meaning fresh installs shipped with
+# public unauthenticated access enabled — login_required let demo
+# users straight through. Combined with /api/run + /api/imagegen
+# proxies, that was an open LLM/compute abuse surface from first
+# boot. Now defaults to active=False; operators flip it on
+# explicitly via the admin UI or LLM_TEAM_DEMO_MODE=1 env override
+# (the env override exists for the demo systemd unit so the public
+# devop.live deployment doesn't need a manual toggle on every restart).
+_DEMO_DEFAULT = os.environ.get("LLM_TEAM_DEMO_MODE", "0") == "1"
+_demo_mode = {"active": _DEMO_DEFAULT, "started_by": "boot" if _DEMO_DEFAULT else "off", "showcase": _DEMO_DEFAULT}

 # Routes that demo users CAN trigger (read-like POSTs — enrichment, self-analysis, team runs)
 DEMO_ALLOWED_POSTS = {
@ -172,9 +233,21 @@ def is_allowlisted(ip):


 def rate_limited(ip, max_req=RATE_LIMIT_MAX):
+    """Rolling rate-limit check. Returns True when the IP has exceeded
+    max_req requests within RATE_LIMIT_WINDOW seconds.
+
+    Cross-lineage scrum 2026-04-30 (Opus WARN): _rate_limit was
+    unbounded per-worker, so an attacker rotating slowly through IPs
+    leaked memory forever. Fix: lazy eviction sweep when the dict
+    grows beyond 10K entries. Real production wants a Redis-backed
+    counter shared across workers; this is the in-process band-aid
+    that prevents runaway growth without changing the deploy shape.
+    """
    if is_allowlisted(ip):
        return False
    now = time.time()
+    if len(_rate_limit) > 10000:
+        _evict_stale_rate_limit(now)
    if ip not in _rate_limit or now - _rate_limit[ip][1] > RATE_LIMIT_WINDOW:
        _rate_limit[ip] = (1, now)
        return False
@ -185,6 +258,16 @@ def rate_limited(ip, max_req=RATE_LIMIT_MAX):
    return False


+def _evict_stale_rate_limit(now):
+    """Drop _rate_limit entries older than 2× the window. Called from
+    rate_limited() only when dict growth exceeds 10K — keeps the cost
+    off the hot path for normal traffic."""
+    cutoff = now - (RATE_LIMIT_WINDOW * 2)
+    stale = [ip for ip, (_, start) in _rate_limit.items() if start < cutoff]
+    for ip in stale:
+        del _rate_limit[ip]
+
+
 def is_admin():
    return session.get("role") == "admin"

@ -560,8 +643,24 @@ def security_checks():
    # Check high-alert expiry
    _check_high_alert_expiry()

-    # Exploit scanner detection — log, alert, track velocity, block
-    if EXPLOIT_PATTERNS.search(path) or EXPLOIT_PATTERNS.search(request.query_string.decode("utf-8", errors="ignore")):
+    # Exploit scanner detection — log, alert, track velocity, block.
+    #
+    # Cross-lineage scrum 2026-04-30 (Opus BLOCK OB-4): pre-fix the
+    # path regex matched on substrings like UNION, SELECT, ;-- and
+    # auto-banned after 3 hits. Admin URLs containing those keywords
+    # in query strings (e.g. an LLM team named "select-rebrand" or a
+    # docs link to /admin/select_a_mode) self-banned the admin's IP.
+    # Now: skip the path-based check for authenticated admins from
+    # an allowlisted IP. The user-agent + body checks (sentinel) still
+    # apply. Allowlisted-IP admins clicking weird URLs no longer
+    # lock themselves out.
+    _skip_exploit_check = False
+    if ip in ALLOWLIST_IPS and session.get("role") == "admin":
+        _skip_exploit_check = True
+    if not _skip_exploit_check and (
+        EXPLOIT_PATTERNS.search(path) or
+        EXPLOIT_PATTERNS.search(request.query_string.decode("utf-8", errors="ignore"))
+    ):
        sec_log.warning("EXPLOIT_SCAN ip=%s path=%s ua=%s", ip, path, ua)
        _track_violation(ip, "exploit_scan")
        send_security_alert(
@ -836,7 +935,21 @@ def auth_login():
        with get_db() as conn:
            with conn.cursor(cursor_factory=psycopg2.extras.RealDictCursor) as cur:
                if is_setup:
-                    # First-time setup: create admin
+                    # First-time setup: create admin.
+                    #
+                    # Cross-lineage scrum 2026-04-30 (Opus WARN): pre-fix
+                    # this was gated only by COUNT(*) FROM users == 0.
+                    # If an operator ever truncated/restored the users
+                    # table, the next external visitor (any IP) could
+                    # claim admin. Now also requires the source IP to
+                    # be in ALLOWLIST_IPS — typically loopback + LAN
+                    # gateway — so a remote attacker hitting the setup
+                    # endpoint after an empty-users state can't seize
+                    # the account. Local operator running setup from
+                    # the box itself still works.
+                    if ip not in ALLOWLIST_IPS:
+                        sec_log.warning("SETUP_DENIED ip=%s — first-time setup requires allowlisted IP", ip)
+                        return jsonify({"error": "setup must be initiated from an allowlisted IP (typically localhost or LAN gateway)"}), 403
                    cur.execute("SELECT COUNT(*) as c FROM users")
                    if cur.fetchone()["c"] > 0:
                        return jsonify({"error": "Setup already completed"}), 400
@ -1349,6 +1462,22 @@ async function loadThreats() {
    });
    // Mass action buttons
    var spacer = document.createElement('div'); spacer.style.flex = '1'; toolbar.appendChild(spacer);
+    // Master "select all on this page" checkbox (2026-04-30 J UX request).
+    // Mirrors the per-row .ip-check style; toggles every visible row.
+    // Three-state: unchecked (none selected), checked (all selected),
+    // indeterminate (partial). updateSelCount keeps it in sync as
+    // individual rows are toggled.
+    var selAllWrap = document.createElement('label');
+    selAllWrap.style.cssText = 'display:flex;align-items:center;gap:6px;font-family:JetBrains Mono,monospace;font-size:9px;text-transform:uppercase;letter-spacing:0.5px;color:#7a7872;cursor:pointer';
+    selAllWrap.title = 'Toggle every IP on this page';
+    var selAll = document.createElement('input'); selAll.type = 'checkbox';
+    selAll.id = 'sel-all';
+    selAll.style.cssText = 'width:16px;height:16px;cursor:pointer;accent-color:#e2b55a';
+    selAll.onchange = function(){ toggleAllChecks(this.checked); };
+    selAllWrap.appendChild(selAll);
+    var selAllLabel = document.createElement('span'); selAllLabel.textContent = 'all';
+    selAllWrap.appendChild(selAllLabel);
+    toolbar.appendChild(selAllWrap);
    var selCount = document.createElement('span'); selCount.id = 'sel-count';
    selCount.style.cssText = 'font-family:JetBrains Mono,monospace;font-size:10px;color:#7a7872';
    toolbar.appendChild(selCount);
@ -1472,9 +1601,33 @@ async function loadThreats() {
 var currentSort = 'hits';

 function updateSelCount() {
-  var checks = document.querySelectorAll('.ip-check:checked');
+  var all = document.querySelectorAll('.ip-check');
+  var checked = document.querySelectorAll('.ip-check:checked');
  var el = document.getElementById('sel-count');
-  if (el) el.textContent = checks.length ? checks.length + ' selected' : '';
+  if (el) el.textContent = checked.length ? checked.length + ' selected' : '';
+  // Sync the master "all" checkbox to reflect the page's actual state.
+  // Three states: none → unchecked, all → checked, partial → indeterminate.
+  // Indeterminate is the visual "half-tick" most browsers render — gives
+  // operators a clear "you've got some but not all selected" hint.
+  var master = document.getElementById('sel-all');
+  if (master) {
+    if (checked.length === 0) {
+      master.checked = false; master.indeterminate = false;
+    } else if (checked.length === all.length) {
+      master.checked = true; master.indeterminate = false;
+    } else {
+      master.indeterminate = true;
+    }
+  }
+}
+
+function toggleAllChecks(checked) {
+  // Master "select all" handler — flips every per-row checkbox on the
+  // page to match the master's state. Used by the toolbar's `[ ] all`
+  // checkbox so operators don't have to click each threat individually
+  // before hitting Ban Selected. (2026-04-30 J UX request.)
+  document.querySelectorAll('.ip-check').forEach(function(cb){ cb.checked = checked; });
+  updateSelCount();
 }

 async function massAction(action) {
@ -1888,7 +2041,18 @@ def get_api_key(provider_name):
    env_map = {"openrouter": "OPENROUTER_API_KEY", "openai": "OPENAI_API_KEY", "anthropic": "ANTHROPIC_API_KEY", "ollama_cloud": "OLLAMA_CLOUD_API_KEY"}
    return os.environ.get(env_map.get(provider_name, ""), "")

-DB_DSN = "dbname=knowledge_base user=kbuser password=IPbLBA0EQI8u4TeM2YZrbm1OAy5nSwqC host=localhost"
+# Cross-lineage scrum 2026-04-30 (Opus BLOCK OB-2 + harness LLM
+# convergent finding): DB_DSN previously had the password hardcoded
+# in source. Same `kbuser`/`knowledge_base` DSN was leaked in
+# voice-ai's audiosocket_bridge.py + sales_assistant.py — confirmed
+# canonical leak by 3 independent reviewers across 2 sessions. Now
+# sourced from env (set via systemd EnvironmentFile=/etc/llm-team-ui.env).
+# No silent fallback to the leaked literal — fail loud. The leaked
+# password is in git history regardless; rotate it in Postgres.
+DB_DSN = os.environ.get("LLM_TEAM_DB_DSN", "")
+if not DB_DSN:
+    print("[llm-team-ui] WARNING: LLM_TEAM_DB_DSN not set — DB ops will fail. "
+          "Set in systemd EnvironmentFile or shell env.", file=sys.stderr, flush=True)

 def get_db():
    return psycopg2.connect(DB_DSN)
@ -7471,7 +7635,10 @@ def admin_security_data():
    sort_by = request.args.get("sort", "hits")
    result = []
    for ip, d in ips.items():
-        if ip.startswith("192.168."):
+        # 2026-04-30: was substring "192.168." — replaced with the
+        # canonical allowlist so 10.x, IPv6 ::1, and operator-added
+        # entries also stay out of the threat panel.
+        if is_allowlisted(ip):
            continue
        result.append({
            "ip": ip, "hits": d["hits"], "exploit_scans": d["exploit_scans"],
@ -7510,21 +7677,52 @@ def _kill_connections(ip):
        pass

 def _nginx_ban(ip):
-    """Add IP to nginx deny list and reload."""
+    """Add IP to nginx deny list and reload.
+
+    Defense in depth (2026-04-30): refuse to write allowlisted IPs
+    to the deny list under ANY circumstance — even a buggy caller
+    that bypassed _track_violation's allowlist check. The deny list
+    is the last write before nginx reload; this is the last place
+    we can stop a bad ban."""
+    if is_allowlisted(ip):
+        sec_log.info("NGINX_BAN_BLOCKED ip=%s — allowlisted, refused to write deny rule", ip)
+        return
    import subprocess
+    line = f"deny {ip};\n"
+    # Each step has its own try/except so we know WHICH step failed.
+    # Pre-2026-04-30 a single bare `except: pass` swallowed every
+    # error including PermissionError on the conf file write and
+    # CalledProcessError from systemctl. Sentinel + auto-escalate
+    # logged "BAN" but the request actually never landed in nginx.
+    # Now each failure mode hits sec_log so the operator sees why.
    try:
-        line = f"deny {ip};\n"
        try:
            with open(_NGINX_BAN_FILE) as f:
                if line in f.read():
                    return
        except FileNotFoundError:
            pass
+    except PermissionError as e:
+        sec_log.warning("NGINX_BAN_READ_DENIED file=%s err=%s — won't dedup, attempting append anyway", _NGINX_BAN_FILE, e)
+    try:
        with open(_NGINX_BAN_FILE, "a") as f:
            f.write(line)
-        subprocess.run(["systemctl", "reload", "nginx"], capture_output=True, timeout=5)
-    except Exception:
-        pass
+    except PermissionError as e:
+        sec_log.error("NGINX_BAN_WRITE_DENIED ip=%s file=%s err=%s — ban NOT effective at nginx layer", ip, _NGINX_BAN_FILE, e)
+        return
+    except Exception as e:
+        sec_log.error("NGINX_BAN_WRITE_ERROR ip=%s err=%s", ip, e)
+        return
+    try:
+        result = subprocess.run(["systemctl", "reload", "nginx"], capture_output=True, text=True, timeout=5)
+        if result.returncode != 0:
+            sec_log.error("NGINX_RELOAD_FAILED ip=%s rc=%d stderr=%s", ip, result.returncode, result.stderr.strip())
+    except subprocess.TimeoutExpired:
+        sec_log.error("NGINX_RELOAD_TIMEOUT ip=%s — systemctl reload nginx didn't finish in 5s", ip)
+    except FileNotFoundError:
+        sec_log.error("NGINX_RELOAD_NO_SYSTEMCTL ip=%s — systemctl not in PATH for service user", ip)
+    except Exception as e:
+        sec_log.error("NGINX_RELOAD_ERROR ip=%s err=%s", ip, e)

 def _nginx_unban(ip):
    """Remove IP from nginx deny list and reload."""
@ -7552,8 +7750,13 @@ def admin_ban_ip():
    action = data.get("action", "ban")
    if not ip:
        return jsonify({"error": "IP required"}), 400
-    if ip.startswith("192.168."):
-        return jsonify({"error": "Cannot ban LAN addresses"}), 400
+    # Defense in depth (2026-04-30): use the canonical ALLOWLIST_IPS
+    # check rather than a substring on "192.168." which would let
+    # 10.0.0.0/8 LANs and IPv6 loopback ::1 through. Same allowlist
+    # the auto-ban paths now respect — operator can't accidentally
+    # cut off their own LAN gateway.
+    if is_allowlisted(ip):
+        return jsonify({"error": f"refusing to ban allowlisted IP {ip} (in ALLOWLIST_IPS)"}), 400
    try:
        if action == "ban":
            subprocess.run(["fail2ban-client", "set", "llm-team-exploit", "banip", ip],
@ -7813,7 +8016,10 @@ def admin_mass_ban():
    results = {"success": 0, "failed": 0, "skipped": 0}
    for ip in ip_list:
        ip = ip.strip()
-        if not ip or ip.startswith("192.168."):
+        # 2026-04-30: substring "192.168." → is_allowlisted so all
+        # trusted networks (LAN gateways, IPv6 loopback, custom
+        # entries) are skipped, not just one /16.
+        if not ip or is_allowlisted(ip):
            results["skipped"] += 1
            continue
        try:
@ -12535,7 +12741,17 @@ SENTINEL_MODEL = "qwen2.5:latest"
 SENTINEL_INTERVAL = 300  # 5 minutes
 _sentinel_last_pos = 0
 _sentinel_results = []  # last 50 analyses
-_sentinel_stats = {"scans": 0, "bans": 0, "last_run": None, "last_error": None, "next_scan_ts": 0}
+_sentinel_stats = {
+    "scans": 0, "bans": 0, "last_run": None, "last_error": None, "next_scan_ts": 0,
+    # 2026-04-30 J: track consecutive AI-query failures so we can
+    # fire a callback (email alert) when Ollama is sustainedly busy
+    # or unreachable. Pre-fix a model-busy state preserved log
+    # position + skipped the scan with no operator notification.
+    "consecutive_ai_failures": 0,
+    "ai_busy_alerted": False,  # one alert per outage; clears on first success
+}
+SENTINEL_AI_FAILURE_ALERT_THRESHOLD = 3  # consecutive failures before email
+SENTINEL_AI_RETRY_DELAY_SECS = 30        # wait before retry inside same scan

 def _sentinel_log_entry(msg):
    """Write to sentinel log file."""
@ -12583,7 +12799,10 @@ def _sentinel_scan():
            if token.startswith("ip="):
                ip = token[3:]
                break
-        if ip and not ip.startswith("192.168."):
+        # 2026-04-30: was substring "192.168." — sentinel now skips
+        # ALL allowlisted IPs from analysis (saves tokens + prevents
+        # the AI judge from getting confused by legitimate admin traffic).
+        if ip and not is_allowlisted(ip):
            ip_activity[ip].append(line)

    if not ip_activity:
@ -12648,21 +12867,85 @@ def _sentinel_scan():
    for ip, summary, _ in analysis_items[:15]:  # max 15 IPs per scan
        prompt += summary + "\n"

-    # Query local AI
-    try:
-        cfg = load_config()
-        base = cfg["providers"]["ollama"].get("base_url", "http://localhost:11434")
-        resp = requests.post(f"{base}/api/generate", json={
-            "model": SENTINEL_MODEL, "prompt": prompt, "stream": False,
-            "options": {"num_ctx": 4096, "temperature": 0.1}
-        }, timeout=60)
-        resp.raise_for_status()
-        ai_response = resp.json()["response"]
-    except Exception as e:
-        _sentinel_stats["last_error"] = f"AI query failed: {e}"
-        _sentinel_log_entry(f"AI_ERROR error={e}")
+    # Query local AI. 2026-04-30 J fix: retry once on model-busy /
+    # connection / timeout, and fire an operator callback when the
+    # AI is sustainedly unreachable. Pre-fix a single Ollama hiccup
+    # silently dropped the scan with no notification — operator only
+    # discovered the gap by checking sentinel-status manually.
+    cfg = load_config()
+    base = cfg["providers"]["ollama"].get("base_url", "http://localhost:11434")
+    body = {
+        "model": SENTINEL_MODEL, "prompt": prompt, "stream": False,
+        "options": {"num_ctx": 4096, "temperature": 0.1},
+    }
+    ai_response = None
+    last_err = None
+    for attempt in range(2):  # original try + 1 retry
+        try:
+            resp = requests.post(f"{base}/api/generate", json=body, timeout=60)
+            resp.raise_for_status()
+            ai_response = resp.json()["response"]
+            break
+        except (requests.exceptions.ConnectionError,
+                requests.exceptions.Timeout,
+                requests.exceptions.ReadTimeout) as e:
+            last_err = f"connection/timeout: {e}"
+            if attempt == 0:
+                _sentinel_log_entry(f"AI_BUSY_RETRY attempt=1 err={str(e)[:80]} sleeping={SENTINEL_AI_RETRY_DELAY_SECS}s")
+                time.sleep(SENTINEL_AI_RETRY_DELAY_SECS)
+                continue
+        except requests.exceptions.HTTPError as e:
+            # 503 Service Unavailable + 429 Too Many = busy; retry.
+            # Other HTTP errors (404 model missing, 400 bad prompt) won't
+            # recover from a retry, so fail fast.
+            sc = getattr(e.response, "status_code", 0)
+            last_err = f"HTTP {sc}: {e}"
+            if sc in (429, 503) and attempt == 0:
+                _sentinel_log_entry(f"AI_BUSY_RETRY attempt=1 status={sc} sleeping={SENTINEL_AI_RETRY_DELAY_SECS}s")
+                time.sleep(SENTINEL_AI_RETRY_DELAY_SECS)
+                continue
+            break
+        except Exception as e:
+            last_err = f"unexpected: {e}"
+            break
+
+    if ai_response is None:
+        _sentinel_stats["consecutive_ai_failures"] += 1
+        _sentinel_stats["last_error"] = f"AI query failed: {last_err}"
+        _sentinel_log_entry(
+            f"AI_ERROR error={last_err} consecutive={_sentinel_stats['consecutive_ai_failures']}"
+        )
+        # Operator callback: fire a security alert email when the AI
+        # has been down for ≥N consecutive scans. One alert per outage —
+        # cleared on next successful scan so a flapping AI doesn't
+        # spam the inbox.
+        if (_sentinel_stats["consecutive_ai_failures"] >= SENTINEL_AI_FAILURE_ALERT_THRESHOLD
+                and not _sentinel_stats["ai_busy_alerted"]):
+            _sentinel_stats["ai_busy_alerted"] = True
+            try:
+                send_security_alert(
+                    f"Sentinel AI unreachable ({_sentinel_stats['consecutive_ai_failures']} consecutive failures)",
+                    f"The sentinel auto-scanner has been unable to reach the LLM judge for "
+                    f"{_sentinel_stats['consecutive_ai_failures']} consecutive scans.\n\n"
+                    f"Last error: {last_err}\n"
+                    f"Model: {SENTINEL_MODEL}\n"
+                    f"Endpoint: {base}\n\n"
+                    f"Threats are being logged and surfaced in the threat-intel UI but "
+                    f"NOT auto-banned during this outage. Manual review recommended.",
+                )
+            except Exception as alert_err:
+                sec_log.error("SENTINEL_ALERT_SEND_FAILED err=%s", alert_err)
        return

+    # AI succeeded. Reset the failure counter + clear the alerted flag
+    # so the next outage gets its own notification.
+    if _sentinel_stats["consecutive_ai_failures"] > 0:
+        _sentinel_log_entry(
+            f"AI_RECOVERED after_failures={_sentinel_stats['consecutive_ai_failures']}"
+        )
+    _sentinel_stats["consecutive_ai_failures"] = 0
+    _sentinel_stats["ai_busy_alerted"] = False
+
    # Parse AI response
    try:
        # Extract JSON from response (handle markdown code blocks)
@ -12688,9 +12971,41 @@ def _sentinel_scan():
    ban_futures = []

    def _execute_ban(ip, threat, reason, attack_type):
-        """Execute a single ban — fail2ban + nginx + kill connections."""
-        subprocess.run(["fail2ban-client", "set", "llm-team-exploit", "banip", ip],
-                       capture_output=True, text=True, timeout=5)
+        """Execute a single ban — fail2ban + nginx + kill connections.
+
+        2026-04-30 J fix: actually examine the fail2ban-client result.
+        Pre-fix capture_output=True was set but the result thrown away,
+        so a non-zero exit (jail not configured, IP already banned, IPv6
+        format quirk) silently said "AI_BAN" in the log while the
+        attacker walked through unimpeded. Now logs returncode + stderr
+        on failure so the operator sees WHY the ban didn't stick."""
+        try:
+            result = subprocess.run(
+                ["fail2ban-client", "set", "llm-team-exploit", "banip", ip],
+                capture_output=True, text=True, timeout=5,
+            )
+            if result.returncode != 0:
+                sec_log.error(
+                    "FAIL2BAN_BAN_FAILED ip=%s rc=%d stdout=%s stderr=%s",
+                    ip, result.returncode,
+                    result.stdout.strip()[:200],
+                    result.stderr.strip()[:200],
+                )
+                _sentinel_log_entry(
+                    f"FAIL2BAN_FAILED ip={ip} rc={result.returncode} "
+                    f"err={result.stderr.strip()[:120]}"
+                )
+                # Continue anyway — nginx layer is independent and may
+                # still take effect.
+        except subprocess.TimeoutExpired:
+            sec_log.error("FAIL2BAN_TIMEOUT ip=%s — client didn't return in 5s", ip)
+            _sentinel_log_entry(f"FAIL2BAN_TIMEOUT ip={ip}")
+        except FileNotFoundError:
+            sec_log.error("FAIL2BAN_NOT_INSTALLED ip=%s — fail2ban-client not in PATH", ip)
+            _sentinel_log_entry(f"FAIL2BAN_NOT_INSTALLED ip={ip}")
+        except Exception as e:
+            sec_log.error("FAIL2BAN_ERROR ip=%s err=%s", ip, e)
+            _sentinel_log_entry(f"FAIL2BAN_ERROR ip={ip} err={e}")
        _nginx_ban(ip)
        _kill_connections(ip)
        sec_log.warning("AI_BAN ip=%s threat=%s reason=%s attack=%s", ip, threat, reason, attack_type)
@ -12714,7 +13029,11 @@ def _sentinel_scan():
            if len(_sentinel_results) > 50:
                _sentinel_results.pop(0)

-            if action == "ban" and ip and not ip.startswith("192.168."):
+            # 2026-04-30: was substring "192.168." — replaced with
+            # canonical is_allowlisted so the sentinel's AI verdict
+            # can't accidentally ban any allowlisted IP that slipped
+            # past the analysis filter (defense in depth).
+            if action == "ban" and ip and not is_allowlisted(ip):
                ban_futures.append(executor.submit(_execute_ban, ip, threat, reason, attack_type))
            else:
                _sentinel_log_entry(f"AI_VERDICT ip={ip} threat={threat} action={action} reason={reason} attack_type={attack_type}")
Author	SHA1	Message	Date
root	083f05c093	sentinel: log subprocess failures + AI-busy retry/callback + allowlist all ban-paths Three issues from J 2026-04-30: 1. Silent fail2ban-client / nginx subprocess failures Pre-fix _execute_ban called fail2ban-client with capture_output=True but threw away the result. _nginx_ban had bare `except: pass` swallowing everything. So a non-zero fail2ban exit (jail not configured, IP already banned, IPv6 quirk) or PermissionError on /etc/nginx/banned_ips.conf logged "AI_BAN" while the attacker walked through unimpeded. The "errors in logs" J was seeing. Now every subprocess failure surfaces: - FAIL2BAN_FAILED rc=N stderr=... — non-zero exit - FAIL2BAN_TIMEOUT — client didn't return in 5s - FAIL2BAN_NOT_INSTALLED — binary missing - NGINX_BAN_WRITE_DENIED — permission error on conf file - NGINX_RELOAD_FAILED rc=N stderr=... — systemctl reload non-zero - NGINX_RELOAD_TIMEOUT / NGINX_RELOAD_NO_SYSTEMCTL — runtime gaps sec_log.error catches these so journalctl -u llm-team-ui shows the actual reason a ban didn't stick. 2. AI auto-scan failure callback when model is busy Pre-fix Ollama unreachable / busy / timeout silently preserved log position + skipped the scan. Operator only learned about the gap by manually checking sentinel-status. Now: - 1 retry inside same scan after SENTINEL_AI_RETRY_DELAY_SECS (30s) on connection error / timeout / 429 / 503 - 4xx errors that won't recover (404 model missing, 400 bad prompt) fail fast without retrying - consecutive_ai_failures counter in _sentinel_stats - On 3+ consecutive failures, send_security_alert() fires — "Sentinel AI unreachable" email with last error + endpoint + model name. One alert per outage (ai_busy_alerted flag); clears on first successful scan so flapping doesn't spam. - AI_RECOVERED log line on first scan after a streak. 3. Sentinel ban path still substring-matched 192.168 Same vulnerability class as admin_ban_ip had — only protected one /16. Replaced 4 sites with is_allowlisted(ip): - threat-list display filter (line 7638): now hides ALL allowlisted IPs from the panel - mass-ban API (line 8016): refuses ban for any allowlisted IP - sentinel analysis filter (line 12786): saves AI tokens by never sending allowlisted-IP traffic to the judge - sentinel ban verdict gate (line 12949): defense in depth — even if the AI says "ban" on an allowlisted IP, this catches it Combined with the layered defenses in b09b73c (track_violation, _auto_escalate, _nginx_ban, admin_ban_ip), there is now no code path that can ban an allowlisted IP. Operator self-ban is structurally impossible. Privilege note: the systemd unit at /root/llm-team-ui/llm-team-ui.service runs as User=root, so subprocess.run(["fail2ban-client", ...]) and systemctl reload nginx have permission. The "errors in logs" J was seeing weren't permission-denied; they were silent non-zero exits. The new subprocess wrappers surface those. If the operator later splits the app into a non-root tier (Opus OB-3 architectural recommendation, deferred), this same infrastructure still works — the wrappers will then surface "PermissionError" with full path + uid context, telling the operator exactly which command needs sudo NOPASSWD or PolicyKit rule. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-30 03:33:20 -05:00
root	2575842f7b	threat-intel: master 'select all' checkbox in toolbar UX request 2026-04-30: when sorting by threat in the threat intel panel, ban-selected required clicking each per-row checkbox individually. Pages with 20-50 threats made bulk-ban tedious. Adds a master `[ ] all` checkbox to the toolbar (right of the Sort buttons, left of the existing 'N selected' counter) that toggles every per-row .ip-check on the page in one click. Then 'Ban Selected' / 'Unban Selected' work over the whole set. Three-state: unchecked (none selected) / checked (all) / indeterminate (partial — browsers render this as a "half-tick" so operators get visual feedback when they've toggled some rows manually after using master). updateSelCount keeps the master in sync as individual rows toggle so the visual is always truthful. No backend change — `/api/admin/security/mass-ban` already accepts an arbitrary IP list. This is purely a frontend ergonomics improvement on top of the existing mass-action infrastructure. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-30 03:27:20 -05:00
root	b09b73c409	llm_team_ui: ban-system defense in depth (extends OB-4) The OB-4 fix at 939dfdd was incomplete. It bypassed the path-regex EXPLOIT check for allowlisted-IP admins, but _track_violation is called from THREE sites (exploit_scan, rate_limit, login_fail) and only the exploit_scan path had the bypass. An admin who hit a rate limit + had a session timeout + entered a wrong password within 60s could still self-ban via the OTHER paths, ending up locked out through the back door. This commit adds 4 layers of defense in depth, each independently sufficient to stop an allowlisted IP from being banned: 1. _track_violation: bail early if is_allowlisted(ip). Allowlisted IPs never accumulate violations from ANY path. Plus eviction sweep when _violation_tracker grows >10K (same shape as the _rate_limit eviction at 266de61). 2. _auto_escalate: re-check is_allowlisted before issuing any ban. Defense in depth — if a future call path bypasses #1, this catches it. 3. _nginx_ban: refuse to write the deny rule for allowlisted IPs, even if a buggy caller reached this far. Last write before nginx reload; last place to stop a bad ban. 4. admin_ban_ip: replace `ip.startswith("192.168.")` substring check with the canonical ALLOWLIST_IPS membership test. Pre-fix this only protected one LAN; 10.0.0.0/8, IPv6 loopback ::1, and custom allowlist entries (e.g. an external monitoring IP) were all banable by manual admin error. Now uses the same allowlist as the auto-ban paths. Operationally the admin can no longer self-ban through any path. The auto-escalate ban audit log entries get a corresponding "AUTO_ESCALATE_BLOCKED ip=... — allowlisted" entry instead of the ban firing silently. Same for nginx_ban: NGINX_BAN_BLOCKED entries log the saved bullet for operator review. Builds on 939dfdd (OB-4 path-regex bypass) + 266de61 (rate_limit eviction + auth_login IP gate). Together these three commits close the LLM Team UI's ban-system self-foot-shoot vulnerability surface. Outstanding from the scrum (architectural, separate session): - OB-3 root-running web app + privileged shell calls - Sentinel prompt-injection WARN Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-30 03:24:01 -05:00
root	266de613b2	llm_team_ui: 2 more scrum WARNs (rate_limit eviction + setup IP gate) Closes the 2 remaining surgical-fix WARNs from the 2026-04-30 cross-lineage scrum on this codebase. OB-3 (root-running web app with shell calls to fail2ban-client / systemctl / nginx config) and the sentinel prompt-injection WARN both need bigger architectural work and stay deferred. OB-rate-limit (Opus WARN) — _rate_limit dict unbounded Pre-fix: per-worker dict with no eviction; an attacker slowly rotating IPs leaked memory forever. Fix: lazy eviction sweep triggered when dict grows beyond 10K entries (cheap because we only scan when growth is unusual). Real production wants a Redis-backed shared counter; this is the in-process band-aid that prevents runaway growth without changing the deploy shape. OB-auth-setup (Opus WARN) — first-time setup grant from any IP Pre-fix: /api/auth/login with setup=true was gated only by COUNT(*) FROM users == 0. If the users table was ever truncated or restored empty, the next external visitor (ANY IP) claimed admin. Fix: also require the source IP to be in ALLOWLIST_IPS (typically loopback + LAN gateway). Local operator setup still works; remote attackers hitting the endpoint after an empty- users state get 403. Both fixes are surgical — single function, no behavior change for the happy path. The eviction sweep runs O(n) only when n>10K and only drops entries already past their useful window, so it never removes an active rate-limit count. Outstanding from the scrum (deferred): - OB-3 root-running web app: needs split into non-root Flask tier + privileged sudo wrapper service. 2-4 hr architectural work. - Sentinel prompt-injection WARN: feeds attacker-controlled UA/ path into LLM judge prompt. Needs prompt-template hardening or output validation gate before LLM verdicts can issue ban actions. - CSP unsafe-inline WARN: defeats most XSS protection. Removing it requires moving inline scripts to external files (HTML refactor). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-30 03:19:38 -05:00
root	939dfddb93	llm_team_ui: 4 fixes from 2026-04-30 cross-lineage scrum Cross-lineage scrum (Opus 4.7 + Kimi K2.6 + Qwen3-coder via the local-review-harness chatd) on this codebase surfaced 5 BLOCK-class issues from Opus + a convergent finding from the harness. This commit lands the 4 surgical fixes; OB-3 (web app runs as root with fail2ban-client + systemctl reload nginx + writes to /etc/nginx/banned_ips.conf) needs an architectural split into a non-root web tier + a privileged sudo wrapper, deferred for its own session. OB-1 — log file open at import crashes app on perm error Pre-fix: `_sec_handler = logging.FileHandler("/var/log/llm-team- security.log")` raised PermissionError at import time on any non-root or fresh-install run, killing the app before Flask started — failure was silent (no Flask process to inspect logs on). Fix: try/except, fall back to StreamHandler(sys.stderr) when the path is unwritable. App starts; sec_log events still land in journald via stderr. LLM_TEAM_SECURITY_LOG env var lets operators override the path. OB-2 — DB password hardcoded in source (CONVERGENT FINDING) The `kbuser` Postgres credential `IPbLBA0EQI8u4TeM2YZrbm1OAy5nSwqC` was leaked in source here AND in voice-ai/audiosocket_bridge.py + voice-ai/sales_assistant.py. Caught independently by harness LLM phase (qwen3.5 local) on voice-ai earlier today AND Opus on this file just now. Same password, same DB (`knowledge_base`) shared between services, three reviewers converged. Fix: source from LLM_TEAM_DB_DSN env var, fail loud on unset. Operator follow-ups: 1. Rotate the password in Postgres (still in git history; redacting source doesn't un-leak it). 2. Set LLM_TEAM_DB_DSN in /etc/llm-team-ui.env (mode 0600, loaded via systemd EnvironmentFile=). 3. Same DSN env-var pattern needs applying to voice-ai/audiosocket_bridge.py:47 once that branch's workspace_context WIP lands. OB-5 — demo_mode default=True ships public access on first boot Pre-fix: `_demo_mode = {"active": True, ...}` + the demo branch in login_required let users through without a session. Combined with /api/run + /api/imagegen proxies, fresh installs were open LLM/compute abuse surface from first boot. Fix: default to False; LLM_TEAM_DEMO_MODE=1 env override exists for the public devop.live deployment systemd unit so the demo doesn't need a manual flip on every restart, but everywhere else defaults closed. OB-4 — EXPLOIT_PATTERNS LAN/admin lockout Pre-fix: regex matched on `request.path` + query string against patterns like UNION / SELECT / ;-- / <script /admin.php. Admin URLs containing those keywords in legitimate ways (e.g. a team name "select-rebrand" or a docs link /admin/select_a_mode) hit 3 violations in 60s and auto-banned the admin's IP. No allowlist. Fix: bypass the path-based check for authenticated admins from an ALLOWLIST_IPS source. Body/UA checks still apply (the prompt- injection-as-DoS WARN in the scrum is separate). Combination prevents self-ban without weakening the broader scanner defense. Plus a .gitignore: /.memory/ — the local-review-harness writes JSONL findings under <repo>/.memory/ when scanning; harness's own gitignore is at the harness repo root, not here, so without this the .memory/ dir would show up as untracked on every harness run against this tree. Other Opus WARNs deferred: - Sentinel feeds attacker-controlled UA into LLM prompt → can steer ban verdicts. Fix needs prompt-template hardening or output-validation gate. - CSP `'unsafe-inline'` defeats most XSS protection (would break inline scripts; needs HTML refactor). - _rate_limit unbounded dict + per-worker (needs eviction loop or Redis-backed counter). - auth_login first-time setup gated only by COUNT(*)==0 (needs network-source restriction or a setup token). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-30 03:14:08 -05:00