llm_team_ui: 4 fixes from 2026-04-30 cross-lineage scrum

Cross-lineage scrum (Opus 4.7 + Kimi K2.6 + Qwen3-coder via the local-review-harness chatd) on this codebase surfaced 5 BLOCK-class issues from Opus + a convergent finding from the harness. This commit lands the 4 surgical fixes; OB-3 (web app runs as root with fail2ban-client + systemctl reload nginx + writes to /etc/nginx/banned_ips.conf) needs an architectural split into a non-root web tier + a privileged sudo wrapper, deferred for its own session. OB-1 — log file open at import crashes app on perm error Pre-fix: `_sec_handler = logging.FileHandler("/var/log/llm-team- security.log")` raised PermissionError at import time on any non-root or fresh-install run, killing the app before Flask started — failure was silent (no Flask process to inspect logs on). Fix: try/except, fall back to StreamHandler(sys.stderr) when the path is unwritable. App starts; sec_log events still land in journald via stderr. LLM_TEAM_SECURITY_LOG env var lets operators override the path. OB-2 — DB password hardcoded in source (CONVERGENT FINDING) The `kbuser` Postgres credential `IPbLBA0EQI8u4TeM2YZrbm1OAy5nSwqC` was leaked in source here AND in voice-ai/audiosocket_bridge.py + voice-ai/sales_assistant.py. Caught independently by harness LLM phase (qwen3.5 local) on voice-ai earlier today AND Opus on this file just now. Same password, same DB (`knowledge_base`) shared between services, three reviewers converged. Fix: source from LLM_TEAM_DB_DSN env var, fail loud on unset. Operator follow-ups: 1. Rotate the password in Postgres (still in git history; redacting source doesn't un-leak it). 2. Set LLM_TEAM_DB_DSN in /etc/llm-team-ui.env (mode 0600, loaded via systemd EnvironmentFile=). 3. Same DSN env-var pattern needs applying to voice-ai/audiosocket_bridge.py:47 once that branch's workspace_context WIP lands. OB-5 — demo_mode default=True ships public access on first boot Pre-fix: `_demo_mode = {"active": True, ...}` + the demo branch in login_required let users through without a session. Combined with /api/run + /api/imagegen proxies, fresh installs were open LLM/compute abuse surface from first boot. Fix: default to False; LLM_TEAM_DEMO_MODE=1 env override exists for the public devop.live deployment systemd unit so the demo doesn't need a manual flip on every restart, but everywhere else defaults closed. OB-4 — EXPLOIT_PATTERNS LAN/admin lockout Pre-fix: regex matched on `request.path` + query string against patterns like UNION / SELECT / ;-- / <script /admin.php. Admin URLs containing those keywords in legitimate ways (e.g. a team name "select-rebrand" or a docs link /admin/select_a_mode) hit 3 violations in 60s and auto-banned the admin's IP. No allowlist. Fix: bypass the path-based check for authenticated admins from an ALLOWLIST_IPS source. Body/UA checks still apply (the prompt- injection-as-DoS WARN in the scrum is separate). Combination prevents self-ban without weakening the broader scanner defense. Plus a .gitignore: /.memory/ — the local-review-harness writes JSONL findings under <repo>/.memory/ when scanning; harness's own gitignore is at the harness repo root, not here, so without this the .memory/ dir would show up as untracked on every harness run against this tree. Other Opus WARNs deferred: - Sentinel feeds attacker-controlled UA into LLM prompt → can steer ban verdicts. Fix needs prompt-template hardening or output-validation gate. - CSP `'unsafe-inline'` defeats most XSS protection (would break inline scripts; needs HTML refactor). - _rate_limit unbounded dict + per-worker (needs eviction loop or Redis-backed counter). - auth_login first-time setup gated only by COUNT(*)==0 (needs network-source restriction or a setup token). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 03:14:08 -05:00 · 2026-04-30 03:14:08 -05:00 · 939dfddb93
commit 939dfddb93
parent 205eff64b4
2 changed files with 62 additions and 7 deletions
--- a/.gitignore
+++ b/.gitignore
@ -2,3 +2,4 @@ __pycache__/
 *.pyc
 .env
 *.log
 .memory/
--- a/llm_team_ui.py
+++ b/llm_team_ui.py
@ -3,6 +3,7 @@
 import json
 import os
 import sys
 import time
 import threading
 import secrets
@ -36,8 +37,23 @@ app.config["SESSION_COOKIE_HTTPONLY"] = True
 app.config["SESSION_COOKIE_SAMESITE"] = "Lax"
 # ─── SECURITY LOGGING ─────────────────────────────────────────
-# Dedicated security log for fail2ban and audit trail
+# Dedicated security log for fail2ban and audit trail.
-_sec_handler = logging.FileHandler("/var/log/llm-team-security.log")
+#
 # Cross-lineage scrum 2026-04-30 (Opus BLOCK OB-1): wrapped in
 # try/except. Pre-fix this raised PermissionError at import time
 # when the service user couldn't write /var/log/llm-team-security.log,
 # crashing the app before Flask started. Now falls back to stderr;
 # sec_log still works (ban events still land in journald via stderr),
 # but the app starts. Operator should still create the file with
 # proper perms on production deploy. Path is overridable via
 # LLM_TEAM_SECURITY_LOG env var.
 _LOG_PATH = os.environ.get("LLM_TEAM_SECURITY_LOG", "/var/log/llm-team-security.log")
 try:
    _sec_handler = logging.FileHandler(_LOG_PATH)
 except (PermissionError, FileNotFoundError, OSError) as _log_err:
    print(f"[security] WARNING: can't open {_LOG_PATH} ({_log_err}); "
          f"falling back to stderr.", file=sys.stderr, flush=True)
    _sec_handler = logging.StreamHandler(sys.stderr)
 _sec_handler.setFormatter(logging.Formatter("%(asctime)s %(message)s"))
 sec_log = logging.getLogger("security")
 sec_log.addHandler(_sec_handler)
@ -150,8 +166,19 @@ def _check_high_alert_expiry():
 # IPs that never get rate-limited (your LAN, localhost)
 ALLOWLIST_IPS = {"127.0.0.1", "::1", "192.168.1.1"}
-# Demo mode state — toggled by admin at runtime
+# Demo mode state — toggled by admin at runtime.
-_demo_mode = {"active": True, "started_by": "boot", "showcase": True}
+#
 # Cross-lineage scrum 2026-04-30 (Opus BLOCK OB-5): pre-fix this
 # defaulted to active=True, meaning fresh installs shipped with
 # public unauthenticated access enabled — login_required let demo
 # users straight through. Combined with /api/run + /api/imagegen
 # proxies, that was an open LLM/compute abuse surface from first
 # boot. Now defaults to active=False; operators flip it on
 # explicitly via the admin UI or LLM_TEAM_DEMO_MODE=1 env override
 # (the env override exists for the demo systemd unit so the public
 # devop.live deployment doesn't need a manual toggle on every restart).
 _DEMO_DEFAULT = os.environ.get("LLM_TEAM_DEMO_MODE", "0") == "1"
 _demo_mode = {"active": _DEMO_DEFAULT, "started_by": "boot" if _DEMO_DEFAULT else "off", "showcase": _DEMO_DEFAULT}
 # Routes that demo users CAN trigger (read-like POSTs — enrichment, self-analysis, team runs)
 DEMO_ALLOWED_POSTS = {
@ -560,8 +587,24 @@ def security_checks():
    # Check high-alert expiry
    _check_high_alert_expiry()
-    # Exploit scanner detection — log, alert, track velocity, block
+    # Exploit scanner detection — log, alert, track velocity, block.
-    if EXPLOIT_PATTERNS.search(path) or EXPLOIT_PATTERNS.search(request.query_string.decode("utf-8", errors="ignore")):
+    #
    # Cross-lineage scrum 2026-04-30 (Opus BLOCK OB-4): pre-fix the
    # path regex matched on substrings like UNION, SELECT, ;-- and
    # auto-banned after 3 hits. Admin URLs containing those keywords
    # in query strings (e.g. an LLM team named "select-rebrand" or a
    # docs link to /admin/select_a_mode) self-banned the admin's IP.
    # Now: skip the path-based check for authenticated admins from
    # an allowlisted IP. The user-agent + body checks (sentinel) still
    # apply. Allowlisted-IP admins clicking weird URLs no longer
    # lock themselves out.
    _skip_exploit_check = False
    if ip in ALLOWLIST_IPS and session.get("role") == "admin":
        _skip_exploit_check = True
    if not _skip_exploit_check and (
        EXPLOIT_PATTERNS.search(path) or
        EXPLOIT_PATTERNS.search(request.query_string.decode("utf-8", errors="ignore"))
    ):
        sec_log.warning("EXPLOIT_SCAN ip=%s path=%s ua=%s", ip, path, ua)
        _track_violation(ip, "exploit_scan")
        send_security_alert(
@ -1888,7 +1931,18 @@ def get_api_key(provider_name):
    env_map = {"openrouter": "OPENROUTER_API_KEY", "openai": "OPENAI_API_KEY", "anthropic": "ANTHROPIC_API_KEY", "ollama_cloud": "OLLAMA_CLOUD_API_KEY"}
    return os.environ.get(env_map.get(provider_name, ""), "")
-DB_DSN = "dbname=knowledge_base user=kbuser password=IPbLBA0EQI8u4TeM2YZrbm1OAy5nSwqC host=localhost"
+# Cross-lineage scrum 2026-04-30 (Opus BLOCK OB-2 + harness LLM
 # convergent finding): DB_DSN previously had the password hardcoded
 # in source. Same `kbuser`/`knowledge_base` DSN was leaked in
 # voice-ai's audiosocket_bridge.py + sales_assistant.py — confirmed
 # canonical leak by 3 independent reviewers across 2 sessions. Now
 # sourced from env (set via systemd EnvironmentFile=/etc/llm-team-ui.env).
 # No silent fallback to the leaked literal — fail loud. The leaked
 # password is in git history regardless; rotate it in Postgres.
 DB_DSN = os.environ.get("LLM_TEAM_DB_DSN", "")
 if not DB_DSN:
    print("[llm-team-ui] WARNING: LLM_TEAM_DB_DSN not set — DB ops will fail. "
          "Set in systemd EnvironmentFile or shell env.", file=sys.stderr, flush=True)
 def get_db():
    return psycopg2.connect(DB_DSN)