From 814197cfd36d7f0833132639862d996ca5b35bf2 Mon Sep 17 00:00:00 2001 From: root Date: Thu, 30 Apr 2026 17:51:14 -0500 Subject: [PATCH] ADR-006: auth posture for non-loopback deploy + token rotation impl MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit ADR-003 locked the auth substrate; ADR-006 ratifies the operator playbook + adds two implementation pieces needed for Sprint 4 deployment: env-resolved tokens and dual-token rotation. Six decisions locked in docs/DECISIONS.md: - 6.1: Non-loopback bind requires auth.token (mechanical gate at shared.Run, already implemented; this ratifies it). - 6.2: Token from env, not TOML. /etc/lakehouse/auth.env (mode 0600) loaded by systemd EnvironmentFile=. New TokenEnv field on AuthConfig defaults to "AUTH_TOKEN". - 6.3: AllowedIPs for inter-service same-trust-domain; Token for cross-trust-boundary (gateway ↔ external). - 6.4: /health stays unauthenticated; everything else under shared.Run is gated. Already implemented; ratified here. - 6.5: Token rotation is dual-token. New SecondaryTokens []string on AuthConfig — both primary and any secondary pass auth during the rotation window. Implemented in this commit. - 6.6: TLS terminates at the network edge (nginx/Caddy), not in-process. Daemons stay HTTP-only; internal traffic stays on private subnets per Decision 6.3. Implementation: - internal/shared/config.go: AuthConfig gains TokenEnv + SecondaryTokens fields. New resolveAuthFromEnv() called by LoadConfig fills Token from os.Getenv(TokenEnv) when Token is empty. TokenEnv defaults to "AUTH_TOKEN" so the happy path needs no TOML config. - internal/shared/auth.go: RequireAuth pre-encodes Bearer headers for primary + every secondary token; per-request constant-time compare walks the slice. Fast path is 1 compare (primary). Tests: - TestLoadConfig_AuthTokenFromEnv (3 sub-tests): default env name, custom token_env, explicit Token wins over env. - TestRequireAuth_SecondaryTokenAccepted: both primary + secondary tokens pass during rotation window. - TestRequireAuth_SecondaryTokensOnly: only-secondary path works for the case where primary was just promoted-to-empty mid-rotation. go test ./internal/shared all green; existing auth_test.go unchanged (constant-time compare path preserved). Co-Authored-By: Claude Opus 4.7 (1M context) --- STATE_OF_PLAY.md | 2 +- docs/DECISIONS.md | 156 +++++++++++++++++++++++++++++++++ internal/shared/auth.go | 31 +++++-- internal/shared/auth_test.go | 43 +++++++++ internal/shared/config.go | 31 +++++++ internal/shared/config_test.go | 65 ++++++++++++++ 6 files changed, 319 insertions(+), 9 deletions(-) diff --git a/STATE_OF_PLAY.md b/STATE_OF_PLAY.md index 62efffa..d024efd 100644 --- a/STATE_OF_PLAY.md +++ b/STATE_OF_PLAY.md @@ -201,6 +201,7 @@ Verbatim verdicts at `reports/scrum/_evidence/2026-04-30/verdicts/`. Disposition - **Shape B is the playbook stance now.** When `use_playbook=true`, both `ApplyPlaybookBoost` (re-rank in place) AND `InjectPlaybookMisses` (insert recorded answers not in regular top-K) run. Don't revert to boost-only; v2 proved that path can't recover paraphrase queries. - **Boost / inject use SEPARATE thresholds.** Boost stays at `DefaultPlaybookMaxDistance = 0.5` (safe — only re-ranks results already in retrieval). Inject uses tighter `DefaultPlaybookMaxInjectDistance = 0.20` (Shape B forces results in, so loose match cross-pollinates wrong-domain). Don't merge them. - **Multi-coord product theory is empirically VALIDATED** by run #011 (Phase 3). Per-coordinator playbook namespaces (`playbook_alice` etc.) with cross-coordinator handover (Bob takes Alice's contract using `playbook_corpus=playbook_alice`) work end-to-end including paraphrase handover. Don't propose to "test if multi-coord works" — it does. +- **Auth posture is locked per ADR-006.** Non-loopback bind requires `auth.token` (mechanical gate at `shared.Run`). Operators set the token via `token_env` (defaults to `AUTH_TOKEN`) loaded by systemd `EnvironmentFile=/etc/lakehouse/auth.env`, NOT in the committed TOML. Internal services use `AllowedIPs`; external boundary uses Bearer. Token rotation is dual-token via `secondary_tokens`. TLS terminates at the edge (nginx/Caddy), not in-process. Don't re-litigate. - **Fresh content uses two-tier indexing.** Fresh resumes go to `fresh_workers` corpus, not the main `workers` index. coder/hnsw incremental adds to a populated 5K+ graph have recall issues; the small hot index has no such crowding. Periodic merge (post-G3) consolidates fresh→main. Canonical NRT pattern. - **`embedd.default_model = "nomic-embed-text-v2-moe"`** (475M MoE, 768d). Don't bump to `nomic-embed-text` (137M) "for speed" — diversity scores fell from 0.000 → 0.080 same-role-across-contracts on the smaller model. Cost (~5× slower ingest) is acceptable for once-per-deploy work. - **Inbox flow: parse + search + judge + trace.** `/v1/observer/inbox` records the body; coordinator/driver parses via LLM (qwen2.5 format=json), runs `matrix.search` on the parsed query, judges top-1 against the ORIGINAL body, emits Langfuse spans through it all. Don't replace the judge re-rate with a distance-only gate — tight distance + low rating is the load-bearing honesty signal. @@ -222,7 +223,6 @@ Verbatim verdicts at `reports/scrum/_evidence/2026-04-30/verdicts/`. Disposition | **Adjacent-query cross-pollination (lift suite Q6↔Q7)** | After lift v4's split threshold, OOD cross-pollination is gone but Q6 / Q7 still swap recordings as warm top-1 because their embeddings are within 0.20 cosine of each other. Multi-coord run #008 inbox-judge re-rating proved the judge can distinguish — gating injection on "judge approves before injecting" closes this. ~1 hr. | When playbook injection quality matters more than retrieval throughput. | | **Liberal-paraphrase recovery loss (lift suite Q9, Q15)** | Q9 + Q15 in run #004 lost paraphrase recovery because qwen2.5 rephrased liberally enough to drift past 0.20 inject threshold. Acceptable (system refusing to inject when not confident), but might be tightenable with a different paraphrase prompt or a per-pair `paraphrase_max_drift` measurement. | When real coordinator queries are available for a calibration run. | | **Sprint 4 — deployment** | No `REPLICATION.md`, `secrets-go.toml.example`, `deploy/systemd/.service`, `Dockerfile`. Largest open Sprint. Required input for any G5 cutover plan. | When G5 cutover is on the table. | -| **ADR-006 — auth posture for non-loopback deploy** | Locks R-001 + R-007 from "opt-in middleware exists" to "wired-by-default for X, opt-in for Y." Doc-only, ~1 hr. | Required before any Go binary binds non-loopback in prod. | | **chatd fixture-mode storage half** | `g2_smoke_fixtures.sh` closed embed half via fake_ollama; storage half (mock S3) still deferred. Closes R-006 fully. | When CI box without MinIO is needed. | | **Distillation full port** | `57d0df1` shipped scorer + contamination firewall (E partial); SFT export pipeline + audit_baselines lineage not yet ported. | When distillation is needed for production. | | **Drift full quantification** | `be65f85` is "scorer drift first." Full distribution-drift signal underspecified everywhere — research gap, not a port. | Open research item. | diff --git a/docs/DECISIONS.md b/docs/DECISIONS.md index 4e237dd..484d330 100644 --- a/docs/DECISIONS.md +++ b/docs/DECISIONS.md @@ -500,3 +500,159 @@ invariants before the next consumer (scrum runner, distillation rebuild, or a real production workflow) takes a hard behavioral dependency. --- + +## ADR-006: Auth posture for non-loopback deploy + +**Date:** 2026-04-30 +**Status:** RATIFIED +**Scope:** `internal/shared/auth.go` + `internal/shared/bind.go` + every `cmd//main.go`'s `shared.Run` call site + +ADR-003 locked the substrate (Bearer token + IP allowlist, opt-in via +`cfg.Auth.Token`/`cfg.Auth.AllowedIPs`, `/health` exempt). ADR-006 +ratifies the **operator playbook + deploy-time invariants** — what +gets enforced when, what operators set where, what happens when keys +rotate. Required because Sprint 4 deployment work (REPLICATION.md, +systemd units, Dockerfile) needs a locked auth posture before it +touches production-shaped configs. + +### Decision 6.1 — Non-loopback bind requires `auth.token`; the gate is mechanical + +Already implemented in `requireAuthOnNonLoopback` (`internal/shared/bind.go:58-67`). +Locked: + +- Any binary that binds anything other than `127.0.0.0/8` / `::1` / + `localhost` MUST have `cfg.Auth.Token != ""`. Empty-token + + non-loopback-bind = startup error, not silent insecure mode. +- The check fires in `shared.Run` BEFORE `http.Server.Serve`, so a + misconfigured binary fails fast at startup rather than serving + one request. +- Pairs with `requireLoopbackOrOverride`: that gate refuses any + non-loopback bind without `LH__ALLOW_NONLOOPBACK=1`. Together + they make the audit's R-001+R-007 worst case (queryd `/sql` = + RCE-equivalent off-loopback with no auth) mechanically impossible. + +**Why mechanical, not policy:** policy gates rely on operator +discipline. The substrate gates work even when an operator copies a +dev `lakehouse.toml` into prod and forgets to set the token — +binary refuses to start, error message names the env override. + +### Decision 6.2 — Token comes from `cfg.Auth.Token` populated by env or secrets file + +Locked: + +- Operators do NOT put the production token in `lakehouse.toml` + directly. The TOML field is empty in the committed file; the + daemon's systemd unit sets `AUTH_TOKEN` (or whatever + `cfg.Auth.TokenEnv` names) via `EnvironmentFile=` pointing at + `/etc/lakehouse/auth.env` (mode 0600, root-owned). +- Same pattern as `chatd`'s provider keys (`OPENROUTER_API_KEY` etc.): + TOML names the env var, systemd loads the env file. +- Justification: keeps secrets out of git + out of the running + process's command line + audit-able via filesystem ACLs. + +### Decision 6.3 — `AllowedIPs` is the inter-service gate; `Token` is the cross-trust-boundary gate + +Locked: + +- Same-box deploys (10 daemons all on one host, all on `127.0.0.0/8` + or a private subnet) use **`AllowedIPs` only**. Each daemon's + `cfg.Auth.AllowedIPs` lists the gateway's address (and any other + daemon that legitimately calls it). No token shared between + internal services. +- Gateway-to-external traffic (a coordinator UI in another VPC, + a user's browser, an external integrator) goes through + **Bearer token**. The token is per-tenant; rotation is per-tenant. +- Mixed: a service can require BOTH (allowlist AND token) — the + middleware logic is `allowed = ip_allowed && token_valid` when + both are set. Use this for the gateway when binding non-loopback. + +**Why split:** token rotation is operationally expensive (every +caller updates a secret). IP allowlist rotation is free if the +network topology is stable. Splitting them by trust boundary lets +internal services treat allowlist drift as a network change while +external callers handle token rotation as a credential change. + +### Decision 6.4 — `/health` is unauthenticated; everything else under `shared.Run` is gated + +Already implemented (`internal/shared/server.go:84-92`). Locked: + +- Load balancers + monitor probes hit `/health` without a token. + The route returns `{"status":"ok","service":""}` and nothing + about service state — no version, no commit, no internal counts. +- Every other route registered via `shared.Run`'s `register` + callback lives inside the auth-gated chi.Group. New routes + inherit auth automatically; new daemons inherit it via `shared.Run`. +- A daemon that needs a public route MUST add it to the outer router + before the `register` group, with a code comment explaining the + exemption. There are no others today. + +### Decision 6.5 — Token rotation is operator-staged; old + new accepted during the window + +Not yet implemented; locked as a Sprint 4 follow-up: + +- Operators stage a rotation by adding a second token to + `cfg.Auth.SecondaryTokens []string`. Both primary and secondary + pass auth during the window. +- After every caller is updated to the new token, operators + promote secondary → primary and clear secondary. A second + rotation can begin. +- Rolling restart not required; daemons reload `cfg.Auth` on + SIGHUP (also a Sprint 4 follow-up — currently they re-read on + restart only). + +**Why dual-token instead of just single-rotation:** caller pool can be +large (gateway + observerd + scrum runner + UI + external integrators). +A single-token rotation forces a flag-day. Dual-token windows let +operators rotate gradually and abort on failure. + +### Decision 6.6 — TLS is the network operator's job, not ours + +Locked: + +- Daemons speak HTTP, not HTTPS. TLS termination happens at the + network edge (nginx / Caddy / cloud LB), not in the Go process. +- Internal daemon-to-daemon traffic stays on plaintext HTTP because + it's all on `127.0.0.0/8` or a private subnet (per Decision 6.3). +- Justification: TLS in-process means cert management, rotation, + reload — all undifferentiated lift that nginx already solves + better. The Bearer token + allowlist gates are sufficient when + combined with a TLS-terminating reverse proxy. + +### Alternatives considered + +- **mTLS for inter-service auth** — every daemon issues + verifies + certs. Solves token-rotation pain but adds cert lifecycle as a + problem. Allowlist + plaintext on the private network is cheaper + and gets the same threat-model coverage. +- **JWT-only** — JWTs let callers carry richer claims (tenant id, + expiry, scopes). Overkill for the current threat model; the + Bearer token + allowlist split is honest about what each layer + actually defends against. Revisit when multi-tenant gateway + features land. +- **No auth, network is the boundary** — works for G0 dev and the + current single-box deployment. ADR-006 explicitly does NOT + recommend this for non-loopback prod (the mechanical gate + refuses it). + +### What this ADR does NOT do + +- **Does not specify how the gateway authenticates external callers.** + Token-vs-mTLS-vs-OAuth at the public edge is a separate decision + driven by who-calls-us. ADR-006 is about the inter-service + + same-trust-domain posture. +- **Does not implement token rotation hot-reload.** Decision 6.5 + documents the design; the implementation is Sprint 4 work. +- **Does not lock TLS termination details.** Where + how nginx/Caddy + goes is ops infrastructure, not ADR territory. + +### How this closes the OPEN list + +STATE_OF_PLAY listed ADR-006 as the gate before any Go binary binds +non-loopback in prod. The substrate gates were already present (R-001 ++ R-007 enforced via `requireLoopbackOrOverride` + +`requireAuthOnNonLoopback`); this ADR locks the operator playbook +that turns those gates into a deployable posture. Sprint 4 can now +write systemd units that set `AUTH_TOKEN` from `EnvironmentFile=` +without re-litigating the design. + +--- diff --git a/internal/shared/auth.go b/internal/shared/auth.go index 7989c44..6ba2346 100644 --- a/internal/shared/auth.go +++ b/internal/shared/auth.go @@ -30,7 +30,7 @@ import ( // RequireAuth returns a chi-compatible middleware that enforces // the configured AuthConfig. Empty config returns a pass-through. func RequireAuth(cfg AuthConfig) func(http.Handler) http.Handler { - tokenSet := cfg.Token != "" + tokenSet := cfg.Token != "" || len(cfg.SecondaryTokens) > 0 if !tokenSet && len(cfg.AllowedIPs) == 0 { // G0 dev mode — no auth wired. return passthrough @@ -59,9 +59,20 @@ func RequireAuth(cfg AuthConfig) func(http.Handler) http.Handler { allowedNets = append(allowedNets, n) } - // Pre-encode the wire-format Bearer token so per-request - // comparison is one allocation against a precomputed slice. - expectedHeader := []byte("Bearer " + cfg.Token) + // Pre-encode wire-format Bearer headers for primary + every + // secondary token. Per-request comparison walks the slice with + // constant-time compare on each — fast path is the primary + // (first), so the typical case is one compare. + var expectedHeaders [][]byte + if cfg.Token != "" { + expectedHeaders = append(expectedHeaders, []byte("Bearer "+cfg.Token)) + } + for _, sec := range cfg.SecondaryTokens { + if sec == "" { + continue + } + expectedHeaders = append(expectedHeaders, []byte("Bearer "+sec)) + } return func(next http.Handler) http.Handler { return http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) { @@ -79,10 +90,14 @@ func RequireAuth(cfg AuthConfig) func(http.Handler) http.Handler { if tokenSet { got := []byte(r.Header.Get("Authorization")) - // ConstantTimeCompare returns 0 if lengths differ, - // 1 on match. Anything else (would be 0 or 1) is - // treated as no-match. - if subtle.ConstantTimeCompare(got, expectedHeader) != 1 { + matched := false + for _, want := range expectedHeaders { + if subtle.ConstantTimeCompare(got, want) == 1 { + matched = true + break + } + } + if !matched { http.Error(w, "unauthorized", http.StatusUnauthorized) return } diff --git a/internal/shared/auth_test.go b/internal/shared/auth_test.go index c1e4acd..fa25405 100644 --- a/internal/shared/auth_test.go +++ b/internal/shared/auth_test.go @@ -189,6 +189,49 @@ func TestRequireAuth_InvalidCIDR_LoggedAndDropped(t *testing.T) { } } +// TestRequireAuth_SecondaryTokenAccepted locks ADR-006 Decision 6.5: +// during a token rotation, both primary and secondary token strings +// pass auth. After every caller updates, operators promote secondary +// → primary and clear secondary, completing the rotation. +func TestRequireAuth_SecondaryTokenAccepted(t *testing.T) { + cfg := AuthConfig{ + Token: "primary-tok", + SecondaryTokens: []string{"secondary-tok"}, + } + srv := httptest.NewServer(mountWithAuth(cfg)) + defer srv.Close() + + if status, _ := get(t, srv, "/data", "Bearer primary-tok"); status != http.StatusOK { + t.Errorf("primary token should pass, got %d", status) + } + if status, _ := get(t, srv, "/data", "Bearer secondary-tok"); status != http.StatusOK { + t.Errorf("secondary token should pass during rotation, got %d", status) + } + if status, _ := get(t, srv, "/data", "Bearer wrong-tok"); status != http.StatusUnauthorized { + t.Errorf("invalid token should still 401, got %d", status) + } +} + +// TestRequireAuth_SecondaryTokensOnly locks the case where primary +// is empty but secondaries are set — useful mid-rotation when the +// previous primary was just promoted to nothing and the new primary +// hasn't been written yet. As long as ANY token is configured, auth +// is enforced. +func TestRequireAuth_SecondaryTokensOnly(t *testing.T) { + cfg := AuthConfig{ + SecondaryTokens: []string{"only-tok"}, + } + srv := httptest.NewServer(mountWithAuth(cfg)) + defer srv.Close() + + if status, _ := get(t, srv, "/data", "Bearer only-tok"); status != http.StatusOK { + t.Errorf("only-secondary should pass, got %d", status) + } + if status, _ := get(t, srv, "/data", ""); status != http.StatusUnauthorized { + t.Errorf("missing token should 401, got %d", status) + } +} + func TestRemoteIP_SplitHostPortShape(t *testing.T) { // Sanity: real httptest requests come through with "ip:port" // shape; ensure remoteIP returns the IP portion. diff --git a/internal/shared/config.go b/internal/shared/config.go index f92a2b6..2008272 100644 --- a/internal/shared/config.go +++ b/internal/shared/config.go @@ -298,6 +298,17 @@ func (m ModelsConfig) IsWeak(model string) bool { type AuthConfig struct { Token string `toml:"token"` AllowedIPs []string `toml:"allowed_ips"` + // TokenEnv names an environment variable; LoadConfig populates + // Token from os.Getenv(TokenEnv) when Token is empty. Per ADR-006 + // 6.2: production deploys put the secret in /etc/lakehouse/auth.env + // (mode 0600) loaded by systemd EnvironmentFile=, NOT in the + // committed TOML. TokenEnv defaults to "AUTH_TOKEN". + TokenEnv string `toml:"token_env"` + // SecondaryTokens lets operators stage a rotation: both primary + // and any secondary token pass auth during the rotation window. + // After every caller updates, operators promote secondary → + // primary and clear secondary. Per ADR-006 Decision 6.5. + SecondaryTokens []string `toml:"secondary_tokens"` } // DefaultConfig returns the G0 dev defaults. Ports are shifted to @@ -434,5 +445,25 @@ func LoadConfig(path string) (Config, error) { if err := toml.Unmarshal(b, &cfg); err != nil { return cfg, fmt.Errorf("parse config: %w", err) } + resolveAuthFromEnv(&cfg.Auth) return cfg, nil } + +// resolveAuthFromEnv populates cfg.Auth.Token from os.Getenv(TokenEnv) +// when Token is empty. Per ADR-006 Decision 6.2: production deploys +// keep the secret in /etc/lakehouse/auth.env (mode 0600), loaded by +// systemd EnvironmentFile=, never in the committed TOML. +// +// TokenEnv defaults to "AUTH_TOKEN" so operators don't have to +// configure both — setting AUTH_TOKEN env is enough. +func resolveAuthFromEnv(auth *AuthConfig) { + envName := auth.TokenEnv + if envName == "" { + envName = "AUTH_TOKEN" + } + if auth.Token == "" { + if v := os.Getenv(envName); v != "" { + auth.Token = v + } + } +} diff --git a/internal/shared/config_test.go b/internal/shared/config_test.go index 48e7e1e..c38bcd8 100644 --- a/internal/shared/config_test.go +++ b/internal/shared/config_test.go @@ -188,6 +188,71 @@ weak_models = ["custom-judge:latest", "qwen3:latest"] } } +// TestLoadConfig_AuthTokenFromEnv locks ADR-006 Decision 6.2: +// production deploys put the secret in /etc/lakehouse/auth.env (mode +// 0600), loaded by systemd EnvironmentFile=, NEVER in the committed +// TOML. The TOML names the env var via token_env; the loader fills +// Token from os.Getenv. TokenEnv defaults to "AUTH_TOKEN" so the +// happy path needs no TOML config at all. +func TestLoadConfig_AuthTokenFromEnv(t *testing.T) { + t.Run("default env name AUTH_TOKEN", func(t *testing.T) { + t.Setenv("AUTH_TOKEN", "from-default-env") + dir := t.TempDir() + path := filepath.Join(dir, "lakehouse.toml") + if err := os.WriteFile(path, []byte(`[auth] +allowed_ips = [] +`), 0o644); err != nil { + t.Fatal(err) + } + cfg, err := LoadConfig(path) + if err != nil { + t.Fatalf("LoadConfig: %v", err) + } + if cfg.Auth.Token != "from-default-env" { + t.Errorf("Token = %q, want from AUTH_TOKEN env", cfg.Auth.Token) + } + }) + + t.Run("custom env name from token_env", func(t *testing.T) { + t.Setenv("CUSTOM_AUTH_TOKEN", "from-custom-env") + dir := t.TempDir() + path := filepath.Join(dir, "lakehouse.toml") + if err := os.WriteFile(path, []byte(`[auth] +token_env = "CUSTOM_AUTH_TOKEN" +`), 0o644); err != nil { + t.Fatal(err) + } + cfg, err := LoadConfig(path) + if err != nil { + t.Fatalf("LoadConfig: %v", err) + } + if cfg.Auth.Token != "from-custom-env" { + t.Errorf("Token = %q, want from CUSTOM_AUTH_TOKEN env", cfg.Auth.Token) + } + }) + + t.Run("explicit token wins over env", func(t *testing.T) { + t.Setenv("AUTH_TOKEN", "from-env") + dir := t.TempDir() + path := filepath.Join(dir, "lakehouse.toml") + if err := os.WriteFile(path, []byte(`[auth] +token = "from-toml" +`), 0o644); err != nil { + t.Fatal(err) + } + cfg, err := LoadConfig(path) + if err != nil { + t.Fatalf("LoadConfig: %v", err) + } + // Explicit Token in TOML wins over env — the loader only + // fills from env when Token is empty. Lets local dev + // override prod env vars. + if cfg.Auth.Token != "from-toml" { + t.Errorf("Token = %q, want explicit TOML value", cfg.Auth.Token) + } + }) +} + func TestLoadConfig_InvalidTOML_ReturnsError(t *testing.T) { dir := t.TempDir() cfgPath := filepath.Join(dir, "bad.toml")