ADR-006: auth posture for non-loopback deploy + token rotation impl

ADR-003 locked the auth substrate; ADR-006 ratifies the operator
playbook + adds two implementation pieces needed for Sprint 4
deployment: env-resolved tokens and dual-token rotation.

Six decisions locked in docs/DECISIONS.md:
- 6.1: Non-loopback bind requires auth.token (mechanical gate at
       shared.Run, already implemented; this ratifies it).
- 6.2: Token from env, not TOML. /etc/lakehouse/auth.env (mode 0600)
       loaded by systemd EnvironmentFile=. New TokenEnv field on
       AuthConfig defaults to "AUTH_TOKEN".
- 6.3: AllowedIPs for inter-service same-trust-domain; Token for
       cross-trust-boundary (gateway ↔ external).
- 6.4: /health stays unauthenticated; everything else under
       shared.Run is gated. Already implemented; ratified here.
- 6.5: Token rotation is dual-token. New SecondaryTokens []string
       on AuthConfig — both primary and any secondary pass auth
       during the rotation window. Implemented in this commit.
- 6.6: TLS terminates at the network edge (nginx/Caddy), not
       in-process. Daemons stay HTTP-only; internal traffic stays
       on private subnets per Decision 6.3.

Implementation:
- internal/shared/config.go: AuthConfig gains TokenEnv +
  SecondaryTokens fields. New resolveAuthFromEnv() called by
  LoadConfig fills Token from os.Getenv(TokenEnv) when Token is
  empty. TokenEnv defaults to "AUTH_TOKEN" so the happy path needs
  no TOML config.
- internal/shared/auth.go: RequireAuth pre-encodes Bearer headers
  for primary + every secondary token; per-request constant-time
  compare walks the slice. Fast path is 1 compare (primary).

Tests:
- TestLoadConfig_AuthTokenFromEnv (3 sub-tests): default env name,
  custom token_env, explicit Token wins over env.
- TestRequireAuth_SecondaryTokenAccepted: both primary + secondary
  tokens pass during rotation window.
- TestRequireAuth_SecondaryTokensOnly: only-secondary path works
  for the case where primary was just promoted-to-empty mid-rotation.

go test ./internal/shared all green; existing auth_test.go
unchanged (constant-time compare path preserved).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
root 2026-04-30 17:51:14 -05:00
parent 6c93a38093
commit 814197cfd3
6 changed files with 319 additions and 9 deletions

View File

@ -201,6 +201,7 @@ Verbatim verdicts at `reports/scrum/_evidence/2026-04-30/verdicts/`. Disposition
- **Shape B is the playbook stance now.** When `use_playbook=true`, both `ApplyPlaybookBoost` (re-rank in place) AND `InjectPlaybookMisses` (insert recorded answers not in regular top-K) run. Don't revert to boost-only; v2 proved that path can't recover paraphrase queries.
- **Boost / inject use SEPARATE thresholds.** Boost stays at `DefaultPlaybookMaxDistance = 0.5` (safe — only re-ranks results already in retrieval). Inject uses tighter `DefaultPlaybookMaxInjectDistance = 0.20` (Shape B forces results in, so loose match cross-pollinates wrong-domain). Don't merge them.
- **Multi-coord product theory is empirically VALIDATED** by run #011 (Phase 3). Per-coordinator playbook namespaces (`playbook_alice` etc.) with cross-coordinator handover (Bob takes Alice's contract using `playbook_corpus=playbook_alice`) work end-to-end including paraphrase handover. Don't propose to "test if multi-coord works" — it does.
- **Auth posture is locked per ADR-006.** Non-loopback bind requires `auth.token` (mechanical gate at `shared.Run`). Operators set the token via `token_env` (defaults to `AUTH_TOKEN`) loaded by systemd `EnvironmentFile=/etc/lakehouse/auth.env`, NOT in the committed TOML. Internal services use `AllowedIPs`; external boundary uses Bearer. Token rotation is dual-token via `secondary_tokens`. TLS terminates at the edge (nginx/Caddy), not in-process. Don't re-litigate.
- **Fresh content uses two-tier indexing.** Fresh resumes go to `fresh_workers` corpus, not the main `workers` index. coder/hnsw incremental adds to a populated 5K+ graph have recall issues; the small hot index has no such crowding. Periodic merge (post-G3) consolidates fresh→main. Canonical NRT pattern.
- **`embedd.default_model = "nomic-embed-text-v2-moe"`** (475M MoE, 768d). Don't bump to `nomic-embed-text` (137M) "for speed" — diversity scores fell from 0.000 → 0.080 same-role-across-contracts on the smaller model. Cost (~5× slower ingest) is acceptable for once-per-deploy work.
- **Inbox flow: parse + search + judge + trace.** `/v1/observer/inbox` records the body; coordinator/driver parses via LLM (qwen2.5 format=json), runs `matrix.search` on the parsed query, judges top-1 against the ORIGINAL body, emits Langfuse spans through it all. Don't replace the judge re-rate with a distance-only gate — tight distance + low rating is the load-bearing honesty signal.
@ -222,7 +223,6 @@ Verbatim verdicts at `reports/scrum/_evidence/2026-04-30/verdicts/`. Disposition
| **Adjacent-query cross-pollination (lift suite Q6↔Q7)** | After lift v4's split threshold, OOD cross-pollination is gone but Q6 / Q7 still swap recordings as warm top-1 because their embeddings are within 0.20 cosine of each other. Multi-coord run #008 inbox-judge re-rating proved the judge can distinguish — gating injection on "judge approves before injecting" closes this. ~1 hr. | When playbook injection quality matters more than retrieval throughput. |
| **Liberal-paraphrase recovery loss (lift suite Q9, Q15)** | Q9 + Q15 in run #004 lost paraphrase recovery because qwen2.5 rephrased liberally enough to drift past 0.20 inject threshold. Acceptable (system refusing to inject when not confident), but might be tightenable with a different paraphrase prompt or a per-pair `paraphrase_max_drift` measurement. | When real coordinator queries are available for a calibration run. |
| **Sprint 4 — deployment** | No `REPLICATION.md`, `secrets-go.toml.example`, `deploy/systemd/<bin>.service`, `Dockerfile`. Largest open Sprint. Required input for any G5 cutover plan. | When G5 cutover is on the table. |
| **ADR-006 — auth posture for non-loopback deploy** | Locks R-001 + R-007 from "opt-in middleware exists" to "wired-by-default for X, opt-in for Y." Doc-only, ~1 hr. | Required before any Go binary binds non-loopback in prod. |
| **chatd fixture-mode storage half** | `g2_smoke_fixtures.sh` closed embed half via fake_ollama; storage half (mock S3) still deferred. Closes R-006 fully. | When CI box without MinIO is needed. |
| **Distillation full port** | `57d0df1` shipped scorer + contamination firewall (E partial); SFT export pipeline + audit_baselines lineage not yet ported. | When distillation is needed for production. |
| **Drift full quantification** | `be65f85` is "scorer drift first." Full distribution-drift signal underspecified everywhere — research gap, not a port. | Open research item. |

View File

@ -500,3 +500,159 @@ invariants before the next consumer (scrum runner, distillation rebuild,
or a real production workflow) takes a hard behavioral dependency.
---
## ADR-006: Auth posture for non-loopback deploy
**Date:** 2026-04-30
**Status:** RATIFIED
**Scope:** `internal/shared/auth.go` + `internal/shared/bind.go` + every `cmd/<bin>/main.go`'s `shared.Run` call site
ADR-003 locked the substrate (Bearer token + IP allowlist, opt-in via
`cfg.Auth.Token`/`cfg.Auth.AllowedIPs`, `/health` exempt). ADR-006
ratifies the **operator playbook + deploy-time invariants** — what
gets enforced when, what operators set where, what happens when keys
rotate. Required because Sprint 4 deployment work (REPLICATION.md,
systemd units, Dockerfile) needs a locked auth posture before it
touches production-shaped configs.
### Decision 6.1 — Non-loopback bind requires `auth.token`; the gate is mechanical
Already implemented in `requireAuthOnNonLoopback` (`internal/shared/bind.go:58-67`).
Locked:
- Any binary that binds anything other than `127.0.0.0/8` / `::1` /
`localhost` MUST have `cfg.Auth.Token != ""`. Empty-token +
non-loopback-bind = startup error, not silent insecure mode.
- The check fires in `shared.Run` BEFORE `http.Server.Serve`, so a
misconfigured binary fails fast at startup rather than serving
one request.
- Pairs with `requireLoopbackOrOverride`: that gate refuses any
non-loopback bind without `LH_<NAME>_ALLOW_NONLOOPBACK=1`. Together
they make the audit's R-001+R-007 worst case (queryd `/sql` =
RCE-equivalent off-loopback with no auth) mechanically impossible.
**Why mechanical, not policy:** policy gates rely on operator
discipline. The substrate gates work even when an operator copies a
dev `lakehouse.toml` into prod and forgets to set the token —
binary refuses to start, error message names the env override.
### Decision 6.2 — Token comes from `cfg.Auth.Token` populated by env or secrets file
Locked:
- Operators do NOT put the production token in `lakehouse.toml`
directly. The TOML field is empty in the committed file; the
daemon's systemd unit sets `AUTH_TOKEN` (or whatever
`cfg.Auth.TokenEnv` names) via `EnvironmentFile=` pointing at
`/etc/lakehouse/auth.env` (mode 0600, root-owned).
- Same pattern as `chatd`'s provider keys (`OPENROUTER_API_KEY` etc.):
TOML names the env var, systemd loads the env file.
- Justification: keeps secrets out of git + out of the running
process's command line + audit-able via filesystem ACLs.
### Decision 6.3 — `AllowedIPs` is the inter-service gate; `Token` is the cross-trust-boundary gate
Locked:
- Same-box deploys (10 daemons all on one host, all on `127.0.0.0/8`
or a private subnet) use **`AllowedIPs` only**. Each daemon's
`cfg.Auth.AllowedIPs` lists the gateway's address (and any other
daemon that legitimately calls it). No token shared between
internal services.
- Gateway-to-external traffic (a coordinator UI in another VPC,
a user's browser, an external integrator) goes through
**Bearer token**. The token is per-tenant; rotation is per-tenant.
- Mixed: a service can require BOTH (allowlist AND token) — the
middleware logic is `allowed = ip_allowed && token_valid` when
both are set. Use this for the gateway when binding non-loopback.
**Why split:** token rotation is operationally expensive (every
caller updates a secret). IP allowlist rotation is free if the
network topology is stable. Splitting them by trust boundary lets
internal services treat allowlist drift as a network change while
external callers handle token rotation as a credential change.
### Decision 6.4 — `/health` is unauthenticated; everything else under `shared.Run` is gated
Already implemented (`internal/shared/server.go:84-92`). Locked:
- Load balancers + monitor probes hit `/health` without a token.
The route returns `{"status":"ok","service":"<name>"}` and nothing
about service state — no version, no commit, no internal counts.
- Every other route registered via `shared.Run`'s `register`
callback lives inside the auth-gated chi.Group. New routes
inherit auth automatically; new daemons inherit it via `shared.Run`.
- A daemon that needs a public route MUST add it to the outer router
before the `register` group, with a code comment explaining the
exemption. There are no others today.
### Decision 6.5 — Token rotation is operator-staged; old + new accepted during the window
Not yet implemented; locked as a Sprint 4 follow-up:
- Operators stage a rotation by adding a second token to
`cfg.Auth.SecondaryTokens []string`. Both primary and secondary
pass auth during the window.
- After every caller is updated to the new token, operators
promote secondary → primary and clear secondary. A second
rotation can begin.
- Rolling restart not required; daemons reload `cfg.Auth` on
SIGHUP (also a Sprint 4 follow-up — currently they re-read on
restart only).
**Why dual-token instead of just single-rotation:** caller pool can be
large (gateway + observerd + scrum runner + UI + external integrators).
A single-token rotation forces a flag-day. Dual-token windows let
operators rotate gradually and abort on failure.
### Decision 6.6 — TLS is the network operator's job, not ours
Locked:
- Daemons speak HTTP, not HTTPS. TLS termination happens at the
network edge (nginx / Caddy / cloud LB), not in the Go process.
- Internal daemon-to-daemon traffic stays on plaintext HTTP because
it's all on `127.0.0.0/8` or a private subnet (per Decision 6.3).
- Justification: TLS in-process means cert management, rotation,
reload — all undifferentiated lift that nginx already solves
better. The Bearer token + allowlist gates are sufficient when
combined with a TLS-terminating reverse proxy.
### Alternatives considered
- **mTLS for inter-service auth** — every daemon issues + verifies
certs. Solves token-rotation pain but adds cert lifecycle as a
problem. Allowlist + plaintext on the private network is cheaper
and gets the same threat-model coverage.
- **JWT-only** — JWTs let callers carry richer claims (tenant id,
expiry, scopes). Overkill for the current threat model; the
Bearer token + allowlist split is honest about what each layer
actually defends against. Revisit when multi-tenant gateway
features land.
- **No auth, network is the boundary** — works for G0 dev and the
current single-box deployment. ADR-006 explicitly does NOT
recommend this for non-loopback prod (the mechanical gate
refuses it).
### What this ADR does NOT do
- **Does not specify how the gateway authenticates external callers.**
Token-vs-mTLS-vs-OAuth at the public edge is a separate decision
driven by who-calls-us. ADR-006 is about the inter-service +
same-trust-domain posture.
- **Does not implement token rotation hot-reload.** Decision 6.5
documents the design; the implementation is Sprint 4 work.
- **Does not lock TLS termination details.** Where + how nginx/Caddy
goes is ops infrastructure, not ADR territory.
### How this closes the OPEN list
STATE_OF_PLAY listed ADR-006 as the gate before any Go binary binds
non-loopback in prod. The substrate gates were already present (R-001
+ R-007 enforced via `requireLoopbackOrOverride` +
`requireAuthOnNonLoopback`); this ADR locks the operator playbook
that turns those gates into a deployable posture. Sprint 4 can now
write systemd units that set `AUTH_TOKEN` from `EnvironmentFile=`
without re-litigating the design.
---

View File

@ -30,7 +30,7 @@ import (
// RequireAuth returns a chi-compatible middleware that enforces
// the configured AuthConfig. Empty config returns a pass-through.
func RequireAuth(cfg AuthConfig) func(http.Handler) http.Handler {
tokenSet := cfg.Token != ""
tokenSet := cfg.Token != "" || len(cfg.SecondaryTokens) > 0
if !tokenSet && len(cfg.AllowedIPs) == 0 {
// G0 dev mode — no auth wired.
return passthrough
@ -59,9 +59,20 @@ func RequireAuth(cfg AuthConfig) func(http.Handler) http.Handler {
allowedNets = append(allowedNets, n)
}
// Pre-encode the wire-format Bearer token so per-request
// comparison is one allocation against a precomputed slice.
expectedHeader := []byte("Bearer " + cfg.Token)
// Pre-encode wire-format Bearer headers for primary + every
// secondary token. Per-request comparison walks the slice with
// constant-time compare on each — fast path is the primary
// (first), so the typical case is one compare.
var expectedHeaders [][]byte
if cfg.Token != "" {
expectedHeaders = append(expectedHeaders, []byte("Bearer "+cfg.Token))
}
for _, sec := range cfg.SecondaryTokens {
if sec == "" {
continue
}
expectedHeaders = append(expectedHeaders, []byte("Bearer "+sec))
}
return func(next http.Handler) http.Handler {
return http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
@ -79,10 +90,14 @@ func RequireAuth(cfg AuthConfig) func(http.Handler) http.Handler {
if tokenSet {
got := []byte(r.Header.Get("Authorization"))
// ConstantTimeCompare returns 0 if lengths differ,
// 1 on match. Anything else (would be 0 or 1) is
// treated as no-match.
if subtle.ConstantTimeCompare(got, expectedHeader) != 1 {
matched := false
for _, want := range expectedHeaders {
if subtle.ConstantTimeCompare(got, want) == 1 {
matched = true
break
}
}
if !matched {
http.Error(w, "unauthorized", http.StatusUnauthorized)
return
}

View File

@ -189,6 +189,49 @@ func TestRequireAuth_InvalidCIDR_LoggedAndDropped(t *testing.T) {
}
}
// TestRequireAuth_SecondaryTokenAccepted locks ADR-006 Decision 6.5:
// during a token rotation, both primary and secondary token strings
// pass auth. After every caller updates, operators promote secondary
// → primary and clear secondary, completing the rotation.
func TestRequireAuth_SecondaryTokenAccepted(t *testing.T) {
cfg := AuthConfig{
Token: "primary-tok",
SecondaryTokens: []string{"secondary-tok"},
}
srv := httptest.NewServer(mountWithAuth(cfg))
defer srv.Close()
if status, _ := get(t, srv, "/data", "Bearer primary-tok"); status != http.StatusOK {
t.Errorf("primary token should pass, got %d", status)
}
if status, _ := get(t, srv, "/data", "Bearer secondary-tok"); status != http.StatusOK {
t.Errorf("secondary token should pass during rotation, got %d", status)
}
if status, _ := get(t, srv, "/data", "Bearer wrong-tok"); status != http.StatusUnauthorized {
t.Errorf("invalid token should still 401, got %d", status)
}
}
// TestRequireAuth_SecondaryTokensOnly locks the case where primary
// is empty but secondaries are set — useful mid-rotation when the
// previous primary was just promoted to nothing and the new primary
// hasn't been written yet. As long as ANY token is configured, auth
// is enforced.
func TestRequireAuth_SecondaryTokensOnly(t *testing.T) {
cfg := AuthConfig{
SecondaryTokens: []string{"only-tok"},
}
srv := httptest.NewServer(mountWithAuth(cfg))
defer srv.Close()
if status, _ := get(t, srv, "/data", "Bearer only-tok"); status != http.StatusOK {
t.Errorf("only-secondary should pass, got %d", status)
}
if status, _ := get(t, srv, "/data", ""); status != http.StatusUnauthorized {
t.Errorf("missing token should 401, got %d", status)
}
}
func TestRemoteIP_SplitHostPortShape(t *testing.T) {
// Sanity: real httptest requests come through with "ip:port"
// shape; ensure remoteIP returns the IP portion.

View File

@ -298,6 +298,17 @@ func (m ModelsConfig) IsWeak(model string) bool {
type AuthConfig struct {
Token string `toml:"token"`
AllowedIPs []string `toml:"allowed_ips"`
// TokenEnv names an environment variable; LoadConfig populates
// Token from os.Getenv(TokenEnv) when Token is empty. Per ADR-006
// 6.2: production deploys put the secret in /etc/lakehouse/auth.env
// (mode 0600) loaded by systemd EnvironmentFile=, NOT in the
// committed TOML. TokenEnv defaults to "AUTH_TOKEN".
TokenEnv string `toml:"token_env"`
// SecondaryTokens lets operators stage a rotation: both primary
// and any secondary token pass auth during the rotation window.
// After every caller updates, operators promote secondary →
// primary and clear secondary. Per ADR-006 Decision 6.5.
SecondaryTokens []string `toml:"secondary_tokens"`
}
// DefaultConfig returns the G0 dev defaults. Ports are shifted to
@ -434,5 +445,25 @@ func LoadConfig(path string) (Config, error) {
if err := toml.Unmarshal(b, &cfg); err != nil {
return cfg, fmt.Errorf("parse config: %w", err)
}
resolveAuthFromEnv(&cfg.Auth)
return cfg, nil
}
// resolveAuthFromEnv populates cfg.Auth.Token from os.Getenv(TokenEnv)
// when Token is empty. Per ADR-006 Decision 6.2: production deploys
// keep the secret in /etc/lakehouse/auth.env (mode 0600), loaded by
// systemd EnvironmentFile=, never in the committed TOML.
//
// TokenEnv defaults to "AUTH_TOKEN" so operators don't have to
// configure both — setting AUTH_TOKEN env is enough.
func resolveAuthFromEnv(auth *AuthConfig) {
envName := auth.TokenEnv
if envName == "" {
envName = "AUTH_TOKEN"
}
if auth.Token == "" {
if v := os.Getenv(envName); v != "" {
auth.Token = v
}
}
}

View File

@ -188,6 +188,71 @@ weak_models = ["custom-judge:latest", "qwen3:latest"]
}
}
// TestLoadConfig_AuthTokenFromEnv locks ADR-006 Decision 6.2:
// production deploys put the secret in /etc/lakehouse/auth.env (mode
// 0600), loaded by systemd EnvironmentFile=, NEVER in the committed
// TOML. The TOML names the env var via token_env; the loader fills
// Token from os.Getenv. TokenEnv defaults to "AUTH_TOKEN" so the
// happy path needs no TOML config at all.
func TestLoadConfig_AuthTokenFromEnv(t *testing.T) {
t.Run("default env name AUTH_TOKEN", func(t *testing.T) {
t.Setenv("AUTH_TOKEN", "from-default-env")
dir := t.TempDir()
path := filepath.Join(dir, "lakehouse.toml")
if err := os.WriteFile(path, []byte(`[auth]
allowed_ips = []
`), 0o644); err != nil {
t.Fatal(err)
}
cfg, err := LoadConfig(path)
if err != nil {
t.Fatalf("LoadConfig: %v", err)
}
if cfg.Auth.Token != "from-default-env" {
t.Errorf("Token = %q, want from AUTH_TOKEN env", cfg.Auth.Token)
}
})
t.Run("custom env name from token_env", func(t *testing.T) {
t.Setenv("CUSTOM_AUTH_TOKEN", "from-custom-env")
dir := t.TempDir()
path := filepath.Join(dir, "lakehouse.toml")
if err := os.WriteFile(path, []byte(`[auth]
token_env = "CUSTOM_AUTH_TOKEN"
`), 0o644); err != nil {
t.Fatal(err)
}
cfg, err := LoadConfig(path)
if err != nil {
t.Fatalf("LoadConfig: %v", err)
}
if cfg.Auth.Token != "from-custom-env" {
t.Errorf("Token = %q, want from CUSTOM_AUTH_TOKEN env", cfg.Auth.Token)
}
})
t.Run("explicit token wins over env", func(t *testing.T) {
t.Setenv("AUTH_TOKEN", "from-env")
dir := t.TempDir()
path := filepath.Join(dir, "lakehouse.toml")
if err := os.WriteFile(path, []byte(`[auth]
token = "from-toml"
`), 0o644); err != nil {
t.Fatal(err)
}
cfg, err := LoadConfig(path)
if err != nil {
t.Fatalf("LoadConfig: %v", err)
}
// Explicit Token in TOML wins over env — the loader only
// fills from env when Token is empty. Lets local dev
// override prod env vars.
if cfg.Auth.Token != "from-toml" {
t.Errorf("Token = %q, want explicit TOML value", cfg.Auth.Token)
}
})
}
func TestLoadConfig_InvalidTOML_ReturnsError(t *testing.T) {
dir := t.TempDir()
cfgPath := filepath.Join(dir, "bad.toml")