ADR-003: inter-service auth posture — Bearer + IP allowlist

Locks in the auth model that R-001 + R-007 will be retrofitted
against. Doc-only — wiring deferred to Sprint 1 when the first
non-loopback binding is needed.

Decision: Bearer token (from secrets-go.toml [auth] section) + IP
allowlist (CIDR list). Both layers required when auth is on; empty
token = G0 dev no-op. /health exempt.

Implementation shape (when it lands):
  - internal/shared/auth.go middleware: one chi r.Use line per binary
  - shared.Run gates: refuses non-loopback bind without configured token
  - subtle.ConstantTimeCompare for token equality (timing-safe)

Alternatives considered + rejected:
  mTLS         — too heavy for single-machine inter-service traffic
  JWT          — buys nothing over Bearer without external IdP
  IP-only      — one stolen IP entry = full access; no defense depth
  OAuth2       — no external IdP commitment in G0-G3 timeline

What this doesn't do:
  - Doesn't implement (code lands Sprint 1)
  - Doesn't break G0 dev (empty token = middleware no-op)
  - Doesn't address gateway→end-user auth (different ADR shape)

Closes the design-decision blocker for R-001 and R-007. Wiring
ticket: Sprint 1 backlog story S1.2.

Also lifts ADR-002 (storaged per-prefix PUT cap) into the doc —
it was implemented in 423a381 but not yet recorded as an ADR.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
root 2026-04-29 06:05:59 -05:00
parent 423a3817c5
commit 0d18ffa780

View File

@ -121,6 +121,127 @@ historical record.
---
(Future ADRs from ADR-002 onward will be added as the Go
## ADR-002: storaged per-prefix PUT cap (vectord _vectors/ → 4 GiB)
**Date:** 2026-04-29
**Decided by:** J
**Status:** Implemented (commit `423a381`)
`storaged` enforces a 256 MiB per-PUT body cap as DoS protection
(`MaxBytesReader` + Content-Length check). Keys under `_vectors/`
(vectord LHV1 persistence) get a raised cap of 4 GiB; everything
else stays at 256 MiB.
**Rationale:** the 500K staffing test surfaced that single-file LHV1
above ~150K vectors at d=768 hits the 256 MiB cap. `manager.Uploader`
already streams on the outbound side, so the cap is a safety gate
not a memory bottleneck — raising it for the vector path doesn't
introduce new memory pressure. Per-prefix preserves the safety
gate for routine traffic while opening the documented production
path. Splitting LHV1 across multiple keys was rejected because G1P
specifically shipped the single-Put framed format to eliminate
torn-write — multi-key would re-introduce that failure mode.
**Follow-up:** if production workloads exceed 4 GiB single-file
LHV1, refactor to operator-driven config (env/TOML) rather than
bumping the constant. The function-level `maxPutBytesFor(key)` in
`cmd/storaged/main.go` keeps that drop-in clean.
---
## ADR-003: Inter-service auth posture — Bearer token + IP allowlist
**Date:** 2026-04-29
**Decided by:** J + Claude
**Status:** Decided — wiring deferred to Sprint 1
**Decision:** When inter-service auth is needed (the moment any
binary binds non-loopback or the deployment crosses a trust
boundary), the auth model is **a Bearer token loaded from
`secrets-go.toml` plus a configurable IP allowlist**. Both layers
required: the token authenticates the caller; the allowlist
narrows the network surface.
**Status today (G0):** zero auth middleware. Every binary binds
`127.0.0.1` by default; commit `6af0520` (R-001 partial fix) refuses
non-loopback bind unless the per-service `LH_<SVC>_ALLOW_NONLOOPBACK=1`
env override is set. The override-and-no-auth combination is the
worst case — this ADR locks in what we'll require before any
production override fires.
### What gets implemented when auth lands
1. **`secrets-go.toml` adds a `[auth]` section:**
```toml
[auth]
token = "..." # 32+ random bytes, hex-encoded
allowed_ips = ["10.0.0.0/8", "127.0.0.1/32"] # CIDR list
```
2. **`internal/shared/auth.go`** ships a single chi middleware:
```go
func RequireAuth(cfg AuthConfig) func(http.Handler) http.Handler
```
- Empty `cfg.Token` → middleware is a no-op (G0 dev mode).
- Non-empty token → reject 401 unless request has
`Authorization: Bearer <token>` matching constant-time.
- Non-empty `allowed_ips` → reject 403 unless `r.RemoteAddr` (or
`X-Forwarded-For` first hop, configurable) is in CIDR set.
- `/health` exempt — load balancers + monitors need it open.
3. **Every `cmd/<svc>/main.go` adds one line:**
```go
r.Use(shared.RequireAuth(cfg.Auth))
```
Mounted before `register(r)` so it covers every route the binary
exposes after `/health`.
4. **`shared.Run` startup gate:** if bind is non-loopback AND
`cfg.Auth.Token == ""`, refuse to start. The implicit
"localhost is the auth layer" guarantee becomes explicit when
crossing the loopback boundary.
### Alternatives considered
| Option | Why rejected |
|---|---|
| **mTLS** | Strongest but heaviest — every binary needs cert provisioning, rotation tooling, and cert-aware client wiring. Overkill for inter-service traffic that already passes through a single gateway. Reconsider when Lakehouse-Go runs across machines. |
| **JWT with short TTL** | Buys nothing over Bearer here — there's no third-party identity provider, no claim hierarchy worth modelling. Pure token has the same security properties at half the wire complexity. |
| **No auth, IP-allowlist only** | One stolen IP allowlist entry → full access. Token + IP is defense in depth; either alone is too weak. |
| **OAuth2 via external IdP** | Rejected for G0G3 timeline. No external IdP commitment. Revisit if Lakehouse-Go ever serves end-user requests directly (today everything fronts through the staffing co-pilot which has its own session model). |
### Constant-time comparison + token hygiene
Token comparison must use `crypto/subtle.ConstantTimeCompare`
naive `==` is vulnerable to timing attacks against an attacker who
can issue many requests and measure round-trip. Token rotation is
operator-driven via `secrets-go.toml` edit + restart; G0 doesn't
need rotate-without-restart.
### What this ADR does NOT do
- **Does not implement the middleware.** Code lands in Sprint 1.
- **Does not require token in G0 dev.** Empty token → no-op. Smokes
+ proof harness keep working without setting tokens.
- **Does not address gateway → end-user auth.** Gateway terminates
inter-service auth at its inbound; if end-users hit gateway from
a browser, that's a different ADR (likely cookie/session, fronted
by a reverse proxy that handles user auth).
### How this closes audit findings
- **R-001 (queryd /sql RCE-equivalent off-loopback):** the bind
gate prevents accidental exposure today; this ADR specifies the
guardrail when intentional exposure is needed.
- **R-007 (zero auth middleware):** answered by the design above;
R-007 stays open until the middleware is implemented but is no
longer "design TBD."
- **R-010 (no CORS posture):** orthogonal to inter-service auth,
but the `RequireAuth` middleware sits at the right layer to add
CORS handling later (browsers don't reach inter-service routes
in the current design, so CORS is also Sprint 1+ when end-user
requests start landing).
---
(Future ADRs from ADR-004 onward will be added as the Go
implementation accrues design decisions — e.g. HNSW parameter
choices, pathway-memory hash function, auditor model rotation, etc.)