diff --git a/docs/DECISIONS.md b/docs/DECISIONS.md index 27fa542..5b5afdf 100644 --- a/docs/DECISIONS.md +++ b/docs/DECISIONS.md @@ -121,6 +121,127 @@ historical record. --- -(Future ADRs from ADR-002 onward will be added as the Go +## ADR-002: storaged per-prefix PUT cap (vectord _vectors/ → 4 GiB) +**Date:** 2026-04-29 +**Decided by:** J +**Status:** Implemented (commit `423a381`) + +`storaged` enforces a 256 MiB per-PUT body cap as DoS protection +(`MaxBytesReader` + Content-Length check). Keys under `_vectors/` +(vectord LHV1 persistence) get a raised cap of 4 GiB; everything +else stays at 256 MiB. + +**Rationale:** the 500K staffing test surfaced that single-file LHV1 +above ~150K vectors at d=768 hits the 256 MiB cap. `manager.Uploader` +already streams on the outbound side, so the cap is a safety gate +not a memory bottleneck — raising it for the vector path doesn't +introduce new memory pressure. Per-prefix preserves the safety +gate for routine traffic while opening the documented production +path. Splitting LHV1 across multiple keys was rejected because G1P +specifically shipped the single-Put framed format to eliminate +torn-write — multi-key would re-introduce that failure mode. + +**Follow-up:** if production workloads exceed 4 GiB single-file +LHV1, refactor to operator-driven config (env/TOML) rather than +bumping the constant. The function-level `maxPutBytesFor(key)` in +`cmd/storaged/main.go` keeps that drop-in clean. + +--- + +## ADR-003: Inter-service auth posture — Bearer token + IP allowlist +**Date:** 2026-04-29 +**Decided by:** J + Claude +**Status:** Decided — wiring deferred to Sprint 1 + +**Decision:** When inter-service auth is needed (the moment any +binary binds non-loopback or the deployment crosses a trust +boundary), the auth model is **a Bearer token loaded from +`secrets-go.toml` plus a configurable IP allowlist**. Both layers +required: the token authenticates the caller; the allowlist +narrows the network surface. + +**Status today (G0):** zero auth middleware. Every binary binds +`127.0.0.1` by default; commit `6af0520` (R-001 partial fix) refuses +non-loopback bind unless the per-service `LH__ALLOW_NONLOOPBACK=1` +env override is set. The override-and-no-auth combination is the +worst case — this ADR locks in what we'll require before any +production override fires. + +### What gets implemented when auth lands + +1. **`secrets-go.toml` adds a `[auth]` section:** + ```toml + [auth] + token = "..." # 32+ random bytes, hex-encoded + allowed_ips = ["10.0.0.0/8", "127.0.0.1/32"] # CIDR list + ``` + +2. **`internal/shared/auth.go`** ships a single chi middleware: + ```go + func RequireAuth(cfg AuthConfig) func(http.Handler) http.Handler + ``` + - Empty `cfg.Token` → middleware is a no-op (G0 dev mode). + - Non-empty token → reject 401 unless request has + `Authorization: Bearer ` matching constant-time. + - Non-empty `allowed_ips` → reject 403 unless `r.RemoteAddr` (or + `X-Forwarded-For` first hop, configurable) is in CIDR set. + - `/health` exempt — load balancers + monitors need it open. + +3. **Every `cmd//main.go` adds one line:** + ```go + r.Use(shared.RequireAuth(cfg.Auth)) + ``` + Mounted before `register(r)` so it covers every route the binary + exposes after `/health`. + +4. **`shared.Run` startup gate:** if bind is non-loopback AND + `cfg.Auth.Token == ""`, refuse to start. The implicit + "localhost is the auth layer" guarantee becomes explicit when + crossing the loopback boundary. + +### Alternatives considered + +| Option | Why rejected | +|---|---| +| **mTLS** | Strongest but heaviest — every binary needs cert provisioning, rotation tooling, and cert-aware client wiring. Overkill for inter-service traffic that already passes through a single gateway. Reconsider when Lakehouse-Go runs across machines. | +| **JWT with short TTL** | Buys nothing over Bearer here — there's no third-party identity provider, no claim hierarchy worth modelling. Pure token has the same security properties at half the wire complexity. | +| **No auth, IP-allowlist only** | One stolen IP allowlist entry → full access. Token + IP is defense in depth; either alone is too weak. | +| **OAuth2 via external IdP** | Rejected for G0–G3 timeline. No external IdP commitment. Revisit if Lakehouse-Go ever serves end-user requests directly (today everything fronts through the staffing co-pilot which has its own session model). | + +### Constant-time comparison + token hygiene + +Token comparison must use `crypto/subtle.ConstantTimeCompare` — +naive `==` is vulnerable to timing attacks against an attacker who +can issue many requests and measure round-trip. Token rotation is +operator-driven via `secrets-go.toml` edit + restart; G0 doesn't +need rotate-without-restart. + +### What this ADR does NOT do + +- **Does not implement the middleware.** Code lands in Sprint 1. +- **Does not require token in G0 dev.** Empty token → no-op. Smokes + + proof harness keep working without setting tokens. +- **Does not address gateway → end-user auth.** Gateway terminates + inter-service auth at its inbound; if end-users hit gateway from + a browser, that's a different ADR (likely cookie/session, fronted + by a reverse proxy that handles user auth). + +### How this closes audit findings + +- **R-001 (queryd /sql RCE-equivalent off-loopback):** the bind + gate prevents accidental exposure today; this ADR specifies the + guardrail when intentional exposure is needed. +- **R-007 (zero auth middleware):** answered by the design above; + R-007 stays open until the middleware is implemented but is no + longer "design TBD." +- **R-010 (no CORS posture):** orthogonal to inter-service auth, + but the `RequireAuth` middleware sits at the right layer to add + CORS handling later (browsers don't reach inter-service routes + in the current design, so CORS is also Sprint 1+ when end-user + requests start landing). + +--- + +(Future ADRs from ADR-004 onward will be added as the Go implementation accrues design decisions — e.g. HNSW parameter choices, pathway-memory hash function, auditor model rotation, etc.)