# Architecture Decision Records — Lakehouse-Go ADRs from the Go era. Numbered fresh from 001 to start clean lineage. Where a Rust ADR (numbered 001–021 in the Rust repo's `DECISIONS.md`) remains in force, this file references it explicitly. Where a Rust ADR is superseded, the new ADR records why. --- ## ADR-001: Foundational decisions for the Go rewrite **Date:** 2026-04-28 **Decided by:** J **Status:** Ratified — Phase G0 unblocked The six questions that gated Phase G0 (per PRD.md / SPEC.md §8) are all answered. ### Decision 1.1 — DuckDB via cgo for the query engine **Decision:** `queryd` uses `marcboeker/go-duckdb` (cgo bindings to DuckDB). Pure-Go alternative was rejected. **Rationale:** DuckDB reads Parquet natively, supports the SQL surface DataFusion exposed in the Rust era (CTEs, window functions, hybrid joins), and runs in-process with cgo. The alternatives were: - Hand-rolling a query planner over arrow-go RecordBatches — multi-engineer-month research project; high risk of correctness bugs. - Running DuckDB as an external process — adds an operational surface and a network hop to every query. Cgo build complexity is the accepted cost. Single-binary deploy preserved (the cgo dependency embeds at link time). **Supersedes Rust ADR-001** (object storage as source of truth) — no. That ADR remains in force; the change is the *engine* over the storage, not the storage model. ### Decision 1.2 — HTMX for the UI **Decision:** Frontend is `html/template` + HTMX + Alpine.js, server-rendered by `cmd/gateway`. React/Vite in a separate repo is the fallback if UX requirements demand SPA-tier interactivity post-G5. **Rationale:** The existing Lakehouse UIs (`/lakehouse/` demo + staffer console) are mostly server-rendered HTML with vanilla JS that already fits the HTMX style. Single-binary deploy is preserved (gateway serves templates + static assets). No build chain beyond `go build`. The React fallback is named explicitly so it's not relitigated unless an actual UX requirement triggers it. ### Decision 1.3 — Gitea hosts the new repo **Decision:** Repo lives at `git.agentview.dev/profit/golangLAKEHOUSE` (same Gitea server that hosts the Rust lakehouse). **Rationale:** Single source of truth for repo hosting; existing auditor tooling (`lakehouse-auditor` systemd service) already speaks Gitea API; existing credentials work; no new ops surface. ### Decision 1.4 — Distillation rebuilt in Go, not ported verbatim **Decision:** The distillation v1.0.0 substrate (`tag distillation-v1.0.0` at `e7636f2` in the Rust repo) is **not** bit-identical-ported. The Go reimplementation: - Ports the LOGIC: SFT export pipeline, contamination firewall (the `quality_score` enum + `SFT_NEVER` constant), category mapping rules, audit-baselines append-only pattern. - Does NOT port the FIXTURES: `tests/fixtures/distillation/acceptance/` is rebuilt from scratch in Go with new ground-truth golden files. - Does NOT port the bit-identical reproducibility PROPERTY: that was measured against the Rust implementation. The Go implementation establishes its own reproducibility baseline. **Rationale:** Bit-identical reproducibility was a measured property of a specific implementation, not a portable invariant. Re-establishing it in Go means new fixtures, new gates, new audit-baselines. This is honest about what's transferring (logic) versus what's a Rust-era artifact (the specific bit-identical hashes). **Risk:** the contamination firewall is the most consequential distillation safety net. The port must be reviewed line-by-line, and the new Go fixtures must include adversarial cases that prove the firewall works in the new implementation. See SPEC §7 acceptance gates. ### Decision 1.5 — Pathway memory starts clean; old traces preserved as reference **Decision:** Go pathway memory begins with zero traces. The existing 88 Rust traces at `/home/profit/lakehouse/data/_pathway_memory/state.json` are NOT loaded into the Go implementation. They are preserved as a historical record in the Rust repo and documented at `docs/RUST_PATHWAY_MEMORY_NOTE.md`. **Rationale:** The Rust pathway memory's value compounded over months of scrum cycles. Loading those traces into a Go implementation that hasn't proven its byte-matching contract risks corrupting the new substrate's signal with semantically-mismatched data. Starting clean keeps the Go pathway memory's lineage clean and lets the byte-match correctness be proven on a known input (per SPEC §3.4 G3.4.B). The historical note records the 88 traces' value (11/11 successful replays at the time of freeze) so the Go implementation has a reference baseline to outperform. ### Decision 1.6 — Auditor longitudinal signal restarts **Decision:** The Rust auditor's `audit_baselines.jsonl` (longitudinal drift signal accumulated across PRs #6–#13) is **not** ported to Go. The Go auditor begins a fresh `audit_baselines.jsonl` lineage on its first PR. **Rationale:** The drift signal is anchored to specific Rust commits, verdict shapes, and Kimi/Haiku/Opus rotation traces. Carrying it into the Go era would be like grafting Rust-PR audit history onto the first Go PR's prologue — confusing more than informative. Restarting gives the Go auditor a clean baseline to measure drift against. The existing Rust `audit_baselines.jsonl` stays in the Rust repo as a historical record. --- ## ADR-002: storaged per-prefix PUT cap (vectord _vectors/ → 4 GiB) **Date:** 2026-04-29 **Decided by:** J **Status:** Implemented (commit `423a381`) `storaged` enforces a 256 MiB per-PUT body cap as DoS protection (`MaxBytesReader` + Content-Length check). Keys under `_vectors/` (vectord LHV1 persistence) get a raised cap of 4 GiB; everything else stays at 256 MiB. **Rationale:** the 500K staffing test surfaced that single-file LHV1 above ~150K vectors at d=768 hits the 256 MiB cap. `manager.Uploader` already streams on the outbound side, so the cap is a safety gate not a memory bottleneck — raising it for the vector path doesn't introduce new memory pressure. Per-prefix preserves the safety gate for routine traffic while opening the documented production path. Splitting LHV1 across multiple keys was rejected because G1P specifically shipped the single-Put framed format to eliminate torn-write — multi-key would re-introduce that failure mode. **Follow-up:** if production workloads exceed 4 GiB single-file LHV1, refactor to operator-driven config (env/TOML) rather than bumping the constant. The function-level `maxPutBytesFor(key)` in `cmd/storaged/main.go` keeps that drop-in clean. --- ## ADR-003: Inter-service auth posture — Bearer token + IP allowlist **Date:** 2026-04-29 **Decided by:** J + Claude **Status:** Decided — wiring deferred to Sprint 1 **Decision:** When inter-service auth is needed (the moment any binary binds non-loopback or the deployment crosses a trust boundary), the auth model is **a Bearer token loaded from `secrets-go.toml` plus a configurable IP allowlist**. Both layers required: the token authenticates the caller; the allowlist narrows the network surface. **Status today (G0):** zero auth middleware. Every binary binds `127.0.0.1` by default; commit `6af0520` (R-001 partial fix) refuses non-loopback bind unless the per-service `LH__ALLOW_NONLOOPBACK=1` env override is set. The override-and-no-auth combination is the worst case — this ADR locks in what we'll require before any production override fires. ### What gets implemented when auth lands 1. **`secrets-go.toml` adds a `[auth]` section:** ```toml [auth] token = "..." # 32+ random bytes, hex-encoded allowed_ips = ["10.0.0.0/8", "127.0.0.1/32"] # CIDR list ``` 2. **`internal/shared/auth.go`** ships a single chi middleware: ```go func RequireAuth(cfg AuthConfig) func(http.Handler) http.Handler ``` - Empty `cfg.Token` → middleware is a no-op (G0 dev mode). - Non-empty token → reject 401 unless request has `Authorization: Bearer ` matching constant-time. - Non-empty `allowed_ips` → reject 403 unless `r.RemoteAddr` (or `X-Forwarded-For` first hop, configurable) is in CIDR set. - `/health` exempt — load balancers + monitors need it open. 3. **Every `cmd//main.go` adds one line:** ```go r.Use(shared.RequireAuth(cfg.Auth)) ``` Mounted before `register(r)` so it covers every route the binary exposes after `/health`. 4. **`shared.Run` startup gate:** if bind is non-loopback AND `cfg.Auth.Token == ""`, refuse to start. The implicit "localhost is the auth layer" guarantee becomes explicit when crossing the loopback boundary. ### Alternatives considered | Option | Why rejected | |---|---| | **mTLS** | Strongest but heaviest — every binary needs cert provisioning, rotation tooling, and cert-aware client wiring. Overkill for inter-service traffic that already passes through a single gateway. Reconsider when Lakehouse-Go runs across machines. | | **JWT with short TTL** | Buys nothing over Bearer here — there's no third-party identity provider, no claim hierarchy worth modelling. Pure token has the same security properties at half the wire complexity. | | **No auth, IP-allowlist only** | One stolen IP allowlist entry → full access. Token + IP is defense in depth; either alone is too weak. | | **OAuth2 via external IdP** | Rejected for G0–G3 timeline. No external IdP commitment. Revisit if Lakehouse-Go ever serves end-user requests directly (today everything fronts through the staffing co-pilot which has its own session model). | ### Constant-time comparison + token hygiene Token comparison must use `crypto/subtle.ConstantTimeCompare` — naive `==` is vulnerable to timing attacks against an attacker who can issue many requests and measure round-trip. Token rotation is operator-driven via `secrets-go.toml` edit + restart; G0 doesn't need rotate-without-restart. ### What this ADR does NOT do - **Does not implement the middleware.** Code lands in Sprint 1. - **Does not require token in G0 dev.** Empty token → no-op. Smokes + proof harness keep working without setting tokens. - **Does not address gateway → end-user auth.** Gateway terminates inter-service auth at its inbound; if end-users hit gateway from a browser, that's a different ADR (likely cookie/session, fronted by a reverse proxy that handles user auth). ### How this closes audit findings - **R-001 (queryd /sql RCE-equivalent off-loopback):** the bind gate prevents accidental exposure today; this ADR specifies the guardrail when intentional exposure is needed. - **R-007 (zero auth middleware):** answered by the design above; R-007 stays open until the middleware is implemented but is no longer "design TBD." - **R-010 (no CORS posture):** orthogonal to inter-service auth, but the `RequireAuth` middleware sits at the right layer to add CORS handling later (browsers don't reach inter-service routes in the current design, so CORS is also Sprint 1+ when end-user requests start landing). --- ## ADR-004: Pathway memory data model — Mem0-style versioned traces **Date:** 2026-04-29 **Decided by:** J + Claude **Status:** Decided — substrate landing in `internal/pathway/` **Decision:** Pathway memory is an append-only event log of opaque traces with Mem0-style semantics: Add / Update / Revise / Retire / History / Search. Each trace has a UID; revisions chain backward via `predecessor_uid` so the full history is reconstructible. Persistence is JSONL append-only with full-replay on load; corruption recovery skips bad lines without halting startup. ### Operations | Op | Effect | |---|---| | `Add(content, tags...)` | New UID, stored fresh, replay_count=1. | | `AddIdempotent(uid, content, tags...)` | If UID exists → replay_count++. Else → Add with that UID. | | `Update(uid, content)` | In-place content replacement (same UID). Bumps `updated_at_ns`. NOT a revision — same trace, new content. | | `Revise(predecessorUID, content, tags...)` | New UID with `predecessor_uid` set. Old trace stays accessible via History. Failure modes: predecessor missing → error; predecessor retired → still allowed (revisions of retired traces are valid). | | `Retire(uid)` | Sets `retired=true`. Excluded from `Search` by default; still accessible via `Get` and `History`. | | `Get(uid)` | Returns the trace (including if retired); error on missing. | | `History(uid)` | Walks `predecessor_uid` chain backward, returns slice [self, parent, grandparent, ...]. Cycle-detected via visited-set; returns error on cycle (which only happens if persistence file was hand-edited). | | `Search(filter)` | Returns matching traces. Default excludes retired; opt in via `IncludeRetired: true`. Filters: tag-match, content-substring, time range. | ### Why Mem0-style + Why these specific ops - **Mem0** (memory pattern from the OpenAI Memories paper / Mem0 lib) is the canonical "agent memory" interface for the same reason Markdown is the canonical text format: it's the lowest-common- denominator that the entire ecosystem assumes. Adopting it lets agent loops written against any Mem0-aware substrate work here. - Update vs Revise are deliberately separate. Update is "I noticed a typo in my note." Revise is "I now believe something different than I did when I wrote this; preserve the old belief for audit." Conflating them loses the audit trail. - Retire vs Delete is deliberate. Retire stops a trace from surfacing in search but preserves it for history reconstruction. Delete (which we don't expose) would break references. ### Trace data shape ```go type Trace struct { UID string // UUID v4 unless caller provides one Content json.RawMessage // opaque, schema is caller's contract PredecessorUID string // empty if root revision CreatedAtNs int64 UpdatedAtNs int64 Retired bool ReplayCount int // ≥1 for any stored trace Tags []string // for Search } ``` `Content` is opaque JSON (not a struct) so callers can store any shape — the data model doesn't constrain semantics. Callers add their own validators on top. ### Persistence JSONL append-only log under `_pathway/.jsonl`. Each mutation appends one JSON line: ``` {"op":"add", "trace":{...}} {"op":"update", "uid":"…", "content":"…"} {"op":"revise", "trace":{…}} # trace.PredecessorUID is set {"op":"retire", "uid":"…"} {"op":"replay", "uid":"…"} # idempotent re-add hit ``` On startup, replay every line in order, building in-memory state. A malformed line logs a warn and is skipped; load continues. Corruption tolerance is non-optional — partial state is better than no state for an agent substrate. Compaction is a future concern. A 100K-trace log replays in seconds; below that scale, JSONL append is the simplest correct choice. When compaction lands, the format will be: snapshot file (full state JSON) + tail JSONL since snapshot. Detect snapshot, load it, then replay tail. ### Cycle safety UIDs are generated server-side via `uuid.New()` (existing dep — catalogd uses it). New UID for every Add and Revise. The data model itself can't form cycles — every Revise points at an EXISTING uid, and the new uid didn't exist a moment ago. History walks defensively anyway: visited-set tracks UIDs seen this walk; if we encounter a duplicate, return error. Protects against corruption (manual edit, bug in a future op) without constraining the happy path. ### Storage location JSONL file path is configurable per store. Default: `/var/lib/lakehouse/pathway/.jsonl` for prod; tests use `t.TempDir()`. Persistence is OPTIONAL — empty path means in-memory only (matches vectord G1's pattern). ### What this ADR does NOT do - **No HTTP surface decision.** Whether `cmd/pathwayd` is its own binary or routes get added to `cmd/vectord` is the next ADR's concern. The substrate is a pure library either way. - **No vector index integration.** Pathway traces can carry a vector embedding in `Content` (caller decides), but this ADR doesn't define how the substrate integrates with `vectord`'s HNSW indexes. That's the staffing co-pilot's design problem when those layers compose. - **No agent-loop semantics.** "When does an agent ADD vs REVISE?" is a workflow decision, not a substrate decision. --- ## ADR-005: Observer fail-safe semantics **Date:** 2026-04-30 **Status:** RATIFIED **Scope:** `internal/observer` (Store, Persistor) + `internal/workflow` (Runner) + `cmd/observerd` The Rust legacy had a documented "verdict:accept on crash" anti-pattern: when the observer crashed mid-evaluation, the upstream interpreted the missing verdict as implicit acceptance. Several silent regressions traced to it. The Go observer's role is structurally different — it is a **witness** (records what happened) rather than a **gate** (decides accept/reject) — but adjacent fail-safe decisions still need locking now that observerd is on the prod-realistic stack via the lift harness (commit `b2e45f7`, 2026-04-30). This ADR ratifies the current behavior and locks the rationale so future consumers don't break the invariant by flipping the defaults. ### Decision 5.1 — Persist failure is logged-not-fatal; ring is the in-flight source of truth Already implemented (`internal/observer/store.go:60-67`). Locked: - If `persistor.Append` fails, log a warning and continue. Do NOT return an error to the caller of `Store.Record`. - The in-memory ring buffer is the source of truth in flight; the JSONL is a best-effort durability shadow. - Operators who need fail-closed audit-grade trails configure that mode through a future opt-in (deferred to a later ADR; not the G0/G1/G2 default). **Why fail-open here:** the observer's job is to keep recording even when the disk hiccups. A `persist-fail-fatal` mode would translate every transient I/O blip into an observer-blackout, which is strictly worse for the witness role than missing a few persisted entries — the ring still has them, and operators can drain it on restart. **Why this isn't the Rust anti-pattern:** the Go observer doesn't emit verdicts. A persist failure here means "we recorded fewer rows on disk than in memory," not "we accepted something we shouldn't have." ### Decision 5.2 — Mode failure in workflow.Runner: `Success = (Error == "")`, no panic-swallow path Already implemented (`internal/workflow/runner.go`). Locked: - Mode errors are caught by the runner and surfaced via the node's `Error` field; `Success` is the boolean derived from `Error == ""`. - `observerd` records an `ObservedOp` per node with `Success: false` and the error string when a mode fails. - Cycles, missing-deps, and unknown modes are aborting errors → 4xx from `/observer/workflow/run` with the failure encoded in the JSON response. **Why this is the explicit anti-Rust:** allowing a mode to silently swallow its panic and report `Success: true` is exactly how the Rust "verdict:accept on crash" pattern manifests. Forcing the runner to record `Success: false` on error makes the failure observable to downstream consumers (observerd queries, scrum review, distillation selection) instead of laundering it into a fake success. ### Decision 5.3 — Provenance is one-row-per-node, recorded post-run Already implemented (`cmd/observerd/main.go:140-154`). Locked: - `runner.Run` returns the full `RunResult` with per-node Success/Error; `handleWorkflowRun` then iterates `res.Nodes` and `store.Record`s an `ObservedOp` per node. - One row per node, NOT a single per-workflow catch-all. A workflow with N nodes produces N audit rows. - Crash semantics: - Crash *during* `runner.Run` → no provenance recorded; queries see absence, not a false acceptance. - Crash *during* the recording loop → some nodes recorded, some absent; queries see partial provenance, again not a false acceptance. - Recovery: re-run the whole workflow. No incremental resume in G0/G1/G2. **Why one row per node:** debugging a partial workflow is a one-grep operation when each node has its own row. A single catch-all row would be exactly the Rust anti-pattern surface — "we accepted this workflow" records that survive partial crashes look identical to genuine acceptances. Per-node-row makes that structurally impossible. **Known gap, not yet a follow-up ADR:** recording happens after `runner.Run` returns, not as each node completes. A long workflow with late-stage failure currently records nodes that already finished only once the runner returns. For G0/G1/G2 substrate this is fine — workflows are short. When workflows get long enough that mid-run visibility matters, a streaming-record callback is the right shape. ### Decision 5.4 — `/observer/event` accepts even when the ring is full Already implemented via `Store.Record`'s shift-left eviction. Locked: - Ring overflow is normal operation: oldest evicted, newest accepted. - 200 OK from `/observer/event` means "we accepted into the ring"; it does NOT promise "we persisted." Persistence remains best-effort per Decision 5.1. - 4xx is reserved for malformed `ObservedOp` payloads (validation failures). **Why accept-on-full:** treating a full ring as a 503 would translate every brief activity burst into client errors, which is exactly the wrong direction for an audit witness — the witness's job is to never refuse to write, only to lose oldest data when capacity binds. ### Alternatives considered - **Persist-required mode** — caller-configurable fail-closed for audit-grade workloads. The right approach when this lands is an opt-in on `Store` construction, leaving the default fail-open. Deferred to a future ADR. - **Distributed ring with WAL** — persist before accept-into-ring, sync semantics. Too heavy for G0/G1 and breaks the ring's "in-flight source of truth" property. - **Mode-result schema with explicit verdict field** — would force every mode to declare accept/reject. Overengineered for the witness role and reintroduces the gate-vs-witness confusion this ADR is trying to avoid. ### What this ADR does NOT do - **No retention policy.** "How long do we keep observer entries on disk?" is a separate operations decision. - **No mode-level retry.** If a mode fails, the runner records that and moves on. Whether to retry is a workflow-definition concern (Archon-style retry policies in the YAML), not the runner's. - **No cross-process recovery.** A crashed observerd loses the ring; the persistor preserves what it managed to write. Operators read the JSONL after restart, not query a dead daemon. - **No persist-required opt-in.** Mentioned in alternatives; lands in a separate ADR when an audit-grade consumer requires it. ### How this closes the OPEN list STATE_OF_PLAY listed ADR-005 as a doc-only gate before observer wired into production paths. The 2026-04-30 lift run wired observerd into the prod-realistic harness boot, which means observer is now on the data path for every reality test workflow. This ADR locks the fail-safe invariants before the next consumer (scrum runner, distillation rebuild, or a real production workflow) takes a hard behavioral dependency. --- ## ADR-006: Auth posture for non-loopback deploy **Date:** 2026-04-30 **Status:** RATIFIED **Scope:** `internal/shared/auth.go` + `internal/shared/bind.go` + every `cmd//main.go`'s `shared.Run` call site ADR-003 locked the substrate (Bearer token + IP allowlist, opt-in via `cfg.Auth.Token`/`cfg.Auth.AllowedIPs`, `/health` exempt). ADR-006 ratifies the **operator playbook + deploy-time invariants** — what gets enforced when, what operators set where, what happens when keys rotate. Required because Sprint 4 deployment work (REPLICATION.md, systemd units, Dockerfile) needs a locked auth posture before it touches production-shaped configs. ### Decision 6.1 — Non-loopback bind requires `auth.token`; the gate is mechanical Already implemented in `requireAuthOnNonLoopback` (`internal/shared/bind.go:58-67`). Locked: - Any binary that binds anything other than `127.0.0.0/8` / `::1` / `localhost` MUST have `cfg.Auth.Token != ""`. Empty-token + non-loopback-bind = startup error, not silent insecure mode. - The check fires in `shared.Run` BEFORE `http.Server.Serve`, so a misconfigured binary fails fast at startup rather than serving one request. - Pairs with `requireLoopbackOrOverride`: that gate refuses any non-loopback bind without `LH__ALLOW_NONLOOPBACK=1`. Together they make the audit's R-001+R-007 worst case (queryd `/sql` = RCE-equivalent off-loopback with no auth) mechanically impossible. **Why mechanical, not policy:** policy gates rely on operator discipline. The substrate gates work even when an operator copies a dev `lakehouse.toml` into prod and forgets to set the token — binary refuses to start, error message names the env override. ### Decision 6.2 — Token comes from `cfg.Auth.Token` populated by env or secrets file Locked: - Operators do NOT put the production token in `lakehouse.toml` directly. The TOML field is empty in the committed file; the daemon's systemd unit sets `AUTH_TOKEN` (or whatever `cfg.Auth.TokenEnv` names) via `EnvironmentFile=` pointing at `/etc/lakehouse/auth.env` (mode 0600, root-owned). - Same pattern as `chatd`'s provider keys (`OPENROUTER_API_KEY` etc.): TOML names the env var, systemd loads the env file. - Justification: keeps secrets out of git + out of the running process's command line + audit-able via filesystem ACLs. ### Decision 6.3 — `AllowedIPs` is the inter-service gate; `Token` is the cross-trust-boundary gate Locked: - Same-box deploys (10 daemons all on one host, all on `127.0.0.0/8` or a private subnet) use **`AllowedIPs` only**. Each daemon's `cfg.Auth.AllowedIPs` lists the gateway's address (and any other daemon that legitimately calls it). No token shared between internal services. - Gateway-to-external traffic (a coordinator UI in another VPC, a user's browser, an external integrator) goes through **Bearer token**. The token is per-tenant; rotation is per-tenant. - Mixed: a service can require BOTH (allowlist AND token) — the middleware logic is `allowed = ip_allowed && token_valid` when both are set. Use this for the gateway when binding non-loopback. **Why split:** token rotation is operationally expensive (every caller updates a secret). IP allowlist rotation is free if the network topology is stable. Splitting them by trust boundary lets internal services treat allowlist drift as a network change while external callers handle token rotation as a credential change. ### Decision 6.4 — `/health` is unauthenticated; everything else under `shared.Run` is gated Already implemented (`internal/shared/server.go:84-92`). Locked: - Load balancers + monitor probes hit `/health` without a token. The route returns `{"status":"ok","service":""}` and nothing about service state — no version, no commit, no internal counts. - Every other route registered via `shared.Run`'s `register` callback lives inside the auth-gated chi.Group. New routes inherit auth automatically; new daemons inherit it via `shared.Run`. - A daemon that needs a public route MUST add it to the outer router before the `register` group, with a code comment explaining the exemption. There are no others today. ### Decision 6.5 — Token rotation is operator-staged; old + new accepted during the window Not yet implemented; locked as a Sprint 4 follow-up: - Operators stage a rotation by adding a second token to `cfg.Auth.SecondaryTokens []string`. Both primary and secondary pass auth during the window. - After every caller is updated to the new token, operators promote secondary → primary and clear secondary. A second rotation can begin. - Rolling restart not required; daemons reload `cfg.Auth` on SIGHUP (also a Sprint 4 follow-up — currently they re-read on restart only). **Why dual-token instead of just single-rotation:** caller pool can be large (gateway + observerd + scrum runner + UI + external integrators). A single-token rotation forces a flag-day. Dual-token windows let operators rotate gradually and abort on failure. ### Decision 6.6 — TLS is the network operator's job, not ours Locked: - Daemons speak HTTP, not HTTPS. TLS termination happens at the network edge (nginx / Caddy / cloud LB), not in the Go process. - Internal daemon-to-daemon traffic stays on plaintext HTTP because it's all on `127.0.0.0/8` or a private subnet (per Decision 6.3). - Justification: TLS in-process means cert management, rotation, reload — all undifferentiated lift that nginx already solves better. The Bearer token + allowlist gates are sufficient when combined with a TLS-terminating reverse proxy. ### Alternatives considered - **mTLS for inter-service auth** — every daemon issues + verifies certs. Solves token-rotation pain but adds cert lifecycle as a problem. Allowlist + plaintext on the private network is cheaper and gets the same threat-model coverage. - **JWT-only** — JWTs let callers carry richer claims (tenant id, expiry, scopes). Overkill for the current threat model; the Bearer token + allowlist split is honest about what each layer actually defends against. Revisit when multi-tenant gateway features land. - **No auth, network is the boundary** — works for G0 dev and the current single-box deployment. ADR-006 explicitly does NOT recommend this for non-loopback prod (the mechanical gate refuses it). ### What this ADR does NOT do - **Does not specify how the gateway authenticates external callers.** Token-vs-mTLS-vs-OAuth at the public edge is a separate decision driven by who-calls-us. ADR-006 is about the inter-service + same-trust-domain posture. - **Does not implement token rotation hot-reload.** Decision 6.5 documents the design; the implementation is Sprint 4 work. - **Does not lock TLS termination details.** Where + how nginx/Caddy goes is ops infrastructure, not ADR territory. ### How this closes the OPEN list STATE_OF_PLAY listed ADR-006 as the gate before any Go binary binds non-loopback in prod. The substrate gates were already present (R-001 + R-007 enforced via `requireLoopbackOrOverride` + `requireAuthOnNonLoopback`); this ADR locks the operator playbook that turns those gates into a deployable posture. Sprint 4 can now write systemd units that set `AUTH_TOKEN` from `EnvironmentFile=` without re-litigating the design. ---