ADR-003 locked the auth substrate; ADR-006 ratifies the operator
playbook + adds two implementation pieces needed for Sprint 4
deployment: env-resolved tokens and dual-token rotation.
Six decisions locked in docs/DECISIONS.md:
- 6.1: Non-loopback bind requires auth.token (mechanical gate at
shared.Run, already implemented; this ratifies it).
- 6.2: Token from env, not TOML. /etc/lakehouse/auth.env (mode 0600)
loaded by systemd EnvironmentFile=. New TokenEnv field on
AuthConfig defaults to "AUTH_TOKEN".
- 6.3: AllowedIPs for inter-service same-trust-domain; Token for
cross-trust-boundary (gateway ↔ external).
- 6.4: /health stays unauthenticated; everything else under
shared.Run is gated. Already implemented; ratified here.
- 6.5: Token rotation is dual-token. New SecondaryTokens []string
on AuthConfig — both primary and any secondary pass auth
during the rotation window. Implemented in this commit.
- 6.6: TLS terminates at the network edge (nginx/Caddy), not
in-process. Daemons stay HTTP-only; internal traffic stays
on private subnets per Decision 6.3.
Implementation:
- internal/shared/config.go: AuthConfig gains TokenEnv +
SecondaryTokens fields. New resolveAuthFromEnv() called by
LoadConfig fills Token from os.Getenv(TokenEnv) when Token is
empty. TokenEnv defaults to "AUTH_TOKEN" so the happy path needs
no TOML config.
- internal/shared/auth.go: RequireAuth pre-encodes Bearer headers
for primary + every secondary token; per-request constant-time
compare walks the slice. Fast path is 1 compare (primary).
Tests:
- TestLoadConfig_AuthTokenFromEnv (3 sub-tests): default env name,
custom token_env, explicit Token wins over env.
- TestRequireAuth_SecondaryTokenAccepted: both primary + secondary
tokens pass during rotation window.
- TestRequireAuth_SecondaryTokensOnly: only-secondary path works
for the case where primary was just promoted-to-empty mid-rotation.
go test ./internal/shared all green; existing auth_test.go
unchanged (constant-time compare path preserved).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
30 KiB
Architecture Decision Records — Lakehouse-Go
ADRs from the Go era. Numbered fresh from 001 to start clean lineage.
Where a Rust ADR (numbered 001–021 in the Rust repo's DECISIONS.md)
remains in force, this file references it explicitly. Where a Rust
ADR is superseded, the new ADR records why.
ADR-001: Foundational decisions for the Go rewrite
Date: 2026-04-28 Decided by: J Status: Ratified — Phase G0 unblocked
The six questions that gated Phase G0 (per PRD.md / SPEC.md §8) are all answered.
Decision 1.1 — DuckDB via cgo for the query engine
Decision: queryd uses marcboeker/go-duckdb (cgo bindings to
DuckDB). Pure-Go alternative was rejected.
Rationale: DuckDB reads Parquet natively, supports the SQL surface DataFusion exposed in the Rust era (CTEs, window functions, hybrid joins), and runs in-process with cgo. The alternatives were:
- Hand-rolling a query planner over arrow-go RecordBatches — multi-engineer-month research project; high risk of correctness bugs.
- Running DuckDB as an external process — adds an operational surface and a network hop to every query.
Cgo build complexity is the accepted cost. Single-binary deploy preserved (the cgo dependency embeds at link time).
Supersedes Rust ADR-001 (object storage as source of truth) — no. That ADR remains in force; the change is the engine over the storage, not the storage model.
Decision 1.2 — HTMX for the UI
Decision: Frontend is html/template + HTMX + Alpine.js,
server-rendered by cmd/gateway. React/Vite in a separate repo is the
fallback if UX requirements demand SPA-tier interactivity post-G5.
Rationale: The existing Lakehouse UIs (/lakehouse/ demo + staffer
console) are mostly server-rendered HTML with vanilla JS that already
fits the HTMX style. Single-binary deploy is preserved (gateway serves
templates + static assets). No build chain beyond go build.
The React fallback is named explicitly so it's not relitigated unless an actual UX requirement triggers it.
Decision 1.3 — Gitea hosts the new repo
Decision: Repo lives at git.agentview.dev/profit/golangLAKEHOUSE
(same Gitea server that hosts the Rust lakehouse).
Rationale: Single source of truth for repo hosting; existing
auditor tooling (lakehouse-auditor systemd service) already speaks
Gitea API; existing credentials work; no new ops surface.
Decision 1.4 — Distillation rebuilt in Go, not ported verbatim
Decision: The distillation v1.0.0 substrate (tag distillation-v1.0.0 at e7636f2 in the Rust repo) is not
bit-identical-ported. The Go reimplementation:
- Ports the LOGIC: SFT export pipeline, contamination firewall (the
quality_scoreenum +SFT_NEVERconstant), category mapping rules, audit-baselines append-only pattern. - Does NOT port the FIXTURES:
tests/fixtures/distillation/acceptance/is rebuilt from scratch in Go with new ground-truth golden files. - Does NOT port the bit-identical reproducibility PROPERTY: that was measured against the Rust implementation. The Go implementation establishes its own reproducibility baseline.
Rationale: Bit-identical reproducibility was a measured property of a specific implementation, not a portable invariant. Re-establishing it in Go means new fixtures, new gates, new audit-baselines. This is honest about what's transferring (logic) versus what's a Rust-era artifact (the specific bit-identical hashes).
Risk: the contamination firewall is the most consequential distillation safety net. The port must be reviewed line-by-line, and the new Go fixtures must include adversarial cases that prove the firewall works in the new implementation. See SPEC §7 acceptance gates.
Decision 1.5 — Pathway memory starts clean; old traces preserved as reference
Decision: Go pathway memory begins with zero traces. The existing
88 Rust traces at
/home/profit/lakehouse/data/_pathway_memory/state.json are NOT loaded
into the Go implementation. They are preserved as a historical record
in the Rust repo and documented at docs/RUST_PATHWAY_MEMORY_NOTE.md.
Rationale: The Rust pathway memory's value compounded over months of scrum cycles. Loading those traces into a Go implementation that hasn't proven its byte-matching contract risks corrupting the new substrate's signal with semantically-mismatched data. Starting clean keeps the Go pathway memory's lineage clean and lets the byte-match correctness be proven on a known input (per SPEC §3.4 G3.4.B).
The historical note records the 88 traces' value (11/11 successful replays at the time of freeze) so the Go implementation has a reference baseline to outperform.
Decision 1.6 — Auditor longitudinal signal restarts
Decision: The Rust auditor's audit_baselines.jsonl
(longitudinal drift signal accumulated across PRs #6–#13) is not
ported to Go. The Go auditor begins a fresh audit_baselines.jsonl
lineage on its first PR.
Rationale: The drift signal is anchored to specific Rust commits, verdict shapes, and Kimi/Haiku/Opus rotation traces. Carrying it into the Go era would be like grafting Rust-PR audit history onto the first Go PR's prologue — confusing more than informative. Restarting gives the Go auditor a clean baseline to measure drift against.
The existing Rust audit_baselines.jsonl stays in the Rust repo as a
historical record.
ADR-002: storaged per-prefix PUT cap (vectord _vectors/ → 4 GiB)
Date: 2026-04-29
Decided by: J
Status: Implemented (commit 423a381)
storaged enforces a 256 MiB per-PUT body cap as DoS protection
(MaxBytesReader + Content-Length check). Keys under _vectors/
(vectord LHV1 persistence) get a raised cap of 4 GiB; everything
else stays at 256 MiB.
Rationale: the 500K staffing test surfaced that single-file LHV1
above ~150K vectors at d=768 hits the 256 MiB cap. manager.Uploader
already streams on the outbound side, so the cap is a safety gate
not a memory bottleneck — raising it for the vector path doesn't
introduce new memory pressure. Per-prefix preserves the safety
gate for routine traffic while opening the documented production
path. Splitting LHV1 across multiple keys was rejected because G1P
specifically shipped the single-Put framed format to eliminate
torn-write — multi-key would re-introduce that failure mode.
Follow-up: if production workloads exceed 4 GiB single-file
LHV1, refactor to operator-driven config (env/TOML) rather than
bumping the constant. The function-level maxPutBytesFor(key) in
cmd/storaged/main.go keeps that drop-in clean.
ADR-003: Inter-service auth posture — Bearer token + IP allowlist
Date: 2026-04-29 Decided by: J + Claude Status: Decided — wiring deferred to Sprint 1
Decision: When inter-service auth is needed (the moment any
binary binds non-loopback or the deployment crosses a trust
boundary), the auth model is a Bearer token loaded from
secrets-go.toml plus a configurable IP allowlist. Both layers
required: the token authenticates the caller; the allowlist
narrows the network surface.
Status today (G0): zero auth middleware. Every binary binds
127.0.0.1 by default; commit 6af0520 (R-001 partial fix) refuses
non-loopback bind unless the per-service LH_<SVC>_ALLOW_NONLOOPBACK=1
env override is set. The override-and-no-auth combination is the
worst case — this ADR locks in what we'll require before any
production override fires.
What gets implemented when auth lands
-
secrets-go.tomladds a[auth]section:[auth] token = "..." # 32+ random bytes, hex-encoded allowed_ips = ["10.0.0.0/8", "127.0.0.1/32"] # CIDR list -
internal/shared/auth.goships a single chi middleware:func RequireAuth(cfg AuthConfig) func(http.Handler) http.Handler- Empty
cfg.Token→ middleware is a no-op (G0 dev mode). - Non-empty token → reject 401 unless request has
Authorization: Bearer <token>matching constant-time. - Non-empty
allowed_ips→ reject 403 unlessr.RemoteAddr(orX-Forwarded-Forfirst hop, configurable) is in CIDR set. /healthexempt — load balancers + monitors need it open.
- Empty
-
Every
cmd/<svc>/main.goadds one line:r.Use(shared.RequireAuth(cfg.Auth))Mounted before
register(r)so it covers every route the binary exposes after/health. -
shared.Runstartup gate: if bind is non-loopback ANDcfg.Auth.Token == "", refuse to start. The implicit "localhost is the auth layer" guarantee becomes explicit when crossing the loopback boundary.
Alternatives considered
| Option | Why rejected |
|---|---|
| mTLS | Strongest but heaviest — every binary needs cert provisioning, rotation tooling, and cert-aware client wiring. Overkill for inter-service traffic that already passes through a single gateway. Reconsider when Lakehouse-Go runs across machines. |
| JWT with short TTL | Buys nothing over Bearer here — there's no third-party identity provider, no claim hierarchy worth modelling. Pure token has the same security properties at half the wire complexity. |
| No auth, IP-allowlist only | One stolen IP allowlist entry → full access. Token + IP is defense in depth; either alone is too weak. |
| OAuth2 via external IdP | Rejected for G0–G3 timeline. No external IdP commitment. Revisit if Lakehouse-Go ever serves end-user requests directly (today everything fronts through the staffing co-pilot which has its own session model). |
Constant-time comparison + token hygiene
Token comparison must use crypto/subtle.ConstantTimeCompare —
naive == is vulnerable to timing attacks against an attacker who
can issue many requests and measure round-trip. Token rotation is
operator-driven via secrets-go.toml edit + restart; G0 doesn't
need rotate-without-restart.
What this ADR does NOT do
- Does not implement the middleware. Code lands in Sprint 1.
- Does not require token in G0 dev. Empty token → no-op. Smokes
- proof harness keep working without setting tokens.
- Does not address gateway → end-user auth. Gateway terminates inter-service auth at its inbound; if end-users hit gateway from a browser, that's a different ADR (likely cookie/session, fronted by a reverse proxy that handles user auth).
How this closes audit findings
- R-001 (queryd /sql RCE-equivalent off-loopback): the bind gate prevents accidental exposure today; this ADR specifies the guardrail when intentional exposure is needed.
- R-007 (zero auth middleware): answered by the design above; R-007 stays open until the middleware is implemented but is no longer "design TBD."
- R-010 (no CORS posture): orthogonal to inter-service auth,
but the
RequireAuthmiddleware sits at the right layer to add CORS handling later (browsers don't reach inter-service routes in the current design, so CORS is also Sprint 1+ when end-user requests start landing).
ADR-004: Pathway memory data model — Mem0-style versioned traces
Date: 2026-04-29
Decided by: J + Claude
Status: Decided — substrate landing in internal/pathway/
Decision: Pathway memory is an append-only event log of opaque
traces with Mem0-style semantics: Add / Update / Revise / Retire /
History / Search. Each trace has a UID; revisions chain backward
via predecessor_uid so the full history is reconstructible.
Persistence is JSONL append-only with full-replay on load;
corruption recovery skips bad lines without halting startup.
Operations
| Op | Effect |
|---|---|
Add(content, tags...) |
New UID, stored fresh, replay_count=1. |
AddIdempotent(uid, content, tags...) |
If UID exists → replay_count++. Else → Add with that UID. |
Update(uid, content) |
In-place content replacement (same UID). Bumps updated_at_ns. NOT a revision — same trace, new content. |
Revise(predecessorUID, content, tags...) |
New UID with predecessor_uid set. Old trace stays accessible via History. Failure modes: predecessor missing → error; predecessor retired → still allowed (revisions of retired traces are valid). |
Retire(uid) |
Sets retired=true. Excluded from Search by default; still accessible via Get and History. |
Get(uid) |
Returns the trace (including if retired); error on missing. |
History(uid) |
Walks predecessor_uid chain backward, returns slice [self, parent, grandparent, ...]. Cycle-detected via visited-set; returns error on cycle (which only happens if persistence file was hand-edited). |
Search(filter) |
Returns matching traces. Default excludes retired; opt in via IncludeRetired: true. Filters: tag-match, content-substring, time range. |
Why Mem0-style + Why these specific ops
- Mem0 (memory pattern from the OpenAI Memories paper / Mem0 lib) is the canonical "agent memory" interface for the same reason Markdown is the canonical text format: it's the lowest-common- denominator that the entire ecosystem assumes. Adopting it lets agent loops written against any Mem0-aware substrate work here.
- Update vs Revise are deliberately separate. Update is "I noticed a typo in my note." Revise is "I now believe something different than I did when I wrote this; preserve the old belief for audit." Conflating them loses the audit trail.
- Retire vs Delete is deliberate. Retire stops a trace from surfacing in search but preserves it for history reconstruction. Delete (which we don't expose) would break references.
Trace data shape
type Trace struct {
UID string // UUID v4 unless caller provides one
Content json.RawMessage // opaque, schema is caller's contract
PredecessorUID string // empty if root revision
CreatedAtNs int64
UpdatedAtNs int64
Retired bool
ReplayCount int // ≥1 for any stored trace
Tags []string // for Search
}
Content is opaque JSON (not a struct) so callers can store any
shape — the data model doesn't constrain semantics. Callers add
their own validators on top.
Persistence
JSONL append-only log under _pathway/<store_name>.jsonl. Each
mutation appends one JSON line:
{"op":"add", "trace":{...}}
{"op":"update", "uid":"…", "content":"…"}
{"op":"revise", "trace":{…}} # trace.PredecessorUID is set
{"op":"retire", "uid":"…"}
{"op":"replay", "uid":"…"} # idempotent re-add hit
On startup, replay every line in order, building in-memory state. A malformed line logs a warn and is skipped; load continues. Corruption tolerance is non-optional — partial state is better than no state for an agent substrate.
Compaction is a future concern. A 100K-trace log replays in seconds; below that scale, JSONL append is the simplest correct choice. When compaction lands, the format will be: snapshot file (full state JSON) + tail JSONL since snapshot. Detect snapshot, load it, then replay tail.
Cycle safety
UIDs are generated server-side via uuid.New() (existing dep —
catalogd uses it). New UID for every Add and Revise. The data
model itself can't form cycles — every Revise points at an
EXISTING uid, and the new uid didn't exist a moment ago.
History walks defensively anyway: visited-set tracks UIDs seen this walk; if we encounter a duplicate, return error. Protects against corruption (manual edit, bug in a future op) without constraining the happy path.
Storage location
JSONL file path is configurable per store. Default:
/var/lib/lakehouse/pathway/<name>.jsonl for prod; tests use
t.TempDir(). Persistence is OPTIONAL — empty path means
in-memory only (matches vectord G1's pattern).
What this ADR does NOT do
- No HTTP surface decision. Whether
cmd/pathwaydis its own binary or routes get added tocmd/vectordis the next ADR's concern. The substrate is a pure library either way. - No vector index integration. Pathway traces can carry a
vector embedding in
Content(caller decides), but this ADR doesn't define how the substrate integrates withvectord's HNSW indexes. That's the staffing co-pilot's design problem when those layers compose. - No agent-loop semantics. "When does an agent ADD vs REVISE?" is a workflow decision, not a substrate decision.
ADR-005: Observer fail-safe semantics
Date: 2026-04-30
Status: RATIFIED
Scope: internal/observer (Store, Persistor) + internal/workflow (Runner) + cmd/observerd
The Rust legacy had a documented "verdict:accept on crash" anti-pattern:
when the observer crashed mid-evaluation, the upstream interpreted the
missing verdict as implicit acceptance. Several silent regressions traced
to it. The Go observer's role is structurally different — it is a
witness (records what happened) rather than a gate (decides
accept/reject) — but adjacent fail-safe decisions still need locking
now that observerd is on the prod-realistic stack via the lift harness
(commit b2e45f7, 2026-04-30). This ADR ratifies the current behavior
and locks the rationale so future consumers don't break the invariant
by flipping the defaults.
Decision 5.1 — Persist failure is logged-not-fatal; ring is the in-flight source of truth
Already implemented (internal/observer/store.go:60-67). Locked:
- If
persistor.Appendfails, log a warning and continue. Do NOT return an error to the caller ofStore.Record. - The in-memory ring buffer is the source of truth in flight; the JSONL is a best-effort durability shadow.
- Operators who need fail-closed audit-grade trails configure that mode through a future opt-in (deferred to a later ADR; not the G0/G1/G2 default).
Why fail-open here: the observer's job is to keep recording even
when the disk hiccups. A persist-fail-fatal mode would translate
every transient I/O blip into an observer-blackout, which is strictly
worse for the witness role than missing a few persisted entries — the
ring still has them, and operators can drain it on restart.
Why this isn't the Rust anti-pattern: the Go observer doesn't emit verdicts. A persist failure here means "we recorded fewer rows on disk than in memory," not "we accepted something we shouldn't have."
Decision 5.2 — Mode failure in workflow.Runner: Success = (Error == ""), no panic-swallow path
Already implemented (internal/workflow/runner.go). Locked:
- Mode errors are caught by the runner and surfaced via the node's
Errorfield;Successis the boolean derived fromError == "". observerdrecords anObservedOpper node withSuccess: falseand the error string when a mode fails.- Cycles, missing-deps, and unknown modes are aborting errors → 4xx
from
/observer/workflow/runwith the failure encoded in the JSON response.
Why this is the explicit anti-Rust: allowing a mode to silently
swallow its panic and report Success: true is exactly how the Rust
"verdict:accept on crash" pattern manifests. Forcing the runner to
record Success: false on error makes the failure observable to
downstream consumers (observerd queries, scrum review, distillation
selection) instead of laundering it into a fake success.
Decision 5.3 — Provenance is one-row-per-node, recorded post-run
Already implemented (cmd/observerd/main.go:140-154). Locked:
runner.Runreturns the fullRunResultwith per-node Success/Error;handleWorkflowRunthen iteratesres.Nodesandstore.Records anObservedOpper node.- One row per node, NOT a single per-workflow catch-all. A workflow with N nodes produces N audit rows.
- Crash semantics:
- Crash during
runner.Run→ no provenance recorded; queries see absence, not a false acceptance. - Crash during the recording loop → some nodes recorded, some absent; queries see partial provenance, again not a false acceptance.
- Crash during
- Recovery: re-run the whole workflow. No incremental resume in G0/G1/G2.
Why one row per node: debugging a partial workflow is a one-grep operation when each node has its own row. A single catch-all row would be exactly the Rust anti-pattern surface — "we accepted this workflow" records that survive partial crashes look identical to genuine acceptances. Per-node-row makes that structurally impossible.
Known gap, not yet a follow-up ADR: recording happens after
runner.Run returns, not as each node completes. A long workflow with
late-stage failure currently records nodes that already finished only
once the runner returns. For G0/G1/G2 substrate this is fine —
workflows are short. When workflows get long enough that mid-run
visibility matters, a streaming-record callback is the right shape.
Decision 5.4 — /observer/event accepts even when the ring is full
Already implemented via Store.Record's shift-left eviction. Locked:
- Ring overflow is normal operation: oldest evicted, newest accepted.
- 200 OK from
/observer/eventmeans "we accepted into the ring"; it does NOT promise "we persisted." Persistence remains best-effort per Decision 5.1. - 4xx is reserved for malformed
ObservedOppayloads (validation failures).
Why accept-on-full: treating a full ring as a 503 would translate every brief activity burst into client errors, which is exactly the wrong direction for an audit witness — the witness's job is to never refuse to write, only to lose oldest data when capacity binds.
Alternatives considered
- Persist-required mode — caller-configurable fail-closed for
audit-grade workloads. The right approach when this lands is an
opt-in on
Storeconstruction, leaving the default fail-open. Deferred to a future ADR. - Distributed ring with WAL — persist before accept-into-ring, sync semantics. Too heavy for G0/G1 and breaks the ring's "in-flight source of truth" property.
- Mode-result schema with explicit verdict field — would force every mode to declare accept/reject. Overengineered for the witness role and reintroduces the gate-vs-witness confusion this ADR is trying to avoid.
What this ADR does NOT do
- No retention policy. "How long do we keep observer entries on disk?" is a separate operations decision.
- No mode-level retry. If a mode fails, the runner records that and moves on. Whether to retry is a workflow-definition concern (Archon-style retry policies in the YAML), not the runner's.
- No cross-process recovery. A crashed observerd loses the ring; the persistor preserves what it managed to write. Operators read the JSONL after restart, not query a dead daemon.
- No persist-required opt-in. Mentioned in alternatives; lands in a separate ADR when an audit-grade consumer requires it.
How this closes the OPEN list
STATE_OF_PLAY listed ADR-005 as a doc-only gate before observer wired into production paths. The 2026-04-30 lift run wired observerd into the prod-realistic harness boot, which means observer is now on the data path for every reality test workflow. This ADR locks the fail-safe invariants before the next consumer (scrum runner, distillation rebuild, or a real production workflow) takes a hard behavioral dependency.
ADR-006: Auth posture for non-loopback deploy
Date: 2026-04-30
Status: RATIFIED
Scope: internal/shared/auth.go + internal/shared/bind.go + every cmd/<bin>/main.go's shared.Run call site
ADR-003 locked the substrate (Bearer token + IP allowlist, opt-in via
cfg.Auth.Token/cfg.Auth.AllowedIPs, /health exempt). ADR-006
ratifies the operator playbook + deploy-time invariants — what
gets enforced when, what operators set where, what happens when keys
rotate. Required because Sprint 4 deployment work (REPLICATION.md,
systemd units, Dockerfile) needs a locked auth posture before it
touches production-shaped configs.
Decision 6.1 — Non-loopback bind requires auth.token; the gate is mechanical
Already implemented in requireAuthOnNonLoopback (internal/shared/bind.go:58-67).
Locked:
- Any binary that binds anything other than
127.0.0.0/8/::1/localhostMUST havecfg.Auth.Token != "". Empty-token + non-loopback-bind = startup error, not silent insecure mode. - The check fires in
shared.RunBEFOREhttp.Server.Serve, so a misconfigured binary fails fast at startup rather than serving one request. - Pairs with
requireLoopbackOrOverride: that gate refuses any non-loopback bind withoutLH_<NAME>_ALLOW_NONLOOPBACK=1. Together they make the audit's R-001+R-007 worst case (queryd/sql= RCE-equivalent off-loopback with no auth) mechanically impossible.
Why mechanical, not policy: policy gates rely on operator
discipline. The substrate gates work even when an operator copies a
dev lakehouse.toml into prod and forgets to set the token —
binary refuses to start, error message names the env override.
Decision 6.2 — Token comes from cfg.Auth.Token populated by env or secrets file
Locked:
- Operators do NOT put the production token in
lakehouse.tomldirectly. The TOML field is empty in the committed file; the daemon's systemd unit setsAUTH_TOKEN(or whatevercfg.Auth.TokenEnvnames) viaEnvironmentFile=pointing at/etc/lakehouse/auth.env(mode 0600, root-owned). - Same pattern as
chatd's provider keys (OPENROUTER_API_KEYetc.): TOML names the env var, systemd loads the env file. - Justification: keeps secrets out of git + out of the running process's command line + audit-able via filesystem ACLs.
Decision 6.3 — AllowedIPs is the inter-service gate; Token is the cross-trust-boundary gate
Locked:
- Same-box deploys (10 daemons all on one host, all on
127.0.0.0/8or a private subnet) useAllowedIPsonly. Each daemon'scfg.Auth.AllowedIPslists the gateway's address (and any other daemon that legitimately calls it). No token shared between internal services. - Gateway-to-external traffic (a coordinator UI in another VPC, a user's browser, an external integrator) goes through Bearer token. The token is per-tenant; rotation is per-tenant.
- Mixed: a service can require BOTH (allowlist AND token) — the
middleware logic is
allowed = ip_allowed && token_validwhen both are set. Use this for the gateway when binding non-loopback.
Why split: token rotation is operationally expensive (every caller updates a secret). IP allowlist rotation is free if the network topology is stable. Splitting them by trust boundary lets internal services treat allowlist drift as a network change while external callers handle token rotation as a credential change.
Decision 6.4 — /health is unauthenticated; everything else under shared.Run is gated
Already implemented (internal/shared/server.go:84-92). Locked:
- Load balancers + monitor probes hit
/healthwithout a token. The route returns{"status":"ok","service":"<name>"}and nothing about service state — no version, no commit, no internal counts. - Every other route registered via
shared.Run'sregistercallback lives inside the auth-gated chi.Group. New routes inherit auth automatically; new daemons inherit it viashared.Run. - A daemon that needs a public route MUST add it to the outer router
before the
registergroup, with a code comment explaining the exemption. There are no others today.
Decision 6.5 — Token rotation is operator-staged; old + new accepted during the window
Not yet implemented; locked as a Sprint 4 follow-up:
- Operators stage a rotation by adding a second token to
cfg.Auth.SecondaryTokens []string. Both primary and secondary pass auth during the window. - After every caller is updated to the new token, operators promote secondary → primary and clear secondary. A second rotation can begin.
- Rolling restart not required; daemons reload
cfg.Authon SIGHUP (also a Sprint 4 follow-up — currently they re-read on restart only).
Why dual-token instead of just single-rotation: caller pool can be large (gateway + observerd + scrum runner + UI + external integrators). A single-token rotation forces a flag-day. Dual-token windows let operators rotate gradually and abort on failure.
Decision 6.6 — TLS is the network operator's job, not ours
Locked:
- Daemons speak HTTP, not HTTPS. TLS termination happens at the network edge (nginx / Caddy / cloud LB), not in the Go process.
- Internal daemon-to-daemon traffic stays on plaintext HTTP because
it's all on
127.0.0.0/8or a private subnet (per Decision 6.3). - Justification: TLS in-process means cert management, rotation, reload — all undifferentiated lift that nginx already solves better. The Bearer token + allowlist gates are sufficient when combined with a TLS-terminating reverse proxy.
Alternatives considered
- mTLS for inter-service auth — every daemon issues + verifies certs. Solves token-rotation pain but adds cert lifecycle as a problem. Allowlist + plaintext on the private network is cheaper and gets the same threat-model coverage.
- JWT-only — JWTs let callers carry richer claims (tenant id, expiry, scopes). Overkill for the current threat model; the Bearer token + allowlist split is honest about what each layer actually defends against. Revisit when multi-tenant gateway features land.
- No auth, network is the boundary — works for G0 dev and the current single-box deployment. ADR-006 explicitly does NOT recommend this for non-loopback prod (the mechanical gate refuses it).
What this ADR does NOT do
- Does not specify how the gateway authenticates external callers. Token-vs-mTLS-vs-OAuth at the public edge is a separate decision driven by who-calls-us. ADR-006 is about the inter-service + same-trust-domain posture.
- Does not implement token rotation hot-reload. Decision 6.5 documents the design; the implementation is Sprint 4 work.
- Does not lock TLS termination details. Where + how nginx/Caddy goes is ops infrastructure, not ADR territory.
How this closes the OPEN list
STATE_OF_PLAY listed ADR-006 as the gate before any Go binary binds non-loopback in prod. The substrate gates were already present (R-001
- R-007 enforced via
requireLoopbackOrOverride+requireAuthOnNonLoopback); this ADR locks the operator playbook that turns those gates into a deployable posture. Sprint 4 can now write systemd units that setAUTH_TOKENfromEnvironmentFile=without re-litigating the design.