golangLAKEHOUSE/docs/DECISIONS.md
root 814197cfd3 ADR-006: auth posture for non-loopback deploy + token rotation impl
ADR-003 locked the auth substrate; ADR-006 ratifies the operator
playbook + adds two implementation pieces needed for Sprint 4
deployment: env-resolved tokens and dual-token rotation.

Six decisions locked in docs/DECISIONS.md:
- 6.1: Non-loopback bind requires auth.token (mechanical gate at
       shared.Run, already implemented; this ratifies it).
- 6.2: Token from env, not TOML. /etc/lakehouse/auth.env (mode 0600)
       loaded by systemd EnvironmentFile=. New TokenEnv field on
       AuthConfig defaults to "AUTH_TOKEN".
- 6.3: AllowedIPs for inter-service same-trust-domain; Token for
       cross-trust-boundary (gateway ↔ external).
- 6.4: /health stays unauthenticated; everything else under
       shared.Run is gated. Already implemented; ratified here.
- 6.5: Token rotation is dual-token. New SecondaryTokens []string
       on AuthConfig — both primary and any secondary pass auth
       during the rotation window. Implemented in this commit.
- 6.6: TLS terminates at the network edge (nginx/Caddy), not
       in-process. Daemons stay HTTP-only; internal traffic stays
       on private subnets per Decision 6.3.

Implementation:
- internal/shared/config.go: AuthConfig gains TokenEnv +
  SecondaryTokens fields. New resolveAuthFromEnv() called by
  LoadConfig fills Token from os.Getenv(TokenEnv) when Token is
  empty. TokenEnv defaults to "AUTH_TOKEN" so the happy path needs
  no TOML config.
- internal/shared/auth.go: RequireAuth pre-encodes Bearer headers
  for primary + every secondary token; per-request constant-time
  compare walks the slice. Fast path is 1 compare (primary).

Tests:
- TestLoadConfig_AuthTokenFromEnv (3 sub-tests): default env name,
  custom token_env, explicit Token wins over env.
- TestRequireAuth_SecondaryTokenAccepted: both primary + secondary
  tokens pass during rotation window.
- TestRequireAuth_SecondaryTokensOnly: only-secondary path works
  for the case where primary was just promoted-to-empty mid-rotation.

go test ./internal/shared all green; existing auth_test.go
unchanged (constant-time compare path preserved).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 17:51:14 -05:00

659 lines
30 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Architecture Decision Records — Lakehouse-Go
ADRs from the Go era. Numbered fresh from 001 to start clean lineage.
Where a Rust ADR (numbered 001021 in the Rust repo's `DECISIONS.md`)
remains in force, this file references it explicitly. Where a Rust
ADR is superseded, the new ADR records why.
---
## ADR-001: Foundational decisions for the Go rewrite
**Date:** 2026-04-28
**Decided by:** J
**Status:** Ratified — Phase G0 unblocked
The six questions that gated Phase G0 (per PRD.md / SPEC.md §8) are
all answered.
### Decision 1.1 — DuckDB via cgo for the query engine
**Decision:** `queryd` uses `marcboeker/go-duckdb` (cgo bindings to
DuckDB). Pure-Go alternative was rejected.
**Rationale:** DuckDB reads Parquet natively, supports the SQL surface
DataFusion exposed in the Rust era (CTEs, window functions, hybrid
joins), and runs in-process with cgo. The alternatives were:
- Hand-rolling a query planner over arrow-go RecordBatches —
multi-engineer-month research project; high risk of correctness
bugs.
- Running DuckDB as an external process — adds an operational surface
and a network hop to every query.
Cgo build complexity is the accepted cost. Single-binary deploy
preserved (the cgo dependency embeds at link time).
**Supersedes Rust ADR-001** (object storage as source of truth) — no.
That ADR remains in force; the change is the *engine* over the
storage, not the storage model.
### Decision 1.2 — HTMX for the UI
**Decision:** Frontend is `html/template` + HTMX + Alpine.js,
server-rendered by `cmd/gateway`. React/Vite in a separate repo is the
fallback if UX requirements demand SPA-tier interactivity post-G5.
**Rationale:** The existing Lakehouse UIs (`/lakehouse/` demo + staffer
console) are mostly server-rendered HTML with vanilla JS that already
fits the HTMX style. Single-binary deploy is preserved (gateway serves
templates + static assets). No build chain beyond `go build`.
The React fallback is named explicitly so it's not relitigated unless
an actual UX requirement triggers it.
### Decision 1.3 — Gitea hosts the new repo
**Decision:** Repo lives at `git.agentview.dev/profit/golangLAKEHOUSE`
(same Gitea server that hosts the Rust lakehouse).
**Rationale:** Single source of truth for repo hosting; existing
auditor tooling (`lakehouse-auditor` systemd service) already speaks
Gitea API; existing credentials work; no new ops surface.
### Decision 1.4 — Distillation rebuilt in Go, not ported verbatim
**Decision:** The distillation v1.0.0 substrate (`tag
distillation-v1.0.0` at `e7636f2` in the Rust repo) is **not**
bit-identical-ported. The Go reimplementation:
- Ports the LOGIC: SFT export pipeline, contamination firewall (the
`quality_score` enum + `SFT_NEVER` constant), category mapping
rules, audit-baselines append-only pattern.
- Does NOT port the FIXTURES: `tests/fixtures/distillation/acceptance/`
is rebuilt from scratch in Go with new ground-truth golden files.
- Does NOT port the bit-identical reproducibility PROPERTY: that was
measured against the Rust implementation. The Go implementation
establishes its own reproducibility baseline.
**Rationale:** Bit-identical reproducibility was a measured property
of a specific implementation, not a portable invariant. Re-establishing
it in Go means new fixtures, new gates, new audit-baselines. This is
honest about what's transferring (logic) versus what's a Rust-era
artifact (the specific bit-identical hashes).
**Risk:** the contamination firewall is the most consequential
distillation safety net. The port must be reviewed line-by-line, and
the new Go fixtures must include adversarial cases that prove the
firewall works in the new implementation. See SPEC §7 acceptance gates.
### Decision 1.5 — Pathway memory starts clean; old traces preserved as reference
**Decision:** Go pathway memory begins with zero traces. The existing
88 Rust traces at
`/home/profit/lakehouse/data/_pathway_memory/state.json` are NOT loaded
into the Go implementation. They are preserved as a historical record
in the Rust repo and documented at `docs/RUST_PATHWAY_MEMORY_NOTE.md`.
**Rationale:** The Rust pathway memory's value compounded over months
of scrum cycles. Loading those traces into a Go implementation that
hasn't proven its byte-matching contract risks corrupting the new
substrate's signal with semantically-mismatched data. Starting clean
keeps the Go pathway memory's lineage clean and lets the byte-match
correctness be proven on a known input (per SPEC §3.4 G3.4.B).
The historical note records the 88 traces' value (11/11 successful
replays at the time of freeze) so the Go implementation has a
reference baseline to outperform.
### Decision 1.6 — Auditor longitudinal signal restarts
**Decision:** The Rust auditor's `audit_baselines.jsonl`
(longitudinal drift signal accumulated across PRs #6#13) is **not**
ported to Go. The Go auditor begins a fresh `audit_baselines.jsonl`
lineage on its first PR.
**Rationale:** The drift signal is anchored to specific Rust commits,
verdict shapes, and Kimi/Haiku/Opus rotation traces. Carrying it into
the Go era would be like grafting Rust-PR audit history onto the first
Go PR's prologue — confusing more than informative. Restarting gives
the Go auditor a clean baseline to measure drift against.
The existing Rust `audit_baselines.jsonl` stays in the Rust repo as a
historical record.
---
## ADR-002: storaged per-prefix PUT cap (vectord _vectors/ → 4 GiB)
**Date:** 2026-04-29
**Decided by:** J
**Status:** Implemented (commit `423a381`)
`storaged` enforces a 256 MiB per-PUT body cap as DoS protection
(`MaxBytesReader` + Content-Length check). Keys under `_vectors/`
(vectord LHV1 persistence) get a raised cap of 4 GiB; everything
else stays at 256 MiB.
**Rationale:** the 500K staffing test surfaced that single-file LHV1
above ~150K vectors at d=768 hits the 256 MiB cap. `manager.Uploader`
already streams on the outbound side, so the cap is a safety gate
not a memory bottleneck — raising it for the vector path doesn't
introduce new memory pressure. Per-prefix preserves the safety
gate for routine traffic while opening the documented production
path. Splitting LHV1 across multiple keys was rejected because G1P
specifically shipped the single-Put framed format to eliminate
torn-write — multi-key would re-introduce that failure mode.
**Follow-up:** if production workloads exceed 4 GiB single-file
LHV1, refactor to operator-driven config (env/TOML) rather than
bumping the constant. The function-level `maxPutBytesFor(key)` in
`cmd/storaged/main.go` keeps that drop-in clean.
---
## ADR-003: Inter-service auth posture — Bearer token + IP allowlist
**Date:** 2026-04-29
**Decided by:** J + Claude
**Status:** Decided — wiring deferred to Sprint 1
**Decision:** When inter-service auth is needed (the moment any
binary binds non-loopback or the deployment crosses a trust
boundary), the auth model is **a Bearer token loaded from
`secrets-go.toml` plus a configurable IP allowlist**. Both layers
required: the token authenticates the caller; the allowlist
narrows the network surface.
**Status today (G0):** zero auth middleware. Every binary binds
`127.0.0.1` by default; commit `6af0520` (R-001 partial fix) refuses
non-loopback bind unless the per-service `LH_<SVC>_ALLOW_NONLOOPBACK=1`
env override is set. The override-and-no-auth combination is the
worst case — this ADR locks in what we'll require before any
production override fires.
### What gets implemented when auth lands
1. **`secrets-go.toml` adds a `[auth]` section:**
```toml
[auth]
token = "..." # 32+ random bytes, hex-encoded
allowed_ips = ["10.0.0.0/8", "127.0.0.1/32"] # CIDR list
```
2. **`internal/shared/auth.go`** ships a single chi middleware:
```go
func RequireAuth(cfg AuthConfig) func(http.Handler) http.Handler
```
- Empty `cfg.Token` → middleware is a no-op (G0 dev mode).
- Non-empty token → reject 401 unless request has
`Authorization: Bearer <token>` matching constant-time.
- Non-empty `allowed_ips` → reject 403 unless `r.RemoteAddr` (or
`X-Forwarded-For` first hop, configurable) is in CIDR set.
- `/health` exempt — load balancers + monitors need it open.
3. **Every `cmd/<svc>/main.go` adds one line:**
```go
r.Use(shared.RequireAuth(cfg.Auth))
```
Mounted before `register(r)` so it covers every route the binary
exposes after `/health`.
4. **`shared.Run` startup gate:** if bind is non-loopback AND
`cfg.Auth.Token == ""`, refuse to start. The implicit
"localhost is the auth layer" guarantee becomes explicit when
crossing the loopback boundary.
### Alternatives considered
| Option | Why rejected |
|---|---|
| **mTLS** | Strongest but heaviest — every binary needs cert provisioning, rotation tooling, and cert-aware client wiring. Overkill for inter-service traffic that already passes through a single gateway. Reconsider when Lakehouse-Go runs across machines. |
| **JWT with short TTL** | Buys nothing over Bearer here — there's no third-party identity provider, no claim hierarchy worth modelling. Pure token has the same security properties at half the wire complexity. |
| **No auth, IP-allowlist only** | One stolen IP allowlist entry → full access. Token + IP is defense in depth; either alone is too weak. |
| **OAuth2 via external IdP** | Rejected for G0G3 timeline. No external IdP commitment. Revisit if Lakehouse-Go ever serves end-user requests directly (today everything fronts through the staffing co-pilot which has its own session model). |
### Constant-time comparison + token hygiene
Token comparison must use `crypto/subtle.ConstantTimeCompare` —
naive `==` is vulnerable to timing attacks against an attacker who
can issue many requests and measure round-trip. Token rotation is
operator-driven via `secrets-go.toml` edit + restart; G0 doesn't
need rotate-without-restart.
### What this ADR does NOT do
- **Does not implement the middleware.** Code lands in Sprint 1.
- **Does not require token in G0 dev.** Empty token → no-op. Smokes
+ proof harness keep working without setting tokens.
- **Does not address gateway → end-user auth.** Gateway terminates
inter-service auth at its inbound; if end-users hit gateway from
a browser, that's a different ADR (likely cookie/session, fronted
by a reverse proxy that handles user auth).
### How this closes audit findings
- **R-001 (queryd /sql RCE-equivalent off-loopback):** the bind
gate prevents accidental exposure today; this ADR specifies the
guardrail when intentional exposure is needed.
- **R-007 (zero auth middleware):** answered by the design above;
R-007 stays open until the middleware is implemented but is no
longer "design TBD."
- **R-010 (no CORS posture):** orthogonal to inter-service auth,
but the `RequireAuth` middleware sits at the right layer to add
CORS handling later (browsers don't reach inter-service routes
in the current design, so CORS is also Sprint 1+ when end-user
requests start landing).
---
## ADR-004: Pathway memory data model — Mem0-style versioned traces
**Date:** 2026-04-29
**Decided by:** J + Claude
**Status:** Decided — substrate landing in `internal/pathway/`
**Decision:** Pathway memory is an append-only event log of opaque
traces with Mem0-style semantics: Add / Update / Revise / Retire /
History / Search. Each trace has a UID; revisions chain backward
via `predecessor_uid` so the full history is reconstructible.
Persistence is JSONL append-only with full-replay on load;
corruption recovery skips bad lines without halting startup.
### Operations
| Op | Effect |
|---|---|
| `Add(content, tags...)` | New UID, stored fresh, replay_count=1. |
| `AddIdempotent(uid, content, tags...)` | If UID exists → replay_count++. Else → Add with that UID. |
| `Update(uid, content)` | In-place content replacement (same UID). Bumps `updated_at_ns`. NOT a revision — same trace, new content. |
| `Revise(predecessorUID, content, tags...)` | New UID with `predecessor_uid` set. Old trace stays accessible via History. Failure modes: predecessor missing → error; predecessor retired → still allowed (revisions of retired traces are valid). |
| `Retire(uid)` | Sets `retired=true`. Excluded from `Search` by default; still accessible via `Get` and `History`. |
| `Get(uid)` | Returns the trace (including if retired); error on missing. |
| `History(uid)` | Walks `predecessor_uid` chain backward, returns slice [self, parent, grandparent, ...]. Cycle-detected via visited-set; returns error on cycle (which only happens if persistence file was hand-edited). |
| `Search(filter)` | Returns matching traces. Default excludes retired; opt in via `IncludeRetired: true`. Filters: tag-match, content-substring, time range. |
### Why Mem0-style + Why these specific ops
- **Mem0** (memory pattern from the OpenAI Memories paper / Mem0 lib)
is the canonical "agent memory" interface for the same reason
Markdown is the canonical text format: it's the lowest-common-
denominator that the entire ecosystem assumes. Adopting it lets
agent loops written against any Mem0-aware substrate work here.
- Update vs Revise are deliberately separate. Update is "I noticed
a typo in my note." Revise is "I now believe something different
than I did when I wrote this; preserve the old belief for audit."
Conflating them loses the audit trail.
- Retire vs Delete is deliberate. Retire stops a trace from
surfacing in search but preserves it for history reconstruction.
Delete (which we don't expose) would break references.
### Trace data shape
```go
type Trace struct {
UID string // UUID v4 unless caller provides one
Content json.RawMessage // opaque, schema is caller's contract
PredecessorUID string // empty if root revision
CreatedAtNs int64
UpdatedAtNs int64
Retired bool
ReplayCount int // ≥1 for any stored trace
Tags []string // for Search
}
```
`Content` is opaque JSON (not a struct) so callers can store any
shape — the data model doesn't constrain semantics. Callers add
their own validators on top.
### Persistence
JSONL append-only log under `_pathway/<store_name>.jsonl`. Each
mutation appends one JSON line:
```
{"op":"add", "trace":{...}}
{"op":"update", "uid":"…", "content":"…"}
{"op":"revise", "trace":{…}} # trace.PredecessorUID is set
{"op":"retire", "uid":"…"}
{"op":"replay", "uid":"…"} # idempotent re-add hit
```
On startup, replay every line in order, building in-memory state.
A malformed line logs a warn and is skipped; load continues.
Corruption tolerance is non-optional — partial state is better
than no state for an agent substrate.
Compaction is a future concern. A 100K-trace log replays in
seconds; below that scale, JSONL append is the simplest correct
choice. When compaction lands, the format will be: snapshot file
(full state JSON) + tail JSONL since snapshot. Detect snapshot,
load it, then replay tail.
### Cycle safety
UIDs are generated server-side via `uuid.New()` (existing dep —
catalogd uses it). New UID for every Add and Revise. The data
model itself can't form cycles — every Revise points at an
EXISTING uid, and the new uid didn't exist a moment ago.
History walks defensively anyway: visited-set tracks UIDs seen
this walk; if we encounter a duplicate, return error. Protects
against corruption (manual edit, bug in a future op) without
constraining the happy path.
### Storage location
JSONL file path is configurable per store. Default:
`/var/lib/lakehouse/pathway/<name>.jsonl` for prod; tests use
`t.TempDir()`. Persistence is OPTIONAL — empty path means
in-memory only (matches vectord G1's pattern).
### What this ADR does NOT do
- **No HTTP surface decision.** Whether `cmd/pathwayd` is its own
binary or routes get added to `cmd/vectord` is the next ADR's
concern. The substrate is a pure library either way.
- **No vector index integration.** Pathway traces can carry a
vector embedding in `Content` (caller decides), but this ADR
doesn't define how the substrate integrates with `vectord`'s
HNSW indexes. That's the staffing co-pilot's design problem
when those layers compose.
- **No agent-loop semantics.** "When does an agent ADD vs
REVISE?" is a workflow decision, not a substrate decision.
---
## ADR-005: Observer fail-safe semantics
**Date:** 2026-04-30
**Status:** RATIFIED
**Scope:** `internal/observer` (Store, Persistor) + `internal/workflow` (Runner) + `cmd/observerd`
The Rust legacy had a documented "verdict:accept on crash" anti-pattern:
when the observer crashed mid-evaluation, the upstream interpreted the
missing verdict as implicit acceptance. Several silent regressions traced
to it. The Go observer's role is structurally different — it is a
**witness** (records what happened) rather than a **gate** (decides
accept/reject) — but adjacent fail-safe decisions still need locking
now that observerd is on the prod-realistic stack via the lift harness
(commit `b2e45f7`, 2026-04-30). This ADR ratifies the current behavior
and locks the rationale so future consumers don't break the invariant
by flipping the defaults.
### Decision 5.1 — Persist failure is logged-not-fatal; ring is the in-flight source of truth
Already implemented (`internal/observer/store.go:60-67`). Locked:
- If `persistor.Append` fails, log a warning and continue. Do NOT
return an error to the caller of `Store.Record`.
- The in-memory ring buffer is the source of truth in flight; the
JSONL is a best-effort durability shadow.
- Operators who need fail-closed audit-grade trails configure that
mode through a future opt-in (deferred to a later ADR; not the
G0/G1/G2 default).
**Why fail-open here:** the observer's job is to keep recording even
when the disk hiccups. A `persist-fail-fatal` mode would translate
every transient I/O blip into an observer-blackout, which is strictly
worse for the witness role than missing a few persisted entries — the
ring still has them, and operators can drain it on restart.
**Why this isn't the Rust anti-pattern:** the Go observer doesn't
emit verdicts. A persist failure here means "we recorded fewer rows
on disk than in memory," not "we accepted something we shouldn't have."
### Decision 5.2 — Mode failure in workflow.Runner: `Success = (Error == "")`, no panic-swallow path
Already implemented (`internal/workflow/runner.go`). Locked:
- Mode errors are caught by the runner and surfaced via the node's
`Error` field; `Success` is the boolean derived from `Error == ""`.
- `observerd` records an `ObservedOp` per node with `Success: false`
and the error string when a mode fails.
- Cycles, missing-deps, and unknown modes are aborting errors → 4xx
from `/observer/workflow/run` with the failure encoded in the JSON
response.
**Why this is the explicit anti-Rust:** allowing a mode to silently
swallow its panic and report `Success: true` is exactly how the Rust
"verdict:accept on crash" pattern manifests. Forcing the runner to
record `Success: false` on error makes the failure observable to
downstream consumers (observerd queries, scrum review, distillation
selection) instead of laundering it into a fake success.
### Decision 5.3 — Provenance is one-row-per-node, recorded post-run
Already implemented (`cmd/observerd/main.go:140-154`). Locked:
- `runner.Run` returns the full `RunResult` with per-node Success/Error;
`handleWorkflowRun` then iterates `res.Nodes` and `store.Record`s an
`ObservedOp` per node.
- One row per node, NOT a single per-workflow catch-all. A workflow with
N nodes produces N audit rows.
- Crash semantics:
- Crash *during* `runner.Run` → no provenance recorded; queries see
absence, not a false acceptance.
- Crash *during* the recording loop → some nodes recorded, some
absent; queries see partial provenance, again not a false
acceptance.
- Recovery: re-run the whole workflow. No incremental resume in G0/G1/G2.
**Why one row per node:** debugging a partial workflow is a one-grep
operation when each node has its own row. A single catch-all row would
be exactly the Rust anti-pattern surface — "we accepted this workflow"
records that survive partial crashes look identical to genuine
acceptances. Per-node-row makes that structurally impossible.
**Known gap, not yet a follow-up ADR:** recording happens after
`runner.Run` returns, not as each node completes. A long workflow with
late-stage failure currently records nodes that already finished only
once the runner returns. For G0/G1/G2 substrate this is fine —
workflows are short. When workflows get long enough that mid-run
visibility matters, a streaming-record callback is the right shape.
### Decision 5.4 — `/observer/event` accepts even when the ring is full
Already implemented via `Store.Record`'s shift-left eviction. Locked:
- Ring overflow is normal operation: oldest evicted, newest accepted.
- 200 OK from `/observer/event` means "we accepted into the ring"; it
does NOT promise "we persisted." Persistence remains best-effort
per Decision 5.1.
- 4xx is reserved for malformed `ObservedOp` payloads (validation
failures).
**Why accept-on-full:** treating a full ring as a 503 would translate
every brief activity burst into client errors, which is exactly the
wrong direction for an audit witness — the witness's job is to never
refuse to write, only to lose oldest data when capacity binds.
### Alternatives considered
- **Persist-required mode** — caller-configurable fail-closed for
audit-grade workloads. The right approach when this lands is an
opt-in on `Store` construction, leaving the default fail-open.
Deferred to a future ADR.
- **Distributed ring with WAL** — persist before accept-into-ring,
sync semantics. Too heavy for G0/G1 and breaks the ring's "in-flight
source of truth" property.
- **Mode-result schema with explicit verdict field** — would force
every mode to declare accept/reject. Overengineered for the witness
role and reintroduces the gate-vs-witness confusion this ADR is
trying to avoid.
### What this ADR does NOT do
- **No retention policy.** "How long do we keep observer entries on
disk?" is a separate operations decision.
- **No mode-level retry.** If a mode fails, the runner records that
and moves on. Whether to retry is a workflow-definition concern
(Archon-style retry policies in the YAML), not the runner's.
- **No cross-process recovery.** A crashed observerd loses the ring;
the persistor preserves what it managed to write. Operators read the
JSONL after restart, not query a dead daemon.
- **No persist-required opt-in.** Mentioned in alternatives; lands in
a separate ADR when an audit-grade consumer requires it.
### How this closes the OPEN list
STATE_OF_PLAY listed ADR-005 as a doc-only gate before observer wired
into production paths. The 2026-04-30 lift run wired observerd into the
prod-realistic harness boot, which means observer is now on the data
path for every reality test workflow. This ADR locks the fail-safe
invariants before the next consumer (scrum runner, distillation rebuild,
or a real production workflow) takes a hard behavioral dependency.
---
## ADR-006: Auth posture for non-loopback deploy
**Date:** 2026-04-30
**Status:** RATIFIED
**Scope:** `internal/shared/auth.go` + `internal/shared/bind.go` + every `cmd/<bin>/main.go`'s `shared.Run` call site
ADR-003 locked the substrate (Bearer token + IP allowlist, opt-in via
`cfg.Auth.Token`/`cfg.Auth.AllowedIPs`, `/health` exempt). ADR-006
ratifies the **operator playbook + deploy-time invariants** — what
gets enforced when, what operators set where, what happens when keys
rotate. Required because Sprint 4 deployment work (REPLICATION.md,
systemd units, Dockerfile) needs a locked auth posture before it
touches production-shaped configs.
### Decision 6.1 — Non-loopback bind requires `auth.token`; the gate is mechanical
Already implemented in `requireAuthOnNonLoopback` (`internal/shared/bind.go:58-67`).
Locked:
- Any binary that binds anything other than `127.0.0.0/8` / `::1` /
`localhost` MUST have `cfg.Auth.Token != ""`. Empty-token +
non-loopback-bind = startup error, not silent insecure mode.
- The check fires in `shared.Run` BEFORE `http.Server.Serve`, so a
misconfigured binary fails fast at startup rather than serving
one request.
- Pairs with `requireLoopbackOrOverride`: that gate refuses any
non-loopback bind without `LH_<NAME>_ALLOW_NONLOOPBACK=1`. Together
they make the audit's R-001+R-007 worst case (queryd `/sql` =
RCE-equivalent off-loopback with no auth) mechanically impossible.
**Why mechanical, not policy:** policy gates rely on operator
discipline. The substrate gates work even when an operator copies a
dev `lakehouse.toml` into prod and forgets to set the token —
binary refuses to start, error message names the env override.
### Decision 6.2 — Token comes from `cfg.Auth.Token` populated by env or secrets file
Locked:
- Operators do NOT put the production token in `lakehouse.toml`
directly. The TOML field is empty in the committed file; the
daemon's systemd unit sets `AUTH_TOKEN` (or whatever
`cfg.Auth.TokenEnv` names) via `EnvironmentFile=` pointing at
`/etc/lakehouse/auth.env` (mode 0600, root-owned).
- Same pattern as `chatd`'s provider keys (`OPENROUTER_API_KEY` etc.):
TOML names the env var, systemd loads the env file.
- Justification: keeps secrets out of git + out of the running
process's command line + audit-able via filesystem ACLs.
### Decision 6.3 — `AllowedIPs` is the inter-service gate; `Token` is the cross-trust-boundary gate
Locked:
- Same-box deploys (10 daemons all on one host, all on `127.0.0.0/8`
or a private subnet) use **`AllowedIPs` only**. Each daemon's
`cfg.Auth.AllowedIPs` lists the gateway's address (and any other
daemon that legitimately calls it). No token shared between
internal services.
- Gateway-to-external traffic (a coordinator UI in another VPC,
a user's browser, an external integrator) goes through
**Bearer token**. The token is per-tenant; rotation is per-tenant.
- Mixed: a service can require BOTH (allowlist AND token) — the
middleware logic is `allowed = ip_allowed && token_valid` when
both are set. Use this for the gateway when binding non-loopback.
**Why split:** token rotation is operationally expensive (every
caller updates a secret). IP allowlist rotation is free if the
network topology is stable. Splitting them by trust boundary lets
internal services treat allowlist drift as a network change while
external callers handle token rotation as a credential change.
### Decision 6.4 — `/health` is unauthenticated; everything else under `shared.Run` is gated
Already implemented (`internal/shared/server.go:84-92`). Locked:
- Load balancers + monitor probes hit `/health` without a token.
The route returns `{"status":"ok","service":"<name>"}` and nothing
about service state — no version, no commit, no internal counts.
- Every other route registered via `shared.Run`'s `register`
callback lives inside the auth-gated chi.Group. New routes
inherit auth automatically; new daemons inherit it via `shared.Run`.
- A daemon that needs a public route MUST add it to the outer router
before the `register` group, with a code comment explaining the
exemption. There are no others today.
### Decision 6.5 — Token rotation is operator-staged; old + new accepted during the window
Not yet implemented; locked as a Sprint 4 follow-up:
- Operators stage a rotation by adding a second token to
`cfg.Auth.SecondaryTokens []string`. Both primary and secondary
pass auth during the window.
- After every caller is updated to the new token, operators
promote secondary → primary and clear secondary. A second
rotation can begin.
- Rolling restart not required; daemons reload `cfg.Auth` on
SIGHUP (also a Sprint 4 follow-up — currently they re-read on
restart only).
**Why dual-token instead of just single-rotation:** caller pool can be
large (gateway + observerd + scrum runner + UI + external integrators).
A single-token rotation forces a flag-day. Dual-token windows let
operators rotate gradually and abort on failure.
### Decision 6.6 — TLS is the network operator's job, not ours
Locked:
- Daemons speak HTTP, not HTTPS. TLS termination happens at the
network edge (nginx / Caddy / cloud LB), not in the Go process.
- Internal daemon-to-daemon traffic stays on plaintext HTTP because
it's all on `127.0.0.0/8` or a private subnet (per Decision 6.3).
- Justification: TLS in-process means cert management, rotation,
reload — all undifferentiated lift that nginx already solves
better. The Bearer token + allowlist gates are sufficient when
combined with a TLS-terminating reverse proxy.
### Alternatives considered
- **mTLS for inter-service auth** — every daemon issues + verifies
certs. Solves token-rotation pain but adds cert lifecycle as a
problem. Allowlist + plaintext on the private network is cheaper
and gets the same threat-model coverage.
- **JWT-only** — JWTs let callers carry richer claims (tenant id,
expiry, scopes). Overkill for the current threat model; the
Bearer token + allowlist split is honest about what each layer
actually defends against. Revisit when multi-tenant gateway
features land.
- **No auth, network is the boundary** — works for G0 dev and the
current single-box deployment. ADR-006 explicitly does NOT
recommend this for non-loopback prod (the mechanical gate
refuses it).
### What this ADR does NOT do
- **Does not specify how the gateway authenticates external callers.**
Token-vs-mTLS-vs-OAuth at the public edge is a separate decision
driven by who-calls-us. ADR-006 is about the inter-service +
same-trust-domain posture.
- **Does not implement token rotation hot-reload.** Decision 6.5
documents the design; the implementation is Sprint 4 work.
- **Does not lock TLS termination details.** Where + how nginx/Caddy
goes is ops infrastructure, not ADR territory.
### How this closes the OPEN list
STATE_OF_PLAY listed ADR-006 as the gate before any Go binary binds
non-loopback in prod. The substrate gates were already present (R-001
+ R-007 enforced via `requireLoopbackOrOverride` +
`requireAuthOnNonLoopback`); this ADR locks the operator playbook
that turns those gates into a deployable posture. Sprint 4 can now
write systemd units that set `AUTH_TOKEN` from `EnvironmentFile=`
without re-litigating the design.
---