ADR-003 locked the auth substrate; ADR-006 ratifies the operator
playbook + adds two implementation pieces needed for Sprint 4
deployment: env-resolved tokens and dual-token rotation.
Six decisions locked in docs/DECISIONS.md:
- 6.1: Non-loopback bind requires auth.token (mechanical gate at
shared.Run, already implemented; this ratifies it).
- 6.2: Token from env, not TOML. /etc/lakehouse/auth.env (mode 0600)
loaded by systemd EnvironmentFile=. New TokenEnv field on
AuthConfig defaults to "AUTH_TOKEN".
- 6.3: AllowedIPs for inter-service same-trust-domain; Token for
cross-trust-boundary (gateway ↔ external).
- 6.4: /health stays unauthenticated; everything else under
shared.Run is gated. Already implemented; ratified here.
- 6.5: Token rotation is dual-token. New SecondaryTokens []string
on AuthConfig — both primary and any secondary pass auth
during the rotation window. Implemented in this commit.
- 6.6: TLS terminates at the network edge (nginx/Caddy), not
in-process. Daemons stay HTTP-only; internal traffic stays
on private subnets per Decision 6.3.
Implementation:
- internal/shared/config.go: AuthConfig gains TokenEnv +
SecondaryTokens fields. New resolveAuthFromEnv() called by
LoadConfig fills Token from os.Getenv(TokenEnv) when Token is
empty. TokenEnv defaults to "AUTH_TOKEN" so the happy path needs
no TOML config.
- internal/shared/auth.go: RequireAuth pre-encodes Bearer headers
for primary + every secondary token; per-request constant-time
compare walks the slice. Fast path is 1 compare (primary).
Tests:
- TestLoadConfig_AuthTokenFromEnv (3 sub-tests): default env name,
custom token_env, explicit Token wins over env.
- TestRequireAuth_SecondaryTokenAccepted: both primary + secondary
tokens pass during rotation window.
- TestRequireAuth_SecondaryTokensOnly: only-secondary path works
for the case where primary was just promoted-to-empty mid-rotation.
go test ./internal/shared all green; existing auth_test.go
unchanged (constant-time compare path preserved).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Closes the OPEN item from STATE_OF_PLAY. Required because observerd is
now on the prod-realistic data path via the lift harness boot (b2e45f7),
so the next consumer (scrum runner / distillation rebuild / production
workflow) needs the fail-safe rationale locked, not implicit.
The Rust "verdict:accept on crash" anti-pattern doesn't translate
one-to-one to the Go observer (witness, not gate). But four adjacent
fail-safe decisions are real and live:
5.1 Persist failure is logged-not-fatal; ring is in-flight source of
truth. Persist-required mode deferred to a future opt-in ADR.
5.2 Mode failure → Success=false, no panic-swallow path. The runner
catches mode errors and surfaces them via node.Error; downstream
consumers see failures explicitly rather than as fake successes
(the Rust anti-pattern surface).
5.3 One row per node, recorded post-run. A workflow with N nodes
produces N audit rows, never a per-workflow catch-all that
survives partial crashes. Known gap: recording happens after
runner.Run returns (acceptable for short workflows; streaming
callback is the right shape when workflows get longer).
5.4 /observer/event accepts on full ring (oldest evicted). Refusing
to write would translate every burst into client errors — wrong
direction for an audit witness.
Mostly ratifies existing behavior; cross-checked claims against
actual code (caught one error in Decision 5.3 draft — recording is
post-run-batched, not per-node-as-it-completes — and the ADR now
states reality).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Closes Sprint 2 design-bar work (audit reports/scrum/sprint-backlog.md):
S2.1 — ADR-004 documents the pathway-memory data model
S2.2 — pathway port lands with deterministic fixture corpus
and full test coverage on day one
S2.3 — retired traces are excluded from retrieval (test
passes; would fail without the filter)
Mem0-style operations: Add / AddIdempotent / Update / Revise /
Retire / Get / History / Search. Each operation is a method on
Store; persistence is JSONL append-only with corruption recovery
on Replay.
internal/pathway/types.go Trace + event + SearchFilter + sentinel errors
internal/pathway/store.go in-memory state + RWMutex + ops
internal/pathway/persistor.go JSONL append-only log with replay
internal/pathway/store_test.go 20 test funcs covering all 7
Sprint 2 claim rows + concurrency
internal/pathway/persistor_test.go 6 test funcs covering missing-
file, corruption recovery, long-line
handling, parent-dir auto-create,
apply-error skip behavior
Sprint 2 claim coverage row-by-row:
ADD TestAdd_AssignsUIDAndTimestamps + TestAdd_RejectsInvalidJSON
UPDATE TestUpdate_ReplacesContentSameUID + Update_MissingUID_Errors
REVISE TestRevise_LinksToPredecessorViaHistory +
TestRevise_PredecessorMissing_Errors +
TestRevise_ChainOfThree_BackwardWalk
RETIRE TestRetire_ExcludedFromSearch +
TestRetire_StillAccessibleViaGet +
TestRetire_StillAccessibleViaHistory
HISTORY/cycle TestHistory_CycleDetected (injected via internal map),
TestHistory_PredecessorMissing_TruncatesChain,
TestHistory_UnknownUID_ErrorsClean
REPLAY/dup TestAddIdempotent_IncrementsReplayCount (locks the
"replay preserves original content" rule per ADR-004)
CORRUPTION TestPersistor_CorruptedLines_Skipped +
TestPersistor_ApplyError_Skipped
ROUND-TRIP TestPersistor_RoundTrip locks the full Save → fresh
Store → Load → Stats-match contract
Two real bugs caught during testing:
- Add returned the same *Trace stored in the map, so callers
holding a reference saw later mutations. Fixed: clone before
return (matches Get's contract). Same fix in AddIdempotent
+ Revise.
- Test typo: {"v":different} isn't valid JSON; AddIdempotent's
json.Valid rejected it as ErrInvalidContent. Test fixed to
use {"v":"different"}; the validation behavior is correct.
Skipped this commit (next):
- cmd/pathwayd HTTP binary
- gateway routing for /v1/pathway/*
- end-to-end smoke
These add the wire surface; the substrate ships first so the
wire layer can be a pure proxy in the next commit.
Verified:
go test -count=1 ./internal/pathway/ — 26 tests green
just verify — vet + test + 9 smokes 34s
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Locks in the auth model that R-001 + R-007 will be retrofitted
against. Doc-only — wiring deferred to Sprint 1 when the first
non-loopback binding is needed.
Decision: Bearer token (from secrets-go.toml [auth] section) + IP
allowlist (CIDR list). Both layers required when auth is on; empty
token = G0 dev no-op. /health exempt.
Implementation shape (when it lands):
- internal/shared/auth.go middleware: one chi r.Use line per binary
- shared.Run gates: refuses non-loopback bind without configured token
- subtle.ConstantTimeCompare for token equality (timing-safe)
Alternatives considered + rejected:
mTLS — too heavy for single-machine inter-service traffic
JWT — buys nothing over Bearer without external IdP
IP-only — one stolen IP entry = full access; no defense depth
OAuth2 — no external IdP commitment in G0-G3 timeline
What this doesn't do:
- Doesn't implement (code lands Sprint 1)
- Doesn't break G0 dev (empty token = middleware no-op)
- Doesn't address gateway→end-user auth (different ADR shape)
Closes the design-decision blocker for R-001 and R-007. Wiring
ticket: Sprint 1 backlog story S1.2.
Also lifts ADR-002 (storaged per-prefix PUT cap) into the doc —
it was implemented in 423a381 but not yet recorded as an ADR.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two documents only — no Go code yet. PRD restates the problem and
preserves the Rust PRD's invariants verbatim, then maps the locked
stack to Go libraries and surfaces four hard problems (DuckDB-via-cgo
for the query engine, Lance dropped, Dioxus → HTMX, arrow-go maturity).
SPEC walks each Rust crate + TS surface and tags the port with library
choice / effort estimate / risk + a 5-phase migration plan from
skeleton (Phase G0) to demo parity (Phase G5).
Six open questions remain that gate Phase G0:
- DuckDB cgo OK?
- HTMX vs React for the UI?
- Repo location?
- Distillation v1.0.0 port verbatim or rebuild?
- Pathway memory data — port 88 traces or start clean?
- Auditor lineage — port audit_baselines.jsonl or restart?
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>