golangLAKEHOUSE/docs/DECISIONS.md
root 0d18ffa780 ADR-003: inter-service auth posture — Bearer + IP allowlist
Locks in the auth model that R-001 + R-007 will be retrofitted
against. Doc-only — wiring deferred to Sprint 1 when the first
non-loopback binding is needed.

Decision: Bearer token (from secrets-go.toml [auth] section) + IP
allowlist (CIDR list). Both layers required when auth is on; empty
token = G0 dev no-op. /health exempt.

Implementation shape (when it lands):
  - internal/shared/auth.go middleware: one chi r.Use line per binary
  - shared.Run gates: refuses non-loopback bind without configured token
  - subtle.ConstantTimeCompare for token equality (timing-safe)

Alternatives considered + rejected:
  mTLS         — too heavy for single-machine inter-service traffic
  JWT          — buys nothing over Bearer without external IdP
  IP-only      — one stolen IP entry = full access; no defense depth
  OAuth2       — no external IdP commitment in G0-G3 timeline

What this doesn't do:
  - Doesn't implement (code lands Sprint 1)
  - Doesn't break G0 dev (empty token = middleware no-op)
  - Doesn't address gateway→end-user auth (different ADR shape)

Closes the design-decision blocker for R-001 and R-007. Wiring
ticket: Sprint 1 backlog story S1.2.

Also lifts ADR-002 (storaged per-prefix PUT cap) into the doc —
it was implemented in 423a381 but not yet recorded as an ADR.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-29 06:05:59 -05:00

11 KiB
Raw Blame History

Architecture Decision Records — Lakehouse-Go

ADRs from the Go era. Numbered fresh from 001 to start clean lineage. Where a Rust ADR (numbered 001021 in the Rust repo's DECISIONS.md) remains in force, this file references it explicitly. Where a Rust ADR is superseded, the new ADR records why.


ADR-001: Foundational decisions for the Go rewrite

Date: 2026-04-28 Decided by: J Status: Ratified — Phase G0 unblocked

The six questions that gated Phase G0 (per PRD.md / SPEC.md §8) are all answered.

Decision 1.1 — DuckDB via cgo for the query engine

Decision: queryd uses marcboeker/go-duckdb (cgo bindings to DuckDB). Pure-Go alternative was rejected.

Rationale: DuckDB reads Parquet natively, supports the SQL surface DataFusion exposed in the Rust era (CTEs, window functions, hybrid joins), and runs in-process with cgo. The alternatives were:

  • Hand-rolling a query planner over arrow-go RecordBatches — multi-engineer-month research project; high risk of correctness bugs.
  • Running DuckDB as an external process — adds an operational surface and a network hop to every query.

Cgo build complexity is the accepted cost. Single-binary deploy preserved (the cgo dependency embeds at link time).

Supersedes Rust ADR-001 (object storage as source of truth) — no. That ADR remains in force; the change is the engine over the storage, not the storage model.

Decision 1.2 — HTMX for the UI

Decision: Frontend is html/template + HTMX + Alpine.js, server-rendered by cmd/gateway. React/Vite in a separate repo is the fallback if UX requirements demand SPA-tier interactivity post-G5.

Rationale: The existing Lakehouse UIs (/lakehouse/ demo + staffer console) are mostly server-rendered HTML with vanilla JS that already fits the HTMX style. Single-binary deploy is preserved (gateway serves templates + static assets). No build chain beyond go build.

The React fallback is named explicitly so it's not relitigated unless an actual UX requirement triggers it.

Decision 1.3 — Gitea hosts the new repo

Decision: Repo lives at git.agentview.dev/profit/golangLAKEHOUSE (same Gitea server that hosts the Rust lakehouse).

Rationale: Single source of truth for repo hosting; existing auditor tooling (lakehouse-auditor systemd service) already speaks Gitea API; existing credentials work; no new ops surface.

Decision 1.4 — Distillation rebuilt in Go, not ported verbatim

Decision: The distillation v1.0.0 substrate (tag distillation-v1.0.0 at e7636f2 in the Rust repo) is not bit-identical-ported. The Go reimplementation:

  • Ports the LOGIC: SFT export pipeline, contamination firewall (the quality_score enum + SFT_NEVER constant), category mapping rules, audit-baselines append-only pattern.
  • Does NOT port the FIXTURES: tests/fixtures/distillation/acceptance/ is rebuilt from scratch in Go with new ground-truth golden files.
  • Does NOT port the bit-identical reproducibility PROPERTY: that was measured against the Rust implementation. The Go implementation establishes its own reproducibility baseline.

Rationale: Bit-identical reproducibility was a measured property of a specific implementation, not a portable invariant. Re-establishing it in Go means new fixtures, new gates, new audit-baselines. This is honest about what's transferring (logic) versus what's a Rust-era artifact (the specific bit-identical hashes).

Risk: the contamination firewall is the most consequential distillation safety net. The port must be reviewed line-by-line, and the new Go fixtures must include adversarial cases that prove the firewall works in the new implementation. See SPEC §7 acceptance gates.

Decision 1.5 — Pathway memory starts clean; old traces preserved as reference

Decision: Go pathway memory begins with zero traces. The existing 88 Rust traces at /home/profit/lakehouse/data/_pathway_memory/state.json are NOT loaded into the Go implementation. They are preserved as a historical record in the Rust repo and documented at docs/RUST_PATHWAY_MEMORY_NOTE.md.

Rationale: The Rust pathway memory's value compounded over months of scrum cycles. Loading those traces into a Go implementation that hasn't proven its byte-matching contract risks corrupting the new substrate's signal with semantically-mismatched data. Starting clean keeps the Go pathway memory's lineage clean and lets the byte-match correctness be proven on a known input (per SPEC §3.4 G3.4.B).

The historical note records the 88 traces' value (11/11 successful replays at the time of freeze) so the Go implementation has a reference baseline to outperform.

Decision 1.6 — Auditor longitudinal signal restarts

Decision: The Rust auditor's audit_baselines.jsonl (longitudinal drift signal accumulated across PRs #6#13) is not ported to Go. The Go auditor begins a fresh audit_baselines.jsonl lineage on its first PR.

Rationale: The drift signal is anchored to specific Rust commits, verdict shapes, and Kimi/Haiku/Opus rotation traces. Carrying it into the Go era would be like grafting Rust-PR audit history onto the first Go PR's prologue — confusing more than informative. Restarting gives the Go auditor a clean baseline to measure drift against.

The existing Rust audit_baselines.jsonl stays in the Rust repo as a historical record.


ADR-002: storaged per-prefix PUT cap (vectord _vectors/ → 4 GiB)

Date: 2026-04-29 Decided by: J Status: Implemented (commit 423a381)

storaged enforces a 256 MiB per-PUT body cap as DoS protection (MaxBytesReader + Content-Length check). Keys under _vectors/ (vectord LHV1 persistence) get a raised cap of 4 GiB; everything else stays at 256 MiB.

Rationale: the 500K staffing test surfaced that single-file LHV1 above ~150K vectors at d=768 hits the 256 MiB cap. manager.Uploader already streams on the outbound side, so the cap is a safety gate not a memory bottleneck — raising it for the vector path doesn't introduce new memory pressure. Per-prefix preserves the safety gate for routine traffic while opening the documented production path. Splitting LHV1 across multiple keys was rejected because G1P specifically shipped the single-Put framed format to eliminate torn-write — multi-key would re-introduce that failure mode.

Follow-up: if production workloads exceed 4 GiB single-file LHV1, refactor to operator-driven config (env/TOML) rather than bumping the constant. The function-level maxPutBytesFor(key) in cmd/storaged/main.go keeps that drop-in clean.


ADR-003: Inter-service auth posture — Bearer token + IP allowlist

Date: 2026-04-29 Decided by: J + Claude Status: Decided — wiring deferred to Sprint 1

Decision: When inter-service auth is needed (the moment any binary binds non-loopback or the deployment crosses a trust boundary), the auth model is a Bearer token loaded from secrets-go.toml plus a configurable IP allowlist. Both layers required: the token authenticates the caller; the allowlist narrows the network surface.

Status today (G0): zero auth middleware. Every binary binds 127.0.0.1 by default; commit 6af0520 (R-001 partial fix) refuses non-loopback bind unless the per-service LH_<SVC>_ALLOW_NONLOOPBACK=1 env override is set. The override-and-no-auth combination is the worst case — this ADR locks in what we'll require before any production override fires.

What gets implemented when auth lands

  1. secrets-go.toml adds a [auth] section:

    [auth]
    token = "..."          # 32+ random bytes, hex-encoded
    allowed_ips = ["10.0.0.0/8", "127.0.0.1/32"]   # CIDR list
    
  2. internal/shared/auth.go ships a single chi middleware:

    func RequireAuth(cfg AuthConfig) func(http.Handler) http.Handler
    
    • Empty cfg.Token → middleware is a no-op (G0 dev mode).
    • Non-empty token → reject 401 unless request has Authorization: Bearer <token> matching constant-time.
    • Non-empty allowed_ips → reject 403 unless r.RemoteAddr (or X-Forwarded-For first hop, configurable) is in CIDR set.
    • /health exempt — load balancers + monitors need it open.
  3. Every cmd/<svc>/main.go adds one line:

    r.Use(shared.RequireAuth(cfg.Auth))
    

    Mounted before register(r) so it covers every route the binary exposes after /health.

  4. shared.Run startup gate: if bind is non-loopback AND cfg.Auth.Token == "", refuse to start. The implicit "localhost is the auth layer" guarantee becomes explicit when crossing the loopback boundary.

Alternatives considered

Option Why rejected
mTLS Strongest but heaviest — every binary needs cert provisioning, rotation tooling, and cert-aware client wiring. Overkill for inter-service traffic that already passes through a single gateway. Reconsider when Lakehouse-Go runs across machines.
JWT with short TTL Buys nothing over Bearer here — there's no third-party identity provider, no claim hierarchy worth modelling. Pure token has the same security properties at half the wire complexity.
No auth, IP-allowlist only One stolen IP allowlist entry → full access. Token + IP is defense in depth; either alone is too weak.
OAuth2 via external IdP Rejected for G0G3 timeline. No external IdP commitment. Revisit if Lakehouse-Go ever serves end-user requests directly (today everything fronts through the staffing co-pilot which has its own session model).

Constant-time comparison + token hygiene

Token comparison must use crypto/subtle.ConstantTimeCompare — naive == is vulnerable to timing attacks against an attacker who can issue many requests and measure round-trip. Token rotation is operator-driven via secrets-go.toml edit + restart; G0 doesn't need rotate-without-restart.

What this ADR does NOT do

  • Does not implement the middleware. Code lands in Sprint 1.
  • Does not require token in G0 dev. Empty token → no-op. Smokes
    • proof harness keep working without setting tokens.
  • Does not address gateway → end-user auth. Gateway terminates inter-service auth at its inbound; if end-users hit gateway from a browser, that's a different ADR (likely cookie/session, fronted by a reverse proxy that handles user auth).

How this closes audit findings

  • R-001 (queryd /sql RCE-equivalent off-loopback): the bind gate prevents accidental exposure today; this ADR specifies the guardrail when intentional exposure is needed.
  • R-007 (zero auth middleware): answered by the design above; R-007 stays open until the middleware is implemented but is no longer "design TBD."
  • R-010 (no CORS posture): orthogonal to inter-service auth, but the RequireAuth middleware sits at the right layer to add CORS handling later (browsers don't reach inter-service routes in the current design, so CORS is also Sprint 1+ when end-user requests start landing).

(Future ADRs from ADR-004 onward will be added as the Go implementation accrues design decisions — e.g. HNSW parameter choices, pathway-memory hash function, auditor model rotation, etc.)