matrix-agent-validated/docs/ADR-017-federation.md

# ADR-017: Federated Multi-Bucket Storage

**Status:** Accepted — 2026-04-16
**Owner:** J
**Implements:** Phase 15 horizon item "Federated multi-bucket query"
**Depends on:** ADR-001 (object storage as source of truth), ADR-009 (delta files)

---

## Problem

Today every dataset lives in exactly one object storage backend (`/home/profit/lakehouse/data` on local disk). Three scenarios break that assumption:

1. **Multi-tenant hosting** — Client A's staffing data must stay in Client A's S3 bucket. Client B's in Client B's. The lakehouse is the compute plane, their bucket is the data plane.
2. **Data residency** — European clients require their data never leaves an EU-region bucket. Same lakehouse instance, different physical storage.
3. **Cross-dataset analytics** — A user asks "average placement margin across all clients we manage." The query must scan Parquet across N buckets as a single logical table.

None of this is possible with a single `object_store` instance. `DataFusion` *can* register multiple object stores under different URL schemes, but nothing in the current stack surfaces that capability.

## Non-Goals (for the MVP)

- **Not** building cross-bucket JOIN optimization. DataFusion pushes predicates down per-partition already; that's sufficient.
- **Not** handling bucket-level auth policies (who-can-read-which-bucket). That's Phase 13's job.
- **Not** supporting heterogeneous backends in a single logical dataset (e.g. half in S3, half local). One dataset = one bucket.
- **Not** automatic replication across buckets. Each bucket stands alone.
- **Not** scheduled sync/migration. Manual today.

---

## Design

### Core invariant

Every `ObjectRef` belongs to exactly one named bucket. The catalog is the single authority over which bucket holds which dataset. Queries transparently span buckets by virtue of DataFusion's multi-store capability — the catalog tells the query engine where each file lives.

### Naming

Three classes of bucket:

- **`primary`** — system-wide default. Shared catalog, reference datasets, and vectors that aren't tenant-specific. Always present, always the fallback. Corresponds to today's `./data` directory.
- **`profile:{user_id}`** — per-user workspace bucket. Each user's personal data, their vector indexes, their workspace state. Provisioned on first use. Pre-loaded ("hot loaded") when the user activates their profile.
- **`{named}`** — tenant/client buckets. Explicit, configured up front. E.g. `client_a`, `client_eu`.

Conventions:
- **Bucket URL** = `{bucket_id}://{key}` internally (e.g. `primary://datasets/candidates.parquet`, `profile:daisy://vectors/notes.parquet`)
- DataFusion sees each bucket as a separate registered object store; `ListingTable` paths include the bucket scheme
- Profile buckets use the `profile:` prefix as both a namespace marker and a DataFusion URL scheme

### Config shape — `lakehouse.toml`

```toml
[storage]
# Backward compat — the existing [storage] block still creates a "primary"
# bucket if no [[storage.buckets]] entries are defined.
root = "./data"
# Where profile buckets are rooted when not otherwise configured.
# A profile:{user} bucket resolves to {profile_root}/{user}/ unless the user
# config overrides.
profile_root = "./data/_profiles"

[[storage.buckets]]
name = "primary"
backend = "local"
root = "./data"

rescue_bucket = "rescue"           # single shared fallback for failed reads (see §Failure mode)

[[storage.buckets]]
name = "rescue"
backend = "local"
root = "./data/_rescue"

[[storage.buckets]]
name = "client_a"
backend = "s3"
bucket = "client-a-lakehouse"
region = "us-east-1"
endpoint = "https://s3.amazonaws.com"
secret_ref = "client_a_aws"        # NOT the literal secret — a handle

[[storage.buckets]]
name = "client_eu"
backend = "s3"
bucket = "client-eu-lakehouse"
region = "eu-west-1"
secret_ref = "client_eu_aws"
```

Rules:
- Credentials **never appear in this file**. `secret_ref` is a handle resolved through the secrets layer (see §Secrets).
- If `[[storage.buckets]]` is absent, the existing `[storage]` block creates a single `"primary"` bucket. Zero-config upgrade.
- Adding a bucket requires a restart (MVP); runtime addition via API is a polish item.

### Secrets

Credentials are never in `lakehouse.toml`. A pluggable `SecretsProvider` trait resolves `secret_ref` handles to credential structs at startup:

```rust
pub trait SecretsProvider: Send + Sync {
    async fn resolve(&self, handle: &str) -> Result<BucketCredentials, String>;
}
```

MVP ships **one** implementation — `FileSecretsProvider` reading `/etc/lakehouse/secrets.toml` (or a path given by `LAKEHOUSE_SECRETS` env var):

```toml
# /etc/lakehouse/secrets.toml — root:root, mode 0600, NEVER in git
[client_a_aws]
access_key = "AKIA..."
secret_key = "wJalrXUt..."

[client_eu_aws]
access_key = "AKIA..."
secret_key = "..."
```

Future providers plug into the same trait without touching core code:
- `VaultSecretsProvider` — HashiCorp Vault
- `SopsSecretsProvider` — age/gpg-encrypted files in git
- `KeyringSecretsProvider` — OS-level keyring

Startup fails fast if a `secret_ref` can't resolve — no silent fallback to anonymous S3.

### Schema changes

**`ObjectRef`** already has a `bucket: String` field — currently it carries the S3 bucket name or `"local"` inconsistently. Repurpose it as the **catalog bucket name**:

```rust
pub struct ObjectRef {
    pub bucket: String,        // NOW: catalog bucket name, e.g. "primary" or "client_a"
    pub key: String,           // object key within that bucket
    pub size_bytes: u64,
    pub created_at: DateTime<Utc>,
}
```

Migration: a `resync-missing`-style one-shot sets `bucket = "primary"` on every existing ObjectRef whose value is empty or ambiguous.

**`DatasetManifest`** — no new top-level field. `objects: Vec<ObjectRef>` is already where bucket info lives. Per the design invariant (one dataset, one bucket), all ObjectRefs inside a manifest share a bucket, so we can add a convenience accessor `manifest.bucket_id()` → `objects[0].bucket.clone()`.

### Storage layer changes

**`storaged::BucketRegistry`** — new struct, replaces the single `Arc<dyn ObjectStore>` currently threaded through gateway.

```rust
pub struct BucketRegistry {
    buckets: HashMap<String, Arc<dyn ObjectStore>>,
    default: String, // "primary"
}

impl BucketRegistry {
    pub fn from_config(cfg: &StorageConfig) -> Result<Self, String>;
    pub fn get(&self, bucket_id: &str) -> Result<&Arc<dyn ObjectStore>, String>;
    pub fn default_store(&self) -> &Arc<dyn ObjectStore>;
    pub fn list_buckets(&self) -> Vec<BucketInfo>;
}
```

**`storaged::ops`** gets a bucket parameter on every call:
```rust
pub async fn put(registry: &BucketRegistry, bucket: &str, key: &str, data: Bytes) -> Result<(), String>;
pub async fn get(registry: &BucketRegistry, bucket: &str, key: &str) -> Result<Bytes, String>;
// etc.
```

Every existing call site passes `"primary"` initially; specific call sites (ingestd, vectord) learn to route.

### Query layer changes

**`queryd`** — on startup, register every bucket as a separate `object_store` URL with DataFusion:

```rust
for (name, store) in bucket_registry.iter() {
    let url = format!("lakehouse-{}://", name).parse::<Url>()?;
    session_ctx.register_object_store(&url, store.clone());
}
```

**`ListingTable`** URLs reference the bucket:
```rust
let url = format!("lakehouse-{}://{}", obj.bucket, obj.key);
```

DataFusion handles cross-bucket scans natively — each partition gets routed to the correct store.

### Gateway API additions

| Endpoint | Purpose |
|---|---|
| `GET /storage/buckets` | List all configured buckets + their backend + reachability status |
| `GET /storage/errors` | Recent bucket operation failures (filtered by bucket/time) |
| `GET /storage/health` | Summary: which buckets have errored in the last 5 minutes |
| `POST /ingest/file` + `X-Lakehouse-Bucket: client_a` | Header selects target bucket; absent → `primary` |
| `POST /profile/{user}/activate` | Pre-load user's profile bucket: embedding caches warm, HNSW indexes rebuilt, workspace state hydrated |
| `POST /profile/{user}/deactivate` | Evict profile bucket's cached state |
| `POST /catalog/datasets/by-name/{name}/relocate` | Move a dataset to another bucket (polish) |

**Routing rule for all existing endpoints:**
1. `X-Lakehouse-Bucket: {name}` header → use that bucket
2. No header → `primary`
3. Unknown bucket name → 404

Catalog and query endpoints span all buckets by default — a query sees all buckets the user has access to unless explicitly scoped.

### Profile hot load

When `POST /profile/{user}/activate` fires:

1. Resolve `profile:{user}` bucket — create if missing (first-time activation)
2. Scan the catalog for all datasets whose `owner == user` or which live in that bucket
3. For each vector index belonging to the profile: pre-load embeddings into `EmbeddingCache`, pre-build HNSW with the locked-in default config (ec=80 es=30)
4. Hydrate any saved workspace state (see Phase 8.5) into memory
5. Return a manifest summary — what's hot, what wasn't found, total memory used

Design contract: `/profile/{user}/activate` should be idempotent. Calling it twice is cheap; the second call is a no-op because the cache is already warm.

### Failure mode — rescue bucket + loud errors

The design goal is **failures stay findable**. When something breaks, you should know within seconds which bucket, which key, which operation, and what happened — without grepping logs.

**Writes**: always go to the target bucket. If the target bucket is unreachable, the request fails hard with 503 and the failure is recorded in the error journal. No queueing, no silent fallback — writes that vanish silently are the worst possible failure mode.

**Reads**:
- First attempt: target bucket.
- On failure: fall through to the single configured `rescue_bucket`.
- If the key exists in rescue, serve it with `X-Lakehouse-Rescue-Used: true` and `X-Lakehouse-Original-Bucket: {target}` response headers so the caller knows fallback happened.
- If the key isn't in rescue either: 404 with both buckets listed in the error body.
- Every rescue fallback is appended to the **error journal** (see below).

**Rescue bucket** is a single, well-known bucket configured at the storage level. Typical setup: a periodic sync from `primary` → `rescue` so that when `primary` has an outage, recent data is still reachable. Sync is outside this ADR's scope — operator's job.

**Rest of the system keeps working**: a failed bucket never takes down the gateway. Other buckets remain queryable. Cross-bucket queries scan what they can and surface per-partition failures (which bucket, which key) in the error body.

### Error journal — "find errors ez"

A simple append-only log of every bucket operation that failed or fell through. Stored at `primary://_errors/bucket_errors.jsonl` (yes, primary — because we need it findable even if other buckets are down).

```json
{"ts":"2026-04-16T10:30:15Z","op":"read","target":"client_a","key":"datasets/candidates.parquet","error":"connection refused","rescued":true,"rescue_key_found":true}
{"ts":"2026-04-16T10:30:18Z","op":"write","target":"client_a","key":"datasets/new.parquet","error":"connection refused","rescued":false}
```

Exposed via:
- `GET /storage/errors?limit=50` — recent errors
- `GET /storage/errors?bucket=client_a` — filter by bucket
- `GET /storage/errors?since=2026-04-16T10:00` — filter by time
- `GET /storage/health` — summary: which buckets have errored in the last 5 minutes

The journal doesn't replace logs (tracing still gets everything), but it gives you a **single authoritative place** to answer the question "has anything been failing?" without tailing systemd journals. If the journal is empty, nothing is broken.

### Vectors are bucket-scoped

Vector indexes live in the same bucket as their source data:

- Data in `primary` → vectors in `primary://vectors/{index_name}.parquet`
- Data in `profile:daisy` → vectors in `profile:daisy://vectors/{index_name}.parquet`
- Data in `client_a` → vectors in `client_a://vectors/{index_name}.parquet`

The existing `_hnsw_trials/` and `_hnsw_evals/` prefixes also live per-bucket. The trial journal for a profile's index stays inside that profile's bucket — so tenants don't see each other's tuning history.

**Profile hot load consequence:** activating a profile pre-loads ALL vector indexes in that profile's bucket into `EmbeddingCache` + builds HNSW with the default config. First query after activation is warm.

---

## Implementation phases

### MVP (this is what I'd build next session)

1. **`shared::config`** — extend `StorageConfig` with `buckets: Vec<BucketConfig>` and `profile_root`, with backward-compat fallback from single `[storage]` block
2. **`shared::secrets`** — `SecretsProvider` trait + `FileSecretsProvider` impl reading `/etc/lakehouse/secrets.toml`
3. **`storaged::BucketRegistry`** — multi-backend registry; mirrors-of tracking; lazy profile bucket creation
4. **`storaged::ops`** — add `bucket: &str` param to every call; read path falls back to mirror on unreachable; write path hard-fails
5. **`gateway`** — `X-Lakehouse-Bucket` header middleware → routing decision per request
6. **`gateway`** — `GET /storage/buckets` + `POST /profile/{user}/activate|deactivate`
7. **`catalogd`** — one-shot migration: every ObjectRef without an explicit bucket gets `"primary"`
8. **`queryd::session`** — register every bucket as a DataFusion ObjectStore under its own URL scheme
9. **`vectord`** — every vector/trial/eval storage path becomes `{bucket}://vectors/...`; profile activate pre-loads HNSW
10. **`storaged::error_journal`** — append-only JSONL writer at `primary://_errors/bucket_errors.jsonl`, hooked into every ops call-site
11. **Test harness** — configure `primary` + `profile:testuser` + a mock `client_a` pointing at a 2nd local dir + `rescue` as a 3rd local dir; verify cross-bucket join, profile activate, rescue fallback, error journal visibility

**Success gate #1 — cross-bucket query:** `SELECT SUM(hours_regular) FROM primary.timesheets t1 JOIN "profile:testuser".timesheets t2 USING (placement_id)` returns a sensible number.

**Success gate #2 — profile hot load:** `POST /profile/daisy/activate` pre-builds HNSW over her personal vector index; next `/search` call against that index is <1ms cold.

**Success gate #3 — rescue fallback + error visibility:** rename `client_a`'s directory to simulate outage. GET for a key returns the copy from `rescue` with `X-Lakehouse-Rescue-Used: true` + `X-Lakehouse-Original-Bucket: client_a` headers. PUT returns 503. `GET /storage/errors` shows both events with full context. `GET /storage/health` flags `client_a` as unhealthy.

### Polish (follow-up)

8. **Bucket health check** — `GET /storage/buckets` shows reachable/unreachable
9. **Bucket relocation** — `POST /catalog/datasets/{name}/relocate` (copy to new bucket, update manifest, delete from old)
10. **Runtime bucket addition** — `POST /storage/buckets` adds a bucket without restart (stores in a separate `buckets.toml`)
11. **S3 streaming optimization** — for very large Parquet files, stream the footer instead of loading the whole blob

---

## Decisions (2026-04-16)

1. **Credentials** — Pluggable `SecretsProvider` trait. MVP ships `FileSecretsProvider` at `/etc/lakehouse/secrets.toml` (root:root, 0600). Future providers (Vault, SOPS, OS keyring) plug in without touching core.
2. **Primary default + profile bucket model** — `primary` always exists as system-wide fallback. Per-user `profile:{user}` buckets are provisioned on first activate. Named tenant buckets configured explicitly.
3. **Ingest routing** — `X-Lakehouse-Bucket` header. Absent → `primary`. Unknown → 404.
4. **Failure** — Writes: hard fail on target, logged to error journal. Reads: fall through to a single shared `rescue_bucket`; response headers `X-Lakehouse-Rescue-Used` + `X-Lakehouse-Original-Bucket` make the fallback visible. Every failure appended to an error journal at `primary://_errors/bucket_errors.jsonl`, queryable via `GET /storage/errors`. Failed buckets never take down unrelated paths.
5. **Vectors in their bucket** — Vectors, trial journals, and eval sets all live in the same bucket as their source data. Profile activate pre-loads the full vector stack for that profile using the locked-in HNSW default (ec=80 es=30).

---

## Risks

| Risk | Severity | Mitigation |
|---|---|---|
| DataFusion multi-store predicate pushdown lands on cross-bucket JOINs | **Medium** | Benchmark early. If bad, mark cross-bucket JOINs as a known sharp edge. |
| S3 credentials accidentally logged | **High** | Never stringify credential fields. Add a test that redacts `secret_key` in every code path. |
| Existing ObjectRefs have inconsistent `bucket` values (some `"local"`, some empty) | **Low** | One-shot migration normalizes to `"primary"`. |
| Bucket config drift between restarts | **Medium** | Config-driven means restart = canonical. A runtime API would need careful state reconciliation; defer. |

---

## Decision log

- **1 dataset = 1 bucket** (rather than datasets spanning buckets) — vastly simpler, matches the "tenant isolation" real use case. Revisit only if someone has an actual multi-bucket dataset requirement.
- **Config-first, runtime-later** — YAML/TOML for the MVP; runtime bucket add/remove is polish. Deployment restart is cheap in this stack.
- **`bucket: String` repurposed, no new field** — avoids a schema migration on every manifest.
- **DataFusion's multi-store primitive handles cross-bucket queries** — don't reinvent. We're wiring it up, not building it.