Four shipped features and a PRD realignment, all measured end-to-end:
HNSW trial system (Phase 15 horizon item → complete)
- vectord: EmbeddingCache, harness (eval sets + brute-force ground truth),
TrialJournal, parameterized HnswConfig on build_index_with_config
- /vectors/hnsw/trial, /hnsw/trials/{idx}, /hnsw/trials/{idx}/best,
/hnsw/evals/{name}/autogen, /hnsw/cache/stats
- Measured on resumes_100k_v2 (100K × 768d): brute-force 44ms -> HNSW 873us
at 100% recall@10. ec=80 es=30 locked as HnswConfig::default()
- Lower ec values trade recall for build time: 20/30 = 0.96 recall in 8s,
80/30 = 1.00 recall in 230s
Catalog manifest repair
- catalogd: resync_from_parquet reads parquet footers to restore row_count
and columns on drifted manifests
- POST /catalog/datasets/{name}/resync + POST /catalog/resync-missing
- All 7 staffing tables recovered to PRD-matching 2,469,278 rows
Federation foundation (ADR-017)
- shared::secrets: SecretsProvider trait + FileSecretsProvider (reads
/etc/lakehouse/secrets.toml, enforces 0600 perms)
- storaged::registry::BucketRegistry — multi-bucket resolution with
rescue_bucket read fallback and reachability probing
- storaged::error_journal — bucket op failures visible in one HTTP call
- storaged::append_log — write-once batched append pattern (fixes the RMW
anti-pattern llms3.com calls out; errors and trial journals both use it)
- /storage/buckets, /storage/errors, /storage/bucket-health,
/storage/errors/{flush,compact}
- Bucket-aware I/O at /storage/buckets/{bucket}/objects/{*key} with
X-Lakehouse-Rescue-Used observability headers on fallback
Postgres streaming ingest
- ingestd::pg_stream: DSN parser, batched ORDER BY + LIMIT/OFFSET pagination
into ArrowWriter, lineage redacts password
- POST /ingest/db — verified against live knowledge_base.team_runs
(586 rows × 13 cols, 6 batches, 196ms end-to-end)
PRD realignment (2026-04-16)
- Dual use case: staffing analytics + local LLM knowledge substrate
- Removed "multi-tenancy (single-owner system)" from non-goals
- Added invariants 8-11: indexes hot-swappable, per-reader profiles,
trials-as-data, operational failures findable in one HTTP call
- New phases 16 (hot-swap generations), 17 (model profiles + dataset
bindings), 18 (Lance vs Parquet+sidecar evaluation)
- Known ceilings table documents the 5M vector wall and escape hatches
- ADR-017 (federation), ADR-018 (append-log pattern) added
- EXECUTION_PLAN.md sequences phases B-E with success gates and
decision rules
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
17 KiB
ADR-017: Federated Multi-Bucket Storage
Status: Accepted — 2026-04-16 Owner: J Implements: Phase 15 horizon item "Federated multi-bucket query" Depends on: ADR-001 (object storage as source of truth), ADR-009 (delta files)
Problem
Today every dataset lives in exactly one object storage backend (/home/profit/lakehouse/data on local disk). Three scenarios break that assumption:
- Multi-tenant hosting — Client A's staffing data must stay in Client A's S3 bucket. Client B's in Client B's. The lakehouse is the compute plane, their bucket is the data plane.
- Data residency — European clients require their data never leaves an EU-region bucket. Same lakehouse instance, different physical storage.
- Cross-dataset analytics — A user asks "average placement margin across all clients we manage." The query must scan Parquet across N buckets as a single logical table.
None of this is possible with a single object_store instance. DataFusion can register multiple object stores under different URL schemes, but nothing in the current stack surfaces that capability.
Non-Goals (for the MVP)
- Not building cross-bucket JOIN optimization. DataFusion pushes predicates down per-partition already; that's sufficient.
- Not handling bucket-level auth policies (who-can-read-which-bucket). That's Phase 13's job.
- Not supporting heterogeneous backends in a single logical dataset (e.g. half in S3, half local). One dataset = one bucket.
- Not automatic replication across buckets. Each bucket stands alone.
- Not scheduled sync/migration. Manual today.
Design
Core invariant
Every ObjectRef belongs to exactly one named bucket. The catalog is the single authority over which bucket holds which dataset. Queries transparently span buckets by virtue of DataFusion's multi-store capability — the catalog tells the query engine where each file lives.
Naming
Three classes of bucket:
primary— system-wide default. Shared catalog, reference datasets, and vectors that aren't tenant-specific. Always present, always the fallback. Corresponds to today's./datadirectory.profile:{user_id}— per-user workspace bucket. Each user's personal data, their vector indexes, their workspace state. Provisioned on first use. Pre-loaded ("hot loaded") when the user activates their profile.{named}— tenant/client buckets. Explicit, configured up front. E.g.client_a,client_eu.
Conventions:
- Bucket URL =
{bucket_id}://{key}internally (e.g.primary://datasets/candidates.parquet,profile:daisy://vectors/notes.parquet) - DataFusion sees each bucket as a separate registered object store;
ListingTablepaths include the bucket scheme - Profile buckets use the
profile:prefix as both a namespace marker and a DataFusion URL scheme
Config shape — lakehouse.toml
[storage]
# Backward compat — the existing [storage] block still creates a "primary"
# bucket if no [[storage.buckets]] entries are defined.
root = "./data"
# Where profile buckets are rooted when not otherwise configured.
# A profile:{user} bucket resolves to {profile_root}/{user}/ unless the user
# config overrides.
profile_root = "./data/_profiles"
[[storage.buckets]]
name = "primary"
backend = "local"
root = "./data"
rescue_bucket = "rescue" # single shared fallback for failed reads (see §Failure mode)
[[storage.buckets]]
name = "rescue"
backend = "local"
root = "./data/_rescue"
[[storage.buckets]]
name = "client_a"
backend = "s3"
bucket = "client-a-lakehouse"
region = "us-east-1"
endpoint = "https://s3.amazonaws.com"
secret_ref = "client_a_aws" # NOT the literal secret — a handle
[[storage.buckets]]
name = "client_eu"
backend = "s3"
bucket = "client-eu-lakehouse"
region = "eu-west-1"
secret_ref = "client_eu_aws"
Rules:
- Credentials never appear in this file.
secret_refis a handle resolved through the secrets layer (see §Secrets). - If
[[storage.buckets]]is absent, the existing[storage]block creates a single"primary"bucket. Zero-config upgrade. - Adding a bucket requires a restart (MVP); runtime addition via API is a polish item.
Secrets
Credentials are never in lakehouse.toml. A pluggable SecretsProvider trait resolves secret_ref handles to credential structs at startup:
pub trait SecretsProvider: Send + Sync {
async fn resolve(&self, handle: &str) -> Result<BucketCredentials, String>;
}
MVP ships one implementation — FileSecretsProvider reading /etc/lakehouse/secrets.toml (or a path given by LAKEHOUSE_SECRETS env var):
# /etc/lakehouse/secrets.toml — root:root, mode 0600, NEVER in git
[client_a_aws]
access_key = "AKIA..."
secret_key = "wJalrXUt..."
[client_eu_aws]
access_key = "AKIA..."
secret_key = "..."
Future providers plug into the same trait without touching core code:
VaultSecretsProvider— HashiCorp VaultSopsSecretsProvider— age/gpg-encrypted files in gitKeyringSecretsProvider— OS-level keyring
Startup fails fast if a secret_ref can't resolve — no silent fallback to anonymous S3.
Schema changes
ObjectRef already has a bucket: String field — currently it carries the S3 bucket name or "local" inconsistently. Repurpose it as the catalog bucket name:
pub struct ObjectRef {
pub bucket: String, // NOW: catalog bucket name, e.g. "primary" or "client_a"
pub key: String, // object key within that bucket
pub size_bytes: u64,
pub created_at: DateTime<Utc>,
}
Migration: a resync-missing-style one-shot sets bucket = "primary" on every existing ObjectRef whose value is empty or ambiguous.
DatasetManifest — no new top-level field. objects: Vec<ObjectRef> is already where bucket info lives. Per the design invariant (one dataset, one bucket), all ObjectRefs inside a manifest share a bucket, so we can add a convenience accessor manifest.bucket_id() → objects[0].bucket.clone().
Storage layer changes
storaged::BucketRegistry — new struct, replaces the single Arc<dyn ObjectStore> currently threaded through gateway.
pub struct BucketRegistry {
buckets: HashMap<String, Arc<dyn ObjectStore>>,
default: String, // "primary"
}
impl BucketRegistry {
pub fn from_config(cfg: &StorageConfig) -> Result<Self, String>;
pub fn get(&self, bucket_id: &str) -> Result<&Arc<dyn ObjectStore>, String>;
pub fn default_store(&self) -> &Arc<dyn ObjectStore>;
pub fn list_buckets(&self) -> Vec<BucketInfo>;
}
storaged::ops gets a bucket parameter on every call:
pub async fn put(registry: &BucketRegistry, bucket: &str, key: &str, data: Bytes) -> Result<(), String>;
pub async fn get(registry: &BucketRegistry, bucket: &str, key: &str) -> Result<Bytes, String>;
// etc.
Every existing call site passes "primary" initially; specific call sites (ingestd, vectord) learn to route.
Query layer changes
queryd — on startup, register every bucket as a separate object_store URL with DataFusion:
for (name, store) in bucket_registry.iter() {
let url = format!("lakehouse-{}://", name).parse::<Url>()?;
session_ctx.register_object_store(&url, store.clone());
}
ListingTable URLs reference the bucket:
let url = format!("lakehouse-{}://{}", obj.bucket, obj.key);
DataFusion handles cross-bucket scans natively — each partition gets routed to the correct store.
Gateway API additions
| Endpoint | Purpose |
|---|---|
GET /storage/buckets |
List all configured buckets + their backend + reachability status |
GET /storage/errors |
Recent bucket operation failures (filtered by bucket/time) |
GET /storage/health |
Summary: which buckets have errored in the last 5 minutes |
POST /ingest/file + X-Lakehouse-Bucket: client_a |
Header selects target bucket; absent → primary |
POST /profile/{user}/activate |
Pre-load user's profile bucket: embedding caches warm, HNSW indexes rebuilt, workspace state hydrated |
POST /profile/{user}/deactivate |
Evict profile bucket's cached state |
POST /catalog/datasets/by-name/{name}/relocate |
Move a dataset to another bucket (polish) |
Routing rule for all existing endpoints:
X-Lakehouse-Bucket: {name}header → use that bucket- No header →
primary - Unknown bucket name → 404
Catalog and query endpoints span all buckets by default — a query sees all buckets the user has access to unless explicitly scoped.
Profile hot load
When POST /profile/{user}/activate fires:
- Resolve
profile:{user}bucket — create if missing (first-time activation) - Scan the catalog for all datasets whose
owner == useror which live in that bucket - For each vector index belonging to the profile: pre-load embeddings into
EmbeddingCache, pre-build HNSW with the locked-in default config (ec=80 es=30) - Hydrate any saved workspace state (see Phase 8.5) into memory
- Return a manifest summary — what's hot, what wasn't found, total memory used
Design contract: /profile/{user}/activate should be idempotent. Calling it twice is cheap; the second call is a no-op because the cache is already warm.
Failure mode — rescue bucket + loud errors
The design goal is failures stay findable. When something breaks, you should know within seconds which bucket, which key, which operation, and what happened — without grepping logs.
Writes: always go to the target bucket. If the target bucket is unreachable, the request fails hard with 503 and the failure is recorded in the error journal. No queueing, no silent fallback — writes that vanish silently are the worst possible failure mode.
Reads:
- First attempt: target bucket.
- On failure: fall through to the single configured
rescue_bucket. - If the key exists in rescue, serve it with
X-Lakehouse-Rescue-Used: trueandX-Lakehouse-Original-Bucket: {target}response headers so the caller knows fallback happened. - If the key isn't in rescue either: 404 with both buckets listed in the error body.
- Every rescue fallback is appended to the error journal (see below).
Rescue bucket is a single, well-known bucket configured at the storage level. Typical setup: a periodic sync from primary → rescue so that when primary has an outage, recent data is still reachable. Sync is outside this ADR's scope — operator's job.
Rest of the system keeps working: a failed bucket never takes down the gateway. Other buckets remain queryable. Cross-bucket queries scan what they can and surface per-partition failures (which bucket, which key) in the error body.
Error journal — "find errors ez"
A simple append-only log of every bucket operation that failed or fell through. Stored at primary://_errors/bucket_errors.jsonl (yes, primary — because we need it findable even if other buckets are down).
{"ts":"2026-04-16T10:30:15Z","op":"read","target":"client_a","key":"datasets/candidates.parquet","error":"connection refused","rescued":true,"rescue_key_found":true}
{"ts":"2026-04-16T10:30:18Z","op":"write","target":"client_a","key":"datasets/new.parquet","error":"connection refused","rescued":false}
Exposed via:
GET /storage/errors?limit=50— recent errorsGET /storage/errors?bucket=client_a— filter by bucketGET /storage/errors?since=2026-04-16T10:00— filter by timeGET /storage/health— summary: which buckets have errored in the last 5 minutes
The journal doesn't replace logs (tracing still gets everything), but it gives you a single authoritative place to answer the question "has anything been failing?" without tailing systemd journals. If the journal is empty, nothing is broken.
Vectors are bucket-scoped
Vector indexes live in the same bucket as their source data:
- Data in
primary→ vectors inprimary://vectors/{index_name}.parquet - Data in
profile:daisy→ vectors inprofile:daisy://vectors/{index_name}.parquet - Data in
client_a→ vectors inclient_a://vectors/{index_name}.parquet
The existing _hnsw_trials/ and _hnsw_evals/ prefixes also live per-bucket. The trial journal for a profile's index stays inside that profile's bucket — so tenants don't see each other's tuning history.
Profile hot load consequence: activating a profile pre-loads ALL vector indexes in that profile's bucket into EmbeddingCache + builds HNSW with the default config. First query after activation is warm.
Implementation phases
MVP (this is what I'd build next session)
shared::config— extendStorageConfigwithbuckets: Vec<BucketConfig>andprofile_root, with backward-compat fallback from single[storage]blockshared::secrets—SecretsProvidertrait +FileSecretsProviderimpl reading/etc/lakehouse/secrets.tomlstoraged::BucketRegistry— multi-backend registry; mirrors-of tracking; lazy profile bucket creationstoraged::ops— addbucket: &strparam to every call; read path falls back to mirror on unreachable; write path hard-failsgateway—X-Lakehouse-Bucketheader middleware → routing decision per requestgateway—GET /storage/buckets+POST /profile/{user}/activate|deactivatecatalogd— one-shot migration: every ObjectRef without an explicit bucket gets"primary"queryd::session— register every bucket as a DataFusion ObjectStore under its own URL schemevectord— every vector/trial/eval storage path becomes{bucket}://vectors/...; profile activate pre-loads HNSWstoraged::error_journal— append-only JSONL writer atprimary://_errors/bucket_errors.jsonl, hooked into every ops call-site- Test harness — configure
primary+profile:testuser+ a mockclient_apointing at a 2nd local dir +rescueas a 3rd local dir; verify cross-bucket join, profile activate, rescue fallback, error journal visibility
Success gate #1 — cross-bucket query: SELECT SUM(hours_regular) FROM primary.timesheets t1 JOIN "profile:testuser".timesheets t2 USING (placement_id) returns a sensible number.
Success gate #2 — profile hot load: POST /profile/daisy/activate pre-builds HNSW over her personal vector index; next /search call against that index is <1ms cold.
Success gate #3 — rescue fallback + error visibility: rename client_a's directory to simulate outage. GET for a key returns the copy from rescue with X-Lakehouse-Rescue-Used: true + X-Lakehouse-Original-Bucket: client_a headers. PUT returns 503. GET /storage/errors shows both events with full context. GET /storage/health flags client_a as unhealthy.
Polish (follow-up)
- Bucket health check —
GET /storage/bucketsshows reachable/unreachable - Bucket relocation —
POST /catalog/datasets/{name}/relocate(copy to new bucket, update manifest, delete from old) - Runtime bucket addition —
POST /storage/bucketsadds a bucket without restart (stores in a separatebuckets.toml) - S3 streaming optimization — for very large Parquet files, stream the footer instead of loading the whole blob
Decisions (2026-04-16)
- Credentials — Pluggable
SecretsProvidertrait. MVP shipsFileSecretsProviderat/etc/lakehouse/secrets.toml(root:root, 0600). Future providers (Vault, SOPS, OS keyring) plug in without touching core. - Primary default + profile bucket model —
primaryalways exists as system-wide fallback. Per-userprofile:{user}buckets are provisioned on first activate. Named tenant buckets configured explicitly. - Ingest routing —
X-Lakehouse-Bucketheader. Absent →primary. Unknown → 404. - Failure — Writes: hard fail on target, logged to error journal. Reads: fall through to a single shared
rescue_bucket; response headersX-Lakehouse-Rescue-Used+X-Lakehouse-Original-Bucketmake the fallback visible. Every failure appended to an error journal atprimary://_errors/bucket_errors.jsonl, queryable viaGET /storage/errors. Failed buckets never take down unrelated paths. - Vectors in their bucket — Vectors, trial journals, and eval sets all live in the same bucket as their source data. Profile activate pre-loads the full vector stack for that profile using the locked-in HNSW default (ec=80 es=30).
Risks
| Risk | Severity | Mitigation |
|---|---|---|
| DataFusion multi-store predicate pushdown lands on cross-bucket JOINs | Medium | Benchmark early. If bad, mark cross-bucket JOINs as a known sharp edge. |
| S3 credentials accidentally logged | High | Never stringify credential fields. Add a test that redacts secret_key in every code path. |
Existing ObjectRefs have inconsistent bucket values (some "local", some empty) |
Low | One-shot migration normalizes to "primary". |
| Bucket config drift between restarts | Medium | Config-driven means restart = canonical. A runtime API would need careful state reconciliation; defer. |
Decision log
- 1 dataset = 1 bucket (rather than datasets spanning buckets) — vastly simpler, matches the "tenant isolation" real use case. Revisit only if someone has an actual multi-bucket dataset requirement.
- Config-first, runtime-later — YAML/TOML for the MVP; runtime bucket add/remove is polish. Deployment restart is cheap in this stack.
bucket: Stringrepurposed, no new field — avoids a schema migration on every manifest.- DataFusion's multi-store primitive handles cross-bucket queries — don't reinvent. We're wiring it up, not building it.