Three pieces of the multi-bucket federation made real:
1. Catalog migration (POST /catalog/migrate-buckets)
- One-shot normalizer for ObjectRef.bucket field
- Empty -> "primary"; legacy "data"/"local" -> "primary"
- Idempotent; re-running on canonical state is no-op
- Ran on existing catalog: 12 refs renamed from "data", 2 already
"primary", all 14 now canonical
2. X-Lakehouse-Bucket header middleware on ingest
- resolve_bucket() helper extracts header, returns
(bucket_name, store) or 404 with valid bucket list
- ingest_file and ingest_db_stream now route writes per-request
- Defaults to "primary" when header absent
- pipeline::ingest_file_to_bucket records the actual bucket on the
ObjectRef so catalog stays the source of truth for "where does this
data live"
- Verified: ingest with X-Lakehouse-Bucket: testing lands in
data/_testing/, ingest without header lands in data/, bad header
returns 404 with hint
3. queryd registers every bucket with DataFusion
- QueryEngine now holds Arc<BucketRegistry> instead of single store
- build_context iterates all buckets, registers each as a separate
ObjectStore under URL scheme "lakehouse-{bucket}://"
- ListingTable URLs include the per-object bucket scheme so
DataFusion routes scans automatically based on ObjectRef.bucket
- Profile bucket names like "profile:user" sanitized to
"lakehouse-profile-user" since URL host segments can't contain ":"
- Tolerant of duplicate manifest entries (pre-existing
pipeline::ingest_file behavior creates a fresh dataset id per
ingest); duplicates skipped with debug log
- Backward compat: legacy "lakehouse://data/" URL still registered
pointing at primary
Success gate: cross-bucket CROSS JOIN
SELECT p.name, p.role, a.species
FROM people_test p (bucket: testing)
CROSS JOIN animals a (bucket: primary)
LIMIT 5
returns rows correctly. DataFusion routed each scan to its bucket's
ObjectStore based on the URL scheme.
No regressions: SELECT COUNT(*) FROM candidates still returns 100000
from the primary bucket.
Deferred to Phase 17:
- POST /profile/{user}/activate (HNSW hot-load on profile switch)
- vectord storage paths becoming bucket-scoped (trial journals,
eval sets per-profile)
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Description
Rust-first object storage system
Languages
TypeScript
38.4%
Rust
35.8%
HTML
13.9%
Python
7.8%
Shell
2.1%
Other
2%