Per 2026-05-03 phase_1_6_gate_3a scrum (10 findings, 0 convergent
location-wise but opus + kimi flagged the same audit-failure issue).
Convergent + load-bearing fix:
Audit-write failure was silently swallowed (returned 200 with empty
hmac) after photo + manifest persisted. For BIPA defensibility this
is wrong — a successful response without an audit row is exactly
the silent-failure mode the spec exists to prevent. Now: full
transactional rollback. If audit append fails after photo + manifest
commit, we remove the photo AND revert the manifest to its
pre-upload state, then return 500 with error="audit_write_failed".
Other real fixes:
Orphan-file leak (opus WARN): if put_subject fails AFTER the photo
is written, the file would orphan on disk with no manifest pointer.
Now removes the photo on manifest-update failure, before returning 500.
Content-Type parameter handling (opus WARN): real-world clients send
`image/jpeg; charset=binary` etc. Parser now strips parameters per
RFC 9110 §8.3 and matches case-insensitively. New regression test
content_type_with_parameters_accepted exercises both.
data_path doc/code mismatch (opus WARN): doc said "relative to the
configured biometric storage root" but code stored absolute path.
Now stores relative — operators reading the manifest reconstruct
the absolute path with their own storage_root, manifests are
portable across deployments. Tests updated.
Timestamp-nanosecond collision (kimi WARN): added 8-char uuid
suffix to filename. Sub-microsecond cadence collision was implausible
but defense-in-depth is cheap.
Dead code (opus + kimi INFO): removed unused require_legal_auth
function (process_upload reimplements the auth check inline)
and the `let _ = ConsentStatus::Given;` no-op type-shape reference.
Skipped (acceptable in v1):
- qwen BLOCK on image format validation: spec explicitly says "we
trust the caller; malformed images fail downstream when deepface
runs in Gate 3b". Documented in the file's module doc-comment.
- qwen WARN on directory create-then-chmod race: brief window
between create_dir_all and set_permissions. Mitigation would
require libc-level umask manipulation; accepted as v1 scope.
- qwen INFO on constant_time_eq duplication: comment explains the
cross-import boundary; acceptable short-term per the reviewer.
Tests: 11 unit tests pass (added content_type_with_parameters_accepted).
Live verification post-restart:
- Content-Type with `; charset=binary` accepted ✓
- data_path returned as relative `WORKER-2/<ts>_<uuid>.jpg` ✓
- Chain verified end-to-end (3 rows: validator + 2 biometric) ✓
- Cross-runtime parity probe still 6/6 byte-identical ✓
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Per docs/PHASE_1_6_BIPA_GATES.md §1 Gate 3 (consent-gate substrate).
Deepface classification (Gate 3b) deferred to its own session — needs
Python subprocess design conversation after the 2026-05-02 sidecar drop.
What ships:
shared/types.rs:
- new BiometricCollection sub-struct: data_path, template_hash,
collected_at, consent_version_hash, classifications (Option<JSON>)
- SubjectManifest gains biometric_collection: Option<BiometricCollection>
with #[serde(default)] so existing on-disk manifests parse and
re-emit without drift
catalogd/biometric_endpoint.rs (NEW, ~600 LOC):
POST /subject/{candidate_id}/photo
- Auth: X-Lakehouse-Legal-Token, constant-time-eq compared against
same legal token file as /audit. Same 32-byte minimum.
- Content-Type: must be image/jpeg or image/png (415 otherwise)
- Body: raw image bytes, max 10MB
- 401: missing or wrong token
- 404: subject not registered
- 403: consent.biometric.status != "given" (returns current status)
- 403: subject status in {Withdrawn, Erased, RetentionExpired}
- 200: writes photo to data/biometric/uploads/<sanitized_id>/<ts>.<ext>
with mode 0700 dir + 0600 file, updates SubjectManifest with
BiometricCollection record, appends audit row
(kind="biometric_collection", purpose="photo_upload"), returns
UploadResponse with template_hash + audit_row_hmac.
Logic split: pure async fn process_upload() takes the headers-as-args
so unit tests exercise every branch without HTTP machinery; the
axum handler is just glue. 10 tests covering all 4 reject paths +
happy path + repeated uploads chaining + structural assertion that
the quarantine path is NOT under data/headshots/ (synthetic faces).
gateway/main.rs:
Mounts /biometric on the same condition as /audit — only when the
SubjectAuditWriter is present AND the legal token loads. Storage
root configurable via LH_BIOMETRIC_STORAGE_ROOT (default
./data/biometric/uploads).
Live verification on the running gateway (post-restart):
- GET /biometric/health → "biometric endpoint ready"
- POST without token → 401 auth_failed
- POST with token, no consent → 403 consent_required (status=NeverCollected)
- Flipped WORKER-2 to consent=given, POST → 200 with hash + path
- File at data/biometric/uploads/WORKER-2/<ts>.jpg, mode 0600
- Manifest biometric_collection field reflects the upload
- Audit row chain links cleanly off the prior validator_lookup row
- GET /audit/subject/WORKER-2 returns chain_verified=true, 2 rows
- Cross-runtime parity probe still 6/6 byte-identical post-change
Phase 1.6 status table updated: Gate 3a DONE, Gate 3b (deepface)
deferred. Calendar bottleneck remains counsel review of items 1/2/5/6.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Per 2026-05-03 step_7_8_retention_and_parity scrum (opus). 5 findings,
0 convergent — but two real fixes shipped:
1. WARN parity_subject_audit.rs:argv — replace .expect() panics with
stderr+exit(2). The parity script captures stdout for byte-compare;
a Rust panic backtrace lands in stdout (script merges 2>&1) and
reads as a parity break instead of a usage error. Added die() helper
that mirrors the Go side's error-exit pattern.
2. INFO parity_subject_audit.rs:5 — doc comment hardcoded the absolute
path /home/profit/golangLAKEHOUSE/... Replaced with repo-relative
reference.
INFO findings on retention_sweep argv style + --as-of report path
overwrite were noted but not actioned (style only / acceptable for
the forecast use case).
The major scrum-surfaced bug (Go json.Marshal HTML-escaping <>& while
serde_json keeps them literal) is fixed on the Go side in parallel
commit. Rust side here is correct as-is — serde_json::to_vec doesn't
HTML-escape by default, so no change needed in canonical_json.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Per docs/specs/SUBJECT_MANIFESTS_ON_CATALOGD.md §5 Step 8.
Cross-runtime parity helper consumed by:
golangLAKEHOUSE/scripts/cutover/parity/subject_audit_parity.sh
Two modes:
--known-answer
Print canonical-JSON + HMAC for a hardcoded fixture row. The Go
helper at golangLAKEHOUSE/scripts/cutover/parity/subject_audit_helper/
must produce byte-identical output. Catches algorithm drift
(canonical-JSON sort order, HMAC algorithm, hex encoding).
--verify <audit_log_path> --key <key_path>
Replay the chain on a real production audit log via the live
SubjectAuditWriter::verify_chain (no re-implementation; the actual
production verification path). Output: one JSON line with mode,
count, tip, verified, error.
The helper exercises the SAME verify_chain path the gateway calls, so
algorithm changes in subject_audit.rs automatically flow into the
parity probe.
Live-verified against 5 production audit logs in data/_catalog/subjects;
all 6 parity assertions pass after fixing two real cross-runtime drifts
on the Go side (omitempty trace_id stripping field; time.RFC3339Nano
stripping trailing zero in nanoseconds — both caught by this probe).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Per docs/specs/SUBJECT_MANIFESTS_ON_CATALOGD.md §5 Step 7:
"Subjects whose retention.general_pii_until < now AND status != erased
get marked for review (don't auto-delete; legal needs to approve)."
Per shared::types::BiometricConsent doc-comment (BIPA requirement on
biometric data, max 3 years from last interaction):
"Implementation MUST enforce daily expiration sweep against this field."
Therefore the sweep checks BOTH retention clocks. Reports overdue
subjects to data/_catalog/subjects/_retention_sweep_<YYYY-MM-DD>.jsonl.
Idempotent: subjects already in {Erased, RetentionExpired} are skipped
so daily runs do not append duplicate rows.
Does NOT mutate subject manifests. Legal/operator owns the action
(extend, flip status, schedule erasure).
CLI:
retention_sweep # dry-run (default), stderr only
retention_sweep --apply # also write JSONL report
retention_sweep --as-of <RFC3339> # alternate clock for forecast/test
retention_sweep --storage-root <dir> # default ./data
Tests: 8 unit tests on is_overdue covering all 5 SubjectStatus values,
both clocks, BIPA-only path, and idempotency on already-flagged
subjects.
Live verification (100 subjects in ./data/_catalog/subjects):
- now (2026-05-03): 0 overdue (correct — 4-year retention)
- --as-of 2031-06-01: 100 overdue, 394 days past, jsonl report shape
verified with biometric fields correctly omitted via
serde skip_serializing_if when subject has no biometric clock.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Second cross-lineage scrum on Steps 5+6 returned 13 distinct findings, 0 convergent.
Three BLOCK-class claims verified as false positives (cache IS written, per-subject
Mutex IS in place, spawn IS safe under writer's lock). Five real fixes shipped:
1. audit_endpoint: legal token min length 16->32 (HMAC-SHA256 best practice, kimi)
2. subject_audit: new chain_tip() returns last hash from full log; audit_endpoint
now reports chain_root from full chain instead of windowed slice (opus)
3. registry: rebuild loader now warns on sanitize collision (symmetric with
put_subject's collision guard - opus)
4. audit_endpoint: tampering detection - if manifest expects non-empty chain_root
but log returns 0 rows, flag chain_verified=false with explicit message (opus)
5. execution_loop::audit_result_state: tightened heuristic - error/denied/not_found
only classify when no rows/data/results sibling (opus INFO)
Tests: 17 catalogd subject + 6 gateway audit_result_state, all green.
New: audit_result_state_does_not_classify_error_when_data_sibling_present,
audit_result_state_status_is_authoritative_even_with_data.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Implementation of docs/specs/SUBJECT_MANIFESTS_ON_CATALOGD.md §5 Step 6
+ §4 (response shape) + §6 (auth model). The defense-against-EEOC-
discovery surface is live: legal counsel hits one URL with one token,
gets back a signed-by-HMAC-chain audit response naming every PII access
for a subject in a time window.
New module: crates/catalogd/src/audit_endpoint.rs (~340 LOC)
- AuditEndpointState { registry, writer, legal_token }
- router() exposes:
GET /subject/{candidate_id}?from=ISO&to=ISO (full audit response)
GET /health (liveness + token check)
- require_legal_auth() — constant-time-eq compare against the
X-Lakehouse-Legal-Token header. Avoids timing leaks on the token
check without pulling in `subtle` for one comparison.
- Token loaded from /etc/lakehouse/legal_audit.token (env-overridable
via LH_LEGAL_AUDIT_TOKEN_FILE). Empty file or <16 chars = endpoint
serves 503 with a clear reason. Token value NEVER logged.
- Response schema: subject_audit_response.v1 with manifest +
audit_log (rows + chain verification) + datasets_referenced +
safe_views_available + completeness_attestation.
New helper on SubjectAuditWriter:
- read_rows_in_range(candidate_id, from, to) — returns rows in window,
used by the endpoint to assemble the response without re-reading
the entire chain.
- verify_chain() now returns Ok(0) when the audit log file doesn't
exist (empty = trivially valid). Prevents legitimate "no PII access
yet for this subject" from showing as integrity=BROKEN in the
audit response. Caller can detect "log was deleted" via comparison
to SubjectManifest.audit_log_chain_root (when that mirror lands).
main.rs:
- Audit endpoint mounted at /audit ONLY when both subject_audit
writer AND legal token are present. Disabled-by-default keeps the
surface from accidentally serving in dev/bring-up environments
without proper credentials.
Tests (9/9 passing):
- constant_time_eq (correctness on equal/diff/empty/length-mismatch)
- missing_legal_token_returns_503
- missing_header_returns_401
- wrong_token_returns_401
- correct_token_passes_auth
- audit_response_assembly_full_path (manifest + 3 rows + chain verify)
- audit_response_window_filters_rows (time-bounded window)
- empty_token_file_results_in_disabled_endpoint
- short_token_file_rejected_at_load (<16 char min)
LIVE end-to-end verification:
1. Plant signing key + legal token in /tmp/lakehouse_audit/
2. Restart gateway with LH_SUBJECT_AUDIT_KEY + LH_LEGAL_AUDIT_TOKEN_FILE
pointing at the test files
3. /audit/health → 200 "audit endpoint ready"
4. /audit/subject/WORKER-1 (no token) → 401 "missing X-Lakehouse-Legal-Token"
5. /audit/subject/WORKER-1 (wrong token) → 401 "X-Lakehouse-Legal-Token mismatch"
6. /audit/subject/WORKER-1 (correct token) → 200 + full manifest + 0 rows
+ chain_verified=true (empty log path)
7. POST /v1/validate with candidate_id=WORKER-1 → triggers WorkerLookup.find()
via the AuditingWorkerLookup wrapper from Step 5
8. data/_catalog/subjects/WORKER-1.audit.jsonl now exists with 1 row
(accessor.purpose=validator_worker_lookup, result=not_found,
prev_chain_hash=GENESIS, valid HMAC)
9. /audit/subject/WORKER-1 (correct token) → 200 + manifest + 1 row +
chain_verified=true + chain_rows_total=1 + completeness attestation
The full audit-trail loop (PII access → audit row → chain → audit response)
works end-to-end on the live gateway.
NOT in this commit (future steps):
- Step 7: Daily retention sweep
- Step 8: Cross-runtime parity (Go side reads the same shapes)
- Mirror chain root to SubjectManifest.audit_log_chain_root after
each append (so tampering detection can use the manifest's
cached root as ground truth)
- Live row projection from datasets (currently caller follows up
via /query/sql against the safe_views named in the response)
- Ed25519 signature on the response (chain verification IS the v1
attestation; signing is future hardening per spec §10)
cargo build --release clean. cargo test -p catalogd audit_endpoint
9/9 PASS. Live verification successful.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-03 cross-lineage scrum on the subjects_steps_1_to_4 wave
returned 14 distinct findings, 0 convergent. opus verdict was HOLD
with 3 BLOCKs around the audit-chain integrity. All real. Fixed:
──────────────────────────────────────────────────────────────────
BLOCK 1 — opus subject_audit.rs:172 + execution_loop.rs:391
Concurrency race: append_line is read-modify-write; the gateway
hook used tokio::spawn fan-out → two concurrent appends to the
same subject both read the same prev_hash, both compute their
HMAC from the same prev, second write silently overwrites first
→ row lost AND chain broken.
Fix:
- SubjectAuditWriter gains per-subject Mutex map. append() acquires
the subject's lock for the duration of the read-modify-write.
Different subjects still parallelize.
- Gateway hook switches from tokio::spawn to inline await. Per-row
cost is ~1ms (one object_store put); inline is correct AND cheap.
- New regression test: 50 concurrent appends to the same subject,
asserts all 50 land with intact chain.
BLOCK 2 — opus subject_audit.rs:108
Non-deterministic canonicalization: serde_json serializes struct
fields in declaration order. Schema evolution (adding/reordering
fields) silently changes the bytes verify_chain hashes → chain
breaks even when nothing was actually tampered with.
Fix:
- New canonical_json() free fn — recursive value rewrite to sort
object keys alphabetically (BTreeMap projection), arrays preserve
order, scalars pass through. Stable across struct evolution.
- Both append() and verify_chain() now compute HMAC over canonical
bytes, not declaration-order bytes.
- New regression tests: alphabetical-key + array-order-preserved.
WARN — opus execution_loop:401
Audit row's `result` was hardcoded to "success" for every Ok(result)
including payloads like {"error":"not found"}. Misleads compliance.
Fix:
- New audit_result_state() free fn that inspects the payload
top-level for error/denied/not_found/status signals (per spec
§3.2 enum). Defaults to "success" only when no error signal.
- 4 new tests covering each enum case + falsy-signals defense.
WARN — opus registry.rs:735
Storage-key collision: sanitize_view_name(id) is the disk key,
but the in-memory HashMap was keyed by raw candidate_id. Two
distinct ids that sanitize to the same key (e.g. "CAND/1" and
"CAND_1") would collide on disk while appearing distinct in
memory; second put silently overwrites first; rebuild loads only
one.
Fix:
- put_subject() / get_subject() / delete_subject() / rebuild()
all key the in-memory HashMap by sanitize_view_name(id), matching
the storage key shape.
- Collision guard: put_subject() refuses (with clear error) when
the sanitized key matches an EXISTING subject with a DIFFERENT
raw candidate_id.
- New regression test: put("CAND/1") then put("CAND_1") errors
+ first subject survives.
WARN — opus backfill_subjects.rs:189
trim_start_matches strips REPEATED prefixes; the spec wanted
one-shot semantics. Edge case unlikely in practice but real.
Fix:
- Switched to strip_prefix(&prefix).unwrap_or(&cid). One-shot.
INFO — opus subject_audit.rs:131
Per-byte format!("{:02x}", b) allocates each iteration. Hot path
on every append.
Fix:
- Replaced with const HEX lookup table + push() into preallocated
String. Same output bytes, no per-byte allocation.
──────────────────────────────────────────────────────────────────
Test summary post-fix:
catalogd subject_audit: 11/11 PASS (added 4 new — concurrency
race regression, parallel-different-subjects,
canonical-key sort, canonical-array order)
catalogd registry subject: 6/6 PASS (added 1 new — collision guard)
gateway execution_loop subject: 10/10 PASS (added 4 new —
audit_result_state enum coverage)
All 27 subject-related tests green. cargo build --release clean.
The convergent-zero scrum result was misleading on its face — opus
caught real BLOCKs that kimi/qwen missed. Per
feedback_cross_lineage_review.md: opus is the load-bearing reviewer;
single-opus BLOCKs warrant manual verification, which here confirmed
all three were correct.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Implementation of docs/specs/SUBJECT_MANIFESTS_ON_CATALOGD.md Step 2.
Per-subject append-only audit JSONL with HMAC-SHA256 chain. Local-first
— no Vault, no external anchor (those are v2 if SOC2 Type II becomes
contract-required; v1 deliberately stays small).
shared/types.rs additions:
- AuditAccessor — kind, daemon, purpose, trace_id
- SubjectAuditRow — schema/ts/candidate_id/accessor/fields_accessed/
result/prev_chain_hash/row_hmac
crates/catalogd/src/subject_audit.rs (NEW):
- SubjectAuditWriter — holds signing key + per-subject latest-hash cache
- from_key_file() — loads key from sealed file, requires ≥32 bytes
- with_inline_key() — for tests + bring-up
- append() — computes HMAC chain link, persists JSONL row, returns new
chain root (caller mirrors to SubjectManifest.audit_log_chain_root)
- verify_chain() — full re-verification of a subject's audit log,
catches both prev_hash drift AND row-level HMAC tampering
- scan_latest_hash() — cold-start path, finds prev_hash from JSONL tail
- append_line() — read-modify-write pattern (object stores have no
native append; same shape as the rest of catalogd's persistence)
Crypto: HMAC-SHA256 via the standard `hmac` crate (added to workspace
+ catalogd deps; not implementing crypto by hand). Output is lowercase
hex matching the rest of the codebase's SHA-256 conventions.
Security choices:
- NO Debug impl on SubjectAuditWriter — auto-deriving Debug would risk
leaking the signing key into log lines. Tests work around this by
matching on Result instead of using .unwrap_err().
- Key min length 32 bytes (HMAC-SHA256 block size guidance).
- Failures are NOT swallowed — Result returned, caller decides whether
to log + continue (per spec §3.2 the gateway tool registry SHOULD
log + continue rather than block reads).
Tests (7/7 passing):
- first_append_uses_genesis_prev_hash
- chain_links_each_append (3-row chain verifies)
- separate_subjects_have_independent_chains (per-subject isolation)
- tamper_detected_on_verify (mutation in middle of chain breaks verify)
- cold_writer_picks_up_existing_chain (process restart preserves chain)
- empty_candidate_id_rejected
- key_too_short_rejected_via_file
NOT in this commit (future steps):
- Step 3: Backfill ETL from workers_500k.parquet (next per J)
- Step 4: Wire gateway tool registry to call append() on every
candidate_id returned by search_candidates / get_candidate
- Step 5: Wire validator WorkerLookup similarly
- Step 6: /audit/subject/{id} HTTP endpoint
- Step 7: Daily retention sweep
- Mirroring chain root to SubjectManifest.audit_log_chain_root
(separate concern; do at the call site)
cargo check --workspace clean. cargo test -p catalogd subject_audit
7/7 PASS.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Five threads of work landing as one milestone — all individually
verified end-to-end against real data, full release build clean,
46 unit tests pass.
## Phase 16.2 / 16.5 — autotune agent + ingest triggers
`vectord::agent` is a long-running tokio task that watches the trial
journal and autonomously proposes + runs new HNSW configs. Distinct
from `autotune::run_autotune` (synchronous one-shot grid). Triggered
on POST /vectors/agent/enqueue/{idx} or by the periodic wake; ingest
paths now push DatasetAppended events when an index's source dataset
gets re-ingested. Rate-limited (max_trials_per_hour) and cooldown-
gated so it can't saturate Ollama under live load.
The proposer is ε-greedy around the current champion: with prob 0.25
sample random from full bounds, otherwise perturb champion ± small
delta on both axes. Dedup against history. Deterministic — RNG seeded
from history.len() so the same journal state proposes the same next
config (helps offline replay debugging).
`[agent]` config section in lakehouse.toml; opt-in via enabled=true.
## Federation Layer 2 — runtime bucket lifecycle + per-index scoping
`BucketRegistry.buckets` moved to `std::sync::RwLock<HashMap>` so
buckets can be added/removed after startup. POST /storage/buckets
provisions at runtime; DELETE /storage/buckets/{name} unregisters
(refuses primary/rescue with 403). Local-backend buckets get their
root directory auto-created.
`IndexMeta.bucket` (default "primary" via serde) records each index's
home bucket. `TrialJournal` and `PromotionRegistry` now hold
Arc<BucketRegistry> + IndexRegistry; they resolve target store per-
index via IndexMeta.bucket. PromotionRegistry::list_all scans every
bucket and dedups by index_name. Pre-federation indexes keep working
unchanged — they just default to primary.
`ModelProfile.bucket: Option<String>` declares per-profile artifact
home. POST /vectors/profile/{id}/activate auto-provisions the
profile's bucket under storage.profile_root if not yet registered.
EvalSets stay primary-only for now — noted gap, low-risk to extend
later with the same resolver pattern.
## Phase 17 — VRAM-aware two-profile gate
Sidecar gains POST /admin/unload (Ollama keep_alive=0 trick — forces
immediate VRAM release), POST /admin/preload (keep_alive=5m with
empty prompt, takes the slot warm), and GET /admin/vram (combines
nvidia-smi snapshot with Ollama /api/ps). Exposed via aibridge as
unload_model / preload_model / vram_snapshot.
`VectorState.active_profile` is the GPU-slot singleton —
Arc<RwLock<Option<ActiveProfileSlot>>>. activate_profile checks for
a previous profile with a different ollama_name and unloads it
before preloading the new one; same-model reactivations skip the
unload (Ollama no-ops). New routes: POST /vectors/profile/{id}/
deactivate (unload + clear slot), GET /vectors/profile/active.
Verified live: staffing-recruiter (qwen2.5) → docs-assistant
(mistral) swap freed qwen2.5 from VRAM and loaded mistral. nomic-
embed-text persists across swaps because both profiles use it —
free optimization that fell out of the design. Scoped search
correctly 403s cross-profile in both directions.
## MySQL streaming connector
`crates/ingestd/src/my_stream.rs` mirrors pg_stream.rs for MySQL.
Pure-rust `mysql_async` driver (default-features=false to avoid C
deps). Same OFFSET pagination, same Parquet-streaming write shape.
Type mapping per ADR-010: int/bigint → Int32/Int64, decimal/float
→ Float64, tinyint(1)/bool → Boolean, everything else → Utf8 with
fallback parsers for date/time/json/uuid via Display.
POST /ingest/mysql parallel to /ingest/db. Same PII auto-detection,
same lineage capture (source_system="mysql"), same agent-trigger
hook. `redact_dsn` generalized — was hardcoded to "postgresql://"
length, now works for any scheme://user:pass@host/path URL (latent
PII leak fix for MySQL DSNs).
Verified live against MariaDB on localhost: 10 rows × 9 columns of
test data round-tripped through datatypes int/varchar/decimal/
tinyint/datetime/text. PII detection auto-flagged name + email.
Aggregation queries through DataFusion match the source values
exactly.
## Phase 18 — Hybrid Parquet+HNSW ⊕ Lance backend (ADR-019)
`vectord-lance` is a new firewall crate. Lance pulls Arrow 57 and
DataFusion 52 — incompatible with the rest of the workspace's
Arrow 55 / DataFusion 47. The firewall isolates that dep tree:
public API uses only std types (Vec<f32>, Vec<String>, Hit, Row,
*Stats), so no Arrow types cross the crate boundary and nothing
propagates to vectord. The ADR-019 path that didn't ship until now.
`vectord::lance_backend::LanceRegistry` lazy-creates a
LanceVectorStore per index, resolving bucket → URI via the
conventional local-bucket layout. `IndexMeta.vector_backend` and
`ModelProfile.vector_backend` carry the choice (default Parquet so
existing indexes unchanged).
Six routes under /vectors/lance/*:
- migrate/{idx}: convert binary-blob Parquet → Lance FixedSizeList
- index/{idx}: build IVF_PQ
- search/{idx}: vector search (embed via sidecar)
- doc/{idx}/{doc_id}: random row fetch
- append/{idx}: native fragment append
- stats/{idx}: row count + index presence
Verified live on the real resumes_100k_v2 corpus (100K × 768d):
- Migrate: 0.57s
- Build IVF_PQ index: 16.2s (matches ADR-019 bench; 14× faster than
HNSW's 230s for the same data)
- Search end-to-end (Ollama embed + Lance scan): 23-53ms
- Random doc_id fetch: 5-7ms (filter scan; faster than Parquet's
~35ms full-file scan, slower than the bench's 311us positional
take — would close that gap with a scalar btree on doc_id)
- Append 100 rows: 3.3ms / +320KB on disk vs Parquet's required
full ~330MB rewrite — the structural win
- Index survives append; both backends coexist cleanly
## Known follow-ups not in this milestone
- ModelProfile.vector_backend doesn't yet auto-route /vectors/profile/
{id}/search to Lance; callers go through /vectors/lance/* directly
- Scalar btree on doc_id (closes the 5-7ms → ~300us gap)
- vectord-lance built default-features=false → no S3 yet
- IVF_PQ recall not measured (ADR-019 caveat) — needs a Lance-aware
variant of the eval harness
- Watcher-path ingest doesn't push agent triggers (HTTP paths do)
- EvalSets still primary-only (federation gap)
- No PATCH endpoint to move an existing index between buckets
- The pre-existing storaged::append_log doctest fails to compile
(malformed `{prefix}/` parses as code fence) — pre-existing bug,
left for a focused fix
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Phase E gave us soft-delete at query time (tombstones hide rows via a
DataFusion filter view). This completes the invariant: after compact,
tombstoned rows are PHYSICALLY absent from the parquet on disk.
delta::compact changes:
- Signature adds tombstones: &[Tombstone]
- After merging base + deltas, apply_tombstone_filter builds a
BooleanArray keep-mask per batch (True where row_key_value is NOT
in the tombstone set) and applies arrow::compute::filter_record_batch
- Supports Utf8, Int32, Int64 key columns (matches refresh.rs coverage
for pg- and csv-derived schemas)
- CompactResult gains tombstones_applied + rows_dropped_by_tombstones
- Caller clears tombstone store on success
Critical correctness fix surfaced during E2E testing:
The original Phase 8 compact concatenated N independent Parquet byte
streams from record_batch_to_parquet() — each with its own footer.
Parquet readers only see the FIRST footer's data; the rest is invisible.
Latent since Phase 8 shipped; triggered by tombstone-filtering produc-
ing multiple batches. Corrupted candidates.parquet on first test run
(restored from UI fixture copy — good argument for test data in repo).
Fix:
- Single ArrowWriter per compaction, writes every batch into one
properly-footered Parquet
- Snappy compression to match ingest defaults (otherwise rewrite
inflated file 3× — 10.5MB → 34MB — because no compression was set)
- Verify-before-swap: parse written buf back to confirm row count
matches expected; refuses to overwrite base_key if verification fails
- Write to {base_key}.compact-{ts}.tmp first, then to base_key; delete
temp; only then delete delta files. Any error along the way leaves
the original base intact.
TombstoneStore::clear(dataset) drops all tombstone batch files and
evicts the per-dataset AppendLog from cache. Called after successful
compact.
QueryEngine::catalog() accessor exposes the Registry so queryd
handlers can reach the tombstone store without routing through gateway
state.
E2E on candidates (100K rows, 15 cols):
- Baseline: 10.59 MB, 100000 rows
- Tombstone CAND-000001/2/3 (soft-delete): 99997 visible, 100000 raw
- Compact: tombstones_applied=3, rows_dropped=3, final_rows=99997
- Post: 10.72 MB (Snappy), valid parquet (1 row_group), 99997 rows
- Restart: persists, tombstones list empty, __raw__candidates also
99997 (the 3 IDs are physically gone from disk)
PRD invariant close: deletion is now actually deletion, not just
masking. GDPR erasure request → tombstone + schedule compact → data
gone.
Deferred:
- Compact-all-datasets cron (currently manual per-dataset via
POST /query/compact)
- Compaction of tombstone batch files themselves (they grow at
flush_threshold=1 per tombstone; TombstoneStore::compact exists
but not auto-called)
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Implements PRD invariant 9 ("every reader gets its own profile") and
completes the multi-model substrate vision. Local models (or agents)
bind to a named set of datasets; activation pre-loads their vector
indexes into memory; search enforces scope.
Schema (shared::types):
- ModelProfile { id, ollama_name, description, bound_datasets,
hnsw_config, embed_model, created_at, created_by }
- ProfileHnswConfig mirrors vectord::trial::HnswConfig to avoid a
cross-crate dep cycle. Default (ec=80, es=30) matches the Phase 15
trial winner.
- bound_datasets can reference raw dataset names OR AiView names
(both register as DataFusion tables with the same name, so mixing
raw tables and PII-redacted views composes naturally)
Catalog (catalogd::registry):
- put_profile validates id is a slug (alphanumeric + -_ only) and
every binding resolves to an existing dataset or view
- Persistence at _catalog/profiles/{id}.json, loaded on rebuild
- get_profile / list_profiles / delete_profile
HTTP endpoints:
- POST /catalog/profiles (create/update)
- GET /catalog/profiles (list)
- GET/DELETE /catalog/profiles/{id}
- POST /vectors/profile/{id}/activate (HNSW hot-load)
- POST /vectors/profile/{id}/search (scope-enforced)
Activation (vectord::service::activate_profile):
- For each bound dataset, find vector indexes with matching source
- Pre-load embeddings into EmbeddingCache
- Build HNSW with profile's config
- Report warmed indexes + per-binding failures + duration
- Failures on individual bindings don't abort — "substrate keeps
working" per ADR-017
Scoped search (vectord::service::profile_scoped_search):
- Look up profile, verify index.source ∈ profile.bound_datasets
- Returns 403 with allowed bindings list if out-of-scope
- Uses HNSW if index is warm, brute-force cosine otherwise (graceful
degradation — no "must activate first" friction)
Bug fix surfaced during testing: vectord::refresh::try_update_index_meta
was a no-op for first-time indexes, so threat_intel_v1 and
kb_team_runs_v1 (both built via refresh after Phase C shipped) didn't
show up in the index registry. Now it auto-infers the source from the
index name convention (`{source}_vN`) and registers new metadata with
reasonable defaults.
End-to-end verified:
- Created security-analyst profile bound to [threat_intel]
- POST /vectors/profile/security-analyst/activate → warmed
threat_intel_v1 (54 vectors) in 156ms, HNSW built
- Within-scope search: method=hnsw, returned relevant IP indicators
- Out-of-scope: tried to search resumes_100k_v2 (source=candidates)
→ 403 "profile 'security-analyst' is not bound to 'candidates' —
allowed bindings: [\"threat_intel\"]"
- staffing-recruiter profile created bound to candidates + placements;
search without activation fell through to brute_force (graceful)
Deferred (Phase 17 followups):
- VRAM-aware activation (unload-then-load via Ollama keep_alive=0)
— Ollama already handles this; we don't need to reinvent
- Model-identity in audit trail — Phase 13 has role-based audit;
adding model_id is ~20 LOC when we want it
- Profile bucket pre-load (profile:user bucket mount) — Phase 17.5
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Implements GDPR/CCPA-compatible row-level deletion without rewriting
the underlying Parquet. Tombstone markers live beside each dataset and
are applied at query time via a DataFusion view that excludes the
deleted row_key_values.
Schema (shared::types):
- Tombstone { dataset, row_key_column, row_key_value, deleted_at,
actor, reason }
- All tombstones for a dataset must share one row_key_column —
enforced at write so the query-time filter remains a single
WHERE NOT IN (...) clause
Storage (catalogd::tombstones):
- Per-dataset AppendLog at _catalog/tombstones/{dataset}/
- flush_threshold=1 + explicit flush after every append — tombstones
are high-value, low-frequency; durability on return is the contract
- Reuses storaged::append_log infra so compaction is already wired
(POST .../tombstones/compact will work once we expose it)
Catalog (catalogd::registry):
- add_tombstone validates dataset exists + key column compatibility
- list_tombstones for the GET endpoint
- TombstoneStore exposed via Registry::tombstones() for queryd
HTTP (catalogd::service):
- POST /catalog/datasets/by-name/{name}/tombstone
{ row_key_column, row_key_values[], actor, reason }
Returns rows_tombstoned count + per-value failure list (207 on
partial success).
- GET same path lists active tombstones with full audit info.
Query layer (queryd::context):
- Snapshot tombstones-by-dataset before registering tables
- Tombstoned tables: raw goes to "__raw__{name}", public "{name}"
becomes DataFusion view with
SELECT * FROM "__raw__{name}" WHERE CAST(col AS VARCHAR) NOT IN (...)
- CAST AS VARCHAR handles both string and integer key columns
- Untombstoned tables register as before — zero overhead
End-to-end on candidates (100K rows):
- Pick CAND-000001/2/3 (Linda/Charles/Kimberly)
- POST tombstone -> rows_tombstoned: 3
- COUNT(*) drops 100000 -> 99997
- WHERE candidate_id IN (those 3) -> 0 rows
- candidates_safe view transitively excludes them
(Linda+Denver: __raw__candidates=159, candidates_safe=158)
- Restart: COUNT still 99997, 3 tombstones reload from disk
Reversibility: tombstones are reversible deletes, not destruction.
Power users can still query "__raw__{name}" to see deleted rows.
Phase 13 access control is what stops a non-admin from accessing
__raw__* tables.
Limits / follow-up:
- Physical compaction not yet integrated — Phase 8's compact_files
doesn't read tombstones during merge. Tombstoned rows are still
on disk until that integration ships.
- Phase 9 journald event emission for tombstones not wired —
tombstone records carry their own actor+reason+timestamp so the
audit trail is intact, but cross-referencing with the mutation
event log would help compliance reporting.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Implements the llms3.com "AI-safe views" pattern: a named projection
that exposes only whitelisted columns, with optional row filter and
per-column redactions. AI agents (or Phase 13 roles) bind to the view;
they can never accidentally see PII even if they write raw SQL.
Schema (shared::types):
- AiView { name, base_dataset, columns: Vec<String>, row_filter,
column_redactions: HashMap<String, Redaction>, ... }
- Redaction enum: Null | Hash | Mask { keep_prefix, keep_suffix }
Catalog (catalogd::registry):
- put_view validates base dataset exists + columns non-empty
- Persists JSON at _catalog/views/{name}.json (sanitized name)
- rebuild() loads views alongside dataset manifests on startup
Query layer (queryd::context):
- build_context registers every AiView as a DataFusion view object
- Constructed SELECT applies whitelist projection, WHERE filter, and
redaction expressions per column
- Mask: substr(prefix) + repeat('*', mid_len) + substr(suffix)
- Hash: digest(value, 'sha256')
- Null: CAST(NULL AS VARCHAR) AS col
- DataFusion handles JOINs/aggregates over the view natively — it's a
real view, not a query rewrite
HTTP (catalogd::service):
- POST /catalog/views (create)
- GET /catalog/views (list)
- GET /catalog/views/{name} (full def)
- DELETE /catalog/views/{name}
End-to-end test on candidates (100K rows, 15 columns):
candidates_safe view:
columns: candidate_id, first_name, city, state, vertical,
skills, years_experience, status
row_filter: status != 'blocked'
redaction: candidate_id mask(prefix=3, suffix=2)
SELECT * FROM candidates_safe LIMIT 5
-> 8 columns only, candidate_id shown as "CAN******01"
(PII fields email/phone/last_name absent from result)
SELECT email FROM candidates_safe
-> fails (column not in projection)
SELECT email FROM candidates
-> succeeds (raw table still accessible by name —
Phase 13 access control is the gate, not the view itself)
Survives restart — view definitions reload from object storage.
Limits / not in MVP:
- View CANNOT shadow base table by name (DataFusion treats them as
separate identifiers; access control must restrict raw-table access)
- row_filter is treated as trusted SQL — operators must validate
before persisting; only authenticated admin path should call put_view
- Redaction expressions assume column is castable to VARCHAR; numeric
redactions could be misleading (a Hash on Int64 returns a hex string
that won't equi-join with another hash on the same value type)
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Three pieces of the multi-bucket federation made real:
1. Catalog migration (POST /catalog/migrate-buckets)
- One-shot normalizer for ObjectRef.bucket field
- Empty -> "primary"; legacy "data"/"local" -> "primary"
- Idempotent; re-running on canonical state is no-op
- Ran on existing catalog: 12 refs renamed from "data", 2 already
"primary", all 14 now canonical
2. X-Lakehouse-Bucket header middleware on ingest
- resolve_bucket() helper extracts header, returns
(bucket_name, store) or 404 with valid bucket list
- ingest_file and ingest_db_stream now route writes per-request
- Defaults to "primary" when header absent
- pipeline::ingest_file_to_bucket records the actual bucket on the
ObjectRef so catalog stays the source of truth for "where does this
data live"
- Verified: ingest with X-Lakehouse-Bucket: testing lands in
data/_testing/, ingest without header lands in data/, bad header
returns 404 with hint
3. queryd registers every bucket with DataFusion
- QueryEngine now holds Arc<BucketRegistry> instead of single store
- build_context iterates all buckets, registers each as a separate
ObjectStore under URL scheme "lakehouse-{bucket}://"
- ListingTable URLs include the per-object bucket scheme so
DataFusion routes scans automatically based on ObjectRef.bucket
- Profile bucket names like "profile:user" sanitized to
"lakehouse-profile-user" since URL host segments can't contain ":"
- Tolerant of duplicate manifest entries (pre-existing
pipeline::ingest_file behavior creates a fresh dataset id per
ingest); duplicates skipped with debug log
- Backward compat: legacy "lakehouse://data/" URL still registered
pointing at primary
Success gate: cross-bucket CROSS JOIN
SELECT p.name, p.role, a.species
FROM people_test p (bucket: testing)
CROSS JOIN animals a (bucket: primary)
LIMIT 5
returns rows correctly. DataFusion routed each scan to its bucket's
ObjectStore based on the URL scheme.
No regressions: SELECT COUNT(*) FROM candidates still returns 100000
from the primary bucket.
Deferred to Phase 17:
- POST /profile/{user}/activate (HNSW hot-load on profile switch)
- vectord storage paths becoming bucket-scoped (trial journals,
eval sets per-profile)
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Implements the llms3.com-inspired pattern: embeddings refresh
asynchronously, decoupled from transactional row writes. New rows arrive,
ingest marks the vector index stale, a later refresh embeds only the
delta (doc_ids not already in the index).
Schema additions (DatasetManifest):
- last_embedded_at: Option<DateTime> - when the index was last refreshed
- embedding_stale_since: Option<DateTime> - set when data written, cleared on refresh
- embedding_refresh_policy: Option<RefreshPolicy> - Manual | OnAppend | Scheduled
Ingest paths (pipeline::ingest_file + pg_stream) call
registry.mark_embeddings_stale after writing. No-op if the dataset has
never been embedded — stale semantics only kick in once last_embedded_at
is set.
Refresh pipeline (vectord::refresh::refresh_index):
- Reads the dataset Parquet, extracts (doc_id, text) pairs
- Accepts Utf8 / Int32 / Int64 id columns (covers both CSV and pg schemas)
- Loads existing embeddings via EmbeddingCache (empty on first-time build)
- Filters to rows whose doc_id is NOT in the existing set
- Chunks (chunker::chunk_column), embeds via Ollama (batches of 32),
writes combined index, clears stale flag
Endpoints:
- POST /vectors/refresh/{dataset_name} - body {index_name, id_column,
text_column, chunk_size?, overlap?}
- GET /vectors/stale - lists datasets whose embedding_stale_since is set
End-to-end verified on threat_intel (knowledge_base.threat_intel):
- Initial refresh: 20 rows -> 20 chunks -> embedded in 2.1s,
last_embedded_at set
- Idempotent second refresh: 0 new docs -> 1.8ms (pure delta check)
- Re-ingest to 54 rows: mark_embeddings_stale fires -> stale_since set
- /vectors/stale surfaces threat_intel with timestamps + policy
- Delta refresh: 34 new docs embedded in 970ms (6x faster than full
re-embed); stale_cleared = true
Not in MVP scope:
- UPDATE semantics (same doc_id, different content) - would need
per-row content hashing
- OnAppend policy auto-trigger - just declares intent; actual scheduler
deferred
- Scheduler runtime - the Scheduled(cron) variant declares the intent so
operators can see which datasets expect what, but the cron itself is
separate
Per ADR-019: when a profile switches to vector_backend=Lance, this
refresh path benefits — Lance's native append replaces our "read all +
rewrite" Parquet rebuild pattern. Current MVP works well enough at
~500-5K rows to validate the architecture; Lance unblocks the 5M+ case.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- DatasetManifest expanded: description, owner, sensitivity, columns,
lineage, freshness contract, tags, row_count
- All new fields use #[serde(default)] for backward compatibility
- PII auto-detection: scans column names for email, phone, SSN, salary,
address, DOB, medical terms — flags as PII/PHI/Financial
- Column-level metadata: name, type, sensitivity, is_pii flag
- Lineage tracking: source_system, source_file, ingest_job, timestamp
- Ingest pipeline auto-populates: PII scan, column meta, lineage, row count
- PATCH /catalog/datasets/by-name/{name}/metadata — update metadata
- Catalog responses now include all rich fields
- 25 unit tests passing (5 new PII detection tests)
Per ADR-013: datasets without metadata become mystery files.
This makes every ingested file self-describing from day one.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>