lakehouse

Author	SHA1	Message	Date
root	3708e6abf1	biometric endpoint: scrum-driven hardening Per 2026-05-03 phase_1_6_gate_3a scrum (10 findings, 0 convergent location-wise but opus + kimi flagged the same audit-failure issue). Convergent + load-bearing fix: Audit-write failure was silently swallowed (returned 200 with empty hmac) after photo + manifest persisted. For BIPA defensibility this is wrong — a successful response without an audit row is exactly the silent-failure mode the spec exists to prevent. Now: full transactional rollback. If audit append fails after photo + manifest commit, we remove the photo AND revert the manifest to its pre-upload state, then return 500 with error="audit_write_failed". Other real fixes: Orphan-file leak (opus WARN): if put_subject fails AFTER the photo is written, the file would orphan on disk with no manifest pointer. Now removes the photo on manifest-update failure, before returning 500. Content-Type parameter handling (opus WARN): real-world clients send `image/jpeg; charset=binary` etc. Parser now strips parameters per RFC 9110 §8.3 and matches case-insensitively. New regression test content_type_with_parameters_accepted exercises both. data_path doc/code mismatch (opus WARN): doc said "relative to the configured biometric storage root" but code stored absolute path. Now stores relative — operators reading the manifest reconstruct the absolute path with their own storage_root, manifests are portable across deployments. Tests updated. Timestamp-nanosecond collision (kimi WARN): added 8-char uuid suffix to filename. Sub-microsecond cadence collision was implausible but defense-in-depth is cheap. Dead code (opus + kimi INFO): removed unused require_legal_auth function (process_upload reimplements the auth check inline) and the `let _ = ConsentStatus::Given;` no-op type-shape reference. Skipped (acceptable in v1): - qwen BLOCK on image format validation: spec explicitly says "we trust the caller; malformed images fail downstream when deepface runs in Gate 3b". Documented in the file's module doc-comment. - qwen WARN on directory create-then-chmod race: brief window between create_dir_all and set_permissions. Mitigation would require libc-level umask manipulation; accepted as v1 scope. - qwen INFO on constant_time_eq duplication: comment explains the cross-import boundary; acceptable short-term per the reviewer. Tests: 11 unit tests pass (added content_type_with_parameters_accepted). Live verification post-restart: - Content-Type with `; charset=binary` accepted ✓ - data_path returned as relative `WORKER-2/<ts>_<uuid>.jpg` ✓ - Chain verified end-to-end (3 rows: validator + 2 biometric) ✓ - Cross-runtime parity probe still 6/6 byte-identical ✓ Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-03 05:05:12 -05:00
root	f1fa6e4e61	phase 1.6 Gate 3a: photo upload endpoint with consent gate Per docs/PHASE_1_6_BIPA_GATES.md §1 Gate 3 (consent-gate substrate). Deepface classification (Gate 3b) deferred to its own session — needs Python subprocess design conversation after the 2026-05-02 sidecar drop. What ships: shared/types.rs: - new BiometricCollection sub-struct: data_path, template_hash, collected_at, consent_version_hash, classifications (Option<JSON>) - SubjectManifest gains biometric_collection: Option<BiometricCollection> with #[serde(default)] so existing on-disk manifests parse and re-emit without drift catalogd/biometric_endpoint.rs (NEW, ~600 LOC): POST /subject/{candidate_id}/photo - Auth: X-Lakehouse-Legal-Token, constant-time-eq compared against same legal token file as /audit. Same 32-byte minimum. - Content-Type: must be image/jpeg or image/png (415 otherwise) - Body: raw image bytes, max 10MB - 401: missing or wrong token - 404: subject not registered - 403: consent.biometric.status != "given" (returns current status) - 403: subject status in {Withdrawn, Erased, RetentionExpired} - 200: writes photo to data/biometric/uploads/<sanitized_id>/<ts>.<ext> with mode 0700 dir + 0600 file, updates SubjectManifest with BiometricCollection record, appends audit row (kind="biometric_collection", purpose="photo_upload"), returns UploadResponse with template_hash + audit_row_hmac. Logic split: pure async fn process_upload() takes the headers-as-args so unit tests exercise every branch without HTTP machinery; the axum handler is just glue. 10 tests covering all 4 reject paths + happy path + repeated uploads chaining + structural assertion that the quarantine path is NOT under data/headshots/ (synthetic faces). gateway/main.rs: Mounts /biometric on the same condition as /audit — only when the SubjectAuditWriter is present AND the legal token loads. Storage root configurable via LH_BIOMETRIC_STORAGE_ROOT (default ./data/biometric/uploads). Live verification on the running gateway (post-restart): - GET /biometric/health → "biometric endpoint ready" - POST without token → 401 auth_failed - POST with token, no consent → 403 consent_required (status=NeverCollected) - Flipped WORKER-2 to consent=given, POST → 200 with hash + path - File at data/biometric/uploads/WORKER-2/<ts>.jpg, mode 0600 - Manifest biometric_collection field reflects the upload - Audit row chain links cleanly off the prior validator_lookup row - GET /audit/subject/WORKER-2 returns chain_verified=true, 2 rows - Cross-runtime parity probe still 6/6 byte-identical post-change Phase 1.6 status table updated: Gate 3a DONE, Gate 3b (deepface) deferred. Calendar bottleneck remains counsel review of items 1/2/5/6. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-03 04:55:32 -05:00
root	2222227c16	catalogd parity helper: scrum-driven hardening Per 2026-05-03 step_7_8_retention_and_parity scrum (opus). 5 findings, 0 convergent — but two real fixes shipped: 1. WARN parity_subject_audit.rs:argv — replace .expect() panics with stderr+exit(2). The parity script captures stdout for byte-compare; a Rust panic backtrace lands in stdout (script merges 2>&1) and reads as a parity break instead of a usage error. Added die() helper that mirrors the Go side's error-exit pattern. 2. INFO parity_subject_audit.rs:5 — doc comment hardcoded the absolute path /home/profit/golangLAKEHOUSE/... Replaced with repo-relative reference. INFO findings on retention_sweep argv style + --as-of report path overwrite were noted but not actioned (style only / acceptable for the forecast use case). The major scrum-surfaced bug (Go json.Marshal HTML-escaping <>& while serde_json keeps them literal) is fixed on the Go side in parallel commit. Rust side here is correct as-is — serde_json::to_vec doesn't HTML-escape by default, so no change needed in canonical_json. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-03 04:29:38 -05:00
root	2413c96817	catalogd: Step 8 — parity_subject_audit binary (Rust side) Per docs/specs/SUBJECT_MANIFESTS_ON_CATALOGD.md §5 Step 8. Cross-runtime parity helper consumed by: golangLAKEHOUSE/scripts/cutover/parity/subject_audit_parity.sh Two modes: --known-answer Print canonical-JSON + HMAC for a hardcoded fixture row. The Go helper at golangLAKEHOUSE/scripts/cutover/parity/subject_audit_helper/ must produce byte-identical output. Catches algorithm drift (canonical-JSON sort order, HMAC algorithm, hex encoding). --verify <audit_log_path> --key <key_path> Replay the chain on a real production audit log via the live SubjectAuditWriter::verify_chain (no re-implementation; the actual production verification path). Output: one JSON line with mode, count, tip, verified, error. The helper exercises the SAME verify_chain path the gateway calls, so algorithm changes in subject_audit.rs automatically flow into the parity probe. Live-verified against 5 production audit logs in data/_catalog/subjects; all 6 parity assertions pass after fixing two real cross-runtime drifts on the Go side (omitempty trace_id stripping field; time.RFC3339Nano stripping trailing zero in nanoseconds — both caught by this probe). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-03 04:16:50 -05:00
root	8fc6238dea	catalogd: Step 7 — daily retention sweep binary Per docs/specs/SUBJECT_MANIFESTS_ON_CATALOGD.md §5 Step 7: "Subjects whose retention.general_pii_until < now AND status != erased get marked for review (don't auto-delete; legal needs to approve)." Per shared::types::BiometricConsent doc-comment (BIPA requirement on biometric data, max 3 years from last interaction): "Implementation MUST enforce daily expiration sweep against this field." Therefore the sweep checks BOTH retention clocks. Reports overdue subjects to data/_catalog/subjects/_retention_sweep_<YYYY-MM-DD>.jsonl. Idempotent: subjects already in {Erased, RetentionExpired} are skipped so daily runs do not append duplicate rows. Does NOT mutate subject manifests. Legal/operator owns the action (extend, flip status, schedule erasure). CLI: retention_sweep # dry-run (default), stderr only retention_sweep --apply # also write JSONL report retention_sweep --as-of <RFC3339> # alternate clock for forecast/test retention_sweep --storage-root <dir> # default ./data Tests: 8 unit tests on is_overdue covering all 5 SubjectStatus values, both clocks, BIPA-only path, and idempotency on already-flagged subjects. Live verification (100 subjects in ./data/_catalog/subjects): - now (2026-05-03): 0 overdue (correct — 4-year retention) - --as-of 2031-06-01: 100 overdue, 394 days past, jsonl report shape verified with biometric fields correctly omitted via serde skip_serializing_if when subject has no biometric clock. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-03 04:05:03 -05:00
root	2a4b316a15	subjects: 2nd scrum fix wave (token min, chain_tip, tampering, rebuild collision warn) Second cross-lineage scrum on Steps 5+6 returned 13 distinct findings, 0 convergent. Three BLOCK-class claims verified as false positives (cache IS written, per-subject Mutex IS in place, spawn IS safe under writer's lock). Five real fixes shipped: 1. audit_endpoint: legal token min length 16->32 (HMAC-SHA256 best practice, kimi) 2. subject_audit: new chain_tip() returns last hash from full log; audit_endpoint now reports chain_root from full chain instead of windowed slice (opus) 3. registry: rebuild loader now warns on sanitize collision (symmetric with put_subject's collision guard - opus) 4. audit_endpoint: tampering detection - if manifest expects non-empty chain_root but log returns 0 rows, flag chain_verified=false with explicit message (opus) 5. execution_loop::audit_result_state: tightened heuristic - error/denied/not_found only classify when no rows/data/results sibling (opus INFO) Tests: 17 catalogd subject + 6 gateway audit_result_state, all green. New: audit_result_state_does_not_classify_error_when_data_sibling_present, audit_result_state_status_is_authoritative_even_with_data. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-03 04:00:42 -05:00
root	15cfd76c04	catalogd + gateway: Step 6 — /audit/subject/{id} legal-tier HTTP endpoint Implementation of docs/specs/SUBJECT_MANIFESTS_ON_CATALOGD.md §5 Step 6 + §4 (response shape) + §6 (auth model). The defense-against-EEOC- discovery surface is live: legal counsel hits one URL with one token, gets back a signed-by-HMAC-chain audit response naming every PII access for a subject in a time window. New module: crates/catalogd/src/audit_endpoint.rs (~340 LOC) - AuditEndpointState { registry, writer, legal_token } - router() exposes: GET /subject/{candidate_id}?from=ISO&to=ISO (full audit response) GET /health (liveness + token check) - require_legal_auth() — constant-time-eq compare against the X-Lakehouse-Legal-Token header. Avoids timing leaks on the token check without pulling in `subtle` for one comparison. - Token loaded from /etc/lakehouse/legal_audit.token (env-overridable via LH_LEGAL_AUDIT_TOKEN_FILE). Empty file or <16 chars = endpoint serves 503 with a clear reason. Token value NEVER logged. - Response schema: subject_audit_response.v1 with manifest + audit_log (rows + chain verification) + datasets_referenced + safe_views_available + completeness_attestation. New helper on SubjectAuditWriter: - read_rows_in_range(candidate_id, from, to) — returns rows in window, used by the endpoint to assemble the response without re-reading the entire chain. - verify_chain() now returns Ok(0) when the audit log file doesn't exist (empty = trivially valid). Prevents legitimate "no PII access yet for this subject" from showing as integrity=BROKEN in the audit response. Caller can detect "log was deleted" via comparison to SubjectManifest.audit_log_chain_root (when that mirror lands). main.rs: - Audit endpoint mounted at /audit ONLY when both subject_audit writer AND legal token are present. Disabled-by-default keeps the surface from accidentally serving in dev/bring-up environments without proper credentials. Tests (9/9 passing): - constant_time_eq (correctness on equal/diff/empty/length-mismatch) - missing_legal_token_returns_503 - missing_header_returns_401 - wrong_token_returns_401 - correct_token_passes_auth - audit_response_assembly_full_path (manifest + 3 rows + chain verify) - audit_response_window_filters_rows (time-bounded window) - empty_token_file_results_in_disabled_endpoint - short_token_file_rejected_at_load (<16 char min) LIVE end-to-end verification: 1. Plant signing key + legal token in /tmp/lakehouse_audit/ 2. Restart gateway with LH_SUBJECT_AUDIT_KEY + LH_LEGAL_AUDIT_TOKEN_FILE pointing at the test files 3. /audit/health → 200 "audit endpoint ready" 4. /audit/subject/WORKER-1 (no token) → 401 "missing X-Lakehouse-Legal-Token" 5. /audit/subject/WORKER-1 (wrong token) → 401 "X-Lakehouse-Legal-Token mismatch" 6. /audit/subject/WORKER-1 (correct token) → 200 + full manifest + 0 rows + chain_verified=true (empty log path) 7. POST /v1/validate with candidate_id=WORKER-1 → triggers WorkerLookup.find() via the AuditingWorkerLookup wrapper from Step 5 8. data/_catalog/subjects/WORKER-1.audit.jsonl now exists with 1 row (accessor.purpose=validator_worker_lookup, result=not_found, prev_chain_hash=GENESIS, valid HMAC) 9. /audit/subject/WORKER-1 (correct token) → 200 + manifest + 1 row + chain_verified=true + chain_rows_total=1 + completeness attestation The full audit-trail loop (PII access → audit row → chain → audit response) works end-to-end on the live gateway. NOT in this commit (future steps): - Step 7: Daily retention sweep - Step 8: Cross-runtime parity (Go side reads the same shapes) - Mirror chain root to SubjectManifest.audit_log_chain_root after each append (so tampering detection can use the manifest's cached root as ground truth) - Live row projection from datasets (currently caller follows up via /query/sql against the safe_views named in the response) - Ed25519 signature on the response (chain verification IS the v1 attestation; signing is future hardening per spec §10) cargo build --release clean. cargo test -p catalogd audit_endpoint 9/9 PASS. Live verification successful. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-03 03:52:04 -05:00
root	e38f3573ff	subject manifests Steps 1-4 — fix scrum-flagged BLOCKs and WARNs 2026-05-03 cross-lineage scrum on the subjects_steps_1_to_4 wave returned 14 distinct findings, 0 convergent. opus verdict was HOLD with 3 BLOCKs around the audit-chain integrity. All real. Fixed: ────────────────────────────────────────────────────────────────── BLOCK 1 — opus subject_audit.rs:172 + execution_loop.rs:391 Concurrency race: append_line is read-modify-write; the gateway hook used tokio::spawn fan-out → two concurrent appends to the same subject both read the same prev_hash, both compute their HMAC from the same prev, second write silently overwrites first → row lost AND chain broken. Fix: - SubjectAuditWriter gains per-subject Mutex map. append() acquires the subject's lock for the duration of the read-modify-write. Different subjects still parallelize. - Gateway hook switches from tokio::spawn to inline await. Per-row cost is ~1ms (one object_store put); inline is correct AND cheap. - New regression test: 50 concurrent appends to the same subject, asserts all 50 land with intact chain. BLOCK 2 — opus subject_audit.rs:108 Non-deterministic canonicalization: serde_json serializes struct fields in declaration order. Schema evolution (adding/reordering fields) silently changes the bytes verify_chain hashes → chain breaks even when nothing was actually tampered with. Fix: - New canonical_json() free fn — recursive value rewrite to sort object keys alphabetically (BTreeMap projection), arrays preserve order, scalars pass through. Stable across struct evolution. - Both append() and verify_chain() now compute HMAC over canonical bytes, not declaration-order bytes. - New regression tests: alphabetical-key + array-order-preserved. WARN — opus execution_loop:401 Audit row's `result` was hardcoded to "success" for every Ok(result) including payloads like {"error":"not found"}. Misleads compliance. Fix: - New audit_result_state() free fn that inspects the payload top-level for error/denied/not_found/status signals (per spec §3.2 enum). Defaults to "success" only when no error signal. - 4 new tests covering each enum case + falsy-signals defense. WARN — opus registry.rs:735 Storage-key collision: sanitize_view_name(id) is the disk key, but the in-memory HashMap was keyed by raw candidate_id. Two distinct ids that sanitize to the same key (e.g. "CAND/1" and "CAND_1") would collide on disk while appearing distinct in memory; second put silently overwrites first; rebuild loads only one. Fix: - put_subject() / get_subject() / delete_subject() / rebuild() all key the in-memory HashMap by sanitize_view_name(id), matching the storage key shape. - Collision guard: put_subject() refuses (with clear error) when the sanitized key matches an EXISTING subject with a DIFFERENT raw candidate_id. - New regression test: put("CAND/1") then put("CAND_1") errors + first subject survives. WARN — opus backfill_subjects.rs:189 trim_start_matches strips REPEATED prefixes; the spec wanted one-shot semantics. Edge case unlikely in practice but real. Fix: - Switched to strip_prefix(&prefix).unwrap_or(&cid). One-shot. INFO — opus subject_audit.rs:131 Per-byte format!("{:02x}", b) allocates each iteration. Hot path on every append. Fix: - Replaced with const HEX lookup table + push() into preallocated String. Same output bytes, no per-byte allocation. ────────────────────────────────────────────────────────────────── Test summary post-fix: catalogd subject_audit: 11/11 PASS (added 4 new — concurrency race regression, parallel-different-subjects, canonical-key sort, canonical-array order) catalogd registry subject: 6/6 PASS (added 1 new — collision guard) gateway execution_loop subject: 10/10 PASS (added 4 new — audit_result_state enum coverage) All 27 subject-related tests green. cargo build --release clean. The convergent-zero scrum result was misleading on its face — opus caught real BLOCKs that kimi/qwen missed. Per feedback_cross_lineage_review.md: opus is the load-bearing reviewer; single-opus BLOCKs warrant manual verification, which here confirmed all three were correct. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-03 03:37:45 -05:00
root	bce6dfd1ee	catalogd: Step 3 — backfill_subjects binary (BIPA-defensible defaults) Implementation of docs/specs/SUBJECT_MANIFESTS_ON_CATALOGD.md §5 Step 3. Reads a parquet source, creates one SubjectManifest per row with the spec-defined safe defaults, persists via Registry::put_subject(). Defaults baked in (per spec §2 + §5 Step 5): - vertical = unknown (HIPAA fail-closed) - consent.general_pii = pending_backfill_review (NOT inferred_existing — BIPA defense) - consent.biometric = never_collected (no biometric data backfilled) - retention.general_pii_until = now + 4 years - retention.policy = "4_year_default" Conservative ergonomics: - --limit 1000 by default. --all to do the full source. - --dry-run for parse + count + sample without writes. - --concurrency 32 (bounded via tokio::sync::Semaphore). - Idempotent: skips subjects that already exist in catalog. - Progress reports every ~5% (or 5K rows, whichever smaller). Live verification on workers_500k.parquet: --limit 100 dry-run: parsed 100 rows, sampled WORKER-1..5, 0 writes ✓ --limit 100 commit: 100 inserted, 0 failed, 100 files in data/_catalog/subjects/ ✓ --limit 100 re-run: 0 inserted, 100 skipped (idempotent) ✓ Sample manifest (data/_catalog/subjects/WORKER-1.json): { "schema": "subject_manifest.v1", "candidate_id": "WORKER-1", "status": "active", "vertical": "unknown", "consent": { "general_pii": {"status": "pending_backfill_review", ...}, "biometric": {"status": "never_collected", ...} }, "retention": {"general_pii_until": "2030-05-02T...", "policy": "4_year_default"}, "datasets": [{"name": "workers_500k", "key_column": "worker_id", "key_value": "1"}] } NOT in this commit (future steps): - Step 4: Wire gateway tool registry to write audit rows on every candidate_id returned (uses SubjectAuditWriter from Step 2) - Step 5: Wire validator WorkerLookup similarly - Step 6: /audit/subject/{id} HTTP endpoint - Step 7: Daily retention sweep - Backfill the full 500K (operator decision: --all when ready; note: 500K JSON files in one dir will slow startup load — may want SQLite/single-file backend before that scale) Operator note: backfill is run-once. To extend to candidates table, re-run with --dataset candidates --key-column candidate_id (no prefix since candidate_id is already the canonical token there). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-03 03:22:54 -05:00
root	d16131bcab	catalogd: Step 2 — SubjectAuditWriter with HMAC chain Implementation of docs/specs/SUBJECT_MANIFESTS_ON_CATALOGD.md Step 2. Per-subject append-only audit JSONL with HMAC-SHA256 chain. Local-first — no Vault, no external anchor (those are v2 if SOC2 Type II becomes contract-required; v1 deliberately stays small). shared/types.rs additions: - AuditAccessor — kind, daemon, purpose, trace_id - SubjectAuditRow — schema/ts/candidate_id/accessor/fields_accessed/ result/prev_chain_hash/row_hmac crates/catalogd/src/subject_audit.rs (NEW): - SubjectAuditWriter — holds signing key + per-subject latest-hash cache - from_key_file() — loads key from sealed file, requires ≥32 bytes - with_inline_key() — for tests + bring-up - append() — computes HMAC chain link, persists JSONL row, returns new chain root (caller mirrors to SubjectManifest.audit_log_chain_root) - verify_chain() — full re-verification of a subject's audit log, catches both prev_hash drift AND row-level HMAC tampering - scan_latest_hash() — cold-start path, finds prev_hash from JSONL tail - append_line() — read-modify-write pattern (object stores have no native append; same shape as the rest of catalogd's persistence) Crypto: HMAC-SHA256 via the standard `hmac` crate (added to workspace + catalogd deps; not implementing crypto by hand). Output is lowercase hex matching the rest of the codebase's SHA-256 conventions. Security choices: - NO Debug impl on SubjectAuditWriter — auto-deriving Debug would risk leaking the signing key into log lines. Tests work around this by matching on Result instead of using .unwrap_err(). - Key min length 32 bytes (HMAC-SHA256 block size guidance). - Failures are NOT swallowed — Result returned, caller decides whether to log + continue (per spec §3.2 the gateway tool registry SHOULD log + continue rather than block reads). Tests (7/7 passing): - first_append_uses_genesis_prev_hash - chain_links_each_append (3-row chain verifies) - separate_subjects_have_independent_chains (per-subject isolation) - tamper_detected_on_verify (mutation in middle of chain breaks verify) - cold_writer_picks_up_existing_chain (process restart preserves chain) - empty_candidate_id_rejected - key_too_short_rejected_via_file NOT in this commit (future steps): - Step 3: Backfill ETL from workers_500k.parquet (next per J) - Step 4: Wire gateway tool registry to call append() on every candidate_id returned by search_candidates / get_candidate - Step 5: Wire validator WorkerLookup similarly - Step 6: /audit/subject/{id} HTTP endpoint - Step 7: Daily retention sweep - Mirroring chain root to SubjectManifest.audit_log_chain_root (separate concern; do at the call site) cargo check --workspace clean. cargo test -p catalogd subject_audit 7/7 PASS. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-03 03:19:18 -05:00
root	d25990982c	catalogd: Step 1 — SubjectManifest type + Registry CRUD Implementation of docs/specs/SUBJECT_MANIFESTS_ON_CATALOGD.md Step 1. Mirrors the existing AiView put/get/list/delete pattern. NOT a separate daemon, NOT new infrastructure — extends catalogd's manifest layer with a fourth manifest type (subject) alongside dataset/view/tombstone/profile. shared/types.rs additions: - SubjectManifest (the wire format from spec §2) - SubjectStatus enum: pending_consent \| active \| withdrawn \| retention_expired \| erased - SubjectVertical enum: unknown \| general \| healthcare \| finance \| other (default = Unknown for fail-closed routing per spec §2.1) - ConsentStatus enum: pending_backfill_review \| pending_first_contact \| given \| withdrawn \| expired - BiometricConsentStatus enum: never_collected \| pending \| given \| withdrawn \| expired - GeneralPiiConsent + BiometricConsent + SubjectConsent - SubjectRetention (general_pii_until + policy) - SubjectDatasetRef (name + key_column + key_value pointing at existing catalogd dataset manifests) catalogd/registry.rs additions: - subjects: Arc<RwLock<HashMap<String, SubjectManifest>>> field on Registry - put_subject() — validates dataset refs, persists to _catalog/subjects/<id>.json, updates in-memory cache - get_subject() / list_subjects() / delete_subject() / subjects_count() - rebuild() now loads subject manifests at startup alongside views + profiles + tombstones Tests (5/5 passing): - put_subject_with_no_dataset_refs_succeeds - put_subject_rejects_dangling_dataset_ref (validation works) - put_subject_with_valid_dataset_ref_succeeds - subject_round_trips_through_object_store (persistence works) - delete_subject_removes_in_memory_and_persistence NOT in this commit (future steps): - Step 2: SubjectAuditWriter with HMAC chain - Step 3: Backfill ETL from workers_500k.parquet - Steps 4-5: Wire gateway tool registry + validator to write audit rows - Step 6: /audit/subject/{id} HTTP endpoint - Step 7: Daily retention sweep cargo check --workspace clean. cargo test -p catalogd subject 5/5 PASS. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-03 03:13:08 -05:00
root	f59ddbebd4	Phase 41: Profile System Expansion - ProfileType enum: Execution, Retrieval, Memory, Observer - Per-type endpoints: /profiles/retrieval, /profiles/memory, /profiles/observer - profile_type field on ModelProfile - All tests pass	2026-04-23 03:07:22 -05:00
profit	5b1fcf6d27	Phase 28-36 body of work Accumulated since a6f12e2 (Phase 21 Rust port + Phase 27 versioning): - Phase 36: embed_semaphore on VectorState (permits=1) serializes seed embed calls — prevents sidecar socket collisions under concurrent /seed stress load - Phase 31+: run_stress.ts 6-task diverse stress scaffolding; run_e2e_rated.ts + orchestrator.ts tightening - Catalog dedupe cleanup: 16 duplicate manifests removed; canonical candidates.parquet (10.5MB -> 76KB) + placements.parquet (1.2MB -> 11KB) regenerated post-dedupe; fresh manifests for active datasets - vectord: harness EvalSet refinements (+181), agent portfolio rotation + ingest triggers (+158), autotune + rag adjustments - catalogd/storaged/ingestd/mcp-server: misc tightening - docs: Phase 28-36 PRD entries + DECISIONS ADR additions; control-plane pivot banner added to top of docs/PRD.md (pointing at docs/CONTROL_PLANE_PRD.md which lands in next commit) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-22 02:41:15 -05:00
root	0d037cfac1	Phases 16.2 + L2 + 17 VRAM gate + MySQL + 18 Lance hybrid milestone Five threads of work landing as one milestone — all individually verified end-to-end against real data, full release build clean, 46 unit tests pass. ## Phase 16.2 / 16.5 — autotune agent + ingest triggers `vectord::agent` is a long-running tokio task that watches the trial journal and autonomously proposes + runs new HNSW configs. Distinct from `autotune::run_autotune` (synchronous one-shot grid). Triggered on POST /vectors/agent/enqueue/{idx} or by the periodic wake; ingest paths now push DatasetAppended events when an index's source dataset gets re-ingested. Rate-limited (max_trials_per_hour) and cooldown- gated so it can't saturate Ollama under live load. The proposer is ε-greedy around the current champion: with prob 0.25 sample random from full bounds, otherwise perturb champion ± small delta on both axes. Dedup against history. Deterministic — RNG seeded from history.len() so the same journal state proposes the same next config (helps offline replay debugging). `[agent]` config section in lakehouse.toml; opt-in via enabled=true. ## Federation Layer 2 — runtime bucket lifecycle + per-index scoping `BucketRegistry.buckets` moved to `std::sync::RwLock<HashMap>` so buckets can be added/removed after startup. POST /storage/buckets provisions at runtime; DELETE /storage/buckets/{name} unregisters (refuses primary/rescue with 403). Local-backend buckets get their root directory auto-created. `IndexMeta.bucket` (default "primary" via serde) records each index's home bucket. `TrialJournal` and `PromotionRegistry` now hold Arc<BucketRegistry> + IndexRegistry; they resolve target store per- index via IndexMeta.bucket. PromotionRegistry::list_all scans every bucket and dedups by index_name. Pre-federation indexes keep working unchanged — they just default to primary. `ModelProfile.bucket: Option<String>` declares per-profile artifact home. POST /vectors/profile/{id}/activate auto-provisions the profile's bucket under storage.profile_root if not yet registered. EvalSets stay primary-only for now — noted gap, low-risk to extend later with the same resolver pattern. ## Phase 17 — VRAM-aware two-profile gate Sidecar gains POST /admin/unload (Ollama keep_alive=0 trick — forces immediate VRAM release), POST /admin/preload (keep_alive=5m with empty prompt, takes the slot warm), and GET /admin/vram (combines nvidia-smi snapshot with Ollama /api/ps). Exposed via aibridge as unload_model / preload_model / vram_snapshot. `VectorState.active_profile` is the GPU-slot singleton — Arc<RwLock<Option<ActiveProfileSlot>>>. activate_profile checks for a previous profile with a different ollama_name and unloads it before preloading the new one; same-model reactivations skip the unload (Ollama no-ops). New routes: POST /vectors/profile/{id}/ deactivate (unload + clear slot), GET /vectors/profile/active. Verified live: staffing-recruiter (qwen2.5) → docs-assistant (mistral) swap freed qwen2.5 from VRAM and loaded mistral. nomic- embed-text persists across swaps because both profiles use it — free optimization that fell out of the design. Scoped search correctly 403s cross-profile in both directions. ## MySQL streaming connector `crates/ingestd/src/my_stream.rs` mirrors pg_stream.rs for MySQL. Pure-rust `mysql_async` driver (default-features=false to avoid C deps). Same OFFSET pagination, same Parquet-streaming write shape. Type mapping per ADR-010: int/bigint → Int32/Int64, decimal/float → Float64, tinyint(1)/bool → Boolean, everything else → Utf8 with fallback parsers for date/time/json/uuid via Display. POST /ingest/mysql parallel to /ingest/db. Same PII auto-detection, same lineage capture (source_system="mysql"), same agent-trigger hook. `redact_dsn` generalized — was hardcoded to "postgresql://" length, now works for any scheme://user:pass@host/path URL (latent PII leak fix for MySQL DSNs). Verified live against MariaDB on localhost: 10 rows × 9 columns of test data round-tripped through datatypes int/varchar/decimal/ tinyint/datetime/text. PII detection auto-flagged name + email. Aggregation queries through DataFusion match the source values exactly. ## Phase 18 — Hybrid Parquet+HNSW ⊕ Lance backend (ADR-019) `vectord-lance` is a new firewall crate. Lance pulls Arrow 57 and DataFusion 52 — incompatible with the rest of the workspace's Arrow 55 / DataFusion 47. The firewall isolates that dep tree: public API uses only std types (Vec<f32>, Vec<String>, Hit, Row, Stats), so no Arrow types cross the crate boundary and nothing propagates to vectord. The ADR-019 path that didn't ship until now. `vectord::lance_backend::LanceRegistry` lazy-creates a LanceVectorStore per index, resolving bucket → URI via the conventional local-bucket layout. `IndexMeta.vector_backend` and `ModelProfile.vector_backend` carry the choice (default Parquet so existing indexes unchanged). Six routes under /vectors/lance/: - migrate/{idx}: convert binary-blob Parquet → Lance FixedSizeList - index/{idx}: build IVF_PQ - search/{idx}: vector search (embed via sidecar) - doc/{idx}/{doc_id}: random row fetch - append/{idx}: native fragment append - stats/{idx}: row count + index presence Verified live on the real resumes_100k_v2 corpus (100K × 768d): - Migrate: 0.57s - Build IVF_PQ index: 16.2s (matches ADR-019 bench; 14× faster than HNSW's 230s for the same data) - Search end-to-end (Ollama embed + Lance scan): 23-53ms - Random doc_id fetch: 5-7ms (filter scan; faster than Parquet's ~35ms full-file scan, slower than the bench's 311us positional take — would close that gap with a scalar btree on doc_id) - Append 100 rows: 3.3ms / +320KB on disk vs Parquet's required full ~330MB rewrite — the structural win - Index survives append; both backends coexist cleanly ## Known follow-ups not in this milestone - ModelProfile.vector_backend doesn't yet auto-route /vectors/profile/ {id}/search to Lance; callers go through /vectors/lance/* directly - Scalar btree on doc_id (closes the 5-7ms → ~300us gap) - vectord-lance built default-features=false → no S3 yet - IVF_PQ recall not measured (ADR-019 caveat) — needs a Lance-aware variant of the eval harness - Watcher-path ingest doesn't push agent triggers (HTTP paths do) - EvalSets still primary-only (federation gap) - No PATCH endpoint to move an existing index between buckets - The pre-existing storaged::append_log doctest fails to compile (malformed `{prefix}/` parses as code fence) — pre-existing bug, left for a focused fix 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-16 20:24:46 -05:00
root	4e1c400f5d	Phase E.2: Compaction integrates tombstones — physical deletion closes GDPR loop Phase E gave us soft-delete at query time (tombstones hide rows via a DataFusion filter view). This completes the invariant: after compact, tombstoned rows are PHYSICALLY absent from the parquet on disk. delta::compact changes: - Signature adds tombstones: &[Tombstone] - After merging base + deltas, apply_tombstone_filter builds a BooleanArray keep-mask per batch (True where row_key_value is NOT in the tombstone set) and applies arrow::compute::filter_record_batch - Supports Utf8, Int32, Int64 key columns (matches refresh.rs coverage for pg- and csv-derived schemas) - CompactResult gains tombstones_applied + rows_dropped_by_tombstones - Caller clears tombstone store on success Critical correctness fix surfaced during E2E testing: The original Phase 8 compact concatenated N independent Parquet byte streams from record_batch_to_parquet() — each with its own footer. Parquet readers only see the FIRST footer's data; the rest is invisible. Latent since Phase 8 shipped; triggered by tombstone-filtering produc- ing multiple batches. Corrupted candidates.parquet on first test run (restored from UI fixture copy — good argument for test data in repo). Fix: - Single ArrowWriter per compaction, writes every batch into one properly-footered Parquet - Snappy compression to match ingest defaults (otherwise rewrite inflated file 3× — 10.5MB → 34MB — because no compression was set) - Verify-before-swap: parse written buf back to confirm row count matches expected; refuses to overwrite base_key if verification fails - Write to {base_key}.compact-{ts}.tmp first, then to base_key; delete temp; only then delete delta files. Any error along the way leaves the original base intact. TombstoneStore::clear(dataset) drops all tombstone batch files and evicts the per-dataset AppendLog from cache. Called after successful compact. QueryEngine::catalog() accessor exposes the Registry so queryd handlers can reach the tombstone store without routing through gateway state. E2E on candidates (100K rows, 15 cols): - Baseline: 10.59 MB, 100000 rows - Tombstone CAND-000001/2/3 (soft-delete): 99997 visible, 100000 raw - Compact: tombstones_applied=3, rows_dropped=3, final_rows=99997 - Post: 10.72 MB (Snappy), valid parquet (1 row_group), 99997 rows - Restart: persists, tombstones list empty, __raw__candidates also 99997 (the 3 IDs are physically gone from disk) PRD invariant close: deletion is now actually deletion, not just masking. GDPR erasure request → tombstone + schedule compact → data gone. Deferred: - Compact-all-datasets cron (currently manual per-dataset via POST /query/compact) - Compaction of tombstone batch files themselves (they grow at flush_threshold=1 per tombstone; TombstoneStore::compact exists but not auto-called) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-16 10:38:30 -05:00
root	a293502265	Phase 17: Model profiles + scoped search — the LLM-brain keystone Implements PRD invariant 9 ("every reader gets its own profile") and completes the multi-model substrate vision. Local models (or agents) bind to a named set of datasets; activation pre-loads their vector indexes into memory; search enforces scope. Schema (shared::types): - ModelProfile { id, ollama_name, description, bound_datasets, hnsw_config, embed_model, created_at, created_by } - ProfileHnswConfig mirrors vectord::trial::HnswConfig to avoid a cross-crate dep cycle. Default (ec=80, es=30) matches the Phase 15 trial winner. - bound_datasets can reference raw dataset names OR AiView names (both register as DataFusion tables with the same name, so mixing raw tables and PII-redacted views composes naturally) Catalog (catalogd::registry): - put_profile validates id is a slug (alphanumeric + -_ only) and every binding resolves to an existing dataset or view - Persistence at _catalog/profiles/{id}.json, loaded on rebuild - get_profile / list_profiles / delete_profile HTTP endpoints: - POST /catalog/profiles (create/update) - GET /catalog/profiles (list) - GET/DELETE /catalog/profiles/{id} - POST /vectors/profile/{id}/activate (HNSW hot-load) - POST /vectors/profile/{id}/search (scope-enforced) Activation (vectord::service::activate_profile): - For each bound dataset, find vector indexes with matching source - Pre-load embeddings into EmbeddingCache - Build HNSW with profile's config - Report warmed indexes + per-binding failures + duration - Failures on individual bindings don't abort — "substrate keeps working" per ADR-017 Scoped search (vectord::service::profile_scoped_search): - Look up profile, verify index.source ∈ profile.bound_datasets - Returns 403 with allowed bindings list if out-of-scope - Uses HNSW if index is warm, brute-force cosine otherwise (graceful degradation — no "must activate first" friction) Bug fix surfaced during testing: vectord::refresh::try_update_index_meta was a no-op for first-time indexes, so threat_intel_v1 and kb_team_runs_v1 (both built via refresh after Phase C shipped) didn't show up in the index registry. Now it auto-infers the source from the index name convention (`{source}_vN`) and registers new metadata with reasonable defaults. End-to-end verified: - Created security-analyst profile bound to [threat_intel] - POST /vectors/profile/security-analyst/activate → warmed threat_intel_v1 (54 vectors) in 156ms, HNSW built - Within-scope search: method=hnsw, returned relevant IP indicators - Out-of-scope: tried to search resumes_100k_v2 (source=candidates) → 403 "profile 'security-analyst' is not bound to 'candidates' — allowed bindings: [\"threat_intel\"]" - staffing-recruiter profile created bound to candidates + placements; search without activation fell through to brute_force (graceful) Deferred (Phase 17 followups): - VRAM-aware activation (unload-then-load via Ollama keep_alive=0) — Ollama already handles this; we don't need to reinvent - Model-identity in audit trail — Phase 13 has role-based audit; adding model_id is ~20 LOC when we want it - Profile bucket pre-load (profile:user bucket mount) — Phase 17.5 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-16 10:09:43 -05:00
root	d87f2ccac6	Phase E: Soft deletes (tombstones) for compliance-grade row deletion Implements GDPR/CCPA-compatible row-level deletion without rewriting the underlying Parquet. Tombstone markers live beside each dataset and are applied at query time via a DataFusion view that excludes the deleted row_key_values. Schema (shared::types): - Tombstone { dataset, row_key_column, row_key_value, deleted_at, actor, reason } - All tombstones for a dataset must share one row_key_column — enforced at write so the query-time filter remains a single WHERE NOT IN (...) clause Storage (catalogd::tombstones): - Per-dataset AppendLog at _catalog/tombstones/{dataset}/ - flush_threshold=1 + explicit flush after every append — tombstones are high-value, low-frequency; durability on return is the contract - Reuses storaged::append_log infra so compaction is already wired (POST .../tombstones/compact will work once we expose it) Catalog (catalogd::registry): - add_tombstone validates dataset exists + key column compatibility - list_tombstones for the GET endpoint - TombstoneStore exposed via Registry::tombstones() for queryd HTTP (catalogd::service): - POST /catalog/datasets/by-name/{name}/tombstone { row_key_column, row_key_values[], actor, reason } Returns rows_tombstoned count + per-value failure list (207 on partial success). - GET same path lists active tombstones with full audit info. Query layer (queryd::context): - Snapshot tombstones-by-dataset before registering tables - Tombstoned tables: raw goes to "__raw__{name}", public "{name}" becomes DataFusion view with SELECT * FROM "__raw__{name}" WHERE CAST(col AS VARCHAR) NOT IN (...) - CAST AS VARCHAR handles both string and integer key columns - Untombstoned tables register as before — zero overhead End-to-end on candidates (100K rows): - Pick CAND-000001/2/3 (Linda/Charles/Kimberly) - POST tombstone -> rows_tombstoned: 3 - COUNT() drops 100000 -> 99997 - WHERE candidate_id IN (those 3) -> 0 rows - candidates_safe view transitively excludes them (Linda+Denver: __raw__candidates=159, candidates_safe=158) - Restart: COUNT still 99997, 3 tombstones reload from disk Reversibility: tombstones are reversible deletes, not destruction. Power users can still query "__raw__{name}" to see deleted rows. Phase 13 access control is what stops a non-admin from accessing __raw__ tables. Limits / follow-up: - Physical compaction not yet integrated — Phase 8's compact_files doesn't read tombstones during merge. Tombstoned rows are still on disk until that integration ships. - Phase 9 journald event emission for tombstones not wired — tombstone records carry their own actor+reason+timestamp so the audit trail is intact, but cross-referencing with the mutation event log would help compliance reporting. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-16 09:40:48 -05:00
root	09fd446c8d	Phase D: AI-safe views — capability-surface projections over base data Implements the llms3.com "AI-safe views" pattern: a named projection that exposes only whitelisted columns, with optional row filter and per-column redactions. AI agents (or Phase 13 roles) bind to the view; they can never accidentally see PII even if they write raw SQL. Schema (shared::types): - AiView { name, base_dataset, columns: Vec<String>, row_filter, column_redactions: HashMap<String, Redaction>, ... } - Redaction enum: Null \| Hash \| Mask { keep_prefix, keep_suffix } Catalog (catalogd::registry): - put_view validates base dataset exists + columns non-empty - Persists JSON at _catalog/views/{name}.json (sanitized name) - rebuild() loads views alongside dataset manifests on startup Query layer (queryd::context): - build_context registers every AiView as a DataFusion view object - Constructed SELECT applies whitelist projection, WHERE filter, and redaction expressions per column - Mask: substr(prefix) + repeat('', mid_len) + substr(suffix) - Hash: digest(value, 'sha256') - Null: CAST(NULL AS VARCHAR) AS col - DataFusion handles JOINs/aggregates over the view natively — it's a real view, not a query rewrite HTTP (catalogd::service): - POST /catalog/views (create) - GET /catalog/views (list) - GET /catalog/views/{name} (full def) - DELETE /catalog/views/{name} End-to-end test on candidates (100K rows, 15 columns): candidates_safe view: columns: candidate_id, first_name, city, state, vertical, skills, years_experience, status row_filter: status != 'blocked' redaction: candidate_id mask(prefix=3, suffix=2) SELECT FROM candidates_safe LIMIT 5 -> 8 columns only, candidate_id shown as "CAN******01" (PII fields email/phone/last_name absent from result) SELECT email FROM candidates_safe -> fails (column not in projection) SELECT email FROM candidates -> succeeds (raw table still accessible by name — Phase 13 access control is the gate, not the view itself) Survives restart — view definitions reload from object storage. Limits / not in MVP: - View CANNOT shadow base table by name (DataFusion treats them as separate identifiers; access control must restrict raw-table access) - row_filter is treated as trusted SQL — operators must validate before persisting; only authenticated admin path should call put_view - Redaction expressions assume column is castable to VARCHAR; numeric redactions could be misleading (a Hash on Int64 returns a hex string that won't equi-join with another hash on the same value type) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-16 09:16:44 -05:00
root	24f1249a62	Federation layer 2: header routing + cross-bucket SQL Three pieces of the multi-bucket federation made real: 1. Catalog migration (POST /catalog/migrate-buckets) - One-shot normalizer for ObjectRef.bucket field - Empty -> "primary"; legacy "data"/"local" -> "primary" - Idempotent; re-running on canonical state is no-op - Ran on existing catalog: 12 refs renamed from "data", 2 already "primary", all 14 now canonical 2. X-Lakehouse-Bucket header middleware on ingest - resolve_bucket() helper extracts header, returns (bucket_name, store) or 404 with valid bucket list - ingest_file and ingest_db_stream now route writes per-request - Defaults to "primary" when header absent - pipeline::ingest_file_to_bucket records the actual bucket on the ObjectRef so catalog stays the source of truth for "where does this data live" - Verified: ingest with X-Lakehouse-Bucket: testing lands in data/_testing/, ingest without header lands in data/, bad header returns 404 with hint 3. queryd registers every bucket with DataFusion - QueryEngine now holds Arc<BucketRegistry> instead of single store - build_context iterates all buckets, registers each as a separate ObjectStore under URL scheme "lakehouse-{bucket}://" - ListingTable URLs include the per-object bucket scheme so DataFusion routes scans automatically based on ObjectRef.bucket - Profile bucket names like "profile:user" sanitized to "lakehouse-profile-user" since URL host segments can't contain ":" - Tolerant of duplicate manifest entries (pre-existing pipeline::ingest_file behavior creates a fresh dataset id per ingest); duplicates skipped with debug log - Backward compat: legacy "lakehouse://data/" URL still registered pointing at primary Success gate: cross-bucket CROSS JOIN SELECT p.name, p.role, a.species FROM people_test p (bucket: testing) CROSS JOIN animals a (bucket: primary) LIMIT 5 returns rows correctly. DataFusion routed each scan to its bucket's ObjectStore based on the URL scheme. No regressions: SELECT COUNT(*) FROM candidates still returns 100000 from the primary bucket. Deferred to Phase 17: - POST /profile/{user}/activate (HNSW hot-load on profile switch) - vectord storage paths becoming bucket-scoped (trial journals, eval sets per-profile) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-16 08:52:32 -05:00
root	97a376482c	Phase C: Decoupled embedding refresh Implements the llms3.com-inspired pattern: embeddings refresh asynchronously, decoupled from transactional row writes. New rows arrive, ingest marks the vector index stale, a later refresh embeds only the delta (doc_ids not already in the index). Schema additions (DatasetManifest): - last_embedded_at: Option<DateTime> - when the index was last refreshed - embedding_stale_since: Option<DateTime> - set when data written, cleared on refresh - embedding_refresh_policy: Option<RefreshPolicy> - Manual \| OnAppend \| Scheduled Ingest paths (pipeline::ingest_file + pg_stream) call registry.mark_embeddings_stale after writing. No-op if the dataset has never been embedded — stale semantics only kick in once last_embedded_at is set. Refresh pipeline (vectord::refresh::refresh_index): - Reads the dataset Parquet, extracts (doc_id, text) pairs - Accepts Utf8 / Int32 / Int64 id columns (covers both CSV and pg schemas) - Loads existing embeddings via EmbeddingCache (empty on first-time build) - Filters to rows whose doc_id is NOT in the existing set - Chunks (chunker::chunk_column), embeds via Ollama (batches of 32), writes combined index, clears stale flag Endpoints: - POST /vectors/refresh/{dataset_name} - body {index_name, id_column, text_column, chunk_size?, overlap?} - GET /vectors/stale - lists datasets whose embedding_stale_since is set End-to-end verified on threat_intel (knowledge_base.threat_intel): - Initial refresh: 20 rows -> 20 chunks -> embedded in 2.1s, last_embedded_at set - Idempotent second refresh: 0 new docs -> 1.8ms (pure delta check) - Re-ingest to 54 rows: mark_embeddings_stale fires -> stale_since set - /vectors/stale surfaces threat_intel with timestamps + policy - Delta refresh: 34 new docs embedded in 970ms (6x faster than full re-embed); stale_cleared = true Not in MVP scope: - UPDATE semantics (same doc_id, different content) - would need per-row content hashing - OnAppend policy auto-trigger - just declares intent; actual scheduler deferred - Scheduler runtime - the Scheduled(cron) variant declares the intent so operators can see which datasets expect what, but the cron itself is separate Per ADR-019: when a profile switches to vector_backend=Lance, this refresh path benefits — Lance's native append replaces our "read all + rewrite" Parquet rebuild pattern. Current MVP works well enough at ~500-5K rows to validate the architecture; Lance unblocks the 5M+ case. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-16 03:00:43 -05:00
root	dbe00d018f	Federation foundation + HNSW trial system + Postgres streaming + PRD reframe Four shipped features and a PRD realignment, all measured end-to-end: HNSW trial system (Phase 15 horizon item → complete) - vectord: EmbeddingCache, harness (eval sets + brute-force ground truth), TrialJournal, parameterized HnswConfig on build_index_with_config - /vectors/hnsw/trial, /hnsw/trials/{idx}, /hnsw/trials/{idx}/best, /hnsw/evals/{name}/autogen, /hnsw/cache/stats - Measured on resumes_100k_v2 (100K × 768d): brute-force 44ms -> HNSW 873us at 100% recall@10. ec=80 es=30 locked as HnswConfig::default() - Lower ec values trade recall for build time: 20/30 = 0.96 recall in 8s, 80/30 = 1.00 recall in 230s Catalog manifest repair - catalogd: resync_from_parquet reads parquet footers to restore row_count and columns on drifted manifests - POST /catalog/datasets/{name}/resync + POST /catalog/resync-missing - All 7 staffing tables recovered to PRD-matching 2,469,278 rows Federation foundation (ADR-017) - shared::secrets: SecretsProvider trait + FileSecretsProvider (reads /etc/lakehouse/secrets.toml, enforces 0600 perms) - storaged::registry::BucketRegistry — multi-bucket resolution with rescue_bucket read fallback and reachability probing - storaged::error_journal — bucket op failures visible in one HTTP call - storaged::append_log — write-once batched append pattern (fixes the RMW anti-pattern llms3.com calls out; errors and trial journals both use it) - /storage/buckets, /storage/errors, /storage/bucket-health, /storage/errors/{flush,compact} - Bucket-aware I/O at /storage/buckets/{bucket}/objects/{*key} with X-Lakehouse-Rescue-Used observability headers on fallback Postgres streaming ingest - ingestd::pg_stream: DSN parser, batched ORDER BY + LIMIT/OFFSET pagination into ArrowWriter, lineage redacts password - POST /ingest/db — verified against live knowledge_base.team_runs (586 rows × 13 cols, 6 batches, 196ms end-to-end) PRD realignment (2026-04-16) - Dual use case: staffing analytics + local LLM knowledge substrate - Removed "multi-tenancy (single-owner system)" from non-goals - Added invariants 8-11: indexes hot-swappable, per-reader profiles, trials-as-data, operational failures findable in one HTTP call - New phases 16 (hot-swap generations), 17 (model profiles + dataset bindings), 18 (Lance vs Parquet+sidecar evaluation) - Known ceilings table documents the 5M vector wall and escape hatches - ADR-017 (federation), ADR-018 (append-log pattern) added - EXECUTION_PLAN.md sequences phases B-E with success gates and decision rules Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-16 01:50:05 -05:00
root	9e53caaec3	Phase 10: Rich catalog v2 — metadata as product - DatasetManifest expanded: description, owner, sensitivity, columns, lineage, freshness contract, tags, row_count - All new fields use #[serde(default)] for backward compatibility - PII auto-detection: scans column names for email, phone, SSN, salary, address, DOB, medical terms — flags as PII/PHI/Financial - Column-level metadata: name, type, sensitivity, is_pii flag - Lineage tracking: source_system, source_file, ingest_job, timestamp - Ingest pipeline auto-populates: PII scan, column meta, lineage, row count - PATCH /catalog/datasets/by-name/{name}/metadata — update metadata - Catalog responses now include all rich fields - 25 unit tests passing (5 new PII detection tests) Per ADR-013: datasets without metadata become mystery files. This makes every ingested file self-describing from day one. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-27 09:15:09 -05:00
root	01373c0e45	Phase 5: hardening — gRPC, observability, auth, config - proto: lakehouse.proto with CatalogService, QueryService, StorageService, AiService - proto crate: tonic-build codegen from proto definitions - catalogd: gRPC CatalogService implementation - gateway: dual HTTP (:3100) + gRPC (:3101) servers - gateway: OpenTelemetry tracing with stdout exporter - gateway: API key auth middleware (toggleable) - shared: TOML config system with typed structs and defaults - lakehouse.toml config file - ADR-006 and ADR-007 documented Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-27 06:37:07 -05:00
root	655b6c0b37	Phase 1: storage + catalog layer - storaged: object_store backend (LocalFileSystem), PUT/GET/DELETE/LIST endpoints - shared: arrow_helpers with Parquet roundtrip + schema fingerprinting (2 tests) - catalogd: in-memory registry with write-ahead manifest persistence to object storage - catalogd: POST/GET /datasets, GET /datasets/by-name/{name} - gateway: wires storaged + catalogd with shared object_store state - Phase tracker updated: Phase 0 + Phase 1 gates passed Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-27 05:15:27 -05:00
root	a52ca841c6	Phase 0: bootstrap Rust workspace - Cargo workspace with 6 crates: shared, storaged, catalogd, queryd, aibridge, gateway - shared: types (DatasetId, ObjectRef, SchemaFingerprint, DatasetManifest) + error enum - gateway: Axum HTTP entrypoint with nested service routers + tracing - All services expose /health stubs - justfile with build/test/run recipes - PRD, phase tracker, and ADR docs Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-27 04:59:05 -05:00

25 Commits