lakehouse

Author	SHA1	Message	Date
root	bce6dfd1ee	catalogd: Step 3 — backfill_subjects binary (BIPA-defensible defaults) Implementation of docs/specs/SUBJECT_MANIFESTS_ON_CATALOGD.md §5 Step 3. Reads a parquet source, creates one SubjectManifest per row with the spec-defined safe defaults, persists via Registry::put_subject(). Defaults baked in (per spec §2 + §5 Step 5): - vertical = unknown (HIPAA fail-closed) - consent.general_pii = pending_backfill_review (NOT inferred_existing — BIPA defense) - consent.biometric = never_collected (no biometric data backfilled) - retention.general_pii_until = now + 4 years - retention.policy = "4_year_default" Conservative ergonomics: - --limit 1000 by default. --all to do the full source. - --dry-run for parse + count + sample without writes. - --concurrency 32 (bounded via tokio::sync::Semaphore). - Idempotent: skips subjects that already exist in catalog. - Progress reports every ~5% (or 5K rows, whichever smaller). Live verification on workers_500k.parquet: --limit 100 dry-run: parsed 100 rows, sampled WORKER-1..5, 0 writes ✓ --limit 100 commit: 100 inserted, 0 failed, 100 files in data/_catalog/subjects/ ✓ --limit 100 re-run: 0 inserted, 100 skipped (idempotent) ✓ Sample manifest (data/_catalog/subjects/WORKER-1.json): { "schema": "subject_manifest.v1", "candidate_id": "WORKER-1", "status": "active", "vertical": "unknown", "consent": { "general_pii": {"status": "pending_backfill_review", ...}, "biometric": {"status": "never_collected", ...} }, "retention": {"general_pii_until": "2030-05-02T...", "policy": "4_year_default"}, "datasets": [{"name": "workers_500k", "key_column": "worker_id", "key_value": "1"}] } NOT in this commit (future steps): - Step 4: Wire gateway tool registry to write audit rows on every candidate_id returned (uses SubjectAuditWriter from Step 2) - Step 5: Wire validator WorkerLookup similarly - Step 6: /audit/subject/{id} HTTP endpoint - Step 7: Daily retention sweep - Backfill the full 500K (operator decision: --all when ready; note: 500K JSON files in one dir will slow startup load — may want SQLite/single-file backend before that scale) Operator note: backfill is run-once. To extend to candidates table, re-run with --dataset candidates --key-column candidate_id (no prefix since candidate_id is already the canonical token there). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-03 03:22:54 -05:00
root	d16131bcab	catalogd: Step 2 — SubjectAuditWriter with HMAC chain Implementation of docs/specs/SUBJECT_MANIFESTS_ON_CATALOGD.md Step 2. Per-subject append-only audit JSONL with HMAC-SHA256 chain. Local-first — no Vault, no external anchor (those are v2 if SOC2 Type II becomes contract-required; v1 deliberately stays small). shared/types.rs additions: - AuditAccessor — kind, daemon, purpose, trace_id - SubjectAuditRow — schema/ts/candidate_id/accessor/fields_accessed/ result/prev_chain_hash/row_hmac crates/catalogd/src/subject_audit.rs (NEW): - SubjectAuditWriter — holds signing key + per-subject latest-hash cache - from_key_file() — loads key from sealed file, requires ≥32 bytes - with_inline_key() — for tests + bring-up - append() — computes HMAC chain link, persists JSONL row, returns new chain root (caller mirrors to SubjectManifest.audit_log_chain_root) - verify_chain() — full re-verification of a subject's audit log, catches both prev_hash drift AND row-level HMAC tampering - scan_latest_hash() — cold-start path, finds prev_hash from JSONL tail - append_line() — read-modify-write pattern (object stores have no native append; same shape as the rest of catalogd's persistence) Crypto: HMAC-SHA256 via the standard `hmac` crate (added to workspace + catalogd deps; not implementing crypto by hand). Output is lowercase hex matching the rest of the codebase's SHA-256 conventions. Security choices: - NO Debug impl on SubjectAuditWriter — auto-deriving Debug would risk leaking the signing key into log lines. Tests work around this by matching on Result instead of using .unwrap_err(). - Key min length 32 bytes (HMAC-SHA256 block size guidance). - Failures are NOT swallowed — Result returned, caller decides whether to log + continue (per spec §3.2 the gateway tool registry SHOULD log + continue rather than block reads). Tests (7/7 passing): - first_append_uses_genesis_prev_hash - chain_links_each_append (3-row chain verifies) - separate_subjects_have_independent_chains (per-subject isolation) - tamper_detected_on_verify (mutation in middle of chain breaks verify) - cold_writer_picks_up_existing_chain (process restart preserves chain) - empty_candidate_id_rejected - key_too_short_rejected_via_file NOT in this commit (future steps): - Step 3: Backfill ETL from workers_500k.parquet (next per J) - Step 4: Wire gateway tool registry to call append() on every candidate_id returned by search_candidates / get_candidate - Step 5: Wire validator WorkerLookup similarly - Step 6: /audit/subject/{id} HTTP endpoint - Step 7: Daily retention sweep - Mirroring chain root to SubjectManifest.audit_log_chain_root (separate concern; do at the call site) cargo check --workspace clean. cargo test -p catalogd subject_audit 7/7 PASS. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-03 03:19:18 -05:00
root	dbe00d018f	Federation foundation + HNSW trial system + Postgres streaming + PRD reframe Four shipped features and a PRD realignment, all measured end-to-end: HNSW trial system (Phase 15 horizon item → complete) - vectord: EmbeddingCache, harness (eval sets + brute-force ground truth), TrialJournal, parameterized HnswConfig on build_index_with_config - /vectors/hnsw/trial, /hnsw/trials/{idx}, /hnsw/trials/{idx}/best, /hnsw/evals/{name}/autogen, /hnsw/cache/stats - Measured on resumes_100k_v2 (100K × 768d): brute-force 44ms -> HNSW 873us at 100% recall@10. ec=80 es=30 locked as HnswConfig::default() - Lower ec values trade recall for build time: 20/30 = 0.96 recall in 8s, 80/30 = 1.00 recall in 230s Catalog manifest repair - catalogd: resync_from_parquet reads parquet footers to restore row_count and columns on drifted manifests - POST /catalog/datasets/{name}/resync + POST /catalog/resync-missing - All 7 staffing tables recovered to PRD-matching 2,469,278 rows Federation foundation (ADR-017) - shared::secrets: SecretsProvider trait + FileSecretsProvider (reads /etc/lakehouse/secrets.toml, enforces 0600 perms) - storaged::registry::BucketRegistry — multi-bucket resolution with rescue_bucket read fallback and reachability probing - storaged::error_journal — bucket op failures visible in one HTTP call - storaged::append_log — write-once batched append pattern (fixes the RMW anti-pattern llms3.com calls out; errors and trial journals both use it) - /storage/buckets, /storage/errors, /storage/bucket-health, /storage/errors/{flush,compact} - Bucket-aware I/O at /storage/buckets/{bucket}/objects/{*key} with X-Lakehouse-Rescue-Used observability headers on fallback Postgres streaming ingest - ingestd::pg_stream: DSN parser, batched ORDER BY + LIMIT/OFFSET pagination into ArrowWriter, lineage redacts password - POST /ingest/db — verified against live knowledge_base.team_runs (586 rows × 13 cols, 6 batches, 196ms end-to-end) PRD realignment (2026-04-16) - Dual use case: staffing analytics + local LLM knowledge substrate - Removed "multi-tenancy (single-owner system)" from non-goals - Added invariants 8-11: indexes hot-swappable, per-reader profiles, trials-as-data, operational failures findable in one HTTP call - New phases 16 (hot-swap generations), 17 (model profiles + dataset bindings), 18 (Lance vs Parquet+sidecar evaluation) - Known ceilings table documents the 5M vector wall and escape hatches - ADR-017 (federation), ADR-018 (append-log pattern) added - EXECUTION_PLAN.md sequences phases B-E with success gates and decision rules Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-16 01:50:05 -05:00
root	01373c0e45	Phase 5: hardening — gRPC, observability, auth, config - proto: lakehouse.proto with CatalogService, QueryService, StorageService, AiService - proto crate: tonic-build codegen from proto definitions - catalogd: gRPC CatalogService implementation - gateway: dual HTTP (:3100) + gRPC (:3101) servers - gateway: OpenTelemetry tracing with stdout exporter - gateway: API key auth middleware (toggleable) - shared: TOML config system with typed structs and defaults - lakehouse.toml config file - ADR-006 and ADR-007 documented Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-27 06:37:07 -05:00
root	655b6c0b37	Phase 1: storage + catalog layer - storaged: object_store backend (LocalFileSystem), PUT/GET/DELETE/LIST endpoints - shared: arrow_helpers with Parquet roundtrip + schema fingerprinting (2 tests) - catalogd: in-memory registry with write-ahead manifest persistence to object storage - catalogd: POST/GET /datasets, GET /datasets/by-name/{name} - gateway: wires storaged + catalogd with shared object_store state - Phase tracker updated: Phase 0 + Phase 1 gates passed Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-27 05:15:27 -05:00
root	a52ca841c6	Phase 0: bootstrap Rust workspace - Cargo workspace with 6 crates: shared, storaged, catalogd, queryd, aibridge, gateway - shared: types (DatasetId, ObjectRef, SchemaFingerprint, DatasetManifest) + error enum - gateway: Axum HTTP entrypoint with nested service routers + tracing - All services expose /health stubs - justfile with build/test/run recipes - PRD, phase tracker, and ADR docs Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-27 04:59:05 -05:00

6 Commits