6 Commits

Author SHA1 Message Date
root
bce6dfd1ee catalogd: Step 3 — backfill_subjects binary (BIPA-defensible defaults)
Implementation of docs/specs/SUBJECT_MANIFESTS_ON_CATALOGD.md §5 Step 3.
Reads a parquet source, creates one SubjectManifest per row with the
spec-defined safe defaults, persists via Registry::put_subject().

Defaults baked in (per spec §2 + §5 Step 5):
  - vertical = unknown                     (HIPAA fail-closed)
  - consent.general_pii = pending_backfill_review  (NOT inferred_existing — BIPA defense)
  - consent.biometric  = never_collected   (no biometric data backfilled)
  - retention.general_pii_until = now + 4 years
  - retention.policy = "4_year_default"

Conservative ergonomics:
  - --limit 1000 by default. --all to do the full source.
  - --dry-run for parse + count + sample without writes.
  - --concurrency 32 (bounded via tokio::sync::Semaphore).
  - Idempotent: skips subjects that already exist in catalog.
  - Progress reports every ~5% (or 5K rows, whichever smaller).

Live verification on workers_500k.parquet:
  --limit 100 dry-run:  parsed 100 rows, sampled WORKER-1..5, 0 writes ✓
  --limit 100 commit:   100 inserted, 0 failed, 100 files in
                        data/_catalog/subjects/ ✓
  --limit 100 re-run:   0 inserted, 100 skipped (idempotent) ✓

Sample manifest (data/_catalog/subjects/WORKER-1.json):
  {
    "schema": "subject_manifest.v1",
    "candidate_id": "WORKER-1",
    "status": "active",
    "vertical": "unknown",
    "consent": {
      "general_pii": {"status": "pending_backfill_review", ...},
      "biometric":   {"status": "never_collected",         ...}
    },
    "retention": {"general_pii_until": "2030-05-02T...", "policy": "4_year_default"},
    "datasets": [{"name": "workers_500k", "key_column": "worker_id", "key_value": "1"}]
  }

NOT in this commit (future steps):
  - Step 4: Wire gateway tool registry to write audit rows on every
    candidate_id returned (uses SubjectAuditWriter from Step 2)
  - Step 5: Wire validator WorkerLookup similarly
  - Step 6: /audit/subject/{id} HTTP endpoint
  - Step 7: Daily retention sweep
  - Backfill the full 500K (operator decision: --all when ready;
    note: 500K JSON files in one dir will slow startup load — may
    want SQLite/single-file backend before that scale)

Operator note: backfill is run-once. To extend to candidates table,
re-run with --dataset candidates --key-column candidate_id (no prefix
since candidate_id is already the canonical token there).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-03 03:22:54 -05:00
root
d16131bcab catalogd: Step 2 — SubjectAuditWriter with HMAC chain
Implementation of docs/specs/SUBJECT_MANIFESTS_ON_CATALOGD.md Step 2.
Per-subject append-only audit JSONL with HMAC-SHA256 chain. Local-first
— no Vault, no external anchor (those are v2 if SOC2 Type II becomes
contract-required; v1 deliberately stays small).

shared/types.rs additions:
- AuditAccessor — kind, daemon, purpose, trace_id
- SubjectAuditRow — schema/ts/candidate_id/accessor/fields_accessed/
  result/prev_chain_hash/row_hmac

crates/catalogd/src/subject_audit.rs (NEW):
- SubjectAuditWriter — holds signing key + per-subject latest-hash cache
- from_key_file() — loads key from sealed file, requires ≥32 bytes
- with_inline_key() — for tests + bring-up
- append() — computes HMAC chain link, persists JSONL row, returns new
  chain root (caller mirrors to SubjectManifest.audit_log_chain_root)
- verify_chain() — full re-verification of a subject's audit log,
  catches both prev_hash drift AND row-level HMAC tampering
- scan_latest_hash() — cold-start path, finds prev_hash from JSONL tail
- append_line() — read-modify-write pattern (object stores have no
  native append; same shape as the rest of catalogd's persistence)

Crypto: HMAC-SHA256 via the standard `hmac` crate (added to workspace
+ catalogd deps; not implementing crypto by hand). Output is lowercase
hex matching the rest of the codebase's SHA-256 conventions.

Security choices:
- NO Debug impl on SubjectAuditWriter — auto-deriving Debug would risk
  leaking the signing key into log lines. Tests work around this by
  matching on Result instead of using .unwrap_err().
- Key min length 32 bytes (HMAC-SHA256 block size guidance).
- Failures are NOT swallowed — Result returned, caller decides whether
  to log + continue (per spec §3.2 the gateway tool registry SHOULD
  log + continue rather than block reads).

Tests (7/7 passing):
- first_append_uses_genesis_prev_hash
- chain_links_each_append (3-row chain verifies)
- separate_subjects_have_independent_chains (per-subject isolation)
- tamper_detected_on_verify (mutation in middle of chain breaks verify)
- cold_writer_picks_up_existing_chain (process restart preserves chain)
- empty_candidate_id_rejected
- key_too_short_rejected_via_file

NOT in this commit (future steps):
- Step 3: Backfill ETL from workers_500k.parquet (next per J)
- Step 4: Wire gateway tool registry to call append() on every
  candidate_id returned by search_candidates / get_candidate
- Step 5: Wire validator WorkerLookup similarly
- Step 6: /audit/subject/{id} HTTP endpoint
- Step 7: Daily retention sweep
- Mirroring chain root to SubjectManifest.audit_log_chain_root
  (separate concern; do at the call site)

cargo check --workspace clean. cargo test -p catalogd subject_audit
7/7 PASS.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-03 03:19:18 -05:00
root
dbe00d018f Federation foundation + HNSW trial system + Postgres streaming + PRD reframe
Four shipped features and a PRD realignment, all measured end-to-end:

HNSW trial system (Phase 15 horizon item → complete)
- vectord: EmbeddingCache, harness (eval sets + brute-force ground truth),
  TrialJournal, parameterized HnswConfig on build_index_with_config
- /vectors/hnsw/trial, /hnsw/trials/{idx}, /hnsw/trials/{idx}/best,
  /hnsw/evals/{name}/autogen, /hnsw/cache/stats
- Measured on resumes_100k_v2 (100K × 768d): brute-force 44ms -> HNSW 873us
  at 100% recall@10. ec=80 es=30 locked as HnswConfig::default()
- Lower ec values trade recall for build time: 20/30 = 0.96 recall in 8s,
  80/30 = 1.00 recall in 230s

Catalog manifest repair
- catalogd: resync_from_parquet reads parquet footers to restore row_count
  and columns on drifted manifests
- POST /catalog/datasets/{name}/resync + POST /catalog/resync-missing
- All 7 staffing tables recovered to PRD-matching 2,469,278 rows

Federation foundation (ADR-017)
- shared::secrets: SecretsProvider trait + FileSecretsProvider (reads
  /etc/lakehouse/secrets.toml, enforces 0600 perms)
- storaged::registry::BucketRegistry — multi-bucket resolution with
  rescue_bucket read fallback and reachability probing
- storaged::error_journal — bucket op failures visible in one HTTP call
- storaged::append_log — write-once batched append pattern (fixes the RMW
  anti-pattern llms3.com calls out; errors and trial journals both use it)
- /storage/buckets, /storage/errors, /storage/bucket-health,
  /storage/errors/{flush,compact}
- Bucket-aware I/O at /storage/buckets/{bucket}/objects/{*key} with
  X-Lakehouse-Rescue-Used observability headers on fallback

Postgres streaming ingest
- ingestd::pg_stream: DSN parser, batched ORDER BY + LIMIT/OFFSET pagination
  into ArrowWriter, lineage redacts password
- POST /ingest/db — verified against live knowledge_base.team_runs
  (586 rows × 13 cols, 6 batches, 196ms end-to-end)

PRD realignment (2026-04-16)
- Dual use case: staffing analytics + local LLM knowledge substrate
- Removed "multi-tenancy (single-owner system)" from non-goals
- Added invariants 8-11: indexes hot-swappable, per-reader profiles,
  trials-as-data, operational failures findable in one HTTP call
- New phases 16 (hot-swap generations), 17 (model profiles + dataset
  bindings), 18 (Lance vs Parquet+sidecar evaluation)
- Known ceilings table documents the 5M vector wall and escape hatches
- ADR-017 (federation), ADR-018 (append-log pattern) added
- EXECUTION_PLAN.md sequences phases B-E with success gates and
  decision rules

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-16 01:50:05 -05:00
root
01373c0e45 Phase 5: hardening — gRPC, observability, auth, config
- proto: lakehouse.proto with CatalogService, QueryService, StorageService, AiService
- proto crate: tonic-build codegen from proto definitions
- catalogd: gRPC CatalogService implementation
- gateway: dual HTTP (:3100) + gRPC (:3101) servers
- gateway: OpenTelemetry tracing with stdout exporter
- gateway: API key auth middleware (toggleable)
- shared: TOML config system with typed structs and defaults
- lakehouse.toml config file
- ADR-006 and ADR-007 documented

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-27 06:37:07 -05:00
root
655b6c0b37 Phase 1: storage + catalog layer
- storaged: object_store backend (LocalFileSystem), PUT/GET/DELETE/LIST endpoints
- shared: arrow_helpers with Parquet roundtrip + schema fingerprinting (2 tests)
- catalogd: in-memory registry with write-ahead manifest persistence to object storage
- catalogd: POST/GET /datasets, GET /datasets/by-name/{name}
- gateway: wires storaged + catalogd with shared object_store state
- Phase tracker updated: Phase 0 + Phase 1 gates passed

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-27 05:15:27 -05:00
root
a52ca841c6 Phase 0: bootstrap Rust workspace
- Cargo workspace with 6 crates: shared, storaged, catalogd, queryd, aibridge, gateway
- shared: types (DatasetId, ObjectRef, SchemaFingerprint, DatasetManifest) + error enum
- gateway: Axum HTTP entrypoint with nested service routers + tracing
- All services expose /health stubs
- justfile with build/test/run recipes
- PRD, phase tracker, and ADR docs

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-27 04:59:05 -05:00