24 Commits

Author SHA1 Message Date
root
f1fa6e4e61 phase 1.6 Gate 3a: photo upload endpoint with consent gate
Per docs/PHASE_1_6_BIPA_GATES.md §1 Gate 3 (consent-gate substrate).
Deepface classification (Gate 3b) deferred to its own session — needs
Python subprocess design conversation after the 2026-05-02 sidecar drop.

What ships:

  shared/types.rs:
    - new BiometricCollection sub-struct: data_path, template_hash,
      collected_at, consent_version_hash, classifications (Option<JSON>)
    - SubjectManifest gains biometric_collection: Option<BiometricCollection>
      with #[serde(default)] so existing on-disk manifests parse and
      re-emit without drift

  catalogd/biometric_endpoint.rs (NEW, ~600 LOC):
    POST /subject/{candidate_id}/photo
      - Auth: X-Lakehouse-Legal-Token, constant-time-eq compared against
        same legal token file as /audit. Same 32-byte minimum.
      - Content-Type: must be image/jpeg or image/png (415 otherwise)
      - Body: raw image bytes, max 10MB
      - 401: missing or wrong token
      - 404: subject not registered
      - 403: consent.biometric.status != "given" (returns current status)
      - 403: subject status in {Withdrawn, Erased, RetentionExpired}
      - 200: writes photo to data/biometric/uploads/<sanitized_id>/<ts>.<ext>
        with mode 0700 dir + 0600 file, updates SubjectManifest with
        BiometricCollection record, appends audit row
        (kind="biometric_collection", purpose="photo_upload"), returns
        UploadResponse with template_hash + audit_row_hmac.

    Logic split: pure async fn process_upload() takes the headers-as-args
    so unit tests exercise every branch without HTTP machinery; the
    axum handler is just glue. 10 tests covering all 4 reject paths +
    happy path + repeated uploads chaining + structural assertion that
    the quarantine path is NOT under data/headshots/ (synthetic faces).

  gateway/main.rs:
    Mounts /biometric on the same condition as /audit — only when the
    SubjectAuditWriter is present AND the legal token loads. Storage
    root configurable via LH_BIOMETRIC_STORAGE_ROOT (default
    ./data/biometric/uploads).

Live verification on the running gateway (post-restart):
  - GET  /biometric/health          → "biometric endpoint ready"
  - POST without token              → 401 auth_failed
  - POST with token, no consent     → 403 consent_required (status=NeverCollected)
  - Flipped WORKER-2 to consent=given, POST → 200 with hash + path
  - File at data/biometric/uploads/WORKER-2/<ts>.jpg, mode 0600
  - Manifest biometric_collection field reflects the upload
  - Audit row chain links cleanly off the prior validator_lookup row
  - GET /audit/subject/WORKER-2 returns chain_verified=true, 2 rows
  - Cross-runtime parity probe still 6/6 byte-identical post-change

Phase 1.6 status table updated: Gate 3a DONE, Gate 3b (deepface)
deferred. Calendar bottleneck remains counsel review of items 1/2/5/6.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-03 04:55:32 -05:00
root
2222227c16 catalogd parity helper: scrum-driven hardening
Per 2026-05-03 step_7_8_retention_and_parity scrum (opus). 5 findings,
0 convergent — but two real fixes shipped:

1. WARN parity_subject_audit.rs:argv — replace .expect() panics with
   stderr+exit(2). The parity script captures stdout for byte-compare;
   a Rust panic backtrace lands in stdout (script merges 2>&1) and
   reads as a parity break instead of a usage error. Added die() helper
   that mirrors the Go side's error-exit pattern.

2. INFO parity_subject_audit.rs:5 — doc comment hardcoded the absolute
   path /home/profit/golangLAKEHOUSE/... Replaced with repo-relative
   reference.

INFO findings on retention_sweep argv style + --as-of report path
overwrite were noted but not actioned (style only / acceptable for
the forecast use case).

The major scrum-surfaced bug (Go json.Marshal HTML-escaping <>& while
serde_json keeps them literal) is fixed on the Go side in parallel
commit. Rust side here is correct as-is — serde_json::to_vec doesn't
HTML-escape by default, so no change needed in canonical_json.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-03 04:29:38 -05:00
root
2413c96817 catalogd: Step 8 — parity_subject_audit binary (Rust side)
Per docs/specs/SUBJECT_MANIFESTS_ON_CATALOGD.md §5 Step 8.

Cross-runtime parity helper consumed by:
  golangLAKEHOUSE/scripts/cutover/parity/subject_audit_parity.sh

Two modes:

  --known-answer
    Print canonical-JSON + HMAC for a hardcoded fixture row. The Go
    helper at golangLAKEHOUSE/scripts/cutover/parity/subject_audit_helper/
    must produce byte-identical output. Catches algorithm drift
    (canonical-JSON sort order, HMAC algorithm, hex encoding).

  --verify <audit_log_path> --key <key_path>
    Replay the chain on a real production audit log via the live
    SubjectAuditWriter::verify_chain (no re-implementation; the actual
    production verification path). Output: one JSON line with mode,
    count, tip, verified, error.

The helper exercises the SAME verify_chain path the gateway calls, so
algorithm changes in subject_audit.rs automatically flow into the
parity probe.

Live-verified against 5 production audit logs in data/_catalog/subjects;
all 6 parity assertions pass after fixing two real cross-runtime drifts
on the Go side (omitempty trace_id stripping field; time.RFC3339Nano
stripping trailing zero in nanoseconds — both caught by this probe).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-03 04:16:50 -05:00
root
8fc6238dea catalogd: Step 7 — daily retention sweep binary
Per docs/specs/SUBJECT_MANIFESTS_ON_CATALOGD.md §5 Step 7:
  "Subjects whose retention.general_pii_until < now AND status != erased
   get marked for review (don't auto-delete; legal needs to approve)."

Per shared::types::BiometricConsent doc-comment (BIPA requirement on
biometric data, max 3 years from last interaction):
  "Implementation MUST enforce daily expiration sweep against this field."

Therefore the sweep checks BOTH retention clocks. Reports overdue
subjects to data/_catalog/subjects/_retention_sweep_<YYYY-MM-DD>.jsonl.

Idempotent: subjects already in {Erased, RetentionExpired} are skipped
so daily runs do not append duplicate rows.

Does NOT mutate subject manifests. Legal/operator owns the action
(extend, flip status, schedule erasure).

CLI:
  retention_sweep                       # dry-run (default), stderr only
  retention_sweep --apply               # also write JSONL report
  retention_sweep --as-of <RFC3339>     # alternate clock for forecast/test
  retention_sweep --storage-root <dir>  # default ./data

Tests: 8 unit tests on is_overdue covering all 5 SubjectStatus values,
both clocks, BIPA-only path, and idempotency on already-flagged
subjects.

Live verification (100 subjects in ./data/_catalog/subjects):
  - now (2026-05-03): 0 overdue (correct — 4-year retention)
  - --as-of 2031-06-01: 100 overdue, 394 days past, jsonl report shape
    verified with biometric fields correctly omitted via
    serde skip_serializing_if when subject has no biometric clock.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-03 04:05:03 -05:00
root
2a4b316a15 subjects: 2nd scrum fix wave (token min, chain_tip, tampering, rebuild collision warn)
Second cross-lineage scrum on Steps 5+6 returned 13 distinct findings, 0 convergent.
Three BLOCK-class claims verified as false positives (cache IS written, per-subject
Mutex IS in place, spawn IS safe under writer's lock). Five real fixes shipped:

1. audit_endpoint: legal token min length 16->32 (HMAC-SHA256 best practice, kimi)
2. subject_audit: new chain_tip() returns last hash from full log; audit_endpoint
   now reports chain_root from full chain instead of windowed slice (opus)
3. registry: rebuild loader now warns on sanitize collision (symmetric with
   put_subject's collision guard - opus)
4. audit_endpoint: tampering detection - if manifest expects non-empty chain_root
   but log returns 0 rows, flag chain_verified=false with explicit message (opus)
5. execution_loop::audit_result_state: tightened heuristic - error/denied/not_found
   only classify when no rows/data/results sibling (opus INFO)

Tests: 17 catalogd subject + 6 gateway audit_result_state, all green.
New: audit_result_state_does_not_classify_error_when_data_sibling_present,
     audit_result_state_status_is_authoritative_even_with_data.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-03 04:00:42 -05:00
root
15cfd76c04 catalogd + gateway: Step 6 — /audit/subject/{id} legal-tier HTTP endpoint
Implementation of docs/specs/SUBJECT_MANIFESTS_ON_CATALOGD.md §5 Step 6
+ §4 (response shape) + §6 (auth model). The defense-against-EEOC-
discovery surface is live: legal counsel hits one URL with one token,
gets back a signed-by-HMAC-chain audit response naming every PII access
for a subject in a time window.

New module: crates/catalogd/src/audit_endpoint.rs (~340 LOC)
  - AuditEndpointState { registry, writer, legal_token }
  - router() exposes:
      GET /subject/{candidate_id}?from=ISO&to=ISO  (full audit response)
      GET /health                                    (liveness + token check)
  - require_legal_auth() — constant-time-eq compare against the
    X-Lakehouse-Legal-Token header. Avoids timing leaks on the token
    check without pulling in `subtle` for one comparison.
  - Token loaded from /etc/lakehouse/legal_audit.token (env-overridable
    via LH_LEGAL_AUDIT_TOKEN_FILE). Empty file or <16 chars = endpoint
    serves 503 with a clear reason. Token value NEVER logged.
  - Response schema: subject_audit_response.v1 with manifest +
    audit_log (rows + chain verification) + datasets_referenced +
    safe_views_available + completeness_attestation.

New helper on SubjectAuditWriter:
  - read_rows_in_range(candidate_id, from, to) — returns rows in window,
    used by the endpoint to assemble the response without re-reading
    the entire chain.
  - verify_chain() now returns Ok(0) when the audit log file doesn't
    exist (empty = trivially valid). Prevents legitimate "no PII access
    yet for this subject" from showing as integrity=BROKEN in the
    audit response. Caller can detect "log was deleted" via comparison
    to SubjectManifest.audit_log_chain_root (when that mirror lands).

main.rs:
  - Audit endpoint mounted at /audit ONLY when both subject_audit
    writer AND legal token are present. Disabled-by-default keeps the
    surface from accidentally serving in dev/bring-up environments
    without proper credentials.

Tests (9/9 passing):
  - constant_time_eq (correctness on equal/diff/empty/length-mismatch)
  - missing_legal_token_returns_503
  - missing_header_returns_401
  - wrong_token_returns_401
  - correct_token_passes_auth
  - audit_response_assembly_full_path (manifest + 3 rows + chain verify)
  - audit_response_window_filters_rows (time-bounded window)
  - empty_token_file_results_in_disabled_endpoint
  - short_token_file_rejected_at_load (<16 char min)

LIVE end-to-end verification:
  1. Plant signing key + legal token in /tmp/lakehouse_audit/
  2. Restart gateway with LH_SUBJECT_AUDIT_KEY + LH_LEGAL_AUDIT_TOKEN_FILE
     pointing at the test files
  3. /audit/health → 200 "audit endpoint ready"
  4. /audit/subject/WORKER-1 (no token) → 401 "missing X-Lakehouse-Legal-Token"
  5. /audit/subject/WORKER-1 (wrong token) → 401 "X-Lakehouse-Legal-Token mismatch"
  6. /audit/subject/WORKER-1 (correct token) → 200 + full manifest + 0 rows
     + chain_verified=true (empty log path)
  7. POST /v1/validate with candidate_id=WORKER-1 → triggers WorkerLookup.find()
     via the AuditingWorkerLookup wrapper from Step 5
  8. data/_catalog/subjects/WORKER-1.audit.jsonl now exists with 1 row
     (accessor.purpose=validator_worker_lookup, result=not_found,
     prev_chain_hash=GENESIS, valid HMAC)
  9. /audit/subject/WORKER-1 (correct token) → 200 + manifest + 1 row +
     chain_verified=true + chain_rows_total=1 + completeness attestation

The full audit-trail loop (PII access → audit row → chain → audit response)
works end-to-end on the live gateway.

NOT in this commit (future steps):
  - Step 7: Daily retention sweep
  - Step 8: Cross-runtime parity (Go side reads the same shapes)
  - Mirror chain root to SubjectManifest.audit_log_chain_root after
    each append (so tampering detection can use the manifest's
    cached root as ground truth)
  - Live row projection from datasets (currently caller follows up
    via /query/sql against the safe_views named in the response)
  - Ed25519 signature on the response (chain verification IS the v1
    attestation; signing is future hardening per spec §10)

cargo build --release clean. cargo test -p catalogd audit_endpoint
9/9 PASS. Live verification successful.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-03 03:52:04 -05:00
root
e38f3573ff subject manifests Steps 1-4 — fix scrum-flagged BLOCKs and WARNs
2026-05-03 cross-lineage scrum on the subjects_steps_1_to_4 wave
returned 14 distinct findings, 0 convergent. opus verdict was HOLD
with 3 BLOCKs around the audit-chain integrity. All real. Fixed:

──────────────────────────────────────────────────────────────────
BLOCK 1 — opus subject_audit.rs:172 + execution_loop.rs:391
  Concurrency race: append_line is read-modify-write; the gateway
  hook used tokio::spawn fan-out → two concurrent appends to the
  same subject both read the same prev_hash, both compute their
  HMAC from the same prev, second write silently overwrites first
  → row lost AND chain broken.

  Fix:
  - SubjectAuditWriter gains per-subject Mutex map. append() acquires
    the subject's lock for the duration of the read-modify-write.
    Different subjects still parallelize.
  - Gateway hook switches from tokio::spawn to inline await. Per-row
    cost is ~1ms (one object_store put); inline is correct AND cheap.
  - New regression test: 50 concurrent appends to the same subject,
    asserts all 50 land with intact chain.

BLOCK 2 — opus subject_audit.rs:108
  Non-deterministic canonicalization: serde_json serializes struct
  fields in declaration order. Schema evolution (adding/reordering
  fields) silently changes the bytes verify_chain hashes → chain
  breaks even when nothing was actually tampered with.

  Fix:
  - New canonical_json() free fn — recursive value rewrite to sort
    object keys alphabetically (BTreeMap projection), arrays preserve
    order, scalars pass through. Stable across struct evolution.
  - Both append() and verify_chain() now compute HMAC over canonical
    bytes, not declaration-order bytes.
  - New regression tests: alphabetical-key + array-order-preserved.

WARN — opus execution_loop:401
  Audit row's `result` was hardcoded to "success" for every Ok(result)
  including payloads like {"error":"not found"}. Misleads compliance.

  Fix:
  - New audit_result_state() free fn that inspects the payload
    top-level for error/denied/not_found/status signals (per spec
    §3.2 enum). Defaults to "success" only when no error signal.
  - 4 new tests covering each enum case + falsy-signals defense.

WARN — opus registry.rs:735
  Storage-key collision: sanitize_view_name(id) is the disk key,
  but the in-memory HashMap was keyed by raw candidate_id. Two
  distinct ids that sanitize to the same key (e.g. "CAND/1" and
  "CAND_1") would collide on disk while appearing distinct in
  memory; second put silently overwrites first; rebuild loads only
  one.

  Fix:
  - put_subject() / get_subject() / delete_subject() / rebuild()
    all key the in-memory HashMap by sanitize_view_name(id), matching
    the storage key shape.
  - Collision guard: put_subject() refuses (with clear error) when
    the sanitized key matches an EXISTING subject with a DIFFERENT
    raw candidate_id.
  - New regression test: put("CAND/1") then put("CAND_1") errors
    + first subject survives.

WARN — opus backfill_subjects.rs:189
  trim_start_matches strips REPEATED prefixes; the spec wanted
  one-shot semantics. Edge case unlikely in practice but real.

  Fix:
  - Switched to strip_prefix(&prefix).unwrap_or(&cid). One-shot.

INFO — opus subject_audit.rs:131
  Per-byte format!("{:02x}", b) allocates each iteration. Hot path
  on every append.

  Fix:
  - Replaced with const HEX lookup table + push() into preallocated
    String. Same output bytes, no per-byte allocation.

──────────────────────────────────────────────────────────────────
Test summary post-fix:
  catalogd subject_audit: 11/11 PASS (added 4 new — concurrency
                          race regression, parallel-different-subjects,
                          canonical-key sort, canonical-array order)
  catalogd registry subject: 6/6 PASS (added 1 new — collision guard)
  gateway execution_loop subject: 10/10 PASS (added 4 new —
                          audit_result_state enum coverage)

  All 27 subject-related tests green. cargo build --release clean.

The convergent-zero scrum result was misleading on its face — opus
caught real BLOCKs that kimi/qwen missed. Per
feedback_cross_lineage_review.md: opus is the load-bearing reviewer;
single-opus BLOCKs warrant manual verification, which here confirmed
all three were correct.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-03 03:37:45 -05:00
root
bce6dfd1ee catalogd: Step 3 — backfill_subjects binary (BIPA-defensible defaults)
Implementation of docs/specs/SUBJECT_MANIFESTS_ON_CATALOGD.md §5 Step 3.
Reads a parquet source, creates one SubjectManifest per row with the
spec-defined safe defaults, persists via Registry::put_subject().

Defaults baked in (per spec §2 + §5 Step 5):
  - vertical = unknown                     (HIPAA fail-closed)
  - consent.general_pii = pending_backfill_review  (NOT inferred_existing — BIPA defense)
  - consent.biometric  = never_collected   (no biometric data backfilled)
  - retention.general_pii_until = now + 4 years
  - retention.policy = "4_year_default"

Conservative ergonomics:
  - --limit 1000 by default. --all to do the full source.
  - --dry-run for parse + count + sample without writes.
  - --concurrency 32 (bounded via tokio::sync::Semaphore).
  - Idempotent: skips subjects that already exist in catalog.
  - Progress reports every ~5% (or 5K rows, whichever smaller).

Live verification on workers_500k.parquet:
  --limit 100 dry-run:  parsed 100 rows, sampled WORKER-1..5, 0 writes ✓
  --limit 100 commit:   100 inserted, 0 failed, 100 files in
                        data/_catalog/subjects/ ✓
  --limit 100 re-run:   0 inserted, 100 skipped (idempotent) ✓

Sample manifest (data/_catalog/subjects/WORKER-1.json):
  {
    "schema": "subject_manifest.v1",
    "candidate_id": "WORKER-1",
    "status": "active",
    "vertical": "unknown",
    "consent": {
      "general_pii": {"status": "pending_backfill_review", ...},
      "biometric":   {"status": "never_collected",         ...}
    },
    "retention": {"general_pii_until": "2030-05-02T...", "policy": "4_year_default"},
    "datasets": [{"name": "workers_500k", "key_column": "worker_id", "key_value": "1"}]
  }

NOT in this commit (future steps):
  - Step 4: Wire gateway tool registry to write audit rows on every
    candidate_id returned (uses SubjectAuditWriter from Step 2)
  - Step 5: Wire validator WorkerLookup similarly
  - Step 6: /audit/subject/{id} HTTP endpoint
  - Step 7: Daily retention sweep
  - Backfill the full 500K (operator decision: --all when ready;
    note: 500K JSON files in one dir will slow startup load — may
    want SQLite/single-file backend before that scale)

Operator note: backfill is run-once. To extend to candidates table,
re-run with --dataset candidates --key-column candidate_id (no prefix
since candidate_id is already the canonical token there).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-03 03:22:54 -05:00
root
d16131bcab catalogd: Step 2 — SubjectAuditWriter with HMAC chain
Implementation of docs/specs/SUBJECT_MANIFESTS_ON_CATALOGD.md Step 2.
Per-subject append-only audit JSONL with HMAC-SHA256 chain. Local-first
— no Vault, no external anchor (those are v2 if SOC2 Type II becomes
contract-required; v1 deliberately stays small).

shared/types.rs additions:
- AuditAccessor — kind, daemon, purpose, trace_id
- SubjectAuditRow — schema/ts/candidate_id/accessor/fields_accessed/
  result/prev_chain_hash/row_hmac

crates/catalogd/src/subject_audit.rs (NEW):
- SubjectAuditWriter — holds signing key + per-subject latest-hash cache
- from_key_file() — loads key from sealed file, requires ≥32 bytes
- with_inline_key() — for tests + bring-up
- append() — computes HMAC chain link, persists JSONL row, returns new
  chain root (caller mirrors to SubjectManifest.audit_log_chain_root)
- verify_chain() — full re-verification of a subject's audit log,
  catches both prev_hash drift AND row-level HMAC tampering
- scan_latest_hash() — cold-start path, finds prev_hash from JSONL tail
- append_line() — read-modify-write pattern (object stores have no
  native append; same shape as the rest of catalogd's persistence)

Crypto: HMAC-SHA256 via the standard `hmac` crate (added to workspace
+ catalogd deps; not implementing crypto by hand). Output is lowercase
hex matching the rest of the codebase's SHA-256 conventions.

Security choices:
- NO Debug impl on SubjectAuditWriter — auto-deriving Debug would risk
  leaking the signing key into log lines. Tests work around this by
  matching on Result instead of using .unwrap_err().
- Key min length 32 bytes (HMAC-SHA256 block size guidance).
- Failures are NOT swallowed — Result returned, caller decides whether
  to log + continue (per spec §3.2 the gateway tool registry SHOULD
  log + continue rather than block reads).

Tests (7/7 passing):
- first_append_uses_genesis_prev_hash
- chain_links_each_append (3-row chain verifies)
- separate_subjects_have_independent_chains (per-subject isolation)
- tamper_detected_on_verify (mutation in middle of chain breaks verify)
- cold_writer_picks_up_existing_chain (process restart preserves chain)
- empty_candidate_id_rejected
- key_too_short_rejected_via_file

NOT in this commit (future steps):
- Step 3: Backfill ETL from workers_500k.parquet (next per J)
- Step 4: Wire gateway tool registry to call append() on every
  candidate_id returned by search_candidates / get_candidate
- Step 5: Wire validator WorkerLookup similarly
- Step 6: /audit/subject/{id} HTTP endpoint
- Step 7: Daily retention sweep
- Mirroring chain root to SubjectManifest.audit_log_chain_root
  (separate concern; do at the call site)

cargo check --workspace clean. cargo test -p catalogd subject_audit
7/7 PASS.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-03 03:19:18 -05:00
root
d25990982c catalogd: Step 1 — SubjectManifest type + Registry CRUD
Implementation of docs/specs/SUBJECT_MANIFESTS_ON_CATALOGD.md Step 1.
Mirrors the existing AiView put/get/list/delete pattern. NOT a separate
daemon, NOT new infrastructure — extends catalogd's manifest layer with
a fourth manifest type (subject) alongside dataset/view/tombstone/profile.

shared/types.rs additions:
- SubjectManifest (the wire format from spec §2)
- SubjectStatus enum: pending_consent | active | withdrawn |
  retention_expired | erased
- SubjectVertical enum: unknown | general | healthcare | finance | other
  (default = Unknown for fail-closed routing per spec §2.1)
- ConsentStatus enum: pending_backfill_review | pending_first_contact |
  given | withdrawn | expired
- BiometricConsentStatus enum: never_collected | pending | given |
  withdrawn | expired
- GeneralPiiConsent + BiometricConsent + SubjectConsent
- SubjectRetention (general_pii_until + policy)
- SubjectDatasetRef (name + key_column + key_value pointing at existing
  catalogd dataset manifests)

catalogd/registry.rs additions:
- subjects: Arc<RwLock<HashMap<String, SubjectManifest>>> field on Registry
- put_subject() — validates dataset refs, persists to
  _catalog/subjects/<id>.json, updates in-memory cache
- get_subject() / list_subjects() / delete_subject() / subjects_count()
- rebuild() now loads subject manifests at startup alongside views +
  profiles + tombstones

Tests (5/5 passing):
- put_subject_with_no_dataset_refs_succeeds
- put_subject_rejects_dangling_dataset_ref (validation works)
- put_subject_with_valid_dataset_ref_succeeds
- subject_round_trips_through_object_store (persistence works)
- delete_subject_removes_in_memory_and_persistence

NOT in this commit (future steps):
- Step 2: SubjectAuditWriter with HMAC chain
- Step 3: Backfill ETL from workers_500k.parquet
- Steps 4-5: Wire gateway tool registry + validator to write audit rows
- Step 6: /audit/subject/{id} HTTP endpoint
- Step 7: Daily retention sweep

cargo check --workspace clean. cargo test -p catalogd subject 5/5 PASS.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-03 03:13:08 -05:00
root
f59ddbebd4 Phase 41: Profile System Expansion
- ProfileType enum: Execution, Retrieval, Memory, Observer
- Per-type endpoints: /profiles/retrieval, /profiles/memory, /profiles/observer
- profile_type field on ModelProfile
- All tests pass
2026-04-23 03:07:22 -05:00
profit
5b1fcf6d27 Phase 28-36 body of work
Accumulated since a6f12e2 (Phase 21 Rust port + Phase 27 versioning):

- Phase 36: embed_semaphore on VectorState (permits=1) serializes
  seed embed calls — prevents sidecar socket collisions under
  concurrent /seed stress load
- Phase 31+: run_stress.ts 6-task diverse stress scaffolding;
  run_e2e_rated.ts + orchestrator.ts tightening
- Catalog dedupe cleanup: 16 duplicate manifests removed; canonical
  candidates.parquet (10.5MB -> 76KB) + placements.parquet (1.2MB ->
  11KB) regenerated post-dedupe; fresh manifests for active datasets
- vectord: harness EvalSet refinements (+181), agent portfolio
  rotation + ingest triggers (+158), autotune + rag adjustments
- catalogd/storaged/ingestd/mcp-server: misc tightening
- docs: Phase 28-36 PRD entries + DECISIONS ADR additions;
  control-plane pivot banner added to top of docs/PRD.md (pointing
  at docs/CONTROL_PLANE_PRD.md which lands in next commit)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-22 02:41:15 -05:00
root
0d037cfac1 Phases 16.2 + L2 + 17 VRAM gate + MySQL + 18 Lance hybrid milestone
Five threads of work landing as one milestone — all individually
verified end-to-end against real data, full release build clean,
46 unit tests pass.

## Phase 16.2 / 16.5 — autotune agent + ingest triggers

`vectord::agent` is a long-running tokio task that watches the trial
journal and autonomously proposes + runs new HNSW configs. Distinct
from `autotune::run_autotune` (synchronous one-shot grid). Triggered
on POST /vectors/agent/enqueue/{idx} or by the periodic wake; ingest
paths now push DatasetAppended events when an index's source dataset
gets re-ingested. Rate-limited (max_trials_per_hour) and cooldown-
gated so it can't saturate Ollama under live load.

The proposer is ε-greedy around the current champion: with prob 0.25
sample random from full bounds, otherwise perturb champion ± small
delta on both axes. Dedup against history. Deterministic — RNG seeded
from history.len() so the same journal state proposes the same next
config (helps offline replay debugging).

`[agent]` config section in lakehouse.toml; opt-in via enabled=true.

## Federation Layer 2 — runtime bucket lifecycle + per-index scoping

`BucketRegistry.buckets` moved to `std::sync::RwLock<HashMap>` so
buckets can be added/removed after startup. POST /storage/buckets
provisions at runtime; DELETE /storage/buckets/{name} unregisters
(refuses primary/rescue with 403). Local-backend buckets get their
root directory auto-created.

`IndexMeta.bucket` (default "primary" via serde) records each index's
home bucket. `TrialJournal` and `PromotionRegistry` now hold
Arc<BucketRegistry> + IndexRegistry; they resolve target store per-
index via IndexMeta.bucket. PromotionRegistry::list_all scans every
bucket and dedups by index_name. Pre-federation indexes keep working
unchanged — they just default to primary.

`ModelProfile.bucket: Option<String>` declares per-profile artifact
home. POST /vectors/profile/{id}/activate auto-provisions the
profile's bucket under storage.profile_root if not yet registered.

EvalSets stay primary-only for now — noted gap, low-risk to extend
later with the same resolver pattern.

## Phase 17 — VRAM-aware two-profile gate

Sidecar gains POST /admin/unload (Ollama keep_alive=0 trick — forces
immediate VRAM release), POST /admin/preload (keep_alive=5m with
empty prompt, takes the slot warm), and GET /admin/vram (combines
nvidia-smi snapshot with Ollama /api/ps). Exposed via aibridge as
unload_model / preload_model / vram_snapshot.

`VectorState.active_profile` is the GPU-slot singleton —
Arc<RwLock<Option<ActiveProfileSlot>>>. activate_profile checks for
a previous profile with a different ollama_name and unloads it
before preloading the new one; same-model reactivations skip the
unload (Ollama no-ops). New routes: POST /vectors/profile/{id}/
deactivate (unload + clear slot), GET /vectors/profile/active.

Verified live: staffing-recruiter (qwen2.5) → docs-assistant
(mistral) swap freed qwen2.5 from VRAM and loaded mistral. nomic-
embed-text persists across swaps because both profiles use it —
free optimization that fell out of the design. Scoped search
correctly 403s cross-profile in both directions.

## MySQL streaming connector

`crates/ingestd/src/my_stream.rs` mirrors pg_stream.rs for MySQL.
Pure-rust `mysql_async` driver (default-features=false to avoid C
deps). Same OFFSET pagination, same Parquet-streaming write shape.
Type mapping per ADR-010: int/bigint → Int32/Int64, decimal/float
→ Float64, tinyint(1)/bool → Boolean, everything else → Utf8 with
fallback parsers for date/time/json/uuid via Display.

POST /ingest/mysql parallel to /ingest/db. Same PII auto-detection,
same lineage capture (source_system="mysql"), same agent-trigger
hook. `redact_dsn` generalized — was hardcoded to "postgresql://"
length, now works for any scheme://user:pass@host/path URL (latent
PII leak fix for MySQL DSNs).

Verified live against MariaDB on localhost: 10 rows × 9 columns of
test data round-tripped through datatypes int/varchar/decimal/
tinyint/datetime/text. PII detection auto-flagged name + email.
Aggregation queries through DataFusion match the source values
exactly.

## Phase 18 — Hybrid Parquet+HNSW ⊕ Lance backend (ADR-019)

`vectord-lance` is a new firewall crate. Lance pulls Arrow 57 and
DataFusion 52 — incompatible with the rest of the workspace's
Arrow 55 / DataFusion 47. The firewall isolates that dep tree:
public API uses only std types (Vec<f32>, Vec<String>, Hit, Row,
*Stats), so no Arrow types cross the crate boundary and nothing
propagates to vectord. The ADR-019 path that didn't ship until now.

`vectord::lance_backend::LanceRegistry` lazy-creates a
LanceVectorStore per index, resolving bucket → URI via the
conventional local-bucket layout. `IndexMeta.vector_backend` and
`ModelProfile.vector_backend` carry the choice (default Parquet so
existing indexes unchanged).

Six routes under /vectors/lance/*:
- migrate/{idx}: convert binary-blob Parquet → Lance FixedSizeList
- index/{idx}: build IVF_PQ
- search/{idx}: vector search (embed via sidecar)
- doc/{idx}/{doc_id}: random row fetch
- append/{idx}: native fragment append
- stats/{idx}: row count + index presence

Verified live on the real resumes_100k_v2 corpus (100K × 768d):
- Migrate: 0.57s
- Build IVF_PQ index: 16.2s (matches ADR-019 bench; 14× faster than
  HNSW's 230s for the same data)
- Search end-to-end (Ollama embed + Lance scan): 23-53ms
- Random doc_id fetch: 5-7ms (filter scan; faster than Parquet's
  ~35ms full-file scan, slower than the bench's 311us positional
  take — would close that gap with a scalar btree on doc_id)
- Append 100 rows: 3.3ms / +320KB on disk vs Parquet's required
  full ~330MB rewrite — the structural win
- Index survives append; both backends coexist cleanly

## Known follow-ups not in this milestone

- ModelProfile.vector_backend doesn't yet auto-route /vectors/profile/
  {id}/search to Lance; callers go through /vectors/lance/* directly
- Scalar btree on doc_id (closes the 5-7ms → ~300us gap)
- vectord-lance built default-features=false → no S3 yet
- IVF_PQ recall not measured (ADR-019 caveat) — needs a Lance-aware
  variant of the eval harness
- Watcher-path ingest doesn't push agent triggers (HTTP paths do)
- EvalSets still primary-only (federation gap)
- No PATCH endpoint to move an existing index between buckets
- The pre-existing storaged::append_log doctest fails to compile
  (malformed `{prefix}/` parses as code fence) — pre-existing bug,
  left for a focused fix

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-16 20:24:46 -05:00
root
4e1c400f5d Phase E.2: Compaction integrates tombstones — physical deletion closes GDPR loop
Phase E gave us soft-delete at query time (tombstones hide rows via a
DataFusion filter view). This completes the invariant: after compact,
tombstoned rows are PHYSICALLY absent from the parquet on disk.

delta::compact changes:
- Signature adds tombstones: &[Tombstone]
- After merging base + deltas, apply_tombstone_filter builds a
  BooleanArray keep-mask per batch (True where row_key_value is NOT
  in the tombstone set) and applies arrow::compute::filter_record_batch
- Supports Utf8, Int32, Int64 key columns (matches refresh.rs coverage
  for pg- and csv-derived schemas)
- CompactResult gains tombstones_applied + rows_dropped_by_tombstones
- Caller clears tombstone store on success

Critical correctness fix surfaced during E2E testing:
The original Phase 8 compact concatenated N independent Parquet byte
streams from record_batch_to_parquet() — each with its own footer.
Parquet readers only see the FIRST footer's data; the rest is invisible.
Latent since Phase 8 shipped; triggered by tombstone-filtering produc-
ing multiple batches. Corrupted candidates.parquet on first test run
(restored from UI fixture copy — good argument for test data in repo).

Fix:
- Single ArrowWriter per compaction, writes every batch into one
  properly-footered Parquet
- Snappy compression to match ingest defaults (otherwise rewrite
  inflated file 3× — 10.5MB → 34MB — because no compression was set)
- Verify-before-swap: parse written buf back to confirm row count
  matches expected; refuses to overwrite base_key if verification fails
- Write to {base_key}.compact-{ts}.tmp first, then to base_key; delete
  temp; only then delete delta files. Any error along the way leaves
  the original base intact.

TombstoneStore::clear(dataset) drops all tombstone batch files and
evicts the per-dataset AppendLog from cache. Called after successful
compact.

QueryEngine::catalog() accessor exposes the Registry so queryd
handlers can reach the tombstone store without routing through gateway
state.

E2E on candidates (100K rows, 15 cols):
- Baseline: 10.59 MB, 100000 rows
- Tombstone CAND-000001/2/3 (soft-delete): 99997 visible, 100000 raw
- Compact: tombstones_applied=3, rows_dropped=3, final_rows=99997
- Post: 10.72 MB (Snappy), valid parquet (1 row_group), 99997 rows
- Restart: persists, tombstones list empty, __raw__candidates also
  99997 (the 3 IDs are physically gone from disk)

PRD invariant close: deletion is now actually deletion, not just
masking. GDPR erasure request → tombstone + schedule compact → data
gone.

Deferred:
- Compact-all-datasets cron (currently manual per-dataset via
  POST /query/compact)
- Compaction of tombstone batch files themselves (they grow at
  flush_threshold=1 per tombstone; TombstoneStore::compact exists
  but not auto-called)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-16 10:38:30 -05:00
root
a293502265 Phase 17: Model profiles + scoped search — the LLM-brain keystone
Implements PRD invariant 9 ("every reader gets its own profile") and
completes the multi-model substrate vision. Local models (or agents)
bind to a named set of datasets; activation pre-loads their vector
indexes into memory; search enforces scope.

Schema (shared::types):
- ModelProfile { id, ollama_name, description, bound_datasets,
                 hnsw_config, embed_model, created_at, created_by }
- ProfileHnswConfig mirrors vectord::trial::HnswConfig to avoid a
  cross-crate dep cycle. Default (ec=80, es=30) matches the Phase 15
  trial winner.
- bound_datasets can reference raw dataset names OR AiView names
  (both register as DataFusion tables with the same name, so mixing
  raw tables and PII-redacted views composes naturally)

Catalog (catalogd::registry):
- put_profile validates id is a slug (alphanumeric + -_ only) and
  every binding resolves to an existing dataset or view
- Persistence at _catalog/profiles/{id}.json, loaded on rebuild
- get_profile / list_profiles / delete_profile

HTTP endpoints:
- POST /catalog/profiles  (create/update)
- GET  /catalog/profiles  (list)
- GET/DELETE /catalog/profiles/{id}
- POST /vectors/profile/{id}/activate  (HNSW hot-load)
- POST /vectors/profile/{id}/search    (scope-enforced)

Activation (vectord::service::activate_profile):
- For each bound dataset, find vector indexes with matching source
- Pre-load embeddings into EmbeddingCache
- Build HNSW with profile's config
- Report warmed indexes + per-binding failures + duration
- Failures on individual bindings don't abort — "substrate keeps
  working" per ADR-017

Scoped search (vectord::service::profile_scoped_search):
- Look up profile, verify index.source ∈ profile.bound_datasets
- Returns 403 with allowed bindings list if out-of-scope
- Uses HNSW if index is warm, brute-force cosine otherwise (graceful
  degradation — no "must activate first" friction)

Bug fix surfaced during testing: vectord::refresh::try_update_index_meta
was a no-op for first-time indexes, so threat_intel_v1 and
kb_team_runs_v1 (both built via refresh after Phase C shipped) didn't
show up in the index registry. Now it auto-infers the source from the
index name convention (`{source}_vN`) and registers new metadata with
reasonable defaults.

End-to-end verified:
- Created security-analyst profile bound to [threat_intel]
- POST /vectors/profile/security-analyst/activate → warmed
  threat_intel_v1 (54 vectors) in 156ms, HNSW built
- Within-scope search: method=hnsw, returned relevant IP indicators
- Out-of-scope: tried to search resumes_100k_v2 (source=candidates)
  → 403 "profile 'security-analyst' is not bound to 'candidates' —
    allowed bindings: [\"threat_intel\"]"
- staffing-recruiter profile created bound to candidates + placements;
  search without activation fell through to brute_force (graceful)

Deferred (Phase 17 followups):
- VRAM-aware activation (unload-then-load via Ollama keep_alive=0)
  — Ollama already handles this; we don't need to reinvent
- Model-identity in audit trail — Phase 13 has role-based audit;
  adding model_id is ~20 LOC when we want it
- Profile bucket pre-load (profile:user bucket mount) — Phase 17.5

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-16 10:09:43 -05:00
root
d87f2ccac6 Phase E: Soft deletes (tombstones) for compliance-grade row deletion
Implements GDPR/CCPA-compatible row-level deletion without rewriting
the underlying Parquet. Tombstone markers live beside each dataset and
are applied at query time via a DataFusion view that excludes the
deleted row_key_values.

Schema (shared::types):
- Tombstone { dataset, row_key_column, row_key_value, deleted_at,
              actor, reason }
- All tombstones for a dataset must share one row_key_column —
  enforced at write so the query-time filter remains a single
  WHERE NOT IN (...) clause

Storage (catalogd::tombstones):
- Per-dataset AppendLog at _catalog/tombstones/{dataset}/
- flush_threshold=1 + explicit flush after every append — tombstones
  are high-value, low-frequency; durability on return is the contract
- Reuses storaged::append_log infra so compaction is already wired
  (POST .../tombstones/compact will work once we expose it)

Catalog (catalogd::registry):
- add_tombstone validates dataset exists + key column compatibility
- list_tombstones for the GET endpoint
- TombstoneStore exposed via Registry::tombstones() for queryd

HTTP (catalogd::service):
- POST /catalog/datasets/by-name/{name}/tombstone
    { row_key_column, row_key_values[], actor, reason }
  Returns rows_tombstoned count + per-value failure list (207 on
  partial success).
- GET same path lists active tombstones with full audit info.

Query layer (queryd::context):
- Snapshot tombstones-by-dataset before registering tables
- Tombstoned tables: raw goes to "__raw__{name}", public "{name}"
  becomes DataFusion view with
  SELECT * FROM "__raw__{name}" WHERE CAST(col AS VARCHAR) NOT IN (...)
- CAST AS VARCHAR handles both string and integer key columns
- Untombstoned tables register as before — zero overhead

End-to-end on candidates (100K rows):
- Pick CAND-000001/2/3 (Linda/Charles/Kimberly)
- POST tombstone -> rows_tombstoned: 3
- COUNT(*) drops 100000 -> 99997
- WHERE candidate_id IN (those 3) -> 0 rows
- candidates_safe view transitively excludes them
  (Linda+Denver: __raw__candidates=159, candidates_safe=158)
- Restart: COUNT still 99997, 3 tombstones reload from disk

Reversibility: tombstones are reversible deletes, not destruction.
Power users can still query "__raw__{name}" to see deleted rows.
Phase 13 access control is what stops a non-admin from accessing
__raw__* tables.

Limits / follow-up:
- Physical compaction not yet integrated — Phase 8's compact_files
  doesn't read tombstones during merge. Tombstoned rows are still
  on disk until that integration ships.
- Phase 9 journald event emission for tombstones not wired —
  tombstone records carry their own actor+reason+timestamp so the
  audit trail is intact, but cross-referencing with the mutation
  event log would help compliance reporting.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-16 09:40:48 -05:00
root
09fd446c8d Phase D: AI-safe views — capability-surface projections over base data
Implements the llms3.com "AI-safe views" pattern: a named projection
that exposes only whitelisted columns, with optional row filter and
per-column redactions. AI agents (or Phase 13 roles) bind to the view;
they can never accidentally see PII even if they write raw SQL.

Schema (shared::types):
- AiView { name, base_dataset, columns: Vec<String>, row_filter,
           column_redactions: HashMap<String, Redaction>, ... }
- Redaction enum: Null | Hash | Mask { keep_prefix, keep_suffix }

Catalog (catalogd::registry):
- put_view validates base dataset exists + columns non-empty
- Persists JSON at _catalog/views/{name}.json (sanitized name)
- rebuild() loads views alongside dataset manifests on startup

Query layer (queryd::context):
- build_context registers every AiView as a DataFusion view object
- Constructed SELECT applies whitelist projection, WHERE filter, and
  redaction expressions per column
  - Mask: substr(prefix) + repeat('*', mid_len) + substr(suffix)
  - Hash: digest(value, 'sha256')
  - Null: CAST(NULL AS VARCHAR) AS col
- DataFusion handles JOINs/aggregates over the view natively — it's a
  real view, not a query rewrite

HTTP (catalogd::service):
- POST /catalog/views (create)
- GET  /catalog/views (list)
- GET  /catalog/views/{name} (full def)
- DELETE /catalog/views/{name}

End-to-end test on candidates (100K rows, 15 columns):

  candidates_safe view:
    columns: candidate_id, first_name, city, state, vertical,
             skills, years_experience, status
    row_filter: status != 'blocked'
    redaction: candidate_id mask(prefix=3, suffix=2)

  SELECT * FROM candidates_safe LIMIT 5
    -> 8 columns only, candidate_id shown as "CAN******01"
       (PII fields email/phone/last_name absent from result)

  SELECT email FROM candidates_safe
    -> fails (column not in projection)

  SELECT email FROM candidates
    -> succeeds (raw table still accessible by name —
       Phase 13 access control is the gate, not the view itself)

Survives restart — view definitions reload from object storage.

Limits / not in MVP:
- View CANNOT shadow base table by name (DataFusion treats them as
  separate identifiers; access control must restrict raw-table access)
- row_filter is treated as trusted SQL — operators must validate
  before persisting; only authenticated admin path should call put_view
- Redaction expressions assume column is castable to VARCHAR; numeric
  redactions could be misleading (a Hash on Int64 returns a hex string
  that won't equi-join with another hash on the same value type)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-16 09:16:44 -05:00
root
24f1249a62 Federation layer 2: header routing + cross-bucket SQL
Three pieces of the multi-bucket federation made real:

1. Catalog migration (POST /catalog/migrate-buckets)
   - One-shot normalizer for ObjectRef.bucket field
   - Empty -> "primary"; legacy "data"/"local" -> "primary"
   - Idempotent; re-running on canonical state is no-op
   - Ran on existing catalog: 12 refs renamed from "data", 2 already
     "primary", all 14 now canonical

2. X-Lakehouse-Bucket header middleware on ingest
   - resolve_bucket() helper extracts header, returns
     (bucket_name, store) or 404 with valid bucket list
   - ingest_file and ingest_db_stream now route writes per-request
   - Defaults to "primary" when header absent
   - pipeline::ingest_file_to_bucket records the actual bucket on the
     ObjectRef so catalog stays the source of truth for "where does this
     data live"
   - Verified: ingest with X-Lakehouse-Bucket: testing lands in
     data/_testing/, ingest without header lands in data/, bad header
     returns 404 with hint

3. queryd registers every bucket with DataFusion
   - QueryEngine now holds Arc<BucketRegistry> instead of single store
   - build_context iterates all buckets, registers each as a separate
     ObjectStore under URL scheme "lakehouse-{bucket}://"
   - ListingTable URLs include the per-object bucket scheme so
     DataFusion routes scans automatically based on ObjectRef.bucket
   - Profile bucket names like "profile:user" sanitized to
     "lakehouse-profile-user" since URL host segments can't contain ":"
   - Tolerant of duplicate manifest entries (pre-existing
     pipeline::ingest_file behavior creates a fresh dataset id per
     ingest); duplicates skipped with debug log
   - Backward compat: legacy "lakehouse://data/" URL still registered
     pointing at primary

Success gate: cross-bucket CROSS JOIN
  SELECT p.name, p.role, a.species
  FROM people_test p          (bucket: testing)
  CROSS JOIN animals a        (bucket: primary)
  LIMIT 5
returns rows correctly. DataFusion routed each scan to its bucket's
ObjectStore based on the URL scheme.

No regressions: SELECT COUNT(*) FROM candidates still returns 100000
from the primary bucket.

Deferred to Phase 17:
- POST /profile/{user}/activate (HNSW hot-load on profile switch)
- vectord storage paths becoming bucket-scoped (trial journals,
  eval sets per-profile)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-16 08:52:32 -05:00
root
97a376482c Phase C: Decoupled embedding refresh
Implements the llms3.com-inspired pattern: embeddings refresh
asynchronously, decoupled from transactional row writes. New rows arrive,
ingest marks the vector index stale, a later refresh embeds only the
delta (doc_ids not already in the index).

Schema additions (DatasetManifest):
- last_embedded_at: Option<DateTime> - when the index was last refreshed
- embedding_stale_since: Option<DateTime> - set when data written, cleared on refresh
- embedding_refresh_policy: Option<RefreshPolicy> - Manual | OnAppend | Scheduled

Ingest paths (pipeline::ingest_file + pg_stream) call
registry.mark_embeddings_stale after writing. No-op if the dataset has
never been embedded — stale semantics only kick in once last_embedded_at
is set.

Refresh pipeline (vectord::refresh::refresh_index):
- Reads the dataset Parquet, extracts (doc_id, text) pairs
- Accepts Utf8 / Int32 / Int64 id columns (covers both CSV and pg schemas)
- Loads existing embeddings via EmbeddingCache (empty on first-time build)
- Filters to rows whose doc_id is NOT in the existing set
- Chunks (chunker::chunk_column), embeds via Ollama (batches of 32),
  writes combined index, clears stale flag

Endpoints:
- POST /vectors/refresh/{dataset_name} - body {index_name, id_column,
  text_column, chunk_size?, overlap?}
- GET /vectors/stale - lists datasets whose embedding_stale_since is set

End-to-end verified on threat_intel (knowledge_base.threat_intel):
- Initial refresh: 20 rows -> 20 chunks -> embedded in 2.1s,
  last_embedded_at set
- Idempotent second refresh: 0 new docs -> 1.8ms (pure delta check)
- Re-ingest to 54 rows: mark_embeddings_stale fires -> stale_since set
- /vectors/stale surfaces threat_intel with timestamps + policy
- Delta refresh: 34 new docs embedded in 970ms (6x faster than full
  re-embed); stale_cleared = true

Not in MVP scope:
- UPDATE semantics (same doc_id, different content) - would need
  per-row content hashing
- OnAppend policy auto-trigger - just declares intent; actual scheduler
  deferred
- Scheduler runtime - the Scheduled(cron) variant declares the intent so
  operators can see which datasets expect what, but the cron itself is
  separate

Per ADR-019: when a profile switches to vector_backend=Lance, this
refresh path benefits — Lance's native append replaces our "read all +
rewrite" Parquet rebuild pattern. Current MVP works well enough at
~500-5K rows to validate the architecture; Lance unblocks the 5M+ case.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-16 03:00:43 -05:00
root
dbe00d018f Federation foundation + HNSW trial system + Postgres streaming + PRD reframe
Four shipped features and a PRD realignment, all measured end-to-end:

HNSW trial system (Phase 15 horizon item → complete)
- vectord: EmbeddingCache, harness (eval sets + brute-force ground truth),
  TrialJournal, parameterized HnswConfig on build_index_with_config
- /vectors/hnsw/trial, /hnsw/trials/{idx}, /hnsw/trials/{idx}/best,
  /hnsw/evals/{name}/autogen, /hnsw/cache/stats
- Measured on resumes_100k_v2 (100K × 768d): brute-force 44ms -> HNSW 873us
  at 100% recall@10. ec=80 es=30 locked as HnswConfig::default()
- Lower ec values trade recall for build time: 20/30 = 0.96 recall in 8s,
  80/30 = 1.00 recall in 230s

Catalog manifest repair
- catalogd: resync_from_parquet reads parquet footers to restore row_count
  and columns on drifted manifests
- POST /catalog/datasets/{name}/resync + POST /catalog/resync-missing
- All 7 staffing tables recovered to PRD-matching 2,469,278 rows

Federation foundation (ADR-017)
- shared::secrets: SecretsProvider trait + FileSecretsProvider (reads
  /etc/lakehouse/secrets.toml, enforces 0600 perms)
- storaged::registry::BucketRegistry — multi-bucket resolution with
  rescue_bucket read fallback and reachability probing
- storaged::error_journal — bucket op failures visible in one HTTP call
- storaged::append_log — write-once batched append pattern (fixes the RMW
  anti-pattern llms3.com calls out; errors and trial journals both use it)
- /storage/buckets, /storage/errors, /storage/bucket-health,
  /storage/errors/{flush,compact}
- Bucket-aware I/O at /storage/buckets/{bucket}/objects/{*key} with
  X-Lakehouse-Rescue-Used observability headers on fallback

Postgres streaming ingest
- ingestd::pg_stream: DSN parser, batched ORDER BY + LIMIT/OFFSET pagination
  into ArrowWriter, lineage redacts password
- POST /ingest/db — verified against live knowledge_base.team_runs
  (586 rows × 13 cols, 6 batches, 196ms end-to-end)

PRD realignment (2026-04-16)
- Dual use case: staffing analytics + local LLM knowledge substrate
- Removed "multi-tenancy (single-owner system)" from non-goals
- Added invariants 8-11: indexes hot-swappable, per-reader profiles,
  trials-as-data, operational failures findable in one HTTP call
- New phases 16 (hot-swap generations), 17 (model profiles + dataset
  bindings), 18 (Lance vs Parquet+sidecar evaluation)
- Known ceilings table documents the 5M vector wall and escape hatches
- ADR-017 (federation), ADR-018 (append-log pattern) added
- EXECUTION_PLAN.md sequences phases B-E with success gates and
  decision rules

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-16 01:50:05 -05:00
root
9e53caaec3 Phase 10: Rich catalog v2 — metadata as product
- DatasetManifest expanded: description, owner, sensitivity, columns,
  lineage, freshness contract, tags, row_count
- All new fields use #[serde(default)] for backward compatibility
- PII auto-detection: scans column names for email, phone, SSN, salary,
  address, DOB, medical terms — flags as PII/PHI/Financial
- Column-level metadata: name, type, sensitivity, is_pii flag
- Lineage tracking: source_system, source_file, ingest_job, timestamp
- Ingest pipeline auto-populates: PII scan, column meta, lineage, row count
- PATCH /catalog/datasets/by-name/{name}/metadata — update metadata
- Catalog responses now include all rich fields
- 25 unit tests passing (5 new PII detection tests)

Per ADR-013: datasets without metadata become mystery files.
This makes every ingested file self-describing from day one.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-27 09:15:09 -05:00
root
01373c0e45 Phase 5: hardening — gRPC, observability, auth, config
- proto: lakehouse.proto with CatalogService, QueryService, StorageService, AiService
- proto crate: tonic-build codegen from proto definitions
- catalogd: gRPC CatalogService implementation
- gateway: dual HTTP (:3100) + gRPC (:3101) servers
- gateway: OpenTelemetry tracing with stdout exporter
- gateway: API key auth middleware (toggleable)
- shared: TOML config system with typed structs and defaults
- lakehouse.toml config file
- ADR-006 and ADR-007 documented

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-27 06:37:07 -05:00
root
655b6c0b37 Phase 1: storage + catalog layer
- storaged: object_store backend (LocalFileSystem), PUT/GET/DELETE/LIST endpoints
- shared: arrow_helpers with Parquet roundtrip + schema fingerprinting (2 tests)
- catalogd: in-memory registry with write-ahead manifest persistence to object storage
- catalogd: POST/GET /datasets, GET /datasets/by-name/{name}
- gateway: wires storaged + catalogd with shared object_store state
- Phase tracker updated: Phase 0 + Phase 1 gates passed

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-27 05:15:27 -05:00
root
a52ca841c6 Phase 0: bootstrap Rust workspace
- Cargo workspace with 6 crates: shared, storaged, catalogd, queryd, aibridge, gateway
- shared: types (DatasetId, ObjectRef, SchemaFingerprint, DatasetManifest) + error enum
- gateway: Axum HTTP entrypoint with nested service routers + tracing
- All services expose /health stubs
- justfile with build/test/run recipes
- PRD, phase tracker, and ADR docs

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-27 04:59:05 -05:00