lakehouse/docs/specs/SUBJECT_MANIFESTS_ON_CATALOGD.md
root ed1fcd3c26 specs: pathway_memory v1 + subject_manifests_on_catalogd v1
Two specifications addressing the framing J asked for after reading
the llms3.com blog: standardize what we have so future work doesn't
drift, and apply the local-first thesis to the audit problem instead
of the over-scoped SaaS-tier identity service.

PATHWAY_MEMORY_SPEC.md (~400 lines):
  Documents the existing crates/vectord/src/pathway_memory.rs as a
  spec — the third metadata layer alongside catalogd's data metadata
  and playbook_memory's operational memory. Defines:
    - PathwayTrace wire format
    - pathway_id = SHA256(task_class | file_prefix | signal_class)
    - file_prefix algorithm (first 2 path segments)
    - pathway_vec: 32-bucket bag-of-tokens hash, fixed dim per spec
    - Lifecycle: insert → revise → replay → probation gate retire
    - Mem0 versioning (trace_uid + parent_trace_uid + version chain)
    - Access patterns: query_for_hotswap / query_by_vec / list_versions
    - PII risk surface (reducer_summary + final_verdict)
    - Spec boundary: stable in v1 vs implementation-specific
  No new architecture. Descriptive, not prescriptive.

SUBJECT_MANIFESTS_ON_CATALOGD.md (~400 lines):
  The local-first audit-trail spec. Adds a fourth manifest type to
  catalogd alongside dataset/view/tombstone/profile. NOT a separate
  identity daemon. NOT Vault/KMS/dual-control JWT. Builds on
  primitives catalogd already ships:
    - SubjectManifest at data/_catalog/subjects/<id>.json
    - Per-subject HMAC-chained audit JSONL
    - Daily retention sweep using existing tombstone primitives
    - Vertical-aware routing (healthcare → local-only)
    - Legal-tier credential separate from gateway internal auth
  ~4 days estimated implementation effort vs 17-20 days for the
  IDENTITY_SERVICE_DESIGN approach. Same defensibility for the
  staffing-client launch window. Strictly additive to compatibility
  with the v3 design if SOC2 Type II becomes a contract requirement.

These are SPECS — what the system already does (pathway) and what's
the smallest local-first thing that addresses the audit need
(subject manifests). Not 9-phase plans. Not new daemons.

The pathway spec is descriptive: writing down what exists so the
next person doesn't reinvent it. The subject-manifests spec is
prescriptive: J greenlights, implementation is days not weeks.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-03 03:07:38 -05:00

16 KiB
Raw Blame History

Subject Manifests on Catalogd — Specification v1

Status: Draft v1 — 2026-05-03 · Layer: Catalogd extension (NOT a separate daemon) · Implementation: to be added to crates/catalogd/src/

What this is. A small extension to catalogd adding a fourth manifest type — subject — alongside the existing dataset / view / tombstone / profile types. A subject manifest answers: "for person X, which datasets contain their PII, which views project it safely, what consent + retention applies, and what's the access log."

What it is NOT. It is not a separate identity daemon. It is not a Postgres-backed identity service. It is not a HashiCorp-Vault-using KEK rotation system. It is not the IDENTITY_SERVICE_DESIGN.md v3 design (that doc is over-scoped for local-only — see its deprecation header). It is the smallest spec that gives you a defensible "show me everything we know about person X" capability for EEOC discovery / BIPA compliance, building on primitives catalogd ALREADY ships.

Why it can be small. Catalogd already has dataset manifests with per-column is_pii flags. It already has views with column_redactions (working example: candidates_safe.json). It already has tombstones for deletes. It already has profiles for per-agent scoping. The audit-trail need is "thread these primitives together by subject identifier" — not "build a new system." That's a ~300-500 LOC extension, not a 17-20 day phase plan.


1. Conceptual model

A subject is a real person whose PII flows through the system. Identified by a stable token (current implementation: candidate_id in workers_500k.parquet).

A subject manifest is a JSON record under data/_catalog/subjects/<candidate_id>.json that points at:

  • which datasets contain rows for this subject (via foreign-key reference)
  • which views safely project this subject's data (via existing view manifest names)
  • what consent + retention metadata applies to this subject
  • what access log file holds this subject's audit trail

Subject manifests are written when a subject enters the system AND updated when a subject's consent changes, vertical reclassifies, or retention period expires.

The audit log is a per-subject append-only JSONL file at data/_catalog/subjects/<candidate_id>.audit.jsonl. Every PII access for that subject writes one row. The file is signed periodically (HMAC chain) for tamper-evidence.


2. Wire format — SubjectManifest

JSON document at data/_catalog/subjects/<candidate_id>.json:

{
  "schema": "subject_manifest.v1",
  "candidate_id": "CAND-000001",
  "created_at": "2026-05-15T12:00:00Z",
  "updated_at": "2026-05-15T12:00:00Z",
  "status": "active",
  "vertical": "general",
  "consent": {
    "general_pii": {
      "status": "given",
      "version": "v1-2026-05-15",
      "given_at": "2026-05-15T12:00:00Z"
    },
    "biometric": {
      "status": "never_collected",
      "retention_until": null
    }
  },
  "retention": {
    "general_pii_until": "2030-05-15T12:00:00Z",
    "policy": "4_year_default"
  },
  "datasets": [
    { "name": "workers_500k", "key_column": "candidate_id", "key_value": "CAND-000001" },
    { "name": "candidates",   "key_column": "candidate_id", "key_value": "CAND-000001" },
    { "name": "placements",   "key_column": "candidate_id", "key_value": "CAND-000001" },
    { "name": "timesheets",   "key_column": "candidate_id", "key_value": "CAND-000001" }
  ],
  "safe_views": ["workers_safe", "candidates_safe"],
  "audit_log_path": "data/_catalog/subjects/CAND-000001.audit.jsonl",
  "audit_log_chain_root": "sha256:..."
}

2.1 — Field semantics

Field Required Notes
schema yes Always "subject_manifest.v1". Validates parser shape.
candidate_id yes Subject identifier. Stable token. Same value as appears in dataset key columns.
status yes pending_consent | active | withdrawn | retention_expired | erased.
vertical yes unknown | general | healthcare | finance | other. Default unknown, fail-closed routing treats unknown as healthcare-equivalent.
consent.general_pii.status yes pending_backfill_review | pending_first_contact | given | withdrawn | expired.
consent.biometric.status yes never_collected | pending | given | withdrawn | expired.
retention.general_pii_until yes ISO-8601. Drives daily expiration sweep.
datasets[].name yes References an existing catalogd dataset manifest by name.
datasets[].key_column yes The column in that dataset that contains the subject's identifier.
datasets[].key_value yes The specific value (the subject's id within that dataset's namespace).
safe_views yes Names of existing catalogd view manifests that safely project this subject's data (for non-legal-tier readers).
audit_log_path yes Relative path to the audit JSONL.
audit_log_chain_root yes SHA-256 of the most recent HMAC-chained checkpoint of the audit log. Updated by the audit-log writer on every write.

3. Audit log format

Per-subject append-only JSONL at data/_catalog/subjects/<candidate_id>.audit.jsonl. One row per PII access:

{
  "schema": "subject_audit.v1",
  "ts": "2026-05-15T13:30:00Z",
  "candidate_id": "CAND-000001",
  "accessor": {
    "kind": "gateway_lookup",
    "daemon": "gateway",
    "purpose": "fill_validation",
    "trace_id": "X-Lakehouse-Trace-Id-..."
  },
  "fields_accessed": ["name"],
  "result": "success",
  "prev_chain_hash": "sha256:...",
  "row_hmac": "hmac-sha256:..."
}

3.1 — HMAC chain

Each row's row_hmac is HMAC-SHA256(key, prev_chain_hash || canonical_json_of_row_minus_hmac). The signing key is loaded once at startup from /etc/lakehouse/subject_audit.key (mode 0400). The chain root in the subject manifest references the latest row's row_hmac.

A tamper-evident verification is one pass:

verify_chain(subject_id):
  manifest = read_subject_manifest(subject_id)
  rows = read_audit_log(subject_id)
  prev = "GENESIS"
  for row in rows:
    expected = hmac_sha256(key, prev || canonicalize(row - row_hmac_field))
    assert row.prev_chain_hash == prev
    assert row.row_hmac == expected
    prev = row.row_hmac
  assert manifest.audit_log_chain_root == prev

This is local (no S3 Object Lock, no Vault) but tamper-evident: any modification to a past row breaks the chain at that point and all subsequent rows. The signing key being on disk is a real risk surface — operators MUST set the file mode 0400 owner-only and back it up to a separate location from the audit logs themselves (so that a single backup doesn't carry both the ciphertext and the verification material).

3.2 — When the audit log is written

Every code path that resolves PII for a subject MUST write an audit row before returning. Specifically:

  • The gateway's tool registry SQL templates (crates/gateway/src/tools/registry.rs) — when search_candidates / get_candidate queries return rows, write one audit row per returned candidate_id
  • The validator's WorkerLookup (crates/validator/src/staffing/parquet_lookup.rs) — when a lookup(candidate_id) succeeds, write one audit row
  • The audit-response endpoint (when implemented) — when /audit/subject/{id} is called, write one row of kind=audit_response
  • Any new code path that touches PII

Write failures MUST NOT silently swallow. They MUST be logged at error level (per the existing observability fabric). Write failures MUST NOT block the read — accept the audit gap and flag it for post-hoc review (better to leak a row than block legitimate operations).


4. The /audit/subject/{candidate_id} response

The audit response builds from the subject manifest + audit log + dataset projections:

{
  "schema": "subject_audit_response.v1",
  "candidate_id": "CAND-000001",
  "generated_at": "2026-05-15T15:00:00Z",
  "generated_by": "catalogd@hostname",
  "manifest": { /* the SubjectManifest */ },
  "datasets": {
    "workers_500k": {
      "row_present": true,
      "safe_view_projection": { /* candidates_safe row for this subject */ }
    }
  },
  "audit_log_window": {
    "from": "2026-01-01T00:00:00Z",
    "to": "2026-05-15T15:00:00Z",
    "rows": [ /* matching audit rows */ ]
  },
  "chain_verification": {
    "verified": true,
    "rows_checked": 42,
    "chain_root": "sha256:..."
  },
  "completeness_attestation": "all dataset rows + audit log entries within the window per retention policy v1 are included",
  "signature": "ed25519:..."
}

The endpoint is auth-gated via a separate legal-tier credential (see §6). The response body is signed with an Ed25519 key separate from the HMAC chain key.


5. Implementation plan (this is the SMALL plan)

This is the spec; the implementation is a small extension to catalogd. Estimated effort:

Step Effort What
1 0.5d Add SubjectManifest struct + JSON load/save in crates/catalogd/src/subjects.rs. Mirror the existing views.rs pattern.
2 0.5d Add SubjectAuditWriter with HMAC chain in same file. Key loaded from sealed file at startup.
3 0.5d Backfill subject manifests from workers_500k.parquet rows. ETL: one manifest per row, default vertical=unknown, consent.general_pii.status=pending_backfill_review.
4 0.5d Wire the gateway tool registry to write audit rows. One audit row per candidate_id returned by search_candidates / get_candidate.
5 0.5d Wire the validator WorkerLookup to write audit rows.
6 1d /audit/subject/{id} HTTP endpoint in crates/catalogd/src/service.rs. Legal-tier auth.
7 0.5d Daily retention sweep: subjects whose retention.general_pii_until < now AND status != erased get marked for review (don't auto-delete; legal needs to approve).
8 0.5d Cross-runtime parity: Go side reads the same subject manifests + audit logs. Same shapes, same HMAC algorithm.
Total ~4 days Compared to 17-20 days for the IDENTITY_SERVICE_DESIGN approach.

Each step is one commit, one revert path. No new daemons. No cloud infrastructure. No Vault. No S3 Object Lock. No dual-control JWT split-secret ceremony.


6. Auth model

Local-first, simple, defensible:

  • Service-tier reads (gateway tool registry resolving candidate names for fill scenarios): authenticated via the existing gateway internal credential. Audit row written.
  • Legal-tier reads (/audit/subject/{id}): requires a separate credential held in /etc/lakehouse/legal_audit.token (mode 0400, owner-only). Operators may load this only when fulfilling a legal request. The token is rotated per a documented runbook (operator + 1 witness; no cryptographic dual-control ceremony required for this scale).
  • Backups: subject manifests + audit logs are backed up daily. The HMAC signing key is backed up to a SEPARATE storage location (different physical/network boundary) so a single backup compromise doesn't enable forgery.

This is not as strong as Vault Transit + dual-control JWT + S3 Object Lock external anchoring. It IS strong enough to pass a normal small-business compliance review and to be defensible in IL/IN small-claims-discovery contexts. If the staffing client's contract requires SOC2 Type II or formal HSM, that's a separate phase — but it's strictly additive on top of this v1 spec.


7. What this spec gives you (load-bearing)

  1. Defensible response to discrimination discovery (worked example: John Martinez at Warehouse B). The endpoint produces a complete + signed + chain-verified record of every PII access affecting him.
  2. BIPA-compliant biometric tracking when real photos arrive. The consent.biometric field + retention timeline are first-class, not bolted on later.
  3. Per-subject right-to-be-forgotten via cryptographic erasure of the subject manifest's audit-key entry + tombstoning of the candidate's dataset rows. (The ability to verify "row is gone, trail is preserved as anonymous audit-event of the erasure" is what GDPR Art. 17 + CCPA expect.)
  4. HIPAA vertical routing via the vertical field. Healthcare-vertical subjects (and unknown defaults) route to local-only models per PRD line 70 — no PHI to cloud egress.
  5. Cross-runtime parity via the simple JSON+HMAC format that Go can read identically.

8. What this spec does NOT solve (and where it punts)

  • Per-row encryption of dataset PII: subject manifests + audit logs are local; the underlying workers_500k.parquet is not encrypted. If staffing-client contract requires at-rest encryption, that's a separate concern handled at the storage tier (filesystem encryption, S3 SSE).
  • Right to explanation (GDPR Art. 22 / EU AI Act): this spec captures decisions that touched a subject; it does not require the model to explain WHY each decision was made in human-readable form. That's a separate Phase capturing model reasoning.
  • Adverse-impact statistics: the comparator-pool snapshot per fill (per AUDIT_PHASE_1_DISCOVERY §10/C3) needs its own writer in the fill pipeline. This spec gives you the per-subject record; it doesn't cross-aggregate selection rates by protected class.
  • External tamper-evidence: the HMAC chain is local. A motivated insider with access to both the audit log AND the signing key could rewrite history. For the staffing-client scale this is acceptable; for higher-stakes deployments a separate timestamping service or external transparency log would be additive.

9. Why this is the right shape for J's deployment

  • It builds on what catalogd ALREADY ships (manifests, views, tombstones — the Iceberg-shape layer).
  • It runs locally — no cloud infrastructure to license, monitor, audit, or pay for.
  • Its primitives are JSON files an operator can read with cat and jq. Tamper-evidence works without trusting opaque crypto APIs.
  • Its implementation is days, not weeks. The timeline matches the staffing-client launch window without forcing them to wait on a SaaS-tier identity service that doesn't fit their data residency posture.
  • It is COMPATIBLE with the IDENTITY_SERVICE_DESIGN v3 path if the staffing client later requires SOC2 Type II — the v1 subject manifests can be migrated into a separate identity daemon when the scale demands it. But that's deferred until demand exists, not built speculatively.

This is the LOCAL-FIRST audit trail. It exists because the SaaS-tier version doesn't fit the deployment model J actually has.


10. Spec boundary

Stable in v1 (changing requires v2):

  • File layout: data/_catalog/subjects/<id>.json + data/_catalog/subjects/<id>.audit.jsonl
  • JSON schemas in §2 + §3 (additive fields OK; renames/removals require v2)
  • HMAC algorithm: HMAC-SHA256 with key from sealed file
  • Chain semantics: prev_chain_hash references previous row's row_hmac
  • Vertical default: unknown with fail-closed routing
  • Consent state machine in §2.1
  • Audit-row write requirements in §3.2

Implementation-specific (free to change):

  • Storage backend (file system v1; SQLite or Postgres acceptable as long as JSON shapes round-trip)
  • HTTP endpoint exact shape (body schema is spec; status codes / headers are implementation)
  • Backfill ETL details

Reserved for v2:

  • Per-row encryption (when staffing-client contract requires it)
  • External tamper-evidence anchor (when SOC2 Type II in scope)
  • Cross-tenant subject isolation (when multi-tenant in scope)

Change log

  • 2026-05-03 — v1 initial draft. Builds on catalogd primitives that already exist. Smaller than IDENTITY_SERVICE_DESIGN v3 by ~4× because it doesn't propose a new daemon.