# Subject Manifests on Catalogd — Specification v1

**Status:** Draft v1 — 2026-05-03 · **Layer:** Catalogd extension (NOT a separate daemon) · **Implementation:** to be added to `crates/catalogd/src/`

> **What this is.** A small extension to `catalogd` adding a fourth manifest type — `subject` — alongside the existing dataset / view / tombstone / profile types. A subject manifest answers: "for person X, which datasets contain their PII, which views project it safely, what consent + retention applies, and what's the access log."
>
> **What it is NOT.** It is not a separate identity daemon. It is not a Postgres-backed identity service. It is not a HashiCorp-Vault-using KEK rotation system. It is not the `IDENTITY_SERVICE_DESIGN.md` v3 design (that doc is over-scoped for local-only — see its deprecation header). It is the smallest spec that gives you a defensible "show me everything we know about person X" capability for EEOC discovery / BIPA compliance, building on primitives catalogd ALREADY ships.
>
> **Why it can be small.** Catalogd already has dataset manifests with per-column `is_pii` flags. It already has views with `column_redactions` (working example: `candidates_safe.json`). It already has tombstones for deletes. It already has profiles for per-agent scoping. The audit-trail need is "thread these primitives together by subject identifier" — not "build a new system." That's a ~300-500 LOC extension, not a 17-20 day phase plan.

---

## 1. Conceptual model

A **subject** is a real person whose PII flows through the system. Identified by a stable token (current implementation: `candidate_id` in `workers_500k.parquet`).

A **subject manifest** is a JSON record under `data/_catalog/subjects/<candidate_id>.json` that points at:
- which datasets contain rows for this subject (via foreign-key reference)
- which views safely project this subject's data (via existing view manifest names)
- what consent + retention metadata applies to this subject
- what access log file holds this subject's audit trail

Subject manifests are written when a subject enters the system AND updated when a subject's consent changes, vertical reclassifies, or retention period expires.

The **audit log** is a per-subject append-only JSONL file at `data/_catalog/subjects/<candidate_id>.audit.jsonl`. Every PII access for that subject writes one row. The file is signed periodically (HMAC chain) for tamper-evidence.

---

## 2. Wire format — `SubjectManifest`

JSON document at `data/_catalog/subjects/<candidate_id>.json`:

```json
{
  "schema": "subject_manifest.v1",
  "candidate_id": "CAND-000001",
  "created_at": "2026-05-15T12:00:00Z",
  "updated_at": "2026-05-15T12:00:00Z",
  "status": "active",
  "vertical": "general",
  "consent": {
    "general_pii": {
      "status": "given",
      "version": "v1-2026-05-15",
      "given_at": "2026-05-15T12:00:00Z"
    },
    "biometric": {
      "status": "never_collected",
      "retention_until": null
    }
  },
  "retention": {
    "general_pii_until": "2030-05-15T12:00:00Z",
    "policy": "4_year_default"
  },
  "datasets": [
    { "name": "workers_500k", "key_column": "candidate_id", "key_value": "CAND-000001" },
    { "name": "candidates",   "key_column": "candidate_id", "key_value": "CAND-000001" },
    { "name": "placements",   "key_column": "candidate_id", "key_value": "CAND-000001" },
    { "name": "timesheets",   "key_column": "candidate_id", "key_value": "CAND-000001" }
  ],
  "safe_views": ["workers_safe", "candidates_safe"],
  "audit_log_path": "data/_catalog/subjects/CAND-000001.audit.jsonl",
  "audit_log_chain_root": "sha256:..."
}
```

### 2.1 — Field semantics

| Field | Required | Notes |
|---|---|---|
| `schema` | yes | Always `"subject_manifest.v1"`. Validates parser shape. |
| `candidate_id` | yes | Subject identifier. Stable token. Same value as appears in dataset key columns. |
| `status` | yes | `pending_consent` \| `active` \| `withdrawn` \| `retention_expired` \| `erased`. |
| `vertical` | yes | `unknown` \| `general` \| `healthcare` \| `finance` \| `other`. **Default `unknown`**, fail-closed routing treats unknown as healthcare-equivalent. |
| `consent.general_pii.status` | yes | `pending_backfill_review` \| `pending_first_contact` \| `given` \| `withdrawn` \| `expired`. |
| `consent.biometric.status` | yes | `never_collected` \| `pending` \| `given` \| `withdrawn` \| `expired`. |
| `retention.general_pii_until` | yes | ISO-8601. Drives daily expiration sweep. |
| `datasets[].name` | yes | References an existing catalogd dataset manifest by name. |
| `datasets[].key_column` | yes | The column in that dataset that contains the subject's identifier. |
| `datasets[].key_value` | yes | The specific value (the subject's id within that dataset's namespace). |
| `safe_views` | yes | Names of existing catalogd view manifests that safely project this subject's data (for non-legal-tier readers). |
| `audit_log_path` | yes | Relative path to the audit JSONL. |
| `audit_log_chain_root` | yes | SHA-256 of the most recent HMAC-chained checkpoint of the audit log. Updated by the audit-log writer on every write. |

---

## 3. Audit log format

Per-subject append-only JSONL at `data/_catalog/subjects/<candidate_id>.audit.jsonl`. One row per PII access:

```json
{
  "schema": "subject_audit.v1",
  "ts": "2026-05-15T13:30:00Z",
  "candidate_id": "CAND-000001",
  "accessor": {
    "kind": "gateway_lookup",
    "daemon": "gateway",
    "purpose": "fill_validation",
    "trace_id": "X-Lakehouse-Trace-Id-..."
  },
  "fields_accessed": ["name"],
  "result": "success",
  "prev_chain_hash": "sha256:...",
  "row_hmac": "hmac-sha256:..."
}
```

### 3.1 — HMAC chain

Each row's `row_hmac` is `HMAC-SHA256(key, prev_chain_hash || canonical_json_of_row_minus_hmac)`. The signing key is loaded once at startup from `/etc/lakehouse/subject_audit.key` (mode 0400). The chain root in the subject manifest references the latest row's `row_hmac`.

A tamper-evident verification is one pass:
```
verify_chain(subject_id):
  manifest = read_subject_manifest(subject_id)
  rows = read_audit_log(subject_id)
  prev = "GENESIS"
  for row in rows:
    expected = hmac_sha256(key, prev || canonicalize(row - row_hmac_field))
    assert row.prev_chain_hash == prev
    assert row.row_hmac == expected
    prev = row.row_hmac
  assert manifest.audit_log_chain_root == prev
```

This is local (no S3 Object Lock, no Vault) but tamper-evident: any modification to a past row breaks the chain at that point and all subsequent rows. The signing key being on disk is a real risk surface — operators MUST set the file mode 0400 owner-only and back it up to a separate location from the audit logs themselves (so that a single backup doesn't carry both the ciphertext and the verification material).

### 3.2 — When the audit log is written

Every code path that resolves PII for a subject MUST write an audit row before returning. Specifically:
- The gateway's tool registry SQL templates (`crates/gateway/src/tools/registry.rs`) — when `search_candidates` / `get_candidate` queries return rows, write one audit row per returned candidate_id
- The validator's WorkerLookup (`crates/validator/src/staffing/parquet_lookup.rs`) — when a `lookup(candidate_id)` succeeds, write one audit row
- The audit-response endpoint (when implemented) — when `/audit/subject/{id}` is called, write one row of `kind=audit_response`
- Any new code path that touches PII

Write failures MUST NOT silently swallow. They MUST be logged at error level (per the existing observability fabric). Write failures MUST NOT block the read — accept the audit gap and flag it for post-hoc review (better to leak a row than block legitimate operations).

---

## 4. The `/audit/subject/{candidate_id}` response

The audit response builds from the subject manifest + audit log + dataset projections:

```json
{
  "schema": "subject_audit_response.v1",
  "candidate_id": "CAND-000001",
  "generated_at": "2026-05-15T15:00:00Z",
  "generated_by": "catalogd@hostname",
  "manifest": { /* the SubjectManifest */ },
  "datasets": {
    "workers_500k": {
      "row_present": true,
      "safe_view_projection": { /* candidates_safe row for this subject */ }
    }
  },
  "audit_log_window": {
    "from": "2026-01-01T00:00:00Z",
    "to": "2026-05-15T15:00:00Z",
    "rows": [ /* matching audit rows */ ]
  },
  "chain_verification": {
    "verified": true,
    "rows_checked": 42,
    "chain_root": "sha256:..."
  },
  "completeness_attestation": "all dataset rows + audit log entries within the window per retention policy v1 are included",
  "signature": "ed25519:..."
}
```

The endpoint is auth-gated via a separate legal-tier credential (see §6). The response body is signed with an Ed25519 key separate from the HMAC chain key.

---

## 5. Implementation plan (this is the SMALL plan)

This is the spec; the implementation is a small extension to catalogd. Estimated effort:

| Step | Effort | What |
|---|---|---|
| **1** | 0.5d | Add `SubjectManifest` struct + JSON load/save in `crates/catalogd/src/subjects.rs`. Mirror the existing `views.rs` pattern. |
| **2** | 0.5d | Add `SubjectAuditWriter` with HMAC chain in same file. Key loaded from sealed file at startup. |
| **3** | 0.5d | Backfill subject manifests from `workers_500k.parquet` rows. ETL: one manifest per row, default `vertical=unknown`, `consent.general_pii.status=pending_backfill_review`. |
| **4** | 0.5d | Wire the gateway tool registry to write audit rows. One audit row per candidate_id returned by search_candidates / get_candidate. |
| **5** | 0.5d | Wire the validator WorkerLookup to write audit rows. |
| **6** | 1d | `/audit/subject/{id}` HTTP endpoint in `crates/catalogd/src/service.rs`. Legal-tier auth. |
| **7** | 0.5d | Daily retention sweep: subjects whose `retention.general_pii_until` < now AND `status != erased` get marked for review (don't auto-delete; legal needs to approve). |
| **8** | 0.5d | Cross-runtime parity: Go side reads the same subject manifests + audit logs. Same shapes, same HMAC algorithm. |
| **Total** | **~4 days** | Compared to 17-20 days for the IDENTITY_SERVICE_DESIGN approach. |

Each step is one commit, one revert path. No new daemons. No cloud infrastructure. No Vault. No S3 Object Lock. No dual-control JWT split-secret ceremony.

---

## 6. Auth model

Local-first, simple, defensible:

- **Service-tier reads** (gateway tool registry resolving candidate names for fill scenarios): authenticated via the existing gateway internal credential. Audit row written.
- **Legal-tier reads** (`/audit/subject/{id}`): requires a separate credential held in `/etc/lakehouse/legal_audit.token` (mode 0400, owner-only). Operators may load this only when fulfilling a legal request. The token is rotated per a documented runbook (operator + 1 witness; no cryptographic dual-control ceremony required for this scale).
- **Backups**: subject manifests + audit logs are backed up daily. The HMAC signing key is backed up to a SEPARATE storage location (different physical/network boundary) so a single backup compromise doesn't enable forgery.

This is not as strong as Vault Transit + dual-control JWT + S3 Object Lock external anchoring. It IS strong enough to pass a normal small-business compliance review and to be defensible in IL/IN small-claims-discovery contexts. If the staffing client's contract requires SOC2 Type II or formal HSM, that's a separate phase — but it's strictly additive on top of this v1 spec.

---

## 7. What this spec gives you (load-bearing)

1. **Defensible response to discrimination discovery** (worked example: John Martinez at Warehouse B). The endpoint produces a complete + signed + chain-verified record of every PII access affecting him.
2. **BIPA-compliant biometric tracking** when real photos arrive. The `consent.biometric` field + retention timeline are first-class, not bolted on later.
3. **Per-subject right-to-be-forgotten** via cryptographic erasure of the subject manifest's audit-key entry + tombstoning of the candidate's dataset rows. (The ability to verify "row is gone, trail is preserved as anonymous audit-event of the erasure" is what GDPR Art. 17 + CCPA expect.)
4. **HIPAA vertical routing** via the `vertical` field. Healthcare-vertical subjects (and `unknown` defaults) route to local-only models per PRD line 70 — no PHI to cloud egress.
5. **Cross-runtime parity** via the simple JSON+HMAC format that Go can read identically.

---

## 8. What this spec does NOT solve (and where it punts)

- **Per-row encryption of dataset PII**: subject manifests + audit logs are local; the underlying `workers_500k.parquet` is not encrypted. If staffing-client contract requires at-rest encryption, that's a separate concern handled at the storage tier (filesystem encryption, S3 SSE).
- **Right to explanation (GDPR Art. 22 / EU AI Act)**: this spec captures decisions that touched a subject; it does not require the model to explain WHY each decision was made in human-readable form. That's a separate Phase capturing model reasoning.
- **Adverse-impact statistics**: the comparator-pool snapshot per fill (per `AUDIT_PHASE_1_DISCOVERY` §10/C3) needs its own writer in the fill pipeline. This spec gives you the per-subject record; it doesn't cross-aggregate selection rates by protected class.
- **External tamper-evidence**: the HMAC chain is local. A motivated insider with access to both the audit log AND the signing key could rewrite history. For the staffing-client scale this is acceptable; for higher-stakes deployments a separate timestamping service or external transparency log would be additive.

---

## 9. Why this is the right shape for J's deployment

- It builds on what catalogd ALREADY ships (manifests, views, tombstones — the Iceberg-shape layer).
- It runs locally — no cloud infrastructure to license, monitor, audit, or pay for.
- Its primitives are JSON files an operator can read with `cat` and `jq`. Tamper-evidence works without trusting opaque crypto APIs.
- Its implementation is days, not weeks. The timeline matches the staffing-client launch window without forcing them to wait on a SaaS-tier identity service that doesn't fit their data residency posture.
- It is COMPATIBLE with the IDENTITY_SERVICE_DESIGN v3 path if the staffing client later requires SOC2 Type II — the v1 subject manifests can be migrated into a separate identity daemon when the scale demands it. But that's deferred until demand exists, not built speculatively.

This is the LOCAL-FIRST audit trail. It exists because the SaaS-tier version doesn't fit the deployment model J actually has.

---

## 10. Spec boundary

**Stable in v1 (changing requires v2):**
- File layout: `data/_catalog/subjects/<id>.json` + `data/_catalog/subjects/<id>.audit.jsonl`
- JSON schemas in §2 + §3 (additive fields OK; renames/removals require v2)
- HMAC algorithm: HMAC-SHA256 with key from sealed file
- Chain semantics: `prev_chain_hash` references previous row's `row_hmac`
- Vertical default: `unknown` with fail-closed routing
- Consent state machine in §2.1
- Audit-row write requirements in §3.2

**Implementation-specific (free to change):**
- Storage backend (file system v1; SQLite or Postgres acceptable as long as JSON shapes round-trip)
- HTTP endpoint exact shape (body schema is spec; status codes / headers are implementation)
- Backfill ETL details

**Reserved for v2:**
- Per-row encryption (when staffing-client contract requires it)
- External tamper-evidence anchor (when SOC2 Type II in scope)
- Cross-tenant subject isolation (when multi-tenant in scope)

---

## Change log

- 2026-05-03 — v1 initial draft. Builds on catalogd primitives that already exist. Smaller than IDENTITY_SERVICE_DESIGN v3 by ~4× because it doesn't propose a new daemon.