specs: pathway_memory v1 + subject_manifests_on_catalogd v1

Two specifications addressing the framing J asked for after reading the llms3.com blog: standardize what we have so future work doesn't drift, and apply the local-first thesis to the audit problem instead of the over-scoped SaaS-tier identity service. PATHWAY_MEMORY_SPEC.md (~400 lines): Documents the existing crates/vectord/src/pathway_memory.rs as a spec — the third metadata layer alongside catalogd's data metadata and playbook_memory's operational memory. Defines: - PathwayTrace wire format - pathway_id = SHA256(task_class | file_prefix | signal_class) - file_prefix algorithm (first 2 path segments) - pathway_vec: 32-bucket bag-of-tokens hash, fixed dim per spec - Lifecycle: insert → revise → replay → probation gate retire - Mem0 versioning (trace_uid + parent_trace_uid + version chain) - Access patterns: query_for_hotswap / query_by_vec / list_versions - PII risk surface (reducer_summary + final_verdict) - Spec boundary: stable in v1 vs implementation-specific No new architecture. Descriptive, not prescriptive. SUBJECT_MANIFESTS_ON_CATALOGD.md (~400 lines): The local-first audit-trail spec. Adds a fourth manifest type to catalogd alongside dataset/view/tombstone/profile. NOT a separate identity daemon. NOT Vault/KMS/dual-control JWT. Builds on primitives catalogd already ships: - SubjectManifest at data/_catalog/subjects/<id>.json - Per-subject HMAC-chained audit JSONL - Daily retention sweep using existing tombstone primitives - Vertical-aware routing (healthcare → local-only) - Legal-tier credential separate from gateway internal auth ~4 days estimated implementation effort vs 17-20 days for the IDENTITY_SERVICE_DESIGN approach. Same defensibility for the staffing-client launch window. Strictly additive to compatibility with the v3 design if SOC2 Type II becomes a contract requirement. These are SPECS — what the system already does (pathway) and what's the smallest local-first thing that addresses the audit need (subject manifests). Not 9-phase plans. Not new daemons. The pathway spec is descriptive: writing down what exists so the next person doesn't reinvent it. The subject-manifests spec is prescriptive: J greenlights, implementation is days not weeks. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-03 03:07:38 -05:00 · 2026-05-03 03:07:38 -05:00 · ed1fcd3c26
commit ed1fcd3c26
parent 991db7be1a
2 changed files with 573 additions and 0 deletions
--- a/docs/specs/PATHWAY_MEMORY_SPEC.md
+++ b/docs/specs/PATHWAY_MEMORY_SPEC.md
@ -0,0 +1,307 @@
+# Pathway Memory — Specification v1
+
+**Status:** Draft v1 — 2026-05-03 · **Layer:** Decision metadata · **Implementation:** `crates/vectord/src/pathway_memory.rs`
+
+> **What this is.** Pathway Memory is a third metadata layer that exists alongside (not on top of, not under) the data-metadata layer (catalogd's manifests/views/tombstones) and the operational-memory layer (playbook_memory). It tracks **decision patterns**: "this code path, in this task class, produced this outcome." It is the AI-substrate analog of what Iceberg manifests are to data tables — pointers + provenance for things you might want to retrieve again or learn from.
+>
+> **What it is NOT.** It is not a vector database. It is not a knowledge graph. It is not an audit log (audit is per-subject; pathways are per-code-pattern). It is not a cache. It is a small, opinionated metadata layer with a defined wire format, defined access patterns, and defined lifecycle rules.
+>
+> **Why a spec exists.** No standard tells us what this layer is. Each implementation that drifts from this spec creates the wiring confusion currently visible in the system (data/_kb/ JSONLs that don't get consumed, version drift across `workers_500k_v1..v10`, unclear who reads `reducer_summary` vs `final_verdict`). The spec is the standardization contract that lets a single implementation become the canonical one.
+
+---
+
+## 1. Conceptual model
+
+A **pathway** is the identity-equivalence-class of code paths that exhibit similar decision behavior. Two distinct files may share a pathway if they belong to the same crate AND are reviewed under the same task class with the same signal class.
+
+A **trace** is a single observed instance of a pathway: one review pass, with its inputs, intermediate signals, ladder attempts, audit outcome, and final verdict.
+
+A pathway is **identified** by a stable hash of three keys: task class, file prefix, signal class. The hash is `pathway_id`.
+
+A trace is **identified** within a pathway by a generated UUID: `trace_uid`. The pair `(pathway_id, trace_uid)` is globally unique.
+
+Pathways are **append-only by trace**: every new observation creates a new trace under the existing pathway_id. Traces themselves are **mutable in version**: `revise()` produces a new version that supersedes the previous (Mem0-style chain).
+
+Pathways are **retirable** by id: when replay history shows the pattern no longer holds, the pathway is marked retired and excluded from hot-swap. New pathways may emerge with fresh ids if the underlying code/task characteristics shift.
+
+---
+
+## 2. Wire format — `PathwayTrace`
+
+Authoritative struct: `crates/vectord/src/pathway_memory.rs`. The fields below are the SPEC; the Rust struct is one implementation. All fields use JSON-serializable types when persisted.
+
+### 2.1 — Identity fields (immutable per trace)
+
+| Field | Type | Constraint |
+|---|---|---|
+| `pathway_id` | string | SHA256 hex of `task_class | file_prefix | signal_class` — see §3 |
+| `trace_uid` | string | UUID. Empty on legacy traces; populated on insert by SHOULD-implementations from v2+ |
+| `version` | u32 | Mem0 version chain. Default 1 on insert. Bumped on `revise()`. |
+| `parent_trace_uid` | string \| null | trace_uid of the trace this one supersedes. null on root version. |
+| `superseded_at` | string \| null | ISO-8601 UTC. Set when this trace becomes a non-head version. |
+| `superseded_by_trace_uid` | string \| null | trace_uid of the new head when this version is superseded. |
+
+### 2.2 — Identity-source fields (the inputs that fed pathway_id)
+
+| Field | Type | Notes |
+|---|---|---|
+| `task_class` | string | Caller-defined class string (e.g. `scrum_review`, `staffing.fill`, `pr_audit`) |
+| `file_path` | string | Full path of the file under review. Reduced to file_prefix for hashing — see §3.2. |
+| `signal_class` | string \| null | Caller-defined behavior label (e.g. `CONVERGING`, `LOOPING`, `STUCK_RETRY`) |
+
+### 2.3 — Observation fields (what happened)
+
+| Field | Type | Description |
+|---|---|---|
+| `created_at` | RFC 3339 timestamp | When this trace was inserted. |
+| `ladder_attempts` | array of `LadderAttempt` | Each model+rung attempt: `{rung, model, latency_ms, accepted, reject_reason?}`. Order is dispatch order. |
+| `kb_chunks` | array of `KbChunkRef` | Knowledge-base chunks the reviewer was given as context: `{source_doc, chunk_id, cosine_score, rank}`. |
+| `observer_signals` | array of `ObserverSignal` | Behavior labels emitted by observer during this pathway's run: `{class, priors, prior_iter_outcomes}`. |
+| `bridge_hits` | array of `BridgeHit` | External-doc lookups (context7-style): `{library, version, result_summary}`. |
+| `sub_pipeline_calls` | array of `SubPipelineCall` | Calls to other sub-pipelines (extract, validate, etc.) made during this trace. |
+| `audit_consensus` | `AuditConsensus` \| null | Cross-lineage audit outcome if any: `{pass, models, disagreements}`. |
+| `reducer_summary` | string | Free-text summary of what the reducer concluded. **PII risk surface — see §6.3.** |
+| `final_verdict` | string | Free-text verdict label (`accepted`, `rejected`, `needs_review`, etc.). **PII risk surface — see §6.3.** |
+
+### 2.4 — Index fields (the matrix retrieval shape)
+
+| Field | Type | Description |
+|---|---|---|
+| `pathway_vec` | array of f32, length 32 | Bag-of-tokens hash embedding. Algorithm in §4. Fixed dimension per spec version. |
+| `replay_count` | u32 | Number of times this pathway has been replayed via hot-swap. Initial insert is NOT a replay. |
+| `replays_succeeded` | u32 | Replays where the post-replay outcome matched the trace's `final_verdict`. |
+| `retired` | bool | Marked true when probation gate fires (§5.4). Excluded from hot-swap forever. |
+
+### 2.5 — Optional semantic-correctness fields (ADR-021)
+
+| Field | Type | Description |
+|---|---|---|
+| `semantic_flags` | array of `SemanticFlag` | One of: `UnitMismatch`, `TypeConfusion`, `NullableConfusion`, `OffByOne`, `StaleReference`, `PseudoImpl`, `DeadCode`, `WarningNoise`, `BoundaryViolation`. Caller-emitted, not auto-derived. |
+| `type_hints_used` | array of `TypeHint` | Schema/type context fed to the reviewer for this trace: `{source, symbol, type_repr}`. |
+| `bug_fingerprints` | array of `BugFingerprint` | Structural pattern hashes the reviewer caught: `{flag, pattern_key, example, occurrences}`. |
+
+All §2.5 fields are additive — implementations MUST default them to empty when deserializing legacy traces.
+
+---
+
+## 3. Identity algorithms
+
+### 3.1 — `pathway_id`
+
+```
+pathway_id = sha256_hex(task_class || "|" || file_prefix(file_path) || "|" || signal_class_or_empty)
+```
+
+- Concatenation uses literal byte `|` (0x7C) as separator.
+- `signal_class_or_empty` is the empty string when signal_class is null.
+- SHA-256 output is 64 hex chars, lowercase.
+
+### 3.2 — `file_prefix`
+
+```
+file_prefix(path) = first two path segments joined by '/'
+```
+
+- Path separator is forward slash. Implementations on Windows MUST normalize to forward slashes before hashing.
+- Files with fewer than two segments use the full path verbatim.
+- Examples:
+  - `crates/queryd/src/service.rs` → `crates/queryd`
+  - `crates/gateway` → `crates/gateway`
+  - `README.md` → `README.md`
+
+### 3.3 — Why this fingerprint shape
+
+The fingerprint groups files within a crate (same `file_prefix`) under the same pathway when reviewed for the same task class with the same signal class. This is intentional: it lets the matrix index recognize "the same kind of bug appears in this crate" without needing per-file traces. Per-file granularity is preserved in the `file_path` field for retrieval but is not part of the identity.
+
+---
+
+## 4. `pathway_vec` algorithm
+
+Fixed dimension: **32 buckets**. Implementation: deterministic bag-of-tokens hash, normalized.
+
+```
+def build_pathway_vec(trace):
+    buckets = [0.0] * 32
+    tokens = collect_metadata_tokens(trace)  # see token list below
+    for tok in tokens:
+        h = sha256(tok.encode())
+        bucket = int.from_bytes(h[:4], 'big') % 32
+        buckets[bucket] += 1.0
+    norm = sqrt(sum(b*b for b in buckets))
+    return [b/norm if norm > 0 else 0.0 for b in buckets]
+```
+
+**Tokens MUST include**:
+- `task_class:<task_class>`
+- `file_prefix:<file_prefix>`
+- `signal_class:<signal_class_or_empty>`
+- For each `ladder_attempts[i].model`: `model:<model>`
+- For each `kb_chunks[i].source_doc`: `kb_doc:<source_doc>`
+- For each `observer_signals[i].class`: `signal:<class>`
+- For each `bug_fingerprints[i].flag`: `flag:<flag>`
+
+**Tokens MUST NOT include**:
+- `reducer_summary` text (free-form, would dominate the bucket distribution)
+- `final_verdict` text
+- `created_at` (would make every trace's vector unique)
+- candidate_id, name, email, phone, or any subject identifier (PII risk)
+
+**Why 32 dimensions:** consensus-validated round 3 ensemble selected "small metadata tokens, not full JSON." 32 is sufficient to distinguish pathway combinations without requiring an external embedding model. An external embedding model would work too but adds a dependency, failure mode, and drift risk the consensus flagged. **Spec version v1 fixes dim=32. v2 may revisit.**
+
+---
+
+## 5. Lifecycle
+
+### 5.1 — Insert
+
+A trace is inserted by the consumer (typically the scrum pipeline or auditor) calling the implementation's `insert()` API.
+
+- If no pathway with `pathway_id` exists, a new pathway bucket is created.
+- If a pathway exists, the new trace is appended to that bucket's traces list.
+- `trace_uid` is generated (UUID v7 RECOMMENDED for time-orderability).
+- `version` is set to 1.
+- `created_at` is set to the insertion timestamp.
+- `pathway_vec` is computed and stored.
+- `replay_count` and `replays_succeeded` start at 0.
+
+### 5.2 — Revise
+
+A trace is revised by calling `revise(trace_uid, new_fields)`.
+
+- A new trace with a fresh `trace_uid` and `version = previous.version + 1` is inserted.
+- The new trace's `parent_trace_uid` is set to the previous trace's `trace_uid`.
+- The previous trace's `superseded_at` is set to the revision timestamp and `superseded_by_trace_uid` is set to the new trace's `trace_uid`.
+- The previous trace remains in the bucket (history is preserved). Retrieval defaults to head versions only — see §7.
+
+### 5.3 — Replay
+
+A pathway is **eligible for replay** when `retired == false` AND the bucket has at least one head-version trace.
+
+The hot-swap consumer queries by `pathway_id` (or by similarity over `pathway_vec`), retrieves the head trace, applies its `final_verdict` shape to a new but similar input, and then reports back via `record_replay(trace_uid, succeeded: bool)`.
+
+- `replay_count` is incremented on every replay (success or failure).
+- `replays_succeeded` is incremented only on success.
+
+### 5.4 — Probation gate (retirement)
+
+Probation triggers when `replay_count >= 3` AND `success_rate < 0.80`.
+
+When triggered, the pathway is marked `retired = true`. It is excluded from hot-swap forever. If the underlying code/task characteristics genuinely change such that a retired pathway would work again, a NEW pathway with a FRESH id will be created (because the inputs to `pathway_id` would have changed, e.g., new signal_class). Retirement is per-id and irreversible.
+
+`success_rate` = `replays_succeeded / replay_count` (0.0 if `replay_count == 0`).
+
+---
+
+## 6. Access patterns (read + write shapes)
+
+### 6.1 — Write paths (who inserts traces)
+
+Spec-conformant writers:
+- Scrum pipeline (`tests/real-world/scrum_master_pipeline.ts` and downstream consumers) on review acceptance
+- Auditor pipeline (`auditor/`) on verdict assembly
+- Validator (`crates/validator/`) on iterate session completion
+
+Each writer is responsible for setting the identity fields correctly. The implementation MUST reject writes where `task_class` is empty.
+
+### 6.2 — Read paths (who consumes traces)
+
+Spec-conformant readers:
+- **Hot-swap retrieval** — `query_for_hotswap(task_class, file_path, signal_class, k=5)` returns the top-k head-version traces from matching pathways, ordered by `(success_rate desc, replay_count desc)`.
+- **Similarity retrieval** — `query_by_vec(pathway_vec, k=10)` returns the top-k head-version traces by cosine similarity over `pathway_vec`.
+- **Audit / forensic** — `list_versions(trace_uid_root)` returns the full version chain. **Must be authorized** — version chains can include reducer_summary text that may contain PII.
+
+The implementation MUST NOT expose `list_versions` on an unauthenticated route.
+
+### 6.3 — PII surface (load-bearing for audit-trail compliance)
+
+The fields `reducer_summary` and `final_verdict` are free-form strings written from review output. They are **highly likely** to contain PII when the reviewed task class involved real candidates (per `AUDIT_PHASE_1_DISCOVERY.md` §1F + §10/C1). Until subject-redaction-on-write lands, implementations:
+- MUST treat trace bodies as a suspected PII sink for any task class touching real subjects
+- SHOULD redact known PII patterns (candidate_id, email, phone) before persisting
+- SHOULD provide a per-trace `subject_ids[]` top-level field when known, so audit-response queries can filter by subject
+
+Until the redaction layer ships, do not advertise pathway_memory as PII-clean.
+
+---
+
+## 7. Storage representation
+
+The current implementation persists state to a single JSON file at `data/_pathway_memory/state.json`:
+
+```
+{
+  "pathways": {
+    "<pathway_id>": [
+      { "trace_uid": "...", "version": 1, "task_class": "...", ... },
+      { "trace_uid": "...", "version": 2, "parent_trace_uid": "...", ... }
+    ]
+  },
+  "last_updated_at": "2026-04-29T05:43:00Z"
+}
+```
+
+The flat-JSON representation is implementation-specific. A spec-conformant implementation could use SQLite, an append-only JSONL, or a Lance dataset. The spec dictates the wire shape per-trace and the access patterns; the storage backend is the implementation's choice.
+
+**Retrieval defaults**: queries return head versions only by default (a head version is one with `superseded_at == null`). Consumers explicitly opt in to historical versions via a `include_history: true` flag.
+
+---
+
+## 8. Spec boundary — what's stable vs implementation-specific
+
+### Stable (locked in v1; changing requires v2 spec)
+- The `pathway_id` algorithm (SHA256 of `task_class | file_prefix | signal_class`)
+- The `file_prefix` algorithm (first two path segments)
+- `pathway_vec` dimension (32) and token list
+- Field names and types in §2 (additive fields are OK; renames or removals require v2)
+- Lifecycle rules in §5
+- Access pattern names in §6 (`query_for_hotswap`, `query_by_vec`, `list_versions`)
+
+### Implementation-specific (free to change without spec bump)
+- Storage backend (JSON file, SQLite, Lance, etc.)
+- HTTP/gRPC/library API surface
+- Concurrency model
+- In-memory caching strategy
+- The exact text of `reducer_summary` (it's content, not schema)
+- Token-bucket hash function (SHA-256 specified in §4 but any deterministic 32-bit hash is acceptable as long as it's deterministic and well-distributed)
+
+### Reserved for v2 (named so future implementations don't paint into a corner)
+- Cross-pathway similarity (`query_pathways_similar_to_pathway`)
+- External embedding model for `pathway_vec` (replaces the bag-of-tokens hash)
+- Subject-aware indexing (so audit-response can find all pathways involving subject X)
+- Differential privacy on `pathway_vec` aggregations
+
+---
+
+## 9. Open questions
+
+These need decisions before v1 is final-final:
+
+1. **Dimension lock-in**: 32 is round-3 consensus, but is it actually optimal? Worth a benchmark before final v1.
+2. **Subject indexing in v1 vs v2**: deferring to v2 means audit queries today have to grep `reducer_summary` text. Is that acceptable for the staffing-client launch window?
+3. **Cross-runtime parity**: the Go side has its own `internal/pathways/` implementation. As of 2026-05-03 the wire shape matches but no parity probe exists. Add `pathway_parity.sh` to the cross-runtime probe suite.
+4. **Migration from current state.json**: existing 91 traces lack `trace_uid`. Spec says `trace_uid` is empty on legacy traces — should v1 require backfill or accept legacy-trace queries indefinitely?
+
+---
+
+## 10. Why this spec exists
+
+Without it:
+- Every new consumer of pathway_memory invents their own read/write conventions
+- The Go-side reimplementation drifts from the Rust-side wire format
+- Future LLM-assistant work proposes new fields without considering the existing schema
+- Forensic queries grep free-text fields because no canonical retrieval exists
+- The PII concern stays "TODO" because there's no stable surface to gate
+
+With it:
+- One implementation can be canonical (Rust, currently); others can be conformance-tested
+- The wire shape is documented; serialization tests can pin it
+- New fields go through an additive-only review (don't break legacy)
+- The KbContext consumer in the gateway has a stable API to read against
+- The audit-trail PRD has a defined PII surface to protect
+
+This is the standardization the system was missing.
+
+---
+
+## Change log
+
+- 2026-05-03 — v1 initial draft. Codifies what's already in `crates/vectord/src/pathway_memory.rs`. The spec is descriptive of the existing implementation, not prescriptive of new behavior.
--- a/docs/specs/SUBJECT_MANIFESTS_ON_CATALOGD.md
+++ b/docs/specs/SUBJECT_MANIFESTS_ON_CATALOGD.md
@ -0,0 +1,266 @@
+# Subject Manifests on Catalogd — Specification v1
+
+**Status:** Draft v1 — 2026-05-03 · **Layer:** Catalogd extension (NOT a separate daemon) · **Implementation:** to be added to `crates/catalogd/src/`
+
+> **What this is.** A small extension to `catalogd` adding a fourth manifest type — `subject` — alongside the existing dataset / view / tombstone / profile types. A subject manifest answers: "for person X, which datasets contain their PII, which views project it safely, what consent + retention applies, and what's the access log."
+>
+> **What it is NOT.** It is not a separate identity daemon. It is not a Postgres-backed identity service. It is not a HashiCorp-Vault-using KEK rotation system. It is not the `IDENTITY_SERVICE_DESIGN.md` v3 design (that doc is over-scoped for local-only — see its deprecation header). It is the smallest spec that gives you a defensible "show me everything we know about person X" capability for EEOC discovery / BIPA compliance, building on primitives catalogd ALREADY ships.
+>
+> **Why it can be small.** Catalogd already has dataset manifests with per-column `is_pii` flags. It already has views with `column_redactions` (working example: `candidates_safe.json`). It already has tombstones for deletes. It already has profiles for per-agent scoping. The audit-trail need is "thread these primitives together by subject identifier" — not "build a new system." That's a ~300-500 LOC extension, not a 17-20 day phase plan.
+
+---
+
+## 1. Conceptual model
+
+A **subject** is a real person whose PII flows through the system. Identified by a stable token (current implementation: `candidate_id` in `workers_500k.parquet`).
+
+A **subject manifest** is a JSON record under `data/_catalog/subjects/<candidate_id>.json` that points at:
+- which datasets contain rows for this subject (via foreign-key reference)
+- which views safely project this subject's data (via existing view manifest names)
+- what consent + retention metadata applies to this subject
+- what access log file holds this subject's audit trail
+
+Subject manifests are written when a subject enters the system AND updated when a subject's consent changes, vertical reclassifies, or retention period expires.
+
+The **audit log** is a per-subject append-only JSONL file at `data/_catalog/subjects/<candidate_id>.audit.jsonl`. Every PII access for that subject writes one row. The file is signed periodically (HMAC chain) for tamper-evidence.
+
+---
+
+## 2. Wire format — `SubjectManifest`
+
+JSON document at `data/_catalog/subjects/<candidate_id>.json`:
+
+```json
+{
+  "schema": "subject_manifest.v1",
+  "candidate_id": "CAND-000001",
+  "created_at": "2026-05-15T12:00:00Z",
+  "updated_at": "2026-05-15T12:00:00Z",
+  "status": "active",
+  "vertical": "general",
+  "consent": {
+    "general_pii": {
+      "status": "given",
+      "version": "v1-2026-05-15",
+      "given_at": "2026-05-15T12:00:00Z"
+    },
+    "biometric": {
+      "status": "never_collected",
+      "retention_until": null
+    }
+  },
+  "retention": {
+    "general_pii_until": "2030-05-15T12:00:00Z",
+    "policy": "4_year_default"
+  },
+  "datasets": [
+    { "name": "workers_500k", "key_column": "candidate_id", "key_value": "CAND-000001" },
+    { "name": "candidates",   "key_column": "candidate_id", "key_value": "CAND-000001" },
+    { "name": "placements",   "key_column": "candidate_id", "key_value": "CAND-000001" },
+    { "name": "timesheets",   "key_column": "candidate_id", "key_value": "CAND-000001" }
+  ],
+  "safe_views": ["workers_safe", "candidates_safe"],
+  "audit_log_path": "data/_catalog/subjects/CAND-000001.audit.jsonl",
+  "audit_log_chain_root": "sha256:..."
+}
+```
+
+### 2.1 — Field semantics
+
+| Field | Required | Notes |
+|---|---|---|
+| `schema` | yes | Always `"subject_manifest.v1"`. Validates parser shape. |
+| `candidate_id` | yes | Subject identifier. Stable token. Same value as appears in dataset key columns. |
+| `status` | yes | `pending_consent` \| `active` \| `withdrawn` \| `retention_expired` \| `erased`. |
+| `vertical` | yes | `unknown` \| `general` \| `healthcare` \| `finance` \| `other`. **Default `unknown`**, fail-closed routing treats unknown as healthcare-equivalent. |
+| `consent.general_pii.status` | yes | `pending_backfill_review` \| `pending_first_contact` \| `given` \| `withdrawn` \| `expired`. |
+| `consent.biometric.status` | yes | `never_collected` \| `pending` \| `given` \| `withdrawn` \| `expired`. |
+| `retention.general_pii_until` | yes | ISO-8601. Drives daily expiration sweep. |
+| `datasets[].name` | yes | References an existing catalogd dataset manifest by name. |
+| `datasets[].key_column` | yes | The column in that dataset that contains the subject's identifier. |
+| `datasets[].key_value` | yes | The specific value (the subject's id within that dataset's namespace). |
+| `safe_views` | yes | Names of existing catalogd view manifests that safely project this subject's data (for non-legal-tier readers). |
+| `audit_log_path` | yes | Relative path to the audit JSONL. |
+| `audit_log_chain_root` | yes | SHA-256 of the most recent HMAC-chained checkpoint of the audit log. Updated by the audit-log writer on every write. |
+
+---
+
+## 3. Audit log format
+
+Per-subject append-only JSONL at `data/_catalog/subjects/<candidate_id>.audit.jsonl`. One row per PII access:
+
+```json
+{
+  "schema": "subject_audit.v1",
+  "ts": "2026-05-15T13:30:00Z",
+  "candidate_id": "CAND-000001",
+  "accessor": {
+    "kind": "gateway_lookup",
+    "daemon": "gateway",
+    "purpose": "fill_validation",
+    "trace_id": "X-Lakehouse-Trace-Id-..."
+  },
+  "fields_accessed": ["name"],
+  "result": "success",
+  "prev_chain_hash": "sha256:...",
+  "row_hmac": "hmac-sha256:..."
+}
+```
+
+### 3.1 — HMAC chain
+
+Each row's `row_hmac` is `HMAC-SHA256(key, prev_chain_hash || canonical_json_of_row_minus_hmac)`. The signing key is loaded once at startup from `/etc/lakehouse/subject_audit.key` (mode 0400). The chain root in the subject manifest references the latest row's `row_hmac`.
+
+A tamper-evident verification is one pass:
+```
+verify_chain(subject_id):
+  manifest = read_subject_manifest(subject_id)
+  rows = read_audit_log(subject_id)
+  prev = "GENESIS"
+  for row in rows:
+    expected = hmac_sha256(key, prev || canonicalize(row - row_hmac_field))
+    assert row.prev_chain_hash == prev
+    assert row.row_hmac == expected
+    prev = row.row_hmac
+  assert manifest.audit_log_chain_root == prev
+```
+
+This is local (no S3 Object Lock, no Vault) but tamper-evident: any modification to a past row breaks the chain at that point and all subsequent rows. The signing key being on disk is a real risk surface — operators MUST set the file mode 0400 owner-only and back it up to a separate location from the audit logs themselves (so that a single backup doesn't carry both the ciphertext and the verification material).
+
+### 3.2 — When the audit log is written
+
+Every code path that resolves PII for a subject MUST write an audit row before returning. Specifically:
+- The gateway's tool registry SQL templates (`crates/gateway/src/tools/registry.rs`) — when `search_candidates` / `get_candidate` queries return rows, write one audit row per returned candidate_id
+- The validator's WorkerLookup (`crates/validator/src/staffing/parquet_lookup.rs`) — when a `lookup(candidate_id)` succeeds, write one audit row
+- The audit-response endpoint (when implemented) — when `/audit/subject/{id}` is called, write one row of `kind=audit_response`
+- Any new code path that touches PII
+
+Write failures MUST NOT silently swallow. They MUST be logged at error level (per the existing observability fabric). Write failures MUST NOT block the read — accept the audit gap and flag it for post-hoc review (better to leak a row than block legitimate operations).
+
+---
+
+## 4. The `/audit/subject/{candidate_id}` response
+
+The audit response builds from the subject manifest + audit log + dataset projections:
+
+```json
+{
+  "schema": "subject_audit_response.v1",
+  "candidate_id": "CAND-000001",
+  "generated_at": "2026-05-15T15:00:00Z",
+  "generated_by": "catalogd@hostname",
+  "manifest": { /* the SubjectManifest */ },
+  "datasets": {
+    "workers_500k": {
+      "row_present": true,
+      "safe_view_projection": { /* candidates_safe row for this subject */ }
+    }
+  },
+  "audit_log_window": {
+    "from": "2026-01-01T00:00:00Z",
+    "to": "2026-05-15T15:00:00Z",
+    "rows": [ /* matching audit rows */ ]
+  },
+  "chain_verification": {
+    "verified": true,
+    "rows_checked": 42,
+    "chain_root": "sha256:..."
+  },
+  "completeness_attestation": "all dataset rows + audit log entries within the window per retention policy v1 are included",
+  "signature": "ed25519:..."
+}
+```
+
+The endpoint is auth-gated via a separate legal-tier credential (see §6). The response body is signed with an Ed25519 key separate from the HMAC chain key.
+
+---
+
+## 5. Implementation plan (this is the SMALL plan)
+
+This is the spec; the implementation is a small extension to catalogd. Estimated effort:
+
+| Step | Effort | What |
+|---|---|---|
+| **1** | 0.5d | Add `SubjectManifest` struct + JSON load/save in `crates/catalogd/src/subjects.rs`. Mirror the existing `views.rs` pattern. |
+| **2** | 0.5d | Add `SubjectAuditWriter` with HMAC chain in same file. Key loaded from sealed file at startup. |
+| **3** | 0.5d | Backfill subject manifests from `workers_500k.parquet` rows. ETL: one manifest per row, default `vertical=unknown`, `consent.general_pii.status=pending_backfill_review`. |
+| **4** | 0.5d | Wire the gateway tool registry to write audit rows. One audit row per candidate_id returned by search_candidates / get_candidate. |
+| **5** | 0.5d | Wire the validator WorkerLookup to write audit rows. |
+| **6** | 1d | `/audit/subject/{id}` HTTP endpoint in `crates/catalogd/src/service.rs`. Legal-tier auth. |
+| **7** | 0.5d | Daily retention sweep: subjects whose `retention.general_pii_until` < now AND `status != erased` get marked for review (don't auto-delete; legal needs to approve). |
+| **8** | 0.5d | Cross-runtime parity: Go side reads the same subject manifests + audit logs. Same shapes, same HMAC algorithm. |
+| **Total** | **~4 days** | Compared to 17-20 days for the IDENTITY_SERVICE_DESIGN approach. |
+
+Each step is one commit, one revert path. No new daemons. No cloud infrastructure. No Vault. No S3 Object Lock. No dual-control JWT split-secret ceremony.
+
+---
+
+## 6. Auth model
+
+Local-first, simple, defensible:
+
+- **Service-tier reads** (gateway tool registry resolving candidate names for fill scenarios): authenticated via the existing gateway internal credential. Audit row written.
+- **Legal-tier reads** (`/audit/subject/{id}`): requires a separate credential held in `/etc/lakehouse/legal_audit.token` (mode 0400, owner-only). Operators may load this only when fulfilling a legal request. The token is rotated per a documented runbook (operator + 1 witness; no cryptographic dual-control ceremony required for this scale).
+- **Backups**: subject manifests + audit logs are backed up daily. The HMAC signing key is backed up to a SEPARATE storage location (different physical/network boundary) so a single backup compromise doesn't enable forgery.
+
+This is not as strong as Vault Transit + dual-control JWT + S3 Object Lock external anchoring. It IS strong enough to pass a normal small-business compliance review and to be defensible in IL/IN small-claims-discovery contexts. If the staffing client's contract requires SOC2 Type II or formal HSM, that's a separate phase — but it's strictly additive on top of this v1 spec.
+
+---
+
+## 7. What this spec gives you (load-bearing)
+
+1. **Defensible response to discrimination discovery** (worked example: John Martinez at Warehouse B). The endpoint produces a complete + signed + chain-verified record of every PII access affecting him.
+2. **BIPA-compliant biometric tracking** when real photos arrive. The `consent.biometric` field + retention timeline are first-class, not bolted on later.
+3. **Per-subject right-to-be-forgotten** via cryptographic erasure of the subject manifest's audit-key entry + tombstoning of the candidate's dataset rows. (The ability to verify "row is gone, trail is preserved as anonymous audit-event of the erasure" is what GDPR Art. 17 + CCPA expect.)
+4. **HIPAA vertical routing** via the `vertical` field. Healthcare-vertical subjects (and `unknown` defaults) route to local-only models per PRD line 70 — no PHI to cloud egress.
+5. **Cross-runtime parity** via the simple JSON+HMAC format that Go can read identically.
+
+---
+
+## 8. What this spec does NOT solve (and where it punts)
+
+- **Per-row encryption of dataset PII**: subject manifests + audit logs are local; the underlying `workers_500k.parquet` is not encrypted. If staffing-client contract requires at-rest encryption, that's a separate concern handled at the storage tier (filesystem encryption, S3 SSE).
+- **Right to explanation (GDPR Art. 22 / EU AI Act)**: this spec captures decisions that touched a subject; it does not require the model to explain WHY each decision was made in human-readable form. That's a separate Phase capturing model reasoning.
+- **Adverse-impact statistics**: the comparator-pool snapshot per fill (per `AUDIT_PHASE_1_DISCOVERY` §10/C3) needs its own writer in the fill pipeline. This spec gives you the per-subject record; it doesn't cross-aggregate selection rates by protected class.
+- **External tamper-evidence**: the HMAC chain is local. A motivated insider with access to both the audit log AND the signing key could rewrite history. For the staffing-client scale this is acceptable; for higher-stakes deployments a separate timestamping service or external transparency log would be additive.
+
+---
+
+## 9. Why this is the right shape for J's deployment
+
+- It builds on what catalogd ALREADY ships (manifests, views, tombstones — the Iceberg-shape layer).
+- It runs locally — no cloud infrastructure to license, monitor, audit, or pay for.
+- Its primitives are JSON files an operator can read with `cat` and `jq`. Tamper-evidence works without trusting opaque crypto APIs.
+- Its implementation is days, not weeks. The timeline matches the staffing-client launch window without forcing them to wait on a SaaS-tier identity service that doesn't fit their data residency posture.
+- It is COMPATIBLE with the IDENTITY_SERVICE_DESIGN v3 path if the staffing client later requires SOC2 Type II — the v1 subject manifests can be migrated into a separate identity daemon when the scale demands it. But that's deferred until demand exists, not built speculatively.
+
+This is the LOCAL-FIRST audit trail. It exists because the SaaS-tier version doesn't fit the deployment model J actually has.
+
+---
+
+## 10. Spec boundary
+
+**Stable in v1 (changing requires v2):**
+- File layout: `data/_catalog/subjects/<id>.json` + `data/_catalog/subjects/<id>.audit.jsonl`
+- JSON schemas in §2 + §3 (additive fields OK; renames/removals require v2)
+- HMAC algorithm: HMAC-SHA256 with key from sealed file
+- Chain semantics: `prev_chain_hash` references previous row's `row_hmac`
+- Vertical default: `unknown` with fail-closed routing
+- Consent state machine in §2.1
+- Audit-row write requirements in §3.2
+
+**Implementation-specific (free to change):**
+- Storage backend (file system v1; SQLite or Postgres acceptable as long as JSON shapes round-trip)
+- HTTP endpoint exact shape (body schema is spec; status codes / headers are implementation)
+- Backfill ETL details
+
+**Reserved for v2:**
+- Per-row encryption (when staffing-client contract requires it)
+- External tamper-evidence anchor (when SOC2 Type II in scope)
+- Cross-tenant subject isolation (when multi-tenant in scope)
+
+---
+
+## Change log
+
+- 2026-05-03 — v1 initial draft. Builds on catalogd primitives that already exist. Smaller than IDENTITY_SERVICE_DESIGN v3 by ~4× because it doesn't propose a new daemon.