specs: pathway_memory v1 + subject_manifests_on_catalogd v1

Two specifications addressing the framing J asked for after reading
the llms3.com blog: standardize what we have so future work doesn't
drift, and apply the local-first thesis to the audit problem instead
of the over-scoped SaaS-tier identity service.

PATHWAY_MEMORY_SPEC.md (~400 lines):
  Documents the existing crates/vectord/src/pathway_memory.rs as a
  spec — the third metadata layer alongside catalogd's data metadata
  and playbook_memory's operational memory. Defines:
    - PathwayTrace wire format
    - pathway_id = SHA256(task_class | file_prefix | signal_class)
    - file_prefix algorithm (first 2 path segments)
    - pathway_vec: 32-bucket bag-of-tokens hash, fixed dim per spec
    - Lifecycle: insert → revise → replay → probation gate retire
    - Mem0 versioning (trace_uid + parent_trace_uid + version chain)
    - Access patterns: query_for_hotswap / query_by_vec / list_versions
    - PII risk surface (reducer_summary + final_verdict)
    - Spec boundary: stable in v1 vs implementation-specific
  No new architecture. Descriptive, not prescriptive.

SUBJECT_MANIFESTS_ON_CATALOGD.md (~400 lines):
  The local-first audit-trail spec. Adds a fourth manifest type to
  catalogd alongside dataset/view/tombstone/profile. NOT a separate
  identity daemon. NOT Vault/KMS/dual-control JWT. Builds on
  primitives catalogd already ships:
    - SubjectManifest at data/_catalog/subjects/<id>.json
    - Per-subject HMAC-chained audit JSONL
    - Daily retention sweep using existing tombstone primitives
    - Vertical-aware routing (healthcare → local-only)
    - Legal-tier credential separate from gateway internal auth
  ~4 days estimated implementation effort vs 17-20 days for the
  IDENTITY_SERVICE_DESIGN approach. Same defensibility for the
  staffing-client launch window. Strictly additive to compatibility
  with the v3 design if SOC2 Type II becomes a contract requirement.

These are SPECS — what the system already does (pathway) and what's
the smallest local-first thing that addresses the audit need
(subject manifests). Not 9-phase plans. Not new daemons.

The pathway spec is descriptive: writing down what exists so the
next person doesn't reinvent it. The subject-manifests spec is
prescriptive: J greenlights, implementation is days not weeks.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
root 2026-05-03 03:07:38 -05:00
parent 991db7be1a
commit ed1fcd3c26
2 changed files with 573 additions and 0 deletions

View File

@ -0,0 +1,307 @@
# Pathway Memory — Specification v1
**Status:** Draft v1 — 2026-05-03 · **Layer:** Decision metadata · **Implementation:** `crates/vectord/src/pathway_memory.rs`
> **What this is.** Pathway Memory is a third metadata layer that exists alongside (not on top of, not under) the data-metadata layer (catalogd's manifests/views/tombstones) and the operational-memory layer (playbook_memory). It tracks **decision patterns**: "this code path, in this task class, produced this outcome." It is the AI-substrate analog of what Iceberg manifests are to data tables — pointers + provenance for things you might want to retrieve again or learn from.
>
> **What it is NOT.** It is not a vector database. It is not a knowledge graph. It is not an audit log (audit is per-subject; pathways are per-code-pattern). It is not a cache. It is a small, opinionated metadata layer with a defined wire format, defined access patterns, and defined lifecycle rules.
>
> **Why a spec exists.** No standard tells us what this layer is. Each implementation that drifts from this spec creates the wiring confusion currently visible in the system (data/_kb/ JSONLs that don't get consumed, version drift across `workers_500k_v1..v10`, unclear who reads `reducer_summary` vs `final_verdict`). The spec is the standardization contract that lets a single implementation become the canonical one.
---
## 1. Conceptual model
A **pathway** is the identity-equivalence-class of code paths that exhibit similar decision behavior. Two distinct files may share a pathway if they belong to the same crate AND are reviewed under the same task class with the same signal class.
A **trace** is a single observed instance of a pathway: one review pass, with its inputs, intermediate signals, ladder attempts, audit outcome, and final verdict.
A pathway is **identified** by a stable hash of three keys: task class, file prefix, signal class. The hash is `pathway_id`.
A trace is **identified** within a pathway by a generated UUID: `trace_uid`. The pair `(pathway_id, trace_uid)` is globally unique.
Pathways are **append-only by trace**: every new observation creates a new trace under the existing pathway_id. Traces themselves are **mutable in version**: `revise()` produces a new version that supersedes the previous (Mem0-style chain).
Pathways are **retirable** by id: when replay history shows the pattern no longer holds, the pathway is marked retired and excluded from hot-swap. New pathways may emerge with fresh ids if the underlying code/task characteristics shift.
---
## 2. Wire format — `PathwayTrace`
Authoritative struct: `crates/vectord/src/pathway_memory.rs`. The fields below are the SPEC; the Rust struct is one implementation. All fields use JSON-serializable types when persisted.
### 2.1 — Identity fields (immutable per trace)
| Field | Type | Constraint |
|---|---|---|
| `pathway_id` | string | SHA256 hex of `task_class | file_prefix | signal_class` — see §3 |
| `trace_uid` | string | UUID. Empty on legacy traces; populated on insert by SHOULD-implementations from v2+ |
| `version` | u32 | Mem0 version chain. Default 1 on insert. Bumped on `revise()`. |
| `parent_trace_uid` | string \| null | trace_uid of the trace this one supersedes. null on root version. |
| `superseded_at` | string \| null | ISO-8601 UTC. Set when this trace becomes a non-head version. |
| `superseded_by_trace_uid` | string \| null | trace_uid of the new head when this version is superseded. |
### 2.2 — Identity-source fields (the inputs that fed pathway_id)
| Field | Type | Notes |
|---|---|---|
| `task_class` | string | Caller-defined class string (e.g. `scrum_review`, `staffing.fill`, `pr_audit`) |
| `file_path` | string | Full path of the file under review. Reduced to file_prefix for hashing — see §3.2. |
| `signal_class` | string \| null | Caller-defined behavior label (e.g. `CONVERGING`, `LOOPING`, `STUCK_RETRY`) |
### 2.3 — Observation fields (what happened)
| Field | Type | Description |
|---|---|---|
| `created_at` | RFC 3339 timestamp | When this trace was inserted. |
| `ladder_attempts` | array of `LadderAttempt` | Each model+rung attempt: `{rung, model, latency_ms, accepted, reject_reason?}`. Order is dispatch order. |
| `kb_chunks` | array of `KbChunkRef` | Knowledge-base chunks the reviewer was given as context: `{source_doc, chunk_id, cosine_score, rank}`. |
| `observer_signals` | array of `ObserverSignal` | Behavior labels emitted by observer during this pathway's run: `{class, priors, prior_iter_outcomes}`. |
| `bridge_hits` | array of `BridgeHit` | External-doc lookups (context7-style): `{library, version, result_summary}`. |
| `sub_pipeline_calls` | array of `SubPipelineCall` | Calls to other sub-pipelines (extract, validate, etc.) made during this trace. |
| `audit_consensus` | `AuditConsensus` \| null | Cross-lineage audit outcome if any: `{pass, models, disagreements}`. |
| `reducer_summary` | string | Free-text summary of what the reducer concluded. **PII risk surface — see §6.3.** |
| `final_verdict` | string | Free-text verdict label (`accepted`, `rejected`, `needs_review`, etc.). **PII risk surface — see §6.3.** |
### 2.4 — Index fields (the matrix retrieval shape)
| Field | Type | Description |
|---|---|---|
| `pathway_vec` | array of f32, length 32 | Bag-of-tokens hash embedding. Algorithm in §4. Fixed dimension per spec version. |
| `replay_count` | u32 | Number of times this pathway has been replayed via hot-swap. Initial insert is NOT a replay. |
| `replays_succeeded` | u32 | Replays where the post-replay outcome matched the trace's `final_verdict`. |
| `retired` | bool | Marked true when probation gate fires (§5.4). Excluded from hot-swap forever. |
### 2.5 — Optional semantic-correctness fields (ADR-021)
| Field | Type | Description |
|---|---|---|
| `semantic_flags` | array of `SemanticFlag` | One of: `UnitMismatch`, `TypeConfusion`, `NullableConfusion`, `OffByOne`, `StaleReference`, `PseudoImpl`, `DeadCode`, `WarningNoise`, `BoundaryViolation`. Caller-emitted, not auto-derived. |
| `type_hints_used` | array of `TypeHint` | Schema/type context fed to the reviewer for this trace: `{source, symbol, type_repr}`. |
| `bug_fingerprints` | array of `BugFingerprint` | Structural pattern hashes the reviewer caught: `{flag, pattern_key, example, occurrences}`. |
All §2.5 fields are additive — implementations MUST default them to empty when deserializing legacy traces.
---
## 3. Identity algorithms
### 3.1 — `pathway_id`
```
pathway_id = sha256_hex(task_class || "|" || file_prefix(file_path) || "|" || signal_class_or_empty)
```
- Concatenation uses literal byte `|` (0x7C) as separator.
- `signal_class_or_empty` is the empty string when signal_class is null.
- SHA-256 output is 64 hex chars, lowercase.
### 3.2 — `file_prefix`
```
file_prefix(path) = first two path segments joined by '/'
```
- Path separator is forward slash. Implementations on Windows MUST normalize to forward slashes before hashing.
- Files with fewer than two segments use the full path verbatim.
- Examples:
- `crates/queryd/src/service.rs``crates/queryd`
- `crates/gateway``crates/gateway`
- `README.md``README.md`
### 3.3 — Why this fingerprint shape
The fingerprint groups files within a crate (same `file_prefix`) under the same pathway when reviewed for the same task class with the same signal class. This is intentional: it lets the matrix index recognize "the same kind of bug appears in this crate" without needing per-file traces. Per-file granularity is preserved in the `file_path` field for retrieval but is not part of the identity.
---
## 4. `pathway_vec` algorithm
Fixed dimension: **32 buckets**. Implementation: deterministic bag-of-tokens hash, normalized.
```
def build_pathway_vec(trace):
buckets = [0.0] * 32
tokens = collect_metadata_tokens(trace) # see token list below
for tok in tokens:
h = sha256(tok.encode())
bucket = int.from_bytes(h[:4], 'big') % 32
buckets[bucket] += 1.0
norm = sqrt(sum(b*b for b in buckets))
return [b/norm if norm > 0 else 0.0 for b in buckets]
```
**Tokens MUST include**:
- `task_class:<task_class>`
- `file_prefix:<file_prefix>`
- `signal_class:<signal_class_or_empty>`
- For each `ladder_attempts[i].model`: `model:<model>`
- For each `kb_chunks[i].source_doc`: `kb_doc:<source_doc>`
- For each `observer_signals[i].class`: `signal:<class>`
- For each `bug_fingerprints[i].flag`: `flag:<flag>`
**Tokens MUST NOT include**:
- `reducer_summary` text (free-form, would dominate the bucket distribution)
- `final_verdict` text
- `created_at` (would make every trace's vector unique)
- candidate_id, name, email, phone, or any subject identifier (PII risk)
**Why 32 dimensions:** consensus-validated round 3 ensemble selected "small metadata tokens, not full JSON." 32 is sufficient to distinguish pathway combinations without requiring an external embedding model. An external embedding model would work too but adds a dependency, failure mode, and drift risk the consensus flagged. **Spec version v1 fixes dim=32. v2 may revisit.**
---
## 5. Lifecycle
### 5.1 — Insert
A trace is inserted by the consumer (typically the scrum pipeline or auditor) calling the implementation's `insert()` API.
- If no pathway with `pathway_id` exists, a new pathway bucket is created.
- If a pathway exists, the new trace is appended to that bucket's traces list.
- `trace_uid` is generated (UUID v7 RECOMMENDED for time-orderability).
- `version` is set to 1.
- `created_at` is set to the insertion timestamp.
- `pathway_vec` is computed and stored.
- `replay_count` and `replays_succeeded` start at 0.
### 5.2 — Revise
A trace is revised by calling `revise(trace_uid, new_fields)`.
- A new trace with a fresh `trace_uid` and `version = previous.version + 1` is inserted.
- The new trace's `parent_trace_uid` is set to the previous trace's `trace_uid`.
- The previous trace's `superseded_at` is set to the revision timestamp and `superseded_by_trace_uid` is set to the new trace's `trace_uid`.
- The previous trace remains in the bucket (history is preserved). Retrieval defaults to head versions only — see §7.
### 5.3 — Replay
A pathway is **eligible for replay** when `retired == false` AND the bucket has at least one head-version trace.
The hot-swap consumer queries by `pathway_id` (or by similarity over `pathway_vec`), retrieves the head trace, applies its `final_verdict` shape to a new but similar input, and then reports back via `record_replay(trace_uid, succeeded: bool)`.
- `replay_count` is incremented on every replay (success or failure).
- `replays_succeeded` is incremented only on success.
### 5.4 — Probation gate (retirement)
Probation triggers when `replay_count >= 3` AND `success_rate < 0.80`.
When triggered, the pathway is marked `retired = true`. It is excluded from hot-swap forever. If the underlying code/task characteristics genuinely change such that a retired pathway would work again, a NEW pathway with a FRESH id will be created (because the inputs to `pathway_id` would have changed, e.g., new signal_class). Retirement is per-id and irreversible.
`success_rate` = `replays_succeeded / replay_count` (0.0 if `replay_count == 0`).
---
## 6. Access patterns (read + write shapes)
### 6.1 — Write paths (who inserts traces)
Spec-conformant writers:
- Scrum pipeline (`tests/real-world/scrum_master_pipeline.ts` and downstream consumers) on review acceptance
- Auditor pipeline (`auditor/`) on verdict assembly
- Validator (`crates/validator/`) on iterate session completion
Each writer is responsible for setting the identity fields correctly. The implementation MUST reject writes where `task_class` is empty.
### 6.2 — Read paths (who consumes traces)
Spec-conformant readers:
- **Hot-swap retrieval**`query_for_hotswap(task_class, file_path, signal_class, k=5)` returns the top-k head-version traces from matching pathways, ordered by `(success_rate desc, replay_count desc)`.
- **Similarity retrieval**`query_by_vec(pathway_vec, k=10)` returns the top-k head-version traces by cosine similarity over `pathway_vec`.
- **Audit / forensic**`list_versions(trace_uid_root)` returns the full version chain. **Must be authorized** — version chains can include reducer_summary text that may contain PII.
The implementation MUST NOT expose `list_versions` on an unauthenticated route.
### 6.3 — PII surface (load-bearing for audit-trail compliance)
The fields `reducer_summary` and `final_verdict` are free-form strings written from review output. They are **highly likely** to contain PII when the reviewed task class involved real candidates (per `AUDIT_PHASE_1_DISCOVERY.md` §1F + §10/C1). Until subject-redaction-on-write lands, implementations:
- MUST treat trace bodies as a suspected PII sink for any task class touching real subjects
- SHOULD redact known PII patterns (candidate_id, email, phone) before persisting
- SHOULD provide a per-trace `subject_ids[]` top-level field when known, so audit-response queries can filter by subject
Until the redaction layer ships, do not advertise pathway_memory as PII-clean.
---
## 7. Storage representation
The current implementation persists state to a single JSON file at `data/_pathway_memory/state.json`:
```
{
"pathways": {
"<pathway_id>": [
{ "trace_uid": "...", "version": 1, "task_class": "...", ... },
{ "trace_uid": "...", "version": 2, "parent_trace_uid": "...", ... }
]
},
"last_updated_at": "2026-04-29T05:43:00Z"
}
```
The flat-JSON representation is implementation-specific. A spec-conformant implementation could use SQLite, an append-only JSONL, or a Lance dataset. The spec dictates the wire shape per-trace and the access patterns; the storage backend is the implementation's choice.
**Retrieval defaults**: queries return head versions only by default (a head version is one with `superseded_at == null`). Consumers explicitly opt in to historical versions via a `include_history: true` flag.
---
## 8. Spec boundary — what's stable vs implementation-specific
### Stable (locked in v1; changing requires v2 spec)
- The `pathway_id` algorithm (SHA256 of `task_class | file_prefix | signal_class`)
- The `file_prefix` algorithm (first two path segments)
- `pathway_vec` dimension (32) and token list
- Field names and types in §2 (additive fields are OK; renames or removals require v2)
- Lifecycle rules in §5
- Access pattern names in §6 (`query_for_hotswap`, `query_by_vec`, `list_versions`)
### Implementation-specific (free to change without spec bump)
- Storage backend (JSON file, SQLite, Lance, etc.)
- HTTP/gRPC/library API surface
- Concurrency model
- In-memory caching strategy
- The exact text of `reducer_summary` (it's content, not schema)
- Token-bucket hash function (SHA-256 specified in §4 but any deterministic 32-bit hash is acceptable as long as it's deterministic and well-distributed)
### Reserved for v2 (named so future implementations don't paint into a corner)
- Cross-pathway similarity (`query_pathways_similar_to_pathway`)
- External embedding model for `pathway_vec` (replaces the bag-of-tokens hash)
- Subject-aware indexing (so audit-response can find all pathways involving subject X)
- Differential privacy on `pathway_vec` aggregations
---
## 9. Open questions
These need decisions before v1 is final-final:
1. **Dimension lock-in**: 32 is round-3 consensus, but is it actually optimal? Worth a benchmark before final v1.
2. **Subject indexing in v1 vs v2**: deferring to v2 means audit queries today have to grep `reducer_summary` text. Is that acceptable for the staffing-client launch window?
3. **Cross-runtime parity**: the Go side has its own `internal/pathways/` implementation. As of 2026-05-03 the wire shape matches but no parity probe exists. Add `pathway_parity.sh` to the cross-runtime probe suite.
4. **Migration from current state.json**: existing 91 traces lack `trace_uid`. Spec says `trace_uid` is empty on legacy traces — should v1 require backfill or accept legacy-trace queries indefinitely?
---
## 10. Why this spec exists
Without it:
- Every new consumer of pathway_memory invents their own read/write conventions
- The Go-side reimplementation drifts from the Rust-side wire format
- Future LLM-assistant work proposes new fields without considering the existing schema
- Forensic queries grep free-text fields because no canonical retrieval exists
- The PII concern stays "TODO" because there's no stable surface to gate
With it:
- One implementation can be canonical (Rust, currently); others can be conformance-tested
- The wire shape is documented; serialization tests can pin it
- New fields go through an additive-only review (don't break legacy)
- The KbContext consumer in the gateway has a stable API to read against
- The audit-trail PRD has a defined PII surface to protect
This is the standardization the system was missing.
---
## Change log
- 2026-05-03 — v1 initial draft. Codifies what's already in `crates/vectord/src/pathway_memory.rs`. The spec is descriptive of the existing implementation, not prescriptive of new behavior.

View File

@ -0,0 +1,266 @@
# Subject Manifests on Catalogd — Specification v1
**Status:** Draft v1 — 2026-05-03 · **Layer:** Catalogd extension (NOT a separate daemon) · **Implementation:** to be added to `crates/catalogd/src/`
> **What this is.** A small extension to `catalogd` adding a fourth manifest type — `subject` — alongside the existing dataset / view / tombstone / profile types. A subject manifest answers: "for person X, which datasets contain their PII, which views project it safely, what consent + retention applies, and what's the access log."
>
> **What it is NOT.** It is not a separate identity daemon. It is not a Postgres-backed identity service. It is not a HashiCorp-Vault-using KEK rotation system. It is not the `IDENTITY_SERVICE_DESIGN.md` v3 design (that doc is over-scoped for local-only — see its deprecation header). It is the smallest spec that gives you a defensible "show me everything we know about person X" capability for EEOC discovery / BIPA compliance, building on primitives catalogd ALREADY ships.
>
> **Why it can be small.** Catalogd already has dataset manifests with per-column `is_pii` flags. It already has views with `column_redactions` (working example: `candidates_safe.json`). It already has tombstones for deletes. It already has profiles for per-agent scoping. The audit-trail need is "thread these primitives together by subject identifier" — not "build a new system." That's a ~300-500 LOC extension, not a 17-20 day phase plan.
---
## 1. Conceptual model
A **subject** is a real person whose PII flows through the system. Identified by a stable token (current implementation: `candidate_id` in `workers_500k.parquet`).
A **subject manifest** is a JSON record under `data/_catalog/subjects/<candidate_id>.json` that points at:
- which datasets contain rows for this subject (via foreign-key reference)
- which views safely project this subject's data (via existing view manifest names)
- what consent + retention metadata applies to this subject
- what access log file holds this subject's audit trail
Subject manifests are written when a subject enters the system AND updated when a subject's consent changes, vertical reclassifies, or retention period expires.
The **audit log** is a per-subject append-only JSONL file at `data/_catalog/subjects/<candidate_id>.audit.jsonl`. Every PII access for that subject writes one row. The file is signed periodically (HMAC chain) for tamper-evidence.
---
## 2. Wire format — `SubjectManifest`
JSON document at `data/_catalog/subjects/<candidate_id>.json`:
```json
{
"schema": "subject_manifest.v1",
"candidate_id": "CAND-000001",
"created_at": "2026-05-15T12:00:00Z",
"updated_at": "2026-05-15T12:00:00Z",
"status": "active",
"vertical": "general",
"consent": {
"general_pii": {
"status": "given",
"version": "v1-2026-05-15",
"given_at": "2026-05-15T12:00:00Z"
},
"biometric": {
"status": "never_collected",
"retention_until": null
}
},
"retention": {
"general_pii_until": "2030-05-15T12:00:00Z",
"policy": "4_year_default"
},
"datasets": [
{ "name": "workers_500k", "key_column": "candidate_id", "key_value": "CAND-000001" },
{ "name": "candidates", "key_column": "candidate_id", "key_value": "CAND-000001" },
{ "name": "placements", "key_column": "candidate_id", "key_value": "CAND-000001" },
{ "name": "timesheets", "key_column": "candidate_id", "key_value": "CAND-000001" }
],
"safe_views": ["workers_safe", "candidates_safe"],
"audit_log_path": "data/_catalog/subjects/CAND-000001.audit.jsonl",
"audit_log_chain_root": "sha256:..."
}
```
### 2.1 — Field semantics
| Field | Required | Notes |
|---|---|---|
| `schema` | yes | Always `"subject_manifest.v1"`. Validates parser shape. |
| `candidate_id` | yes | Subject identifier. Stable token. Same value as appears in dataset key columns. |
| `status` | yes | `pending_consent` \| `active` \| `withdrawn` \| `retention_expired` \| `erased`. |
| `vertical` | yes | `unknown` \| `general` \| `healthcare` \| `finance` \| `other`. **Default `unknown`**, fail-closed routing treats unknown as healthcare-equivalent. |
| `consent.general_pii.status` | yes | `pending_backfill_review` \| `pending_first_contact` \| `given` \| `withdrawn` \| `expired`. |
| `consent.biometric.status` | yes | `never_collected` \| `pending` \| `given` \| `withdrawn` \| `expired`. |
| `retention.general_pii_until` | yes | ISO-8601. Drives daily expiration sweep. |
| `datasets[].name` | yes | References an existing catalogd dataset manifest by name. |
| `datasets[].key_column` | yes | The column in that dataset that contains the subject's identifier. |
| `datasets[].key_value` | yes | The specific value (the subject's id within that dataset's namespace). |
| `safe_views` | yes | Names of existing catalogd view manifests that safely project this subject's data (for non-legal-tier readers). |
| `audit_log_path` | yes | Relative path to the audit JSONL. |
| `audit_log_chain_root` | yes | SHA-256 of the most recent HMAC-chained checkpoint of the audit log. Updated by the audit-log writer on every write. |
---
## 3. Audit log format
Per-subject append-only JSONL at `data/_catalog/subjects/<candidate_id>.audit.jsonl`. One row per PII access:
```json
{
"schema": "subject_audit.v1",
"ts": "2026-05-15T13:30:00Z",
"candidate_id": "CAND-000001",
"accessor": {
"kind": "gateway_lookup",
"daemon": "gateway",
"purpose": "fill_validation",
"trace_id": "X-Lakehouse-Trace-Id-..."
},
"fields_accessed": ["name"],
"result": "success",
"prev_chain_hash": "sha256:...",
"row_hmac": "hmac-sha256:..."
}
```
### 3.1 — HMAC chain
Each row's `row_hmac` is `HMAC-SHA256(key, prev_chain_hash || canonical_json_of_row_minus_hmac)`. The signing key is loaded once at startup from `/etc/lakehouse/subject_audit.key` (mode 0400). The chain root in the subject manifest references the latest row's `row_hmac`.
A tamper-evident verification is one pass:
```
verify_chain(subject_id):
manifest = read_subject_manifest(subject_id)
rows = read_audit_log(subject_id)
prev = "GENESIS"
for row in rows:
expected = hmac_sha256(key, prev || canonicalize(row - row_hmac_field))
assert row.prev_chain_hash == prev
assert row.row_hmac == expected
prev = row.row_hmac
assert manifest.audit_log_chain_root == prev
```
This is local (no S3 Object Lock, no Vault) but tamper-evident: any modification to a past row breaks the chain at that point and all subsequent rows. The signing key being on disk is a real risk surface — operators MUST set the file mode 0400 owner-only and back it up to a separate location from the audit logs themselves (so that a single backup doesn't carry both the ciphertext and the verification material).
### 3.2 — When the audit log is written
Every code path that resolves PII for a subject MUST write an audit row before returning. Specifically:
- The gateway's tool registry SQL templates (`crates/gateway/src/tools/registry.rs`) — when `search_candidates` / `get_candidate` queries return rows, write one audit row per returned candidate_id
- The validator's WorkerLookup (`crates/validator/src/staffing/parquet_lookup.rs`) — when a `lookup(candidate_id)` succeeds, write one audit row
- The audit-response endpoint (when implemented) — when `/audit/subject/{id}` is called, write one row of `kind=audit_response`
- Any new code path that touches PII
Write failures MUST NOT silently swallow. They MUST be logged at error level (per the existing observability fabric). Write failures MUST NOT block the read — accept the audit gap and flag it for post-hoc review (better to leak a row than block legitimate operations).
---
## 4. The `/audit/subject/{candidate_id}` response
The audit response builds from the subject manifest + audit log + dataset projections:
```json
{
"schema": "subject_audit_response.v1",
"candidate_id": "CAND-000001",
"generated_at": "2026-05-15T15:00:00Z",
"generated_by": "catalogd@hostname",
"manifest": { /* the SubjectManifest */ },
"datasets": {
"workers_500k": {
"row_present": true,
"safe_view_projection": { /* candidates_safe row for this subject */ }
}
},
"audit_log_window": {
"from": "2026-01-01T00:00:00Z",
"to": "2026-05-15T15:00:00Z",
"rows": [ /* matching audit rows */ ]
},
"chain_verification": {
"verified": true,
"rows_checked": 42,
"chain_root": "sha256:..."
},
"completeness_attestation": "all dataset rows + audit log entries within the window per retention policy v1 are included",
"signature": "ed25519:..."
}
```
The endpoint is auth-gated via a separate legal-tier credential (see §6). The response body is signed with an Ed25519 key separate from the HMAC chain key.
---
## 5. Implementation plan (this is the SMALL plan)
This is the spec; the implementation is a small extension to catalogd. Estimated effort:
| Step | Effort | What |
|---|---|---|
| **1** | 0.5d | Add `SubjectManifest` struct + JSON load/save in `crates/catalogd/src/subjects.rs`. Mirror the existing `views.rs` pattern. |
| **2** | 0.5d | Add `SubjectAuditWriter` with HMAC chain in same file. Key loaded from sealed file at startup. |
| **3** | 0.5d | Backfill subject manifests from `workers_500k.parquet` rows. ETL: one manifest per row, default `vertical=unknown`, `consent.general_pii.status=pending_backfill_review`. |
| **4** | 0.5d | Wire the gateway tool registry to write audit rows. One audit row per candidate_id returned by search_candidates / get_candidate. |
| **5** | 0.5d | Wire the validator WorkerLookup to write audit rows. |
| **6** | 1d | `/audit/subject/{id}` HTTP endpoint in `crates/catalogd/src/service.rs`. Legal-tier auth. |
| **7** | 0.5d | Daily retention sweep: subjects whose `retention.general_pii_until` < now AND `status != erased` get marked for review (don't auto-delete; legal needs to approve). |
| **8** | 0.5d | Cross-runtime parity: Go side reads the same subject manifests + audit logs. Same shapes, same HMAC algorithm. |
| **Total** | **~4 days** | Compared to 17-20 days for the IDENTITY_SERVICE_DESIGN approach. |
Each step is one commit, one revert path. No new daemons. No cloud infrastructure. No Vault. No S3 Object Lock. No dual-control JWT split-secret ceremony.
---
## 6. Auth model
Local-first, simple, defensible:
- **Service-tier reads** (gateway tool registry resolving candidate names for fill scenarios): authenticated via the existing gateway internal credential. Audit row written.
- **Legal-tier reads** (`/audit/subject/{id}`): requires a separate credential held in `/etc/lakehouse/legal_audit.token` (mode 0400, owner-only). Operators may load this only when fulfilling a legal request. The token is rotated per a documented runbook (operator + 1 witness; no cryptographic dual-control ceremony required for this scale).
- **Backups**: subject manifests + audit logs are backed up daily. The HMAC signing key is backed up to a SEPARATE storage location (different physical/network boundary) so a single backup compromise doesn't enable forgery.
This is not as strong as Vault Transit + dual-control JWT + S3 Object Lock external anchoring. It IS strong enough to pass a normal small-business compliance review and to be defensible in IL/IN small-claims-discovery contexts. If the staffing client's contract requires SOC2 Type II or formal HSM, that's a separate phase — but it's strictly additive on top of this v1 spec.
---
## 7. What this spec gives you (load-bearing)
1. **Defensible response to discrimination discovery** (worked example: John Martinez at Warehouse B). The endpoint produces a complete + signed + chain-verified record of every PII access affecting him.
2. **BIPA-compliant biometric tracking** when real photos arrive. The `consent.biometric` field + retention timeline are first-class, not bolted on later.
3. **Per-subject right-to-be-forgotten** via cryptographic erasure of the subject manifest's audit-key entry + tombstoning of the candidate's dataset rows. (The ability to verify "row is gone, trail is preserved as anonymous audit-event of the erasure" is what GDPR Art. 17 + CCPA expect.)
4. **HIPAA vertical routing** via the `vertical` field. Healthcare-vertical subjects (and `unknown` defaults) route to local-only models per PRD line 70 — no PHI to cloud egress.
5. **Cross-runtime parity** via the simple JSON+HMAC format that Go can read identically.
---
## 8. What this spec does NOT solve (and where it punts)
- **Per-row encryption of dataset PII**: subject manifests + audit logs are local; the underlying `workers_500k.parquet` is not encrypted. If staffing-client contract requires at-rest encryption, that's a separate concern handled at the storage tier (filesystem encryption, S3 SSE).
- **Right to explanation (GDPR Art. 22 / EU AI Act)**: this spec captures decisions that touched a subject; it does not require the model to explain WHY each decision was made in human-readable form. That's a separate Phase capturing model reasoning.
- **Adverse-impact statistics**: the comparator-pool snapshot per fill (per `AUDIT_PHASE_1_DISCOVERY` §10/C3) needs its own writer in the fill pipeline. This spec gives you the per-subject record; it doesn't cross-aggregate selection rates by protected class.
- **External tamper-evidence**: the HMAC chain is local. A motivated insider with access to both the audit log AND the signing key could rewrite history. For the staffing-client scale this is acceptable; for higher-stakes deployments a separate timestamping service or external transparency log would be additive.
---
## 9. Why this is the right shape for J's deployment
- It builds on what catalogd ALREADY ships (manifests, views, tombstones — the Iceberg-shape layer).
- It runs locally — no cloud infrastructure to license, monitor, audit, or pay for.
- Its primitives are JSON files an operator can read with `cat` and `jq`. Tamper-evidence works without trusting opaque crypto APIs.
- Its implementation is days, not weeks. The timeline matches the staffing-client launch window without forcing them to wait on a SaaS-tier identity service that doesn't fit their data residency posture.
- It is COMPATIBLE with the IDENTITY_SERVICE_DESIGN v3 path if the staffing client later requires SOC2 Type II — the v1 subject manifests can be migrated into a separate identity daemon when the scale demands it. But that's deferred until demand exists, not built speculatively.
This is the LOCAL-FIRST audit trail. It exists because the SaaS-tier version doesn't fit the deployment model J actually has.
---
## 10. Spec boundary
**Stable in v1 (changing requires v2):**
- File layout: `data/_catalog/subjects/<id>.json` + `data/_catalog/subjects/<id>.audit.jsonl`
- JSON schemas in §2 + §3 (additive fields OK; renames/removals require v2)
- HMAC algorithm: HMAC-SHA256 with key from sealed file
- Chain semantics: `prev_chain_hash` references previous row's `row_hmac`
- Vertical default: `unknown` with fail-closed routing
- Consent state machine in §2.1
- Audit-row write requirements in §3.2
**Implementation-specific (free to change):**
- Storage backend (file system v1; SQLite or Postgres acceptable as long as JSON shapes round-trip)
- HTTP endpoint exact shape (body schema is spec; status codes / headers are implementation)
- Backfill ETL details
**Reserved for v2:**
- Per-row encryption (when staffing-client contract requires it)
- External tamper-evidence anchor (when SOC2 Type II in scope)
- Cross-tenant subject isolation (when multi-tenant in scope)
---
## Change log
- 2026-05-03 — v1 initial draft. Builds on catalogd primitives that already exist. Smaller than IDENTITY_SERVICE_DESIGN v3 by ~4× because it doesn't propose a new daemon.