lakehouse/docs/runbooks/LEGAL_AUDIT_KEY_ROTATION.md
root b2c34b80b3 phase 1.6: lock Gate 3b = C, reconcile docs to shipped state, fix double-upload file leak
Four threads landing together — all driven by the audit J asked for before
production cutover.

(1) Gate 3b DECIDED: Option C (defer classifications). `BiometricCollection.classifications`
    stays `Option<JSON> = None` in v1. `docs/specs/GATE_3B_DEEPFACE_DESIGN.md` status
    flipped from "draft / awaits product" to DECIDED. Consent template + retention
    schedule revised to remove all "automated facial-classification" / "deepface"
    language so disclosed scope matches implemented scope.

(2) Endpoint-path drift reconciled across 3 docs. `PHASE_1_6_BIPA_GATES.md`,
    `BIPA_DESTRUCTION_RUNBOOK.md`, and `biometric_retention_schedule_v1.md` had
    references to legacy `/v1/identity/subjects/*` paths (proposed under a separate
    identityd daemon, never shipped) — corrected to actual shipped routes
    `/biometric/subject/*` (catalogd-local). Schema block in PHASE_1_6_BIPA_GATES
    rewritten to reflect JSON `SubjectManifest.biometric_collection` substrate
    (not the proposed Postgres `subjects` table).

(3) New operational artifacts:
    - `scripts/staffing/verify_biometric_erasure.sh` — checks 4 things post-erasure
      (manifest cleared, uploads dir empty, audit row matches, chain verified).
      Smoke-tested live against WORKER-2.
    - `scripts/staffing/biometric_destruction_report.sh` — monthly anonymized
      destruction-event aggregation. Smoke-tested clean.
    - `scripts/staffing/bundle_counsel_packet.sh` — tarballs the counsel-review
      packet with per-file SHA-256 manifest.
    - `docs/runbooks/LEGAL_AUDIT_KEY_ROTATION.md` — formal rotation procedure
      operationalized after the 2026-05-05 /tmp wipe incident.
    - `docs/counsel/COUNSEL_REVIEW_PACKET_2026-05-05.md` — cover note bundling
      all eng-staged BIPA docs for counsel review with per-doc questions, sign-off
      checklist, recommended review sequence.

(4) Double-upload file leak fixed in `crates/catalogd/src/biometric_endpoint.rs`.
    `verify_biometric_erasure.sh` smoked WORKER-2 and surfaced a stranded photo
    file. Investigation showed the file was 13-byte test-fixture bytes (zero PII,
    no biometric content); audit timeline showed two consecutive uploads followed
    by one erasure — the second upload had silently overwritten manifest.data_path,
    orphaning the first file. Patched `process_upload` to refuse a second upload
    with HTTP 409 + `error: "biometric_already_collected"` when
    `biometric_collection.is_some()` on the manifest. Operator must explicitly
    POST `/biometric/subject/{id}/erase` first.

    Tests: new `second_upload_without_erase_returns_409` (asserts 409 + manifest
    pointer unchanged + first file untouched on disk). Replaced
    `repeated_uploads_grow_the_chain` with `upload_erase_upload_grows_the_chain_cleanly`
    (covers the legitimate re-collection cycle: chain grows to 3 rows). Updated
    `content_type_with_parameters_accepted` to use 2 distinct subjects (was
    using 1 subject with 2 uploads to test ct parsing — would now 409).

    22/22 biometric_endpoint tests + 59/59 catalogd lib tests green post-patch.

Production posture: gateway needs `cargo build --release -p gateway` +
`systemctl restart lakehouse.service` to pick up the new 409 in live traffic.

Counsel calendar is now the only remaining blocker for first real-photo intake.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-05 06:19:40 -05:00

309 lines
12 KiB
Markdown

# Legal-Tier Audit Key & Token Rotation Runbook
**Spec companion:** `docs/PHASE_1_6_BIPA_GATES.md` §2 + `docs/runbooks/BIPA_DESTRUCTION_RUNBOOK.md`
**Audience:** Operators with root on the gateway host (J + named operators)
**Status:** Engineering-authored — ⚖ counsel review encouraged before formal adoption
> This runbook covers rotation of the two crypto-credentials that gate
> the Phase 1.6 audit substrate:
>
> 1. **`LH_SUBJECT_AUDIT_KEY`** — the 32-byte HMAC-SHA256 signing key
> that chains every per-subject audit row. If this key changes, all
> pre-rotation chain rows tamper-detect under the new key. That is
> correct, expected, BIPA-defensible behavior — the chain integrity
> it provided pre-rotation remains intact in the archive of the old
> key, and post-rotation chains remain intact going forward.
>
> 2. **`LH_LEGAL_AUDIT_TOKEN`** — the 32+-character bearer token that
> authorizes calls to `/audit/subject/{id}` and
> `/biometric/subject/{id}/erase`. Rotation does NOT touch any audit
> history; only access to the legal-tier endpoints flips.
>
> Both live at `/etc/lakehouse/` (mode 0400, owned by root) and are
> loaded by the gateway via systemd `Environment=` directives in
> `/etc/systemd/system/lakehouse.service.d/audit_env.conf`. They are
> NOT loaded from `/tmp` — a 2026-05-05 reboot incident wiped a
> `/tmp`-resident key and caused `/audit` + `/biometric` to fail-closed
> (which is what they should do); the rotation fix moved them to the
> persistent path.
---
## 1. When to rotate
Rotate **immediately** when any of the following is true:
| Trigger | Urgency | Notes |
|---|---|---|
| Suspected operator credential compromise | Within 1 hour | Token mismatch is fail-closed by default; immediate rotation closes the window. |
| Operator with legal-tier access leaves the team | Within 24 hours | Treat as compromise. |
| Key/token file's filesystem permissions were ever weakened (mode > 0400, group readable, etc.) | Within 24 hours | Filesystem audit may have leaked the bytes. |
| Token was ever transmitted over an untrusted channel (printed in CI log, sent over SMS, etc.) | Within 24 hours | Same reasoning. |
| Scheduled rotation (recommended) | Every 90 days | BIPA does not mandate a rotation cadence; counsel may set one. |
Do **not** rotate when:
- A subject's audit chain tamper-detects in isolation. That is normal
if the audit log was edited (which would itself be the BIPA finding,
not the key). Investigate the chain, not the key.
- Cross-runtime parity drift appears. That's an HMAC-input-shape bug
(Go vs Rust serialization), not a key issue. See
`STATE_OF_PLAY.md` "three runtime-divergence classes" entry.
---
## 2. Pre-rotation checks (5 minutes)
Before generating new credentials, capture a clean baseline so you can
prove the rotation cause and sequence afterward.
### 2.1. Take the engineering snapshot
```bash
# Confirm the canonical files exist with correct permissions.
ls -la /etc/lakehouse/subject_audit.key /etc/lakehouse/legal_audit.token
# Hash the existing key + token (NEVER the bytes themselves) so the
# old credential is identifiable in retrospect without storing it.
sha256sum /etc/lakehouse/subject_audit.key
sha256sum /etc/lakehouse/legal_audit.token
# Confirm the gateway is currently using these files.
sudo systemctl cat lakehouse.service | grep -E "Environment.*AUDIT"
# Verify the audit endpoint is healthy with the current credentials.
curl -sf http://localhost:3100/audit/health
```
If `/audit/health` is already 503, the rotation is **recovery**, not
preventive — note this in the rotation event record (§5).
### 2.2. Capture a known-good chain root
Pick one or two subjects with non-empty audit logs and record their
chain roots **under the current key**:
```bash
TOKEN=$(cat /etc/lakehouse/legal_audit.token)
for cid in WORKER-2 WORKER-100; do
curl -sf -H "X-Lakehouse-Legal-Token: $TOKEN" \
"http://localhost:3100/audit/subject/$cid" \
| jq '{cid: .candidate_id, verified: .audit_log.chain_verified, root: .audit_log.chain_root, rows: .audit_log.chain_rows_total}'
done
```
Save the output. Post-rotation, those chains will tamper-detect under
the new key — that is **expected** and the saved snapshot is the proof
that the chain WAS intact under the old key, before rotation.
---
## 3. Generation + rotation
### 3.1. Generate the new key
```bash
# 32 random bytes as hex = 64 chars. Either format works for HMAC-SHA256;
# we follow the existing convention (44-char base64-ish with no padding).
sudo install -m 0400 -o root -g root <(openssl rand -base64 33 | tr -d '\n=' | head -c 44) \
/etc/lakehouse/subject_audit.key.new
sudo install -m 0400 -o root -g root <(openssl rand -base64 33 | tr -d '\n=' | head -c 44) \
/etc/lakehouse/legal_audit.token.new
# Sanity: confirm 44-char content + correct mode.
sudo wc -c /etc/lakehouse/subject_audit.key.new /etc/lakehouse/legal_audit.token.new
sudo ls -la /etc/lakehouse/*.new
```
Both must be `mode 0400`, owned by root, exactly **44 chars** (the
audit endpoint refuses tokens shorter than 32 chars at load — see
`crates/catalogd/src/audit_endpoint.rs:73`).
### 3.2. Atomic swap
The gateway reads these files **once at boot** (per
`crates/catalogd/src/audit_endpoint.rs::AuditEndpointState::new` and
the equivalent for the writer). Atomic mv → restart is required.
```bash
# Move the old credentials to a quarantine path with timestamp so the
# old hashes remain identifiable post-rotation.
TS=$(date -u +%Y%m%dT%H%M%SZ)
sudo mkdir -p /etc/lakehouse/_archived
sudo install -d -m 0700 -o root -g root /etc/lakehouse/_archived
sudo mv /etc/lakehouse/subject_audit.key /etc/lakehouse/_archived/subject_audit.key.$TS
sudo mv /etc/lakehouse/legal_audit.token /etc/lakehouse/_archived/legal_audit.token.$TS
sudo mv /etc/lakehouse/subject_audit.key.new /etc/lakehouse/subject_audit.key
sudo mv /etc/lakehouse/legal_audit.token.new /etc/lakehouse/legal_audit.token
sudo ls -la /etc/lakehouse/subject_audit.key /etc/lakehouse/legal_audit.token
```
### 3.3. Restart the gateway
```bash
sudo systemctl restart lakehouse.service
sleep 2
sudo systemctl status lakehouse.service --no-pager | head -10
```
Wait for the gateway to bind port 3100 cleanly. If it doesn't, check
`journalctl -u lakehouse.service -n 50 --no-pager` for the failure
mode — the most common cause is the new file having wrong mode/owner.
---
## 4. Post-rotation verification (5 minutes)
### 4.1. Health probes
```bash
# Audit endpoint must be 200, not 503.
curl -sf http://localhost:3100/audit/health
# Expect: "audit endpoint ready"
# /v1/health must list the gateway's full provider set.
curl -sf http://localhost:3100/v1/health | jq '.providers, .worker_count'
```
### 4.2. Confirm the new token works
```bash
NEW_TOKEN=$(cat /etc/lakehouse/legal_audit.token)
curl -sS -o /dev/null -w '%{http_code}\n' \
-H "X-Lakehouse-Legal-Token: $NEW_TOKEN" \
http://localhost:3100/audit/subject/WORKER-100
# Expect: 200
```
If 401, the file the gateway loaded does NOT match the file you wrote.
Check ownership / mode / for trailing whitespace differences with
`hexdump -C /etc/lakehouse/legal_audit.token | head`.
### 4.3. Confirm the new chain works
Append-only chains are key-tied. Any *new* audit row written
post-rotation is signed under the new key and verifies cleanly:
```bash
# Issue a /v1/validate call against any worker — it spawns an audit row.
curl -sf -X POST http://localhost:3100/v1/validate \
-H 'Content-Type: application/json' \
-d '{"mode":"fill","candidate_id":"WORKER-100","worker_id":"WORKER-100","fields":["exists"]}' >/dev/null
# Read the chain back. Last row must verify under the new key.
curl -sf -H "X-Lakehouse-Legal-Token: $NEW_TOKEN" \
http://localhost:3100/audit/subject/WORKER-100 \
| jq '.audit_log | {verified: .chain_verified, rows: .chain_rows_total, last_kind: .rows[-1].accessor.kind}'
```
`chain_verified: true` confirms the new key is signing + verifying.
### 4.4. Confirm pre-rotation chains tamper-detect (expected)
```bash
curl -sf -H "X-Lakehouse-Legal-Token: $NEW_TOKEN" \
http://localhost:3100/audit/subject/WORKER-2 \
| jq '.audit_log | {verified: .chain_verified, error: .chain_verification_error}'
```
For any subject whose chain was written under the old key, this
returns `chain_verified: false` with an HMAC-mismatch error. **This
is correct behavior**, not a bug. The old chain was correctly signed
under the old key and verified under it; the new key cannot retroactively
verify rows it didn't sign. The pre-rotation snapshot you captured in
§2.2 is the defensible proof that those rows WERE valid pre-rotation.
If, instead, you see a chain that *should* verify post-rotation
returning `verified: false`, that's the rotation having gone wrong —
likely an old-key file that didn't get archived cleanly. Restore from
`/etc/lakehouse/_archived/<ts>/`, then re-attempt.
---
## 5. Record the rotation event
Append a row to the rotation log:
```bash
sudo tee -a /etc/lakehouse/_archived/rotation_log.jsonl <<EOF
{"ts":"$(date -u +%Y-%m-%dT%H:%M:%SZ)","operator":"<your name>","reason":"<scheduled|compromise|cred_loss|recovery>","old_key_sha256":"<hash from §2.1>","new_key_sha256":"$(sha256sum /etc/lakehouse/subject_audit.key | awk '{print $1}')","old_token_sha256":"<hash from §2.1>","new_token_sha256":"$(sha256sum /etc/lakehouse/legal_audit.token | awk '{print $1}')","witness":"<witness name or N/A for routine>"}
EOF
sudo chmod 0600 /etc/lakehouse/_archived/rotation_log.jsonl
sudo chown root:root /etc/lakehouse/_archived/rotation_log.jsonl
```
This file is the operator-side record of when the key changed and why.
It does NOT contain the key itself — only hashes — so it is safe to
back up and share with counsel on request.
---
## 6. Recovery from a lost key
If the active `subject_audit.key` is destroyed (filesystem corruption,
accidental delete, /tmp wipe per the 2026-05-05 incident), the gateway
will fail-closed at startup:
- `/audit/subject/{id}` → 503 ("audit endpoint disabled (legal token missing)" or equivalent for the signing key)
- `/biometric/subject/{id}/photo` → 503 (same fail-closed posture)
This is correct behavior — a server that cannot HMAC-sign new audit
rows must not accept new biometric writes.
**Recovery is rotation.** Generate a new key per §3.1, atomic-swap
per §3.2, restart per §3.3, verify per §4. Pre-loss chains tamper-detect
under the new key (the old key is gone — there is no way to verify
them). Treat the loss event as the BIPA-defensible boundary: pre-loss
chain verification was provided by the working key; post-loss new
chains are signed under the new key.
If a counsel-grade attestation of the pre-loss chains is needed, the
`/etc/lakehouse/_archived/` folder contains the historical hashes;
combined with the cross-runtime parity probe (Go reader gives the
same byte-identical view as Rust), the chain history pre-loss is
preservable as long as the on-disk JSONL files were not also lost.
---
## 7. ⚖ counsel notes
These are areas where counsel may want to opine before this runbook
is formally adopted:
1. **Rotation cadence.** BIPA itself does not require periodic rotation;
counsel may set a 90-day schedule to satisfy a separate compliance
posture (SOC2, internal policy).
2. **Custody of `/etc/lakehouse/_archived/`.** The archived hashes do
NOT contain the keys, but the archived raw key files DO. Counsel
may want a more aggressive destruction schedule for the raw archived
keys — say 1 year — to reduce a long-tail compromise surface.
3. **Notification obligations on rotation due to compromise.** §1
triggers a rotation; §1 does not address whether candidates whose
biometric data was protected by the compromised key must be notified.
This is a counsel call.
---
## 8. Operator acknowledgment
| Operator | Date acknowledged | Signature |
|---|---|---|
| J | _____ | _______________ |
| _____ | _____ | _______________ |
---
## 9. Change log
- 2026-05-05 — Initial runbook authored after the /tmp wipe incident
on the same day (key was at `/tmp/subject_audit.key` and was deleted
on reboot, disabling `/audit` + `/biometric` until the key was
regenerated at `/etc/lakehouse/subject_audit.key`). Recovery of
that incident produced a working procedure; this runbook captures
it as the canonical playbook for any future rotation.