lakehouse/docs/runbooks/LEGAL_AUDIT_KEY_ROTATION.md
root b2c34b80b3 phase 1.6: lock Gate 3b = C, reconcile docs to shipped state, fix double-upload file leak
Four threads landing together — all driven by the audit J asked for before
production cutover.

(1) Gate 3b DECIDED: Option C (defer classifications). `BiometricCollection.classifications`
    stays `Option<JSON> = None` in v1. `docs/specs/GATE_3B_DEEPFACE_DESIGN.md` status
    flipped from "draft / awaits product" to DECIDED. Consent template + retention
    schedule revised to remove all "automated facial-classification" / "deepface"
    language so disclosed scope matches implemented scope.

(2) Endpoint-path drift reconciled across 3 docs. `PHASE_1_6_BIPA_GATES.md`,
    `BIPA_DESTRUCTION_RUNBOOK.md`, and `biometric_retention_schedule_v1.md` had
    references to legacy `/v1/identity/subjects/*` paths (proposed under a separate
    identityd daemon, never shipped) — corrected to actual shipped routes
    `/biometric/subject/*` (catalogd-local). Schema block in PHASE_1_6_BIPA_GATES
    rewritten to reflect JSON `SubjectManifest.biometric_collection` substrate
    (not the proposed Postgres `subjects` table).

(3) New operational artifacts:
    - `scripts/staffing/verify_biometric_erasure.sh` — checks 4 things post-erasure
      (manifest cleared, uploads dir empty, audit row matches, chain verified).
      Smoke-tested live against WORKER-2.
    - `scripts/staffing/biometric_destruction_report.sh` — monthly anonymized
      destruction-event aggregation. Smoke-tested clean.
    - `scripts/staffing/bundle_counsel_packet.sh` — tarballs the counsel-review
      packet with per-file SHA-256 manifest.
    - `docs/runbooks/LEGAL_AUDIT_KEY_ROTATION.md` — formal rotation procedure
      operationalized after the 2026-05-05 /tmp wipe incident.
    - `docs/counsel/COUNSEL_REVIEW_PACKET_2026-05-05.md` — cover note bundling
      all eng-staged BIPA docs for counsel review with per-doc questions, sign-off
      checklist, recommended review sequence.

(4) Double-upload file leak fixed in `crates/catalogd/src/biometric_endpoint.rs`.
    `verify_biometric_erasure.sh` smoked WORKER-2 and surfaced a stranded photo
    file. Investigation showed the file was 13-byte test-fixture bytes (zero PII,
    no biometric content); audit timeline showed two consecutive uploads followed
    by one erasure — the second upload had silently overwritten manifest.data_path,
    orphaning the first file. Patched `process_upload` to refuse a second upload
    with HTTP 409 + `error: "biometric_already_collected"` when
    `biometric_collection.is_some()` on the manifest. Operator must explicitly
    POST `/biometric/subject/{id}/erase` first.

    Tests: new `second_upload_without_erase_returns_409` (asserts 409 + manifest
    pointer unchanged + first file untouched on disk). Replaced
    `repeated_uploads_grow_the_chain` with `upload_erase_upload_grows_the_chain_cleanly`
    (covers the legitimate re-collection cycle: chain grows to 3 rows). Updated
    `content_type_with_parameters_accepted` to use 2 distinct subjects (was
    using 1 subject with 2 uploads to test ct parsing — would now 409).

    22/22 biometric_endpoint tests + 59/59 catalogd lib tests green post-patch.

Production posture: gateway needs `cargo build --release -p gateway` +
`systemctl restart lakehouse.service` to pick up the new 409 in live traffic.

Counsel calendar is now the only remaining blocker for first real-photo intake.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-05 06:19:40 -05:00

12 KiB

Legal-Tier Audit Key & Token Rotation Runbook

Spec companion: docs/PHASE_1_6_BIPA_GATES.md §2 + docs/runbooks/BIPA_DESTRUCTION_RUNBOOK.md Audience: Operators with root on the gateway host (J + named operators) Status: Engineering-authored — ⚖ counsel review encouraged before formal adoption

This runbook covers rotation of the two crypto-credentials that gate the Phase 1.6 audit substrate:

  1. LH_SUBJECT_AUDIT_KEY — the 32-byte HMAC-SHA256 signing key that chains every per-subject audit row. If this key changes, all pre-rotation chain rows tamper-detect under the new key. That is correct, expected, BIPA-defensible behavior — the chain integrity it provided pre-rotation remains intact in the archive of the old key, and post-rotation chains remain intact going forward.

  2. LH_LEGAL_AUDIT_TOKEN — the 32+-character bearer token that authorizes calls to /audit/subject/{id} and /biometric/subject/{id}/erase. Rotation does NOT touch any audit history; only access to the legal-tier endpoints flips.

Both live at /etc/lakehouse/ (mode 0400, owned by root) and are loaded by the gateway via systemd Environment= directives in /etc/systemd/system/lakehouse.service.d/audit_env.conf. They are NOT loaded from /tmp — a 2026-05-05 reboot incident wiped a /tmp-resident key and caused /audit + /biometric to fail-closed (which is what they should do); the rotation fix moved them to the persistent path.


1. When to rotate

Rotate immediately when any of the following is true:

Trigger Urgency Notes
Suspected operator credential compromise Within 1 hour Token mismatch is fail-closed by default; immediate rotation closes the window.
Operator with legal-tier access leaves the team Within 24 hours Treat as compromise.
Key/token file's filesystem permissions were ever weakened (mode > 0400, group readable, etc.) Within 24 hours Filesystem audit may have leaked the bytes.
Token was ever transmitted over an untrusted channel (printed in CI log, sent over SMS, etc.) Within 24 hours Same reasoning.
Scheduled rotation (recommended) Every 90 days BIPA does not mandate a rotation cadence; counsel may set one.

Do not rotate when:

  • A subject's audit chain tamper-detects in isolation. That is normal if the audit log was edited (which would itself be the BIPA finding, not the key). Investigate the chain, not the key.
  • Cross-runtime parity drift appears. That's an HMAC-input-shape bug (Go vs Rust serialization), not a key issue. See STATE_OF_PLAY.md "three runtime-divergence classes" entry.

2. Pre-rotation checks (5 minutes)

Before generating new credentials, capture a clean baseline so you can prove the rotation cause and sequence afterward.

2.1. Take the engineering snapshot

# Confirm the canonical files exist with correct permissions.
ls -la /etc/lakehouse/subject_audit.key /etc/lakehouse/legal_audit.token

# Hash the existing key + token (NEVER the bytes themselves) so the
# old credential is identifiable in retrospect without storing it.
sha256sum /etc/lakehouse/subject_audit.key
sha256sum /etc/lakehouse/legal_audit.token

# Confirm the gateway is currently using these files.
sudo systemctl cat lakehouse.service | grep -E "Environment.*AUDIT"

# Verify the audit endpoint is healthy with the current credentials.
curl -sf http://localhost:3100/audit/health

If /audit/health is already 503, the rotation is recovery, not preventive — note this in the rotation event record (§5).

2.2. Capture a known-good chain root

Pick one or two subjects with non-empty audit logs and record their chain roots under the current key:

TOKEN=$(cat /etc/lakehouse/legal_audit.token)
for cid in WORKER-2 WORKER-100; do
  curl -sf -H "X-Lakehouse-Legal-Token: $TOKEN" \
    "http://localhost:3100/audit/subject/$cid" \
  | jq '{cid: .candidate_id, verified: .audit_log.chain_verified, root: .audit_log.chain_root, rows: .audit_log.chain_rows_total}'
done

Save the output. Post-rotation, those chains will tamper-detect under the new key — that is expected and the saved snapshot is the proof that the chain WAS intact under the old key, before rotation.


3. Generation + rotation

3.1. Generate the new key

# 32 random bytes as hex = 64 chars. Either format works for HMAC-SHA256;
# we follow the existing convention (44-char base64-ish with no padding).
sudo install -m 0400 -o root -g root <(openssl rand -base64 33 | tr -d '\n=' | head -c 44) \
  /etc/lakehouse/subject_audit.key.new

sudo install -m 0400 -o root -g root <(openssl rand -base64 33 | tr -d '\n=' | head -c 44) \
  /etc/lakehouse/legal_audit.token.new

# Sanity: confirm 44-char content + correct mode.
sudo wc -c /etc/lakehouse/subject_audit.key.new /etc/lakehouse/legal_audit.token.new
sudo ls -la /etc/lakehouse/*.new

Both must be mode 0400, owned by root, exactly 44 chars (the audit endpoint refuses tokens shorter than 32 chars at load — see crates/catalogd/src/audit_endpoint.rs:73).

3.2. Atomic swap

The gateway reads these files once at boot (per crates/catalogd/src/audit_endpoint.rs::AuditEndpointState::new and the equivalent for the writer). Atomic mv → restart is required.

# Move the old credentials to a quarantine path with timestamp so the
# old hashes remain identifiable post-rotation.
TS=$(date -u +%Y%m%dT%H%M%SZ)
sudo mkdir -p /etc/lakehouse/_archived
sudo install -d -m 0700 -o root -g root /etc/lakehouse/_archived

sudo mv /etc/lakehouse/subject_audit.key  /etc/lakehouse/_archived/subject_audit.key.$TS
sudo mv /etc/lakehouse/legal_audit.token  /etc/lakehouse/_archived/legal_audit.token.$TS

sudo mv /etc/lakehouse/subject_audit.key.new  /etc/lakehouse/subject_audit.key
sudo mv /etc/lakehouse/legal_audit.token.new  /etc/lakehouse/legal_audit.token

sudo ls -la /etc/lakehouse/subject_audit.key /etc/lakehouse/legal_audit.token

3.3. Restart the gateway

sudo systemctl restart lakehouse.service
sleep 2
sudo systemctl status lakehouse.service --no-pager | head -10

Wait for the gateway to bind port 3100 cleanly. If it doesn't, check journalctl -u lakehouse.service -n 50 --no-pager for the failure mode — the most common cause is the new file having wrong mode/owner.


4. Post-rotation verification (5 minutes)

4.1. Health probes

# Audit endpoint must be 200, not 503.
curl -sf http://localhost:3100/audit/health
# Expect: "audit endpoint ready"

# /v1/health must list the gateway's full provider set.
curl -sf http://localhost:3100/v1/health | jq '.providers, .worker_count'

4.2. Confirm the new token works

NEW_TOKEN=$(cat /etc/lakehouse/legal_audit.token)
curl -sS -o /dev/null -w '%{http_code}\n' \
  -H "X-Lakehouse-Legal-Token: $NEW_TOKEN" \
  http://localhost:3100/audit/subject/WORKER-100
# Expect: 200

If 401, the file the gateway loaded does NOT match the file you wrote. Check ownership / mode / for trailing whitespace differences with hexdump -C /etc/lakehouse/legal_audit.token | head.

4.3. Confirm the new chain works

Append-only chains are key-tied. Any new audit row written post-rotation is signed under the new key and verifies cleanly:

# Issue a /v1/validate call against any worker — it spawns an audit row.
curl -sf -X POST http://localhost:3100/v1/validate \
  -H 'Content-Type: application/json' \
  -d '{"mode":"fill","candidate_id":"WORKER-100","worker_id":"WORKER-100","fields":["exists"]}' >/dev/null

# Read the chain back. Last row must verify under the new key.
curl -sf -H "X-Lakehouse-Legal-Token: $NEW_TOKEN" \
  http://localhost:3100/audit/subject/WORKER-100 \
| jq '.audit_log | {verified: .chain_verified, rows: .chain_rows_total, last_kind: .rows[-1].accessor.kind}'

chain_verified: true confirms the new key is signing + verifying.

4.4. Confirm pre-rotation chains tamper-detect (expected)

curl -sf -H "X-Lakehouse-Legal-Token: $NEW_TOKEN" \
  http://localhost:3100/audit/subject/WORKER-2 \
| jq '.audit_log | {verified: .chain_verified, error: .chain_verification_error}'

For any subject whose chain was written under the old key, this returns chain_verified: false with an HMAC-mismatch error. This is correct behavior, not a bug. The old chain was correctly signed under the old key and verified under it; the new key cannot retroactively verify rows it didn't sign. The pre-rotation snapshot you captured in §2.2 is the defensible proof that those rows WERE valid pre-rotation.

If, instead, you see a chain that should verify post-rotation returning verified: false, that's the rotation having gone wrong — likely an old-key file that didn't get archived cleanly. Restore from /etc/lakehouse/_archived/<ts>/, then re-attempt.


5. Record the rotation event

Append a row to the rotation log:

sudo tee -a /etc/lakehouse/_archived/rotation_log.jsonl <<EOF
{"ts":"$(date -u +%Y-%m-%dT%H:%M:%SZ)","operator":"<your name>","reason":"<scheduled|compromise|cred_loss|recovery>","old_key_sha256":"<hash from §2.1>","new_key_sha256":"$(sha256sum /etc/lakehouse/subject_audit.key | awk '{print $1}')","old_token_sha256":"<hash from §2.1>","new_token_sha256":"$(sha256sum /etc/lakehouse/legal_audit.token | awk '{print $1}')","witness":"<witness name or N/A for routine>"}
EOF

sudo chmod 0600 /etc/lakehouse/_archived/rotation_log.jsonl
sudo chown root:root /etc/lakehouse/_archived/rotation_log.jsonl

This file is the operator-side record of when the key changed and why. It does NOT contain the key itself — only hashes — so it is safe to back up and share with counsel on request.


6. Recovery from a lost key

If the active subject_audit.key is destroyed (filesystem corruption, accidental delete, /tmp wipe per the 2026-05-05 incident), the gateway will fail-closed at startup:

  • /audit/subject/{id} → 503 ("audit endpoint disabled (legal token missing)" or equivalent for the signing key)
  • /biometric/subject/{id}/photo → 503 (same fail-closed posture)

This is correct behavior — a server that cannot HMAC-sign new audit rows must not accept new biometric writes.

Recovery is rotation. Generate a new key per §3.1, atomic-swap per §3.2, restart per §3.3, verify per §4. Pre-loss chains tamper-detect under the new key (the old key is gone — there is no way to verify them). Treat the loss event as the BIPA-defensible boundary: pre-loss chain verification was provided by the working key; post-loss new chains are signed under the new key.

If a counsel-grade attestation of the pre-loss chains is needed, the /etc/lakehouse/_archived/ folder contains the historical hashes; combined with the cross-runtime parity probe (Go reader gives the same byte-identical view as Rust), the chain history pre-loss is preservable as long as the on-disk JSONL files were not also lost.


7. ⚖ counsel notes

These are areas where counsel may want to opine before this runbook is formally adopted:

  1. Rotation cadence. BIPA itself does not require periodic rotation; counsel may set a 90-day schedule to satisfy a separate compliance posture (SOC2, internal policy).
  2. Custody of /etc/lakehouse/_archived/. The archived hashes do NOT contain the keys, but the archived raw key files DO. Counsel may want a more aggressive destruction schedule for the raw archived keys — say 1 year — to reduce a long-tail compromise surface.
  3. Notification obligations on rotation due to compromise. §1 triggers a rotation; §1 does not address whether candidates whose biometric data was protected by the compromised key must be notified. This is a counsel call.

8. Operator acknowledgment

Operator Date acknowledged Signature
J _____ _______________
_____ _____ _______________

9. Change log

  • 2026-05-05 — Initial runbook authored after the /tmp wipe incident on the same day (key was at /tmp/subject_audit.key and was deleted on reboot, disabling /audit + /biometric until the key was regenerated at /etc/lakehouse/subject_audit.key). Recovery of that incident produced a working procedure; this runbook captures it as the canonical playbook for any future rotation.