lakehouse/docs/AUDIT_TRAIL_PRD.md
root dbcd05c5c5 audit docs: deprecation headers — over-scoped for local-only deployment
Today's PRD-line-70 reframe (everything runs locally) means the audit-trail
docs I drafted earlier this session are over-engineered for J's actual
deployment model. They were sized for SaaS-tier infra (Vault/KMS/S3
Object Lock/dual-control JWT/separate Postgres) — appropriate for a
multi-tenant cloud service, wrong for a single-box local install.

Adding clear deprecation headers so future sessions don't read these
as authoritative and propose another 17-20 day plan involving cloud
infrastructure that would re-violate PRD line 70.

What STAYS valid (preserved in headers):
- The legal use case (John Martinez worked example)
- The IL/IN jurisdictional surface (counsel checklist)
- The Phase 1 + 1.5 discovery findings (PII flow paths file:line)
- Phase 1.6 BIPA gates (when real photos arrive)

What's OVER-SCOPED (flagged in headers):
- The 9-phase implementation plan
- The identity service design (Vault/KMS/dual-control)

Future v2 of these docs needs to be sized for local single-box: a few
hundred LOC of local writers + signed local audit file, not 17-20 days
of distributed-systems design.

No code changes. Just doc-level guardrails for future scope drift.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-03 02:42:05 -05:00

31 KiB

PRD: Production-Ready Audit Trail

⚠ OVER-SCOPED — 9-phase plan needs to shrink for local-only deployment.

2026-05-03 evening: J reframed the system as local-only per PRD line 70. The 9-phase plan in §8 was sized for SaaS-tier infrastructure with cloud HSM, separate identity daemon, dual-control JWT, etc. For a single-box local deployment, audit trail can be a few hundred LOC of local writers + a signed local file, not a 17-20 day phase plan.

What stays valid:

  • The legal use case (worked example: John Martinez at Warehouse B requests audit) — this is the real problem
  • The §10.5 jurisdictional surface (IL BIPA, IN, federal) — counsel reads this
  • The §3 surface map: where decisions get made today (file:line evidence — see AUDIT_PHASE_1_DISCOVERY.md)
  • Phase 1.6 BIPA pre-launch gates — those still apply when real photos arrive

What's over-scoped:

  • The 9-phase implementation plan (§8) — should compress to 3-4 phases for local-only
  • The identity service design (IDENTITY_SERVICE_DESIGN.md) — see that doc's deprecation header

Do NOT execute the §8 phase plan as-written. When J greenlights, draft a v2 plan sized for local single-box.

Status: Draft — 2026-05-03 · Owner: J · Drafted by: working session 2026-05-03

Why this document exists. Staffing client won't sign until we can prove the AI system can defend a discrimination claim. We've been claiming "production-ready" off smoke + parity tests; those prove the surface compiles, NOT that an audit response can be produced for a specific person. This PRD writes the audit-trail capability down before we start building it, so the phases are accountable and the scope doesn't drift mid-implementation.


Scenario. John Martinez worked Warehouse B as a placed candidate. Six months later he files a complaint claiming discrimination during the hiring process. His lawyer requests an audit under EEOC discovery: produce every AI-system decision affecting John between dates D1 and D2.

What we must produce. A response that proves either:

  • (a) John was treated identically to other candidates with comparable qualifications — same scoring criteria, same model invocations, same decision rules — and the outcome differences are explained by non-protected factors, OR
  • (b) The system surfaces exactly what factors led to outcomes, in a form a court can verify, so the claim can be defended on documented criteria rather than "trust the AI."

What we must NOT produce.

  • Other subjects' data (response leaks if even one other candidate's name appears)
  • Internal infrastructure details (DB paths, server names, internal IDs that aren't candidate-shaped)
  • Raw model prompts/completions that contain protected attributes (race, gender, age, etc.) — even if the model didn't use them, their presence in the audit log creates new evidence

The defensibility chain. The audit shows:

  1. Indexing-time decisions — when John was added to the candidate pool, what embedding the model produced, what features were extracted, what categories he was placed into
  2. Search-time decisions — every query that included him in candidate sets, what rank he received, what the model used to compute that rank
  3. Recommendation-time decisions — every fill/recommendation event involving him, what scoring drove it, what validators ran, what they returned
  4. Iteration decisions — any iterate retries that touched him (validator failures, model self-corrections)
  5. Outcome decisions — final fills, rejections, hand-offs

For each, the audit row must show: timestamp, decision type, model + provider, input features (sanitized of protected attributes — see §4), output decision, rationale.


2. The subject audit response — output format

GET /audit/subject/{candidate_id}?from=D1&to=D2
→ JSON or signed PDF (legal preference TBD)

Header section:

  • subject identifier (candidate_id), date range, response generation timestamp, signing daemon, integrity hash
  • pre-translation note: candidate_id ↔ PII mapping is held by the identity service (§5), NOT by this audit endpoint. Legal counsel re-correlates separately under their own access controls.

Per-decision row schema (shape, not exhaustive):

{
  "ts": "ISO-8601 UTC",
  "decision_kind": "embedding_create | search_inclusion | search_rank | fill_recommendation | validation_outcome | iterate_attempt | observer_signal",
  "daemon": "gateway | validatord | observerd | matrixd | ingestd",
  "model": "kimi-k2.6 | deepseek-v3.2 | ...",
  "provider": "ollama_cloud | opencode | openrouter",
  "input_features": { /* what the model SAW  sanitized per §4 */ },
  "output": { /* what the model decided */ },
  "rationale": "model's natural-language explanation, or rule-based justification",
  "trace_id": "X-Lakehouse-Trace-Id linking to Langfuse trace tree",
  "session_id": "iterate session that produced this row"
}

Footer section:

  • Coverage attestation: "this response includes ALL decisions about candidate_id between D1 and D2 that are retained per §6 retention policy"
  • Sign-off: cryptographic signature from a daemon whose key is in escrow (proves audit was generated by the system, not hand-edited)

3. Surface map — where decisions happen

Decision happens at Currently logged where Audit-completeness gap
Ingestion (candidate added to pool) data/_kb/outcomes.jsonl? journald mutation log? UNKNOWN — needs walk
Embedding creation (vector built for candidate) NOWHERE per-candidate; embed cache hits aren't subject-tagged MAJOR GAP — need to subject-tag every embedding
Search inclusion (candidate appeared in a result set) Pathway memory + session JSONL (?) Partial — need subject-correlation
Search rank (position in result set) Result set in chat traces, but not indexed by candidate Partial
Fill recommendation data/datasets/fill_events.parquet (per CLAUDE.md decision A) + pathway memory Probably OK but not verified
Validation outcome (FillValidator/EmailValidator pass/fail) /v1/iterate session JSONL — but validation_kind not populated per yesterday's misread Partial — fix today
Iterate retry escalations Session JSONL attempts[] array OK
Observer signals observerd events at :3800 (or :4219 Go side) UNKNOWN — needs walk
Matrix-indexer compounding (semantic flags, bug fingerprints) pathway_memory/state.json (currently 91 traces) Probably leaks — these are tagged by file/task, not by subject

Substantive finding from this walk: the matrix indexer + pathway memory are tagged by code not by subject. They surface "this code path failed for this task class" — they don't currently let us answer "every decision matrix-indexer made about John." If matrix-indexer fingerprints leak protected-attribute correlations (e.g., a fingerprint that says "candidates from [zip code with majority demographic X] got outcome Y"), that's a discrimination smoking gun that we currently have no way to audit cleanly.


4. PII handling rules

Tokenization rule: candidate_id is the only identifier that crosses runtime boundaries (logs, JSONL, traces, pathway memory, observer events, model prompts). Email / name / address / phone / SSN / DOB are NEVER in any of these surfaces.

Identity service (§5) holds the candidate_id ↔ PII mapping. Only legal-authorized access reads it.

Protected-attribute exclusion at decision time: the model NEVER receives:

  • Race, ethnicity, national origin
  • Sex, gender, marital status, pregnancy
  • Age, date of birth (allowed: years of experience)
  • Religion
  • Disability, genetic information
  • Veteran status (unless legally relevant for the role)
  • Sexual orientation, gender identity

If the model never sees these, no decision can be predicated on them. The audit row's input_features field proves this: by inspecting the row, a lawyer can confirm protected attributes were absent from input.

Inferred-attribute risk. A model can infer protected attributes from non-protected proxies (zip code → race, name → ethnicity, photo → multiple). The audit must surface this risk. Open question: do we ban photo features from candidate scoring? Do we ban surname tokenization? These are policy calls.

Audit response sanitization: the response goes to the candidate's lawyer, not to the world. It contains the candidate's own name (re-correlated by legal). It must NOT contain other candidates' names, even in comparison/ranking rows.


5. Identity service — candidate_id ↔ PII mapping

Current state: data/datasets/workers_500k.parquet has the full PII (per CLAUDE.md). The candidates_safe view (post-fix c3c9c21) is the masked projection. GAP: candidate_id is currently the row position / a derived field — there's no separate identity service. This needs to change.

Target state:

  • identity/ subsystem (new) — holds the candidate_id → {email, name, address, phone, SSN_last4, DOB, ...} mapping
  • All other systems (gateway, validatord, observerd, matrixd, pathwayd) only ever see candidate_id
  • Identity reads require a separate auth credential held by legal-authorized operators
  • Every identity read is itself audited (log who accessed PII for which candidate when)
  • Identity service runs as its own daemon, port-isolated from the gateway
  • Cross-runtime: same identity service backs both Rust and Go

Open question: does the identity service need to be a separate physical daemon (most defensible) or a logically-separated process within an existing one (easier to ship)? Recommend separate daemon — gives legal a single attestable boundary.


6. Retention policy

Current state: UNKNOWN. Pathway memory is append-only. Session JSONL is append-only. We have no documented retention SLA.

Target state (proposed):

  • Active retention: while client is in the system, all audit rows kept hot (queryable in <1s)
  • Legal hold: N years after client/candidate leaves the system, audit rows retained on warm storage. N is TBD — typical EEOC retention is 1-3 years; some state-level claims have 4-year statutes; Title VII discovery can subpoena older. Recommend 4 years minimum, configurable per client contract.
  • Right to be forgotten: if a candidate requests deletion under CCPA/GDPR, we apply tombstoning to the identity service (PII removed) BUT preserve the audit-decision rows under candidate_id (anonymized via PII removal at the source). The decision history remains; the human identification is severed.
  • Cryptographic erasure for append-only logs: pathway memory and matrix indexer can't be selectively deleted without breaking integrity. Encryption-at-rest with per-subject keys lets us "delete" by destroying the key — the encrypted row remains but is unreadable.

Open question: does the staffing client want a documented retention SLA in their contract? If yes, this PRD becomes contract-grade and the numbers above need their sign-off.


7. Current state vs target state

Capability Today Production-ready target Gap
candidate_id as canonical token partial (row position?) UUID, separate from PII Real — needs identity service
Identity service none separate daemon, audited reads Real — build new
/audit/subject/{id} endpoint none live with the §2 schema Real — build new
Subject-tagged embeddings no every embed creates an audit row Real — instrument
Subject-tagged search results partial every result set logged with subject IDs Partial — needs walk
Subject-tagged validation outcomes yes (in session JSONL) yes + integrity-signed Partial
Subject-tagged matrix indexer entries NO yes (decide first whether matrix should be subject-aware at all) Major
Protected-attribute filter at decision time informal enforced at gateway boundary, audited Unknown — needs code walk
Retention policy none documented 4-year hot, configurable cold tier Real — design + build
Right to be forgotten none per-subject cryptographic erasure Real — design + build
Cross-runtime parity for all of the above partial (5 algorithm probes) new audit-parity probes Real — extend probe set

8. Implementation phases (proposed sequence)

Each phase has an exit criterion the next phase can lean on. Don't start phase N+1 until phase N's exit holds.

Phase 1 — Discovery walk (read-only, ~3-4 hours)

Walk every daemon and tag every code path that touches subject identifiers. Output: a complete map of where candidate_id lives today, where email/name/PII leak today, what's logged where. No code changes. Fills in all "UNKNOWN" entries in §3 and §7 with file:line references.

Exit: §3 surface map is fully populated with current-state evidence. §7 gap table has no "Unknown" cells.

Phase 2 — Identity service design (design doc, ~2 hours)

Write docs/IDENTITY_SERVICE.md: schema, port, auth model, read-audit format, cross-runtime contract, migration path from current state. No code changes.

Exit: J approves the design.

Phase 3 — Audit response endpoint (skeleton, ~4-6 hours)

Build /audit/subject/{id} endpoint that returns ALL information CURRENTLY logged about the subject — even before identity service is built, even if logs leak PII, even if subject-tagging is incomplete. This is the "what John Martinez would get today" baseline. Reading the output reveals exactly what's wrong.

Exit: endpoint returns a JSON response for any candidate_id in workers_500k. Contents are reviewed; gaps catalogued.

Phase 4 — Subject tagging across substrates

Instrument the missing decision points (embedding creation, search rank, observer signals, matrix indexer entries) with subject identifiers. Each daemon's instrumentation lands as its own commit. Cross-runtime: each Rust commit ships with a Go-side mirror.

Exit: /audit/subject/{id} response is complete for the worked example (John Martinez at Warehouse B can be reconstructed end-to-end).

Phase 5 — Identity service build

Stand up the identity daemon. Migrate candidate_id ↔ PII mapping out of workers_500k.parquet into the new service. Audit every read. Update all callers to never see PII directly.

Exit: PII grep across all log files / JSONL streams / pathway memory state returns 0 hits. Cross-runtime parity probe added: audit_parity.sh validates Rust + Go produce identical audit responses for the same subject.

Phase 6 — Protected-attribute boundary enforcement

Add a hard filter at the gateway: any model invocation must declare the input features it sees, and protected attributes are stripped at the boundary. Audit row's input_features becomes load-bearing.

Exit: can run discrimination-test scenario: feed protected attribute through, verify it's stripped before model sees it, verify audit row shows the stripping.

Phase 7 — Retention + right-to-be-forgotten

Document retention SLA. Implement tier-down (hot → warm → cold → encrypted-with-deletable-key). Implement subject-erasure endpoint.

Exit: test scenario: subject requests deletion, identity service tombstones, decision rows remain under candidate_id but are unreadable post-erasure, audit response for that subject returns "subject erased" header instead of decision rows.

Decide JSON vs signed PDF for legal output. Build the export pipeline. Sign with a key in escrow.

Exit: can produce the John Martinez audit response in the format legal will accept; signature verifies.

Phase 9 — End-to-end discrimination defense rehearsal

Run the worked example: simulate John Martinez's complaint, generate the audit, walk through what a lawyer would see, identify any remaining gaps, fix them.

Exit: J + (eventually) the staffing client's legal team sign off on the format and completeness.


9. Cross-runtime requirement

Both Rust legacy and Go rewrite must satisfy every phase's exit criterion. The 5 existing parity probes (validator, extract_json, session_log, materializer, embed) cover algorithmic equivalence; they do NOT cover audit. New parity probe audit_parity.sh lands as part of phase 5.

The identity service is the new shared substrate — both runtimes call it; the daemon itself is one implementation (no per-runtime version).


10. Open questions blocking phase 2 (resolved 2026-05-03)

These were the original Phase 1 open questions. J answered the load-bearing 5 in conversation 2026-05-03; answers folded in below. Remaining items 1, 4, 6 still need J's call before Phase 2 design ships.

  1. Identity service: separate daemon vs in-process? Recommended: separate. Status: pending J confirmation.
  2. Retention period N years? — Out of scope for now; will be set at deployment time per client contract. Default to 4-year hot retention until set.
  3. Photos/video in scope? YES (J 2026-05-03). BIPA (740 ILCS 14) applies in full. Per-violation $1k-$5k statutory damages, written consent + retention schedule mandatory. This becomes a Phase 1.5 priority — see §10.5/§13 for revised phase ordering.
  4. JSON or signed PDF for legal export? Recommend signed JSON with PDF rendering option. Status: pending.
  5. RTBF under append-only: Cryptographic erasure approach approved in principle (J 2026-05-03 implicit via approval of plan).
  6. Audit endpoint auth model: Recommend legal-only credential separate from admin token. Status: pending.
  7. Matrix indexer subject-awareness: Per scrum review (AUDIT_PHASE_1_DISCOVERY.md §10/C1), matrix-indexer is suspected PII sink (trace bodies unverified). Action: sample state.json before deciding (a) keep code-only + add PII-redact-on-write to trace bodies, OR (b) remove subject-summary text from trace bodies entirely. Decision deferred until §8.1 sampling completes.

Newly answered 2026-05-03 (J)

  1. Langfuse hosting model: Self-hosted. Removes the GDPR Art. 44 cross-border-transfer concern that 3/3 scrum reviewers flagged. Langfuse retention config + Postgres/ClickHouse access controls still need to be audited as part of Phase 1.5 — but the boundary stays inside J's infrastructure, which is materially better than SaaS Langfuse.

  2. EU candidates in scope: Not currently — may need placeholder later. Design choice: build the identity-service interface to be EU-compatible (DPIA-shaped fields, lawful-basis tracking, SCC-ready transfer mechanism slots) but DO NOT gate Phase 2 on EU compliance. Phase 2 ships IL+IN-shaped; EU additions are a follow-up phase.

  3. Healthcare vertical / HIPAA: Same framework — yes. Healthcare staffing IS in scope. PHI in resume_text, communications, and call_log is realistic. Implications:

    • Business Associate Agreement (BAA) required with any third-party model provider that processes content from healthcare-vertical staffing requests
    • opencode + ollama_cloud + openrouter (per PR #13 routing) are external — BAAs needed OR healthcare requests must route to local-only models (Ollama on-box)
    • PHI redaction at the gateway boundary becomes mandatory before the model call leaves the box, OR the model call must stay on-box for healthcare requests
    • Vertical detection at the gateway boundary becomes a Phase 2 requirement
  4. Training / RAG re-runs may use historical outcomes: Yes — design as if it WILL. Implications:

    • outcomes.jsonl and overseer_corrections.jsonl cannot remain raw-PII forever — anything that lands in a training corpus or RAG re-index becomes ungeneratable to delete (PII in model weights)
    • Phase 2 design must include a "training-safe export" pipeline that strips PII from outcomes before feeding to any training/RAG path
    • Crypto-erasure of historical outcomes becomes load-bearing — if a candidate exercises RTBF and their data already trained a model, we must be able to evidence "the source was destroyed; the model retains it indistinguishably from synthetic patterns"

Effect on the §8 phase plan

The user-confirmed answers shift priorities. Revised ordering (incorporating scrum-driven priority changes from AUDIT_PHASE_1_DISCOVERY.md §10):

  • Phase 1.5 (NEW) — BIPA-specific photo/video schema audit + Langfuse boundary scoping + outcomes.jsonl content sample. Lands BEFORE Phase 2 design starts.
  • Phase 2 (identity service design) — now must include EU-placeholder fields, vertical-detection (healthcare flag), training-safe export interface, BIPA consent + retention metadata
  • Phase 3 (audit endpoint skeleton) — unchanged
  • Phase 4 (subject tagging) — must include healthcare-vertical routing decision at gateway boundary
  • Phase 5 (identity service build) — must include BIPA-compliant biometric metadata table
  • Phase 6 (protected-attribute boundary) — must include PHI redaction for healthcare-vertical requests
  • Phase 7 (retention + RTBF) — must include training-safe export evidence chain
  • Phase 8 (legal export) — unchanged
  • Phase 9 (rehearsal) — must include both EEOC discrimination scenario AND BIPA biometric scenario AND healthcare PHI breach scenario

10.5 Jurisdictional surface (IL + IN)

⚠ Not legal advice. This is a research-grade checklist for J to take into a conversation with actual employment + privacy counsel. The system is targeting Chicago (Illinois) and Indiana placements per 2026-05-03 conversation. Counsel needs to verify what currently applies, what's pending, and whether case law has moved any of these in 2026. Verify with counsel before claiming compliance with any item below.

Federal layer (always applies)

Statute / framework Relevance to this system
Title VII (Civil Rights Act) Bans discrimination on race, color, religion, sex, national origin in hiring. Discrimination claim defense is the worked example in §1.
ADEA (Age Discrimination in Employment) Bans age-based discrimination for workers 40+. DOB must be excluded from features per §4.
ADA (Americans with Disabilities Act) Bans disability discrimination + requires reasonable accommodation. Disability-inferring features (gait, photo features, medical history) need exclusion.
EEOC enforcement Receives complaints, issues right-to-sue. Audit response per §2 is what defends in EEOC investigation.
OFCCP Applies if our staffing client serves federal contractors. Adds affirmative-action recordkeeping on top of EEOC.
FCRA (Fair Credit Reporting Act) Triggers if background checks are performed. Pre-adverse-action notice + dispute process needed.
Section 1981 Race-based contract discrimination — staffing is contract relationship.

Illinois-specific (Chicago jurisdiction)

Statute What What we need
BIPA (Biometric Information Privacy Act, 740 ILCS 14) Bans collection of biometric identifiers (face geometry, fingerprints, voiceprints) without informed written consent + retention schedule. Penalties: $1,000-$5,000 per violation per person. Class actions are common and aggressive. If we use candidate photos for any feature (face match, headshot rendering, photo-derived attributes), BIPA almost certainly applies. The headshot pool we generate (per CLAUDE.md commit 5d93a71 area) needs careful review — synthetic faces are probably OK; real candidate photos are NOT without explicit BIPA-compliant consent. Counsel must review.
Illinois AI Video Interview Act (820 ILCS 42) If AI analyzes recorded video interviews, employer must disclose AI use, obtain consent, provide explanation of how AI works, and limit who can review the video. If we ever ingest video, this applies. Currently we don't, but worth flagging to counsel as a "what if we add this in 12 months" boundary.
Illinois Human Rights Act (775 ILCS 5) Broader than federal Title VII — adds protected classes including arrest record, military status, marital status, order of protection, citizenship status (in some cases), unfavorable military discharge. Protected attribute exclusion list in §4 needs expanding to cover IL-specific classes.
Personal Information Protection Act (PIPA, 815 ILCS 530) Breach notification — must notify Illinois residents whose unencrypted PII was breached. If identity service or workers parquet is breached, notification clock starts. Need incident response runbook.
Illinois Day and Temporary Labor Services Act (820 ILCS 175) Specific to staffing/temporary services industry. Includes equal-pay-for-equal-work + record-keeping requirements + worker notification. Highly relevant — applies directly to staffing-company clients. Audit retention may interact with these recordkeeping requirements.
Workplace Transparency Act Restrictions on non-disclosure agreements re: harassment/discrimination Tangential but worth noting.
City of Chicago Human Rights Ordinance (Title 6 Chicago Municipal Code) Adds protected classes beyond IHRA (source of income, parental status, military discharge status, credit history). Chicago-specific protected attributes list.
Cook County Human Rights Ordinance Similar additions county-wide. Chicago is in Cook County so this stacks.
Possible: AI hiring transparency Several states/cities have proposed/passed laws modeled on NYC Local Law 144 (annual bias audit + candidate notification). I do not know whether IL or Chicago has such a law on the books as of 2026-01 cutoff. Counsel must check current state. If it exists, we need annual bias audit reports (which IS what this PRD is building toward, but the report format may have specific requirements).

Indiana-specific

Statute What What we need
Indiana Data Breach Disclosure (IC 24-4.9) Breach notification within "without unreasonable delay" Same incident response runbook as IL PIPA.
Indiana Civil Rights Law (IC 22-9) State-level employment discrimination Largely tracks federal Title VII, fewer expansions than IL.
Indiana Genetic Information Privacy Act Bans use of genetic info in employment Already in §4 protected list.
General observation Indiana is generally less aggressive than Illinois on AI/employment regulation as of cutoff. The IL bar is higher — if we satisfy IL, IN typically follows. Counsel must confirm this isn't backwards.

Cross-cutting (security frameworks for SaaS sales)

These aren't laws but are commonly required by enterprise customers (including staffing clients) before sale.

Framework What Relevance
SOC 2 Type II Auditor attestation of operating effectiveness over 6-12 months across Trust Service Criteria (Security, Availability, Processing Integrity, Confidentiality, Privacy). The Privacy criterion overlaps heavily with this PRD. Privacy + Security are the two load-bearing TSCs. Effort to first Type II report: 6-9 months. Type I (point-in-time) is faster (weeks) but enterprise buyers usually want Type II.
SOC 3 Public-facing summary of SOC 2 (no detailed control descriptions). Nice-to-have for marketing but the staffing client will want SOC 2 Type II report under NDA.
HIPAA Healthcare data protection. Triggers ONLY if staffing places workers into healthcare roles where they handle PHI. Currently not in scope per CLAUDE.md. Confirm scoping with J.
PCI DSS Payment card data Not currently in scope.
ISO 27001 International information security management Alternative to SOC 2; more common in EU. Probably unnecessary for IL/IN-only deployments.

What this means for phase ordering

The 9-phase plan in §8 is technically correct but may need re-ordering once counsel weighs in:

  • BIPA risk on photos is so high and so aggressive that if we use real candidate photos anywhere, that may need to be the FIRST thing we resolve — before the audit-trail work starts. Class-action exposure is enormous.
  • SOC 2 Type II prep runs in parallel with this work, not after. If the staffing client says "show us your SOC 2 report" we need to have started the engagement weeks/months before.
  • Day and Temporary Labor Services Act may impose recordkeeping that interacts with our retention SLA (§6) — counsel may say "no, retention has to be N years for THIS reason, not your defaulted 4."

Open questions for counsel (one ask)

  1. Does the staffing client have an existing SOC 2 report we leverage, or do we need our own?
  2. Are we using any real candidate photos? If yes, is BIPA consent in place?
  3. Does Illinois have an AI hiring transparency law on the books in 2026? If yes, what does the bias audit report need to look like?
  4. What's the IL Day and Temporary Labor Services Act recordkeeping retention period? Does it interact with our 4-year proposed SLA?
  5. Are background checks performed? If yes, do we need FCRA pre-adverse-action workflow integration?
  6. Any healthcare placements? (HIPAA scoping)
  7. Is the staffing client a federal contractor? (OFCCP scoping)

Counsel's answers shape whether the §8 phase plan ships as-is or needs reordering.


11. What this PRD is NOT

  • Not a contract with the staffing client. That document needs lawyers and signs after this is built.
  • Not a regulatory compliance attestation. We can build to the spirit of GDPR/CCPA/EEOC/BIPA/etc — passing actual certification is its own project.
  • Not a guarantee against discrimination claims. It's a guarantee that if a claim is filed, we can produce evidence about how decisions were made.
  • Not a substitute for human review. The audit shows what the AI did; humans still own the final call on hires.
  • Not legal advice. The §10.5 jurisdictional surface is a research-grade checklist, NOT counsel's analysis. Verify everything with actual employment + privacy counsel licensed in IL + IN before claiming compliance with anything in this document.

12. Appendix — terms

  • Subject — a person whose data flows through the system (candidate, worker, applicant). Identified by candidate_id.
  • Decision — any system action that changes a subject's standing (added to candidate pool, ranked in search, recommended for fill, validated, scored, etc.).
  • Audit row — one record in the audit response per decision, with the schema in §2.
  • PII — personally identifiable information per the broad CCPA/GDPR definitions. In this system: name, email, phone, address, SSN, DOB, plus inferred-from-photo attributes.
  • Protected attribute — characteristics that are illegal to discriminate on under federal/state law. The §4 list.
  • Inferred attribute — a protected attribute the model derives from a non-protected feature (zip → race correlation, name → ethnicity correlation).
  • Identity service — the daemon that holds candidate_id ↔ PII mapping. Separate auth.
  • Subject tagging — the practice of labeling every decision/embedding/log row with a candidate_id so the audit endpoint can find it.
  • Cryptographic erasure — making data unrecoverable by destroying its decryption key, even if the encrypted bytes remain on disk. Used for right-to-be-forgotten on append-only logs.

Change log

  • 2026-05-03 — Initial draft. Authored after J flagged the audit-trail gap as the production-readiness blocker.