lakehouse

profit/lakehouse

Fork 0

Commit Graph

Author	SHA1	Message	Date
root	b2c34b80b3	phase 1.6: lock Gate 3b = C, reconcile docs to shipped state, fix double-upload file leak Four threads landing together — all driven by the audit J asked for before production cutover. (1) Gate 3b DECIDED: Option C (defer classifications). `BiometricCollection.classifications` stays `Option<JSON> = None` in v1. `docs/specs/GATE_3B_DEEPFACE_DESIGN.md` status flipped from "draft / awaits product" to DECIDED. Consent template + retention schedule revised to remove all "automated facial-classification" / "deepface" language so disclosed scope matches implemented scope. (2) Endpoint-path drift reconciled across 3 docs. `PHASE_1_6_BIPA_GATES.md`, `BIPA_DESTRUCTION_RUNBOOK.md`, and `biometric_retention_schedule_v1.md` had references to legacy `/v1/identity/subjects/` paths (proposed under a separate identityd daemon, never shipped) — corrected to actual shipped routes `/biometric/subject/` (catalogd-local). Schema block in PHASE_1_6_BIPA_GATES rewritten to reflect JSON `SubjectManifest.biometric_collection` substrate (not the proposed Postgres `subjects` table). (3) New operational artifacts: - `scripts/staffing/verify_biometric_erasure.sh` — checks 4 things post-erasure (manifest cleared, uploads dir empty, audit row matches, chain verified). Smoke-tested live against WORKER-2. - `scripts/staffing/biometric_destruction_report.sh` — monthly anonymized destruction-event aggregation. Smoke-tested clean. - `scripts/staffing/bundle_counsel_packet.sh` — tarballs the counsel-review packet with per-file SHA-256 manifest. - `docs/runbooks/LEGAL_AUDIT_KEY_ROTATION.md` — formal rotation procedure operationalized after the 2026-05-05 /tmp wipe incident. - `docs/counsel/COUNSEL_REVIEW_PACKET_2026-05-05.md` — cover note bundling all eng-staged BIPA docs for counsel review with per-doc questions, sign-off checklist, recommended review sequence. (4) Double-upload file leak fixed in `crates/catalogd/src/biometric_endpoint.rs`. `verify_biometric_erasure.sh` smoked WORKER-2 and surfaced a stranded photo file. Investigation showed the file was 13-byte test-fixture bytes (zero PII, no biometric content); audit timeline showed two consecutive uploads followed by one erasure — the second upload had silently overwritten manifest.data_path, orphaning the first file. Patched `process_upload` to refuse a second upload with HTTP 409 + `error: "biometric_already_collected"` when `biometric_collection.is_some()` on the manifest. Operator must explicitly POST `/biometric/subject/{id}/erase` first. Tests: new `second_upload_without_erase_returns_409` (asserts 409 + manifest pointer unchanged + first file untouched on disk). Replaced `repeated_uploads_grow_the_chain` with `upload_erase_upload_grows_the_chain_cleanly` (covers the legitimate re-collection cycle: chain grows to 3 rows). Updated `content_type_with_parameters_accepted` to use 2 distinct subjects (was using 1 subject with 2 uploads to test ct parsing — would now 409). 22/22 biometric_endpoint tests + 59/59 catalogd lib tests green post-patch. Production posture: gateway needs `cargo build --release -p gateway` + `systemctl restart lakehouse.service` to pick up the new 409 in live traffic. Counsel calendar is now the only remaining blocker for first real-photo intake. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-05 06:19:40 -05:00
root	8ec43e0721	phase 1.6 Gate 3b: deepface integration design doc (3 options + recommendation) Per docs/PHASE_1_6_BIPA_GATES.md Gate 3b. Three viable paths for populating BiometricCollection.classifications, sized + tradeoff'd: Option A — Python subprocess per upload (no daemon) ~80 LOC, 0.5-1 day. Smallest integration. Reintroduces a Python dependency the 2026-05-02 sidecar drop deliberately removed. Option B — ONNX models in Rust (no Python at all) ~200-400 LOC + model-build pipeline, 5-7 days. Fully consistent with sidecar drop. Need pre-trained models with appropriate licenses (or train ourselves, multi-week). Adds face detection preprocessing in Rust. Option C — Defer; classifications field stays None 0.25 day. BIPA-safest position; substrate is forward-compatible. Forces the question "do we actually need classifications?" to be answered by a real product requirement, not by spec inertia. Recommendation: Option C (defer), conditional on confirming the product requirement. Reasoning: - All BIPA-load-bearing surfaces (consent + audit + retention + erasure) ship without classifications - Riskiest BIPA position is collecting demographic-derived data without a documented business purpose - Substrate accommodates A or B later in 1-3 days if real demand surfaces Open questions for J at the bottom of the doc — pick A/B/C is the gating decision before any engineering happens. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-03 05:25:45 -05:00
root	ed1fcd3c26	specs: pathway_memory v1 + subject_manifests_on_catalogd v1 Two specifications addressing the framing J asked for after reading the llms3.com blog: standardize what we have so future work doesn't drift, and apply the local-first thesis to the audit problem instead of the over-scoped SaaS-tier identity service. PATHWAY_MEMORY_SPEC.md (~400 lines): Documents the existing crates/vectord/src/pathway_memory.rs as a spec — the third metadata layer alongside catalogd's data metadata and playbook_memory's operational memory. Defines: - PathwayTrace wire format - pathway_id = SHA256(task_class \| file_prefix \| signal_class) - file_prefix algorithm (first 2 path segments) - pathway_vec: 32-bucket bag-of-tokens hash, fixed dim per spec - Lifecycle: insert → revise → replay → probation gate retire - Mem0 versioning (trace_uid + parent_trace_uid + version chain) - Access patterns: query_for_hotswap / query_by_vec / list_versions - PII risk surface (reducer_summary + final_verdict) - Spec boundary: stable in v1 vs implementation-specific No new architecture. Descriptive, not prescriptive. SUBJECT_MANIFESTS_ON_CATALOGD.md (~400 lines): The local-first audit-trail spec. Adds a fourth manifest type to catalogd alongside dataset/view/tombstone/profile. NOT a separate identity daemon. NOT Vault/KMS/dual-control JWT. Builds on primitives catalogd already ships: - SubjectManifest at data/_catalog/subjects/<id>.json - Per-subject HMAC-chained audit JSONL - Daily retention sweep using existing tombstone primitives - Vertical-aware routing (healthcare → local-only) - Legal-tier credential separate from gateway internal auth ~4 days estimated implementation effort vs 17-20 days for the IDENTITY_SERVICE_DESIGN approach. Same defensibility for the staffing-client launch window. Strictly additive to compatibility with the v3 design if SOC2 Type II becomes a contract requirement. These are SPECS — what the system already does (pathway) and what's the smallest local-first thing that addresses the audit need (subject manifests). Not 9-phase plans. Not new daemons. The pathway spec is descriptive: writing down what exists so the next person doesn't reinvent it. The subject-manifests spec is prescriptive: J greenlights, implementation is days not weeks. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-03 03:07:38 -05:00

Author

SHA1

Message

Date

root

b2c34b80b3

phase 1.6: lock Gate 3b = C, reconcile docs to shipped state, fix double-upload file leak

Four threads landing together — all driven by the audit J asked for before
production cutover.

(1) Gate 3b DECIDED: Option C (defer classifications). `BiometricCollection.classifications`
stays `Option<JSON> = None` in v1. `docs/specs/GATE_3B_DEEPFACE_DESIGN.md` status
flipped from "draft / awaits product" to DECIDED. Consent template + retention
schedule revised to remove all "automated facial-classification" / "deepface"
language so disclosed scope matches implemented scope.

(2) Endpoint-path drift reconciled across 3 docs. `PHASE_1_6_BIPA_GATES.md`,
`BIPA_DESTRUCTION_RUNBOOK.md`, and `biometric_retention_schedule_v1.md` had
references to legacy `/v1/identity/subjects/*` paths (proposed under a separate
identityd daemon, never shipped) — corrected to actual shipped routes
`/biometric/subject/*` (catalogd-local). Schema block in PHASE_1_6_BIPA_GATES
rewritten to reflect JSON `SubjectManifest.biometric_collection` substrate
(not the proposed Postgres `subjects` table).

(3) New operational artifacts:
- `scripts/staffing/verify_biometric_erasure.sh` — checks 4 things post-erasure
(manifest cleared, uploads dir empty, audit row matches, chain verified).
Smoke-tested live against WORKER-2.
- `scripts/staffing/biometric_destruction_report.sh` — monthly anonymized
destruction-event aggregation. Smoke-tested clean.
- `scripts/staffing/bundle_counsel_packet.sh` — tarballs the counsel-review
packet with per-file SHA-256 manifest.
- `docs/runbooks/LEGAL_AUDIT_KEY_ROTATION.md` — formal rotation procedure
operationalized after the 2026-05-05 /tmp wipe incident.
- `docs/counsel/COUNSEL_REVIEW_PACKET_2026-05-05.md` — cover note bundling
all eng-staged BIPA docs for counsel review with per-doc questions, sign-off
checklist, recommended review sequence.

(4) Double-upload file leak fixed in `crates/catalogd/src/biometric_endpoint.rs`.
`verify_biometric_erasure.sh` smoked WORKER-2 and surfaced a stranded photo
file. Investigation showed the file was 13-byte test-fixture bytes (zero PII,
no biometric content); audit timeline showed two consecutive uploads followed
by one erasure — the second upload had silently overwritten manifest.data_path,
orphaning the first file. Patched `process_upload` to refuse a second upload
with HTTP 409 + `error: "biometric_already_collected"` when
`biometric_collection.is_some()` on the manifest. Operator must explicitly
POST `/biometric/subject/{id}/erase` first.

Tests: new `second_upload_without_erase_returns_409` (asserts 409 + manifest
pointer unchanged + first file untouched on disk). Replaced
`repeated_uploads_grow_the_chain` with `upload_erase_upload_grows_the_chain_cleanly`
(covers the legitimate re-collection cycle: chain grows to 3 rows). Updated
`content_type_with_parameters_accepted` to use 2 distinct subjects (was
using 1 subject with 2 uploads to test ct parsing — would now 409).

22/22 biometric_endpoint tests + 59/59 catalogd lib tests green post-patch.

Production posture: gateway needs `cargo build --release -p gateway` +
`systemctl restart lakehouse.service` to pick up the new 409 in live traffic.

Counsel calendar is now the only remaining blocker for first real-photo intake.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

2026-05-05 06:19:40 -05:00

root

8ec43e0721

phase 1.6 Gate 3b: deepface integration design doc (3 options + recommendation)

Per docs/PHASE_1_6_BIPA_GATES.md Gate 3b. Three viable paths for
populating BiometricCollection.classifications, sized + tradeoff'd:

  Option A — Python subprocess per upload (no daemon)
    ~80 LOC, 0.5-1 day. Smallest integration. Reintroduces a Python
    dependency the 2026-05-02 sidecar drop deliberately removed.

  Option B — ONNX models in Rust (no Python at all)
    ~200-400 LOC + model-build pipeline, 5-7 days. Fully consistent
    with sidecar drop. Need pre-trained models with appropriate
    licenses (or train ourselves, multi-week). Adds face detection
    preprocessing in Rust.

  Option C — Defer; classifications field stays None
    0.25 day. BIPA-safest position; substrate is forward-compatible.
    Forces the question "do we actually need classifications?" to be
    answered by a real product requirement, not by spec inertia.

Recommendation: **Option C (defer)**, conditional on confirming the
product requirement. Reasoning:
- All BIPA-load-bearing surfaces (consent + audit + retention +
  erasure) ship without classifications
- Riskiest BIPA position is collecting demographic-derived data
  without a documented business purpose
- Substrate accommodates A or B later in 1-3 days if real demand
  surfaces

Open questions for J at the bottom of the doc — pick A/B/C is the
gating decision before any engineering happens.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

2026-05-03 05:25:45 -05:00

root

ed1fcd3c26

specs: pathway_memory v1 + subject_manifests_on_catalogd v1

Two specifications addressing the framing J asked for after reading
the llms3.com blog: standardize what we have so future work doesn't
drift, and apply the local-first thesis to the audit problem instead
of the over-scoped SaaS-tier identity service.

PATHWAY_MEMORY_SPEC.md (~400 lines):
  Documents the existing crates/vectord/src/pathway_memory.rs as a
  spec — the third metadata layer alongside catalogd's data metadata
  and playbook_memory's operational memory. Defines:
    - PathwayTrace wire format
    - pathway_id = SHA256(task_class | file_prefix | signal_class)
    - file_prefix algorithm (first 2 path segments)
    - pathway_vec: 32-bucket bag-of-tokens hash, fixed dim per spec
    - Lifecycle: insert → revise → replay → probation gate retire
    - Mem0 versioning (trace_uid + parent_trace_uid + version chain)
    - Access patterns: query_for_hotswap / query_by_vec / list_versions
    - PII risk surface (reducer_summary + final_verdict)
    - Spec boundary: stable in v1 vs implementation-specific
  No new architecture. Descriptive, not prescriptive.

SUBJECT_MANIFESTS_ON_CATALOGD.md (~400 lines):
  The local-first audit-trail spec. Adds a fourth manifest type to
  catalogd alongside dataset/view/tombstone/profile. NOT a separate
  identity daemon. NOT Vault/KMS/dual-control JWT. Builds on
  primitives catalogd already ships:
    - SubjectManifest at data/_catalog/subjects/<id>.json
    - Per-subject HMAC-chained audit JSONL
    - Daily retention sweep using existing tombstone primitives
    - Vertical-aware routing (healthcare → local-only)
    - Legal-tier credential separate from gateway internal auth
  ~4 days estimated implementation effort vs 17-20 days for the
  IDENTITY_SERVICE_DESIGN approach. Same defensibility for the
  staffing-client launch window. Strictly additive to compatibility
  with the v3 design if SOC2 Type II becomes a contract requirement.

These are SPECS — what the system already does (pathway) and what's
the smallest local-first thing that addresses the audit need
(subject manifests). Not 9-phase plans. Not new daemons.

The pathway spec is descriptive: writing down what exists so the
next person doesn't reinvent it. The subject-manifests spec is
prescriptive: J greenlights, implementation is days not weeks.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

2026-05-03 03:07:38 -05:00

3 Commits