lakehouse/docs/ADR-021-sparse-data-trust.md
root 13b01fee9f ADR-021: Sparse data trust path — start with nothing, earn everything
The staffing company said: 'we don't have any of that data.'
They're right. We showed a demo with 18-field profiles and they
have a name and a phone number.

This ADR documents the trust path:
  Phase 1 (Day 1): Work with name + phone + role. That's enough.
  Phase 2 (Week 1-4): Timesheets → reliability. Calls → history.
  Phase 3 (Month 2+): AI starts helping with real earned data.

Key principles:
- Never show empty fields or 0% bars
- Show what's THERE, not what's missing
- Trust indicators: 'based on 3 placements' not just 'Reliability: 87%'
- The system earns data by being useful, not by demanding it upfront

Also created sparse_workers dataset (200 workers, 74% have role,
34% have notes, 5 have ONLY name+phone) for realistic testing.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-17 15:32:06 -05:00

3.5 KiB

ADR-021: Sparse Data Trust Path — start with nothing, earn everything

Status: Proposed — 2026-04-17 Triggered by: Legacy staffing company pushback: "we don't have that data" Owner: J


The Problem

We demonstrated a system with rich worker profiles (18 fields, behavioral scores, certifications, communication history). The staffing company said: "We don't have any of that. We have a name, a phone number, and a contract."

They're right. Our demo assumed data that doesn't exist in their world. Showing AI that only works with perfect data is worse than useless — it builds distrust.

Their Reality

Day 1 data for a typical worker:

  • Name
  • Phone number
  • Maybe: city, role, one or two skills mentioned on a call

Day 1 data for a contract:

  • Client name
  • Role needed
  • Headcount
  • Start date
  • Maybe: required certs, location

That's it. No reliability scores. No availability metrics. No communication history. No certifications in a database. The staffing coordinator carries that knowledge in their HEAD.

The Trust Path

Phase 1: Work with what they have (Day 1)

  • Accept sparse profiles: name + phone + role. That's enough.
  • Match contracts to workers by role + location only. No scores.
  • The system is useful immediately: "here are the 12 people you have tagged as Forklift Operators in IL." That's already faster than scrolling a spreadsheet.
  • Don't show empty fields. Don't show 0% bars. Don't show what's missing — show what's THERE.

Phase 2: Earn data through use (Week 1-4)

  • Every placement generates a timesheet → reliability starts building
  • Every call logged → communication history accumulates
  • Every check-in → availability becomes real
  • Every cert verified → certification database grows
  • The staffer doesn't "enter data" — they just do their job, and the system learns from it.

Phase 3: AI starts helping (Month 2+)

  • Enough data to actually score workers → show reliability
  • Enough history to predict availability → surface it
  • Enough placements to know client preferences → suggest matches
  • The enrichment happened BECAUSE they used the system, not as a prerequisite TO use it.

What This Means for the UI

  • Worker cards must gracefully degrade. Name only? Show name only. Name + role? Show name + role. Full profile? Show everything.
  • Never show empty metrics. No "Reliability: 0%" — that looks like the worker is unreliable. Just don't show it until there's data.
  • Lead with what the staffer KNOWS: "you placed this worker at Company X last month" — that's information they trust because they lived it.

What This Means for the Architecture

  • The vector index works on whatever text exists. A name + role is 200 characters. That's enough for an embedding. As more data arrives, the embeddings get richer and search gets better.
  • The hybrid search works with sparse SQL filters too. "role = 'Forklift Operator'" is a valid filter even without reliability.
  • The playbook system captures the staffer's decisions: "you picked this worker for this contract" → that IS the training data for future AI recommendations.

Implementation

  1. Sparse profile ingest: accept CSV with as few as 2 columns (name, phone). Everything else optional.
  2. Graceful UI degradation: worker cards only show fields that exist
  3. Progressive enrichment: timesheet ingest → auto-calculate reliability; check-in ingest → auto-calculate availability
  4. Trust indicators: "based on 3 placements" not "Reliability: 87%" — show WHERE the number comes from