2 Commits

Author SHA1 Message Date
root
23eb04a145 Onboarding wizard — ingest any staffing CSV in 3 steps
New /onboard page. Client-facing wizard for getting real data into
the system without engineering help.

Flow:
1. Drop a CSV (or click 'Use the sample as my data' — ships a 25-row
   realistic staffing roster under /samples/staffing_roster_sample.csv)
2. Browser parses client-side. Columns auto-typed (text/int/decimal/
   date). PII flagged by name hint AND content regex (emails, phones).
   First rows previewed. Read-only — nothing written yet.
3. Name the dataset (lowercase+underscores). Commit.
4. Post-commit: dataset is live. Shows 4 next steps the operator can
   take (SQL query, vector index, dashboard search, playbook training).

Backend:
- /onboard serves onboard.html
- /samples/*.csv serves CSV files from mcp-server/samples/ with
  filename validation (only [a-zA-Z0-9_-.]+.csv, prevents path traversal)
- /onboard/ingest forwards multipart/form-data to gateway /ingest/file
  preserving the boundary. The generic /api/* passthrough breaks
  multipart because it reads as text and forwards as JSON; this route
  uses arrayBuffer + original Content-Type.

Verified end-to-end: upload sample roster (25 rows, 12 columns) →
parse in browser → show columns + PII flags + preview → commit →
gateway writes Parquet, registers in catalog → immediately queryable:
  SELECT * FROM onboard_demo2 LIMIT 3
  → Sarah Johnson, Forklift Operator, Chicago, IL, 0.92
Round-trip <1 second.

Nav updated on all pages to link Onboard. Shipped with a sample CSV
so the full flow is demonstrable without real client data.

When a real client shows up, same path — they upload their CSV.
No engineering ticket, no code change, no schema pre-definition.

Security: sample filename regex prevents path traversal. CSV parse
is client-side pure JS (no DOM injection). Commit uses existing
/ingest/file validation (schema fingerprint, PII server-side,
content-hash dedup).
2026-04-20 18:13:56 -05:00
root
468798c9ac /spec: technical specification — 11-chapter README-equivalent
J's ask: explain the full architecture so someone reading a README
can dispute it or recreate it. The repo isn't public yet; this page
IS the spec until it is.

Ch1 Repository layout — 13 crates + tests/multi-agent + docs + data,
    with owned responsibility and file path per crate.

Ch2 Data ingest pipeline (8 steps) — sources (file/inbox/DB/cron),
    parse+normalize with ADR-010 conservative typing, PII auto-tag,
    dedup, Parquet write, catalog register with fingerprint gate,
    mark embeddings stale, queryable immediately.

Ch3 Measurement & indexing — row count / fingerprint / owner /
    sensitivity / freshness / lineage per dataset. HNSW vs Lance
    tradeoff table with measured numbers (ADR-019). Autotune loop.
    Per-profile scoping (Phase 17).

Ch4 Contract inference from external signal — Chicago permit feed
    → role mapping → worker count heuristic → timeline → hybrid
    search with boost → pattern discovery → rendered card. All
    pre-computed before staffer opens UI.

Ch5 What a CRM can't do — 11-row comparison table of capabilities.

Ch6 How it gets better over time — three paths:
    - Phase 19 playbook boost (full math)
    - Pattern discovery meta-index
    - Autotune agent

Ch7 Scale story: 20 staffers, 300 contracts, midday +20/+1M surge
    - Async gateway + per-staffer profile isolation + client blacklists
    - 7-step surge handling flow (ingest, stale-mark, incremental refresh,
      degradation, hot-swap, autotune re-enter)
    - Known pain points: Ollama inference serial, RAM ceiling ~5M on
      HNSW (mitigated by Lance), VRAM 1-2 models sequential,
      playbook_memory unbounded.

Ch8 Error surfaces & recovery — 10-row table covering ingest schema
    conflicts, bucket failures, ghost names, dual-agent drift,
    empty searches, Ollama down, gateway restart, schema fingerprint
    divergence. Every failure has a named surface and recovery path.

Ch9 Per-staffer context — active profile, workspace, client blacklist,
    audit trail, daily summary. How 20 staffers don't see the same UI.

Ch10 Day in the life — 07:00 housekeeping → 07:30 refresh → 08:00
     staffer opens → 08:15 drill down → 08:30 Call click → 09:00
     second staffer shares memory → 12:30 surge → 14:00 no-show →
     15:00 new embeddings live → 17:00 retrospective → 22:00
     overnight trials.

Ch11 Known limits & non-goals — deferred (rate/margin, push, confidence
     calibration, neural re-ranker, pm compaction, call_log cross-ref)
     and explicitly out-of-scope (cloud, ACID, streaming, CRM replace,
     proprietary formats, hard multi-tenant).

Also: nav updated on /dashboard, /console, /proof to link /spec.
Every architectural claim in the spec cites either a code path, an
ADR number, or a phase reference so someone skeptical can target
the specific artifact.
2026-04-20 17:56:18 -05:00