46 Commits

Author SHA1 Message Date
root
650f5e97b6 Fix chunker UTF-8 boundary panic (causes 120GB OOM in refresh path)
The chunker's &text[start..end] slice could land inside a multi-byte
UTF-8 character (e.g. narrow no-break space \u{202f}, em-dashes, smart
quotes — universal in pg-imported editorial data). Rust panics on
non-boundary string slicing. In the refresh path that panic is caught
by tokio's task machinery but somehow causes linear memory growth at
~540MB/sec until OOM at 120GB+.

Root cause: chunk boundaries computed by byte arithmetic without
checking is_char_boundary(). The existing "look for last sentence / \n
/ space" logic finds ASCII-safe positions, but the *primary* `end`
calculation `(start + chunk_size).min(text.len())` lands wherever.

Fix:
- ceil_char_boundary(s, idx) — forward-scan to the nearest valid
  UTF-8 char boundary. Used at end, actual_end, and next_start.
- Iteration cap — break if iterations exceed text.len(). Any
  non-progressing loop dies safely instead of burning memory.
- Forced forward advance — if overlap + boundary math produce a
  next_start <= start, force +1 char to guarantee termination.

Reproduced on kb_team_runs (585 pg-imported prompts with editorial
unicode): previous run grew memory linearly to 124GB over 240s then
OOM-killed. Same request after fix: peaks at <100MB, completes in
~4m42s to produce 12,693 embeddings. /vectors/search returns
relevant results.

Regression tests added:
- handles_multibyte_utf8_at_chunk_boundary — exact \u{202f} repro
- no_infinite_loop_on_no_spaces — 5KB text, no whitespace
- no_infinite_loop_on_degenerate_params — chunk_size == overlap

Surfaced by Phase C, but pre-existed as a latent bug since Phase 7.
Any Ollama-targeted RAG corpus with non-ASCII content would have hit
this once it grew past ~13KB per document.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-16 03:27:17 -05:00
root
97a376482c Phase C: Decoupled embedding refresh
Implements the llms3.com-inspired pattern: embeddings refresh
asynchronously, decoupled from transactional row writes. New rows arrive,
ingest marks the vector index stale, a later refresh embeds only the
delta (doc_ids not already in the index).

Schema additions (DatasetManifest):
- last_embedded_at: Option<DateTime> - when the index was last refreshed
- embedding_stale_since: Option<DateTime> - set when data written, cleared on refresh
- embedding_refresh_policy: Option<RefreshPolicy> - Manual | OnAppend | Scheduled

Ingest paths (pipeline::ingest_file + pg_stream) call
registry.mark_embeddings_stale after writing. No-op if the dataset has
never been embedded — stale semantics only kick in once last_embedded_at
is set.

Refresh pipeline (vectord::refresh::refresh_index):
- Reads the dataset Parquet, extracts (doc_id, text) pairs
- Accepts Utf8 / Int32 / Int64 id columns (covers both CSV and pg schemas)
- Loads existing embeddings via EmbeddingCache (empty on first-time build)
- Filters to rows whose doc_id is NOT in the existing set
- Chunks (chunker::chunk_column), embeds via Ollama (batches of 32),
  writes combined index, clears stale flag

Endpoints:
- POST /vectors/refresh/{dataset_name} - body {index_name, id_column,
  text_column, chunk_size?, overlap?}
- GET /vectors/stale - lists datasets whose embedding_stale_since is set

End-to-end verified on threat_intel (knowledge_base.threat_intel):
- Initial refresh: 20 rows -> 20 chunks -> embedded in 2.1s,
  last_embedded_at set
- Idempotent second refresh: 0 new docs -> 1.8ms (pure delta check)
- Re-ingest to 54 rows: mark_embeddings_stale fires -> stale_since set
- /vectors/stale surfaces threat_intel with timestamps + policy
- Delta refresh: 34 new docs embedded in 970ms (6x faster than full
  re-embed); stale_cleared = true

Not in MVP scope:
- UPDATE semantics (same doc_id, different content) - would need
  per-row content hashing
- OnAppend policy auto-trigger - just declares intent; actual scheduler
  deferred
- Scheduler runtime - the Scheduled(cron) variant declares the intent so
  operators can see which datasets expect what, but the cron itself is
  separate

Per ADR-019: when a profile switches to vector_backend=Lance, this
refresh path benefits — Lance's native append replaces our "read all +
rewrite" Parquet rebuild pattern. Current MVP works well enough at
~500-5K rows to validate the architecture; Lance unblocks the 5M+ case.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-16 03:00:43 -05:00
root
76f6fba5de Phase B: Lance pilot — hybrid decision with measured benchmark
Standalone benchmark crate `crates/lance-bench` running Lance 4.0 against
our Parquet+HNSW at 100K × 768d (resumes_100k_v2) measured 8 dimensions.

Results (see docs/ADR-019-vector-storage.md for full scorecard):

  Cold load:        Parquet 0.17s   vs Lance 0.13s   (tie — not ≥2× threshold)
  Disk size:        330.3 MB        vs 330.4 MB      (tie)
  Search p50:       873us           vs 2229us        (Parquet 2.55× faster)
  Search p95:       1413us          vs 4998us        (Parquet 3.54× faster)
  Index build:      230s (ec=80)    vs 16s (IVF_PQ)  (Lance 14× faster)
  Random access:    35ms (scan)     vs 311us         (Lance 112× faster)
  Append 10K rows:  full rewrite    vs 0.08s/+31MB   (Lance structural win)

Decision (ADR-019): hybrid, not migrate-or-reject.

- Parquet+HNSW stays primary — our HNSW at ec=80 es=30 recall=1.00 is
  2.55× faster than Lance IVF_PQ at 100K in-RAM scale
- Lance joins as second backend per-profile for workloads where it wins
  architecturally: random row access (RAG text fetch), append-heavy
  pipelines (Phase C), hot-swap generations (Phase 16, 14× faster
  builds), and indexes past the ~5M RAM ceiling
- Phase 17 ModelProfile gets vector_backend: Parquet | Lance field
- Ceiling table in PRD updated — 5M ceiling now says "switch to Lance"
  instead of "migrate" since Lance runs alongside, not instead of

Isolation: lance-bench is a standalone workspace crate with its own dep
tree (Lance pulls DataFusion 52 + Arrow 57 incompatible with main stack
DataFusion 47 + Arrow 55). Kept off the critical path until API is
stable enough to promote into vectord::lance_store.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-16 02:37:11 -05:00
root
dbe00d018f Federation foundation + HNSW trial system + Postgres streaming + PRD reframe
Four shipped features and a PRD realignment, all measured end-to-end:

HNSW trial system (Phase 15 horizon item → complete)
- vectord: EmbeddingCache, harness (eval sets + brute-force ground truth),
  TrialJournal, parameterized HnswConfig on build_index_with_config
- /vectors/hnsw/trial, /hnsw/trials/{idx}, /hnsw/trials/{idx}/best,
  /hnsw/evals/{name}/autogen, /hnsw/cache/stats
- Measured on resumes_100k_v2 (100K × 768d): brute-force 44ms -> HNSW 873us
  at 100% recall@10. ec=80 es=30 locked as HnswConfig::default()
- Lower ec values trade recall for build time: 20/30 = 0.96 recall in 8s,
  80/30 = 1.00 recall in 230s

Catalog manifest repair
- catalogd: resync_from_parquet reads parquet footers to restore row_count
  and columns on drifted manifests
- POST /catalog/datasets/{name}/resync + POST /catalog/resync-missing
- All 7 staffing tables recovered to PRD-matching 2,469,278 rows

Federation foundation (ADR-017)
- shared::secrets: SecretsProvider trait + FileSecretsProvider (reads
  /etc/lakehouse/secrets.toml, enforces 0600 perms)
- storaged::registry::BucketRegistry — multi-bucket resolution with
  rescue_bucket read fallback and reachability probing
- storaged::error_journal — bucket op failures visible in one HTTP call
- storaged::append_log — write-once batched append pattern (fixes the RMW
  anti-pattern llms3.com calls out; errors and trial journals both use it)
- /storage/buckets, /storage/errors, /storage/bucket-health,
  /storage/errors/{flush,compact}
- Bucket-aware I/O at /storage/buckets/{bucket}/objects/{*key} with
  X-Lakehouse-Rescue-Used observability headers on fallback

Postgres streaming ingest
- ingestd::pg_stream: DSN parser, batched ORDER BY + LIMIT/OFFSET pagination
  into ArrowWriter, lineage redacts password
- POST /ingest/db — verified against live knowledge_base.team_runs
  (586 rows × 13 cols, 6 batches, 196ms end-to-end)

PRD realignment (2026-04-16)
- Dual use case: staffing analytics + local LLM knowledge substrate
- Removed "multi-tenancy (single-owner system)" from non-goals
- Added invariants 8-11: indexes hot-swappable, per-reader profiles,
  trials-as-data, operational failures findable in one HTTP call
- New phases 16 (hot-swap generations), 17 (model profiles + dataset
  bindings), 18 (Lance vs Parquet+sidecar evaluation)
- Known ceilings table documents the 5M vector wall and escape hatches
- ADR-017 (federation), ADR-018 (append-log pattern) added
- EXECUTION_PLAN.md sequences phases B-E with success gates and
  decision rules

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-16 01:50:05 -05:00
root
84407eeb51 Stress test suite: 9/9 passed — architecture validated
Tests:
1. Concurrent (10 queries): avg 48ms, max 50ms, no contention
2. Cross-reference (1.3M rows): 130ms, 3 JOINs + anti-join
3. Restart recovery: 12 datasets, 100K rows identical after restart
4. Pagination: 100K rows in 1000 pages, random page fetch works
5. Sustained: 70 QPS over 100 queries, 0 errors
6. Journal: write, flush, read-back correct
7. Tool registry: 6 tools execute correctly with audit
8. Cache: hot/cold verified
9. MySQL comparison: schema-on-read, vector+SQL, portable backup, PII auto-detect

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-27 22:13:27 -05:00
root
037555802e Systemd services: gateway, sidecar, UI survive reboots
- lakehouse.service: release gateway on :3100, auto-restart
- lakehouse-sidecar.service: Python FastAPI on :3200, auto-restart
- lakehouse-ui.service: WASM file server on :3300, auto-restart
- All enabled at boot (multi-user.target)
- scripts/serve_ui.py for systemd-compatible file serving

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-27 22:06:28 -05:00
root
fdb2e9cda8 Fix browser crash: cache schema context, lazy Dashboard, default to Ask tab
Root cause: Dashboard auto-fired 6+ API calls on load, then Ask tab fired
7 DESCRIBE queries per question — 15+ concurrent requests from WASM.

Fixes:
- Schema context cached after first build (7 DESCRIBE → 0 on subsequent questions)
- Dashboard lazy-loads only when tab clicked (not on app mount)
- Default tab changed back to Ask (no background API storm)
- std::sync::Mutex for WASM compat (no tokio in browser)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-27 21:18:53 -05:00
root
238cb84d26 Server-side pagination for large result sets
- ResultStore: execute query, store batches server-side, serve pages on demand
- POST /query/paged → returns query_id + total_rows + page count (no rows)
- GET /query/page/{id}/{page}?size=100 → returns one page of rows
- RecordBatch slicing for efficient page extraction from Arrow batches
- LRU eviction: keeps 50 most recent query results in memory
- Tested: 100K rows → 1,000 pages of 100, any page fetchable by number
- Supervisor pattern: chunk results, serve on demand, retry-safe (idempotent GET)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-27 20:54:44 -05:00
root
ed17216005 Fix browser crash: limit schema context + cap table rows at 200
- Schema context limited to 7 core staffing tables (was all 12+)
- Results table capped at 200 rows to prevent DOM explosion
- Shows "first 200 of N rows" when truncated

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-27 20:48:32 -05:00
root
0bd753294b Robust SQL extraction: handles explanations, markdown, prefixes
clean_sql now uses 3 strategies in priority order:
1. Extract from ```sql...``` markdown blocks
2. Find first SELECT/WITH/INSERT statement in text
3. Strip leading "sql" keyword fallback

Tested against 5 real model output patterns:
- Clean SQL ✓
- "sql" prefixed ✓
- Markdown fenced ✓
- Explanation before ```sql block ✓
- Explanation with SELECT buried in text ✓

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-27 20:42:11 -05:00
root
34c03894ae Auto-retry on ALL SQL errors, not just schema errors
Previous: only retried on "Schema error" or "No field named"
Now: retries on any error (type mismatches, execution errors, etc.)
Model gets full error message + schema to write corrected SQL.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-27 20:38:27 -05:00
root
ddcbb9590c Fix SQL generation: clean_sql helper + relationship hints + verified
- clean_sql() strips markdown fences, leading "sql" keyword, trailing explanations
- Schema context now includes table relationships (JOIN paths)
- Explicit note: "vertical only in candidates/clients/job_orders, JOIN for others"
- Full column paths (table.column) in schema to reduce ambiguity
- Auto-retry on schema errors feeds error + schema back to model
- TESTED: 4 questions all return correct results:
  "highest avg salary" → IT $2,213 ✓
  "top 5 earning over $50/hr" → correct candidates ✓
  "most placements by vertical" → Industrial 10,096 ✓
  "revenue by client" → 1,996 clients ✓

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-27 20:36:55 -05:00
root
2c5aeaeada Fix SQL generation: stricter prompt + auto-retry on schema errors
- Prompt now says "CRITICAL: ONLY use columns from schema, do NOT invent"
- Strips markdown backticks from model output
- Auto-retry: if SQL fails with "Schema error" or "No field named",
  feeds the error + schema back to the model for a corrected query
- Both button click and Enter key paths have retry logic

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-27 20:32:25 -05:00
root
399fc81ab5 UI: Dashboard + Ingest tabs showing full system progression
- Dashboard: live stats (datasets, rows, embeddings, HNSW, tools, cache)
  Architecture overview (6 capability areas)
  Build progression timeline (all 17 phases listed)
- Ingest tab: Postgres table browser + import, file upload info, inbox watcher
- System tab: existing health checks
- Starts on Dashboard for immediate overview
- No futures::executor in WASM — all async/await

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-27 20:22:36 -05:00
root
9992b5f135 Database connector: PostgreSQL → Parquet import
- POST /ingest/postgres/tables — list all tables in a database
- POST /ingest/postgres/import — import table → Parquet → catalog → queryable
- Auto type mapping: int2/4/8 → Int, float4/8 → Float64, bool → Boolean,
  text/varchar/jsonb/timestamp → Utf8 (safe default per ADR-010)
- Auto PII detection + lineage on import
- Empty password support for trust auth
- Tested: imported lab_trials (40 rows, 10 cols) and threat_intel (20 rows, 30 cols)
  from local knowledge_base Postgres database — immediately queryable

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-27 20:14:16 -05:00
root
294f3f6a49 Scheduled ingest: file watcher auto-ingests from ./inbox
- Drop CSV/JSON/PDF/text into ./inbox → auto-detected → Parquet → queryable
- Polls every 10 seconds (configurable)
- Processed files moved to ./inbox/processed/
- Failed files moved to ./inbox/failed/
- Dedup: same file dropped twice = no-op
- Watcher starts automatically on gateway boot
- Tested: CSV dropped → queryable in <15s

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-27 20:04:40 -05:00
root
04770c97eb HNSW vector index: 100K search in 27ms (58x faster than brute-force)
- instant-distance HNSW implementation for approximate nearest neighbors
- HnswStore: build from stored embeddings, in-memory index, thread-safe
- POST /vectors/hnsw/build — build index from Parquet (100K in 35s release)
- POST /vectors/hnsw/search — fast ANN search
- GET /vectors/hnsw/list — list loaded indexes

Benchmark (100K × 768d, release build):
  Brute-force: 1,567ms
  HNSW:           31ms (50x)
  HNSW warm:      27ms (58x)

Build cost: 35s one-time for 100K vectors (release mode)
ef_construction=40, ef_search=50 — good recall/speed balance

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-27 20:00:50 -05:00
root
8282842eaf Sync memory + phases: all 15 phases marked complete
PHASES.md and project memory updated to reflect actual build state.
Phases 11-14 were built but trackers weren't updated.

Final stats: 11 crates, 30 tests, 16 ADRs, 2.47M rows, 100K vectors.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-27 19:34:15 -05:00
root
35f0559d78 Phase 14: Schema evolution with AI migration rules
- Schema diff detection: compare old vs new schema, identify changes
  (added, removed, type changed, renamed columns)
- Fuzzy rename detection: "first_name" → "full_name" detected by shared word parts
- Auto-generated migration rules: direct map, cast, concat, split, drop
  Each rule has confidence score (0.0-1.0)
- AI migration prompt builder: generates LLM prompt for complex schema changes
  LLM suggests JSON migration rules when heuristics aren't enough
- 5 new unit tests (detect added, removed, type change, rename, rule generation)
- 30 total unit tests passing

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-27 19:31:19 -05:00
root
d61096e26f 100K embedding COMPLETE: 177/sec, 9.5 min, zero failures
- Supervisor 4-pipeline: 100,000 chunks embedded successfully
- Peak throughput: 177 chunks/sec (4.1x vs single-pipeline 43/sec)
- Total time: 572s (9.5 minutes)
- Storage: 315 MB Parquet
- Brute-force search over 100K vectors: 4.5s
- Index metadata registered: nomic-embed-text, 768d, build stats
- Zero failures — supervisor retry handled all transient errors

Previous attempt (single pipeline): failed at 97K after 38 min
This attempt (supervisor): completed 100K in 9.5 min with retry

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-27 09:53:47 -05:00
root
e5b7663c20 Phase 13: Access control — role-based sensitivity enforcement
- AccessControl: agent roles with allowed sensitivity levels
- 4 default roles: admin (all), recruiter (PII ok), analyst (financial ok), agent (internal only)
- Field-level masking: determines which columns to mask per agent based on sensitivity
- Query audit log: tracks every query with agent, datasets, PII fields accessed
- Endpoints: GET/POST /access/roles, GET /access/audit, POST /access/check
- Toggleable via config (auth.enabled)
- 100K embedding: supervisor now sustained 125/sec (2.9x vs single pipeline)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-27 09:47:47 -05:00
root
b2cd54e941 100K embedding: supervisor achieves 67.6/sec (57% faster than single pipeline)
- 4 parallel pipelines on i9 + A4000 via Ollama
- Previous single-pipeline: 43/sec, 39min for 100K
- Supervisor: 67.6/sec, 22min for 100K
- Previous 100K attempt failed at 97K (no retry) — supervisor handles this
- Checkpointing every 1000 chunks for crash recovery
- Round-robin retry on batch failure (3 attempts)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-27 09:45:59 -05:00
root
6f0f92a9e4 Phase 12: Tool registry — governed business actions for AI agents
- ToolRegistry: named tools with parameter validation and audit logging
- 6 built-in staffing tools:
  search_candidates (skills, city, state, experience, availability)
  get_candidate (by ID)
  revenue_by_client (top N by billed revenue)
  recruiter_performance (placements, revenue per recruiter)
  cold_leads (called N+ times, never placed)
  open_jobs (by vertical, city)
- Each tool: name, description, params, permission level (read/write/admin)
- SQL template with validated parameter substitution
- Full audit trail: every invocation logged with agent, params, result
- Endpoints: GET /tools (list), GET /tools/{name} (schema),
  POST /tools/{name}/call (execute), GET /tools/audit (log)
- Per ADR-015: governed interface before raw SQL for agents

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-27 09:31:42 -05:00
root
6cd1daeb51 Phase 11: Embedding versioning — model-proof vector layer
- IndexRegistry: tracks all vector indexes with model metadata
  (model_name, model_version, dimensions, build stats)
- Index metadata persisted as JSON in vectors/meta/
- Rebuilt on startup for crash recovery
- GET /vectors/indexes — list all indexes (filter by source/model)
- GET /vectors/indexes/{name} — get index metadata
- Background jobs auto-register metadata on completion
- Multi-version support: same data, different models, coexist
- Per ADR-014: enables incremental re-embed on model upgrade

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-27 09:27:10 -05:00
root
6d49f81ebf Add read-mem skill + comprehensive project memory
- /read-mem skill: reads PRD, phases, decisions, checks live services
- Updated PHASES.md with all 15 phases tracked
- Updated project_lakehouse.md memory with full context
- Updated CLAUDE.md with project reference
- Skill at ~/.claude/skills/read-mem/ and project level
- Triggers on: "read mem", "project status", "where were we", "catch me up"

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-27 09:23:01 -05:00
root
9e53caaec3 Phase 10: Rich catalog v2 — metadata as product
- DatasetManifest expanded: description, owner, sensitivity, columns,
  lineage, freshness contract, tags, row_count
- All new fields use #[serde(default)] for backward compatibility
- PII auto-detection: scans column names for email, phone, SSN, salary,
  address, DOB, medical terms — flags as PII/PHI/Financial
- Column-level metadata: name, type, sensitivity, is_pii flag
- Lineage tracking: source_system, source_file, ingest_job, timestamp
- Ingest pipeline auto-populates: PII scan, column meta, lineage, row count
- PATCH /catalog/datasets/by-name/{name}/metadata — update metadata
- Catalog responses now include all rich fields
- 25 unit tests passing (5 new PII detection tests)

Per ADR-013: datasets without metadata become mystery files.
This makes every ingested file self-describing from day one.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-27 09:15:09 -05:00
root
bf7cf96911 Phase 9: Event journal — append-only mutation history
- journald crate: immutable event log for every data mutation
- Events: entity_type, entity_id, field, action, old_value, new_value,
  actor, source, workspace_id, timestamp
- In-memory buffer with configurable flush threshold (default 100 events)
- Flush writes events as Parquet to journal/ directory
- Query: GET /journal/history/{entity_id} — full history of any record
- Query: GET /journal/recent?limit=50 — latest events across all entities
- Convenience methods: record_insert, record_update, record_ingest
- Stats: GET /journal/stats — buffer size, persisted file count
- Manual flush: POST /journal/flush
- Per ADR-012: events are never modified or deleted

This is the single most important future-proofing decision.
Once history is lost, it's gone forever.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-27 09:09:33 -05:00
root
3b695cd592 Dual-pipeline supervisor for embedding ingestion
- 4 parallel pipelines (tuned for i9 + A4000)
- Range-based work splitting (2500 chunks per range)
- Round-robin retry on failure (3 attempts before dead-letter)
- Checkpointing to disk every 1000 chunks (crash recovery)
- On restart, loads checkpoint and skips completed ranges
- Dead-letter queue for permanently failed ranges
- Vectors assembled in order after all pipelines finish
- Batch size 64 for GPU throughput

Architecture:
  Supervisor → splits 100K chunks into 40 ranges
    ├── Pipeline 0: grabs range, embeds, reports progress
    ├── Pipeline 1: grabs range, embeds, reports progress
    ├── Pipeline 2: grabs range, embeds, reports progress
    └── Pipeline 3: grabs range, embeds, reports progress
    Failed range → back to queue → next available pipeline retries

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-27 09:06:28 -05:00
root
6a532cb248 Background job system for embedding — fixes 100K timeout
- JobTracker: create/update/complete/fail jobs with progress tracking
- POST /vectors/index now returns immediately with job_id (HTTP 202)
- Embedding runs in tokio::spawn background task
- GET /vectors/jobs/{id} returns live progress (chunks embedded, rate, ETA)
- GET /vectors/jobs lists all jobs
- Progress logged every 100 batches with chunks/sec and ETA
- 100K embedding job running successfully at 44 chunks/sec
- System stays responsive during embedding (queries in 23ms)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-27 09:03:07 -05:00
root
354c9c4a04 PRD v3: future-proofing roadmap — event journal, rich catalog, tool registry
Phases 9-15 designed based on "future regret" analysis:
- Phase 9: Event journal (append-only mutation history — can't retrofit)
- Phase 10: Rich catalog v2 (ownership, sensitivity, lineage, freshness)
- Phase 11: Embedding versioning (model-proof vector layer)
- Phase 12: Tool registry (governed agent actions via MCP)
- Phase 13: Security & access control (field-level, row-level, audit)
- Phase 14: Schema evolution with AI migration rules
- Phase 15+: Federated query, DB connectors, OCR, fine-tuned models

8 design principles: store truth openly, describe richly, never destroy
evidence, secure centrally, expose through tools, version everything,
unstructured first-class, separate storage/compute/intelligence.

ADR-012 through ADR-016 documenting key future-proofing decisions.
Updated benchmarks: 2.47M rows, hot cache 9.8x speedup.
Updated operating rules: cheap-now/expensive-later built first.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-27 08:57:29 -05:00
root
0b9da45647 Agent workspaces: per-contract overlays with instant handoff
- WorkspaceManager: create/get/list workspaces with daily/weekly/monthly/pinned tiers
- Saved searches: agent stores SQL queries in workspace context
- Shortlist: tag candidates/records to a workspace with notes
- Activity log: track calls, emails, updates per workspace per agent
- Instant handoff: transfer workspace ownership with full history
  Zero data copy — just a pointer swap, receiving agent sees everything
- Persistence: workspaces stored as JSON in object storage, rebuilt on startup
- Endpoints: /workspaces/create, /{id}, /{id}/handoff, /{id}/search,
  /{id}/shortlist, /{id}/activity
- Tested: Sarah creates workspace, saves searches, shortlists 3 candidates,
  logs activity, hands off to Mike who continues seamlessly

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-27 08:44:45 -05:00
root
6df904a03c Phase 8: Hot cache + incremental delta updates
- MemCache: LRU in-memory cache for hot datasets (configurable max, default 16GB)
  Pin/evict/stats endpoints: POST /query/cache/pin, /cache/evict, GET /cache/stats
- Delta store: append-only delta Parquet files for row-level updates
  Write deltas without rewriting base files, merge at query time
- Compaction: POST /query/compact merges deltas into base Parquet
- Query engine: checks cache first, falls back to Parquet, merges deltas
- Benchmarked on 2.47M rows:
  1M row JOIN: 854ms cold → 96ms hot (8.9x speedup)
  100K filter: 62ms cold → 21ms hot (3x speedup)
  1.1M rows cached in 408MB RAM

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-27 08:37:28 -05:00
root
eae51977ab Scale test: 2.47M rows + 10K vector index benchmarked
Benchmarks on 128GB RAM server:
- 100K candidate filter (skills+city+status): 257ms
- 1M timesheet aggregation (revenue by client): 942ms
- 800K call log cross-reference (cold leads): 642ms
- Triple JOIN recruiter performance: 487ms
- 500K email open rate aggregation: 259ms
- COUNT all 2.47M rows: 84ms
- 10K vector search (cosine similarity): ~450ms
- Embedding throughput: 49 chunks/sec via Ollama
- RAG correctly refuses to hallucinate when no match exists

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-27 08:31:37 -05:00
root
26fc98c885 Phase 7: Vector index + RAG pipeline
- vectord crate: chunk → embed → store → search → RAG
- chunker: configurable chunk size + overlap, sentence-boundary aware splitting
- store: embeddings as Parquet (binary blob f32 vectors), portable format
- search: brute-force cosine similarity (works up to ~100K vectors)
- rag: full pipeline — embed question → search index → retrieve context → LLM answer
- Endpoints: POST /vectors/index, /vectors/search, /vectors/rag
- Gateway wired with vectord service
- Tested: 200 candidate resumes indexed in 5.4s, semantic search + RAG working
- 20 unit tests passing (chunker, search, ingestd, shared)
- AI gives honest "no match found" when context doesn't support an answer

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-27 08:12:28 -05:00
root
bb05c4412e Phase 6: Ingest pipeline — CSV, JSON, PDF, text file support
- ingestd crate: detect file type → parse → schema detection → Parquet → catalog
- CSV: auto-detect column types (int, float, bool, string), handles $, %, commas
  Strips dollar signs from amounts, flexible row parsing, sanitized column names
- JSON: array or newline-delimited, nested object flattening (a.b.c → a_b_c)
- PDF: text extraction via lopdf, one row per page (source_file, page_number, text)
- Text/SMS: line-based ingestion with line numbers
- Dedup: SHA-256 content hash, re-ingest same file = no-op
- Gateway: POST /ingest/file multipart upload, 256MB body limit
- Schema detection per ADR-010: ambiguous types default to String
- 12 unit tests passing (CSV parsing, JSON flattening, type inference, dedup)
- Tested: messy CSV with missing data, dollar amounts, N/A values → queryable

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-27 08:07:31 -05:00
root
6740a017c7 PRD v2: production roadmap with ingest, vector search, hot cache phases
- Phase 6: Ingest pipeline (CSV/JSON → schema detect → Parquet → catalog)
- Phase 7: Vector index + RAG (embed → HNSW → semantic search → LLM answer)
- Phase 8: Hot cache + incremental updates (MemTable, delta files, merge-on-read)
- ADR-008 through ADR-011: embeddings as Parquet, delta files not Delta Lake,
  schema defaults to string, not a CRM replacement
- Staffing company reference dataset (286K rows, 7 tables)
- Honest risk assessment: vector search at scale and incremental updates are hard

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-27 07:54:24 -05:00
root
b37e171e10 UI redesign: Ask, Explore, SQL, System tabs
- Ask: natural language → AI generates SQL → DataFusion executes → results
  Shows the AI-over-data-lake story: schema introspection → LLM → query
- Explore: click dataset → schema + preview + AI-generated summary
- SQL: raw DataFusion SQL editor with Ctrl+Enter
- System: health grid testing all 5 services + embeddings + generation
- Example prompts for quick demo
- Dark theme with accent styling

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-27 07:24:51 -05:00
root
b235ef9201 Fix nginx route collision — namespace lakehouse API under /lakehouse/api/
Previous regex routes for /catalog, /storage, /health intercepted main site.
Now all lakehouse API calls go through /lakehouse/api/ prefix, stripped by nginx rewrite.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-27 06:57:58 -05:00
root
387ce0074c UI: full-stack test coverage with tabs for Query, Storage, AI, Status
- Query tab: SQL editor with results table (existing)
- Storage tab: list objects, register datasets pointing at storage keys
- AI tab: embed (nomic-embed-text), generate (qwen2.5), rerank with scored results
- Status tab: health checks for all 5 services + functional tests (embed, generate, SQL)
- nginx: added /lakehouse/ and API proxy paths to devop.live config
- Loaded 3 sample datasets: employees, events, products
- Fixed Rust 2024 reserved keyword `gen`

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-27 06:56:18 -05:00
root
01373c0e45 Phase 5: hardening — gRPC, observability, auth, config
- proto: lakehouse.proto with CatalogService, QueryService, StorageService, AiService
- proto crate: tonic-build codegen from proto definitions
- catalogd: gRPC CatalogService implementation
- gateway: dual HTTP (:3100) + gRPC (:3101) servers
- gateway: OpenTelemetry tracing with stdout exporter
- gateway: API key auth middleware (toggleable)
- shared: TOML config system with typed structs and defaults
- lakehouse.toml config file
- ADR-006 and ADR-007 documented

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-27 06:37:07 -05:00
root
50a8c8013f Phase 4: Dioxus frontend with dataset browser and SQL query editor
- ui: Dioxus WASM app with dataset sidebar, SQL editor (Ctrl+Enter), results table
- ui: dynamic API base URL (same-origin for nginx, port-based for local dev)
- gateway: CORS enabled for cross-origin requests
- nginx: lakehouse.devop.live proxies UI (:3300) + API (:3100) on same origin
- justfile: ui-build, ui-serve, sidecar, up commands added

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-27 06:24:15 -05:00
root
78266fdd05 Add __pycache__ to gitignore
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-27 05:54:09 -05:00
root
239e471223 Phase 3: AI integration with Ollama via Python sidecar
- sidecar: FastAPI app with /embed, /generate, /rerank hitting Ollama
- sidecar: Dockerfile, env var config (EMBED_MODEL, GEN_MODEL, RERANK_MODEL)
- aibridge: reqwest HTTP client with typed request/response structs
- aibridge: Axum proxy endpoints (POST /ai/embed, /ai/generate, /ai/rerank)
- gateway: wires AiClient with SIDECAR_URL env var
- e2e verified: nomic-embed-text returns 768d vectors, qwen2.5 generates text

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-27 05:53:56 -05:00
root
19bdfab227 Phase 2: DataFusion query engine over Parquet
- queryd: SessionContext with custom URL scheme to avoid path doubling with LocalFileSystem
- queryd: ListingTable registration from catalog ObjectRefs with schema inference
- queryd: POST /query/sql returns JSON {columns, rows, row_count}
- queryd→catalogd wiring: reads all datasets, registers as named tables
- gateway: wires QueryEngine with shared store + registry
- e2e verified: SELECT *, WHERE/ORDER BY, COUNT/AVG all correct

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-27 05:48:20 -05:00
root
655b6c0b37 Phase 1: storage + catalog layer
- storaged: object_store backend (LocalFileSystem), PUT/GET/DELETE/LIST endpoints
- shared: arrow_helpers with Parquet roundtrip + schema fingerprinting (2 tests)
- catalogd: in-memory registry with write-ahead manifest persistence to object storage
- catalogd: POST/GET /datasets, GET /datasets/by-name/{name}
- gateway: wires storaged + catalogd with shared object_store state
- Phase tracker updated: Phase 0 + Phase 1 gates passed

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-27 05:15:27 -05:00
root
a52ca841c6 Phase 0: bootstrap Rust workspace
- Cargo workspace with 6 crates: shared, storaged, catalogd, queryd, aibridge, gateway
- shared: types (DatasetId, ObjectRef, SchemaFingerprint, DatasetManifest) + error enum
- gateway: Axum HTTP entrypoint with nested service routers + tracing
- All services expose /health stubs
- justfile with build/test/run recipes
- PRD, phase tracker, and ADR docs

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-27 04:59:05 -05:00