matrix-agent-validated/docs/ADR-020-universal-id-mapping.md
profit ac01fffd9a checkpoint: matrix-agent-validated (2026-04-25)
Architectural snapshot of the lakehouse codebase at the point where the
full matrix-driven agent loop with Mem0 versioning + deletion was
validated end-to-end.

WHAT THIS REPO IS
A clean single-commit snapshot of the lakehouse code. Heavy test data
(.parquet datasets, vector indexes) excluded — see REPLICATION.md for
regen path. Full lakehouse history at git.agentview.dev/profit/lakehouse.

WHAT WAS PROVEN
- Vector retrieval across multi-corpora matrix (chicago_permits + entity
  briefs + sec_tickers + distilled procedural + llm_team runs)
- Observer hand-review (cloud + heuristic fallback) gating each candidate
- Local-model agent loop (qwen3.5:latest) with tool use + scratchpad
- Playbook seal on success → next-iter retrieval surfaces it as preamble
- Mem0 versioning + deletion in pathway_memory:
    * UPSERT: ADD on new workflow, UPDATE bumps replay_count on identical
    * REVISE: chains versions, parent.superseded_at + superseded_by stamped
    * RETIRE: marks specific trace retired with reason, excluded from retrieval
    * HISTORY: walks chain root→tip, cycle-safe

KEY DIRECTORIES
- crates/vectord/src/pathway_memory.rs — Mem0 ops live here
- crates/vectord/src/playbook_memory.rs — original Mem0 reference
- tests/agent_test/ — local-model agent harness + PRD + session archives
- scripts/dump_raw_corpus.sh — MinIO bucket dump (raw test corpus)
- scripts/vectorize_raw_corpus.ts — corpus → vector indexes
- scripts/analyze_chicago_contracts.ts — real inference pipeline
- scripts/seal_agent_playbook.ts — Mem0 upsert from agent traces

Replication: see REPLICATION.md for Debian 13 clean install + cloud-only
adaptation (no local Ollama).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-25 19:43:27 -05:00

108 lines
4.0 KiB
Markdown

# ADR-020: Universal ID Mapping — the hybrid search identity problem
**Status:** Proposed — 2026-04-17
**Triggered by:** ID mismatch between vector doc_ids and SQL primary keys
**Owner:** J
---
## Problem
The hybrid search endpoint (`POST /vectors/hybrid`) filters SQL results
by primary key, then matches those keys against vector doc_ids. This
requires the two ID spaces to be compatible. Currently they're not:
| Source | SQL primary key | Vector doc_id | Match? |
|---|---|---|---|
| ethereal_workers | `worker_id = 4925` | `W-4925` | Only with W- strip |
| workers_500k | `worker_id = 41566` | `W500K-41566` | Only with W500K- strip |
| candidates | `candidate_id = CAND-055035` | `CAND-055035` | Yes (same format) |
| workers_5k_proof | `worker_id = 3200` | `W5K-3200` | Only with W5K- strip |
Every time a new dataset is ingested and embedded, a new prefix
appears and the hybrid search breaks until someone hardcodes another
`strip_prefix` line.
This violates PRD invariant: **"Any data source can be ingested
without pre-defined schemas."** The ID format IS a pre-defined schema
that the hybrid search depends on.
## Root cause
When the embedding pipeline creates a vector index, it generates
doc_ids by concatenating a prefix + the source row's ID column. The
prefix is chosen at index creation time and baked into the Parquet
vector file. The SQL dataset knows nothing about this prefix.
The hybrid search then has to reverse-engineer the prefix to match
vector results against SQL rows. This is fragile and breaks on every
new data source.
## Solution: catalog-level ID mapping
### Option A: Normalize doc_ids at embedding time (simplest)
When creating a vector index, don't prefix the doc_id. Use the raw
value from the source column. If the source has `worker_id = 4925`,
the vector doc_id is just `"4925"` — no `W-`, no `W500K-`.
**Pros:** Simplest. Hybrid search just compares strings directly.
**Cons:** Doc_ids across different indexes could collide (two datasets
both have worker_id=1). Need to scope by index name.
### Option B: Catalog stores the mapping (most robust)
The catalog maintains a mapping table:
```
index_name | doc_id_prefix | source_dataset | source_id_column
-----------+---------------+----------------+-----------------
resumes | CAND- | candidates | candidate_id
workers_v1 | W500K- | workers_500k | worker_id
```
Hybrid search reads this mapping and applies the prefix/strip
automatically. New datasets register their mapping at index creation.
**Pros:** Handles all cases. Self-describing. No code changes for new data.
**Cons:** One more lookup per hybrid search (trivial perf cost).
### Option C: Pass the mapping in the request (pragmatic)
The hybrid search request already accepts `id_column`. Extend it
with `id_prefix` so the caller says "vector doc_ids have prefix
W500K- and the SQL column is worker_id."
```json
{
"index_name": "workers_500k_v1",
"id_column": "worker_id",
"id_prefix": "W500K-",
"sql_filter": "role = 'Forklift Operator' AND state = 'IL'",
...
}
```
**Pros:** Zero backend change. Caller already knows the context.
**Cons:** Caller has to know the prefix — not self-service.
## Recommendation
**Option A for new indexes, Option B for registry.**
1. Change the embedding pipeline to use RAW IDs (no prefix) as the
default. Existing indexes keep their prefixed IDs.
2. Add an `id_prefix` column to IndexMeta so the catalog knows how
to map between vector and SQL IDs.
3. Hybrid search reads IndexMeta.id_prefix and applies it automatically.
4. Remove the hardcoded strip_prefix chain.
This means: ingest a new dataset, embed it, hybrid search works
immediately. No code changes. The system is truly "any data source."
## Implementation
1. `IndexMeta` gains `id_prefix: Option<String>` (default None = no prefix)
2. Embedding pipeline: when `id_prefix` is None, use raw IDs from source
3. Hybrid search: read `IndexMeta.id_prefix`, prepend to SQL IDs before matching
4. Migration: existing indexes retain their prefix in IndexMeta