THE REAL PROBLEM: Every new data source produces different doc_id prefixes in vector indexes (W-, W500K-, W5K-, CAND-). Hybrid search had to hardcode strip_prefix for each one. New datasets broke hybrid until someone added another prefix. This violates "any data source without pre-defined schemas." THE FIX: IndexMeta.id_prefix — the catalog records what prefix each index uses. Hybrid search reads it and strips automatically. Legacy indexes fall back to heuristic stripping. New indexes can set id_prefix=None to use raw IDs (no prefix, no stripping needed). This means: ingest a new dataset, embed it, hybrid search works immediately without code changes. The system is truly source-agnostic. Also: full ADR document at docs/ADR-020-universal-id-mapping.md with the three options considered and rationale for the chosen approach. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
4.0 KiB
ADR-020: Universal ID Mapping — the hybrid search identity problem
Status: Proposed — 2026-04-17 Triggered by: ID mismatch between vector doc_ids and SQL primary keys Owner: J
Problem
The hybrid search endpoint (POST /vectors/hybrid) filters SQL results
by primary key, then matches those keys against vector doc_ids. This
requires the two ID spaces to be compatible. Currently they're not:
| Source | SQL primary key | Vector doc_id | Match? |
|---|---|---|---|
| ethereal_workers | worker_id = 4925 |
W-4925 |
Only with W- strip |
| workers_500k | worker_id = 41566 |
W500K-41566 |
Only with W500K- strip |
| candidates | candidate_id = CAND-055035 |
CAND-055035 |
Yes (same format) |
| workers_5k_proof | worker_id = 3200 |
W5K-3200 |
Only with W5K- strip |
Every time a new dataset is ingested and embedded, a new prefix
appears and the hybrid search breaks until someone hardcodes another
strip_prefix line.
This violates PRD invariant: "Any data source can be ingested without pre-defined schemas." The ID format IS a pre-defined schema that the hybrid search depends on.
Root cause
When the embedding pipeline creates a vector index, it generates doc_ids by concatenating a prefix + the source row's ID column. The prefix is chosen at index creation time and baked into the Parquet vector file. The SQL dataset knows nothing about this prefix.
The hybrid search then has to reverse-engineer the prefix to match vector results against SQL rows. This is fragile and breaks on every new data source.
Solution: catalog-level ID mapping
Option A: Normalize doc_ids at embedding time (simplest)
When creating a vector index, don't prefix the doc_id. Use the raw
value from the source column. If the source has worker_id = 4925,
the vector doc_id is just "4925" — no W-, no W500K-.
Pros: Simplest. Hybrid search just compares strings directly. Cons: Doc_ids across different indexes could collide (two datasets both have worker_id=1). Need to scope by index name.
Option B: Catalog stores the mapping (most robust)
The catalog maintains a mapping table:
index_name | doc_id_prefix | source_dataset | source_id_column
-----------+---------------+----------------+-----------------
resumes | CAND- | candidates | candidate_id
workers_v1 | W500K- | workers_500k | worker_id
Hybrid search reads this mapping and applies the prefix/strip automatically. New datasets register their mapping at index creation.
Pros: Handles all cases. Self-describing. No code changes for new data. Cons: One more lookup per hybrid search (trivial perf cost).
Option C: Pass the mapping in the request (pragmatic)
The hybrid search request already accepts id_column. Extend it
with id_prefix so the caller says "vector doc_ids have prefix
W500K- and the SQL column is worker_id."
{
"index_name": "workers_500k_v1",
"id_column": "worker_id",
"id_prefix": "W500K-",
"sql_filter": "role = 'Forklift Operator' AND state = 'IL'",
...
}
Pros: Zero backend change. Caller already knows the context. Cons: Caller has to know the prefix — not self-service.
Recommendation
Option A for new indexes, Option B for registry.
- Change the embedding pipeline to use RAW IDs (no prefix) as the default. Existing indexes keep their prefixed IDs.
- Add an
id_prefixcolumn to IndexMeta so the catalog knows how to map between vector and SQL IDs. - Hybrid search reads IndexMeta.id_prefix and applies it automatically.
- Remove the hardcoded strip_prefix chain.
This means: ingest a new dataset, embed it, hybrid search works immediately. No code changes. The system is truly "any data source."
Implementation
IndexMetagainsid_prefix: Option<String>(default None = no prefix)- Embedding pipeline: when
id_prefixis None, use raw IDs from source - Hybrid search: read
IndexMeta.id_prefix, prepend to SQL IDs before matching - Migration: existing indexes retain their prefix in IndexMeta