lakehouse/docs/specs/PATHWAY_MEMORY_SPEC.md

# Pathway Memory — Specification v1

**Status:** Draft v1 — 2026-05-03 · **Layer:** Decision metadata · **Implementation:** `crates/vectord/src/pathway_memory.rs`

> **What this is.** Pathway Memory is a third metadata layer that exists alongside (not on top of, not under) the data-metadata layer (catalogd's manifests/views/tombstones) and the operational-memory layer (playbook_memory). It tracks **decision patterns**: "this code path, in this task class, produced this outcome." It is the AI-substrate analog of what Iceberg manifests are to data tables — pointers + provenance for things you might want to retrieve again or learn from.
>
> **What it is NOT.** It is not a vector database. It is not a knowledge graph. It is not an audit log (audit is per-subject; pathways are per-code-pattern). It is not a cache. It is a small, opinionated metadata layer with a defined wire format, defined access patterns, and defined lifecycle rules.
>
> **Why a spec exists.** No standard tells us what this layer is. Each implementation that drifts from this spec creates the wiring confusion currently visible in the system (data/_kb/ JSONLs that don't get consumed, version drift across `workers_500k_v1..v10`, unclear who reads `reducer_summary` vs `final_verdict`). The spec is the standardization contract that lets a single implementation become the canonical one.

---

## 1. Conceptual model

A **pathway** is the identity-equivalence-class of code paths that exhibit similar decision behavior. Two distinct files may share a pathway if they belong to the same crate AND are reviewed under the same task class with the same signal class.

A **trace** is a single observed instance of a pathway: one review pass, with its inputs, intermediate signals, ladder attempts, audit outcome, and final verdict.

A pathway is **identified** by a stable hash of three keys: task class, file prefix, signal class. The hash is `pathway_id`.

A trace is **identified** within a pathway by a generated UUID: `trace_uid`. The pair `(pathway_id, trace_uid)` is globally unique.

Pathways are **append-only by trace**: every new observation creates a new trace under the existing pathway_id. Traces themselves are **mutable in version**: `revise()` produces a new version that supersedes the previous (Mem0-style chain).

Pathways are **retirable** by id: when replay history shows the pattern no longer holds, the pathway is marked retired and excluded from hot-swap. New pathways may emerge with fresh ids if the underlying code/task characteristics shift.

---

## 2. Wire format — `PathwayTrace`

Authoritative struct: `crates/vectord/src/pathway_memory.rs`. The fields below are the SPEC; the Rust struct is one implementation. All fields use JSON-serializable types when persisted.

### 2.1 — Identity fields (immutable per trace)

| Field | Type | Constraint |
|---|---|---|
| `pathway_id` | string | SHA256 hex of `task_class | file_prefix | signal_class` — see §3 |
| `trace_uid` | string | UUID. Empty on legacy traces; populated on insert by SHOULD-implementations from v2+ |
| `version` | u32 | Mem0 version chain. Default 1 on insert. Bumped on `revise()`. |
| `parent_trace_uid` | string \| null | trace_uid of the trace this one supersedes. null on root version. |
| `superseded_at` | string \| null | ISO-8601 UTC. Set when this trace becomes a non-head version. |
| `superseded_by_trace_uid` | string \| null | trace_uid of the new head when this version is superseded. |

### 2.2 — Identity-source fields (the inputs that fed pathway_id)

| Field | Type | Notes |
|---|---|---|
| `task_class` | string | Caller-defined class string (e.g. `scrum_review`, `staffing.fill`, `pr_audit`) |
| `file_path` | string | Full path of the file under review. Reduced to file_prefix for hashing — see §3.2. |
| `signal_class` | string \| null | Caller-defined behavior label (e.g. `CONVERGING`, `LOOPING`, `STUCK_RETRY`) |

### 2.3 — Observation fields (what happened)

| Field | Type | Description |
|---|---|---|
| `created_at` | RFC 3339 timestamp | When this trace was inserted. |
| `ladder_attempts` | array of `LadderAttempt` | Each model+rung attempt: `{rung, model, latency_ms, accepted, reject_reason?}`. Order is dispatch order. |
| `kb_chunks` | array of `KbChunkRef` | Knowledge-base chunks the reviewer was given as context: `{source_doc, chunk_id, cosine_score, rank}`. |
| `observer_signals` | array of `ObserverSignal` | Behavior labels emitted by observer during this pathway's run: `{class, priors, prior_iter_outcomes}`. |
| `bridge_hits` | array of `BridgeHit` | External-doc lookups (context7-style): `{library, version, result_summary}`. |
| `sub_pipeline_calls` | array of `SubPipelineCall` | Calls to other sub-pipelines (extract, validate, etc.) made during this trace. |
| `audit_consensus` | `AuditConsensus` \| null | Cross-lineage audit outcome if any: `{pass, models, disagreements}`. |
| `reducer_summary` | string | Free-text summary of what the reducer concluded. **PII risk surface — see §6.3.** |
| `final_verdict` | string | Free-text verdict label (`accepted`, `rejected`, `needs_review`, etc.). **PII risk surface — see §6.3.** |

### 2.4 — Index fields (the matrix retrieval shape)

| Field | Type | Description |
|---|---|---|
| `pathway_vec` | array of f32, length 32 | Bag-of-tokens hash embedding. Algorithm in §4. Fixed dimension per spec version. |
| `replay_count` | u32 | Number of times this pathway has been replayed via hot-swap. Initial insert is NOT a replay. |
| `replays_succeeded` | u32 | Replays where the post-replay outcome matched the trace's `final_verdict`. |
| `retired` | bool | Marked true when probation gate fires (§5.4). Excluded from hot-swap forever. |

### 2.5 — Optional semantic-correctness fields (ADR-021)

| Field | Type | Description |
|---|---|---|
| `semantic_flags` | array of `SemanticFlag` | One of: `UnitMismatch`, `TypeConfusion`, `NullableConfusion`, `OffByOne`, `StaleReference`, `PseudoImpl`, `DeadCode`, `WarningNoise`, `BoundaryViolation`. Caller-emitted, not auto-derived. |
| `type_hints_used` | array of `TypeHint` | Schema/type context fed to the reviewer for this trace: `{source, symbol, type_repr}`. |
| `bug_fingerprints` | array of `BugFingerprint` | Structural pattern hashes the reviewer caught: `{flag, pattern_key, example, occurrences}`. |

All §2.5 fields are additive — implementations MUST default them to empty when deserializing legacy traces.

---

## 3. Identity algorithms

### 3.1 — `pathway_id`

```
pathway_id = sha256_hex(task_class || "|" || file_prefix(file_path) || "|" || signal_class_or_empty)
```

- Concatenation uses literal byte `|` (0x7C) as separator.
- `signal_class_or_empty` is the empty string when signal_class is null.
- SHA-256 output is 64 hex chars, lowercase.

### 3.2 — `file_prefix`

```
file_prefix(path) = first two path segments joined by '/'
```

- Path separator is forward slash. Implementations on Windows MUST normalize to forward slashes before hashing.
- Files with fewer than two segments use the full path verbatim.
- Examples:
  - `crates/queryd/src/service.rs` → `crates/queryd`
  - `crates/gateway` → `crates/gateway`
  - `README.md` → `README.md`

### 3.3 — Why this fingerprint shape

The fingerprint groups files within a crate (same `file_prefix`) under the same pathway when reviewed for the same task class with the same signal class. This is intentional: it lets the matrix index recognize "the same kind of bug appears in this crate" without needing per-file traces. Per-file granularity is preserved in the `file_path` field for retrieval but is not part of the identity.

---

## 4. `pathway_vec` algorithm

Fixed dimension: **32 buckets**. Implementation: deterministic bag-of-tokens hash, normalized.

```
def build_pathway_vec(trace):
    buckets = [0.0] * 32
    tokens = collect_metadata_tokens(trace)  # see token list below
    for tok in tokens:
        h = sha256(tok.encode())
        bucket = int.from_bytes(h[:4], 'big') % 32
        buckets[bucket] += 1.0
    norm = sqrt(sum(b*b for b in buckets))
    return [b/norm if norm > 0 else 0.0 for b in buckets]
```

**Tokens MUST include**:
- `task_class:<task_class>`
- `file_prefix:<file_prefix>`
- `signal_class:<signal_class_or_empty>`
- For each `ladder_attempts[i].model`: `model:<model>`
- For each `kb_chunks[i].source_doc`: `kb_doc:<source_doc>`
- For each `observer_signals[i].class`: `signal:<class>`
- For each `bug_fingerprints[i].flag`: `flag:<flag>`

**Tokens MUST NOT include**:
- `reducer_summary` text (free-form, would dominate the bucket distribution)
- `final_verdict` text
- `created_at` (would make every trace's vector unique)
- candidate_id, name, email, phone, or any subject identifier (PII risk)

**Why 32 dimensions:** consensus-validated round 3 ensemble selected "small metadata tokens, not full JSON." 32 is sufficient to distinguish pathway combinations without requiring an external embedding model. An external embedding model would work too but adds a dependency, failure mode, and drift risk the consensus flagged. **Spec version v1 fixes dim=32. v2 may revisit.**

---

## 5. Lifecycle

### 5.1 — Insert

A trace is inserted by the consumer (typically the scrum pipeline or auditor) calling the implementation's `insert()` API.

- If no pathway with `pathway_id` exists, a new pathway bucket is created.
- If a pathway exists, the new trace is appended to that bucket's traces list.
- `trace_uid` is generated (UUID v7 RECOMMENDED for time-orderability).
- `version` is set to 1.
- `created_at` is set to the insertion timestamp.
- `pathway_vec` is computed and stored.
- `replay_count` and `replays_succeeded` start at 0.

### 5.2 — Revise

A trace is revised by calling `revise(trace_uid, new_fields)`.

- A new trace with a fresh `trace_uid` and `version = previous.version + 1` is inserted.
- The new trace's `parent_trace_uid` is set to the previous trace's `trace_uid`.
- The previous trace's `superseded_at` is set to the revision timestamp and `superseded_by_trace_uid` is set to the new trace's `trace_uid`.
- The previous trace remains in the bucket (history is preserved). Retrieval defaults to head versions only — see §7.

### 5.3 — Replay

A pathway is **eligible for replay** when `retired == false` AND the bucket has at least one head-version trace.

The hot-swap consumer queries by `pathway_id` (or by similarity over `pathway_vec`), retrieves the head trace, applies its `final_verdict` shape to a new but similar input, and then reports back via `record_replay(trace_uid, succeeded: bool)`.

- `replay_count` is incremented on every replay (success or failure).
- `replays_succeeded` is incremented only on success.

### 5.4 — Probation gate (retirement)

Probation triggers when `replay_count >= 3` AND `success_rate < 0.80`.

When triggered, the pathway is marked `retired = true`. It is excluded from hot-swap forever. If the underlying code/task characteristics genuinely change such that a retired pathway would work again, a NEW pathway with a FRESH id will be created (because the inputs to `pathway_id` would have changed, e.g., new signal_class). Retirement is per-id and irreversible.

`success_rate` = `replays_succeeded / replay_count` (0.0 if `replay_count == 0`).

---

## 6. Access patterns (read + write shapes)

### 6.1 — Write paths (who inserts traces)

Spec-conformant writers:
- Scrum pipeline (`tests/real-world/scrum_master_pipeline.ts` and downstream consumers) on review acceptance
- Auditor pipeline (`auditor/`) on verdict assembly
- Validator (`crates/validator/`) on iterate session completion

Each writer is responsible for setting the identity fields correctly. The implementation MUST reject writes where `task_class` is empty.

### 6.2 — Read paths (who consumes traces)

Spec-conformant readers:
- **Hot-swap retrieval** — `query_for_hotswap(task_class, file_path, signal_class, k=5)` returns the top-k head-version traces from matching pathways, ordered by `(success_rate desc, replay_count desc)`.
- **Similarity retrieval** — `query_by_vec(pathway_vec, k=10)` returns the top-k head-version traces by cosine similarity over `pathway_vec`.
- **Audit / forensic** — `list_versions(trace_uid_root)` returns the full version chain. **Must be authorized** — version chains can include reducer_summary text that may contain PII.

The implementation MUST NOT expose `list_versions` on an unauthenticated route.

### 6.3 — PII surface (load-bearing for audit-trail compliance)

The fields `reducer_summary` and `final_verdict` are free-form strings written from review output. They are **highly likely** to contain PII when the reviewed task class involved real candidates (per `AUDIT_PHASE_1_DISCOVERY.md` §1F + §10/C1). Until subject-redaction-on-write lands, implementations:
- MUST treat trace bodies as a suspected PII sink for any task class touching real subjects
- SHOULD redact known PII patterns (candidate_id, email, phone) before persisting
- SHOULD provide a per-trace `subject_ids[]` top-level field when known, so audit-response queries can filter by subject

Until the redaction layer ships, do not advertise pathway_memory as PII-clean.

---

## 7. Storage representation

The current implementation persists state to a single JSON file at `data/_pathway_memory/state.json`:

```
{
  "pathways": {
    "<pathway_id>": [
      { "trace_uid": "...", "version": 1, "task_class": "...", ... },
      { "trace_uid": "...", "version": 2, "parent_trace_uid": "...", ... }
    ]
  },
  "last_updated_at": "2026-04-29T05:43:00Z"
}
```

The flat-JSON representation is implementation-specific. A spec-conformant implementation could use SQLite, an append-only JSONL, or a Lance dataset. The spec dictates the wire shape per-trace and the access patterns; the storage backend is the implementation's choice.

**Retrieval defaults**: queries return head versions only by default (a head version is one with `superseded_at == null`). Consumers explicitly opt in to historical versions via a `include_history: true` flag.

---

## 8. Spec boundary — what's stable vs implementation-specific

### Stable (locked in v1; changing requires v2 spec)
- The `pathway_id` algorithm (SHA256 of `task_class | file_prefix | signal_class`)
- The `file_prefix` algorithm (first two path segments)
- `pathway_vec` dimension (32) and token list
- Field names and types in §2 (additive fields are OK; renames or removals require v2)
- Lifecycle rules in §5
- Access pattern names in §6 (`query_for_hotswap`, `query_by_vec`, `list_versions`)

### Implementation-specific (free to change without spec bump)
- Storage backend (JSON file, SQLite, Lance, etc.)
- HTTP/gRPC/library API surface
- Concurrency model
- In-memory caching strategy
- The exact text of `reducer_summary` (it's content, not schema)
- Token-bucket hash function (SHA-256 specified in §4 but any deterministic 32-bit hash is acceptable as long as it's deterministic and well-distributed)

### Reserved for v2 (named so future implementations don't paint into a corner)
- Cross-pathway similarity (`query_pathways_similar_to_pathway`)
- External embedding model for `pathway_vec` (replaces the bag-of-tokens hash)
- Subject-aware indexing (so audit-response can find all pathways involving subject X)
- Differential privacy on `pathway_vec` aggregations

---

## 9. Open questions

These need decisions before v1 is final-final:

1. **Dimension lock-in**: 32 is round-3 consensus, but is it actually optimal? Worth a benchmark before final v1.
2. **Subject indexing in v1 vs v2**: deferring to v2 means audit queries today have to grep `reducer_summary` text. Is that acceptable for the staffing-client launch window?
3. **Cross-runtime parity**: the Go side has its own `internal/pathways/` implementation. As of 2026-05-03 the wire shape matches but no parity probe exists. Add `pathway_parity.sh` to the cross-runtime probe suite.
4. **Migration from current state.json**: existing 91 traces lack `trace_uid`. Spec says `trace_uid` is empty on legacy traces — should v1 require backfill or accept legacy-trace queries indefinitely?

---

## 10. Why this spec exists

Without it:
- Every new consumer of pathway_memory invents their own read/write conventions
- The Go-side reimplementation drifts from the Rust-side wire format
- Future LLM-assistant work proposes new fields without considering the existing schema
- Forensic queries grep free-text fields because no canonical retrieval exists
- The PII concern stays "TODO" because there's no stable surface to gate

With it:
- One implementation can be canonical (Rust, currently); others can be conformance-tested
- The wire shape is documented; serialization tests can pin it
- New fields go through an additive-only review (don't break legacy)
- The KbContext consumer in the gateway has a stable API to read against
- The audit-trail PRD has a defined PII surface to protect

This is the standardization the system was missing.

---

## Change log

- 2026-05-03 — v1 initial draft. Codifies what's already in `crates/vectord/src/pathway_memory.rs`. The spec is descriptive of the existing implementation, not prescriptive of new behavior.