lakehouse/docs/specs/PATHWAY_MEMORY_SPEC.md
root ed1fcd3c26 specs: pathway_memory v1 + subject_manifests_on_catalogd v1
Two specifications addressing the framing J asked for after reading
the llms3.com blog: standardize what we have so future work doesn't
drift, and apply the local-first thesis to the audit problem instead
of the over-scoped SaaS-tier identity service.

PATHWAY_MEMORY_SPEC.md (~400 lines):
  Documents the existing crates/vectord/src/pathway_memory.rs as a
  spec — the third metadata layer alongside catalogd's data metadata
  and playbook_memory's operational memory. Defines:
    - PathwayTrace wire format
    - pathway_id = SHA256(task_class | file_prefix | signal_class)
    - file_prefix algorithm (first 2 path segments)
    - pathway_vec: 32-bucket bag-of-tokens hash, fixed dim per spec
    - Lifecycle: insert → revise → replay → probation gate retire
    - Mem0 versioning (trace_uid + parent_trace_uid + version chain)
    - Access patterns: query_for_hotswap / query_by_vec / list_versions
    - PII risk surface (reducer_summary + final_verdict)
    - Spec boundary: stable in v1 vs implementation-specific
  No new architecture. Descriptive, not prescriptive.

SUBJECT_MANIFESTS_ON_CATALOGD.md (~400 lines):
  The local-first audit-trail spec. Adds a fourth manifest type to
  catalogd alongside dataset/view/tombstone/profile. NOT a separate
  identity daemon. NOT Vault/KMS/dual-control JWT. Builds on
  primitives catalogd already ships:
    - SubjectManifest at data/_catalog/subjects/<id>.json
    - Per-subject HMAC-chained audit JSONL
    - Daily retention sweep using existing tombstone primitives
    - Vertical-aware routing (healthcare → local-only)
    - Legal-tier credential separate from gateway internal auth
  ~4 days estimated implementation effort vs 17-20 days for the
  IDENTITY_SERVICE_DESIGN approach. Same defensibility for the
  staffing-client launch window. Strictly additive to compatibility
  with the v3 design if SOC2 Type II becomes a contract requirement.

These are SPECS — what the system already does (pathway) and what's
the smallest local-first thing that addresses the audit need
(subject manifests). Not 9-phase plans. Not new daemons.

The pathway spec is descriptive: writing down what exists so the
next person doesn't reinvent it. The subject-manifests spec is
prescriptive: J greenlights, implementation is days not weeks.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-03 03:07:38 -05:00

17 KiB

Pathway Memory — Specification v1

Status: Draft v1 — 2026-05-03 · Layer: Decision metadata · Implementation: crates/vectord/src/pathway_memory.rs

What this is. Pathway Memory is a third metadata layer that exists alongside (not on top of, not under) the data-metadata layer (catalogd's manifests/views/tombstones) and the operational-memory layer (playbook_memory). It tracks decision patterns: "this code path, in this task class, produced this outcome." It is the AI-substrate analog of what Iceberg manifests are to data tables — pointers + provenance for things you might want to retrieve again or learn from.

What it is NOT. It is not a vector database. It is not a knowledge graph. It is not an audit log (audit is per-subject; pathways are per-code-pattern). It is not a cache. It is a small, opinionated metadata layer with a defined wire format, defined access patterns, and defined lifecycle rules.

Why a spec exists. No standard tells us what this layer is. Each implementation that drifts from this spec creates the wiring confusion currently visible in the system (data/_kb/ JSONLs that don't get consumed, version drift across workers_500k_v1..v10, unclear who reads reducer_summary vs final_verdict). The spec is the standardization contract that lets a single implementation become the canonical one.


1. Conceptual model

A pathway is the identity-equivalence-class of code paths that exhibit similar decision behavior. Two distinct files may share a pathway if they belong to the same crate AND are reviewed under the same task class with the same signal class.

A trace is a single observed instance of a pathway: one review pass, with its inputs, intermediate signals, ladder attempts, audit outcome, and final verdict.

A pathway is identified by a stable hash of three keys: task class, file prefix, signal class. The hash is pathway_id.

A trace is identified within a pathway by a generated UUID: trace_uid. The pair (pathway_id, trace_uid) is globally unique.

Pathways are append-only by trace: every new observation creates a new trace under the existing pathway_id. Traces themselves are mutable in version: revise() produces a new version that supersedes the previous (Mem0-style chain).

Pathways are retirable by id: when replay history shows the pattern no longer holds, the pathway is marked retired and excluded from hot-swap. New pathways may emerge with fresh ids if the underlying code/task characteristics shift.


2. Wire format — PathwayTrace

Authoritative struct: crates/vectord/src/pathway_memory.rs. The fields below are the SPEC; the Rust struct is one implementation. All fields use JSON-serializable types when persisted.

2.1 — Identity fields (immutable per trace)

Field Type Constraint
pathway_id string SHA256 hex of `task_class
trace_uid string UUID. Empty on legacy traces; populated on insert by SHOULD-implementations from v2+
version u32 Mem0 version chain. Default 1 on insert. Bumped on revise().
parent_trace_uid string | null trace_uid of the trace this one supersedes. null on root version.
superseded_at string | null ISO-8601 UTC. Set when this trace becomes a non-head version.
superseded_by_trace_uid string | null trace_uid of the new head when this version is superseded.

2.2 — Identity-source fields (the inputs that fed pathway_id)

Field Type Notes
task_class string Caller-defined class string (e.g. scrum_review, staffing.fill, pr_audit)
file_path string Full path of the file under review. Reduced to file_prefix for hashing — see §3.2.
signal_class string | null Caller-defined behavior label (e.g. CONVERGING, LOOPING, STUCK_RETRY)

2.3 — Observation fields (what happened)

Field Type Description
created_at RFC 3339 timestamp When this trace was inserted.
ladder_attempts array of LadderAttempt Each model+rung attempt: {rung, model, latency_ms, accepted, reject_reason?}. Order is dispatch order.
kb_chunks array of KbChunkRef Knowledge-base chunks the reviewer was given as context: {source_doc, chunk_id, cosine_score, rank}.
observer_signals array of ObserverSignal Behavior labels emitted by observer during this pathway's run: {class, priors, prior_iter_outcomes}.
bridge_hits array of BridgeHit External-doc lookups (context7-style): {library, version, result_summary}.
sub_pipeline_calls array of SubPipelineCall Calls to other sub-pipelines (extract, validate, etc.) made during this trace.
audit_consensus AuditConsensus | null Cross-lineage audit outcome if any: {pass, models, disagreements}.
reducer_summary string Free-text summary of what the reducer concluded. PII risk surface — see §6.3.
final_verdict string Free-text verdict label (accepted, rejected, needs_review, etc.). PII risk surface — see §6.3.

2.4 — Index fields (the matrix retrieval shape)

Field Type Description
pathway_vec array of f32, length 32 Bag-of-tokens hash embedding. Algorithm in §4. Fixed dimension per spec version.
replay_count u32 Number of times this pathway has been replayed via hot-swap. Initial insert is NOT a replay.
replays_succeeded u32 Replays where the post-replay outcome matched the trace's final_verdict.
retired bool Marked true when probation gate fires (§5.4). Excluded from hot-swap forever.

2.5 — Optional semantic-correctness fields (ADR-021)

Field Type Description
semantic_flags array of SemanticFlag One of: UnitMismatch, TypeConfusion, NullableConfusion, OffByOne, StaleReference, PseudoImpl, DeadCode, WarningNoise, BoundaryViolation. Caller-emitted, not auto-derived.
type_hints_used array of TypeHint Schema/type context fed to the reviewer for this trace: {source, symbol, type_repr}.
bug_fingerprints array of BugFingerprint Structural pattern hashes the reviewer caught: {flag, pattern_key, example, occurrences}.

All §2.5 fields are additive — implementations MUST default them to empty when deserializing legacy traces.


3. Identity algorithms

3.1 — pathway_id

pathway_id = sha256_hex(task_class || "|" || file_prefix(file_path) || "|" || signal_class_or_empty)
  • Concatenation uses literal byte | (0x7C) as separator.
  • signal_class_or_empty is the empty string when signal_class is null.
  • SHA-256 output is 64 hex chars, lowercase.

3.2 — file_prefix

file_prefix(path) = first two path segments joined by '/'
  • Path separator is forward slash. Implementations on Windows MUST normalize to forward slashes before hashing.
  • Files with fewer than two segments use the full path verbatim.
  • Examples:
    • crates/queryd/src/service.rscrates/queryd
    • crates/gatewaycrates/gateway
    • README.mdREADME.md

3.3 — Why this fingerprint shape

The fingerprint groups files within a crate (same file_prefix) under the same pathway when reviewed for the same task class with the same signal class. This is intentional: it lets the matrix index recognize "the same kind of bug appears in this crate" without needing per-file traces. Per-file granularity is preserved in the file_path field for retrieval but is not part of the identity.


4. pathway_vec algorithm

Fixed dimension: 32 buckets. Implementation: deterministic bag-of-tokens hash, normalized.

def build_pathway_vec(trace):
    buckets = [0.0] * 32
    tokens = collect_metadata_tokens(trace)  # see token list below
    for tok in tokens:
        h = sha256(tok.encode())
        bucket = int.from_bytes(h[:4], 'big') % 32
        buckets[bucket] += 1.0
    norm = sqrt(sum(b*b for b in buckets))
    return [b/norm if norm > 0 else 0.0 for b in buckets]

Tokens MUST include:

  • task_class:<task_class>
  • file_prefix:<file_prefix>
  • signal_class:<signal_class_or_empty>
  • For each ladder_attempts[i].model: model:<model>
  • For each kb_chunks[i].source_doc: kb_doc:<source_doc>
  • For each observer_signals[i].class: signal:<class>
  • For each bug_fingerprints[i].flag: flag:<flag>

Tokens MUST NOT include:

  • reducer_summary text (free-form, would dominate the bucket distribution)
  • final_verdict text
  • created_at (would make every trace's vector unique)
  • candidate_id, name, email, phone, or any subject identifier (PII risk)

Why 32 dimensions: consensus-validated round 3 ensemble selected "small metadata tokens, not full JSON." 32 is sufficient to distinguish pathway combinations without requiring an external embedding model. An external embedding model would work too but adds a dependency, failure mode, and drift risk the consensus flagged. Spec version v1 fixes dim=32. v2 may revisit.


5. Lifecycle

5.1 — Insert

A trace is inserted by the consumer (typically the scrum pipeline or auditor) calling the implementation's insert() API.

  • If no pathway with pathway_id exists, a new pathway bucket is created.
  • If a pathway exists, the new trace is appended to that bucket's traces list.
  • trace_uid is generated (UUID v7 RECOMMENDED for time-orderability).
  • version is set to 1.
  • created_at is set to the insertion timestamp.
  • pathway_vec is computed and stored.
  • replay_count and replays_succeeded start at 0.

5.2 — Revise

A trace is revised by calling revise(trace_uid, new_fields).

  • A new trace with a fresh trace_uid and version = previous.version + 1 is inserted.
  • The new trace's parent_trace_uid is set to the previous trace's trace_uid.
  • The previous trace's superseded_at is set to the revision timestamp and superseded_by_trace_uid is set to the new trace's trace_uid.
  • The previous trace remains in the bucket (history is preserved). Retrieval defaults to head versions only — see §7.

5.3 — Replay

A pathway is eligible for replay when retired == false AND the bucket has at least one head-version trace.

The hot-swap consumer queries by pathway_id (or by similarity over pathway_vec), retrieves the head trace, applies its final_verdict shape to a new but similar input, and then reports back via record_replay(trace_uid, succeeded: bool).

  • replay_count is incremented on every replay (success or failure).
  • replays_succeeded is incremented only on success.

5.4 — Probation gate (retirement)

Probation triggers when replay_count >= 3 AND success_rate < 0.80.

When triggered, the pathway is marked retired = true. It is excluded from hot-swap forever. If the underlying code/task characteristics genuinely change such that a retired pathway would work again, a NEW pathway with a FRESH id will be created (because the inputs to pathway_id would have changed, e.g., new signal_class). Retirement is per-id and irreversible.

success_rate = replays_succeeded / replay_count (0.0 if replay_count == 0).


6. Access patterns (read + write shapes)

6.1 — Write paths (who inserts traces)

Spec-conformant writers:

  • Scrum pipeline (tests/real-world/scrum_master_pipeline.ts and downstream consumers) on review acceptance
  • Auditor pipeline (auditor/) on verdict assembly
  • Validator (crates/validator/) on iterate session completion

Each writer is responsible for setting the identity fields correctly. The implementation MUST reject writes where task_class is empty.

6.2 — Read paths (who consumes traces)

Spec-conformant readers:

  • Hot-swap retrievalquery_for_hotswap(task_class, file_path, signal_class, k=5) returns the top-k head-version traces from matching pathways, ordered by (success_rate desc, replay_count desc).
  • Similarity retrievalquery_by_vec(pathway_vec, k=10) returns the top-k head-version traces by cosine similarity over pathway_vec.
  • Audit / forensiclist_versions(trace_uid_root) returns the full version chain. Must be authorized — version chains can include reducer_summary text that may contain PII.

The implementation MUST NOT expose list_versions on an unauthenticated route.

6.3 — PII surface (load-bearing for audit-trail compliance)

The fields reducer_summary and final_verdict are free-form strings written from review output. They are highly likely to contain PII when the reviewed task class involved real candidates (per AUDIT_PHASE_1_DISCOVERY.md §1F + §10/C1). Until subject-redaction-on-write lands, implementations:

  • MUST treat trace bodies as a suspected PII sink for any task class touching real subjects
  • SHOULD redact known PII patterns (candidate_id, email, phone) before persisting
  • SHOULD provide a per-trace subject_ids[] top-level field when known, so audit-response queries can filter by subject

Until the redaction layer ships, do not advertise pathway_memory as PII-clean.


7. Storage representation

The current implementation persists state to a single JSON file at data/_pathway_memory/state.json:

{
  "pathways": {
    "<pathway_id>": [
      { "trace_uid": "...", "version": 1, "task_class": "...", ... },
      { "trace_uid": "...", "version": 2, "parent_trace_uid": "...", ... }
    ]
  },
  "last_updated_at": "2026-04-29T05:43:00Z"
}

The flat-JSON representation is implementation-specific. A spec-conformant implementation could use SQLite, an append-only JSONL, or a Lance dataset. The spec dictates the wire shape per-trace and the access patterns; the storage backend is the implementation's choice.

Retrieval defaults: queries return head versions only by default (a head version is one with superseded_at == null). Consumers explicitly opt in to historical versions via a include_history: true flag.


8. Spec boundary — what's stable vs implementation-specific

Stable (locked in v1; changing requires v2 spec)

  • The pathway_id algorithm (SHA256 of task_class | file_prefix | signal_class)
  • The file_prefix algorithm (first two path segments)
  • pathway_vec dimension (32) and token list
  • Field names and types in §2 (additive fields are OK; renames or removals require v2)
  • Lifecycle rules in §5
  • Access pattern names in §6 (query_for_hotswap, query_by_vec, list_versions)

Implementation-specific (free to change without spec bump)

  • Storage backend (JSON file, SQLite, Lance, etc.)
  • HTTP/gRPC/library API surface
  • Concurrency model
  • In-memory caching strategy
  • The exact text of reducer_summary (it's content, not schema)
  • Token-bucket hash function (SHA-256 specified in §4 but any deterministic 32-bit hash is acceptable as long as it's deterministic and well-distributed)

Reserved for v2 (named so future implementations don't paint into a corner)

  • Cross-pathway similarity (query_pathways_similar_to_pathway)
  • External embedding model for pathway_vec (replaces the bag-of-tokens hash)
  • Subject-aware indexing (so audit-response can find all pathways involving subject X)
  • Differential privacy on pathway_vec aggregations

9. Open questions

These need decisions before v1 is final-final:

  1. Dimension lock-in: 32 is round-3 consensus, but is it actually optimal? Worth a benchmark before final v1.
  2. Subject indexing in v1 vs v2: deferring to v2 means audit queries today have to grep reducer_summary text. Is that acceptable for the staffing-client launch window?
  3. Cross-runtime parity: the Go side has its own internal/pathways/ implementation. As of 2026-05-03 the wire shape matches but no parity probe exists. Add pathway_parity.sh to the cross-runtime probe suite.
  4. Migration from current state.json: existing 91 traces lack trace_uid. Spec says trace_uid is empty on legacy traces — should v1 require backfill or accept legacy-trace queries indefinitely?

10. Why this spec exists

Without it:

  • Every new consumer of pathway_memory invents their own read/write conventions
  • The Go-side reimplementation drifts from the Rust-side wire format
  • Future LLM-assistant work proposes new fields without considering the existing schema
  • Forensic queries grep free-text fields because no canonical retrieval exists
  • The PII concern stays "TODO" because there's no stable surface to gate

With it:

  • One implementation can be canonical (Rust, currently); others can be conformance-tested
  • The wire shape is documented; serialization tests can pin it
  • New fields go through an additive-only review (don't break legacy)
  • The KbContext consumer in the gateway has a stable API to read against
  • The audit-trail PRD has a defined PII surface to protect

This is the standardization the system was missing.


Change log

  • 2026-05-03 — v1 initial draft. Codifies what's already in crates/vectord/src/pathway_memory.rs. The spec is descriptive of the existing implementation, not prescriptive of new behavior.