root ed1fcd3c26 specs: pathway_memory v1 + subject_manifests_on_catalogd v1

Two specifications addressing the framing J asked for after reading
the llms3.com blog: standardize what we have so future work doesn't
drift, and apply the local-first thesis to the audit problem instead
of the over-scoped SaaS-tier identity service.

PATHWAY_MEMORY_SPEC.md (~400 lines):
  Documents the existing crates/vectord/src/pathway_memory.rs as a
  spec — the third metadata layer alongside catalogd's data metadata
  and playbook_memory's operational memory. Defines:
    - PathwayTrace wire format
    - pathway_id = SHA256(task_class | file_prefix | signal_class)
    - file_prefix algorithm (first 2 path segments)
    - pathway_vec: 32-bucket bag-of-tokens hash, fixed dim per spec
    - Lifecycle: insert → revise → replay → probation gate retire
    - Mem0 versioning (trace_uid + parent_trace_uid + version chain)
    - Access patterns: query_for_hotswap / query_by_vec / list_versions
    - PII risk surface (reducer_summary + final_verdict)
    - Spec boundary: stable in v1 vs implementation-specific
  No new architecture. Descriptive, not prescriptive.

SUBJECT_MANIFESTS_ON_CATALOGD.md (~400 lines):
  The local-first audit-trail spec. Adds a fourth manifest type to
  catalogd alongside dataset/view/tombstone/profile. NOT a separate
  identity daemon. NOT Vault/KMS/dual-control JWT. Builds on
  primitives catalogd already ships:
    - SubjectManifest at data/_catalog/subjects/<id>.json
    - Per-subject HMAC-chained audit JSONL
    - Daily retention sweep using existing tombstone primitives
    - Vertical-aware routing (healthcare → local-only)
    - Legal-tier credential separate from gateway internal auth
  ~4 days estimated implementation effort vs 17-20 days for the
  IDENTITY_SERVICE_DESIGN approach. Same defensibility for the
  staffing-client launch window. Strictly additive to compatibility
  with the v3 design if SOC2 Type II becomes a contract requirement.

These are SPECS — what the system already does (pathway) and what's
the smallest local-first thing that addresses the audit need
(subject manifests). Not 9-phase plans. Not new daemons.

The pathway spec is descriptive: writing down what exists so the
next person doesn't reinvent it. The subject-manifests spec is
prescriptive: J greenlights, implementation is days not weeks.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

2026-05-03 03:07:38 -05:00

17 KiB

Raw Blame History

Pathway Memory — Specification v1

Status: Draft v1 — 2026-05-03 · Layer: Decision metadata · Implementation: crates/vectord/src/pathway_memory.rs

What this is. Pathway Memory is a third metadata layer that exists alongside (not on top of, not under) the data-metadata layer (catalogd's manifests/views/tombstones) and the operational-memory layer (playbook_memory). It tracks decision patterns: "this code path, in this task class, produced this outcome." It is the AI-substrate analog of what Iceberg manifests are to data tables — pointers + provenance for things you might want to retrieve again or learn from.

What it is NOT. It is not a vector database. It is not a knowledge graph. It is not an audit log (audit is per-subject; pathways are per-code-pattern). It is not a cache. It is a small, opinionated metadata layer with a defined wire format, defined access patterns, and defined lifecycle rules.

Why a spec exists. No standard tells us what this layer is. Each implementation that drifts from this spec creates the wiring confusion currently visible in the system (data/_kb/ JSONLs that don't get consumed, version drift across workers_500k_v1..v10, unclear who reads reducer_summary vs final_verdict). The spec is the standardization contract that lets a single implementation become the canonical one.

1. Conceptual model

A pathway is the identity-equivalence-class of code paths that exhibit similar decision behavior. Two distinct files may share a pathway if they belong to the same crate AND are reviewed under the same task class with the same signal class.

A trace is a single observed instance of a pathway: one review pass, with its inputs, intermediate signals, ladder attempts, audit outcome, and final verdict.

A pathway is identified by a stable hash of three keys: task class, file prefix, signal class. The hash is pathway_id.

A trace is identified within a pathway by a generated UUID: trace_uid. The pair (pathway_id, trace_uid) is globally unique.

Pathways are append-only by trace: every new observation creates a new trace under the existing pathway_id. Traces themselves are mutable in version: revise() produces a new version that supersedes the previous (Mem0-style chain).

Pathways are retirable by id: when replay history shows the pattern no longer holds, the pathway is marked retired and excluded from hot-swap. New pathways may emerge with fresh ids if the underlying code/task characteristics shift.

2. Wire format — `PathwayTrace`

Authoritative struct: crates/vectord/src/pathway_memory.rs. The fields below are the SPEC; the Rust struct is one implementation. All fields use JSON-serializable types when persisted.

2.1 — Identity fields (immutable per trace)

Field	Type	Constraint
`pathway_id`	string	SHA256 hex of `task_class
`trace_uid`	string	UUID. Empty on legacy traces; populated on insert by SHOULD-implementations from v2+
`version`	u32	Mem0 version chain. Default 1 on insert. Bumped on `revise()`.
`parent_trace_uid`	string \| null	trace_uid of the trace this one supersedes. null on root version.
`superseded_at`	string \| null	ISO-8601 UTC. Set when this trace becomes a non-head version.
`superseded_by_trace_uid`	string \| null	trace_uid of the new head when this version is superseded.

2.2 — Identity-source fields (the inputs that fed pathway_id)

Field	Type	Notes
`task_class`	string	Caller-defined class string (e.g. `scrum_review`, `staffing.fill`, `pr_audit`)
`file_path`	string	Full path of the file under review. Reduced to file_prefix for hashing — see §3.2.
`signal_class`	string \| null	Caller-defined behavior label (e.g. `CONVERGING`, `LOOPING`, `STUCK_RETRY`)

2.3 — Observation fields (what happened)

Field	Type	Description
`created_at`	RFC 3339 timestamp	When this trace was inserted.
`ladder_attempts`	array of `LadderAttempt`	Each model+rung attempt: `{rung, model, latency_ms, accepted, reject_reason?}`. Order is dispatch order.
`kb_chunks`	array of `KbChunkRef`	Knowledge-base chunks the reviewer was given as context: `{source_doc, chunk_id, cosine_score, rank}`.
`observer_signals`	array of `ObserverSignal`	Behavior labels emitted by observer during this pathway's run: `{class, priors, prior_iter_outcomes}`.
`bridge_hits`	array of `BridgeHit`	External-doc lookups (context7-style): `{library, version, result_summary}`.
`sub_pipeline_calls`	array of `SubPipelineCall`	Calls to other sub-pipelines (extract, validate, etc.) made during this trace.
`audit_consensus`	`AuditConsensus` \| null	Cross-lineage audit outcome if any: `{pass, models, disagreements}`.
`reducer_summary`	string	Free-text summary of what the reducer concluded. PII risk surface — see §6.3.
`final_verdict`	string	Free-text verdict label (`accepted`, `rejected`, `needs_review`, etc.). PII risk surface — see §6.3.

2.4 — Index fields (the matrix retrieval shape)

Field	Type	Description
`pathway_vec`	array of f32, length 32	Bag-of-tokens hash embedding. Algorithm in §4. Fixed dimension per spec version.
`replay_count`	u32	Number of times this pathway has been replayed via hot-swap. Initial insert is NOT a replay.
`replays_succeeded`	u32	Replays where the post-replay outcome matched the trace's `final_verdict`.
`retired`	bool	Marked true when probation gate fires (§5.4). Excluded from hot-swap forever.

2.5 — Optional semantic-correctness fields (ADR-021)

Field	Type	Description
`semantic_flags`	array of `SemanticFlag`	One of: `UnitMismatch`, `TypeConfusion`, `NullableConfusion`, `OffByOne`, `StaleReference`, `PseudoImpl`, `DeadCode`, `WarningNoise`, `BoundaryViolation`. Caller-emitted, not auto-derived.
`type_hints_used`	array of `TypeHint`	Schema/type context fed to the reviewer for this trace: `{source, symbol, type_repr}`.
`bug_fingerprints`	array of `BugFingerprint`	Structural pattern hashes the reviewer caught: `{flag, pattern_key, example, occurrences}`.

All §2.5 fields are additive — implementations MUST default them to empty when deserializing legacy traces.

3. Identity algorithms

3.1 — `pathway_id`

pathway_id = sha256_hex(task_class || "|" || file_prefix(file_path) || "|" || signal_class_or_empty)

Concatenation uses literal byte | (0x7C) as separator.
signal_class_or_empty is the empty string when signal_class is null.
SHA-256 output is 64 hex chars, lowercase.

3.2 — `file_prefix`

file_prefix(path) = first two path segments joined by '/'

Path separator is forward slash. Implementations on Windows MUST normalize to forward slashes before hashing.
Files with fewer than two segments use the full path verbatim.
Examples:
- crates/queryd/src/service.rs → crates/queryd
- crates/gateway → crates/gateway
- README.md → README.md

3.3 — Why this fingerprint shape

The fingerprint groups files within a crate (same file_prefix) under the same pathway when reviewed for the same task class with the same signal class. This is intentional: it lets the matrix index recognize "the same kind of bug appears in this crate" without needing per-file traces. Per-file granularity is preserved in the file_path field for retrieval but is not part of the identity.

4. `pathway_vec` algorithm

Fixed dimension: 32 buckets. Implementation: deterministic bag-of-tokens hash, normalized.

def build_pathway_vec(trace):
    buckets = [0.0] * 32
    tokens = collect_metadata_tokens(trace)  # see token list below
    for tok in tokens:
        h = sha256(tok.encode())
        bucket = int.from_bytes(h[:4], 'big') % 32
        buckets[bucket] += 1.0
    norm = sqrt(sum(b*b for b in buckets))
    return [b/norm if norm > 0 else 0.0 for b in buckets]

Tokens MUST include:

task_class:<task_class>
file_prefix:<file_prefix>
signal_class:<signal_class_or_empty>
For each ladder_attempts[i].model: model:<model>
For each kb_chunks[i].source_doc: kb_doc:<source_doc>
For each observer_signals[i].class: signal:<class>
For each bug_fingerprints[i].flag: flag:<flag>

Tokens MUST NOT include:

reducer_summary text (free-form, would dominate the bucket distribution)
final_verdict text
created_at (would make every trace's vector unique)
candidate_id, name, email, phone, or any subject identifier (PII risk)

Why 32 dimensions: consensus-validated round 3 ensemble selected "small metadata tokens, not full JSON." 32 is sufficient to distinguish pathway combinations without requiring an external embedding model. An external embedding model would work too but adds a dependency, failure mode, and drift risk the consensus flagged. Spec version v1 fixes dim=32. v2 may revisit.

5. Lifecycle

5.1 — Insert

A trace is inserted by the consumer (typically the scrum pipeline or auditor) calling the implementation's insert() API.

If no pathway with pathway_id exists, a new pathway bucket is created.
If a pathway exists, the new trace is appended to that bucket's traces list.
trace_uid is generated (UUID v7 RECOMMENDED for time-orderability).
version is set to 1.
created_at is set to the insertion timestamp.
pathway_vec is computed and stored.
replay_count and replays_succeeded start at 0.

5.2 — Revise

A trace is revised by calling revise(trace_uid, new_fields).

A new trace with a fresh trace_uid and version = previous.version + 1 is inserted.
The new trace's parent_trace_uid is set to the previous trace's trace_uid.
The previous trace's superseded_at is set to the revision timestamp and superseded_by_trace_uid is set to the new trace's trace_uid.
The previous trace remains in the bucket (history is preserved). Retrieval defaults to head versions only — see §7.

5.3 — Replay

A pathway is eligible for replay when retired == false AND the bucket has at least one head-version trace.

The hot-swap consumer queries by pathway_id (or by similarity over pathway_vec), retrieves the head trace, applies its final_verdict shape to a new but similar input, and then reports back via record_replay(trace_uid, succeeded: bool).

replay_count is incremented on every replay (success or failure).
replays_succeeded is incremented only on success.

5.4 — Probation gate (retirement)

Probation triggers when replay_count >= 3 AND success_rate < 0.80.

When triggered, the pathway is marked retired = true. It is excluded from hot-swap forever. If the underlying code/task characteristics genuinely change such that a retired pathway would work again, a NEW pathway with a FRESH id will be created (because the inputs to pathway_id would have changed, e.g., new signal_class). Retirement is per-id and irreversible.

success_rate = replays_succeeded / replay_count (0.0 if replay_count == 0).

6. Access patterns (read + write shapes)

6.1 — Write paths (who inserts traces)

Spec-conformant writers:

Scrum pipeline (tests/real-world/scrum_master_pipeline.ts and downstream consumers) on review acceptance
Auditor pipeline (auditor/) on verdict assembly
Validator (crates/validator/) on iterate session completion

Each writer is responsible for setting the identity fields correctly. The implementation MUST reject writes where task_class is empty.

6.2 — Read paths (who consumes traces)

Spec-conformant readers:

Hot-swap retrieval — query_for_hotswap(task_class, file_path, signal_class, k=5) returns the top-k head-version traces from matching pathways, ordered by (success_rate desc, replay_count desc).
Similarity retrieval — query_by_vec(pathway_vec, k=10) returns the top-k head-version traces by cosine similarity over pathway_vec.
Audit / forensic — list_versions(trace_uid_root) returns the full version chain. Must be authorized — version chains can include reducer_summary text that may contain PII.

The implementation MUST NOT expose list_versions on an unauthenticated route.

6.3 — PII surface (load-bearing for audit-trail compliance)

The fields reducer_summary and final_verdict are free-form strings written from review output. They are highly likely to contain PII when the reviewed task class involved real candidates (per AUDIT_PHASE_1_DISCOVERY.md §1F + §10/C1). Until subject-redaction-on-write lands, implementations:

MUST treat trace bodies as a suspected PII sink for any task class touching real subjects
SHOULD redact known PII patterns (candidate_id, email, phone) before persisting
SHOULD provide a per-trace subject_ids[] top-level field when known, so audit-response queries can filter by subject

Until the redaction layer ships, do not advertise pathway_memory as PII-clean.

7. Storage representation

The current implementation persists state to a single JSON file at data/_pathway_memory/state.json:

{
  "pathways": {
    "<pathway_id>": [
      { "trace_uid": "...", "version": 1, "task_class": "...", ... },
      { "trace_uid": "...", "version": 2, "parent_trace_uid": "...", ... }
    ]
  },
  "last_updated_at": "2026-04-29T05:43:00Z"
}

The flat-JSON representation is implementation-specific. A spec-conformant implementation could use SQLite, an append-only JSONL, or a Lance dataset. The spec dictates the wire shape per-trace and the access patterns; the storage backend is the implementation's choice.

Retrieval defaults: queries return head versions only by default (a head version is one with superseded_at == null). Consumers explicitly opt in to historical versions via a include_history: true flag.

8. Spec boundary — what's stable vs implementation-specific

Stable (locked in v1; changing requires v2 spec)

The pathway_id algorithm (SHA256 of task_class | file_prefix | signal_class)
The file_prefix algorithm (first two path segments)
pathway_vec dimension (32) and token list
Field names and types in §2 (additive fields are OK; renames or removals require v2)
Lifecycle rules in §5
Access pattern names in §6 (query_for_hotswap, query_by_vec, list_versions)

Implementation-specific (free to change without spec bump)

Storage backend (JSON file, SQLite, Lance, etc.)
HTTP/gRPC/library API surface
Concurrency model
In-memory caching strategy
The exact text of reducer_summary (it's content, not schema)
Token-bucket hash function (SHA-256 specified in §4 but any deterministic 32-bit hash is acceptable as long as it's deterministic and well-distributed)

Reserved for v2 (named so future implementations don't paint into a corner)

Cross-pathway similarity (query_pathways_similar_to_pathway)
External embedding model for pathway_vec (replaces the bag-of-tokens hash)
Subject-aware indexing (so audit-response can find all pathways involving subject X)
Differential privacy on pathway_vec aggregations

9. Open questions

These need decisions before v1 is final-final:

Dimension lock-in: 32 is round-3 consensus, but is it actually optimal? Worth a benchmark before final v1.
Subject indexing in v1 vs v2: deferring to v2 means audit queries today have to grep reducer_summary text. Is that acceptable for the staffing-client launch window?
Cross-runtime parity: the Go side has its own internal/pathways/ implementation. As of 2026-05-03 the wire shape matches but no parity probe exists. Add pathway_parity.sh to the cross-runtime probe suite.
Migration from current state.json: existing 91 traces lack trace_uid. Spec says trace_uid is empty on legacy traces — should v1 require backfill or accept legacy-trace queries indefinitely?

10. Why this spec exists

Without it:

Every new consumer of pathway_memory invents their own read/write conventions
The Go-side reimplementation drifts from the Rust-side wire format
Future LLM-assistant work proposes new fields without considering the existing schema
Forensic queries grep free-text fields because no canonical retrieval exists
The PII concern stays "TODO" because there's no stable surface to gate

With it:

One implementation can be canonical (Rust, currently); others can be conformance-tested
The wire shape is documented; serialization tests can pin it
New fields go through an additive-only review (don't break legacy)
The KbContext consumer in the gateway has a stable API to read against
The audit-trail PRD has a defined PII surface to protect

This is the standardization the system was missing.

Change log

2026-05-03 — v1 initial draft. Codifies what's already in crates/vectord/src/pathway_memory.rs. The spec is descriptive of the existing implementation, not prescriptive of new behavior.

17 KiB Raw Blame History