Phase 44 PRD (docs/CONTROL_PLANE_PRD.md:204) explicitly lists
`tests/multi-agent/agent.ts::generate()` as a migration target:
every internal LLM caller must flow through /v1/chat so usage
accounting + audit trail see all traffic.
generateCloud() was bypassing the gateway entirely — direct POST to
OLLAMA_CLOUD_URL/api/generate with the bearer key. This meant:
- /v1/usage missed every agent.ts cloud call
- No gateway-side caching, rate-limiting, or cost gating
- Callers needed OLLAMA_CLOUD_KEY in env (leak risk; gateway
already owns the key)
Migration:
- Endpoint: OLLAMA_CLOUD_URL/api/generate → GATEWAY/v1/chat
- Body shape: {prompt,options.num_predict,options.temperature} →
OpenAI-compatible {messages[],temperature,max_tokens}
- provider: "ollama_cloud" explicit in the request
- Response extraction: data.response → data.choices[0].message.content
- OLLAMA_CLOUD_KEY no longer required in agent.ts env
Phase 44 gate verified: `grep localhost:3200/generate|/api/generate`
now only hits (a) the ollama_cloud.rs adapter itself (legit — it's
the gateway-side direct caller) and (b) this comment explaining the
migration history. Zero non-adapter code paths to /api/generate.
generate() (local Ollama) still goes direct to :3200 — that's the
t1_hot path. Phase 44 PRD focuses on cloud callers; hot-path local
generation deliberately stays direct for latency.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>