golangLAKEHOUSE/README.md
root 8f4c16fab1 mcpd: Go MCP SDK port — replaces Bun mcp-server tool surface
New cmd/mcpd binary using github.com/modelcontextprotocol/go-sdk
v1.5.0 over stdio transport. Exposes Lakehouse capabilities as MCP
tools: list_datasets, get_manifest, query_sql, embed_text,
search_vectors. Each tool proxies to the gateway via HTTP.

Replaces the MCP-tool subset of the Rust system's Bun mcp-server.ts
(the audit's "split this 2520-line empire" finding from R-005). HTTP
demo routes (the staffing co-pilot UI at /api/intelligence/*,
/headshots/*, etc.) stay Bun until G5 cutover — those are demo-
specific and depend on matrix-indexer signals not yet ported.

Architecture:
  cmd/mcpd/main.go (235 LoC)
    main() reads --gateway flag, builds server via buildServer(),
    runs on StdioTransport. Each tool's args is a typed struct with
    jsonschema tags (the SDK's canonical pattern); reflection
    generates the JSON Schema automatically.

    gatewayClient: thin HTTP wrapper over the configured gateway URL.
    30s per-request timeout. 16 MiB tool-response cap. Non-2xx
    surfaces as IsError CallToolResult (NOT as transport error) so
    the LLM caller sees the error text and can decide how to react.

    proxy() handles GET + POST + JSON body uniformly. errorResult()
    + jsonResult() helpers normalize CallToolResult shape.

  cmd/mcpd/main_test.go (13 test funcs)
    Tests the full MCP wire end-to-end without a subprocess: spin
    up a fake gateway via httptest, build the MCP server pointed at
    it, connect a client via in-memory transports (NewInMemoryTransports),
    call each tool. Each tool gets:
      - happy path (gateway returns 200 → tool returns content)
      - input validation (missing required fields → IsError)
      - upstream error (gateway 4xx → tool returns IsError)
    Plus TestListTools verifies all 5 tools register; TestGatewayUnreachable
    verifies network-level failures surface as IsError, not panics.

Setup for Claude Desktop / Code documented in README:
  {
    "mcpServers": {
      "lakehouse": {
        "command": "/path/to/bin/mcpd",
        "args": ["--gateway", "http://127.0.0.1:3110"]
      }
    }
  }

Verified:
  go test -count=1 ./cmd/mcpd/  — 13/13 green
  just verify                    — vet + test + 9 smokes 35s

Out of scope for this commit:
  - Resources (mcp.AddResource): not needed yet; tools cover the
    interactive surface. Add when an LLM-side use case shows up.
  - Prompts (mcp.AddPrompt): same.
  - Streamable transports (HTTP, SSE): stdio is the universal one;
    streamable can be added with srv.Run(ctx, &mcp.StreamableHTTPHandler{})
    swap if a daemon-mode deploy makes sense.
  - mcpd inside the daemon-supervised stack: it's stdio-only and
    spawned by the MCP client, not run as a service. Adding a
    daemon-mode (HTTP transport on a port) is a follow-up if MCP
    consumers want long-lived sessions.

This is a tool-surface only port. The Bun mcp-server.ts also serves
HTTP demo routes (/api/catalog/datasets, /intelligence/*, /headshots/*)
that depend on the matrix-indexer signals from the Rust system; those
stay Bun until G5 cutover when the staffing co-pilot service ports
to Go.

Direct deps added:
  github.com/modelcontextprotocol/go-sdk v1.5.0

Transitive (resolved by go mod tidy):
  github.com/google/jsonschema-go        v0.4.2
  github.com/yosida95/uritemplate/v3     v3.0.2
  golang.org/x/oauth2                    v0.35.0
  github.com/segmentio/encoding          v0.5.4
  github.com/golang-jwt/jwt/v5           v5.3.1

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-29 07:00:38 -05:00

137 lines
5.0 KiB
Markdown

# golangLAKEHOUSE
Go reimplementation of the Lakehouse — a versioned knowledge
substrate for staffing analytics + local AI workloads.
## Status
**Phase G0 complete + G1/G1P/G2 shipped.** Six binaries plus a
seventh (vectord) and an eighth (embedd) on top, fronted by a
single gateway. Acceptance smokes green for D1-D6 + G1 + G1P + G2.
End-to-end staffing co-pilot pipeline functional through the
gateway:
```
text → /v1/embed → /v1/vectors/index/<name>/add
text → /v1/embed → /v1/vectors/index/<name>/search → top-K hits
```
Plus the SQL path:
```
CSV → /v1/ingest (parses, writes Parquet via storaged, registers
manifest with catalogd)
SQL → /v1/sql (DuckDB over the registered Parquets via httpfs)
```
See `docs/PHASE_G0_KICKOFF.md` for the day-by-day record (D1-D6 +
real-scale validation + G1/G1P/G2 pointer at the bottom).
## Service inventory
| Bin | Port | Role |
|---|---|---|
| `gateway` | 3110 | Reverse proxy fronting all backing services |
| `storaged` | 3211 | Object I/O over S3 (MinIO in dev) |
| `catalogd` | 3212 | Parquet manifest registry, ADR-020 idempotency |
| `ingestd` | 3213 | CSV → Parquet → register loop |
| `queryd` | 3214 | DuckDB SELECT over registered Parquets via httpfs |
| `vectord` | 3215 | HNSW vector search (+ optional persistence to storaged) |
| `embedd` | 3216 | Text → vector via Ollama (default `nomic-embed-text` 768-d) |
| `mcpd` | stdio | Model Context Protocol server (Claude Desktop / Code consumers) |
## MCP server
`bin/mcpd` exposes Lakehouse capabilities as MCP tools over stdio:
`list_datasets`, `get_manifest`, `query_sql`, `embed_text`, `search_vectors`.
All tools proxy to the gateway, so the gateway must be up first.
Wire into Claude Desktop / Claude Code by adding to the MCP config:
```json
{
"mcpServers": {
"lakehouse": {
"command": "/path/to/golangLAKEHOUSE/bin/mcpd",
"args": ["--gateway", "http://127.0.0.1:3110"]
}
}
}
```
Replaces the Bun `mcp-server.ts` MCP-tool surface from the Rust system.
HTTP demo routes (the staffing co-pilot UI) stay Bun until G5.
## Acceptance smokes
```
scripts/d1_smoke.sh # 5-binary skeleton + chi /health + gateway proxy probes
scripts/d2_smoke.sh # storaged GET/PUT/LIST/DELETE + 256 MiB cap + concurrency cap
scripts/d3_smoke.sh # catalogd register/manifest/list + rehydrate-across-restart
scripts/d4_smoke.sh # ingestd CSV → Parquet round-trip + schema-drift 409
scripts/d5_smoke.sh # queryd DuckDB SELECT through httpfs over MinIO
scripts/d6_smoke.sh # full ingest → query through gateway only
scripts/g1_smoke.sh # vectord HNSW recall + dim mismatch + duplicate-create 409
scripts/g1p_smoke.sh # vectord state survives kill+restart via storaged
scripts/g2_smoke.sh # embed → vectord add → search round-trip
```
Or run the full gate via the task runner (see below):
```
just verify # vet + tests + 9 smokes; ~33s wall
```
## Task runner
```
just # show available recipes
just verify # full Sprint 0 gate (vet + tests + 9 smokes)
just smoke <day> # single smoke (d1..d6, g1, g1p, g2)
just doctor # check cold-start deps; --json for CI
just install-hooks # install pre-push hook that runs just verify
```
After a fresh clone, run `just install-hooks` once so `git push` is
gated on the same green chain that ran here. Hook lives in
`.git/hooks/pre-push` (not tracked; recreated by the recipe).
## Cold-start dependencies
- Go 1.25+ at `/usr/local/go/bin` (arrow-go pulled the 1.25 floor)
- `gcc` + `libc-dev` for the DuckDB cgo binding (ADR-001 §1.1)
- `just` task runner (`apt install just` on Debian 13+)
- MinIO running on `:9000` with bucket `lakehouse-go-primary`
- Ollama running on `:11434` with `nomic-embed-text` loaded (G2)
- `/etc/lakehouse/secrets-go.toml` with `[s3.primary]` credentials
(storaged + queryd both read this)
`just doctor` probes all of the above and reports the fix command
for each missing dep. CI / scripts can use `just doctor --json`.
## Layout
```
docs/ Direction + spec + ADRs + day-by-day
cmd/ One main package per binary
internal/ Shared packages — storeclient, catalogclient,
secrets, shared, embed, gateway, plus
per-service implementation packages
scripts/ Smokes + ancillary tooling
```
## Reading order
1. `docs/PRD.md` — what we're building and why
2. `docs/SPEC.md` — how, per-component
3. `docs/DECISIONS.md` — ADRs (ADR-001 foundational)
4. `docs/PHASE_G0_KICKOFF.md` — day-by-day from D1 through G2
5. `docs/RUST_PATHWAY_MEMORY_NOTE.md` — historical reference for the
Rust era's pathway memory (not migrated, by ADR-001 #5)
## Predecessor
The Rust Lakehouse this rewrite supersedes lives at
`git.agentview.dev/profit/lakehouse`. It remains the live system
serving `devop.live/lakehouse/` until this Go implementation reaches
feature parity per `docs/SPEC.md` §7. Then Rust enters
maintenance-only mode.