avelino · avelino · Apr 9, 2026 · Apr 9, 2026 · Apr 9, 2026
diff --git a/Cargo.lock b/Cargo.lock
diff --git a/Cargo.toml b/Cargo.toml
@@ -38,6 +38,7 @@ uuid = { version = "1", features = ["v4"] }
 chrondb = "0.2.3-dev.12298e3"
 chrono = { version = "0.4", features = ["serde"] }
 shell-words = "1"
+socket2 = { version = "0.6", features = ["all"] }
 
 [dev-dependencies]
 tempfile = "3"
diff --git a/README.md b/README.md
@@ -219,13 +219,15 @@ This works with Claude Code, Cursor, Windsurf, or any MCP-compatible client. Too
 
 Every MCP client (Claude Code, Cursor, Windsurf) spawns all backend servers at startup and keeps them alive forever. With 10 servers and 3 sessions open, that's 30 idle processes eating ~3 GB of RAM.
 
-The `mcp` proxy fixes this with **persistent tool cache**, **lazy initialization**, and **adaptive idle shutdown**:
+The `mcp` proxy fixes this with **persistent tool cache**, **lazy initialization**, **shared backends across clients**, and **adaptive idle shutdown**:
 
 - **Instant startup** — tools are cached to disk and served immediately, even before backends connect
+- **One backend = one process, no matter how many clients** — 5 editor sessions hitting `slack` share a single `slack-mcp-server` child. Calls run in parallel via JSON-RPC id multiplexing on stdio.
 - Backends only connect when you actually use them (background refresh keeps the cache fresh)
-- Idle backends are shut down automatically (1-5 min based on usage frequency)
+- Idle backends are shut down automatically (1-5 min based on usage frequency), with a warm-up grace period so a brand-new backend isn't reaped before its first use
 - Tools stay visible — reconnection is transparent on next call
 - Cache invalidates automatically when backend config changes
+- Zero orphans: every spawned child is reaped on shutdown, panic, cancel, or stalled close (`kill_on_drop` everywhere)
 
 ```json
 {

diff --git a/docs/guides/proxy-mode.md b/docs/guides/proxy-mode.md
@@ -43,6 +43,53 @@ graph LR
 2. Client calls `tools/list` — the proxy returns tools instantly from persistent cache (if available), while refreshing from real backends in the background. On first run, it connects to all backends to discover their tool lists.
 3. Client calls `tools/call` — the proxy reconnects the target backend on demand (if it was shut down), routes the request, and tracks usage for adaptive timeout
 
+## Concurrency model
+
+The proxy is built to be the **orchestration layer for N concurrent clients sharing the same set of backend processes**. This is the whole point of the project — without it, every editor session and every chat window spawns its own copy of every MCP server, multiplying RAM, CPU, and API rate-limit cost by the number of clients.
+
+The guarantees the proxy makes:
+
+- **One backend = one OS process**, regardless of how many clients are connected. 5 clients hitting `slack` share a single `slack-mcp-server` child.
+- **Calls to different backends run in parallel.** Two clients calling `sentry__search_issues` and `github__list_repos` at the same time do not block each other.
+- **Calls to the same backend also run in parallel.** The stdio transport multiplexes JSON-RPC requests on a single pipe via id matching, so 5 clients all hitting `slack__conversations_replies` simultaneously fan out and back through the same process — none of them serializes on the others.
+- **A slow or hung backend only delays the requests targeting it.** The rest of the proxy keeps moving.
+- **A dead client only loses its own request.** TCP keepalive (30s/10s) on the HTTP listener detects half-open sockets from crashed editors within ~60s, and `MCP_PROXY_REQUEST_TIMEOUT` (default 120s) is a final hard bound at the request boundary.
+- **No orphan backends.** Every spawned child is registered with `kill_on_drop`, so a panicked task, a cancelled request, or a stalled graceful shutdown all converge on the same outcome: the child gets reaped.
+
+You can verify the orchestration on a running proxy with `/health`:
+
+```bash
+curl -s http://127.0.0.1:7332/health | jq
+```
+
+```json
+{
+  "status": "ok",
+  "backends_configured": 9,
+  "backends_connected": 9,
+  "active_clients": 5,
+  "tools": 213,
+  "version": "0.4.3"
+}
+```
+
+`backends_connected` should never grow with `active_clients` — that's the win.
+
+```mermaid
+graph LR
+    C1["claude code #1"] --> P
+    C2["claude code #2"] --> P
+    C3["opencode"] --> P
+    C4["cursor"] --> P
+    C5["windsurf"] --> P
+    P["mcp serve --http<br/>(orchestrator)"] --> Slack["slack-mcp<br/>(1 process)"]
+    P --> Sentry["sentry-mcp<br/>(1 process)"]
+    P --> GH["github-mcp<br/>(1 process)"]
+    P --> N["...N backends"]
+
+    style P fill:#4a9,color:#fff
+```
+
 ## Tool namespacing
 
 Tools are prefixed with the server name using double underscore (`__`) as separator:
@@ -69,15 +116,22 @@ Diagnostics go to stderr:
 
 ```
 [serve] discovering tools from sentry...
-[serve] sentry: 8 tool(s)
 [serve] discovering tools from slack...
+[serve] sentry: 8 tool(s)
 [serve] slack: 12 tool(s)
 [serve] ready — 2 backend(s), 20 tool(s)
-[serve] shutting down idle backend: sentry (idle 62s, 0 reqs)
-[serve] connecting to sentry... (on next tools/call)
+[serve] shutting down idle backend: sentry (idle 74s, 1 reqs)
+[serve] finalizing shutdown for sentry
+[serve] connecting to sentry...
 [serve] sentry: 8 tool(s) (reconnected)
 ```
 
+A few things to note about reaping:
+
+- A backend is **never reaped before its first request** until `max_idle_timeout` has elapsed since connect (warm-up grace). You won't see "shutting down idle backend: X (... 0 reqs)" inside the first few minutes after start anymore — the proxy waits for X to actually be used before considering it idle.
+- When the reaper does fire, all eligible backends shut down **in parallel** — 8 idle backends take ~5s, not 8 × 5s.
+- If a backend's graceful shutdown stalls past 5s, you'll see `shutdown timed out — force-killed via drop`. The child is guaranteed reaped via `kill_on_drop`; the proxy never leaks orphan processes.
+
 ## HTTP mode
 
 Expose the proxy as an HTTP server so multiple developers can share a single MCP endpoint:
@@ -148,18 +202,21 @@ JSON-RPC responses are delivered as `message` events on the SSE stream. The conn
 
 ### GET /health
 
-Returns the proxy status:
+Returns the proxy status, including backend pool size and live client sessions:
 
 ```json
 {
   "status": "ok",
-  "backends_configured": 3,
-  "backends_connected": 1,
-  "tools": 20,
-  "version": "0.3.0"
+  "backends_configured": 9,
+  "backends_connected": 9,
+  "active_clients": 5,
+  "tools": 213,
+  "version": "0.4.3"
 }
 ```
 
+`active_clients` is the number of SSE sessions currently registered. Combined with `backends_connected`, this is the metric that proves the proxy is doing its job: **N clients sharing M backends, not N × M processes**. If you ever see `backends_connected` grow proportionally to `active_clients`, something is wrong (clients should be hitting the proxy, not bypassing it to spawn their own backends).
+
 ### Graceful shutdown
 
 The HTTP server shuts down cleanly on `SIGTERM` or `SIGINT` (Ctrl+C). It stops accepting new connections, finishes in-flight requests, and disconnects all backends.

diff --git a/docs/mcp-servers-are-draining-your-hardware.md b/docs/mcp-servers-are-draining-your-hardware.md
@@ -134,6 +134,21 @@ All sessions share one proxy. The proxy manages backend lifecycles. You can conf
 
 Full configuration reference: [idle timeout options](https://mcp.avelino.run/reference/config-file#idle-timeout). Proxy mode setup: [proxy mode guide](https://mcp.avelino.run/guides/proxy-mode).
 
+### Update (April 2026): N clients, M backends
+
+The original `mcp serve` only solved half the problem — it stopped *one* client from spawning duplicate backends, but if you had multiple editors connecting at the same time, the proxy itself could serialize them or, worse, accumulate orphan processes when a client died. After [#51](https://github.com/avelino/mcp/issues/51) the proxy is now an actual orchestrator: a single backend process is shared across **every connected client**, requests run in parallel through the stdio multiplexer, and dead clients can never leak backend children. The numbers from a real run with 5 editor sessions and 9 backends:
+
+```json
+{
+  "backends_configured": 9,
+  "backends_connected": 9,
+  "active_clients": 5,
+  "tools": 213
+}
+```
+
+9 processes serving 5 clients, not 45. That's the full version of the win this post described.
+
 ## What should change in the ecosystem
 
 This isn't just a `mcp` CLI problem. Every MCP client should implement some form of lazy lifecycle management:

diff --git a/docs/reference/architecture.md b/docs/reference/architecture.md
@@ -29,20 +29,24 @@ The whole thing is a single async pipeline. No daemon, no background process, no
 The most important design decision is the `Transport` trait:
 
 ```rust
-trait Transport: Send {
-    async fn request(&mut self, msg: &JsonRpcRequest) -> Result<JsonRpcResponse>;
-    async fn notify(&mut self, msg: &JsonRpcNotification) -> Result<()>;
-    async fn close(&mut self) -> Result<()>;
+trait Transport: Send + Sync {
+    async fn request(&self, msg: &JsonRpcRequest) -> Result<JsonRpcResponse>;
+    async fn notify(&self, msg: &JsonRpcNotification) -> Result<()>;
+    async fn close(&self) -> Result<()>;
 }
 ```
 
-Everything in the client uses this interface. It doesn't know if it's talking to a subprocess or a remote server. Two implementations:
+Note the `&self` (not `&mut self`) and the `Sync` bound. This is what makes the proxy non-blocking under load: a single transport instance can be shared across many concurrent tasks via `Arc<dyn Transport>`, and each implementation uses interior mutability (channels, atomics, mutexes) for the small amount of state it needs to mutate. There is no global lock around the client.
 
-**StdioTransport** — Spawns a child process, sends JSON-RPC messages to its stdin, reads responses from stdout. Handles line-by-line parsing, skips server notifications (messages without `id`), and applies a configurable timeout.
+Three implementations:
 
-**HttpTransport** — Sends HTTP POST requests with JSON-RPC bodies. Handles SSE (Server-Sent Events) responses by extracting the last `data:` line. Manages session IDs via `Mcp-Session-Id` headers. On 401 responses, triggers the authentication flow and retries once.
+**StdioTransport** — Spawns a child process and runs it as a multiplexed pipe. A dedicated **writer task** owns the child's stdin and serializes outbound writes. A dedicated **reader task** consumes the child's stdout line-by-line and dispatches each response to its caller via a `oneshot` channel keyed by JSON-RPC `id`. The result: **multiple in-flight requests can run concurrently on the same backend process** — callers only block waiting for their own response. The child is spawned with `kill_on_drop(true)` so it is reaped on any cleanup path (graceful shutdown, panic, task abort, error). On `close()` the child gets a brief grace period and is then force-killed.
 
-The client wraps the transport in a `Box<dyn Transport>`, so adding a new transport (WebSocket, for example) means implementing three methods — nothing else changes.
+**HttpTransport** — Sends HTTP POST requests with JSON-RPC bodies. Handles SSE (Server-Sent Events) responses by extracting the last `data:` line. Manages session IDs via `Mcp-Session-Id` headers. On 401 responses, triggers the authentication flow and retries once. Mutable state (`session_id`, `bearer_token`, `headers` after a 401) lives behind small `Mutex`es; `reqwest::Client` is already `Send + Sync`, so concurrent requests fan out at the HTTP layer.
+
+**CliTransport** — Wraps any command-line tool as an MCP server (see [CLI as MCP](../guides/cli-as-mcp.md)). Discovery state lives behind an `RwLock` with double-checked locking, and each tool invocation spawns a fresh `Command` with `kill_on_drop(true)` so cancellation reaps the child instead of leaking it.
+
+`McpClient` wraps the transport in `Arc<dyn Transport>` and uses an `AtomicU64` for request id generation, so a single `Arc<McpClient>` is safe to share across any number of tasks. Adding a new transport (WebSocket, for example) means implementing the three trait methods — nothing else changes.
 
 ## Authentication: layered fallbacks
 
@@ -128,12 +132,39 @@ Cache invalidation is per-backend via SHA-256 hash of the raw config JSON. If a
 
 Each backend tracks usage statistics: request count, first/last use timestamps, and an exponential moving average (EMA) of inter-request intervals. A background reaper task runs every 30 seconds and shuts down backends that exceed their idle timeout. The timeout is adaptive by default — frequently used backends (>20 req/h) get 5 minutes, moderately used (5-20 req/h) get 3 minutes, and rarely used (<5 req/h) get 1 minute. Users can override this per backend with fixed timeouts or `"never"`.
 
+A **warm-up grace period** protects freshly-connected backends: a backend with `request_count == 0` is never reaped before its `max_idle_timeout` elapses, so the proxy doesn't kill a backend you haven't gotten around to using yet. Without this, the proxy would reap idle backends ~60 seconds after start and the very first real `tools/call` would always pay a full reconnect.
+
+When the reaper does fire, it shuts down all eligible backends **in parallel** via a `tokio::task::JoinSet`. If a backend's graceful `shutdown()` doesn't finish within 5 seconds, the reaper drops the `Arc<McpClient>` and `kill_on_drop(true)` force-reaps the child — orphaned backend processes are not possible by construction.
+
 When a backend is shut down, its tools remain in the tool list (cached in memory and on disk). On the next `tools/call` targeting that backend, the proxy transparently reconnects, refreshes the tool cache, and forwards the request. Usage stats are preserved across reconnections for adaptive timeout continuity.
 
 The proxy reuses the same `McpClient` and `Transport` abstractions — no new protocol code was needed. It just listens on stdin instead of connecting to a server's stdin.
 
 Error handling is partial-availability: if one backend fails to connect, the others still work. If a backend dies mid-session, the proxy returns an MCP-level error for that tool call without crashing.
 
+### Concurrency model
+
+The proxy is the orchestrator for **N concurrent clients sharing the same set of backends**. The whole pipeline is built so that no single client, request, or backend can wedge any of the others.
+
+Backends are pooled by name in a `HashMap<String, BackendState>` inside `ProxyServer`, and each connected backend is held as `Arc<McpClient>`. A request flows through `dispatch_request` in three carefully scoped phases:
+
+1. **Resolve (under a brief proxy lock)** — look up the namespaced tool in `tool_map`, run the ACL check, and clone the `Arc<McpClient>` out of `BackendState::Connected`. The lock is released before any I/O.
+2. **Connect (without the proxy lock)** — if no client exists yet, `connect_backend()` spawns the child, runs the MCP handshake and `tools/list`, and only then briefly re-acquires the lock to install the new client (deduplicating against any concurrent connector).
+3. **Invoke (without the proxy lock)** — `client.call_tool().await` runs entirely outside the proxy lock. Because `McpClient` and `Transport` are `&self`, the same `Arc<McpClient>` is invoked in parallel by every concurrent caller; the stdio multiplexer described above handles fan-in/fan-out by id.
+
+Discovery — the act of connecting to a previously-unseen backend and listing its tools — used to run **under** the proxy lock, which meant a single slow backend (e.g. a 30-second OAuth handshake) could wedge every other client until it returned. That is fixed by a separate `discovery_lock: Arc<Mutex<()>>` on `ProxyServer`. Discovery batches now snapshot the pending set under a brief lock, drop the proxy lock, run all the connect attempts in parallel **without** holding the proxy mutex, and only re-acquire the lock briefly to commit each result. Two callers that both want to discover are serialized on the discovery lock (so they don't double-spawn), but request handlers targeting already-discovered backends fly through with zero contention while a discovery batch is in progress.
+
+The HTTP+SSE legacy transport has its own backpressure trap: each client session is fed by a bounded `mpsc` channel, and a slow consumer can fill the buffer. The POST handler bounds its `tx.send(...)` with a 5s timeout — on failure or timeout, the session is **evicted** from the session map and the client is expected to reconnect. The SSE keepalive ping background task uses `try_send` instead of `send().await` so a momentarily-full buffer never blocks it; after ~1 minute of consecutive full-buffer pings the session is also evicted as wedged.
+
+Practical consequences:
+
+- Calls to **different** backends are fully parallel.
+- Calls to the **same** backend are also parallel — they fan out through one shared process via the stdio multiplexer (or through `reqwest`'s native concurrency for HTTP backends). One backend = one OS process, regardless of how many clients are connected.
+- A slow or hung backend only delays the requests targeting it. Other clients keep moving.
+- A slow discovery (e.g. an unreachable backend hitting its 30s timeout) blocks only other callers that also need discovery. Already-discovered backends keep serving requests normally.
+- A dead client only loses its own request. The HTTP listener is bound with TCP keepalive (30s idle / 10s interval) so half-open sockets from crashed clients are detected within ~60s, and `MCP_PROXY_REQUEST_TIMEOUT` (default 120s) is a final hard bound at the proxy boundary.
+- A client request that is cancelled mid-flight cleans up after itself: the future is dropped, any spawned child process is reaped via `kill_on_drop`, and the backend's pending-request map is cleared by the reader task on EOF.
+
 ### Server-side authentication
 
 The proxy supports an optional authentication layer for HTTP mode, designed to be transport-independent:

diff --git a/docs/reference/environment-variables.md b/docs/reference/environment-variables.md
@@ -8,6 +8,7 @@ These variables configure `mcp` behavior:
 |---|---|---|
 | `MCP_CONFIG_PATH` | `~/.config/mcp/servers.json` | Path to the config file |
 | `MCP_TIMEOUT` | `60` | Timeout in seconds for stdio server responses |
+| `MCP_PROXY_REQUEST_TIMEOUT` | `120` | (proxy mode) Hard upper bound, in seconds, that any single client request can spend inside `mcp serve` before the proxy returns a JSON-RPC error. Acts as a belt-and-suspenders boundary on top of the per-transport `MCP_TIMEOUT`. |
 
 ### `MCP_CONFIG_PATH`
 
@@ -27,6 +28,14 @@ MCP_TIMEOUT=120 mcp slack --list
 
 Does not affect HTTP servers (they use reqwest's default timeouts).
 
+### `MCP_PROXY_REQUEST_TIMEOUT`
+
+Only applies to `mcp serve`. Bounds how long the proxy will wait for any single client JSON-RPC request to complete end-to-end (auth + routing + backend I/O). If the bound is hit, the client receives a JSON-RPC error with code `-32000` and the in-flight request is dropped — other concurrent clients are unaffected. Set lower for tighter SLAs, higher for backends that legitimately take a long time.
+
+```bash
+MCP_PROXY_REQUEST_TIMEOUT=60 mcp serve --http :7332
+```
+
 ## Config variables
 
 Environment variables referenced in `servers.json` with `${VAR_NAME}` syntax. These are user-defined and depend on which servers you've configured.

diff --git a/src/cache.rs b/src/cache.rs
@@ -14,6 +14,7 @@ pub struct BackendToolCache {
     pub cached_at: String,
 }
 
+#[derive(Clone)]
 pub struct ToolCacheStore {
     pool: Arc<DbPool>,
 }

diff --git a/src/cli_discovery.rs b/src/cli_discovery.rs
@@ -132,8 +132,10 @@ async fn run_help(
         .and_then(|v| v.parse().ok())
         .unwrap_or(30);
 
+    // kill_on_drop ensures a help-probe child is reaped if the timeout fires
+    // or the discovery task is cancelled — no orphans from this path.
     let mut cmd = Command::new(command);
-    cmd.args(args).arg(help_flag).envs(env);
+    cmd.args(args).arg(help_flag).envs(env).kill_on_drop(true);
 
     let output = timeout(Duration::from_secs(timeout_secs), cmd.output())
         .await