The Model Context Protocol started as a stdio pipe between Claude Desktop and local Python scripts. Now it’s moving to HTTP, and that shift exposes every operational problem you’d expect when you take a local-only RPC protocol and expose it to the network.
FastMCP is one of the first frameworks to handle HTTP transport for MCP servers with streaming semantics. The implementation choices matter because they define how resources get cleaned up when streams break, how you prevent resource exhaustion, and whether your deployment can scale horizontally.
Why HTTP Changes Everything
Stdio MCP servers run in the same trust boundary as the client. The client spawns the server process, pipes JSON-RPC over stdin/stdout, and kills the process when done. Authentication is implicit (same user context), resource limits are inherited from the parent process, and cleanup happens when the pipe closes.
HTTP breaks all of that:
- Authentication is explicit. Every request needs credentials. You can’t rely on process ownership.
- Resource cleanup is manual. A broken TCP connection doesn’t kill the server process. You need timeouts and cancellation tokens.
- Rate limiting is required. A single client can open hundreds of connections. Stdio gave you one pipe per client.
- Observability is harder. Stdio logs go to the parent process. HTTP servers need structured logging, trace IDs, and correlation across chunked responses.
FastMCP addresses these by layering HTTP semantics on top of the MCP JSON-RPC protocol, but the design choices create new failure modes.
Transport Architecture
FastMCP supports two transport modes: stdio for local development and HTTP with Server-Sent Events (SSE) for production deployments. The HTTP implementation uses FastAPI under the hood, which means you get ASGI compatibility and can deploy with Uvicorn, Hypercorn, or any ASGI-compliant server.
The protocol flow looks like this:
- Client opens an HTTP connection to the MCP server endpoint.
- Client sends a JSON-RPC 2.0 request (tool invocation, resource fetch, or prompt template request).
- Server validates the request, invokes the handler, and streams the response as SSE chunks.
- Client accumulates chunks until it receives the completion sentinel or an error.
Each SSE chunk is a JSON object with a type field (data, error, or done) and a payload. The client library reassembles these into the final response.
Streaming Semantics and Partial Failures
MCP tools can return streaming responses. A tool that searches a large dataset might yield results incrementally instead of buffering everything in memory. FastMCP implements this with Server-Sent Events over HTTP.
The client opens a long-lived HTTP connection, sends a tool invocation request, and receives a stream of JSON chunks. Each chunk is a partial result. The stream ends with a sentinel value or an error.
What happens when the stream breaks mid-flight?
If the client closes the connection (network failure, timeout, user cancellation), the server’s SSE handler receives a CancelledError. FastMCP propagates this to the tool function. If the tool is async and checks for cancellation, it can clean up resources (close database connections, delete temp files, release locks). If the tool doesn’t check, it keeps running until it finishes or hits a timeout.
from fastmcp import FastMCP
mcp = FastMCP("example")
@mcp.tool()
async def long_running_search(query: str) -> AsyncIterator[str]:
async with db_pool.acquire() as conn:
try:
async for row in conn.stream(query):
yield row
except asyncio.CancelledError:
# Client disconnected, clean up
await conn.rollback()
raise
If the server crashes mid-stream, the client receives a broken pipe. The MCP spec doesn’t define retry semantics. The client can retry the entire tool call, but if the tool has side effects (wrote to a database, sent an email), you get duplicate operations. Idempotency is your problem.
Resource Exhaustion and Rate Limiting
A malicious or buggy agent can exhaust server resources by:
- Opening many concurrent HTTP connections and never closing them.
- Invoking expensive tools in a tight loop.
- Requesting large streaming responses and abandoning them mid-stream.
FastMCP doesn’t include rate limiting. You add it with middleware or a reverse proxy. For connection limits, you configure the ASGI server:
uvicorn main:app --limit-concurrency 100 --timeout-keep-alive 30
This caps concurrent connections at 100 and closes idle connections after 30 seconds. If an agent opens 101 connections, the 101st request blocks until a slot opens.
For tool-level limits, you add semaphores:
import asyncio
db_semaphore = asyncio.Semaphore(10)
@mcp.tool()
async def query_database(query: str) -> str:
async with db_semaphore:
return await db.execute(query)
This limits concurrent database queries to 10. If 11 agents call the tool simultaneously, the 11th waits.
Authentication Patterns
FastMCP’s HTTP transport requires you to implement authentication. The framework provides hooks but no built-in auth mechanism. The Dev.to tutorial demonstrates two common patterns:
Bearer tokens. Each tool call includes an Authorization header. The server validates the token and maps it to a principal. No session state. This works for stateless tools but creates latency overhead when tools need to fetch user context on every call.
Session-based auth. The client authenticates once, receives a session cookie, and reuses it for subsequent tool calls. The server maintains session state in memory or a backing store (Redis, Postgres). This reduces per-request latency but requires session cleanup logic and complicates horizontal scaling.
You implement authorization logic in your tool handlers. The framework won’t enforce permissions:
@mcp.tool()
async def query_database(query: str, user_id: str) -> str:
# Authorization happens here, not in the framework
if not await can_user_run_query(user_id, query):
raise PermissionError("User cannot run this query")
return await db.execute(query)
If you skip the permission check, any authenticated user can call any tool.
Observability Across Chunked Responses
When a tool streams a response over SSE, you need to trace the entire operation: the initial HTTP request, each chunk, and the final completion or error. FastMCP doesn’t include tracing. You add it with OpenTelemetry:
from opentelemetry import trace
from opentelemetry.instrumentation.fastapi import FastAPIInstrumentor
tracer = trace.get_tracer(__name__)
FastAPIInstrumentor.instrument_app(app)
@mcp.tool()
async def search(query: str) -> AsyncIterator[str]:
with tracer.start_as_current_span("search_tool") as span:
span.set_attribute("query", query)
async for result in perform_search(query):
span.add_event("result_chunk", {"size": len(result)})
yield result
This creates a trace span for the entire tool invocation and adds events for each chunk. If the stream breaks, the span records the error and the number of chunks sent before failure.
You also need request IDs to correlate logs across services. Add middleware to generate or propagate them:
import uuid
from starlette.middleware.base import BaseHTTPMiddleware
class RequestIDMiddleware(BaseHTTPMiddleware):
async def dispatch(self, request, call_next):
request_id = request.headers.get("X-Request-ID", str(uuid.uuid4()))
request.state.request_id = request_id
response = await call_next(request)
response.headers["X-Request-ID"] = request_id
return response
Transport Comparison
| Aspect | stdio Transport | HTTP Transport |
|---|---|---|
| Authentication | Implicit (process owner) | Explicit (bearer token or session) |
| Resource cleanup | Automatic (pipe close kills process) | Manual (timeouts, cancellation tokens) |
| Rate limiting | One pipe per client | Requires middleware or reverse proxy |
| Observability | Logs to parent process | Structured logging, trace IDs required |
| Failure mode | Client crash kills server | Server keeps running, needs health checks |
| Multi-tenancy | One server per client | One server, many clients, auth required |
| Network exposure | Local only | Requires TLS, firewall rules, DDoS protection |
Tool Design Requirements
These transport differences demand changes to how you design tools themselves. Before you deploy HTTP MCP servers, your tools need proper design:
- Make tools idempotent. Retry logic will cause duplicate operations if tools have side effects. Add idempotency keys to prevent duplicate writes, charges, or emails.
- Implement cancellation checks. Tools should periodically check for
CancelledErrorand clean up resources (close connections, delete temp files, release locks). - Add input validation. Don’t trust agent-provided arguments. Validate types, ranges, and permissions before executing operations.
- Set operation timeouts. Database queries, API calls, and file operations all need timeouts to prevent hung requests.
Production Deployment Checklist
Before you expose an HTTP MCP server to production agents:
- Enable TLS. Use a reverse proxy (Caddy, nginx) to terminate TLS. Don’t run the ASGI server directly on the internet.
- Add authentication. Bearer tokens or session cookies. Validate on every request.
- Implement authorization. Check that the authenticated principal can invoke the requested tool with the provided arguments.
- Set connection limits. Cap concurrent connections and idle timeouts in the ASGI server config.
- Add rate limiting. Per-principal, not per-IP. Use Redis or a distributed rate limiter for multi-instance deployments.
- Configure timeouts. Tool invocations, HTTP requests, and database queries all need timeouts.
- Add observability. Structured logs, trace IDs, metrics for tool invocation latency and error rates.
- Handle cancellation. Propagate
CancelledErrorto tools so they can clean up resources. - Test partial failures. Kill the server mid-stream and verify the client handles it gracefully.
Technical Verdict
FastMCP trades operational complexity for multi-tenant scalability. Stdio is simpler but doesn’t scale beyond single-machine deployments.
Use FastMCP’s HTTP transport when:
- You need to expose MCP tools to agents running in different processes or on different machines.
- You’re building a multi-tenant agent platform where 10+ concurrent agents share the same tool server.
- You need latency-sensitive deployments and can implement per-principal rate limiting.
- Your tools are idempotent or you can add idempotency keys to prevent duplicate operations on retry.
Avoid it when:
- Your agent and tools run on the same machine and you don’t need network access. Stdio is simpler, faster, and eliminates auth overhead.
- You can’t implement proper authentication and rate limiting. HTTP without auth is a security hole. Without rate limiting, a single buggy agent can exhaust your server in seconds, resulting in 503 errors for all other clients and cascading failures across your agent fleet.
- Your tools have side effects and you can’t make them idempotent. Retry logic will cause duplicate operations. For example, a tool that sends emails without idempotency keys will send duplicate messages on every retry, or a payment processing tool will double-charge customers. The cost of these duplicate operations exceeds the benefit of HTTP scalability.
- You need guaranteed sub-millisecond tool invocation latency. HTTP adds network round-trip time, TLS handshake overhead, and SSE framing costs that stdio avoids.
The framework handles the HTTP and SSE plumbing, but authentication, observability, and resource management are your responsibility. If you skip those, you’ll have a working demo and a production incident waiting to happen.