mech.app
AI Agents

NVIDIA's Video Search Blueprint: How Vision Agents Orchestrate VLMs, LLMs, and Real-Time Streams

Three-tier architecture for GPU-accelerated vision agents: real-time feature extraction, downstream analytics, and MCP-based agentic orchestration.

Source: github.com
NVIDIA's Video Search Blueprint: How Vision Agents Orchestrate VLMs, LLMs, and Real-Time Streams

NVIDIA just released a reference-grade blueprint for vision agents that shows how to combine real-time video intelligence, downstream analytics, and agentic orchestration into a single system. The Video Search and Summarization (VSS) blueprint is trending at rank 3 on GitHub for Python and powers the build.nvidia.com experience for natural-language video agents.

This is not a demo. It is a reference architecture that separates real-time feature extraction from downstream enrichment and agentic processing, all backed by NVIDIA NIM microservices and the Model Context Protocol.

Three-Tier Architecture

VSS splits video processing into three distinct layers, each with its own latency profile and state management:

  1. Real-time video intelligence: Feature extraction, embeddings, and stream understanding. Results publish to a message broker (Kafka or Redis Streams) for downstream consumption.
  2. Downstream analytics: Metadata enrichment into trajectories, incidents, and verified alerts. This layer consumes the message stream and writes to a time-series or vector store.
  3. Agentic and offline processing: Orchestrated tools for search, Q&A, summarization, and clip retrieval. This layer uses the Model Context Protocol to expose tools to LLMs and VLMs.

The separation matters because real-time processing cannot block on LLM inference, and agentic workflows need access to historical metadata without replaying video streams.

Orchestration Flow

The blueprint uses a message broker as the coordination point between tiers. Here is the flow for a live camera feed:

  1. Vision microservice extracts features from each frame (object detection, embeddings, scene classification).
  2. Features publish to a Kafka topic with frame timestamp and camera ID.
  3. Downstream analytics service consumes the topic, builds trajectories, and writes to a vector database.
  4. Agentic layer exposes MCP tools that query the vector database and invoke VLMs for visual Q&A.
  5. LLM orchestrator decides which tools to call based on user query.

For stored video, the flow is simpler: batch processing writes metadata directly to the vector store, and the agentic layer queries it on demand.

Model Context Protocol Integration

The blueprint implements MCP to expose video analytics as tools. Each tool is a microservice endpoint that the LLM can invoke via function calling. The tools include:

  • Search: Semantic search over video embeddings (text-to-video or image-to-video).
  • Q&A: Visual question answering over specific frames or clips.
  • Summarization: Generate natural-language summaries of video segments.
  • Clip retrieval: Return video clips matching a query with start and end timestamps.

The MCP server maintains a tool registry and handles authentication, rate limiting, and result caching. The LLM sees tools as function signatures with JSON schemas. When the LLM calls a tool, the MCP server routes the request to the appropriate microservice and returns the result.

This pattern keeps the LLM stateless. The MCP server owns the conversation context and tool state.

State Management

The blueprint uses three state stores:

Store TypePurposeExample
Message brokerReal-time feature streamKafka, Redis Streams
Vector databaseEmbeddings and metadataMilvus, Weaviate, pgvector
Object storageVideo clips and framesS3, MinIO

The message broker is append-only and retains events for a configurable window (e.g., 7 days). The vector database stores embeddings with metadata pointers to object storage. The object storage holds raw video and extracted frames.

This separation allows the agentic layer to query metadata without loading video into memory. When a user asks for a clip, the system returns a signed URL to object storage instead of streaming bytes through the LLM.

Security Boundaries

The blueprint enforces three security boundaries:

  1. Ingestion boundary: Vision microservices authenticate to the message broker with mTLS. Camera feeds use RTSP over TLS or WebRTC with DTLS.
  2. Analytics boundary: Downstream services authenticate to the vector database and object storage with IAM roles. No direct access to the message broker.
  3. Agentic boundary: MCP server authenticates to all downstream services. LLM has no direct access to infrastructure. All tool calls go through the MCP server.

The MCP server also enforces rate limits per user and per tool. This prevents a runaway LLM from exhausting GPU quota or storage bandwidth.

Observability

The blueprint instruments each tier with OpenTelemetry. Traces span the entire request path: user query to LLM, LLM to MCP server, MCP server to tool microservice, tool microservice to vector database or VLM.

Key metrics to watch:

  • Frame processing latency: Time from camera frame to message broker publish. Target: <100ms.
  • Tool invocation latency: Time from LLM function call to MCP server response. Target: <2s.
  • VLM inference latency: Time for visual Q&A over a single frame. Target: <500ms.
  • Vector search latency: Time to retrieve top-k embeddings. Target: <50ms.

The blueprint also logs tool call failures and retries. If a tool fails three times, the MCP server returns an error to the LLM with a suggested fallback.

Deployment Shape

The reference implementation runs on a single GPU node with Docker Compose. For production, the blueprint provides Kubernetes manifests with the following shape:

  • Vision microservices: StatefulSet with GPU affinity. One pod per camera feed.
  • Message broker: StatefulSet with persistent volumes. Three replicas for high availability.
  • Downstream analytics: Deployment with horizontal pod autoscaling. Scales based on message lag.
  • MCP server: Deployment with load balancer. Stateless, scales based on request rate.
  • Vector database: StatefulSet with persistent volumes. Sharded by camera ID.

The blueprint also includes Helm charts for NVIDIA NIM microservices (VLM and LLM inference).

Failure Modes

Here are the likely failure modes and mitigations:

Vision microservice crashes: Message broker retains unconsumed messages. Downstream analytics replays from last checkpoint. No data loss.

Vector database unavailable: MCP server returns cached results if available. Otherwise, returns error to LLM with retry suggestion.

VLM inference timeout: MCP server retries with exponential backoff. After three retries, returns error to LLM. LLM can fall back to text-only search.

Message broker partition rebalance: Downstream analytics pauses consumption during rebalance. Real-time latency spikes but recovers within seconds.

Object storage outage: Clip retrieval fails. MCP server returns metadata only (timestamps, descriptions). User can retry later.

The blueprint does not handle corrupted embeddings or incorrect tool responses. You need application-level validation for those cases.

Code Example: MCP Tool Registration

Here is how the blueprint registers a search tool with the MCP server:

from mcp import Tool, ToolParameter, ToolResult
from typing import List, Dict

# Note: ToolParameter and ToolResult APIs are based on NVIDIA's MCP specification.
# Validate against the official NVIDIA MCP SDK documentation before deployment.
# The blueprint repository includes the complete MCP server implementation.

class VideoSearchTool(Tool):
    name = "search_video"
    description = "Search video embeddings by text or image query"
    
    parameters = [
        ToolParameter(
            name="query",
            type="string",
            description="Natural language search query",
            required=True
        ),
        ToolParameter(
            name="camera_ids",
            type="array",
            description="Filter by camera IDs",
            required=False
        ),
        ToolParameter(
            name="time_range",
            type="object",
            description="Start and end timestamps",
            required=False
        )
    ]
    
    async def execute(self, query: str, camera_ids: List[str] = None, 
                     time_range: Dict = None) -> ToolResult:
        # Generate query embedding
        embedding = await self.vlm_client.embed(query)
        
        # Build vector search filter
        filter_expr = self._build_filter(camera_ids, time_range)
        
        # Query vector database
        results = await self.vector_db.search(
            embedding=embedding,
            filter=filter_expr,
            limit=10
        )
        
        # Return results with metadata
        return ToolResult(
            success=True,
            data=[{
                "timestamp": r.metadata["timestamp"],
                "camera_id": r.metadata["camera_id"],
                "score": r.score,
                "clip_url": r.metadata["clip_url"]
            } for r in results]
        )

The MCP server exposes this tool to the LLM as a function signature. The LLM calls it with JSON arguments, and the server routes the request to the tool’s execute method. The MCP server implementation is provided in the blueprint repository.

Technical Verdict

Use this blueprint when:

  • You need to build vision agents that combine real-time video streams with agentic orchestration.
  • You want a reference architecture that separates real-time processing from downstream analytics and agentic workflows.
  • You are already using NVIDIA NIM microservices or plan to deploy on NVIDIA GPUs.
  • You need MCP-based tool orchestration for multimodal workflows.
  • You have multiple camera feeds or video sources and need to scale horizontally.

Avoid this blueprint when:

  • You only need batch video processing (no real-time streams). Use a simpler pipeline with direct VLM inference.
  • You are building text-only agents. The three-tier architecture adds unnecessary complexity.
  • You need end-to-end latency under 200ms for agentic queries. MCP serialization, tool routing, and VLM inference typically add 150–500ms overhead depending on model size and hardware.
  • You want to run on CPU-only infrastructure. The blueprint assumes GPU acceleration for vision microservices.
  • You need a battle-tested, widely deployed solution. This is a reference implementation, not a turnkey product. You will need to implement custom vision microservices for your camera types and write downstream analytics logic for your domain.

The blueprint is reference-grade and shows NVIDIA’s recommended patterns for vision agents. It assumes you want to separate real-time and agentic processing, use a message broker for coordination, and expose tools via MCP. If those assumptions fit your use case, this is the fastest path to a working vision agent architecture.