VideoSeeker: How Native Tool Invocation Fixes Instance-Level Video Agent Failures

Large vision-language models can caption a video or answer questions about what happened. They fail when you need frame-accurate object tracking, bounding boxes that follow a specific person through a crowd, or temporal slices that isolate a single event. The problem is not model capacity. It is the interface: text prompts are a lossy way to specify spatial coordinates and temporal windows.

VideoSeeker addresses this by giving the model native tool invocation. Instead of asking “where is the red car between 0:15 and 0:30,” the agent calls a detection primitive with frame indices and receives structured bounding-box data. The shift matters for surveillance pipelines, sports analytics, and compliance monitoring where instance-level precision determines whether the system works at all.

Why Text Prompts Break for Spatiotemporal Tasks

Text is ambiguous for spatial references. “The person on the left” depends on camera angle and changes frame to frame. “Around 0:20” might mean a three-second window or a single frame. Current LVLMs generate text responses that describe locations but cannot return structured coordinates or trigger follow-up perception calls.

The decoupling of perception and reasoning creates a second failure mode. The model reasons over a fixed set of visual features extracted at inference start. If the agent needs to zoom into a region or re-examine frames with different detection parameters, it has no mechanism to request new visual data. The reasoning loop is closed.

VideoSeeker internalizes tool calling so the model can:

Invoke detection tools mid-inference to retrieve bounding boxes or segmentation masks
Request frame subsets or temporal slices on demand
Chain tool outputs (detect object, track across frames, extract region features)

This is not a wrapper script that parses text and calls APIs. The model learns when to emit tool-call tokens instead of text tokens, and the training loop includes supervision on tool invocation correctness.

Architecture: Tool Registry and Invocation Flow

The system exposes a registry of spatiotemporal tools. Each tool has a schema that defines input parameters (frame indices, bounding box coordinates, object class) and output structure (coordinates, confidence scores, feature vectors).

Tool invocation flow:

Decision point: During autoregressive generation, the model can emit a special <tool> token instead of continuing text generation.
Tool selection: The next tokens specify which tool to invoke and serialize parameters (e.g., detect_object(frames=[10,20,30], class="person")).
Execution: The runtime pauses generation, calls the tool (which may be a YOLO detector, a tracking model, or a frame extraction service), and injects the structured output back into the context.
Resumption: The model continues generation with access to the tool result, either invoking another tool or producing a text response.

The paper describes a four-stage data synthesis pipeline that generates training examples where queries require tool use. A question like “show me all frames where the blue backpack appears” would trigger detection and temporal localization tools. A question like “what color is the backpack” might use cached detection results or generate text directly.

Training Pipeline: Cold Start and RL

The model is initialized from a pretrained LVLM and fine-tuned in two stages.

Stage 1: Cold-start supervision

The training set includes (video, query, tool call sequence, ground truth answer) tuples. The model learns to predict the correct sequence of tool invocations. Loss is computed over both tool-call tokens and final text output. This stage teaches the model the syntax of tool calls and when to use them.

Stage 2: Reinforcement learning

The model generates tool-call sequences and receives rewards based on task success. For instance-level localization, the reward is IoU (intersection over union) between predicted and ground-truth bounding boxes, averaged over time. For temporal grounding, the reward is precision and recall of frame ranges.

RL training incentivizes the model to invoke tools proactively. If the model can answer a question by reasoning over cached features, it skips tool calls. If it needs fresh detections, it learns to request them. The policy balances latency (fewer tool calls) against accuracy (more precise localization).

Evaluation Harness for Instance-Level Precision

Standard video QA benchmarks measure text accuracy. Instance-level tasks require structured metrics.

Eval pipeline:

Spatial precision: IoU between predicted and ground-truth bounding boxes, per frame
Temporal precision: Intersection over union of predicted and ground-truth time intervals
Identity consistency: Track ID must remain stable across frames (no ID switches)
Hallucination penalty: Predicted detections with no ground-truth match count as false positives

The harness runs the agent on a held-out video set, logs all tool calls, and compares structured outputs against annotations. Text responses are parsed to extract coordinates and frame indices. If the model says “the person appears at 0:15” but the ground truth is 0:12, the temporal IoU is low.

The evaluation pipeline also measures tool call efficiency. Invoking detection on every frame is correct but wasteful. The best agents learn to sample frames intelligently (e.g., detect every 10th frame, then refine around high-confidence regions).

Error Recovery and Boundary Conditions

Tool calls fail. Bounding boxes drift when objects move out of frame. Detection confidence drops in low-light segments. The agent needs failure modes that do not cascade into hallucinated answers.

Several failure scenarios are likely:

Bounding box out of bounds: When detection returns coordinates outside frame dimensions, the system must clip to valid ranges and flag reduced confidence. This may lose partial objects at frame edges.

Object disappears mid-track: Tracking algorithms can lose objects when they exit the frame or become occluded. The system should return the last known position with a null or low-confidence flag, forcing downstream logic to handle gaps explicitly.

Detection tool timeout: If a tool takes too long or crashes, the agent can fall back to text-only responses or retry with relaxed parameters. This degrades to baseline LVLM behavior but prevents complete failure.

Conflicting tool outputs: When multiple detections overlap or contradict, the system uses confidence scores to select the most reliable result or surfaces the ambiguity to the user.

The training data includes examples where tools return empty results or low-confidence detections. The model learns to surface uncertainty (“I lost track of the object at 0:45”) rather than fabricate coordinates.

For long videos, intermediate tool outputs are cached. A 10-minute video at 30 FPS is 18,000 frames. Running object detection on every frame for every query is not viable. The system caches detection results keyed by frame index and tool parameters. If the agent requests the same detection twice, it reads from cache. Cache invalidation happens when the user uploads a new video or changes detection thresholds.

Implementation Example: Runtime Handler

Here is a simplified view of how the runtime handles tool calls during generation:

import json
from typing import Dict, Any

class ToolError(Exception):
    """Raised when tool execution fails"""
    pass

class VideoAgent:
    def __init__(self, model, tool_registry, cache):
        self.model = model
        self.tools = tool_registry
        self.cache = cache
    
    def _serialize_params(self, params: Dict[str, Any]) -> str:
        """
        Create a cache-safe key from tool parameters.
        Uses JSON serialization to handle nested dicts and lists.
        """
        return json.dumps(params, sort_keys=True)
    
    def generate(self, video, query, max_tokens=1024):
        """
        Generate response with tool invocation support.
        
        Args:
            video: Video object with frame data
            query: User query string
            max_tokens: Maximum tokens to generate
        """
        context = [video, query]
        
        for _ in range(max_tokens):
            token = self.model.next_token(context)
            
            if token == "<tool>":
                # Extract tool name and parameters from next tokens
                tool_name = self.model.next_token(context)
                tool_params = self.model.decode_params(context)
                
                # Cache key from tool name and serialized parameters
                cache_key = (tool_name, self._serialize_params(tool_params))
                
                if cache_key in self.cache:
                    result = self.cache[cache_key]
                else:
                    try:
                        result = self.tools[tool_name].execute(
                            video, tool_params
                        )
                        self.cache[cache_key] = result
                    except ToolError as e:
                        result = {"error": str(e), "confidence": 0.0}
                
                context.append(result)
            else:
                context.append(token)
                if token == "<eos>":
                    break
        
        return self.model.decode(context)

Tool execution is synchronous and blocking. The model waits for the tool result before continuing. Async tool calls would require state machine complexity and introduce latency variance that breaks SLA predictability. For production systems with strict p99 requirements, synchronous execution makes latency easier to reason about.

Deployment Shape and Observability

A production deployment would likely split the agent into three services:

Inference service: Runs the LVLM, handles token generation, emits tool-call requests
Tool executor: Runs detection models, tracking algorithms, frame extraction (may be GPU-bound)
Cache layer: Stores intermediate tool outputs, keyed by video ID and tool parameters

The inference service decides when to emit <tool> tokens based on learned patterns from the RL training phase. During training, the model receives rewards for correct tool invocations and penalties for unnecessary calls. This teaches the model a policy: invoke tools when the query contains spatial or temporal specificity (“at 0:15”, “the person on the left”) and skip them for semantic questions (“what is the mood”).

The inference service sends tool requests to the executor over a queue. The executor returns structured results. If the executor is overloaded, requests queue and the agent experiences latency spikes. Observability must track:

Tool call latency (p50, p99)
Cache hit rate
Tool error rate by type (timeout, out-of-bounds, low confidence)
End-to-end query latency

For long videos, the executor can precompute common tool outputs (e.g., run object detection on all frames at upload time). This shifts cost to ingest but reduces query latency.

Comparison: Text-Only vs. Tool-Native Agents

Dimension	Text-Only LVLM	Tool-Native Agent	Hybrid (Fallback)
Latency	200-500ms	1-5s (tool-dependent)	500ms-2s
Spatial Precision	Descriptive only	Frame-level IoU > 0.8	Variable
Deployment Complexity	Single service	3+ services + cache	2 services
Cache Overhead	Minimal	High (frame embeddings, detections)	Moderate
Failure Mode	Hallucinated descriptions	Tool timeout or empty result	Degrades to text-only
Best For	Semantic QA	Instance tracking, localization	General video understanding

When Tool Invocation Helps and When It Does Not

Use native tool invocation when:

Queries require frame-accurate localization or tracking
The task involves chaining perception primitives (detect, track, segment)
You need structured outputs (bounding boxes, timestamps) not text descriptions
The video is long and you cannot afford to process all frames upfront

Avoid it when:

Queries are purely semantic (“what is the mood of this scene”)
You need sub-second response times and cannot tolerate tool-call latency
The tool registry is unstable or tools have high error rates
The model is not trained on tool invocation and will hallucinate tool syntax

The architecture adds complexity. You now have a multi-service deployment, a cache layer, and a training pipeline that includes RL. The payoff is precision on tasks where text prompts fail.

Technical Verdict

VideoSeeker shows that video agents can move beyond text-only interfaces by learning to invoke detection and tracking tools natively. The architecture is sound: a tool registry, a learned policy for when to call tools, and a training loop that includes both supervised and RL phases. The paper demonstrates that this approach enables proactive perception and retrieval of relevant video segments on demand, addressing the fundamental limitations of text-prompt-only interaction for spatiotemporal localization tasks.

The main operational risk is tool-call latency. If detection models are slow or the cache layer is cold, query times balloon. The system works best when common tool outputs are precomputed or cached aggressively.

For production video pipelines that need instance-level precision (surveillance, sports analytics, compliance), this approach is worth the complexity. For general video QA where text descriptions suffice, the simpler text-only LVLM is faster and easier to deploy.

The structured evaluation harness is critical. Measuring IoU over time and penalizing hallucinated detections gives you a real signal on whether the agent is grounding its answers in visual evidence or guessing.

Source Links

VideoSeeker: Incentivizing Instance-level Video Understanding via Native Agentic Tool Invocation (arXiv:2605.16079v1)