Needle: Distilling Gemini Tool Calling into 26M Parameters for On-Device Agents

Needle is a 26M parameter model distilled from Gemini that enables trading agents, portfolio monitors, and algorithmic execution systems to run on budget phones and embedded devices without cloud dependencies. Most function-calling models live in the cloud because tool use requires reasoning over schemas, validating arguments, and handling multi-step workflows. For fintech, this means every market signal, order, or risk check incurs network latency and API costs. Needle compresses that behavior into a model small enough to run offline, enabling real-time trading agents and financial workflows on consumer hardware.

The project addresses a gap between frontier model capabilities and edge constraints. Cloud models like GPT-4 or Claude handle tool discovery and parameter validation well, but they introduce round-trip latency, privacy leakage, and dependency on external services. Needle targets offline-first agents, mobile assistants, and embedded systems where network access is intermittent or prohibited. The Show HN thread attracted 766 points and 210 comments, reflecting strong community interest in on-device agent infrastructure for financial applications.

Distillation Pipeline for Tool Calling

Knowledge distillation trains a small student model to mimic a large teacher model’s outputs. For tool calling, the challenge is preserving structured outputs: function names, argument schemas, and error handling logic.

Training flow:

Synthetic data generation: Generate diverse tool-calling prompts covering common patterns (API calls, database queries, file operations, multi-step workflows).
Teacher inference: Run Gemini on each prompt to produce tool invocations with correct argument types and validation logic.
Student training: Train the 26M parameter model to predict the same tool calls, using cross-entropy loss on token sequences and schema-aware loss functions that penalize malformed JSON or incorrect argument types.
Validation: Test the student model against held-out tool schemas to ensure it generalizes beyond training examples.

The distillation process compresses token prediction and the implicit reasoning about function signatures. A 26M model cannot store every possible API schema, so it learns patterns: how to map natural language intent to parameter names, how to infer required vs. optional fields, and how to handle missing information.

Fintech Use Cases

Needle’s on-device architecture enables financial workflows that cloud-based agents cannot support:

Real-time market data ingestion and technical analysis:

Mobile trading apps can process streaming price feeds and execute technical indicators (moving averages, RSI, Bollinger Bands) without sending data to cloud APIs.
Latency drops from 200-2000ms (cloud round-trip) to 5-20ms (local inference), enabling sub-second trade signals.
Privacy-sensitive traders avoid exposing positions or strategies to third-party services.

Order validation and execution without cloud round-trips:

Trading agents validate order parameters (symbol, quantity, price limits) locally before submitting to exchanges.
Eliminates network latency for pre-trade checks, reducing slippage on fast-moving markets.
Offline validation prevents malformed orders from reaching exchange APIs, reducing rejection rates.

Portfolio rebalancing and risk monitoring in air-gapped environments:

Institutional traders operating in secure facilities can run risk models and rebalancing logic on isolated devices.
No external API calls means no data exfiltration risk or compliance violations.
Agents monitor margin requirements, position limits, and concentration risk without cloud dependencies.

Compliance logging and audit trails for regulated trading:

All tool calls, arguments, and results are logged locally in SQLite or encrypted storage.
Audit trails remain on-device until synced to compliance systems, preventing real-time surveillance by cloud providers.
Regulators can inspect agent behavior without accessing third-party logs.

Quantization and Inference Engine

At 26M parameters, the model fits in ~50MB of memory when quantized to 4-bit or 8-bit precision. Needle achieves 6000 tok/s prefill by using:

Quantization-aware training: The model is trained with simulated quantization noise, so weight precision loss during deployment does not degrade tool-calling accuracy.
Optimized inference kernels: Custom CUDA or Metal kernels for matrix multiplication and attention, tuned for small batch sizes and low-latency decode.
Sparse attention: Tool-calling prompts often have long context (tool documentation, conversation history), so sparse attention patterns reduce compute without losing schema information.

The throughput numbers assume a modern mobile GPU (Apple M-series or Snapdragon 8 Gen 3). On older devices, prefill drops to ~2000 tok/s, but decode remains fast enough for interactive agents.

Tool Discovery and Parameter Validation

A 26M model has limited capacity for reasoning, so tool discovery relies on explicit schemas in the prompt. You provide a JSON schema for each available tool, and the model selects the correct function and fills arguments. Input sanitization is critical: user queries must be separated from system instructions to prevent prompt injection attacks that manipulate tool calls.

Example schema for market order placement:

{
  "name": "place_market_order",
  "description": "Execute a market order for a given symbol",
  "parameters": {
    "type": "object",
    "properties": {
      "symbol": {"type": "string", "description": "Trading symbol (e.g., AAPL)"},
      "quantity": {"type": "integer", "description": "Number of shares"},
      "side": {"type": "string", "enum": ["buy", "sell"]},
      "account_id": {"type": "string", "description": "Trading account identifier"}
    },
    "required": ["symbol", "quantity", "side", "account_id"]
  }
}

Model output:

{
  "tool": "place_market_order",
  "arguments": {
    "symbol": "AAPL",
    "quantity": 100,
    "side": "buy",
    "account_id": "ACC-12345"
  }
}

Parameter validation happens in two stages:

Schema compliance: The model outputs JSON that matches the schema structure. If it produces malformed JSON, the agent retries with an error message in the prompt.
Semantic validation: The calling code checks argument types and constraints (e.g., enum values, required fields). If validation fails, the agent appends the error to the conversation and asks the model to correct the invocation.

This two-stage approach offloads complex validation logic to deterministic code, reducing the reasoning burden on the small model.

Failure Modes and Recovery

Small models fail differently than frontier models. Common issues:

Failure Mode	Cause	Mitigation	Threshold
Hallucinated tool names	Model invents functions not in the schema	Strict schema matching; reject unknown tools	Occurs at >50 tools in registry
Missing required arguments	Model skips fields or uses wrong types	Schema-aware loss during training; retry with error feedback	~15% error rate on complex schemas
Context overflow	Long tool documentation exceeds context window	Summarize schemas; use retrieval to inject only relevant tools	Breaks at >4K tokens total context
Multi-step reasoning collapse	Model loses track of intermediate results	Explicit state management; store tool outputs in structured memory	Fails after >3 sequential tool calls
Stale market data	Model uses cached prices from earlier context	Refresh tool outputs on every call; timestamp all market data	>100ms latency acceptable for non-HFT workflows

The most brittle failure is multi-step reasoning. A 26M model cannot hold complex plans in its weights, so you need external orchestration. Store tool outputs in a key-value store or conversation history, and prompt the model with explicit state: “You called get_portfolio_value and received {total: 125000}. Now call check_margin_requirement.”

Deployment Shape

Needle runs entirely on-device, so the deployment architecture is simpler than cloud-based agents. Tool access control is enforced at the registry level: do not expose destructive operations (file deletion, network requests) without explicit user confirmation.

Model binary: 50MB quantized weights, loaded into GPU memory at startup.
Inference runtime: ONNX, llama.cpp, or custom engine with Metal/CUDA backends.
Tool registry: JSON file or in-memory map of available functions and schemas (limit to <50 tools to avoid hallucination).
State store: SQLite or in-memory dictionary for conversation history and tool outputs (encrypt sensitive data at rest to prevent leakage on compromised devices).
Orchestration loop: Python or Swift code that prompts the model, validates outputs, executes tools, and appends results to the conversation.

No external API calls, no network dependencies. The agent runs in airplane mode.

Observability and Debugging

On-device agents are harder to debug because you cannot log every inference to a centralized system. Needle deployments need local observability:

Prompt logging: Write every prompt and model output to a local file or database. This lets you replay failures and tune schemas.
Tool execution traces: Log tool calls, arguments, and results. If the agent fails, you can see which tool invocation broke the workflow.
Schema validation errors: Track how often the model produces malformed JSON or incorrect argument types. High error rates (>15%) indicate the distillation process did not capture the teacher’s validation logic.
Latency metrics: Measure prefill and decode times per request. Spikes indicate memory pressure or thermal throttling on mobile devices.

For production deployments, consider exporting traces to a local SQLite database and syncing them to a backend when the device reconnects. This gives you aggregate metrics without real-time telemetry.

Trade-Offs: Small Models vs. Cloud APIs

Dimension	Needle (26M)	Cloud API (GPT-4, Claude)
Latency	5-20ms per tool call	200-2000ms (network + inference)
Privacy	All data stays on-device	Data sent to third-party servers
Cost	Zero marginal cost per call	$0.01-$0.10 per request
Reasoning depth	<3 sequential tool calls before collapse	Handles >10 nested steps with complex logic
Tool discovery	Requires explicit schemas; breaks at >50 tools	Can infer tools from natural language descriptions
Offline capability	Full functionality without network	Requires internet connection
Context window	<4K tokens before overflow	32K-128K tokens
Regulatory compliance	All data on-device; audit trails stored locally	Centralized logging; easier compliance reporting

Needle wins on latency, privacy, and cost. Cloud APIs win on reasoning depth and flexibility. Choose Needle when you need offline-first agents, low-latency tool calls, or zero API costs. Choose cloud APIs when you need complex reasoning, broad tool discovery, or minimal engineering effort.

Technical Verdict

Use Needle when:

Your tool set is strictly <50 functions and your context window stays <4K tokens. The 26M parameter constraint means the model cannot handle large registries or long conversations without hallucination or context overflow.
Latency requirements are <50ms per tool call and you need 6000 tok/s prefill or 1200 tok/s decode speeds. The throughput enables real-time interactive agents that cloud APIs cannot match.
You need offline-first agents for mobile, embedded, or air-gapped environments where network access is unreliable or prohibited. This is critical for institutional trading desks operating in secure facilities or mobile traders in regions with poor connectivity.
Your workflows involve ≤3 sequential tool calls. Beyond that threshold, the model’s shallow reasoning capacity collapses and you need external state management.
Zero marginal cost per call is a hard requirement. Needle eliminates API costs entirely, making it viable for high-frequency portfolio monitoring or continuous risk checks.
Schema validation error rates <15% are acceptable for your use case. Higher error rates indicate the distillation process did not fully capture validation logic.
Trading latency <50ms is critical and you operate in regions with unreliable connectivity. Sub-second trade signals and order validation without cloud round-trips reduce slippage and improve execution quality.
Privacy-sensitive financial workflows require on-device processing. Hedge funds, proprietary trading firms, and institutional investors can avoid exposing positions or strategies to third-party cloud providers.

Avoid Needle when:

Your agent needs multi-hop reasoning over >3 sequential tool calls. The model will lose track of intermediate results and produce incorrect invocations.
Tool discovery must work with vague or ambiguous natural language descriptions. Needle requires explicit JSON schemas in the prompt.
Your tool registry exceeds 50 functions. The model will hallucinate tool names or confuse similar schemas.
Your deployment target is a server or cloud environment where network access is reliable and cheap. Cloud APIs offer better reasoning depth with less engineering effort.
You lack the capacity to build custom orchestration, state management, and observability infrastructure. Needle is not a drop-in replacement for GPT-4.
Context windows >4K tokens are required for your tool documentation or conversation history.
Your compliance framework requires real-time centralized audit trails. Cloud-based agents simplify regulatory reporting by aggregating logs in a single system, while Needle requires manual sync of on-device logs to compliance infrastructure.

Needle is a specialized tool for edge deployments where latency, privacy, and cost constraints outweigh reasoning depth. If you can tolerate shallow reasoning (≤3 steps), explicit schemas (<50 tools), and limited context (<4K tokens), you get a fast, private, and cost-free agent runtime.

The 766-point HN response reflects growing demand for edge-deployed financial agents that eliminate cloud dependencies, API costs, and latency. Needle addresses this gap for fintech teams building offline-first trading systems, mobile portfolio managers, and embedded risk monitors.