Trigger.dev V2: How a Temporal Alternative Handles Long-Running Agent Workflows in TypeScript

Trigger.dev started as a Zapier alternative and pivoted hard into durable execution territory after developers kept asking for long-running workflows with retries. The V2 redesign positions it as a TypeScript-native alternative to Temporal, targeting teams that need workflow orchestration without learning Go or managing a separate cluster. The 172-point V2 announcement (compared to 745 for V1) signals that developers prioritized durable execution over Zapier-style automation.

The architecture matters for agent builders because a 10-turn research agent loses conversation history when an API timeout occurs mid-execution. Tool calls fail, rate limits hit, context windows overflow, and you need execution guarantees that survive process crashes. Trigger.dev’s approach trades Temporal’s battle-tested runtime for a simpler TypeScript surface that runs on your existing Node infrastructure.

Execution Model: Tasks as Durable Functions

Trigger.dev wraps your code in a task() primitive that handles state persistence automatically. Each task gets:

Automatic checkpointing: State snapshots after every await boundary
Idempotency keys: Deduplication across retries using task ID + run ID
Retry policies: Exponential backoff with configurable max attempts
Queue isolation: Named queues with concurrency limits

The runtime serializes execution context to Postgres between steps. When a task crashes or times out, the next worker picks up from the last checkpoint. This works for agent workflows where you need to preserve conversation history, tool call results, and partial outputs across failures.

// Example pattern: research agent with tool calling
// Adapted from Trigger.dev documentation
export const researchAgent = task({
  id: "research-agent",
  retry: {
    maxAttempts: 3,
    factor: 2,
    minTimeout: 1000,
  },
  queue: {
    name: "research",
    concurrencyLimit: 5,
  },
  run: async ({ topic }: { topic: string }) => {
    const messages: CoreMessage[] = [
      { role: "user", content: `Research: ${topic}` },
    ];

    for (let i = 0; i < 10; i++) {
      // Checkpoint happens here automatically
      const { text, toolCalls, steps } = await generateText({
        model: anthropic("claude-opus-4-20250514"),
        system: "You are a research assistant with web access.",
        messages,
        tools: { search, browse, analyze },
        maxSteps: 5,
      });

      if (!toolCalls.length) {
        return { summary: text, stepsUsed: steps.length };
      }

      // Each tool call executes and gets retried independently
      // Your tool registry handles the actual execution logic
      for (const call of toolCalls) {
        const result = await toolRegistry.execute(call.name, call.arguments);
        messages.push({ role: "tool", content: JSON.stringify(result) });
      }
    }
  },
});

State Persistence: What Gets Saved and When

Trigger.dev checkpoints at every await expression. The serialization boundary includes:

Function arguments and local variables
Async operation results (API responses, database queries)
Error stack traces for failed steps
Timing metadata for observability

This creates a replay log similar to Temporal’s event history, but stored as JSON in Postgres rather than a custom event store. The trade-off: simpler infrastructure at the cost of storage efficiency. A 100-step agent workflow writes 100+ checkpoint rows (approximately 500KB total) compared to Temporal’s event-sourced approach (approximately 50KB for the same workflow).

Checkpoint Storage Comparison

Component	Trigger.dev	Temporal
Storage backend	Postgres (JSONB)	Cassandra/MySQL (custom schema)
Checkpoint granularity	Per await	Per activity/signal
Replay mechanism	Deserialize + resume	Event sourcing
Size limit	1MB per checkpoint	2MB event history
Query performance	Standard SQL	Optimized for workflow queries

The Postgres dependency means you can query workflow state with normal SQL, which helps debugging. But it also means checkpoint writes compete with your application’s database load unless you use a separate instance.

Retry Logic and Failure Boundaries

Trigger.dev distinguishes between transient and permanent failures using error types:

Retryable errors: Network timeouts, rate limits, 5xx responses
Permanent errors: 4xx responses, validation failures, explicit throws

You control retry behavior per task or per operation:

// Task-level retry policy
export const apiIntegration = task({
  id: "api-integration",
  retry: {
    maxAttempts: 5,
    factor: 2,
    randomize: true, // Jitter to prevent thundering herd
  },
  run: async (payload) => {
    // This gets retried with exponential backoff
    const data = await fetch("https://api.example.com/data");
    
    // Wrap non-retryable operations
    const result = await permanentFailure(async () => {
      return validateSchema(data);
    });
    
    return result;
  },
});

For agent workflows with parallel tool calls, you can use Promise.allSettled() to continue execution even when some tools fail:

const toolResults = await Promise.allSettled(
  toolCalls.map(call => toolRegistry.execute(call.name, call.arguments))
);

// Filter successful results, log failures
const successful = toolResults
  .filter(r => r.status === "fulfilled")
  .map(r => r.value);

Concurrency Control and Backpressure

Agent workflows often spawn hundreds of parallel operations (embedding chunks, API calls, database writes). Trigger.dev handles backpressure through:

Queue concurrency limits: Max concurrent tasks per queue
Global concurrency: Max tasks across all queues
Rate limiting: Per-integration throttling

The queue system uses a pull model. Workers poll for tasks, execute them, and checkpoint results. When queues fill up, new tasks wait in Postgres rather than crashing your application.

// Limit concurrent embeddings to avoid rate limits
export const embedQueue = queue({
  name: "embeddings",
  concurrencyLimit: 10,
});

export const embedDocument = task({
  id: "embed-document",
  queue: embedQueue,
  run: async ({ chunks }: { chunks: string[] }) => {
    // Only 10 of these run concurrently
    const embeddings = await Promise.all(
      chunks.map(chunk => openai.embeddings.create({
        model: "text-embedding-3-small",
        input: chunk,
      }))
    );
    return embeddings;
  },
});

Type Safety Boundaries

TypeScript types flow through task definitions, but serialization breaks type guarantees at runtime. The boundary issues:

Class instances: Don’t survive serialization. Use plain objects.
Functions: Can’t be checkpointed. Pass data, not callbacks.
Circular references: Break JSON serialization. Flatten your data structures.
Dates: Serialize as ISO strings, require manual parsing.

This matters when integrating with LLM SDKs that return complex objects. You need to extract serializable data before checkpointing:

// Bad: Class instance won't survive checkpoint
const response = await openai.chat.completions.create({...});
await someAsyncOperation(); // Checkpoint here loses response methods

// Good: Extract data immediately
const { choices, usage } = await openai.chat.completions.create({...});
const message = choices[0].message.content;
await someAsyncOperation(); // Checkpoint preserves plain data

Observability and Debugging

The dashboard shows:

Real-time task execution with step-by-step traces
Retry history with error messages
Queue depth and concurrency metrics
Checkpoint sizes and timing

Each task run gets a unique URL with full execution history. You can replay failed runs from any checkpoint, which helps debugging non-deterministic failures in agent workflows.

The tracing format follows OpenTelemetry conventions, so you can export spans to Datadog, Honeycomb, or your existing observability stack.

Deployment Shape

Trigger.dev runs as:

Managed cloud: Hosted workers, you deploy task code
Self-hosted: Docker containers, you manage Postgres and workers
Hybrid: Cloud control plane, self-hosted workers for data residency

The managed option handles scaling automatically. Workers spin up based on queue depth, with configurable min/max instances. Self-hosted deployments require you to manage worker autoscaling and database capacity.

For agent workloads with spiky traffic (batch processing, scheduled research tasks), the managed option avoids over-provisioning. For workflows that process sensitive data, self-hosted keeps everything in your VPC.

Comparison: Trigger.dev vs Temporal

Infrastructure trade-offs that affect agent workflow scalability and operational complexity:

Dimension	Trigger.dev	Temporal
Language	TypeScript only	Go, Java, Python, PHP, .NET
Runtime	Node.js/Bun	Go server + language SDKs
State storage	Postgres	Cassandra/MySQL/Postgres
Deployment	Managed or Docker	Kubernetes cluster
Learning curve	Familiar async/await	Workflow/activity split
Type safety	TypeScript native	SDK-dependent
Ecosystem	Growing	Mature, battle-tested

Temporal’s Go runtime provides stronger execution guarantees and better performance at scale (10k+ concurrent workflows). Trigger.dev’s TypeScript-native approach reduces operational complexity for teams already running Node services.

The execution model differs fundamentally. Temporal requires splitting workflows (orchestration logic) from activities (side effects). Trigger.dev lets you write normal async TypeScript, which feels more natural but gives you less control over determinism.

When Type Safety Breaks Down

The serialization boundary creates runtime risks that TypeScript can’t catch:

Non-serializable data: Classes, functions, symbols
Large payloads: Checkpoints over 1MB fail silently
Circular references: Crash during JSON.stringify
Prototype loss: Objects lose methods after deserialization

You need runtime validation at checkpoint boundaries:

import { z } from "zod";

const CheckpointSchema = z.object({
  messages: z.array(z.object({
    role: z.enum(["user", "assistant", "tool"]),
    content: z.string(),
  })),
  metadata: z.record(z.unknown()),
});

export const agentTask = task({
  id: "agent",
  run: async (input) => {
    let state = CheckpointSchema.parse(input);
    
    // Validate before each checkpoint
    await someOperation();
    state = CheckpointSchema.parse(state);
    
    return state;
  },
});

Likely Failure Modes

Agent workflows on Trigger.dev fail when:

Checkpoint size exceeds limits: Large conversation histories or tool outputs
Non-deterministic code: Random IDs, timestamps, or external state
Database connection exhaustion: Too many concurrent checkpoints
Memory leaks in long-running tasks: Node.js heap fills up over hours
Serialization errors: Unhandled circular references or class instances

The checkpoint-per-await model amplifies these issues. A 10-turn research agent with 5 tool calls per turn generates 50 checkpoints. At approximately 50KB per checkpoint, total state reaches 2.5MB. If your Postgres configuration limits row size to 1MB (common in managed services), the workflow fails around turn 20. Monitor checkpoint sizes and split large workflows into smaller tasks when conversation history exceeds 500KB.

Technical Verdict

Use Trigger.dev when:

Your agent workflows run 5-15 turns with moderate tool call volume (under 20 calls per turn) and checkpoint sizes stay under 500KB. The Postgres storage model handles this range efficiently without hitting row size limits or creating write contention. Beyond 20 turns or 1MB total state, you’ll need to implement manual state pruning or switch to Temporal’s event sourcing.
You need automatic retry logic for API timeouts and rate limits without writing custom error handling. A research agent that calls 3 external APIs per turn benefits from Trigger.dev’s exponential backoff (1s, 2s, 4s delays) and automatic checkpoint recovery. This eliminates 50+ lines of manual retry code per workflow.
Your team already runs Node.js services and wants to avoid deploying a separate Temporal cluster (3+ services: frontend, history, matching). Trigger.dev adds one Postgres table and reuses your existing database infrastructure. Self-hosting Temporal requires Kubernetes expertise and 2-4 GB RAM minimum for the Go runtime.
You process 100-1000 workflows per hour with predictable traffic patterns. Trigger.dev’s managed workers scale based on queue depth (5-second polling interval), which works for batch automation but adds latency for real-time agent interactions. Temporal’s push-based model provides sub-second task dispatch.

Avoid Trigger.dev when:

Your agent generates conversation histories over 1MB per workflow or runs 50+ turns with complex tool outputs. A document analysis agent that processes 100-page PDFs with embeddings per paragraph hits Postgres checkpoint limits quickly. Temporal’s event sourcing compresses state more efficiently (10:1 ratio for large workflows) and supports 2MB event histories.
You require deterministic replay for audit compliance or debugging. Trigger.dev’s checkpoint deserialization can break on non-deterministic code (Date.now(), Math.random(), external API calls without idempotency keys). Temporal’s event sourcing guarantees identical replay by isolating side effects in activities. This matters for financial workflows or regulated industries.
You need sub-500ms latency for real-time agent interactions. Trigger.dev’s checkpoint-per-await model adds 50-200ms serialization overhead per step, plus Postgres write latency (10-50ms). A customer support agent that responds in under 1 second cannot afford this overhead. Temporal’s in-memory state and optimized runtime provide 10-50ms task dispatch.
Your agent system spans multiple languages (Python ML inference, Go data processing, TypeScript orchestration). Trigger.dev locks you into TypeScript for all workflow logic. Temporal supports polyglot workflows where each activity runs in its native language, which matters for teams with existing ML pipelines or data infrastructure.
You need fine-grained control over retry boundaries. Trigger.dev retries entire task blocks, so a workflow with 10 tool calls retries all 10 if one fails. Temporal’s activity isolation lets you retry individual operations independently. This reduces wasted compute for expensive operations like embedding generation or video processing.

The checkpoint-per-await model works well for scheduled automation (nightly report generation, batch document processing, email sequences) where 5-10 second latency is acceptable and state stays under 500KB. For real-time agent systems or workflows that accumulate multi-MB state, Temporal’s event sourcing and optimized runtime provide better scalability. The TypeScript-native approach reduces operational overhead but constrains you to Node’s execution model and Postgres storage limits.

Source Links

Trigger.dev V2 Announcement (172 points, 39 comments)
Official Documentation
GitHub Repository