Trigger.dev started as a Zapier alternative and pivoted hard into durable execution territory after developers kept asking for long-running workflows with retries. The V2 redesign positions it as a TypeScript-native alternative to Temporal, targeting teams that need workflow orchestration without learning Go or managing a separate cluster. The 172-point V2 announcement (compared to 745 for V1) signals that developers prioritized durable execution over Zapier-style automation.
The architecture matters for agent builders because a 10-turn research agent loses conversation history when an API timeout occurs mid-execution. Tool calls fail, rate limits hit, context windows overflow, and you need execution guarantees that survive process crashes. Trigger.dev’s approach trades Temporal’s battle-tested runtime for a simpler TypeScript surface that runs on your existing Node infrastructure.
Execution Model: Tasks as Durable Functions
Trigger.dev wraps your code in a task() primitive that handles state persistence automatically. Each task gets:
- Automatic checkpointing: State snapshots after every await boundary
- Idempotency keys: Deduplication across retries using task ID + run ID
- Retry policies: Exponential backoff with configurable max attempts
- Queue isolation: Named queues with concurrency limits
The runtime serializes execution context to Postgres between steps. When a task crashes or times out, the next worker picks up from the last checkpoint. This works for agent workflows where you need to preserve conversation history, tool call results, and partial outputs across failures.
// Example pattern: research agent with tool calling
// Adapted from Trigger.dev documentation
export const researchAgent = task({
id: "research-agent",
retry: {
maxAttempts: 3,
factor: 2,
minTimeout: 1000,
},
queue: {
name: "research",
concurrencyLimit: 5,
},
run: async ({ topic }: { topic: string }) => {
const messages: CoreMessage[] = [
{ role: "user", content: `Research: ${topic}` },
];
for (let i = 0; i < 10; i++) {
// Checkpoint happens here automatically
const { text, toolCalls, steps } = await generateText({
model: anthropic("claude-opus-4-20250514"),
system: "You are a research assistant with web access.",
messages,
tools: { search, browse, analyze },
maxSteps: 5,
});
if (!toolCalls.length) {
return { summary: text, stepsUsed: steps.length };
}
// Each tool call executes and gets retried independently
// Your tool registry handles the actual execution logic
for (const call of toolCalls) {
const result = await toolRegistry.execute(call.name, call.arguments);
messages.push({ role: "tool", content: JSON.stringify(result) });
}
}
},
});
State Persistence: What Gets Saved and When
Trigger.dev checkpoints at every await expression. The serialization boundary includes:
- Function arguments and local variables
- Async operation results (API responses, database queries)
- Error stack traces for failed steps
- Timing metadata for observability
This creates a replay log similar to Temporal’s event history, but stored as JSON in Postgres rather than a custom event store. The trade-off: simpler infrastructure at the cost of storage efficiency. A 100-step agent workflow writes 100+ checkpoint rows (approximately 500KB total) compared to Temporal’s event-sourced approach (approximately 50KB for the same workflow).
Checkpoint Storage Comparison
| Component | Trigger.dev | Temporal |
|---|---|---|
| Storage backend | Postgres (JSONB) | Cassandra/MySQL (custom schema) |
| Checkpoint granularity | Per await | Per activity/signal |
| Replay mechanism | Deserialize + resume | Event sourcing |
| Size limit | 1MB per checkpoint | 2MB event history |
| Query performance | Standard SQL | Optimized for workflow queries |
The Postgres dependency means you can query workflow state with normal SQL, which helps debugging. But it also means checkpoint writes compete with your application’s database load unless you use a separate instance.
Retry Logic and Failure Boundaries
Trigger.dev distinguishes between transient and permanent failures using error types:
- Retryable errors: Network timeouts, rate limits, 5xx responses
- Permanent errors: 4xx responses, validation failures, explicit throws
You control retry behavior per task or per operation:
// Task-level retry policy
export const apiIntegration = task({
id: "api-integration",
retry: {
maxAttempts: 5,
factor: 2,
randomize: true, // Jitter to prevent thundering herd
},
run: async (payload) => {
// This gets retried with exponential backoff
const data = await fetch("https://api.example.com/data");
// Wrap non-retryable operations
const result = await permanentFailure(async () => {
return validateSchema(data);
});
return result;
},
});
For agent workflows with parallel tool calls, you can use Promise.allSettled() to continue execution even when some tools fail:
const toolResults = await Promise.allSettled(
toolCalls.map(call => toolRegistry.execute(call.name, call.arguments))
);
// Filter successful results, log failures
const successful = toolResults
.filter(r => r.status === "fulfilled")
.map(r => r.value);
Concurrency Control and Backpressure
Agent workflows often spawn hundreds of parallel operations (embedding chunks, API calls, database writes). Trigger.dev handles backpressure through:
- Queue concurrency limits: Max concurrent tasks per queue
- Global concurrency: Max tasks across all queues
- Rate limiting: Per-integration throttling
The queue system uses a pull model. Workers poll for tasks, execute them, and checkpoint results. When queues fill up, new tasks wait in Postgres rather than crashing your application.
// Limit concurrent embeddings to avoid rate limits
export const embedQueue = queue({
name: "embeddings",
concurrencyLimit: 10,
});
export const embedDocument = task({
id: "embed-document",
queue: embedQueue,
run: async ({ chunks }: { chunks: string[] }) => {
// Only 10 of these run concurrently
const embeddings = await Promise.all(
chunks.map(chunk => openai.embeddings.create({
model: "text-embedding-3-small",
input: chunk,
}))
);
return embeddings;
},
});
Type Safety Boundaries
TypeScript types flow through task definitions, but serialization breaks type guarantees at runtime. The boundary issues:
- Class instances: Don’t survive serialization. Use plain objects.
- Functions: Can’t be checkpointed. Pass data, not callbacks.
- Circular references: Break JSON serialization. Flatten your data structures.
- Dates: Serialize as ISO strings, require manual parsing.
This matters when integrating with LLM SDKs that return complex objects. You need to extract serializable data before checkpointing:
// Bad: Class instance won't survive checkpoint
const response = await openai.chat.completions.create({...});
await someAsyncOperation(); // Checkpoint here loses response methods
// Good: Extract data immediately
const { choices, usage } = await openai.chat.completions.create({...});
const message = choices[0].message.content;
await someAsyncOperation(); // Checkpoint preserves plain data
Observability and Debugging
The dashboard shows:
- Real-time task execution with step-by-step traces
- Retry history with error messages
- Queue depth and concurrency metrics
- Checkpoint sizes and timing
Each task run gets a unique URL with full execution history. You can replay failed runs from any checkpoint, which helps debugging non-deterministic failures in agent workflows.
The tracing format follows OpenTelemetry conventions, so you can export spans to Datadog, Honeycomb, or your existing observability stack.
Deployment Shape
Trigger.dev runs as:
- Managed cloud: Hosted workers, you deploy task code
- Self-hosted: Docker containers, you manage Postgres and workers
- Hybrid: Cloud control plane, self-hosted workers for data residency
The managed option handles scaling automatically. Workers spin up based on queue depth, with configurable min/max instances. Self-hosted deployments require you to manage worker autoscaling and database capacity.
For agent workloads with spiky traffic (batch processing, scheduled research tasks), the managed option avoids over-provisioning. For workflows that process sensitive data, self-hosted keeps everything in your VPC.
Comparison: Trigger.dev vs Temporal
Infrastructure trade-offs that affect agent workflow scalability and operational complexity:
| Dimension | Trigger.dev | Temporal |
|---|---|---|
| Language | TypeScript only | Go, Java, Python, PHP, .NET |
| Runtime | Node.js/Bun | Go server + language SDKs |
| State storage | Postgres | Cassandra/MySQL/Postgres |
| Deployment | Managed or Docker | Kubernetes cluster |
| Learning curve | Familiar async/await | Workflow/activity split |
| Type safety | TypeScript native | SDK-dependent |
| Ecosystem | Growing | Mature, battle-tested |
Temporal’s Go runtime provides stronger execution guarantees and better performance at scale (10k+ concurrent workflows). Trigger.dev’s TypeScript-native approach reduces operational complexity for teams already running Node services.
The execution model differs fundamentally. Temporal requires splitting workflows (orchestration logic) from activities (side effects). Trigger.dev lets you write normal async TypeScript, which feels more natural but gives you less control over determinism.
When Type Safety Breaks Down
The serialization boundary creates runtime risks that TypeScript can’t catch:
- Non-serializable data: Classes, functions, symbols
- Large payloads: Checkpoints over 1MB fail silently
- Circular references: Crash during JSON.stringify
- Prototype loss: Objects lose methods after deserialization
You need runtime validation at checkpoint boundaries:
import { z } from "zod";
const CheckpointSchema = z.object({
messages: z.array(z.object({
role: z.enum(["user", "assistant", "tool"]),
content: z.string(),
})),
metadata: z.record(z.unknown()),
});
export const agentTask = task({
id: "agent",
run: async (input) => {
let state = CheckpointSchema.parse(input);
// Validate before each checkpoint
await someOperation();
state = CheckpointSchema.parse(state);
return state;
},
});
Likely Failure Modes
Agent workflows on Trigger.dev fail when:
- Checkpoint size exceeds limits: Large conversation histories or tool outputs
- Non-deterministic code: Random IDs, timestamps, or external state
- Database connection exhaustion: Too many concurrent checkpoints
- Memory leaks in long-running tasks: Node.js heap fills up over hours
- Serialization errors: Unhandled circular references or class instances
The checkpoint-per-await model amplifies these issues. A 10-turn research agent with 5 tool calls per turn generates 50 checkpoints. At approximately 50KB per checkpoint, total state reaches 2.5MB. If your Postgres configuration limits row size to 1MB (common in managed services), the workflow fails around turn 20. Monitor checkpoint sizes and split large workflows into smaller tasks when conversation history exceeds 500KB.
Technical Verdict
Use Trigger.dev when:
- Your agent workflows run 5-15 turns with moderate tool call volume (under 20 calls per turn) and checkpoint sizes stay under 500KB. The Postgres storage model handles this range efficiently without hitting row size limits or creating write contention. Beyond 20 turns or 1MB total state, you’ll need to implement manual state pruning or switch to Temporal’s event sourcing.
- You need automatic retry logic for API timeouts and rate limits without writing custom error handling. A research agent that calls 3 external APIs per turn benefits from Trigger.dev’s exponential backoff (1s, 2s, 4s delays) and automatic checkpoint recovery. This eliminates 50+ lines of manual retry code per workflow.
- Your team already runs Node.js services and wants to avoid deploying a separate Temporal cluster (3+ services: frontend, history, matching). Trigger.dev adds one Postgres table and reuses your existing database infrastructure. Self-hosting Temporal requires Kubernetes expertise and 2-4 GB RAM minimum for the Go runtime.
- You process 100-1000 workflows per hour with predictable traffic patterns. Trigger.dev’s managed workers scale based on queue depth (5-second polling interval), which works for batch automation but adds latency for real-time agent interactions. Temporal’s push-based model provides sub-second task dispatch.
Avoid Trigger.dev when:
- Your agent generates conversation histories over 1MB per workflow or runs 50+ turns with complex tool outputs. A document analysis agent that processes 100-page PDFs with embeddings per paragraph hits Postgres checkpoint limits quickly. Temporal’s event sourcing compresses state more efficiently (10:1 ratio for large workflows) and supports 2MB event histories.
- You require deterministic replay for audit compliance or debugging. Trigger.dev’s checkpoint deserialization can break on non-deterministic code (Date.now(), Math.random(), external API calls without idempotency keys). Temporal’s event sourcing guarantees identical replay by isolating side effects in activities. This matters for financial workflows or regulated industries.
- You need sub-500ms latency for real-time agent interactions. Trigger.dev’s checkpoint-per-await model adds 50-200ms serialization overhead per step, plus Postgres write latency (10-50ms). A customer support agent that responds in under 1 second cannot afford this overhead. Temporal’s in-memory state and optimized runtime provide 10-50ms task dispatch.
- Your agent system spans multiple languages (Python ML inference, Go data processing, TypeScript orchestration). Trigger.dev locks you into TypeScript for all workflow logic. Temporal supports polyglot workflows where each activity runs in its native language, which matters for teams with existing ML pipelines or data infrastructure.
- You need fine-grained control over retry boundaries. Trigger.dev retries entire task blocks, so a workflow with 10 tool calls retries all 10 if one fails. Temporal’s activity isolation lets you retry individual operations independently. This reduces wasted compute for expensive operations like embedding generation or video processing.
The checkpoint-per-await model works well for scheduled automation (nightly report generation, batch document processing, email sequences) where 5-10 second latency is acceptable and state stays under 500KB. For real-time agent systems or workflows that accumulate multi-MB state, Temporal’s event sourcing and optimized runtime provide better scalability. The TypeScript-native approach reduces operational overhead but constrains you to Node’s execution model and Postgres storage limits.
Source Links
- Trigger.dev V2 Announcement (172 points, 39 comments)
- Official Documentation
- GitHub Repository