Trigger.dev started as a “Zapier alternative for developers” (745 HN points, February 2023). Eight months later, the team shipped V2 and repositioned as a “Temporal alternative for TypeScript” (172 points, October 2023). That pivot exposes a real infrastructure gap: developers want durable execution without learning Go or managing Temporal’s operational complexity.
The shift reveals what “durable execution” actually means when you strip away the workflow engine abstractions. It’s not about event routing or webhook fanout. It’s about persisting function call state, resuming execution after crashes, and handling retries without forcing developers to write explicit state machines.
What Changed Between V1 and V2
V1 focused on event-driven triggers. You connected integrations (GitHub, Slack, Stripe), defined event handlers, and Trigger.dev routed payloads. The model worked for short-lived webhooks but broke down for long-running jobs: multi-hour AI workflows, batch processing, or anything that needed to survive process restarts.
V2 introduced durable execution primitives:
- Task persistence: Function calls serialize to storage mid-execution
- Automatic retries: Exponential backoff without manual try/catch blocks
- Resumable state: Execution picks up from the last checkpoint after failure
- Observability hooks: Built-in tracing for jobs that span hours or days
The architecture change is fundamental. V1 was stateless request/response. V2 is stateful orchestration with explicit checkpoints.
How Durable Execution Works in TypeScript
Temporal achieves durability through event sourcing. Every workflow decision (function call, timer, signal) appends an event to a log. On replay, the runtime reconstructs state by replaying events. This works but requires a custom DSL and runtime.
Trigger.dev takes a different approach. Instead of replaying events, it checkpoints function state at specific boundaries:
- Entry point: Task function starts, initial state persists
- Await boundaries: Every
awaitcall checkpoints before and after - Tool calls: External API calls serialize request/response pairs
- Completion: Final result persists with execution metadata
The runtime intercepts async operations and writes state snapshots to Postgres. On failure, it reloads the last checkpoint and resumes from the next await.
export const processVideo = task({
id: "process-video",
run: async ({ videoUrl }: { videoUrl: string }) => {
// Checkpoint 1: Task starts
const download = await downloadVideo(videoUrl);
// Checkpoint 2: Download completes
const transcript = await transcribeAudio(download.audioPath);
// Checkpoint 3: Transcription completes
const summary = await generateSummary(transcript.text);
// Checkpoint 4: Summary completes
return { summary, duration: download.duration };
}
});
If the process crashes after transcription, the runtime skips downloadVideo and transcribeAudio on restart. It loads the cached transcript and continues with generateSummary.
Retry Semantics and Idempotency
Automatic retries sound simple until you hit side effects. If a task sends an email, crashes, and retries, you don’t want duplicate emails.
Trigger.dev handles this with execution IDs and idempotency keys:
- Every task run gets a unique execution ID
- External API calls include the execution ID in headers or request metadata
- Idempotent APIs (Stripe, Twilio) deduplicate based on the key
- Non-idempotent calls require explicit guards (database checks, distributed locks)
The platform doesn’t enforce idempotency. It provides the primitives (execution IDs, checkpoint boundaries) and expects developers to handle side effects correctly.
| Failure Mode | Trigger.dev Behavior | Developer Responsibility |
|---|---|---|
| Network timeout | Automatic retry with exponential backoff | Ensure API calls are idempotent or use execution ID |
| Process crash | Resume from last checkpoint | Design checkpoints around side effects |
| Code deployment | Graceful drain, new version picks up pending tasks | Maintain backward-compatible state schemas |
| Database unavailable | Retry with circuit breaker | Handle transient failures in task logic |
State Persistence and Queue Management
Trigger.dev uses Postgres for state storage. Each task execution writes:
- Execution metadata (ID, status, start time, retry count)
- Checkpoint snapshots (serialized function state at await boundaries)
- Event log (tool calls, errors, state transitions)
The queue is also Postgres-backed. Tasks enter a pending state, workers poll for available jobs, and execution moves to running. On completion or failure, the row updates with final state.
This design trades throughput for simplicity. Postgres can’t match Redis or RabbitMQ for queue performance, but it eliminates operational complexity. No separate message broker, no distributed coordination, no split-brain scenarios between queue and state store.
The bottleneck appears around 10,000 concurrent tasks. Beyond that, you need connection pooling, read replicas, or a move to dedicated queue infrastructure.
Observability for Long-Running Jobs
When a task runs for six hours, you need more than logs. Trigger.dev exposes:
- Real-time execution view: Current checkpoint, elapsed time, next scheduled retry
- Trace timeline: Visual breakdown of await boundaries and tool calls
- State snapshots: Inspect serialized state at any checkpoint
- Retry history: See every attempt, failure reason, and backoff delay
The observability model assumes you’re debugging distributed systems, not single-process scripts. You need to answer: “Where did this task stall?” and “Why did this retry loop 47 times?”
The platform streams execution events to a dashboard. Each checkpoint emits a trace span. Tool calls include request/response payloads (with PII redaction hooks). Errors capture stack traces and serialized state.
Deployment Shape and Failure Modes
Trigger.dev offers cloud-hosted and self-hosted options. The cloud version runs workers in isolated containers with automatic scaling. Self-hosted requires:
- Postgres (state and queue storage)
- Redis (optional, for distributed locks and rate limiting)
- Worker processes (Node.js or Bun runtimes)
- Coordinator service (schedules tasks, manages retries)
The failure modes differ by deployment:
Cloud-hosted risks:
- Vendor lock-in for state storage
- Cold start latency for infrequent tasks
- Network egress costs for large payloads
Self-hosted risks:
- Database failover complexity
- Worker autoscaling configuration
- Checkpoint storage growth (unbounded if not pruned)
Both models share a core risk: checkpoint serialization failures. If your task holds non-serializable state (open file handles, WebSocket connections), the runtime can’t persist it. The task crashes and retries from the last valid checkpoint, potentially losing progress.
How It Compares to Temporal
Temporal is a workflow engine. Trigger.dev is a task runner with durable execution. The distinction matters:
Temporal strengths:
- Multi-language support (Go, Java, Python, TypeScript)
- Complex workflow patterns (sagas, child workflows, signals)
- Battle-tested at scale (Uber, Netflix, Stripe)
Trigger.dev strengths:
- Native TypeScript experience (no DSL, no code generation)
- Simpler operational model (Postgres instead of Cassandra + Elasticsearch)
- Faster onboarding (minutes vs. days)
Temporal is the right choice for mission-critical workflows with complex coordination. Trigger.dev fits teams that want durable execution without the operational overhead.
Technical Verdict
Use Trigger.dev when:
- You’re building in TypeScript and want native async/await semantics
- Your tasks are mostly linear (API calls, batch jobs, AI workflows)
- You prefer operational simplicity over maximum throughput
- You need durable execution but not full workflow orchestration
Avoid it when:
- You need sub-second task latency (checkpointing adds overhead)
- Your workflows require complex coordination (parallel branches, dynamic fan-out)
- You’re already invested in Temporal or another workflow engine
- You need guaranteed exactly-once semantics (Trigger.dev provides at-least-once)
The platform’s real contribution is showing that durable execution doesn’t require a PhD in distributed systems. You can get retry semantics, state persistence, and resumable execution with Postgres and careful checkpoint design. That’s a useful middle ground between “hope the process doesn’t crash” and “run a Temporal cluster.”