Trigger.dev Architecture: Event Routing, State Persistence, and Retry Logic in a Code-First Workflow Platform

Trigger.dev positions itself between Zapier’s no-code GUI and Temporal’s heavyweight event sourcing model. It gives developers TypeScript-first workflow definitions with managed state, retry logic, and task queuing. Two Show HN posts (745 and 172 points) tracked the platform’s evolution from a Zapier alternative to a Temporal alternative, signaling a shift from event-driven integrations to durable workflow orchestration.

The core question: how does Trigger.dev persist workflow state across long-running tasks, route events between services, and handle retries without forcing developers into a database schema or event sourcing architecture?

Architecture Overview

Trigger.dev runs three components:

Orchestrator: Centralized control plane that schedules tasks, manages queues, and stores execution state.
Worker runtime: Executes task code in isolated environments (Docker containers or serverless functions).
Client SDK: TypeScript library embedded in your application that defines tasks and triggers.

Workflows are defined as task() functions in your codebase. The SDK registers these with the orchestrator at build time. When an event fires (webhook, cron, or manual trigger), the orchestrator queues the task, assigns it to a worker, and tracks execution state in Postgres.

State Persistence Model

Trigger.dev does not use event sourcing. Instead:

Each task execution gets a unique run ID.
The orchestrator writes execution state (status, output, error) to Postgres after each step.
Long-running tasks checkpoint automatically. If a worker dies, the orchestrator replays from the last checkpoint.
Developers access state via ctx.run metadata, not by querying a database.

This differs from Temporal, which rebuilds state by replaying an event log. Trigger.dev trades replay determinism for simpler mental models: state is a row in a table, not a projection of events.

Event Routing and Task Queuing

Tasks are triggered by:

Webhooks: Orchestrator exposes HTTP endpoints per task. Incoming requests queue a run.
Scheduled triggers: Cron expressions stored in the orchestrator. A scheduler service polls and enqueues tasks.
Manual invocation: SDK method tasks.trigger() sends a message to the orchestrator API.

The orchestrator maintains per-task queues with configurable concurrency limits. When a task is queued:

Orchestrator checks concurrency settings (max parallel runs, rate limits).
If capacity exists, it assigns the run to an available worker.
Worker pulls task code from the registry, executes it, and streams logs back.

If no workers are available, the run waits in the queue. The orchestrator does not pre-allocate workers; it scales them on demand (Kubernetes pods or serverless functions).

Retry and Failure Recovery

Retry logic is declarative:

export const processOrder = task({
  id: "process-order",
  retry: {
    maxAttempts: 3,
    factor: 2,
    minTimeout: 1000,
    maxTimeout: 10000,
  },
  run: async (payload) => {
    const result = await externalAPI.charge(payload.amount);
    return result;
  },
});

When a task fails:

The orchestrator records the error and schedules a retry based on exponential backoff.
If all retries exhaust, the run enters a FAILED state.
Developers can manually retry from the dashboard or API.

Partial failures (e.g., step 2 of 5 fails) do not replay earlier steps. The orchestrator resumes from the failed step using the last checkpoint. This avoids idempotency issues but requires careful design: side effects in earlier steps must be safe to skip on retry.

Observability and Deployment Boundaries

Observability

The orchestrator exposes:

Run logs: Streamed from workers to the orchestrator, stored in Postgres, viewable in the dashboard.
Trace spans: Each task step emits OpenTelemetry spans. The orchestrator aggregates them into a trace tree.
Metrics: Task duration, queue depth, retry counts, and worker utilization.

Logs and traces are queryable via the dashboard or API. No external APM is required, but you can export spans to Datadog or Honeycomb.

Deployment Options

Trigger.dev offers two deployment modes:

Mode	Orchestrator	Workers	State Storage	Use Case
Cloud	Managed by Trigger.dev	Serverless (Fly.io or AWS Lambda)	Managed Postgres	Fast setup, no ops burden
Self-hosted	Docker Compose or Kubernetes	Your infrastructure (Docker, K8s, or serverless)	Your Postgres instance	Data residency, custom networking

Self-hosting requires:

Running the orchestrator as a stateless service (scales horizontally).
Configuring worker environments (Docker images or serverless runtimes).
Managing Postgres for state and logs.

The orchestrator and workers communicate over HTTP/WebSocket. Workers poll the orchestrator for tasks; the orchestrator does not push tasks to workers. This simplifies firewall rules but adds latency (polling interval is configurable, default 1 second).

Comparison: Trigger.dev vs. Temporal vs. Zapier

Dimension	Trigger.dev	Temporal	Zapier
State model	Postgres rows	Event sourcing	Opaque
Retry logic	Declarative, per-task	Workflow-level, deterministic replay	GUI-configured
Developer control	Full code access	Full code access	No code access
Deployment	Cloud or self-hosted	Self-hosted (complex)	SaaS only
Observability	Built-in dashboard	Requires external APM	Limited logs

Trigger.dev sits between Temporal’s deterministic guarantees and Zapier’s simplicity. You get code-first workflows without event sourcing complexity, but you lose Temporal’s replay-based recovery and Zapier’s zero-ops model.

Failure Modes and Mitigations

Orchestrator Downtime

If the orchestrator crashes:

Queued tasks remain in Postgres; no data loss.
Running tasks continue in workers but cannot report status.
When the orchestrator restarts, it reconciles worker state and resumes scheduling.

Mitigation: Run multiple orchestrator replicas behind a load balancer. State is in Postgres, so replicas are stateless.

Worker Failures

If a worker crashes mid-task:

The orchestrator detects the missing heartbeat (default 30 seconds).
It marks the run as FAILED and schedules a retry.
The new worker resumes from the last checkpoint.

Mitigation: Set aggressive heartbeat intervals for time-sensitive tasks. Use idempotent operations in task steps.

Postgres Bottleneck

High task throughput can saturate Postgres:

Writes: Every task step writes state and logs.
Reads: Dashboard queries and worker polling hit the database.

Mitigation: Use read replicas for dashboard queries. Batch log writes. Archive old runs to cold storage.

Webhook Delivery Failures

If your application cannot reach the orchestrator:

Webhooks are lost unless the sender retries.
No built-in webhook queue or dead-letter handling.

Mitigation: Use a message broker (SQS, Pub/Sub) between your app and Trigger.dev. The broker handles retries and durability.

Code Example: Multi-Step Task with External API

import { task } from "@trigger.dev/sdk/v3";

export const syncCustomerData = task({
  id: "sync-customer-data",
  retry: { maxAttempts: 5 },
  run: async (payload: { customerId: string }) => {
    // Step 1: Fetch from external API
    const customer = await fetch(
      `https://api.example.com/customers/${payload.customerId}`
    ).then((r) => r.json());

    // Step 2: Transform data (checkpoint after this)
    const normalized = {
      id: customer.id,
      email: customer.email_address,
      createdAt: new Date(customer.created_at),
    };

    // Step 3: Write to database
    await db.customers.upsert(normalized);

    // Step 4: Trigger downstream task
    await tasks.trigger("send-welcome-email", {
      email: normalized.email,
    });

    return { success: true, customerId: normalized.id };
  },
});

If step 3 fails, the orchestrator retries from step 3. Steps 1 and 2 do not re-execute. This requires step 1 (API fetch) to be safe to skip: either the API is idempotent or you cache the result.

Technical Verdict

Use Trigger.dev when:

You need code-first workflows with managed state and retries.
You want observability without external APM setup.
You can tolerate non-deterministic replay (state is checkpointed, not event-sourced).
You prefer TypeScript and want type-safe task definitions.

Avoid Trigger.dev when:

You need strict deterministic replay (use Temporal).
Your workflows span months or years (Postgres state storage gets expensive).
You require sub-second task latency (worker polling adds 1+ second delay).
You need a GUI for non-developers (use Zapier or n8n).

Trigger.dev works best for developer-authored workflows that run minutes to hours, need retry logic, and benefit from centralized observability. It does not replace Temporal for mission-critical financial workflows or Zapier for marketing automation by non-engineers.