mech.app
Security

Tracecat: Open-Source SOAR Shows How Security Automation Differs from Business Workflows

Tracecat's architecture reveals security-specific orchestration: case state, evidence chains, analyst handoff, and compliance audit trails.

Source: Tracecat GitHub
Tracecat: Open-Source SOAR Shows How Security Automation Differs from Business Workflows

Tracecat is an MIT-licensed SOAR (Security Orchestration, Automation, and Response) platform that hit 264 points on Hacker News. The project matters because it exposes the orchestration primitives that security automation requires but business workflow tools do not provide: case state management, evidence chain tracking, analyst handoff points, and compliance audit trails. When you automate incident response instead of marketing ops, the workflow layer needs different guarantees.

Security analysts face 100+ alerts daily. Most are false positives, but each requires investigation: query logs, contact affected users, check vulnerability databases, escalate to senior analysts. Existing SOAR tools (Splunk Phantom, Palo Alto Cortex XSOAR) are enterprise-only and closed. Tracecat shows what changes when you build workflow automation for security response instead of business processes.

Why Security Automation Differs from Business Workflows

Business workflow tools (Zapier, n8n, Automatisch) connect APIs to move data: new Stripe payment triggers a Slack message and updates a Google Sheet. The workflow is stateless. Each execution is independent. If step 3 fails, you retry or discard the event.

Security automation has different requirements:

  • Case state: An alert creates a case that persists across multiple workflow runs. The case tracks investigation status (new, investigating, escalated, resolved), assigned analyst, and accumulated evidence.
  • Evidence chains: Every action (log query, file hash lookup, user contact) must be recorded with timestamps and actor identity for compliance audits.
  • Analyst handoff: Workflows pause for human decisions. An analyst reviews findings, decides whether to escalate, and resumes automation.
  • Parallel investigation with SLAs: Multiple tasks run concurrently (query SIEM for related events, check EDR for malware, contact victim) with different timeouts and failure modes. If the SIEM query times out after 30 seconds, the workflow continues with partial data.
  • Compliance audit trails: SOC2, ISO 27001, and incident response frameworks require immutable logs of who did what and when.

Tracecat’s architecture addresses these requirements. The core abstraction is a case: a stateful container for an investigation that spans multiple workflow executions.

Case State Management

A case in Tracecat represents a security incident under investigation. Unlike business workflows where each execution is independent, a case persists across multiple workflow runs and analyst interactions.

Case lifecycle:

  1. Alert ingestion: SIEM or EDR sends an alert webhook. Tracecat creates a case with status new.
  2. Automated triage: Workflow queries threat intelligence feeds, checks user activity logs, and assigns a severity score.
  3. Analyst review: Workflow pauses. Analyst reviews findings in the Tracecat UI and decides to investigate further or close as false positive.
  4. Investigation: Workflow resumes, queries additional data sources, contacts affected users, and collects evidence.
  5. Resolution: Analyst marks case as resolved with remediation notes. Workflow triggers post-incident tasks (update firewall rules, notify management).

The case object stores:

  • Status: new, investigating, escalated, resolved, closed
  • Assigned analyst: User ID of the analyst responsible for the case
  • Evidence: Array of artifacts collected during investigation (log entries, file hashes, IP addresses)
  • Timeline: Chronological list of actions taken (automated and manual)
  • Metadata: Alert source, severity, affected assets

Example case schema:

# Simplified case model
class Case:
    id: str
    status: Literal["new", "investigating", "escalated", "resolved", "closed"]
    severity: Literal["low", "medium", "high", "critical"]
    assigned_to: Optional[str]  # Analyst user ID
    created_at: datetime
    updated_at: datetime
    
    # Evidence collected during investigation
    evidence: List[Evidence]
    
    # Timeline of actions (automated and manual)
    timeline: List[TimelineEvent]
    
    # Alert metadata
    alert_source: str  # "siem", "edr", "firewall"
    affected_assets: List[str]  # ["user@example.com", "10.0.1.42"]
    
    # Compliance fields: immutable log of all actions
    # Enforced via database triggers or append-only tables
    audit_log: List[AuditEntry]

The key difference from business workflows: state persists and accumulates. A Zapier workflow processes a webhook and completes. A Tracecat case remains open for hours or days, accumulating evidence and analyst decisions.

Evidence Chain and Audit Trails

Security automation must prove what happened and who did it. Compliance frameworks require immutable audit logs. Tracecat records every action in the case timeline with:

  • Timestamp: When the action occurred (ISO 8601 with millisecond precision)
  • Actor: User ID (analyst) or system component (workflow engine, integration)
  • Action type: query_logs, contact_user, escalate_case, add_evidence
  • Input parameters: What data was queried or sent
  • Output: What was returned or changed
  • Execution context: Workflow run ID, step ID, retry count

Example timeline entry:

{
  "id": "evt_abc123",
  "case_id": "case_xyz789",
  "timestamp": "2026-05-17T14:32:15.342Z",
  "actor": {
    "type": "workflow",
    "workflow_id": "wf_threat_intel",
    "run_id": "run_def456"
  },
  "action": "query_threat_intel",
  "input": {
    "ip_address": "203.0.113.42",
    "sources": ["virustotal", "abuseipdb"]
  },
  "output": {
    "virustotal": {
      "malicious_score": 8,
      "detections": 12
    },
    "abuseipdb": {
      "abuse_confidence": 95,
      "reports": 47
    }
  },
  "duration_ms": 1205
}

The audit log is append-only. Entries cannot be modified or deleted. This guarantees evidence integrity for compliance audits and legal proceedings.

Business workflow tools log execution history for debugging, but do not enforce immutability or actor attribution. Tracecat treats the audit log as a compliance artifact.

Analyst Handoff and Human-in-the-Loop

Security workflows pause for human decisions. An analyst reviews automated findings and decides whether to escalate, investigate further, or close the case. Tracecat implements handoff with approval steps.

An approval step:

  1. Pauses workflow execution
  2. Notifies the assigned analyst (Slack message, email, or in-app notification)
  3. Displays evidence collected so far in the Tracecat UI
  4. Waits for analyst decision (approve, reject, or request more data)
  5. Resumes workflow with the analyst’s decision as input to subsequent steps

Example workflow with analyst handoff:

# Simplified workflow definition
name: Investigate Suspicious Login
trigger:
  type: webhook
  source: siem

steps:
  - id: enrich_user
    action: query_user_directory
    input:
      username: "{{ trigger.username }}"
    
  - id: check_threat_intel
    action: query_threat_intel
    input:
      ip_address: "{{ trigger.source_ip }}"
    
  # Workflow pauses here until analyst makes a decision
  # Decision is recorded in audit log with analyst ID and timestamp
  - id: analyst_review
    action: request_approval
    input:
      assigned_to: "{{ case.assigned_to }}"
      message: "Suspicious login from {{ trigger.source_ip }}. User: {{ enrich_user.full_name }}. Threat score: {{ check_threat_intel.score }}"
      options: ["investigate", "escalate", "false_positive"]
    
  - id: investigate_further
    condition: "{{ analyst_review.decision == 'investigate' }}"
    action: query_logs
    input:
      username: "{{ trigger.username }}"
      time_range: "24h"
    
  - id: escalate_case
    condition: "{{ analyst_review.decision == 'escalate' }}"
    action: escalate_to_senior
    input:
      case_id: "{{ case.id }}"
      reason: "{{ analyst_review.notes }}"

The workflow pauses at analyst_review until the analyst makes a decision. The decision is recorded in the case timeline with the analyst’s user ID and timestamp. Subsequent steps branch based on the decision.

Business workflow tools (Zapier, n8n) do not support long-running pauses. Workflows execute start to finish in seconds or minutes. Tracecat workflows can pause for hours or days waiting for analyst input.

Parallel Investigation with Different SLAs

Security investigations query multiple data sources concurrently: SIEM for related events, EDR for malware indicators, threat intelligence feeds for IP reputation, user directory for account details. Each query has a different timeout and failure mode.

Tracecat executes steps in parallel when they do not depend on each other. The workflow engine tracks dependencies and launches independent steps concurrently.

Example parallel execution:

steps:
  # These three queries run concurrently
  # Each has a different timeout based on expected response time
  - id: query_siem
    action: query_siem
    input:
      query: "source_ip={{ trigger.ip }}"
      time_range: "1h"
    timeout: 30s  # SIEM queries can be slow
    
  - id: query_edr
    action: query_edr
    input:
      host: "{{ trigger.hostname }}"
    timeout: 10s
    
  - id: query_threat_intel
    action: query_threat_intel
    input:
      ip_address: "{{ trigger.ip }}"
    timeout: 5s
    
  # Waits for all three queries (or timeouts) before proceeding
  # Partial data is acceptable: security investigations often proceed with incomplete information
  - id: aggregate_findings
    depends_on: [query_siem, query_edr, query_threat_intel]
    action: aggregate
    input:
      siem: "{{ query_siem.results }}"
      edr: "{{ query_edr.results }}"
      threat_intel: "{{ query_threat_intel.results }}"

The workflow engine launches query_siem, query_edr, and query_threat_intel concurrently. The aggregate_findings step waits for all three to complete (or timeout) before executing.

If query_siem times out after 30 seconds, the workflow continues with partial data. The timeout is recorded in the case timeline. The analyst sees that SIEM data is missing and can manually query the SIEM if needed.

Business workflow tools execute steps sequentially by default. Parallel execution requires explicit configuration and does not handle partial failures gracefully. Tracecat treats partial data as normal: security investigations often proceed with incomplete information.

Integration Boundaries for Security Tools

Security automation integrates with tools that have different authentication models, rate limits, and failure modes than business APIs.

Common security integrations:

  • SIEM (Splunk, Elastic Security): Query logs with complex search syntax. Queries can take 10-60 seconds. Rate limits are per-user, not per-API-key.
  • EDR (CrowdStrike, SentinelOne): Query endpoint telemetry. Requires OAuth or API key. Rate limits are strict (100 requests per minute).
  • Threat intelligence (VirusTotal, AlienVault OTX): Lookup IP/domain/file hash reputation. Free tiers have low rate limits (4 requests per minute for VirusTotal).
  • Ticketing (Jira, ServiceNow): Create and update incident tickets. Requires OAuth. No strict rate limits but slow API response times (1-3 seconds per request).
  • User directory (Active Directory, Okta): Query user account details. LDAP or SAML authentication. Fast queries (100-200ms).

Tracecat’s integration layer handles:

  • Credential management: Store API keys, OAuth tokens, and LDAP credentials encrypted at rest. Rotate tokens before expiration.
  • Rate limiting: Track requests per integration and back off when limits are hit. Queue requests and retry with exponential backoff.
  • Timeout handling: Set per-integration timeouts. If a SIEM query times out, log the failure and continue the workflow.
  • Error classification: Distinguish retryable errors (503 Service Unavailable, rate limit) from permanent failures (401 Unauthorized, 404 Not Found).

Example integration definition:

# Simplified integration for VirusTotal
class VirusTotalIntegration:
    name = "virustotal"
    auth_type = "api_key"
    rate_limit = 4  # requests per minute
    
    async def query_ip(self, ip_address: str, api_key: str) -> dict:
        url = f"https://www.virustotal.com/api/v3/ip_addresses/{ip_address}"
        headers = {"x-apikey": api_key}
        
        async with httpx.AsyncClient(timeout=5.0) as client:
            try:
                response = await client.get(url, headers=headers)
                response.raise_for_status()
                return response.json()
            except httpx.HTTPStatusError as e:
                if e.response.status_code == 429:
                    # Rate limit hit, retry after backoff
                    raise RateLimitError("VirusTotal rate limit exceeded")
                elif e.response.status_code == 401:
                    # Invalid API key, do not retry
                    raise AuthenticationError("Invalid VirusTotal API key")
                else:
                    raise
            except httpx.TimeoutException:
                # Query timed out, log and continue
                raise TimeoutError("VirusTotal query timed out")

Business workflow tools integrate with APIs that have predictable rate limits and fast response times (Stripe, Slack, Google Sheets). Security tools have unpredictable query times and strict rate limits. Tracecat’s integration layer is tuned for these constraints.

Deployment and Observability

Tracecat ships as a Docker Compose stack: Python backend (FastAPI), PostgreSQL for case state, Redis for job queue, and a React admin UI. The simplest deployment is a single server running all components.

For production, you split components:

  • API tier: Handles webhook ingestion, analyst UI, and workflow orchestration. Stateless, scales horizontally.
  • Worker tier: Executes workflow steps. Scales by adding worker processes that pull jobs from Redis.
  • Database: PostgreSQL with replication for high availability. Case state and audit logs are stored here. Audit log immutability is enforced via PostgreSQL triggers or append-only tables, not application-level logic, to prevent tampering.
  • Queue: Redis for job distribution across workers.

Observability requirements for security automation:

  • Case dashboard: Show open cases, assigned analysts, and investigation status.
  • Workflow execution logs: Display step-by-step execution with input/output for each step.
  • Integration health: Track API response times, rate limit usage, and error rates per integration.
  • Audit log export: Export case timelines to SIEM or compliance platform (Splunk, Datadog).