mech.app
AI Agents

Dograh's Visual Workflow Builder: How Voice AI Platforms Route Speech-to-Speech Pipelines Without Code

How Dograh orchestrates STT/LLM/TTS components, handles telephony integration, and manages MCP-native tool calling in real-time voice agents.

Source: github.com
Dograh's Visual Workflow Builder: How Voice AI Platforms Route Speech-to-Speech Pipelines Without Code

Voice AI platforms promise drag-and-drop simplicity, but the plumbing underneath is a real-time state machine that routes audio streams through STT, LLM inference, TTS synthesis, and tool calls while maintaining sub-second latency. Dograh is an open-source, self-hosted alternative to Vapi and Retell that exposes this orchestration layer through a visual workflow builder. It hit #9 on GitHub Trending (Python) with a 155 score and positions itself as the transparent option for teams that need to see how voice agents actually work.

The interesting part is not the drag-and-drop UI. It’s how the platform compiles visual nodes into executable state machines, handles telephony integration (SIP/WebRTC), and manages MCP-native tool calling without breaking the conversational flow. This is the infrastructure behind “zero to working bot in under 2 minutes.”

The Speech-to-Speech Pipeline Problem

A voice agent is not a single API call. It’s a pipeline with at least four distinct stages:

  1. Speech-to-Text (STT): Convert incoming audio to text (Deepgram, AssemblyAI, Whisper)
  2. LLM Inference: Generate a response based on conversation state (OpenAI, Anthropic, local models)
  3. Text-to-Speech (TTS): Synthesize the response into audio (ElevenLabs, PlayHT, Cartesia)
  4. Tool Calling: Execute external actions mid-conversation (MCP servers, REST APIs, database queries)

Each stage has its own latency profile, failure modes, and provider quirks. STT might take 200ms, LLM inference 800ms, TTS another 300ms. That’s 1.3 seconds before the user hears a response, and you haven’t even handled interruptions, partial transcripts, or tool call timeouts.

The orchestration challenge is keeping all four stages in sync while maintaining conversational latency. If the user interrupts mid-response, you need to cancel TTS synthesis, flush the audio buffer, and restart the pipeline. If a tool call times out, you need to decide whether to retry, fallback, or surface an error to the user.

How the Visual Workflow Builder Compiles to State Machines

Dograh’s workflow builder lets you drag nodes for STT providers, LLM models, TTS engines, and tool calls onto a canvas. Each node has configuration options (API keys, model parameters, retry logic) and connection points that define the data flow.

When you save a workflow, the platform compiles it into a state machine that runs on the backend. The compilation step does three things:

  1. Validates the graph: Ensures every node has required inputs, no cycles exist (unless explicitly allowed for loops), and tool call nodes are properly connected to LLM nodes.
  2. Generates routing logic: Converts visual connections into conditional branches. If Node A outputs to Node B and Node C, the state machine needs to know when to fork execution.
  3. Injects observability hooks: Adds logging, metrics, and tracing at every transition so you can debug failures in production.

The resulting state machine is a Python object (likely using a library like transitions or a custom FSM) that gets instantiated per conversation. Each conversation has its own state machine instance, which means you can run thousands of concurrent voice agents without shared state collisions.

Example Workflow Compilation

A simple workflow might look like this:

[Telephony Input] → [Deepgram STT] → [OpenAI GPT-4] → [ElevenLabs TTS] → [Telephony Output]

                                      [MCP Tool Call]

The compiled state machine has states like LISTENING, TRANSCRIBING, INFERRING, CALLING_TOOL, SYNTHESIZING, and SPEAKING. Transitions are triggered by events: audio_chunk_received, transcript_ready, llm_response_ready, tool_result_ready, audio_synthesized.

If the LLM decides to call a tool, the state machine transitions to CALLING_TOOL, waits for the result, then transitions back to INFERRING with the tool output appended to the conversation context. If the tool call times out, the state machine can transition to a fallback state that returns a canned response or asks the user to retry.

MCP-Native Tool Calling in a Voice Context

MCP (Model Context Protocol) is Anthropic’s standard for connecting LLMs to external tools. Dograh claims to be “MCP native,” which means it can discover and invoke MCP servers without custom integration code.

In a voice context, MCP tool calling introduces latency and failure modes that don’t exist in text-based agents:

  • Latency budget: You have maybe 500ms to call a tool and get a result before the conversation feels broken. If the tool takes 2 seconds, the user will think the agent is stuck.
  • Partial results: Some tools return streaming results (e.g., database queries). The agent needs to decide whether to wait for the full result or start synthesizing audio with partial data.
  • Error handling: If the tool call fails, the agent needs to surface a useful error to the user without exposing internal details. “I couldn’t fetch your account balance” is better than “HTTP 500 from api.bank.com.”

Dograh handles this by treating tool calls as async operations with configurable timeouts. When the LLM emits a tool call, the state machine transitions to CALLING_TOOL, sends the request to the MCP server, and starts a timer. If the result arrives before the timeout, the state machine transitions back to INFERRING. If the timeout expires, the state machine transitions to a fallback state.

The platform also supports tool call batching. If the LLM emits multiple tool calls in a single response, the state machine can execute them in parallel and wait for all results before continuing. This reduces total latency but increases complexity (what if one tool succeeds and another fails?).

Telephony Integration: SIP, WebRTC, and the Audio Buffer Problem

Voice agents don’t run in a vacuum. They need to connect to phone systems (SIP trunks), web browsers (WebRTC), or mobile apps (native SDKs). Dograh supports all three, but the plumbing is different for each.

SIP Integration

SIP (Session Initiation Protocol) is the standard for VoIP telephony. When a user calls a phone number, the SIP trunk routes the call to Dograh’s backend, which starts a new conversation state machine.

The audio stream is typically G.711 or Opus codec at 8kHz or 16kHz. Dograh needs to:

  1. Buffer incoming audio: Collect enough samples to send to the STT provider (usually 100-500ms chunks).
  2. Handle jitter: Network delays can cause audio packets to arrive out of order. The platform needs a jitter buffer to reorder packets before sending them to STT.
  3. Manage echo cancellation: If the agent is speaking while the user is speaking, the microphone will pick up the agent’s audio. The platform needs acoustic echo cancellation (AEC) to filter this out.

SIP also introduces failure modes like dropped connections, codec mismatches, and DTMF tone interference. If the call drops mid-conversation, the state machine needs to clean up resources (close STT/TTS streams, cancel pending tool calls) and log the failure for debugging.

WebRTC Integration

WebRTC is the standard for browser-based voice/video. It’s more complex than SIP because it requires signaling (exchanging connection metadata), ICE negotiation (finding the best network path), and DTLS encryption (securing the media stream).

Dograh’s WebRTC integration likely uses a library like aiortc (Python) or pion (Go) to handle the protocol details. The platform needs to:

  1. Establish a peer connection: Exchange SDP offers/answers with the browser to negotiate codecs and network paths.
  2. Handle ICE candidates: Collect and exchange network addresses (STUN/TURN servers) to find the best route for audio packets.
  3. Manage media tracks: Receive incoming audio from the browser, send outgoing audio back, and handle track muting/unmuting.

WebRTC also supports data channels, which Dograh could use for out-of-band signaling (e.g., sending typing indicators, metadata, or tool call results without interrupting the audio stream).

State Management and Conversation Context

Every voice agent needs to maintain conversation state: what the user said, what the agent said, what tools were called, and what the current intent is. Dograh’s state machine keeps this context in memory (per conversation instance) and optionally persists it to a database for long-running conversations.

The context includes:

  • Transcript history: Full conversation log (user utterances + agent responses)
  • Tool call history: What tools were called, with what arguments, and what they returned
  • Session metadata: User ID, phone number, start time, current state
  • LLM context window: The last N tokens sent to the LLM (to stay within model limits)

When the LLM generates a response, it needs the full context to maintain coherence. But LLMs have token limits (e.g., GPT-4 supports 128k tokens, but that’s expensive). Dograh needs a strategy for context pruning: keep the most recent messages, summarize older messages, or use a sliding window.

The platform also needs to handle interruptions. If the user starts speaking while the agent is speaking, the state machine needs to:

  1. Stop TTS synthesis: Cancel the current audio generation.
  2. Flush the audio buffer: Clear any queued audio chunks.
  3. Restart STT: Start transcribing the user’s new utterance.
  4. Update context: Mark the interrupted response as incomplete and append the new user input.

This is harder than it sounds. TTS providers don’t always support cancellation (you might need to close the connection and open a new one). Audio buffers might have 200-500ms of latency, so the user will hear a brief overlap before the agent stops speaking.

Failure Modes and Observability

Voice agents fail in creative ways. Here are the most common failure modes and how Dograh handles them:

Failure ModeSymptomDograh’s Handling
STT timeoutUser speaks, no transcript arrivesRetry with exponential backoff, fallback to silence detection
LLM timeoutTranscript sent, no response after 5sReturn canned response (“I’m having trouble thinking right now”)
TTS timeoutResponse ready, no audio after 3sRetry once, then fallback to text-only response (for WebRTC)
Tool call timeoutTool invoked, no result after 2sReturn error to LLM, let it decide how to proceed
Network partitionAudio packets stop arrivingDetect via keepalive, close connection, log failure
Codec mismatchAudio is garbled or silentDetect via audio level monitoring, renegotiate codec

Observability is critical for debugging these failures. Dograh needs to log:

  • Every state transition: When did the conversation move from LISTENING to TRANSCRIBING?
  • Every API call: What was sent to the STT/LLM/TTS provider, and what came back?
  • Every tool call: What arguments were passed, what was returned, how long did it take?
  • Every audio chunk: How many bytes, what codec, what timestamp?

The platform likely uses structured logging (JSON) and exports metrics to Prometheus or a similar system. You should be able to query “show me all conversations where the LLM timeout exceeded 5 seconds” or “show me all tool calls that failed with HTTP 500.” Dograh exposes these logs through a dashboard interface and can export metrics to external observability platforms for production monitoring.

Deployment Shape and Resource Requirements

Dograh is self-hosted, which means you need to run it on your own infrastructure. The platform provides a Docker Compose file for local development and a Kubernetes Helm chart for production.

The resource requirements depend on concurrency:

  • Low concurrency (1-10 agents): 2 CPU cores, 4GB RAM, runs on a single VM or container.
  • Medium concurrency (10-100 agents): 8 CPU cores, 16GB RAM, consider horizontal scaling with a load balancer.
  • High concurrency (100+ agents): Kubernetes cluster with autoscaling, separate pods for STT/LLM/TTS to isolate failures.

The platform also needs external dependencies:

  • STT/LLM/TTS providers: You bring your own API keys (BYOK). Dograh doesn’t host these services.
  • Database: PostgreSQL or similar for conversation history, user metadata, and workflow definitions.
  • Message queue: Redis or RabbitMQ for async tool calls and background jobs.
  • Telephony gateway: Twilio, Telnyx, or a self-hosted SIP trunk for phone integration.

The deployment shape affects latency. If your STT provider is in us-east-1 and your Dograh instance is in eu-west-1, you’ll add 80-100ms of round-trip latency. Co-locating services in the same region (or even the same availability zone) is critical for sub-second response times.

Dograh is in early-stage adoption, so production deployment experience is limited compared to Vapi/Retell. Early adopters should expect to debug infrastructure issues and contribute fixes upstream.

Code Snippet: Simplified State Machine

Here’s a simplified example of what the compiled state machine might look like. This is pseudocode for illustration; Dograh’s actual implementation may use a different library or pattern.

from transitions import Machine

class VoiceAgent:
    states = ['idle', 'listening', 'transcribing', 'inferring', 
              'calling_tool', 'synthesizing', 'speaking']
    
    def __init__(self, workflow_config):
        self.machine = Machine(model=self, states=VoiceAgent.states, 
                               initial='idle')
        # Define transitions with callbacks that execute after state change
        self.machine.add_transition('start', 'idle', 'listening')
        self.machine.add_transition('audio_received', 'listening', 
                                    'transcribing', after='send_to_stt')
        self.machine.add_transition('transcript_ready', 'transcribing', 
                                    'inferring', after='send_to_llm')
        self.machine.add_transition('tool_call_needed', 'inferring', 
                                    'calling_tool', after='invoke_tool')
        self.machine.add_transition('tool_result_ready', 'calling_tool', 
                                    'inferring', after='append_tool_result')
        self.machine.add_transition('response_ready', 'inferring', 
                                    'synthesizing', after='send_to_tts')
        self.machine.add_transition('audio_ready', 'synthesizing', 
                                    'speaking', after='play_audio')
        self.machine.add_transition('done_speaking', 'speaking', 
                                    'listening')
        
        self.workflow = workflow_config
        self.context = []
    
    def send_to_stt(self):
        # Call STT provider with audio buffer
        pass
    
    def send_to_llm(self):
        # Call LLM with conversation context
        pass
    
    def invoke_tool(self):
        # Call MCP server with tool arguments
        pass
    
    def send_to_tts(self):
        # Call TTS provider with response text
        pass
    
    def play_audio(self):
        # Stream audio to telephony/WebRTC
        pass

This is a toy example, but it shows the core pattern: states, transitions, and callbacks that invoke external services. The real implementation would add error handling, timeouts, retries, and observability hooks at every step.

When to Use Dograh vs. SaaS Alternatives

ScenarioRecommendation
You need full control over data and infrastructureUse Dograh (self-hosted, open source)
You want to avoid per-minute pricingUse Dograh (BYOK model, no usage fees)
You need to customize the orchestration logicUse Dograh (full code access, extensible)
You want zero ops overheadUse Vapi or Retell (managed SaaS)
You need enterprise SLAs and supportUse Vapi or Retell (commercial backing)
You’re prototyping and don’t care about costUse Vapi or Retell (faster to start)

Dograh makes sense if you’re building a product where voice agents are a core feature and you need to understand (

Tags

agentic-ai orchestration infrastructure voice-ai mcp

Primary Source

github.com