AutoSubs' Direct DaVinci Integration: Why On-Device Transcription Agents Skip the API Layer

AutoSubs runs Whisper models on your machine and writes timestamped subtitles directly into DaVinci Resolve, Premiere Pro, and After Effects timelines. No cloud. No API keys. No subscription. The project has 3,514 stars and logs 11,000+ weekly app opens because it solves a specific integration problem: getting local AI output into professional video tools without forcing editors to copy-paste SRT files.

This is not a wrapper around OpenAI’s Whisper API. It’s an Electron app that runs inference locally, then bridges the gap between model output and application state using native plugins and filesystem contracts. The architecture reveals a pattern worth studying: when to bypass traditional API boundaries and write directly to application data structures.

The Integration Stack

AutoSubs uses three integration paths depending on the target application:

DaVinci Resolve: Writes a Lua script that Resolve’s scripting API can execute. The script creates subtitle tracks and inserts text clips with precise timecodes. No plugin installation required. Resolve’s API is Python-based but accepts Lua for timeline operations.

Adobe Premiere Pro & After Effects: Bundles a CEP (Common Extensibility Platform) extension that runs inside the Adobe host. The extension reads JSON files written by the Electron app and calls ExtendScript APIs to create text layers or caption tracks.

Standalone mode: Exports SRT, plain text, or JSON for manual import.

The key decision is where to place the integration boundary. AutoSubs could have exported SRT and relied on each application’s import dialog. Instead, it writes directly to the timeline, which requires understanding each tool’s plugin architecture.

Whisper Model Execution

The app downloads quantized Whisper models (tiny, base, small, medium, large) and runs them using whisper.cpp bindings. Model selection trades accuracy for speed:

Model	Size	Speed (RTX 3060)	Use Case
Tiny	75 MB	10x realtime	Drafts, quick review
Base	142 MB	7x realtime	General use
Small	466 MB	4x realtime	Balanced accuracy
Medium	1.5 GB	2x realtime	High accuracy
Large	2.9 GB	1x realtime	Maximum accuracy

Speaker diarization runs as a separate pass using pyannote.audio models. The app segments audio by speaker, then assigns labels (Speaker 1, Speaker 2) to each subtitle block. This requires a second model download and adds 20-30% processing time.

Translation uses Whisper’s built-in multilingual capabilities. You transcribe in the source language, then translate to English (or other supported languages) in a single inference pass. No separate translation model required.

DaVinci Resolve Integration Mechanics

Resolve’s scripting API exposes timeline objects through Python or Lua. AutoSubs generates a Lua script that:

Creates a new subtitle track in the active timeline
Iterates through transcription segments
Calls AddMarker() or CreateSubtitleFromText() for each segment with start/end frames
Sets text content and speaker labels

The generated script looks like this:

resolve = Resolve()
projectManager = resolve:GetProjectManager()
project = projectManager:GetCurrentProject()
timeline = project:GetCurrentTimeline()

subtitleTrack = timeline:AddTrack("subtitle")

timeline:AddMarker(24, "Blue", "Speaker 1", "Hello, this is the first line.", 1, 48)
timeline:AddMarker(72, "Blue", "Speaker 2", "And this is the second.", 1, 60)

The script runs inside Resolve’s scripting console. AutoSubs can trigger execution automatically if Resolve’s remote scripting is enabled (requires setting an environment variable and opening a TCP port). Otherwise, the user copies the script path and runs it manually.

This approach has a critical limitation: it only works if Resolve’s scripting API is enabled and accessible. Corporate or locked-down installations may disable remote scripting for security reasons. In those cases, AutoSubs falls back to SRT export.

Adobe CEP Extension Architecture

Adobe apps use CEP (Common Extensibility Platform) for third-party panels. AutoSubs bundles a CEP extension that:

Registers as a panel in Premiere Pro or After Effects
Watches a filesystem directory for JSON files written by the Electron app
Parses transcription data and calls ExtendScript APIs to create caption tracks or text layers

The extension runs in a Chromium-based environment inside the Adobe host. It communicates with the Electron app through file-based IPC: the Electron app writes JSON, the CEP extension reads it and executes timeline operations.

ExtendScript (Adobe’s scripting language) is single-threaded and synchronous. Creating 500 subtitle clips can take 10-15 seconds because each clip insertion blocks. The CEP extension batches operations where possible but cannot parallelize timeline writes.

State Management and Failure Modes

AutoSubs maintains state in three places:

Electron app: User settings, model downloads, transcription queue
Filesystem: Generated scripts, JSON files, SRT exports
Target application: Timeline state, subtitle tracks, text layers

The most common failure mode is desynchronization between the Electron app and the target application. If a user deletes a subtitle track in Resolve and reruns the script, AutoSubs creates a duplicate track instead of updating the existing one. There’s no two-way sync. The app writes forward only.

Another failure mode: model inference crashes or hangs on long files (2+ hours). The app doesn’t checkpoint progress, so a crash at 90% completion means restarting from zero. The solution is to split long files into chunks before transcription, but this requires manual intervention.

Security Boundaries

Running local models eliminates cloud API risks (data exfiltration, rate limits, cost) but introduces new ones:

Model provenance: AutoSubs downloads models from Hugging Face. If the download is compromised, the app runs untrusted code.
Filesystem access: The app writes scripts and JSON to user-specified directories. A malicious script could overwrite system files if the user has write permissions.
Plugin execution: The CEP extension runs inside Adobe’s security sandbox, but ExtendScript has broad access to timeline data and project files.

The app does not sandbox model execution. Whisper runs in the same process as the Electron renderer. A malicious model could exploit vulnerabilities in whisper.cpp bindings to escape the renderer process.

Observability and Debugging

AutoSubs logs transcription progress to the Electron console and displays a progress bar in the UI. Logs include:

Model load time
Audio preprocessing duration
Inference speed (tokens per second)
Speaker diarization results

There’s no structured logging or telemetry. If transcription fails, the user sees a generic error message. Debugging requires opening DevTools and inspecting console output.

The CEP extension logs to Adobe’s ExtendScript Toolkit, which most users don’t have installed. If timeline insertion fails, the error is silent. The extension should write logs to a user-accessible file, but it doesn’t.

Deployment Shape

AutoSubs ships as a signed Electron app for Windows and macOS. Linux users install via .deb or .rpm packages. The app bundles:

Electron runtime (150 MB)
whisper.cpp binaries for each platform
CEP extension (5 MB)
Python runtime for speaker diarization (200 MB)

Total install size: 400-500 MB before downloading models. The large bundle size is a trade-off for zero-config installation. Users don’t need to install Python, Node.js, or CUDA drivers separately.

Updates are manual. The app checks GitHub releases on startup and prompts the user to download new versions. There’s no auto-update mechanism because the app doesn’t phone home.

When On-Device Beats Cloud APIs

AutoSubs succeeds because it optimizes for a specific workflow: video editors who process dozens of files per day and don’t want to pay per-minute transcription fees. The on-device approach wins when:

Volume is high: Transcribing 50 videos/day costs $0 locally vs. $50-100 with cloud APIs.
Latency matters: No network round-trip. Transcription starts immediately.
Privacy is required: Medical, legal, or confidential content never leaves the machine.
Offline work is common: Editors on planes, in studios without internet, or in air-gapped environments.

The approach loses when:

Accuracy is critical: Cloud models (OpenAI, AssemblyAI) are larger and more accurate than quantized local models.
Hardware is limited: Transcribing a 1-hour video on a laptop without a GPU takes 2+ hours.
Integration is complex: Cloud APIs often provide richer metadata (confidence scores, word-level timestamps, punctuation) that local models skip.

Technical Verdict

Use AutoSubs if:

You process 15+ videos per week. At AssemblyAI’s $0.10/minute rate, a 20-minute video costs $2. AutoSubs pays for itself after 8-10 videos if you have suitable hardware.
Your GPU has 4GB+ VRAM. The medium model runs at 2x realtime on an RTX 3060. Without a GPU, expect 0.3-0.5x realtime on CPU, making it impractical for videos longer than 20 minutes.
You need 85-92% accuracy for draft subtitles or rough cuts. Quantized Whisper models hit this range on clear audio. Not suitable for broadcast, legal transcripts, or accessibility compliance where 98%+ accuracy is required.
You work in DaVinci Resolve with scripting enabled. The Lua integration is the smoothest path. If your IT department disables remote scripting, you lose the main automation benefit.

Avoid AutoSubs if:

You need real-time transcription during recording. The app processes files after recording completes. No streaming support.
You publish to YouTube, Vimeo, or other platforms with native caption APIs. Cloud services integrate directly with platform APIs. AutoSubs requires manual upload of SRT files.
Your workflow requires two-way sync. You cannot edit subtitles in the timeline and push changes back to AutoSubs. It’s a one-way bridge from audio to timeline.
You need word-level timestamps or confidence scores. The app outputs segment-level timestamps only. Cloud APIs provide per-word timing and confidence metadata.
Your hardware is a laptop without a discrete GPU. A 1-hour video will take 2-3 hours to process on CPU, blocking other work.

The breakeven point is volume and hardware. If you have a GPU and process more than 15 videos weekly, the cost savings justify the accuracy trade-off. If you need maximum accuracy or work with cloud platforms, pay for cloud APIs. The architecture is a useful reference for building local AI agents that integrate with desktop applications. The key insight: sometimes the best API is no API. Writing directly to application data structures (Lua scripts, JSON files, CEP extensions) can be faster and more reliable than waiting for official API support.

Source Links

Primary source: AutoSubs on GitHub
DaVinci Resolve scripting docs: Blackmagic Resolve API
Adobe CEP resources: CEP Cookbook