Choose Deepgram if: Real-time streaming transcription is required (voice UX, live captions) Sub-second latency from speech to text is critical Speaker identification (diarization) is needed in real-time You need domain-specific vocabulary or custom model fine-tuning Choose OpenAI Whisper API if: Batch transcription of pre-recorded audio is the primary use case Maximum language coverage (100 languages, automatic detection) Simplest possible API — just openai.audio.transcriptions.create() Most aff

Deepgram vs OpenAI Whisper API vs AssemblyAI: STT 2026

TL;DR

Speech-to-text has crossed the accuracy threshold — all three providers deliver excellent transcription. The differences are in streaming capability, audio intelligence features, and price. Deepgram Nova-2 is the real-time specialist — streaming WebSocket API with sub-300ms latency, speaker diarization, smart formatting, and a generous free tier; it's the default for voice apps and live transcription. OpenAI Whisper API is the simplest path to accurate transcription — one endpoint, excellent accuracy across 100 languages, affordable pricing, but batch-only (no real-time streaming). AssemblyAI is the audio intelligence platform — transcription plus sentiment analysis, entity detection, content safety, PII redaction, auto-chapters, and more; it's the choice when you need insights from audio, not just words. For real-time voice apps: Deepgram. For batch transcription of recorded audio: Whisper API. For audio that needs analysis beyond transcription: AssemblyAI.

Key Takeaways

Deepgram streams results in < 300ms — suitable for real-time captions and voice UX
OpenAI Whisper supports 100 languages — best language coverage in the comparison
AssemblyAI's LeMUR — LLM-powered analysis of transcripts (summarize, question-answer)
Deepgram free tier: 12,000 minutes/year — enough for substantial development
Whisper API pricing: $0.006/minute — cheapest per-minute rate
AssemblyAI includes PII redaction — removes SSN, credit cards, phone numbers from transcripts
Deepgram supports custom models — fine-tune on domain-specific vocabulary

Use Cases and the Right Provider

Voice app (real-time captions, dictation)    → Deepgram (streaming)
Meeting transcription (Zoom, Meets)          → Deepgram or AssemblyAI (diarization)
Podcast transcription (batch)               → Whisper API or AssemblyAI
Audio content analysis / insights           → AssemblyAI (LeMUR, sentiment, topics)
Multi-language content (100 languages)      → Whisper API
Call center analytics                       → AssemblyAI (PII, sentiment, topics)
Cheapest batch transcription               → Whisper API ($0.006/min)

Deepgram: Real-Time Streaming Transcription

Deepgram's Nova-2 model delivers streaming transcription via WebSocket — results appear as words are spoken, not after the recording ends.

Installation

npm install @deepgram/sdk

Basic Transcription (Pre-Recorded)

import { createClient } from "@deepgram/sdk";

const deepgram = createClient(process.env.DEEPGRAM_API_KEY!);

// Transcribe a URL
const { result, error } = await deepgram.listen.prerecorded.transcribeUrl(
  { url: "https://example.com/audio.mp3" },
  {
    model: "nova-2",
    smart_format: true,        // Punctuation, capitalization, number formatting
    diarize: true,             // Speaker identification
    language: "en",
    punctuate: true,
    utterances: true,          // Segment by speaker
  }
);

if (error) throw error;

const transcript = result.results.channels[0].alternatives[0].transcript;
console.log("Transcript:", transcript);

// Speaker-segmented output (with diarize: true, utterances: true)
for (const utterance of result.results.utterances ?? []) {
  console.log(`Speaker ${utterance.speaker}: ${utterance.transcript}`);
}

Transcribe a Local File

import fs from "fs";

const audioBuffer = fs.readFileSync("./audio.mp3");

const { result, error } = await deepgram.listen.prerecorded.transcribeFile(
  audioBuffer,
  {
    model: "nova-2",
    smart_format: true,
    diarize: true,
    mimetype: "audio/mp3",
  }
);

const words = result.results.channels[0].alternatives[0].words;
// words: Array of { word, start, end, confidence, speaker }

Real-Time Streaming (WebSocket)

import { createClient, LiveTranscriptionEvents } from "@deepgram/sdk";

const deepgram = createClient(process.env.DEEPGRAM_API_KEY!);

async function startLiveTranscription() {
  const live = deepgram.listen.live({
    model: "nova-2",
    language: "en",
    smart_format: true,
    interim_results: true,     // Get partial results as user speaks
    endpointing: 300,          // Silence threshold to finalize a segment (ms)
  });

  live.on(LiveTranscriptionEvents.Open, () => {
    console.log("WebSocket connected");
  });

  live.on(LiveTranscriptionEvents.Transcript, (data) => {
    const transcript = data.channel.alternatives[0].transcript;
    const isFinal = data.is_final;

    if (isFinal && transcript) {
      console.log("Final:", transcript);
      // Send to UI / process
    } else if (transcript) {
      console.log("Interim:", transcript);
      // Update UI preview
    }
  });

  live.on(LiveTranscriptionEvents.Error, console.error);
  live.on(LiveTranscriptionEvents.Close, () => console.log("Disconnected"));

  return live;
}

// Send audio chunks to the live session
const liveSession = await startLiveTranscription();

// From microphone (browser)
const mediaRecorder = new MediaRecorder(stream);
mediaRecorder.addEventListener("dataavailable", (event) => {
  if (event.data.size > 0) {
    liveSession.send(event.data);
  }
});

// Close when done
liveSession.finish();

Next.js API Route (Real-Time)

// app/api/transcribe/stream/route.ts
import { createClient, LiveTranscriptionEvents } from "@deepgram/sdk";
import { NextRequest } from "next/server";

export async function GET(req: NextRequest) {
  const deepgram = createClient(process.env.DEEPGRAM_API_KEY!);

  // Return Server-Sent Events for transcript updates
  const encoder = new TextEncoder();
  const stream = new ReadableStream({
    start(controller) {
      const live = deepgram.listen.live({ model: "nova-2", smart_format: true });

      live.on(LiveTranscriptionEvents.Transcript, (data) => {
        const transcript = data.channel.alternatives[0].transcript;
        if (transcript) {
          controller.enqueue(
            encoder.encode(`data: ${JSON.stringify({ transcript })}\n\n`)
          );
        }
      });
    },
  });

  return new Response(stream, {
    headers: {
      "Content-Type": "text/event-stream",
      "Cache-Control": "no-cache",
    },
  });
}

OpenAI Whisper API: Simplest Batch Transcription

OpenAI's Whisper API is the simplest transcription endpoint — upload audio, get text back. Excellent accuracy, 100 languages, no setup complexity.

Installation

npm install openai

Basic Transcription

import OpenAI from "openai";
import fs from "fs";

const openai = new OpenAI({ apiKey: process.env.OPENAI_API_KEY! });

// Transcribe a local audio file
async function transcribeFile(filePath: string): Promise<string> {
  const transcript = await openai.audio.transcriptions.create({
    file: fs.createReadStream(filePath),
    model: "whisper-1",
    language: "en",              // Optional — auto-detects if omitted
    response_format: "json",     // "json" | "text" | "srt" | "vtt" | "verbose_json"
    temperature: 0,              // 0 = most confident, 1 = most creative
  });

  return transcript.text;
}

// Get subtitle format directly
async function transcribeToSRT(filePath: string): Promise<string> {
  const srt = await openai.audio.transcriptions.create({
    file: fs.createReadStream(filePath),
    model: "whisper-1",
    response_format: "srt",
  });
  return srt as unknown as string;
}

With Verbose JSON (Word Timestamps)

const transcript = await openai.audio.transcriptions.create({
  file: fs.createReadStream("audio.mp3"),
  model: "whisper-1",
  response_format: "verbose_json",
  timestamp_granularities: ["word"],  // Word-level timestamps
});

// Access word-level timing
for (const word of transcript.words ?? []) {
  console.log(`"${word.word}" — ${word.start}s to ${word.end}s`);
}

Translation (Any Language → English)

// Whisper can translate to English from any of 100 languages
const translation = await openai.audio.translations.create({
  file: fs.createReadStream("french-audio.mp3"),
  model: "whisper-1",
  // No language parameter needed — Whisper auto-detects and translates to English
});

console.log("English translation:", translation.text);

AssemblyAI: Audio Intelligence Platform

AssemblyAI goes beyond transcription — it analyzes audio for insights: sentiment, topics, entities, PII, and lets you query transcripts with LLMs.

Installation

npm install assemblyai

Basic Transcription

import { AssemblyAI } from "assemblyai";

const client = new AssemblyAI({ apiKey: process.env.ASSEMBLYAI_API_KEY! });

// Transcribe from URL
const transcript = await client.transcripts.transcribe({
  audio_url: "https://example.com/podcast.mp3",
  speaker_labels: true,          // Speaker diarization
  auto_highlights: true,         // Key phrases and topics
  sentiment_analysis: true,      // Sentiment per sentence
  entity_detection: true,        // Named entities (persons, places, etc.)
  iab_categories: true,          // IAB content categorization
});

console.log("Transcript:", transcript.text);

// Speaker-segmented output
for (const utterance of transcript.utterances ?? []) {
  console.log(`${utterance.speaker}: ${utterance.text}`);
}

// Sentiment analysis
for (const sentence of transcript.sentiment_analysis_results ?? []) {
  console.log(`[${sentence.sentiment}] ${sentence.text}`);
}

PII Redaction

const transcript = await client.transcripts.transcribe({
  audio_url: "https://example.com/call-recording.mp3",
  redact_pii: true,
  redact_pii_audio: true,        // Also redact PII from audio file itself
  redact_pii_policies: [
    "ssn",                       // Social security numbers
    "credit_card_number",
    "phone_number",
    "email_address",
    "person_name",
    "date_of_birth",
  ],
  // Redacted transcript: "My SSN is #### and my card is ####..."
});

LeMUR: LLM Queries on Transcripts

// Ask questions about the audio content using LLMs
const response = await client.lemur.task({
  transcript_ids: [transcript.id],
  prompt: "Provide a concise 3-bullet summary of the key points discussed.",
  final_model: "anthropic/claude-3-5-sonnet",  // Or openai/gpt-4o
});

console.log("Summary:", response.response);

// Q&A from audio
const qaResponse = await client.lemur.questionAnswer({
  transcript_ids: [transcript.id],
  questions: [
    {
      question: "What was the main decision made in this meeting?",
      answer_format: "Single sentence",
    },
    {
      question: "What action items were mentioned?",
      answer_format: "Bulleted list",
    },
  ],
});

for (const qa of qaResponse.response) {
  console.log(`Q: ${qa.question}\nA: ${qa.answer}\n`);
}

Feature Comparison

Feature	Deepgram	Whisper API	AssemblyAI
Real-time streaming	✅ < 300ms	❌ Batch only	❌ Batch only
Pricing per minute	$0.0043	$0.006	$0.0037
Language support	30+	✅ 100 languages	100+
Speaker diarization	✅	❌	✅
Sentiment analysis	❌	❌	✅
Entity detection	❌	❌	✅
PII redaction	❌	❌	✅
LLM transcript queries	❌	❌	✅ LeMUR
Custom models	✅	❌	❌
Auto chapters	❌	❌	✅
Free tier	✅ 12k min/year	❌	✅ $50 credit
SRT/VTT output	✅	✅	✅

When to Use Each

Choose Deepgram if:

Real-time streaming transcription is required (voice UX, live captions)
Sub-second latency from speech to text is critical
Speaker identification (diarization) is needed in real-time
You need domain-specific vocabulary or custom model fine-tuning

Choose OpenAI Whisper API if:

Batch transcription of pre-recorded audio is the primary use case
Maximum language coverage (100 languages, automatic detection)
Simplest possible API — just openai.audio.transcriptions.create()
Most affordable per-minute rate for high-volume batch work

Choose AssemblyAI if:

Audio insights beyond transcription: sentiment, topics, entities, PII
Meeting intelligence — summaries, action items, speaker attribution
Compliance use cases requiring PII redaction from audio and transcript
You want to query transcripts with LLMs via LeMUR

Accuracy in Real-World Conditions

Word error rate (WER) benchmarks on clean studio audio are not the right measure for production applications. Real-world audio has background noise, speaker accents, domain-specific vocabulary, and crosstalk. How each provider handles these conditions matters more than clean-audio WER.

Deepgram Nova-2 excels on natural conversational speech — phone calls, customer service recordings, video meetings. The model was trained on diverse real-world audio, not primarily on broadcast speech. Nova-2 handles strong accents, overlapping speech, and noisy environments better than Whisper on conversational material. The smart_format: true option adds punctuation, capitalization, and converts spoken numbers to numerals — critical for meeting transcripts that need to be readable without editing.

OpenAI Whisper was trained on 680,000 hours of multilingual audio, with heavy representation of broadcast and structured speech (lectures, podcasts, documentaries). It produces cleaner punctuation and more standard English output on formal speech. On conversational audio with strong accents or significant background noise, accuracy can drop more than Deepgram. The verbose_json format with word-level timestamps is best-in-class for subtitle generation workflows.

AssemblyAI uses its own model pipeline that balances transcription with analysis. The transcription accuracy is comparable to Deepgram and Whisper on most audio types. The accuracy advantage comes from the post-processing layer: sentiment analysis, entity detection, and PII redaction all have their own accuracy characteristics. AssemblyAI's entity detection model performs well on business context — company names, product names, and proper nouns that generic models often miss.

Pricing at Scale

For applications handling significant audio volume, pricing differences compound quickly.

Provider	10 hours/day	100 hours/day	1000 hours/day
Deepgram Nova-2	$7.74/day	$77.40/day	$774/day
Whisper API	$3.60/day	$36/day	$360/day
AssemblyAI	$6.66/day	$66.60/day	$666/day

At 1000 hours per day (~1 million minutes per month), Whisper API saves $120-130/day vs the other providers. That's $43,000-$47,000 annually — enough to justify the batch-only constraint for many use cases.

The Deepgram free tier (12,000 minutes = 200 hours per year) is the most generous for development and testing. Both AssemblyAI and Deepgram offer enterprise pricing for high volumes that can materially improve on these list rates.

Production Integration Patterns

Meeting Transcription with Diarization

For a meeting transcription service (Zoom, Meet, Teams recordings), the recommended stack is Deepgram or AssemblyAI with diarize: true (or speaker_labels: true). The output is utterance-segmented by speaker, making it trivially parseable into a structured transcript format.

For AssemblyAI, the auto_chapters: true option automatically segments long meetings into logical sections with summaries — reducing post-processing work significantly for meeting intelligence applications.

Real-Time Voice Applications

Deepgram is the only option here. The WebSocket streaming API returns transcript results as words are spoken, enabling real-time caption overlays, voice command processing, and live meeting transcription. AssemblyAI and Whisper both require complete audio before returning results.

For AI voice agents (STT → LLM → TTS), Deepgram pairs naturally with LiveKit's Agents framework, which handles the audio pipeline orchestration.

Compliance and Call Center Analytics

AssemblyAI is the only provider with built-in PII redaction from both transcript and audio file. For applications processing customer calls (financial services, healthcare, insurance), the combination of PII redaction, sentiment analysis, and LeMUR querying (ask the LLM what action items were discussed, what the customer's concern was) provides a complete analytics layer without building a custom post-processing pipeline.

Model Updates and API Stability

One operational consideration that's easy to overlook: how do these providers handle model updates? For applications where accuracy is a core product metric, an undocumented model upgrade can shift transcription behavior in ways that break downstream parsing.

Deepgram handles this with explicit model versioning. Pinning to nova-2 gives you a stable model; Deepgram doesn't deprecate models without notice. OpenAI's Whisper API currently exposes whisper-1 as the only model identifier — OpenAI can update the underlying model without incrementing the version name, meaning you may see accuracy changes between API calls over time. AssemblyAI similarly does not expose version pinning for their default model. For applications that need reproducible transcription behavior — legal transcript archives, compliance recordings — Deepgram's explicit version control is a meaningful advantage.

Methodology

Data sourced from official Deepgram Nova-2 documentation (developers.deepgram.com), OpenAI Whisper API documentation (platform.openai.com/docs/guides/speech-to-text), AssemblyAI documentation (assemblyai.com/docs), pricing pages as of February 2026, and community benchmarks from the AI builders community. Language counts from each provider's official documentation. Pricing calculations based on list rates at 600 seconds per hour of audio.

Related: ElevenLabs vs OpenAI TTS vs Cartesia for the text-to-speech side of voice AI, or Vercel AI SDK vs OpenAI SDK vs Anthropic SDK for the LLM client that powers audio intelligence workflows.

The 2026 JavaScript Stack Cheatsheet