Skip to main content

Guide

Deepgram vs OpenAI Whisper API vs AssemblyAI: STT 2026

Deepgram Nova-2, Whisper API, and AssemblyAI compared for speech-to-text in 2026. Streaming, accuracy, diarization, pricing, and Node.js integration guides.

·PkgPulse Team·
0

TL;DR

Speech-to-text has crossed the accuracy threshold — all three providers deliver excellent transcription. The differences are in streaming capability, audio intelligence features, and price. Deepgram Nova-2 is the real-time specialist — streaming WebSocket API with sub-300ms latency, speaker diarization, smart formatting, and a generous free tier; it's the default for voice apps and live transcription. OpenAI Whisper API is the simplest path to accurate transcription — one endpoint, excellent accuracy across 100 languages, affordable pricing, but batch-only (no real-time streaming). AssemblyAI is the audio intelligence platform — transcription plus sentiment analysis, entity detection, content safety, PII redaction, auto-chapters, and more; it's the choice when you need insights from audio, not just words. For real-time voice apps: Deepgram. For batch transcription of recorded audio: Whisper API. For audio that needs analysis beyond transcription: AssemblyAI.

Key Takeaways

  • Deepgram streams results in < 300ms — suitable for real-time captions and voice UX
  • OpenAI Whisper supports 100 languages — best language coverage in the comparison
  • AssemblyAI's LeMUR — LLM-powered analysis of transcripts (summarize, question-answer)
  • Deepgram free tier: 12,000 minutes/year — enough for substantial development
  • Whisper API pricing: $0.006/minute — cheapest per-minute rate
  • AssemblyAI includes PII redaction — removes SSN, credit cards, phone numbers from transcripts
  • Deepgram supports custom models — fine-tune on domain-specific vocabulary

Use Cases and the Right Provider

Voice app (real-time captions, dictation)    → Deepgram (streaming)
Meeting transcription (Zoom, Meets)          → Deepgram or AssemblyAI (diarization)
Podcast transcription (batch)               → Whisper API or AssemblyAI
Audio content analysis / insights           → AssemblyAI (LeMUR, sentiment, topics)
Multi-language content (100 languages)      → Whisper API
Call center analytics                       → AssemblyAI (PII, sentiment, topics)
Cheapest batch transcription               → Whisper API ($0.006/min)

Deepgram: Real-Time Streaming Transcription

Deepgram's Nova-2 model delivers streaming transcription via WebSocket — results appear as words are spoken, not after the recording ends.

Installation

npm install @deepgram/sdk

Basic Transcription (Pre-Recorded)

import { createClient } from "@deepgram/sdk";

const deepgram = createClient(process.env.DEEPGRAM_API_KEY!);

// Transcribe a URL
const { result, error } = await deepgram.listen.prerecorded.transcribeUrl(
  { url: "https://example.com/audio.mp3" },
  {
    model: "nova-2",
    smart_format: true,        // Punctuation, capitalization, number formatting
    diarize: true,             // Speaker identification
    language: "en",
    punctuate: true,
    utterances: true,          // Segment by speaker
  }
);

if (error) throw error;

const transcript = result.results.channels[0].alternatives[0].transcript;
console.log("Transcript:", transcript);

// Speaker-segmented output (with diarize: true, utterances: true)
for (const utterance of result.results.utterances ?? []) {
  console.log(`Speaker ${utterance.speaker}: ${utterance.transcript}`);
}

Transcribe a Local File

import fs from "fs";

const audioBuffer = fs.readFileSync("./audio.mp3");

const { result, error } = await deepgram.listen.prerecorded.transcribeFile(
  audioBuffer,
  {
    model: "nova-2",
    smart_format: true,
    diarize: true,
    mimetype: "audio/mp3",
  }
);

const words = result.results.channels[0].alternatives[0].words;
// words: Array of { word, start, end, confidence, speaker }

Real-Time Streaming (WebSocket)

import { createClient, LiveTranscriptionEvents } from "@deepgram/sdk";

const deepgram = createClient(process.env.DEEPGRAM_API_KEY!);

async function startLiveTranscription() {
  const live = deepgram.listen.live({
    model: "nova-2",
    language: "en",
    smart_format: true,
    interim_results: true,     // Get partial results as user speaks
    endpointing: 300,          // Silence threshold to finalize a segment (ms)
  });

  live.on(LiveTranscriptionEvents.Open, () => {
    console.log("WebSocket connected");
  });

  live.on(LiveTranscriptionEvents.Transcript, (data) => {
    const transcript = data.channel.alternatives[0].transcript;
    const isFinal = data.is_final;

    if (isFinal && transcript) {
      console.log("Final:", transcript);
      // Send to UI / process
    } else if (transcript) {
      console.log("Interim:", transcript);
      // Update UI preview
    }
  });

  live.on(LiveTranscriptionEvents.Error, console.error);
  live.on(LiveTranscriptionEvents.Close, () => console.log("Disconnected"));

  return live;
}

// Send audio chunks to the live session
const liveSession = await startLiveTranscription();

// From microphone (browser)
const mediaRecorder = new MediaRecorder(stream);
mediaRecorder.addEventListener("dataavailable", (event) => {
  if (event.data.size > 0) {
    liveSession.send(event.data);
  }
});

// Close when done
liveSession.finish();

Next.js API Route (Real-Time)

// app/api/transcribe/stream/route.ts
import { createClient, LiveTranscriptionEvents } from "@deepgram/sdk";
import { NextRequest } from "next/server";

export async function GET(req: NextRequest) {
  const deepgram = createClient(process.env.DEEPGRAM_API_KEY!);

  // Return Server-Sent Events for transcript updates
  const encoder = new TextEncoder();
  const stream = new ReadableStream({
    start(controller) {
      const live = deepgram.listen.live({ model: "nova-2", smart_format: true });

      live.on(LiveTranscriptionEvents.Transcript, (data) => {
        const transcript = data.channel.alternatives[0].transcript;
        if (transcript) {
          controller.enqueue(
            encoder.encode(`data: ${JSON.stringify({ transcript })}\n\n`)
          );
        }
      });
    },
  });

  return new Response(stream, {
    headers: {
      "Content-Type": "text/event-stream",
      "Cache-Control": "no-cache",
    },
  });
}

OpenAI Whisper API: Simplest Batch Transcription

OpenAI's Whisper API is the simplest transcription endpoint — upload audio, get text back. Excellent accuracy, 100 languages, no setup complexity.

Installation

npm install openai

Basic Transcription

import OpenAI from "openai";
import fs from "fs";

const openai = new OpenAI({ apiKey: process.env.OPENAI_API_KEY! });

// Transcribe a local audio file
async function transcribeFile(filePath: string): Promise<string> {
  const transcript = await openai.audio.transcriptions.create({
    file: fs.createReadStream(filePath),
    model: "whisper-1",
    language: "en",              // Optional — auto-detects if omitted
    response_format: "json",     // "json" | "text" | "srt" | "vtt" | "verbose_json"
    temperature: 0,              // 0 = most confident, 1 = most creative
  });

  return transcript.text;
}

// Get subtitle format directly
async function transcribeToSRT(filePath: string): Promise<string> {
  const srt = await openai.audio.transcriptions.create({
    file: fs.createReadStream(filePath),
    model: "whisper-1",
    response_format: "srt",
  });
  return srt as unknown as string;
}

With Verbose JSON (Word Timestamps)

const transcript = await openai.audio.transcriptions.create({
  file: fs.createReadStream("audio.mp3"),
  model: "whisper-1",
  response_format: "verbose_json",
  timestamp_granularities: ["word"],  // Word-level timestamps
});

// Access word-level timing
for (const word of transcript.words ?? []) {
  console.log(`"${word.word}" — ${word.start}s to ${word.end}s`);
}

Translation (Any Language → English)

// Whisper can translate to English from any of 100 languages
const translation = await openai.audio.translations.create({
  file: fs.createReadStream("french-audio.mp3"),
  model: "whisper-1",
  // No language parameter needed — Whisper auto-detects and translates to English
});

console.log("English translation:", translation.text);

AssemblyAI: Audio Intelligence Platform

AssemblyAI goes beyond transcription — it analyzes audio for insights: sentiment, topics, entities, PII, and lets you query transcripts with LLMs.

Installation

npm install assemblyai

Basic Transcription

import { AssemblyAI } from "assemblyai";

const client = new AssemblyAI({ apiKey: process.env.ASSEMBLYAI_API_KEY! });

// Transcribe from URL
const transcript = await client.transcripts.transcribe({
  audio_url: "https://example.com/podcast.mp3",
  speaker_labels: true,          // Speaker diarization
  auto_highlights: true,         // Key phrases and topics
  sentiment_analysis: true,      // Sentiment per sentence
  entity_detection: true,        // Named entities (persons, places, etc.)
  iab_categories: true,          // IAB content categorization
});

console.log("Transcript:", transcript.text);

// Speaker-segmented output
for (const utterance of transcript.utterances ?? []) {
  console.log(`${utterance.speaker}: ${utterance.text}`);
}

// Sentiment analysis
for (const sentence of transcript.sentiment_analysis_results ?? []) {
  console.log(`[${sentence.sentiment}] ${sentence.text}`);
}

PII Redaction

const transcript = await client.transcripts.transcribe({
  audio_url: "https://example.com/call-recording.mp3",
  redact_pii: true,
  redact_pii_audio: true,        // Also redact PII from audio file itself
  redact_pii_policies: [
    "ssn",                       // Social security numbers
    "credit_card_number",
    "phone_number",
    "email_address",
    "person_name",
    "date_of_birth",
  ],
  // Redacted transcript: "My SSN is #### and my card is ####..."
});

LeMUR: LLM Queries on Transcripts

// Ask questions about the audio content using LLMs
const response = await client.lemur.task({
  transcript_ids: [transcript.id],
  prompt: "Provide a concise 3-bullet summary of the key points discussed.",
  final_model: "anthropic/claude-3-5-sonnet",  // Or openai/gpt-4o
});

console.log("Summary:", response.response);

// Q&A from audio
const qaResponse = await client.lemur.questionAnswer({
  transcript_ids: [transcript.id],
  questions: [
    {
      question: "What was the main decision made in this meeting?",
      answer_format: "Single sentence",
    },
    {
      question: "What action items were mentioned?",
      answer_format: "Bulleted list",
    },
  ],
});

for (const qa of qaResponse.response) {
  console.log(`Q: ${qa.question}\nA: ${qa.answer}\n`);
}

Feature Comparison

FeatureDeepgramWhisper APIAssemblyAI
Real-time streaming✅ < 300ms❌ Batch only❌ Batch only
Pricing per minute$0.0043$0.006$0.0037
Language support30+✅ 100 languages100+
Speaker diarization
Sentiment analysis
Entity detection
PII redaction
LLM transcript queries✅ LeMUR
Custom models
Auto chapters
Free tier✅ 12k min/year✅ $50 credit
SRT/VTT output

When to Use Each

Choose Deepgram if:

  • Real-time streaming transcription is required (voice UX, live captions)
  • Sub-second latency from speech to text is critical
  • Speaker identification (diarization) is needed in real-time
  • You need domain-specific vocabulary or custom model fine-tuning

Choose OpenAI Whisper API if:

  • Batch transcription of pre-recorded audio is the primary use case
  • Maximum language coverage (100 languages, automatic detection)
  • Simplest possible API — just openai.audio.transcriptions.create()
  • Most affordable per-minute rate for high-volume batch work

Choose AssemblyAI if:

  • Audio insights beyond transcription: sentiment, topics, entities, PII
  • Meeting intelligence — summaries, action items, speaker attribution
  • Compliance use cases requiring PII redaction from audio and transcript
  • You want to query transcripts with LLMs via LeMUR

Accuracy in Real-World Conditions

Word error rate (WER) benchmarks on clean studio audio are not the right measure for production applications. Real-world audio has background noise, speaker accents, domain-specific vocabulary, and crosstalk. How each provider handles these conditions matters more than clean-audio WER.

Deepgram Nova-2 excels on natural conversational speech — phone calls, customer service recordings, video meetings. The model was trained on diverse real-world audio, not primarily on broadcast speech. Nova-2 handles strong accents, overlapping speech, and noisy environments better than Whisper on conversational material. The smart_format: true option adds punctuation, capitalization, and converts spoken numbers to numerals — critical for meeting transcripts that need to be readable without editing.

OpenAI Whisper was trained on 680,000 hours of multilingual audio, with heavy representation of broadcast and structured speech (lectures, podcasts, documentaries). It produces cleaner punctuation and more standard English output on formal speech. On conversational audio with strong accents or significant background noise, accuracy can drop more than Deepgram. The verbose_json format with word-level timestamps is best-in-class for subtitle generation workflows.

AssemblyAI uses its own model pipeline that balances transcription with analysis. The transcription accuracy is comparable to Deepgram and Whisper on most audio types. The accuracy advantage comes from the post-processing layer: sentiment analysis, entity detection, and PII redaction all have their own accuracy characteristics. AssemblyAI's entity detection model performs well on business context — company names, product names, and proper nouns that generic models often miss.

Pricing at Scale

For applications handling significant audio volume, pricing differences compound quickly.

Provider10 hours/day100 hours/day1000 hours/day
Deepgram Nova-2$7.74/day$77.40/day$774/day
Whisper API$3.60/day$36/day$360/day
AssemblyAI$6.66/day$66.60/day$666/day

At 1000 hours per day (~1 million minutes per month), Whisper API saves $120-130/day vs the other providers. That's $43,000-$47,000 annually — enough to justify the batch-only constraint for many use cases.

The Deepgram free tier (12,000 minutes = 200 hours per year) is the most generous for development and testing. Both AssemblyAI and Deepgram offer enterprise pricing for high volumes that can materially improve on these list rates.

Production Integration Patterns

Meeting Transcription with Diarization

For a meeting transcription service (Zoom, Meet, Teams recordings), the recommended stack is Deepgram or AssemblyAI with diarize: true (or speaker_labels: true). The output is utterance-segmented by speaker, making it trivially parseable into a structured transcript format.

For AssemblyAI, the auto_chapters: true option automatically segments long meetings into logical sections with summaries — reducing post-processing work significantly for meeting intelligence applications.

Real-Time Voice Applications

Deepgram is the only option here. The WebSocket streaming API returns transcript results as words are spoken, enabling real-time caption overlays, voice command processing, and live meeting transcription. AssemblyAI and Whisper both require complete audio before returning results.

For AI voice agents (STT → LLM → TTS), Deepgram pairs naturally with LiveKit's Agents framework, which handles the audio pipeline orchestration.

Compliance and Call Center Analytics

AssemblyAI is the only provider with built-in PII redaction from both transcript and audio file. For applications processing customer calls (financial services, healthcare, insurance), the combination of PII redaction, sentiment analysis, and LeMUR querying (ask the LLM what action items were discussed, what the customer's concern was) provides a complete analytics layer without building a custom post-processing pipeline.

Model Updates and API Stability

One operational consideration that's easy to overlook: how do these providers handle model updates? For applications where accuracy is a core product metric, an undocumented model upgrade can shift transcription behavior in ways that break downstream parsing.

Deepgram handles this with explicit model versioning. Pinning to nova-2 gives you a stable model; Deepgram doesn't deprecate models without notice. OpenAI's Whisper API currently exposes whisper-1 as the only model identifier — OpenAI can update the underlying model without incrementing the version name, meaning you may see accuracy changes between API calls over time. AssemblyAI similarly does not expose version pinning for their default model. For applications that need reproducible transcription behavior — legal transcript archives, compliance recordings — Deepgram's explicit version control is a meaningful advantage.

Methodology

Data sourced from official Deepgram Nova-2 documentation (developers.deepgram.com), OpenAI Whisper API documentation (platform.openai.com/docs/guides/speech-to-text), AssemblyAI documentation (assemblyai.com/docs), pricing pages as of February 2026, and community benchmarks from the AI builders community. Language counts from each provider's official documentation. Pricing calculations based on list rates at 600 seconds per hour of audio.


Related: ElevenLabs vs OpenAI TTS vs Cartesia for the text-to-speech side of voice AI, or Vercel AI SDK vs OpenAI SDK vs Anthropic SDK for the LLM client that powers audio intelligence workflows.

See also: Sass vs Tailwind CSS and Mastra vs LangChain.js vs Google GenKit

The 2026 JavaScript Stack Cheatsheet

One PDF: the best package for every category (ORMs, bundlers, auth, testing, state management). Used by 500+ devs. Free, updated monthly.