Skip to main content

Deepgram vs OpenAI Whisper API vs AssemblyAI: STT 2026

·PkgPulse Team

Deepgram vs OpenAI Whisper API vs AssemblyAI: STT 2026

TL;DR

Speech-to-text has crossed the accuracy threshold — all three providers deliver excellent transcription. The differences are in streaming capability, audio intelligence features, and price. Deepgram Nova-2 is the real-time specialist — streaming WebSocket API with sub-300ms latency, speaker diarization, smart formatting, and a generous free tier; it's the default for voice apps and live transcription. OpenAI Whisper API is the simplest path to accurate transcription — one endpoint, excellent accuracy across 100 languages, affordable pricing, but batch-only (no real-time streaming). AssemblyAI is the audio intelligence platform — transcription plus sentiment analysis, entity detection, content safety, PII redaction, auto-chapters, and more; it's the choice when you need insights from audio, not just words. For real-time voice apps: Deepgram. For batch transcription of recorded audio: Whisper API. For audio that needs analysis beyond transcription: AssemblyAI.

Key Takeaways

  • Deepgram streams results in < 300ms — suitable for real-time captions and voice UX
  • OpenAI Whisper supports 100 languages — best language coverage in the comparison
  • AssemblyAI's LeMUR — LLM-powered analysis of transcripts (summarize, question-answer)
  • Deepgram free tier: 12,000 minutes/year — enough for substantial development
  • Whisper API pricing: $0.006/minute — cheapest per-minute rate
  • AssemblyAI includes PII redaction — removes SSN, credit cards, phone numbers from transcripts
  • Deepgram supports custom models — fine-tune on domain-specific vocabulary

Use Cases and the Right Provider

Voice app (real-time captions, dictation)    → Deepgram (streaming)
Meeting transcription (Zoom, Meets)          → Deepgram or AssemblyAI (diarization)
Podcast transcription (batch)               → Whisper API or AssemblyAI
Audio content analysis / insights           → AssemblyAI (LeMUR, sentiment, topics)
Multi-language content (100 languages)      → Whisper API
Call center analytics                       → AssemblyAI (PII, sentiment, topics)
Cheapest batch transcription               → Whisper API ($0.006/min)

Deepgram: Real-Time Streaming Transcription

Deepgram's Nova-2 model delivers streaming transcription via WebSocket — results appear as words are spoken, not after the recording ends.

Installation

npm install @deepgram/sdk

Basic Transcription (Pre-Recorded)

import { createClient } from "@deepgram/sdk";

const deepgram = createClient(process.env.DEEPGRAM_API_KEY!);

// Transcribe a URL
const { result, error } = await deepgram.listen.prerecorded.transcribeUrl(
  { url: "https://example.com/audio.mp3" },
  {
    model: "nova-2",
    smart_format: true,        // Punctuation, capitalization, number formatting
    diarize: true,             // Speaker identification
    language: "en",
    punctuate: true,
    utterances: true,          // Segment by speaker
  }
);

if (error) throw error;

const transcript = result.results.channels[0].alternatives[0].transcript;
console.log("Transcript:", transcript);

// Speaker-segmented output (with diarize: true, utterances: true)
for (const utterance of result.results.utterances ?? []) {
  console.log(`Speaker ${utterance.speaker}: ${utterance.transcript}`);
}

Transcribe a Local File

import fs from "fs";

const audioBuffer = fs.readFileSync("./audio.mp3");

const { result, error } = await deepgram.listen.prerecorded.transcribeFile(
  audioBuffer,
  {
    model: "nova-2",
    smart_format: true,
    diarize: true,
    mimetype: "audio/mp3",
  }
);

const words = result.results.channels[0].alternatives[0].words;
// words: Array of { word, start, end, confidence, speaker }

Real-Time Streaming (WebSocket)

import { createClient, LiveTranscriptionEvents } from "@deepgram/sdk";

const deepgram = createClient(process.env.DEEPGRAM_API_KEY!);

async function startLiveTranscription() {
  const live = deepgram.listen.live({
    model: "nova-2",
    language: "en",
    smart_format: true,
    interim_results: true,     // Get partial results as user speaks
    endpointing: 300,          // Silence threshold to finalize a segment (ms)
  });

  live.on(LiveTranscriptionEvents.Open, () => {
    console.log("WebSocket connected");
  });

  live.on(LiveTranscriptionEvents.Transcript, (data) => {
    const transcript = data.channel.alternatives[0].transcript;
    const isFinal = data.is_final;

    if (isFinal && transcript) {
      console.log("Final:", transcript);
      // Send to UI / process
    } else if (transcript) {
      console.log("Interim:", transcript);
      // Update UI preview
    }
  });

  live.on(LiveTranscriptionEvents.Error, console.error);
  live.on(LiveTranscriptionEvents.Close, () => console.log("Disconnected"));

  return live;
}

// Send audio chunks to the live session
const liveSession = await startLiveTranscription();

// From microphone (browser)
const mediaRecorder = new MediaRecorder(stream);
mediaRecorder.addEventListener("dataavailable", (event) => {
  if (event.data.size > 0) {
    liveSession.send(event.data);
  }
});

// Close when done
liveSession.finish();

Next.js API Route (Real-Time)

// app/api/transcribe/stream/route.ts
import { createClient, LiveTranscriptionEvents } from "@deepgram/sdk";
import { NextRequest } from "next/server";

export async function GET(req: NextRequest) {
  const deepgram = createClient(process.env.DEEPGRAM_API_KEY!);

  // Return Server-Sent Events for transcript updates
  const encoder = new TextEncoder();
  const stream = new ReadableStream({
    start(controller) {
      const live = deepgram.listen.live({ model: "nova-2", smart_format: true });

      live.on(LiveTranscriptionEvents.Transcript, (data) => {
        const transcript = data.channel.alternatives[0].transcript;
        if (transcript) {
          controller.enqueue(
            encoder.encode(`data: ${JSON.stringify({ transcript })}\n\n`)
          );
        }
      });
    },
  });

  return new Response(stream, {
    headers: {
      "Content-Type": "text/event-stream",
      "Cache-Control": "no-cache",
    },
  });
}

OpenAI Whisper API: Simplest Batch Transcription

OpenAI's Whisper API is the simplest transcription endpoint — upload audio, get text back. Excellent accuracy, 100 languages, no setup complexity.

Installation

npm install openai

Basic Transcription

import OpenAI from "openai";
import fs from "fs";

const openai = new OpenAI({ apiKey: process.env.OPENAI_API_KEY! });

// Transcribe a local audio file
async function transcribeFile(filePath: string): Promise<string> {
  const transcript = await openai.audio.transcriptions.create({
    file: fs.createReadStream(filePath),
    model: "whisper-1",
    language: "en",              // Optional — auto-detects if omitted
    response_format: "json",     // "json" | "text" | "srt" | "vtt" | "verbose_json"
    temperature: 0,              // 0 = most confident, 1 = most creative
  });

  return transcript.text;
}

// Get subtitle format directly
async function transcribeToSRT(filePath: string): Promise<string> {
  const srt = await openai.audio.transcriptions.create({
    file: fs.createReadStream(filePath),
    model: "whisper-1",
    response_format: "srt",
  });
  return srt as unknown as string;
}

With Verbose JSON (Word Timestamps)

const transcript = await openai.audio.transcriptions.create({
  file: fs.createReadStream("audio.mp3"),
  model: "whisper-1",
  response_format: "verbose_json",
  timestamp_granularities: ["word"],  // Word-level timestamps
});

// Access word-level timing
for (const word of transcript.words ?? []) {
  console.log(`"${word.word}" — ${word.start}s to ${word.end}s`);
}

Translation (Any Language → English)

// Whisper can translate to English from any of 100 languages
const translation = await openai.audio.translations.create({
  file: fs.createReadStream("french-audio.mp3"),
  model: "whisper-1",
  // No language parameter needed — Whisper auto-detects and translates to English
});

console.log("English translation:", translation.text);

AssemblyAI: Audio Intelligence Platform

AssemblyAI goes beyond transcription — it analyzes audio for insights: sentiment, topics, entities, PII, and lets you query transcripts with LLMs.

Installation

npm install assemblyai

Basic Transcription

import { AssemblyAI } from "assemblyai";

const client = new AssemblyAI({ apiKey: process.env.ASSEMBLYAI_API_KEY! });

// Transcribe from URL
const transcript = await client.transcripts.transcribe({
  audio_url: "https://example.com/podcast.mp3",
  speaker_labels: true,          // Speaker diarization
  auto_highlights: true,         // Key phrases and topics
  sentiment_analysis: true,      // Sentiment per sentence
  entity_detection: true,        // Named entities (persons, places, etc.)
  iab_categories: true,          // IAB content categorization
});

console.log("Transcript:", transcript.text);

// Speaker-segmented output
for (const utterance of transcript.utterances ?? []) {
  console.log(`${utterance.speaker}: ${utterance.text}`);
}

// Sentiment analysis
for (const sentence of transcript.sentiment_analysis_results ?? []) {
  console.log(`[${sentence.sentiment}] ${sentence.text}`);
}

PII Redaction

const transcript = await client.transcripts.transcribe({
  audio_url: "https://example.com/call-recording.mp3",
  redact_pii: true,
  redact_pii_audio: true,        // Also redact PII from audio file itself
  redact_pii_policies: [
    "ssn",                       // Social security numbers
    "credit_card_number",
    "phone_number",
    "email_address",
    "person_name",
    "date_of_birth",
  ],
  // Redacted transcript: "My SSN is #### and my card is ####..."
});

LeMUR: LLM Queries on Transcripts

// Ask questions about the audio content using LLMs
const response = await client.lemur.task({
  transcript_ids: [transcript.id],
  prompt: "Provide a concise 3-bullet summary of the key points discussed.",
  final_model: "anthropic/claude-3-5-sonnet",  // Or openai/gpt-4o
});

console.log("Summary:", response.response);

// Q&A from audio
const qaResponse = await client.lemur.questionAnswer({
  transcript_ids: [transcript.id],
  questions: [
    {
      question: "What was the main decision made in this meeting?",
      answer_format: "Single sentence",
    },
    {
      question: "What action items were mentioned?",
      answer_format: "Bulleted list",
    },
  ],
});

for (const qa of qaResponse.response) {
  console.log(`Q: ${qa.question}\nA: ${qa.answer}\n`);
}

Feature Comparison

FeatureDeepgramWhisper APIAssemblyAI
Real-time streaming✅ < 300ms❌ Batch only❌ Batch only
Pricing per minute$0.0043$0.006$0.0037
Language support30+✅ 100 languages100+
Speaker diarization
Sentiment analysis
Entity detection
PII redaction
LLM transcript queries✅ LeMUR
Custom models
Auto chapters
Free tier✅ 12k min/year✅ $50 credit
SRT/VTT output

When to Use Each

Choose Deepgram if:

  • Real-time streaming transcription is required (voice UX, live captions)
  • Sub-second latency from speech to text is critical
  • Speaker identification (diarization) is needed in real-time
  • You need domain-specific vocabulary or custom model fine-tuning

Choose OpenAI Whisper API if:

  • Batch transcription of pre-recorded audio is the primary use case
  • Maximum language coverage (100 languages, automatic detection)
  • Simplest possible API — just openai.audio.transcriptions.create()
  • Most affordable per-minute rate for high-volume batch work

Choose AssemblyAI if:

  • Audio insights beyond transcription: sentiment, topics, entities, PII
  • Meeting intelligence — summaries, action items, speaker attribution
  • Compliance use cases requiring PII redaction from audio and transcript
  • You want to query transcripts with LLMs via LeMUR

Methodology

Data sourced from official Deepgram Nova-2 documentation (developers.deepgram.com), OpenAI Whisper API documentation (platform.openai.com/docs/guides/speech-to-text), AssemblyAI documentation (assemblyai.com/docs), pricing pages as of February 2026, and community benchmarks from the AI builders community. Language counts from each provider's official documentation.


Related: ElevenLabs vs OpenAI TTS vs Cartesia for the text-to-speech side of voice AI, or Vercel AI SDK vs OpenAI SDK vs Anthropic SDK for the LLM client that powers audio intelligence workflows.

Comments

Stay Updated

Get the latest package insights, npm trends, and tooling tips delivered to your inbox.