Deepgram vs OpenAI Whisper API vs AssemblyAI: STT 2026
Deepgram vs OpenAI Whisper API vs AssemblyAI: STT 2026
TL;DR
Speech-to-text has crossed the accuracy threshold — all three providers deliver excellent transcription. The differences are in streaming capability, audio intelligence features, and price. Deepgram Nova-2 is the real-time specialist — streaming WebSocket API with sub-300ms latency, speaker diarization, smart formatting, and a generous free tier; it's the default for voice apps and live transcription. OpenAI Whisper API is the simplest path to accurate transcription — one endpoint, excellent accuracy across 100 languages, affordable pricing, but batch-only (no real-time streaming). AssemblyAI is the audio intelligence platform — transcription plus sentiment analysis, entity detection, content safety, PII redaction, auto-chapters, and more; it's the choice when you need insights from audio, not just words. For real-time voice apps: Deepgram. For batch transcription of recorded audio: Whisper API. For audio that needs analysis beyond transcription: AssemblyAI.
Key Takeaways
- Deepgram streams results in < 300ms — suitable for real-time captions and voice UX
- OpenAI Whisper supports 100 languages — best language coverage in the comparison
- AssemblyAI's LeMUR — LLM-powered analysis of transcripts (summarize, question-answer)
- Deepgram free tier: 12,000 minutes/year — enough for substantial development
- Whisper API pricing: $0.006/minute — cheapest per-minute rate
- AssemblyAI includes PII redaction — removes SSN, credit cards, phone numbers from transcripts
- Deepgram supports custom models — fine-tune on domain-specific vocabulary
Use Cases and the Right Provider
Voice app (real-time captions, dictation) → Deepgram (streaming)
Meeting transcription (Zoom, Meets) → Deepgram or AssemblyAI (diarization)
Podcast transcription (batch) → Whisper API or AssemblyAI
Audio content analysis / insights → AssemblyAI (LeMUR, sentiment, topics)
Multi-language content (100 languages) → Whisper API
Call center analytics → AssemblyAI (PII, sentiment, topics)
Cheapest batch transcription → Whisper API ($0.006/min)
Deepgram: Real-Time Streaming Transcription
Deepgram's Nova-2 model delivers streaming transcription via WebSocket — results appear as words are spoken, not after the recording ends.
Installation
npm install @deepgram/sdk
Basic Transcription (Pre-Recorded)
import { createClient } from "@deepgram/sdk";
const deepgram = createClient(process.env.DEEPGRAM_API_KEY!);
// Transcribe a URL
const { result, error } = await deepgram.listen.prerecorded.transcribeUrl(
{ url: "https://example.com/audio.mp3" },
{
model: "nova-2",
smart_format: true, // Punctuation, capitalization, number formatting
diarize: true, // Speaker identification
language: "en",
punctuate: true,
utterances: true, // Segment by speaker
}
);
if (error) throw error;
const transcript = result.results.channels[0].alternatives[0].transcript;
console.log("Transcript:", transcript);
// Speaker-segmented output (with diarize: true, utterances: true)
for (const utterance of result.results.utterances ?? []) {
console.log(`Speaker ${utterance.speaker}: ${utterance.transcript}`);
}
Transcribe a Local File
import fs from "fs";
const audioBuffer = fs.readFileSync("./audio.mp3");
const { result, error } = await deepgram.listen.prerecorded.transcribeFile(
audioBuffer,
{
model: "nova-2",
smart_format: true,
diarize: true,
mimetype: "audio/mp3",
}
);
const words = result.results.channels[0].alternatives[0].words;
// words: Array of { word, start, end, confidence, speaker }
Real-Time Streaming (WebSocket)
import { createClient, LiveTranscriptionEvents } from "@deepgram/sdk";
const deepgram = createClient(process.env.DEEPGRAM_API_KEY!);
async function startLiveTranscription() {
const live = deepgram.listen.live({
model: "nova-2",
language: "en",
smart_format: true,
interim_results: true, // Get partial results as user speaks
endpointing: 300, // Silence threshold to finalize a segment (ms)
});
live.on(LiveTranscriptionEvents.Open, () => {
console.log("WebSocket connected");
});
live.on(LiveTranscriptionEvents.Transcript, (data) => {
const transcript = data.channel.alternatives[0].transcript;
const isFinal = data.is_final;
if (isFinal && transcript) {
console.log("Final:", transcript);
// Send to UI / process
} else if (transcript) {
console.log("Interim:", transcript);
// Update UI preview
}
});
live.on(LiveTranscriptionEvents.Error, console.error);
live.on(LiveTranscriptionEvents.Close, () => console.log("Disconnected"));
return live;
}
// Send audio chunks to the live session
const liveSession = await startLiveTranscription();
// From microphone (browser)
const mediaRecorder = new MediaRecorder(stream);
mediaRecorder.addEventListener("dataavailable", (event) => {
if (event.data.size > 0) {
liveSession.send(event.data);
}
});
// Close when done
liveSession.finish();
Next.js API Route (Real-Time)
// app/api/transcribe/stream/route.ts
import { createClient, LiveTranscriptionEvents } from "@deepgram/sdk";
import { NextRequest } from "next/server";
export async function GET(req: NextRequest) {
const deepgram = createClient(process.env.DEEPGRAM_API_KEY!);
// Return Server-Sent Events for transcript updates
const encoder = new TextEncoder();
const stream = new ReadableStream({
start(controller) {
const live = deepgram.listen.live({ model: "nova-2", smart_format: true });
live.on(LiveTranscriptionEvents.Transcript, (data) => {
const transcript = data.channel.alternatives[0].transcript;
if (transcript) {
controller.enqueue(
encoder.encode(`data: ${JSON.stringify({ transcript })}\n\n`)
);
}
});
},
});
return new Response(stream, {
headers: {
"Content-Type": "text/event-stream",
"Cache-Control": "no-cache",
},
});
}
OpenAI Whisper API: Simplest Batch Transcription
OpenAI's Whisper API is the simplest transcription endpoint — upload audio, get text back. Excellent accuracy, 100 languages, no setup complexity.
Installation
npm install openai
Basic Transcription
import OpenAI from "openai";
import fs from "fs";
const openai = new OpenAI({ apiKey: process.env.OPENAI_API_KEY! });
// Transcribe a local audio file
async function transcribeFile(filePath: string): Promise<string> {
const transcript = await openai.audio.transcriptions.create({
file: fs.createReadStream(filePath),
model: "whisper-1",
language: "en", // Optional — auto-detects if omitted
response_format: "json", // "json" | "text" | "srt" | "vtt" | "verbose_json"
temperature: 0, // 0 = most confident, 1 = most creative
});
return transcript.text;
}
// Get subtitle format directly
async function transcribeToSRT(filePath: string): Promise<string> {
const srt = await openai.audio.transcriptions.create({
file: fs.createReadStream(filePath),
model: "whisper-1",
response_format: "srt",
});
return srt as unknown as string;
}
With Verbose JSON (Word Timestamps)
const transcript = await openai.audio.transcriptions.create({
file: fs.createReadStream("audio.mp3"),
model: "whisper-1",
response_format: "verbose_json",
timestamp_granularities: ["word"], // Word-level timestamps
});
// Access word-level timing
for (const word of transcript.words ?? []) {
console.log(`"${word.word}" — ${word.start}s to ${word.end}s`);
}
Translation (Any Language → English)
// Whisper can translate to English from any of 100 languages
const translation = await openai.audio.translations.create({
file: fs.createReadStream("french-audio.mp3"),
model: "whisper-1",
// No language parameter needed — Whisper auto-detects and translates to English
});
console.log("English translation:", translation.text);
AssemblyAI: Audio Intelligence Platform
AssemblyAI goes beyond transcription — it analyzes audio for insights: sentiment, topics, entities, PII, and lets you query transcripts with LLMs.
Installation
npm install assemblyai
Basic Transcription
import { AssemblyAI } from "assemblyai";
const client = new AssemblyAI({ apiKey: process.env.ASSEMBLYAI_API_KEY! });
// Transcribe from URL
const transcript = await client.transcripts.transcribe({
audio_url: "https://example.com/podcast.mp3",
speaker_labels: true, // Speaker diarization
auto_highlights: true, // Key phrases and topics
sentiment_analysis: true, // Sentiment per sentence
entity_detection: true, // Named entities (persons, places, etc.)
iab_categories: true, // IAB content categorization
});
console.log("Transcript:", transcript.text);
// Speaker-segmented output
for (const utterance of transcript.utterances ?? []) {
console.log(`${utterance.speaker}: ${utterance.text}`);
}
// Sentiment analysis
for (const sentence of transcript.sentiment_analysis_results ?? []) {
console.log(`[${sentence.sentiment}] ${sentence.text}`);
}
PII Redaction
const transcript = await client.transcripts.transcribe({
audio_url: "https://example.com/call-recording.mp3",
redact_pii: true,
redact_pii_audio: true, // Also redact PII from audio file itself
redact_pii_policies: [
"ssn", // Social security numbers
"credit_card_number",
"phone_number",
"email_address",
"person_name",
"date_of_birth",
],
// Redacted transcript: "My SSN is #### and my card is ####..."
});
LeMUR: LLM Queries on Transcripts
// Ask questions about the audio content using LLMs
const response = await client.lemur.task({
transcript_ids: [transcript.id],
prompt: "Provide a concise 3-bullet summary of the key points discussed.",
final_model: "anthropic/claude-3-5-sonnet", // Or openai/gpt-4o
});
console.log("Summary:", response.response);
// Q&A from audio
const qaResponse = await client.lemur.questionAnswer({
transcript_ids: [transcript.id],
questions: [
{
question: "What was the main decision made in this meeting?",
answer_format: "Single sentence",
},
{
question: "What action items were mentioned?",
answer_format: "Bulleted list",
},
],
});
for (const qa of qaResponse.response) {
console.log(`Q: ${qa.question}\nA: ${qa.answer}\n`);
}
Feature Comparison
| Feature | Deepgram | Whisper API | AssemblyAI |
|---|---|---|---|
| Real-time streaming | ✅ < 300ms | ❌ Batch only | ❌ Batch only |
| Pricing per minute | $0.0043 | $0.006 | $0.0037 |
| Language support | 30+ | ✅ 100 languages | 100+ |
| Speaker diarization | ✅ | ❌ | ✅ |
| Sentiment analysis | ❌ | ❌ | ✅ |
| Entity detection | ❌ | ❌ | ✅ |
| PII redaction | ❌ | ❌ | ✅ |
| LLM transcript queries | ❌ | ❌ | ✅ LeMUR |
| Custom models | ✅ | ❌ | ❌ |
| Auto chapters | ❌ | ❌ | ✅ |
| Free tier | ✅ 12k min/year | ❌ | ✅ $50 credit |
| SRT/VTT output | ✅ | ✅ | ✅ |
When to Use Each
Choose Deepgram if:
- Real-time streaming transcription is required (voice UX, live captions)
- Sub-second latency from speech to text is critical
- Speaker identification (diarization) is needed in real-time
- You need domain-specific vocabulary or custom model fine-tuning
Choose OpenAI Whisper API if:
- Batch transcription of pre-recorded audio is the primary use case
- Maximum language coverage (100 languages, automatic detection)
- Simplest possible API — just
openai.audio.transcriptions.create() - Most affordable per-minute rate for high-volume batch work
Choose AssemblyAI if:
- Audio insights beyond transcription: sentiment, topics, entities, PII
- Meeting intelligence — summaries, action items, speaker attribution
- Compliance use cases requiring PII redaction from audio and transcript
- You want to query transcripts with LLMs via LeMUR
Methodology
Data sourced from official Deepgram Nova-2 documentation (developers.deepgram.com), OpenAI Whisper API documentation (platform.openai.com/docs/guides/speech-to-text), AssemblyAI documentation (assemblyai.com/docs), pricing pages as of February 2026, and community benchmarks from the AI builders community. Language counts from each provider's official documentation.
Related: ElevenLabs vs OpenAI TTS vs Cartesia for the text-to-speech side of voice AI, or Vercel AI SDK vs OpenAI SDK vs Anthropic SDK for the LLM client that powers audio intelligence workflows.