TL;DR
Speech-to-text has crossed the accuracy threshold — all three providers deliver excellent transcription. The differences are in streaming capability, audio intelligence features, and price. Deepgram Nova-2 is the real-time specialist — streaming WebSocket API with sub-300ms latency, speaker diarization, smart formatting, and a generous free tier; it's the default for voice apps and live transcription. OpenAI Whisper API is the simplest path to accurate transcription — one endpoint, excellent accuracy across 100 languages, affordable pricing, but batch-only (no real-time streaming). AssemblyAI is the audio intelligence platform — transcription plus sentiment analysis, entity detection, content safety, PII redaction, auto-chapters, and more; it's the choice when you need insights from audio, not just words. For real-time voice apps: Deepgram. For batch transcription of recorded audio: Whisper API. For audio that needs analysis beyond transcription: AssemblyAI.
Key Takeaways
- Deepgram streams results in < 300ms — suitable for real-time captions and voice UX
- OpenAI Whisper supports 100 languages — best language coverage in the comparison
- AssemblyAI's LeMUR — LLM-powered analysis of transcripts (summarize, question-answer)
- Deepgram free tier: 12,000 minutes/year — enough for substantial development
- Whisper API pricing: $0.006/minute — cheapest per-minute rate
- AssemblyAI includes PII redaction — removes SSN, credit cards, phone numbers from transcripts
- Deepgram supports custom models — fine-tune on domain-specific vocabulary
Use Cases and the Right Provider
Voice app (real-time captions, dictation) → Deepgram (streaming)
Meeting transcription (Zoom, Meets) → Deepgram or AssemblyAI (diarization)
Podcast transcription (batch) → Whisper API or AssemblyAI
Audio content analysis / insights → AssemblyAI (LeMUR, sentiment, topics)
Multi-language content (100 languages) → Whisper API
Call center analytics → AssemblyAI (PII, sentiment, topics)
Cheapest batch transcription → Whisper API ($0.006/min)
Deepgram: Real-Time Streaming Transcription
Deepgram's Nova-2 model delivers streaming transcription via WebSocket — results appear as words are spoken, not after the recording ends.
Installation
npm install @deepgram/sdk
Basic Transcription (Pre-Recorded)
import { createClient } from "@deepgram/sdk";
const deepgram = createClient(process.env.DEEPGRAM_API_KEY!);
// Transcribe a URL
const { result, error } = await deepgram.listen.prerecorded.transcribeUrl(
{ url: "https://example.com/audio.mp3" },
{
model: "nova-2",
smart_format: true, // Punctuation, capitalization, number formatting
diarize: true, // Speaker identification
language: "en",
punctuate: true,
utterances: true, // Segment by speaker
}
);
if (error) throw error;
const transcript = result.results.channels[0].alternatives[0].transcript;
console.log("Transcript:", transcript);
// Speaker-segmented output (with diarize: true, utterances: true)
for (const utterance of result.results.utterances ?? []) {
console.log(`Speaker ${utterance.speaker}: ${utterance.transcript}`);
}
Transcribe a Local File
import fs from "fs";
const audioBuffer = fs.readFileSync("./audio.mp3");
const { result, error } = await deepgram.listen.prerecorded.transcribeFile(
audioBuffer,
{
model: "nova-2",
smart_format: true,
diarize: true,
mimetype: "audio/mp3",
}
);
const words = result.results.channels[0].alternatives[0].words;
// words: Array of { word, start, end, confidence, speaker }
Real-Time Streaming (WebSocket)
import { createClient, LiveTranscriptionEvents } from "@deepgram/sdk";
const deepgram = createClient(process.env.DEEPGRAM_API_KEY!);
async function startLiveTranscription() {
const live = deepgram.listen.live({
model: "nova-2",
language: "en",
smart_format: true,
interim_results: true, // Get partial results as user speaks
endpointing: 300, // Silence threshold to finalize a segment (ms)
});
live.on(LiveTranscriptionEvents.Open, () => {
console.log("WebSocket connected");
});
live.on(LiveTranscriptionEvents.Transcript, (data) => {
const transcript = data.channel.alternatives[0].transcript;
const isFinal = data.is_final;
if (isFinal && transcript) {
console.log("Final:", transcript);
// Send to UI / process
} else if (transcript) {
console.log("Interim:", transcript);
// Update UI preview
}
});
live.on(LiveTranscriptionEvents.Error, console.error);
live.on(LiveTranscriptionEvents.Close, () => console.log("Disconnected"));
return live;
}
// Send audio chunks to the live session
const liveSession = await startLiveTranscription();
// From microphone (browser)
const mediaRecorder = new MediaRecorder(stream);
mediaRecorder.addEventListener("dataavailable", (event) => {
if (event.data.size > 0) {
liveSession.send(event.data);
}
});
// Close when done
liveSession.finish();
Next.js API Route (Real-Time)
// app/api/transcribe/stream/route.ts
import { createClient, LiveTranscriptionEvents } from "@deepgram/sdk";
import { NextRequest } from "next/server";
export async function GET(req: NextRequest) {
const deepgram = createClient(process.env.DEEPGRAM_API_KEY!);
// Return Server-Sent Events for transcript updates
const encoder = new TextEncoder();
const stream = new ReadableStream({
start(controller) {
const live = deepgram.listen.live({ model: "nova-2", smart_format: true });
live.on(LiveTranscriptionEvents.Transcript, (data) => {
const transcript = data.channel.alternatives[0].transcript;
if (transcript) {
controller.enqueue(
encoder.encode(`data: ${JSON.stringify({ transcript })}\n\n`)
);
}
});
},
});
return new Response(stream, {
headers: {
"Content-Type": "text/event-stream",
"Cache-Control": "no-cache",
},
});
}
OpenAI Whisper API: Simplest Batch Transcription
OpenAI's Whisper API is the simplest transcription endpoint — upload audio, get text back. Excellent accuracy, 100 languages, no setup complexity.
Installation
npm install openai
Basic Transcription
import OpenAI from "openai";
import fs from "fs";
const openai = new OpenAI({ apiKey: process.env.OPENAI_API_KEY! });
// Transcribe a local audio file
async function transcribeFile(filePath: string): Promise<string> {
const transcript = await openai.audio.transcriptions.create({
file: fs.createReadStream(filePath),
model: "whisper-1",
language: "en", // Optional — auto-detects if omitted
response_format: "json", // "json" | "text" | "srt" | "vtt" | "verbose_json"
temperature: 0, // 0 = most confident, 1 = most creative
});
return transcript.text;
}
// Get subtitle format directly
async function transcribeToSRT(filePath: string): Promise<string> {
const srt = await openai.audio.transcriptions.create({
file: fs.createReadStream(filePath),
model: "whisper-1",
response_format: "srt",
});
return srt as unknown as string;
}
With Verbose JSON (Word Timestamps)
const transcript = await openai.audio.transcriptions.create({
file: fs.createReadStream("audio.mp3"),
model: "whisper-1",
response_format: "verbose_json",
timestamp_granularities: ["word"], // Word-level timestamps
});
// Access word-level timing
for (const word of transcript.words ?? []) {
console.log(`"${word.word}" — ${word.start}s to ${word.end}s`);
}
Translation (Any Language → English)
// Whisper can translate to English from any of 100 languages
const translation = await openai.audio.translations.create({
file: fs.createReadStream("french-audio.mp3"),
model: "whisper-1",
// No language parameter needed — Whisper auto-detects and translates to English
});
console.log("English translation:", translation.text);
AssemblyAI: Audio Intelligence Platform
AssemblyAI goes beyond transcription — it analyzes audio for insights: sentiment, topics, entities, PII, and lets you query transcripts with LLMs.
Installation
npm install assemblyai
Basic Transcription
import { AssemblyAI } from "assemblyai";
const client = new AssemblyAI({ apiKey: process.env.ASSEMBLYAI_API_KEY! });
// Transcribe from URL
const transcript = await client.transcripts.transcribe({
audio_url: "https://example.com/podcast.mp3",
speaker_labels: true, // Speaker diarization
auto_highlights: true, // Key phrases and topics
sentiment_analysis: true, // Sentiment per sentence
entity_detection: true, // Named entities (persons, places, etc.)
iab_categories: true, // IAB content categorization
});
console.log("Transcript:", transcript.text);
// Speaker-segmented output
for (const utterance of transcript.utterances ?? []) {
console.log(`${utterance.speaker}: ${utterance.text}`);
}
// Sentiment analysis
for (const sentence of transcript.sentiment_analysis_results ?? []) {
console.log(`[${sentence.sentiment}] ${sentence.text}`);
}
PII Redaction
const transcript = await client.transcripts.transcribe({
audio_url: "https://example.com/call-recording.mp3",
redact_pii: true,
redact_pii_audio: true, // Also redact PII from audio file itself
redact_pii_policies: [
"ssn", // Social security numbers
"credit_card_number",
"phone_number",
"email_address",
"person_name",
"date_of_birth",
],
// Redacted transcript: "My SSN is #### and my card is ####..."
});
LeMUR: LLM Queries on Transcripts
// Ask questions about the audio content using LLMs
const response = await client.lemur.task({
transcript_ids: [transcript.id],
prompt: "Provide a concise 3-bullet summary of the key points discussed.",
final_model: "anthropic/claude-3-5-sonnet", // Or openai/gpt-4o
});
console.log("Summary:", response.response);
// Q&A from audio
const qaResponse = await client.lemur.questionAnswer({
transcript_ids: [transcript.id],
questions: [
{
question: "What was the main decision made in this meeting?",
answer_format: "Single sentence",
},
{
question: "What action items were mentioned?",
answer_format: "Bulleted list",
},
],
});
for (const qa of qaResponse.response) {
console.log(`Q: ${qa.question}\nA: ${qa.answer}\n`);
}
Feature Comparison
| Feature | Deepgram | Whisper API | AssemblyAI |
|---|---|---|---|
| Real-time streaming | ✅ < 300ms | ❌ Batch only | ❌ Batch only |
| Pricing per minute | $0.0043 | $0.006 | $0.0037 |
| Language support | 30+ | ✅ 100 languages | 100+ |
| Speaker diarization | ✅ | ❌ | ✅ |
| Sentiment analysis | ❌ | ❌ | ✅ |
| Entity detection | ❌ | ❌ | ✅ |
| PII redaction | ❌ | ❌ | ✅ |
| LLM transcript queries | ❌ | ❌ | ✅ LeMUR |
| Custom models | ✅ | ❌ | ❌ |
| Auto chapters | ❌ | ❌ | ✅ |
| Free tier | ✅ 12k min/year | ❌ | ✅ $50 credit |
| SRT/VTT output | ✅ | ✅ | ✅ |
When to Use Each
Choose Deepgram if:
- Real-time streaming transcription is required (voice UX, live captions)
- Sub-second latency from speech to text is critical
- Speaker identification (diarization) is needed in real-time
- You need domain-specific vocabulary or custom model fine-tuning
Choose OpenAI Whisper API if:
- Batch transcription of pre-recorded audio is the primary use case
- Maximum language coverage (100 languages, automatic detection)
- Simplest possible API — just
openai.audio.transcriptions.create() - Most affordable per-minute rate for high-volume batch work
Choose AssemblyAI if:
- Audio insights beyond transcription: sentiment, topics, entities, PII
- Meeting intelligence — summaries, action items, speaker attribution
- Compliance use cases requiring PII redaction from audio and transcript
- You want to query transcripts with LLMs via LeMUR
Accuracy in Real-World Conditions
Word error rate (WER) benchmarks on clean studio audio are not the right measure for production applications. Real-world audio has background noise, speaker accents, domain-specific vocabulary, and crosstalk. How each provider handles these conditions matters more than clean-audio WER.
Deepgram Nova-2 excels on natural conversational speech — phone calls, customer service recordings, video meetings. The model was trained on diverse real-world audio, not primarily on broadcast speech. Nova-2 handles strong accents, overlapping speech, and noisy environments better than Whisper on conversational material. The smart_format: true option adds punctuation, capitalization, and converts spoken numbers to numerals — critical for meeting transcripts that need to be readable without editing.
OpenAI Whisper was trained on 680,000 hours of multilingual audio, with heavy representation of broadcast and structured speech (lectures, podcasts, documentaries). It produces cleaner punctuation and more standard English output on formal speech. On conversational audio with strong accents or significant background noise, accuracy can drop more than Deepgram. The verbose_json format with word-level timestamps is best-in-class for subtitle generation workflows.
AssemblyAI uses its own model pipeline that balances transcription with analysis. The transcription accuracy is comparable to Deepgram and Whisper on most audio types. The accuracy advantage comes from the post-processing layer: sentiment analysis, entity detection, and PII redaction all have their own accuracy characteristics. AssemblyAI's entity detection model performs well on business context — company names, product names, and proper nouns that generic models often miss.
Pricing at Scale
For applications handling significant audio volume, pricing differences compound quickly.
| Provider | 10 hours/day | 100 hours/day | 1000 hours/day |
|---|---|---|---|
| Deepgram Nova-2 | $7.74/day | $77.40/day | $774/day |
| Whisper API | $3.60/day | $36/day | $360/day |
| AssemblyAI | $6.66/day | $66.60/day | $666/day |
At 1000 hours per day (~1 million minutes per month), Whisper API saves $120-130/day vs the other providers. That's $43,000-$47,000 annually — enough to justify the batch-only constraint for many use cases.
The Deepgram free tier (12,000 minutes = 200 hours per year) is the most generous for development and testing. Both AssemblyAI and Deepgram offer enterprise pricing for high volumes that can materially improve on these list rates.
Production Integration Patterns
Meeting Transcription with Diarization
For a meeting transcription service (Zoom, Meet, Teams recordings), the recommended stack is Deepgram or AssemblyAI with diarize: true (or speaker_labels: true). The output is utterance-segmented by speaker, making it trivially parseable into a structured transcript format.
For AssemblyAI, the auto_chapters: true option automatically segments long meetings into logical sections with summaries — reducing post-processing work significantly for meeting intelligence applications.
Real-Time Voice Applications
Deepgram is the only option here. The WebSocket streaming API returns transcript results as words are spoken, enabling real-time caption overlays, voice command processing, and live meeting transcription. AssemblyAI and Whisper both require complete audio before returning results.
For AI voice agents (STT → LLM → TTS), Deepgram pairs naturally with LiveKit's Agents framework, which handles the audio pipeline orchestration.
Compliance and Call Center Analytics
AssemblyAI is the only provider with built-in PII redaction from both transcript and audio file. For applications processing customer calls (financial services, healthcare, insurance), the combination of PII redaction, sentiment analysis, and LeMUR querying (ask the LLM what action items were discussed, what the customer's concern was) provides a complete analytics layer without building a custom post-processing pipeline.
Model Updates and API Stability
One operational consideration that's easy to overlook: how do these providers handle model updates? For applications where accuracy is a core product metric, an undocumented model upgrade can shift transcription behavior in ways that break downstream parsing.
Deepgram handles this with explicit model versioning. Pinning to nova-2 gives you a stable model; Deepgram doesn't deprecate models without notice. OpenAI's Whisper API currently exposes whisper-1 as the only model identifier — OpenAI can update the underlying model without incrementing the version name, meaning you may see accuracy changes between API calls over time. AssemblyAI similarly does not expose version pinning for their default model. For applications that need reproducible transcription behavior — legal transcript archives, compliance recordings — Deepgram's explicit version control is a meaningful advantage.
Methodology
Data sourced from official Deepgram Nova-2 documentation (developers.deepgram.com), OpenAI Whisper API documentation (platform.openai.com/docs/guides/speech-to-text), AssemblyAI documentation (assemblyai.com/docs), pricing pages as of February 2026, and community benchmarks from the AI builders community. Language counts from each provider's official documentation. Pricing calculations based on list rates at 600 seconds per hour of audio.
Related: ElevenLabs vs OpenAI TTS vs Cartesia for the text-to-speech side of voice AI, or Vercel AI SDK vs OpenAI SDK vs Anthropic SDK for the LLM client that powers audio intelligence workflows.
See also: Sass vs Tailwind CSS and Mastra vs LangChain.js vs Google GenKit