ElevenLabs vs OpenAI TTS vs Cartesia: Text-to-Speech APIs 2026
TL;DR
Text-to-speech has crossed the uncanny valley — modern TTS produces voices indistinguishable from humans in most contexts. The providers have diverged on their strengths. ElevenLabs leads on voice quality and features — voice cloning from 1 minute of audio, 30+ languages, emotional range, dubbing, sound effects, and a generous library of pre-built voices; it's the default when audio quality matters most. OpenAI TTS is the developer-friendly choice — six voices, one endpoint, streaming support, and pricing so simple it's hard to optimize ($0.015/1K chars); best when you're already using OpenAI and need good-enough TTS fast. Cartesia is the ultra-low-latency specialist — sub-100ms time-to-first-audio via WebSocket streaming, purpose-built for real-time voice applications, conversational AI, and interactive IVR. For highest quality voice narration: ElevenLabs. For quick integration in an OpenAI stack: OpenAI TTS. For real-time voice AI where latency is the constraint: Cartesia.
Key Takeaways
- ElevenLabs voice cloning — create a custom voice from 1 minute of audio samples
- OpenAI TTS pricing: $0.015/1K characters — cheapest per-character rate of the three
- Cartesia sub-100ms latency — first audio byte before entire text is processed
- ElevenLabs supports 30+ languages — including Arabic, Hindi, Korean, Turkish
- OpenAI has 6 voices — Alloy, Echo, Fable, Onyx, Nova, Shimmer
- Cartesia Sonic model — optimized for conversational, low-latency streaming
- All three support streaming — partial audio output as text is processed
Use Cases by Provider
Long-form narration (audiobooks, podcasts) → ElevenLabs (best voice quality)
Short-form UI text (notifications, captions) → OpenAI TTS (simplest API)
Conversational AI / voice agents → Cartesia (lowest latency)
Voice cloning from existing speaker → ElevenLabs (best cloning)
100+ language support → ElevenLabs
Already on OpenAI stack → OpenAI TTS
Real-time IVR / phone bots → Cartesia
Video dubbing / translation → ElevenLabs Dubbing API
ElevenLabs: Highest Quality Voice AI
ElevenLabs produces the most human-like voices and supports the widest range of use cases — from narration to voice cloning to sound effects.
Installation
npm install elevenlabs
Basic Text-to-Speech
import { ElevenLabsClient } from "elevenlabs";
import fs from "fs";
const client = new ElevenLabsClient({
apiKey: process.env.ELEVENLABS_API_KEY!,
});
// Convert text to speech with a specific voice
async function textToSpeech(text: string, outputPath: string): Promise<void> {
const audio = await client.textToSpeech.convert("JBFqnCBsd6RMkjVDRZzb", {
// Voice ID "JBFqnCBsd6RMkjVDRZzb" = George (pre-built voice)
text,
model_id: "eleven_multilingual_v2", // Best quality
// model_id: "eleven_turbo_v2_5" // Lower latency, slightly lower quality
voice_settings: {
stability: 0.5, // 0-1: higher = more consistent voice
similarity_boost: 0.75, // 0-1: higher = closer to original voice
style: 0.0, // 0-1: style exaggeration
use_speaker_boost: true,
},
output_format: "mp3_44100_128",
});
// audio is an AsyncIterable<Uint8Array>
const chunks: Uint8Array[] = [];
for await (const chunk of audio) {
chunks.push(chunk);
}
const buffer = Buffer.concat(chunks);
fs.writeFileSync(outputPath, buffer);
}
await textToSpeech("Hello, this is a test of ElevenLabs text to speech.", "output.mp3");
Streaming (Real-Time)
import { ElevenLabsClient } from "elevenlabs";
import { PassThrough } from "stream";
const client = new ElevenLabsClient({ apiKey: process.env.ELEVENLABS_API_KEY! });
// Stream TTS to HTTP response (Next.js)
// app/api/tts/route.ts
export async function POST(req: Request) {
const { text } = await req.json();
const audio = await client.textToSpeech.convertAsStream(
"JBFqnCBsd6RMkjVDRZzb",
{
text,
model_id: "eleven_turbo_v2_5", // Lower latency for streaming
output_format: "mp3_44100_128",
}
);
// Pipe the AsyncIterable to a ReadableStream
const stream = new ReadableStream({
async start(controller) {
for await (const chunk of audio) {
controller.enqueue(chunk);
}
controller.close();
},
});
return new Response(stream, {
headers: { "Content-Type": "audio/mpeg" },
});
}
Voice Cloning
// Create a custom voice from audio samples
async function cloneVoice(audioFilePaths: string[], voiceName: string) {
const files = audioFilePaths.map((path) => ({
audio: fs.createReadStream(path),
name: `sample_${path}`,
type: "audio/mp3" as const,
}));
const voice = await client.voices.add({
name: voiceName,
description: "Custom cloned voice",
files,
labels: { accent: "american", age: "middle_aged", gender: "male" },
});
console.log("Voice created:", voice.voice_id);
return voice.voice_id;
}
// Use the cloned voice
const voiceId = await cloneVoice(["./samples/sample1.mp3", "./samples/sample2.mp3"], "MyCustomVoice");
await textToSpeech("Using my cloned voice now.", "cloned_output.mp3");
Sound Effects Generation
// ElevenLabs also generates sound effects (not just voice)
const soundEffect = await client.textToSoundEffects.convert({
text: "Heavy rain hitting a tin roof, distant thunder rolling",
duration_seconds: 5,
prompt_influence: 0.3,
});
const chunks: Uint8Array[] = [];
for await (const chunk of soundEffect) chunks.push(chunk);
fs.writeFileSync("rain.mp3", Buffer.concat(chunks));
OpenAI TTS: Simplest Integration
OpenAI's text-to-speech API is a single endpoint with six voices — no configuration required, just text in and audio out.
Installation
npm install openai
Basic TTS
import OpenAI from "openai";
import fs from "fs";
import path from "path";
const openai = new OpenAI({ apiKey: process.env.OPENAI_API_KEY! });
// Six voices: alloy, echo, fable, onyx, nova, shimmer
async function textToSpeech(
text: string,
voice: "alloy" | "echo" | "fable" | "onyx" | "nova" | "shimmer" = "nova"
): Promise<Buffer> {
const mp3 = await openai.audio.speech.create({
model: "tts-1", // "tts-1" | "tts-1-hd" (higher quality, slower)
voice,
input: text,
response_format: "mp3", // "mp3" | "opus" | "aac" | "flac" | "wav" | "pcm"
speed: 1.0, // 0.25 to 4.0
});
const buffer = Buffer.from(await mp3.arrayBuffer());
return buffer;
}
const audio = await textToSpeech("Welcome to PkgPulse!", "nova");
fs.writeFileSync("welcome.mp3", audio);
Streaming to HTTP Response
// app/api/tts/route.ts — Next.js streaming response
import OpenAI from "openai";
const openai = new OpenAI({ apiKey: process.env.OPENAI_API_KEY! });
export async function POST(req: Request) {
const { text, voice = "nova" } = await req.json();
const response = await openai.audio.speech.create({
model: "tts-1",
voice,
input: text,
response_format: "opus", // Opus streams better than MP3
});
// OpenAI response is directly a Web API Response compatible body
return new Response(response.body, {
headers: {
"Content-Type": "audio/opus",
"Transfer-Encoding": "chunked",
},
});
}
Client-Side Playback
// React component — stream TTS and play in browser
async function speakText(text: string) {
const response = await fetch("/api/tts", {
method: "POST",
headers: { "Content-Type": "application/json" },
body: JSON.stringify({ text, voice: "nova" }),
});
const arrayBuffer = await response.arrayBuffer();
const audioContext = new AudioContext();
const audioBuffer = await audioContext.decodeAudioData(arrayBuffer);
const source = audioContext.createBufferSource();
source.buffer = audioBuffer;
source.connect(audioContext.destination);
source.start();
}
// Or with HTMLAudioElement (simpler)
async function speakTextSimple(text: string) {
const response = await fetch("/api/tts", {
method: "POST",
headers: { "Content-Type": "application/json" },
body: JSON.stringify({ text }),
});
const blob = await response.blob();
const url = URL.createObjectURL(blob);
const audio = new Audio(url);
await audio.play();
}
TTS with Word Timestamps (for Captions)
// OpenAI TTS doesn't natively support word timestamps in TTS
// Use a pipeline: TTS → Whisper STT with verbose_json for alignment
async function generateAudioWithTimestamps(text: string) {
// Step 1: Generate audio
const mp3 = await openai.audio.speech.create({
model: "tts-1",
voice: "nova",
input: text,
});
const audioBuffer = Buffer.from(await mp3.arrayBuffer());
// Step 2: Transcribe with word timestamps to get alignment
const { File } = await import("node:buffer");
const transcript = await openai.audio.transcriptions.create({
file: new File([audioBuffer], "audio.mp3", { type: "audio/mp3" }),
model: "whisper-1",
response_format: "verbose_json",
timestamp_granularities: ["word"],
});
return {
audioBuffer,
words: transcript.words, // Array of { word, start, end }
};
}
Cartesia: Ultra-Low-Latency TTS
Cartesia's Sonic model is built for real-time voice applications — designed to produce the first audio chunk in under 100ms, enabling natural-feeling conversational AI.
Installation
npm install @cartesia/cartesia-js
Basic TTS (REST)
import Cartesia from "@cartesia/cartesia-js";
const cartesia = new Cartesia({ apiKey: process.env.CARTESIA_API_KEY! });
// Fetch available voices
const voices = await cartesia.voices.list();
console.log(voices.map((v) => ({ id: v.id, name: v.name })));
// Basic synthesis
async function synthesize(text: string): Promise<Buffer> {
const response = await cartesia.tts.bytes({
model_id: "sonic-2", // Latest Sonic model
transcript: text,
voice: {
mode: "id",
id: "a0e99841-438c-4a64-b679-ae501e7d6091", // Barbershop Man (example voice)
},
output_format: {
container: "mp3",
encoding: "mp3",
sample_rate: 44100,
},
language: "en",
});
return Buffer.from(response);
}
WebSocket Streaming (Sub-100ms)
import Cartesia from "@cartesia/cartesia-js";
const cartesia = new Cartesia({ apiKey: process.env.CARTESIA_API_KEY! });
// WebSocket stream — audio starts arriving before full text is processed
async function streamTTS(text: string) {
const websocket = await cartesia.tts.websocket({
container: "raw",
encoding: "pcm_f32le",
sample_rate: 44100,
});
// Send text, receive audio chunks in real-time
const response = await websocket.send({
model_id: "sonic-2",
voice: {
mode: "id",
id: "a0e99841-438c-4a64-b679-ae501e7d6091",
},
transcript: text,
language: "en",
context_id: "session-123", // Maintain voice consistency across messages
continue: false, // true = more text coming; false = finalize
});
// Audio chunks stream as text is synthesized
const audioChunks: Buffer[] = [];
for await (const chunk of response) {
if (chunk.type === "chunk") {
audioChunks.push(Buffer.from(chunk.data, "base64"));
// In production: pipe directly to audio output
} else if (chunk.type === "done") {
break;
}
}
await websocket.disconnect();
return Buffer.concat(audioChunks);
}
// Streaming multi-turn conversation
async function conversationalTTS() {
const websocket = await cartesia.tts.websocket({
container: "raw",
encoding: "pcm_f32le",
sample_rate: 44100,
});
const contextId = `conversation-${Date.now()}`;
// Turn 1
const turn1 = await websocket.send({
model_id: "sonic-2",
voice: { mode: "id", id: "a0e99841-438c-4a64-b679-ae501e7d6091" },
transcript: "Hello! How can I help you today?",
context_id: contextId,
continue: false,
});
for await (const chunk of turn1) {
if (chunk.type === "chunk") streamToAudioOutput(chunk.data);
if (chunk.type === "done") break;
}
// Turn 2 — voice consistency maintained via context_id
const turn2 = await websocket.send({
model_id: "sonic-2",
voice: { mode: "id", id: "a0e99841-438c-4a64-b679-ae501e7d6091" },
transcript: "I can help you with that right away.",
context_id: contextId,
continue: false,
});
for await (const chunk of turn2) {
if (chunk.type === "chunk") streamToAudioOutput(chunk.data);
if (chunk.type === "done") break;
}
await websocket.disconnect();
}
Voice Cloning (Instant)
// Cartesia instant voice cloning from audio sample
async function cloneVoice(audioFilePath: string, voiceName: string) {
const audioBlob = new Blob([fs.readFileSync(audioFilePath)], { type: "audio/wav" });
const voice = await cartesia.voices.clone({
clip: audioBlob,
name: voiceName,
description: "Cloned from audio sample",
language: "en",
});
console.log("Cloned voice ID:", voice.id);
return voice.id;
}
Node.js Voice Agent Pattern
// Pattern: LLM → Cartesia TTS → real-time audio output
import Anthropic from "@anthropic-ai/sdk";
import Cartesia from "@cartesia/cartesia-js";
const anthropic = new Anthropic();
const cartesia = new Cartesia({ apiKey: process.env.CARTESIA_API_KEY! });
async function voiceAgent(userMessage: string) {
// Step 1: Get LLM response
const llmResponse = await anthropic.messages.create({
model: "claude-3-5-haiku-20241022",
max_tokens: 200,
messages: [{ role: "user", content: userMessage }],
});
const text = llmResponse.content[0].type === "text" ? llmResponse.content[0].text : "";
// Step 2: Stream to Cartesia (sub-100ms first audio)
const websocket = await cartesia.tts.websocket({
container: "raw",
encoding: "pcm_f32le",
sample_rate: 44100,
});
const stream = await websocket.send({
model_id: "sonic-2",
voice: { mode: "id", id: "a0e99841-438c-4a64-b679-ae501e7d6091" },
transcript: text,
language: "en",
});
for await (const chunk of stream) {
if (chunk.type === "chunk") {
playAudioChunk(chunk.data); // Play as it arrives
}
if (chunk.type === "done") break;
}
await websocket.disconnect();
}
Feature Comparison
| Feature | ElevenLabs | OpenAI TTS | Cartesia |
|---|---|---|---|
| Voice quality | ✅ Best | ✅ Good | ✅ Good |
| Time to first audio | ~500ms | ~800ms | ✅ < 100ms |
| Voice cloning | ✅ (1-min sample) | ❌ | ✅ (instant) |
| Pre-built voices | 3,000+ | 6 | 200+ |
| Language support | 30+ | 57 | 10+ |
| Sound effects | ✅ | ❌ | ❌ |
| Video dubbing | ✅ | ❌ | ❌ |
| WebSocket streaming | ✅ | ❌ (HTTP chunked) | ✅ Native |
| Pricing (per 1K chars) | $0.03 (multilingual) | $0.015 | ~$0.065 |
| Free tier | 10K chars/month | ❌ | $5 credit |
| Conversation context | ❌ | ❌ | ✅ context_id |
When to Use Each
Choose ElevenLabs if:
- Audio quality is the top priority (audiobooks, narration, podcasts)
- Voice cloning from existing speaker audio is needed
- 30+ language support or emotional voice range required
- Sound effects generation alongside TTS
- Video dubbing and lip-sync translation
Choose OpenAI TTS if:
- You're already using OpenAI APIs and want minimal integration overhead
- Good-enough quality for UI text, notifications, or basic narration
- Simple pricing: $0.015/1K chars with no tiers or voice add-ons
- HTTP streaming without WebSocket complexity
Choose Cartesia if:
- Real-time conversational AI where latency feels like latency (< 100ms matters)
- Voice agents, IVR systems, or phone bots
- WebSocket streaming for lowest possible time-to-first-audio
- Multi-turn conversation with consistent voice via context_id
- You're building on top of a voice AI framework (LiveKit, Daily.co, etc.)
Audio Format Selection and Streaming Performance
The audio format you choose has significant impact on both bandwidth and latency. MP3 (MPEG-1 Audio Layer III) is the most compatible format — playable in all browsers via HTMLAudioElement without additional codecs — but it is not designed for streaming because MP3 decoders require looking ahead in the bitstream to decode frames. Opus is dramatically better for streaming: it has lower latency (frames as small as 2.5ms vs. MP3's 26ms), supports true streaming without frame-lookahead, and achieves comparable or better audio quality at lower bitrates than MP3. OpenAI TTS's Opus output format (response_format: "opus") is the best choice for streaming applications where first-audio time matters — the browser's MediaSource Extensions API can decode Opus in real time as chunks arrive. Cartesia's WebSocket streaming uses raw PCM (pulse-code modulation) format, which has zero decoding overhead but requires your application to handle playback using the Web Audio API rather than a simple HTMLAudioElement — more code but the lowest possible latency. ElevenLabs supports both streaming MP3 and WebSocket-based PCM streaming through their WebSocket API, giving you the full range of format options.
Cost Optimization for High-Volume TTS Applications
Text-to-speech costs scale linearly with character count, which can become substantial for content-heavy applications. OpenAI TTS at $0.015 per 1,000 characters is the most economical for high-volume use cases — a podcast summarization app that converts 50,000 characters per day costs $0.75/day, or about $23/month. ElevenLabs' multilingual model at $0.03 per 1,000 characters doubles that cost, though the quality improvement for narration use cases may justify the premium. Cartesia's pricing at approximately $0.065 per 1,000 characters is the most expensive per character but may be cost-effective when the latency requirement (sub-100ms for conversational AI) eliminates the other providers as viable options. Implement character counting in your application before sending to the TTS API — truncate or summarize long inputs, use caching for frequently generated audio (navigation prompts, standard notifications), and avoid regenerating audio for identical text. A simple Redis cache keyed on sha256(text + voiceId + settings) can eliminate redundant API calls for repeated phrases.
Voice Consistency and Conversational Context
Long-form audio applications like podcasts, audiobooks, or multi-turn voice agents require voice consistency — the synthesized voice should not shift in character, pacing, or tonal quality across different text segments. ElevenLabs' stability parameter (0–1) controls voice consistency: higher stability produces more monotone but consistent output, lower stability allows more expressiveness but introduces variance between generations. For audiobook narration where chapters are generated separately, use a high stability value (0.7–0.9) and the same voice ID consistently. OpenAI TTS with a fixed voice parameter is inherently consistent across calls — the voice does not vary based on stability settings. Cartesia's context_id parameter maintains voice context across multiple WebSocket sends within a session, ensuring that turn-by-turn dialogue in a conversational AI feels like the same person speaking throughout the conversation rather than a series of disconnected utterances. For voice agents where the same voice speaks hundreds of turns over a long session, Cartesia's context-aware streaming is the only option that explicitly manages this continuity.
Integrating TTS into Production Voice AI Pipelines
Production voice AI applications combine speech recognition (STT), language model inference (LLM), and text-to-speech (TTS) in a pipeline that must minimize total end-to-end latency to feel natural. The target for a conversational AI response is under 2 seconds from when the user stops speaking to when they hear the AI's first word. With typical component latencies — Deepgram STT at 200–400ms, LLM first-token latency at 200–500ms on Groq, and Cartesia TTS at under 100ms to first audio — reaching sub-second total latency is achievable. The key optimization is streaming: start TTS synthesis as soon as the first complete sentence from the LLM is available, rather than waiting for the full LLM response. This requires sentence boundary detection (splitting the LLM stream at ., ?, ! characters), sending each sentence to Cartesia's WebSocket as a separate message with continue: true, and beginning audio playback before the LLM has finished generating the full response. LiveKit's Agent framework, Daily.co's real-time API, and Twilio's Voice Intelligence all provide scaffolding for this pipeline pattern, and all three have documented integrations with Cartesia for the TTS layer.
Methodology
Data sourced from official ElevenLabs documentation (elevenlabs.io/docs), OpenAI TTS documentation (platform.openai.com/docs/guides/text-to-speech), Cartesia documentation (docs.cartesia.ai), pricing pages as of February 2026, latency benchmarks from the AI voice community, and community discussions from the ElevenLabs Discord and r/MachineLearning.
Related: Deepgram vs OpenAI Whisper vs AssemblyAI for the speech-to-text side of voice AI, or Vercel AI SDK vs OpenAI SDK vs Anthropic SDK for the LLM layer that drives voice agent responses.
See also: Mastra vs LangChain.js vs Google GenKit and Model Context Protocol (MCP) Libraries for Node.js 2026