ElevenLabs vs OpenAI TTS vs Cartesia: Text-to-Speech APIs 2026
ElevenLabs vs OpenAI TTS vs Cartesia: Text-to-Speech APIs 2026
TL;DR
Text-to-speech has crossed the uncanny valley — modern TTS produces voices indistinguishable from humans in most contexts. The providers have diverged on their strengths. ElevenLabs leads on voice quality and features — voice cloning from 1 minute of audio, 30+ languages, emotional range, dubbing, sound effects, and a generous library of pre-built voices; it's the default when audio quality matters most. OpenAI TTS is the developer-friendly choice — six voices, one endpoint, streaming support, and pricing so simple it's hard to optimize ($0.015/1K chars); best when you're already using OpenAI and need good-enough TTS fast. Cartesia is the ultra-low-latency specialist — sub-100ms time-to-first-audio via WebSocket streaming, purpose-built for real-time voice applications, conversational AI, and interactive IVR. For highest quality voice narration: ElevenLabs. For quick integration in an OpenAI stack: OpenAI TTS. For real-time voice AI where latency is the constraint: Cartesia.
Key Takeaways
- ElevenLabs voice cloning — create a custom voice from 1 minute of audio samples
- OpenAI TTS pricing: $0.015/1K characters — cheapest per-character rate of the three
- Cartesia sub-100ms latency — first audio byte before entire text is processed
- ElevenLabs supports 30+ languages — including Arabic, Hindi, Korean, Turkish
- OpenAI has 6 voices — Alloy, Echo, Fable, Onyx, Nova, Shimmer
- Cartesia Sonic model — optimized for conversational, low-latency streaming
- All three support streaming — partial audio output as text is processed
Use Cases by Provider
Long-form narration (audiobooks, podcasts) → ElevenLabs (best voice quality)
Short-form UI text (notifications, captions) → OpenAI TTS (simplest API)
Conversational AI / voice agents → Cartesia (lowest latency)
Voice cloning from existing speaker → ElevenLabs (best cloning)
100+ language support → ElevenLabs
Already on OpenAI stack → OpenAI TTS
Real-time IVR / phone bots → Cartesia
Video dubbing / translation → ElevenLabs Dubbing API
ElevenLabs: Highest Quality Voice AI
ElevenLabs produces the most human-like voices and supports the widest range of use cases — from narration to voice cloning to sound effects.
Installation
npm install elevenlabs
Basic Text-to-Speech
import { ElevenLabsClient } from "elevenlabs";
import fs from "fs";
const client = new ElevenLabsClient({
apiKey: process.env.ELEVENLABS_API_KEY!,
});
// Convert text to speech with a specific voice
async function textToSpeech(text: string, outputPath: string): Promise<void> {
const audio = await client.textToSpeech.convert("JBFqnCBsd6RMkjVDRZzb", {
// Voice ID "JBFqnCBsd6RMkjVDRZzb" = George (pre-built voice)
text,
model_id: "eleven_multilingual_v2", // Best quality
// model_id: "eleven_turbo_v2_5" // Lower latency, slightly lower quality
voice_settings: {
stability: 0.5, // 0-1: higher = more consistent voice
similarity_boost: 0.75, // 0-1: higher = closer to original voice
style: 0.0, // 0-1: style exaggeration
use_speaker_boost: true,
},
output_format: "mp3_44100_128",
});
// audio is an AsyncIterable<Uint8Array>
const chunks: Uint8Array[] = [];
for await (const chunk of audio) {
chunks.push(chunk);
}
const buffer = Buffer.concat(chunks);
fs.writeFileSync(outputPath, buffer);
}
await textToSpeech("Hello, this is a test of ElevenLabs text to speech.", "output.mp3");
Streaming (Real-Time)
import { ElevenLabsClient } from "elevenlabs";
import { PassThrough } from "stream";
const client = new ElevenLabsClient({ apiKey: process.env.ELEVENLABS_API_KEY! });
// Stream TTS to HTTP response (Next.js)
// app/api/tts/route.ts
export async function POST(req: Request) {
const { text } = await req.json();
const audio = await client.textToSpeech.convertAsStream(
"JBFqnCBsd6RMkjVDRZzb",
{
text,
model_id: "eleven_turbo_v2_5", // Lower latency for streaming
output_format: "mp3_44100_128",
}
);
// Pipe the AsyncIterable to a ReadableStream
const stream = new ReadableStream({
async start(controller) {
for await (const chunk of audio) {
controller.enqueue(chunk);
}
controller.close();
},
});
return new Response(stream, {
headers: { "Content-Type": "audio/mpeg" },
});
}
Voice Cloning
// Create a custom voice from audio samples
async function cloneVoice(audioFilePaths: string[], voiceName: string) {
const files = audioFilePaths.map((path) => ({
audio: fs.createReadStream(path),
name: `sample_${path}`,
type: "audio/mp3" as const,
}));
const voice = await client.voices.add({
name: voiceName,
description: "Custom cloned voice",
files,
labels: { accent: "american", age: "middle_aged", gender: "male" },
});
console.log("Voice created:", voice.voice_id);
return voice.voice_id;
}
// Use the cloned voice
const voiceId = await cloneVoice(["./samples/sample1.mp3", "./samples/sample2.mp3"], "MyCustomVoice");
await textToSpeech("Using my cloned voice now.", "cloned_output.mp3");
Sound Effects Generation
// ElevenLabs also generates sound effects (not just voice)
const soundEffect = await client.textToSoundEffects.convert({
text: "Heavy rain hitting a tin roof, distant thunder rolling",
duration_seconds: 5,
prompt_influence: 0.3,
});
const chunks: Uint8Array[] = [];
for await (const chunk of soundEffect) chunks.push(chunk);
fs.writeFileSync("rain.mp3", Buffer.concat(chunks));
OpenAI TTS: Simplest Integration
OpenAI's text-to-speech API is a single endpoint with six voices — no configuration required, just text in and audio out.
Installation
npm install openai
Basic TTS
import OpenAI from "openai";
import fs from "fs";
import path from "path";
const openai = new OpenAI({ apiKey: process.env.OPENAI_API_KEY! });
// Six voices: alloy, echo, fable, onyx, nova, shimmer
async function textToSpeech(
text: string,
voice: "alloy" | "echo" | "fable" | "onyx" | "nova" | "shimmer" = "nova"
): Promise<Buffer> {
const mp3 = await openai.audio.speech.create({
model: "tts-1", // "tts-1" | "tts-1-hd" (higher quality, slower)
voice,
input: text,
response_format: "mp3", // "mp3" | "opus" | "aac" | "flac" | "wav" | "pcm"
speed: 1.0, // 0.25 to 4.0
});
const buffer = Buffer.from(await mp3.arrayBuffer());
return buffer;
}
const audio = await textToSpeech("Welcome to PkgPulse!", "nova");
fs.writeFileSync("welcome.mp3", audio);
Streaming to HTTP Response
// app/api/tts/route.ts — Next.js streaming response
import OpenAI from "openai";
const openai = new OpenAI({ apiKey: process.env.OPENAI_API_KEY! });
export async function POST(req: Request) {
const { text, voice = "nova" } = await req.json();
const response = await openai.audio.speech.create({
model: "tts-1",
voice,
input: text,
response_format: "opus", // Opus streams better than MP3
});
// OpenAI response is directly a Web API Response compatible body
return new Response(response.body, {
headers: {
"Content-Type": "audio/opus",
"Transfer-Encoding": "chunked",
},
});
}
Client-Side Playback
// React component — stream TTS and play in browser
async function speakText(text: string) {
const response = await fetch("/api/tts", {
method: "POST",
headers: { "Content-Type": "application/json" },
body: JSON.stringify({ text, voice: "nova" }),
});
const arrayBuffer = await response.arrayBuffer();
const audioContext = new AudioContext();
const audioBuffer = await audioContext.decodeAudioData(arrayBuffer);
const source = audioContext.createBufferSource();
source.buffer = audioBuffer;
source.connect(audioContext.destination);
source.start();
}
// Or with HTMLAudioElement (simpler)
async function speakTextSimple(text: string) {
const response = await fetch("/api/tts", {
method: "POST",
headers: { "Content-Type": "application/json" },
body: JSON.stringify({ text }),
});
const blob = await response.blob();
const url = URL.createObjectURL(blob);
const audio = new Audio(url);
await audio.play();
}
TTS with Word Timestamps (for Captions)
// OpenAI TTS doesn't natively support word timestamps in TTS
// Use a pipeline: TTS → Whisper STT with verbose_json for alignment
async function generateAudioWithTimestamps(text: string) {
// Step 1: Generate audio
const mp3 = await openai.audio.speech.create({
model: "tts-1",
voice: "nova",
input: text,
});
const audioBuffer = Buffer.from(await mp3.arrayBuffer());
// Step 2: Transcribe with word timestamps to get alignment
const { File } = await import("node:buffer");
const transcript = await openai.audio.transcriptions.create({
file: new File([audioBuffer], "audio.mp3", { type: "audio/mp3" }),
model: "whisper-1",
response_format: "verbose_json",
timestamp_granularities: ["word"],
});
return {
audioBuffer,
words: transcript.words, // Array of { word, start, end }
};
}
Cartesia: Ultra-Low-Latency TTS
Cartesia's Sonic model is built for real-time voice applications — designed to produce the first audio chunk in under 100ms, enabling natural-feeling conversational AI.
Installation
npm install @cartesia/cartesia-js
Basic TTS (REST)
import Cartesia from "@cartesia/cartesia-js";
const cartesia = new Cartesia({ apiKey: process.env.CARTESIA_API_KEY! });
// Fetch available voices
const voices = await cartesia.voices.list();
console.log(voices.map((v) => ({ id: v.id, name: v.name })));
// Basic synthesis
async function synthesize(text: string): Promise<Buffer> {
const response = await cartesia.tts.bytes({
model_id: "sonic-2", // Latest Sonic model
transcript: text,
voice: {
mode: "id",
id: "a0e99841-438c-4a64-b679-ae501e7d6091", // Barbershop Man (example voice)
},
output_format: {
container: "mp3",
encoding: "mp3",
sample_rate: 44100,
},
language: "en",
});
return Buffer.from(response);
}
WebSocket Streaming (Sub-100ms)
import Cartesia from "@cartesia/cartesia-js";
const cartesia = new Cartesia({ apiKey: process.env.CARTESIA_API_KEY! });
// WebSocket stream — audio starts arriving before full text is processed
async function streamTTS(text: string) {
const websocket = await cartesia.tts.websocket({
container: "raw",
encoding: "pcm_f32le",
sample_rate: 44100,
});
// Send text, receive audio chunks in real-time
const response = await websocket.send({
model_id: "sonic-2",
voice: {
mode: "id",
id: "a0e99841-438c-4a64-b679-ae501e7d6091",
},
transcript: text,
language: "en",
context_id: "session-123", // Maintain voice consistency across messages
continue: false, // true = more text coming; false = finalize
});
// Audio chunks stream as text is synthesized
const audioChunks: Buffer[] = [];
for await (const chunk of response) {
if (chunk.type === "chunk") {
audioChunks.push(Buffer.from(chunk.data, "base64"));
// In production: pipe directly to audio output
} else if (chunk.type === "done") {
break;
}
}
await websocket.disconnect();
return Buffer.concat(audioChunks);
}
// Streaming multi-turn conversation
async function conversationalTTS() {
const websocket = await cartesia.tts.websocket({
container: "raw",
encoding: "pcm_f32le",
sample_rate: 44100,
});
const contextId = `conversation-${Date.now()}`;
// Turn 1
const turn1 = await websocket.send({
model_id: "sonic-2",
voice: { mode: "id", id: "a0e99841-438c-4a64-b679-ae501e7d6091" },
transcript: "Hello! How can I help you today?",
context_id: contextId,
continue: false,
});
for await (const chunk of turn1) {
if (chunk.type === "chunk") streamToAudioOutput(chunk.data);
if (chunk.type === "done") break;
}
// Turn 2 — voice consistency maintained via context_id
const turn2 = await websocket.send({
model_id: "sonic-2",
voice: { mode: "id", id: "a0e99841-438c-4a64-b679-ae501e7d6091" },
transcript: "I can help you with that right away.",
context_id: contextId,
continue: false,
});
for await (const chunk of turn2) {
if (chunk.type === "chunk") streamToAudioOutput(chunk.data);
if (chunk.type === "done") break;
}
await websocket.disconnect();
}
Voice Cloning (Instant)
// Cartesia instant voice cloning from audio sample
async function cloneVoice(audioFilePath: string, voiceName: string) {
const audioBlob = new Blob([fs.readFileSync(audioFilePath)], { type: "audio/wav" });
const voice = await cartesia.voices.clone({
clip: audioBlob,
name: voiceName,
description: "Cloned from audio sample",
language: "en",
});
console.log("Cloned voice ID:", voice.id);
return voice.id;
}
Node.js Voice Agent Pattern
// Pattern: LLM → Cartesia TTS → real-time audio output
import Anthropic from "@anthropic-ai/sdk";
import Cartesia from "@cartesia/cartesia-js";
const anthropic = new Anthropic();
const cartesia = new Cartesia({ apiKey: process.env.CARTESIA_API_KEY! });
async function voiceAgent(userMessage: string) {
// Step 1: Get LLM response
const llmResponse = await anthropic.messages.create({
model: "claude-3-5-haiku-20241022",
max_tokens: 200,
messages: [{ role: "user", content: userMessage }],
});
const text = llmResponse.content[0].type === "text" ? llmResponse.content[0].text : "";
// Step 2: Stream to Cartesia (sub-100ms first audio)
const websocket = await cartesia.tts.websocket({
container: "raw",
encoding: "pcm_f32le",
sample_rate: 44100,
});
const stream = await websocket.send({
model_id: "sonic-2",
voice: { mode: "id", id: "a0e99841-438c-4a64-b679-ae501e7d6091" },
transcript: text,
language: "en",
});
for await (const chunk of stream) {
if (chunk.type === "chunk") {
playAudioChunk(chunk.data); // Play as it arrives
}
if (chunk.type === "done") break;
}
await websocket.disconnect();
}
Feature Comparison
| Feature | ElevenLabs | OpenAI TTS | Cartesia |
|---|---|---|---|
| Voice quality | ✅ Best | ✅ Good | ✅ Good |
| Time to first audio | ~500ms | ~800ms | ✅ < 100ms |
| Voice cloning | ✅ (1-min sample) | ❌ | ✅ (instant) |
| Pre-built voices | 3,000+ | 6 | 200+ |
| Language support | 30+ | 57 | 10+ |
| Sound effects | ✅ | ❌ | ❌ |
| Video dubbing | ✅ | ❌ | ❌ |
| WebSocket streaming | ✅ | ❌ (HTTP chunked) | ✅ Native |
| Pricing (per 1K chars) | $0.03 (multilingual) | $0.015 | ~$0.065 |
| Free tier | 10K chars/month | ❌ | $5 credit |
| Conversation context | ❌ | ❌ | ✅ context_id |
When to Use Each
Choose ElevenLabs if:
- Audio quality is the top priority (audiobooks, narration, podcasts)
- Voice cloning from existing speaker audio is needed
- 30+ language support or emotional voice range required
- Sound effects generation alongside TTS
- Video dubbing and lip-sync translation
Choose OpenAI TTS if:
- You're already using OpenAI APIs and want minimal integration overhead
- Good-enough quality for UI text, notifications, or basic narration
- Simple pricing: $0.015/1K chars with no tiers or voice add-ons
- HTTP streaming without WebSocket complexity
Choose Cartesia if:
- Real-time conversational AI where latency feels like latency (< 100ms matters)
- Voice agents, IVR systems, or phone bots
- WebSocket streaming for lowest possible time-to-first-audio
- Multi-turn conversation with consistent voice via context_id
- You're building on top of a voice AI framework (LiveKit, Daily.co, etc.)
Methodology
Data sourced from official ElevenLabs documentation (elevenlabs.io/docs), OpenAI TTS documentation (platform.openai.com/docs/guides/text-to-speech), Cartesia documentation (docs.cartesia.ai), pricing pages as of February 2026, latency benchmarks from the AI voice community, and community discussions from the ElevenLabs Discord and r/MachineLearning.
Related: Deepgram vs OpenAI Whisper vs AssemblyAI for the speech-to-text side of voice AI, or Vercel AI SDK vs OpenAI SDK vs Anthropic SDK for the LLM layer that drives voice agent responses.