Skip to main content

ElevenLabs vs OpenAI TTS vs Cartesia: Text-to-Speech APIs 2026

·PkgPulse Team

ElevenLabs vs OpenAI TTS vs Cartesia: Text-to-Speech APIs 2026

TL;DR

Text-to-speech has crossed the uncanny valley — modern TTS produces voices indistinguishable from humans in most contexts. The providers have diverged on their strengths. ElevenLabs leads on voice quality and features — voice cloning from 1 minute of audio, 30+ languages, emotional range, dubbing, sound effects, and a generous library of pre-built voices; it's the default when audio quality matters most. OpenAI TTS is the developer-friendly choice — six voices, one endpoint, streaming support, and pricing so simple it's hard to optimize ($0.015/1K chars); best when you're already using OpenAI and need good-enough TTS fast. Cartesia is the ultra-low-latency specialist — sub-100ms time-to-first-audio via WebSocket streaming, purpose-built for real-time voice applications, conversational AI, and interactive IVR. For highest quality voice narration: ElevenLabs. For quick integration in an OpenAI stack: OpenAI TTS. For real-time voice AI where latency is the constraint: Cartesia.

Key Takeaways

  • ElevenLabs voice cloning — create a custom voice from 1 minute of audio samples
  • OpenAI TTS pricing: $0.015/1K characters — cheapest per-character rate of the three
  • Cartesia sub-100ms latency — first audio byte before entire text is processed
  • ElevenLabs supports 30+ languages — including Arabic, Hindi, Korean, Turkish
  • OpenAI has 6 voices — Alloy, Echo, Fable, Onyx, Nova, Shimmer
  • Cartesia Sonic model — optimized for conversational, low-latency streaming
  • All three support streaming — partial audio output as text is processed

Use Cases by Provider

Long-form narration (audiobooks, podcasts)     → ElevenLabs (best voice quality)
Short-form UI text (notifications, captions)   → OpenAI TTS (simplest API)
Conversational AI / voice agents               → Cartesia (lowest latency)
Voice cloning from existing speaker            → ElevenLabs (best cloning)
100+ language support                          → ElevenLabs
Already on OpenAI stack                        → OpenAI TTS
Real-time IVR / phone bots                     → Cartesia
Video dubbing / translation                    → ElevenLabs Dubbing API

ElevenLabs: Highest Quality Voice AI

ElevenLabs produces the most human-like voices and supports the widest range of use cases — from narration to voice cloning to sound effects.

Installation

npm install elevenlabs

Basic Text-to-Speech

import { ElevenLabsClient } from "elevenlabs";
import fs from "fs";

const client = new ElevenLabsClient({
  apiKey: process.env.ELEVENLABS_API_KEY!,
});

// Convert text to speech with a specific voice
async function textToSpeech(text: string, outputPath: string): Promise<void> {
  const audio = await client.textToSpeech.convert("JBFqnCBsd6RMkjVDRZzb", {
    // Voice ID "JBFqnCBsd6RMkjVDRZzb" = George (pre-built voice)
    text,
    model_id: "eleven_multilingual_v2",   // Best quality
    // model_id: "eleven_turbo_v2_5"      // Lower latency, slightly lower quality
    voice_settings: {
      stability: 0.5,         // 0-1: higher = more consistent voice
      similarity_boost: 0.75, // 0-1: higher = closer to original voice
      style: 0.0,             // 0-1: style exaggeration
      use_speaker_boost: true,
    },
    output_format: "mp3_44100_128",
  });

  // audio is an AsyncIterable<Uint8Array>
  const chunks: Uint8Array[] = [];
  for await (const chunk of audio) {
    chunks.push(chunk);
  }

  const buffer = Buffer.concat(chunks);
  fs.writeFileSync(outputPath, buffer);
}

await textToSpeech("Hello, this is a test of ElevenLabs text to speech.", "output.mp3");

Streaming (Real-Time)

import { ElevenLabsClient } from "elevenlabs";
import { PassThrough } from "stream";

const client = new ElevenLabsClient({ apiKey: process.env.ELEVENLABS_API_KEY! });

// Stream TTS to HTTP response (Next.js)
// app/api/tts/route.ts
export async function POST(req: Request) {
  const { text } = await req.json();

  const audio = await client.textToSpeech.convertAsStream(
    "JBFqnCBsd6RMkjVDRZzb",
    {
      text,
      model_id: "eleven_turbo_v2_5",  // Lower latency for streaming
      output_format: "mp3_44100_128",
    }
  );

  // Pipe the AsyncIterable to a ReadableStream
  const stream = new ReadableStream({
    async start(controller) {
      for await (const chunk of audio) {
        controller.enqueue(chunk);
      }
      controller.close();
    },
  });

  return new Response(stream, {
    headers: { "Content-Type": "audio/mpeg" },
  });
}

Voice Cloning

// Create a custom voice from audio samples
async function cloneVoice(audioFilePaths: string[], voiceName: string) {
  const files = audioFilePaths.map((path) => ({
    audio: fs.createReadStream(path),
    name: `sample_${path}`,
    type: "audio/mp3" as const,
  }));

  const voice = await client.voices.add({
    name: voiceName,
    description: "Custom cloned voice",
    files,
    labels: { accent: "american", age: "middle_aged", gender: "male" },
  });

  console.log("Voice created:", voice.voice_id);
  return voice.voice_id;
}

// Use the cloned voice
const voiceId = await cloneVoice(["./samples/sample1.mp3", "./samples/sample2.mp3"], "MyCustomVoice");
await textToSpeech("Using my cloned voice now.", "cloned_output.mp3");

Sound Effects Generation

// ElevenLabs also generates sound effects (not just voice)
const soundEffect = await client.textToSoundEffects.convert({
  text: "Heavy rain hitting a tin roof, distant thunder rolling",
  duration_seconds: 5,
  prompt_influence: 0.3,
});

const chunks: Uint8Array[] = [];
for await (const chunk of soundEffect) chunks.push(chunk);
fs.writeFileSync("rain.mp3", Buffer.concat(chunks));

OpenAI TTS: Simplest Integration

OpenAI's text-to-speech API is a single endpoint with six voices — no configuration required, just text in and audio out.

Installation

npm install openai

Basic TTS

import OpenAI from "openai";
import fs from "fs";
import path from "path";

const openai = new OpenAI({ apiKey: process.env.OPENAI_API_KEY! });

// Six voices: alloy, echo, fable, onyx, nova, shimmer
async function textToSpeech(
  text: string,
  voice: "alloy" | "echo" | "fable" | "onyx" | "nova" | "shimmer" = "nova"
): Promise<Buffer> {
  const mp3 = await openai.audio.speech.create({
    model: "tts-1",              // "tts-1" | "tts-1-hd" (higher quality, slower)
    voice,
    input: text,
    response_format: "mp3",      // "mp3" | "opus" | "aac" | "flac" | "wav" | "pcm"
    speed: 1.0,                  // 0.25 to 4.0
  });

  const buffer = Buffer.from(await mp3.arrayBuffer());
  return buffer;
}

const audio = await textToSpeech("Welcome to PkgPulse!", "nova");
fs.writeFileSync("welcome.mp3", audio);

Streaming to HTTP Response

// app/api/tts/route.ts — Next.js streaming response
import OpenAI from "openai";

const openai = new OpenAI({ apiKey: process.env.OPENAI_API_KEY! });

export async function POST(req: Request) {
  const { text, voice = "nova" } = await req.json();

  const response = await openai.audio.speech.create({
    model: "tts-1",
    voice,
    input: text,
    response_format: "opus",  // Opus streams better than MP3
  });

  // OpenAI response is directly a Web API Response compatible body
  return new Response(response.body, {
    headers: {
      "Content-Type": "audio/opus",
      "Transfer-Encoding": "chunked",
    },
  });
}

Client-Side Playback

// React component — stream TTS and play in browser
async function speakText(text: string) {
  const response = await fetch("/api/tts", {
    method: "POST",
    headers: { "Content-Type": "application/json" },
    body: JSON.stringify({ text, voice: "nova" }),
  });

  const arrayBuffer = await response.arrayBuffer();
  const audioContext = new AudioContext();
  const audioBuffer = await audioContext.decodeAudioData(arrayBuffer);

  const source = audioContext.createBufferSource();
  source.buffer = audioBuffer;
  source.connect(audioContext.destination);
  source.start();
}

// Or with HTMLAudioElement (simpler)
async function speakTextSimple(text: string) {
  const response = await fetch("/api/tts", {
    method: "POST",
    headers: { "Content-Type": "application/json" },
    body: JSON.stringify({ text }),
  });

  const blob = await response.blob();
  const url = URL.createObjectURL(blob);
  const audio = new Audio(url);
  await audio.play();
}

TTS with Word Timestamps (for Captions)

// OpenAI TTS doesn't natively support word timestamps in TTS
// Use a pipeline: TTS → Whisper STT with verbose_json for alignment
async function generateAudioWithTimestamps(text: string) {
  // Step 1: Generate audio
  const mp3 = await openai.audio.speech.create({
    model: "tts-1",
    voice: "nova",
    input: text,
  });

  const audioBuffer = Buffer.from(await mp3.arrayBuffer());

  // Step 2: Transcribe with word timestamps to get alignment
  const { File } = await import("node:buffer");
  const transcript = await openai.audio.transcriptions.create({
    file: new File([audioBuffer], "audio.mp3", { type: "audio/mp3" }),
    model: "whisper-1",
    response_format: "verbose_json",
    timestamp_granularities: ["word"],
  });

  return {
    audioBuffer,
    words: transcript.words,  // Array of { word, start, end }
  };
}

Cartesia: Ultra-Low-Latency TTS

Cartesia's Sonic model is built for real-time voice applications — designed to produce the first audio chunk in under 100ms, enabling natural-feeling conversational AI.

Installation

npm install @cartesia/cartesia-js

Basic TTS (REST)

import Cartesia from "@cartesia/cartesia-js";

const cartesia = new Cartesia({ apiKey: process.env.CARTESIA_API_KEY! });

// Fetch available voices
const voices = await cartesia.voices.list();
console.log(voices.map((v) => ({ id: v.id, name: v.name })));

// Basic synthesis
async function synthesize(text: string): Promise<Buffer> {
  const response = await cartesia.tts.bytes({
    model_id: "sonic-2",         // Latest Sonic model
    transcript: text,
    voice: {
      mode: "id",
      id: "a0e99841-438c-4a64-b679-ae501e7d6091",  // Barbershop Man (example voice)
    },
    output_format: {
      container: "mp3",
      encoding: "mp3",
      sample_rate: 44100,
    },
    language: "en",
  });

  return Buffer.from(response);
}

WebSocket Streaming (Sub-100ms)

import Cartesia from "@cartesia/cartesia-js";

const cartesia = new Cartesia({ apiKey: process.env.CARTESIA_API_KEY! });

// WebSocket stream — audio starts arriving before full text is processed
async function streamTTS(text: string) {
  const websocket = await cartesia.tts.websocket({
    container: "raw",
    encoding: "pcm_f32le",
    sample_rate: 44100,
  });

  // Send text, receive audio chunks in real-time
  const response = await websocket.send({
    model_id: "sonic-2",
    voice: {
      mode: "id",
      id: "a0e99841-438c-4a64-b679-ae501e7d6091",
    },
    transcript: text,
    language: "en",
    context_id: "session-123",  // Maintain voice consistency across messages
    continue: false,             // true = more text coming; false = finalize
  });

  // Audio chunks stream as text is synthesized
  const audioChunks: Buffer[] = [];
  for await (const chunk of response) {
    if (chunk.type === "chunk") {
      audioChunks.push(Buffer.from(chunk.data, "base64"));
      // In production: pipe directly to audio output
    } else if (chunk.type === "done") {
      break;
    }
  }

  await websocket.disconnect();
  return Buffer.concat(audioChunks);
}

// Streaming multi-turn conversation
async function conversationalTTS() {
  const websocket = await cartesia.tts.websocket({
    container: "raw",
    encoding: "pcm_f32le",
    sample_rate: 44100,
  });

  const contextId = `conversation-${Date.now()}`;

  // Turn 1
  const turn1 = await websocket.send({
    model_id: "sonic-2",
    voice: { mode: "id", id: "a0e99841-438c-4a64-b679-ae501e7d6091" },
    transcript: "Hello! How can I help you today?",
    context_id: contextId,
    continue: false,
  });

  for await (const chunk of turn1) {
    if (chunk.type === "chunk") streamToAudioOutput(chunk.data);
    if (chunk.type === "done") break;
  }

  // Turn 2 — voice consistency maintained via context_id
  const turn2 = await websocket.send({
    model_id: "sonic-2",
    voice: { mode: "id", id: "a0e99841-438c-4a64-b679-ae501e7d6091" },
    transcript: "I can help you with that right away.",
    context_id: contextId,
    continue: false,
  });

  for await (const chunk of turn2) {
    if (chunk.type === "chunk") streamToAudioOutput(chunk.data);
    if (chunk.type === "done") break;
  }

  await websocket.disconnect();
}

Voice Cloning (Instant)

// Cartesia instant voice cloning from audio sample
async function cloneVoice(audioFilePath: string, voiceName: string) {
  const audioBlob = new Blob([fs.readFileSync(audioFilePath)], { type: "audio/wav" });

  const voice = await cartesia.voices.clone({
    clip: audioBlob,
    name: voiceName,
    description: "Cloned from audio sample",
    language: "en",
  });

  console.log("Cloned voice ID:", voice.id);
  return voice.id;
}

Node.js Voice Agent Pattern

// Pattern: LLM → Cartesia TTS → real-time audio output
import Anthropic from "@anthropic-ai/sdk";
import Cartesia from "@cartesia/cartesia-js";

const anthropic = new Anthropic();
const cartesia = new Cartesia({ apiKey: process.env.CARTESIA_API_KEY! });

async function voiceAgent(userMessage: string) {
  // Step 1: Get LLM response
  const llmResponse = await anthropic.messages.create({
    model: "claude-3-5-haiku-20241022",
    max_tokens: 200,
    messages: [{ role: "user", content: userMessage }],
  });

  const text = llmResponse.content[0].type === "text" ? llmResponse.content[0].text : "";

  // Step 2: Stream to Cartesia (sub-100ms first audio)
  const websocket = await cartesia.tts.websocket({
    container: "raw",
    encoding: "pcm_f32le",
    sample_rate: 44100,
  });

  const stream = await websocket.send({
    model_id: "sonic-2",
    voice: { mode: "id", id: "a0e99841-438c-4a64-b679-ae501e7d6091" },
    transcript: text,
    language: "en",
  });

  for await (const chunk of stream) {
    if (chunk.type === "chunk") {
      playAudioChunk(chunk.data);  // Play as it arrives
    }
    if (chunk.type === "done") break;
  }

  await websocket.disconnect();
}

Feature Comparison

FeatureElevenLabsOpenAI TTSCartesia
Voice quality✅ Best✅ Good✅ Good
Time to first audio~500ms~800ms✅ < 100ms
Voice cloning✅ (1-min sample)✅ (instant)
Pre-built voices3,000+6200+
Language support30+5710+
Sound effects
Video dubbing
WebSocket streaming❌ (HTTP chunked)✅ Native
Pricing (per 1K chars)$0.03 (multilingual)$0.015~$0.065
Free tier10K chars/month$5 credit
Conversation context✅ context_id

When to Use Each

Choose ElevenLabs if:

  • Audio quality is the top priority (audiobooks, narration, podcasts)
  • Voice cloning from existing speaker audio is needed
  • 30+ language support or emotional voice range required
  • Sound effects generation alongside TTS
  • Video dubbing and lip-sync translation

Choose OpenAI TTS if:

  • You're already using OpenAI APIs and want minimal integration overhead
  • Good-enough quality for UI text, notifications, or basic narration
  • Simple pricing: $0.015/1K chars with no tiers or voice add-ons
  • HTTP streaming without WebSocket complexity

Choose Cartesia if:

  • Real-time conversational AI where latency feels like latency (< 100ms matters)
  • Voice agents, IVR systems, or phone bots
  • WebSocket streaming for lowest possible time-to-first-audio
  • Multi-turn conversation with consistent voice via context_id
  • You're building on top of a voice AI framework (LiveKit, Daily.co, etc.)

Methodology

Data sourced from official ElevenLabs documentation (elevenlabs.io/docs), OpenAI TTS documentation (platform.openai.com/docs/guides/text-to-speech), Cartesia documentation (docs.cartesia.ai), pricing pages as of February 2026, latency benchmarks from the AI voice community, and community discussions from the ElevenLabs Discord and r/MachineLearning.


Related: Deepgram vs OpenAI Whisper vs AssemblyAI for the speech-to-text side of voice AI, or Vercel AI SDK vs OpenAI SDK vs Anthropic SDK for the LLM layer that drives voice agent responses.

Comments

Stay Updated

Get the latest package insights, npm trends, and tooling tips delivered to your inbox.