Groq vs Together AI vs Fireworks AI: Fast LLM Inference APIs 2026
Groq vs Together AI vs Fireworks AI: Fast LLM Inference APIs 2026
TL;DR
OpenAI is expensive and rate-limited. A new tier of inference providers runs open-source models — Llama 3, Mixtral, Qwen, Gemma — with OpenAI-compatible APIs at a fraction of the cost. Groq uses custom LPU (Language Processing Unit) hardware delivering 400-800 tokens/second — the fastest inference available, period. Together AI is the most flexible — 100+ open-source models, fine-tuning API, and custom deployment of your own models. Fireworks AI focuses on production-grade OSS model serving with dedicated deployments and function-calling optimizations. For maximum inference speed: Groq. For open-source model breadth and fine-tuning: Together AI. For production-grade OSS model serving with SLAs: Fireworks AI.
Key Takeaways
- Groq delivers 400-800 tokens/second for Llama 3 — 5-10x faster than OpenAI
- Together AI hosts 100+ models including Llama 3, Qwen 2.5, Mistral, and Flux image models
- All three expose OpenAI-compatible APIs — swap with one
baseURLchange - Groq's free tier: 14,400 req/day — generous for prototyping
- Together AI fine-tuning — train on your data, serve the fine-tuned model via API
- Fireworks AI
accounts/fireworks/modelsnamespace — curated, optimized model versions - Groq latency varies by model — Llama 3.3 70B is fast; larger models require Groq's On-Demand tier
Why Use Alternative Inference Providers?
OpenAI gpt-4o:
- Cost: $2.50 input / $10 output per 1M tokens
- Speed: ~50-100 tokens/sec
- Models: Proprietary only
Groq / Together / Fireworks:
- Cost: $0.05-$0.90 per 1M tokens (80-90% cheaper)
- Speed: 100-800 tokens/sec
- Models: Llama 3, Mistral, Qwen, Gemma, and 100+ open-source
Use cases that make sense:
- High-volume API calls (cost savings at scale)
- Real-time applications (speed matters)
- Needing a specific open-source model
- Avoiding vendor lock-in on proprietary model families
Groq: Custom Hardware, Maximum Speed
Groq's LPUs (Language Processing Units) are purpose-built chips for transformer inference. They achieve throughput OpenAI and Anthropic's cloud deployments cannot match on commodity GPUs.
Installation
npm install groq-sdk
# Or use OpenAI SDK with baseURL override
npm install openai
Basic Completion
import Groq from "groq-sdk";
const client = new Groq({
apiKey: process.env.GROQ_API_KEY,
});
const completion = await client.chat.completions.create({
model: "llama-3.3-70b-versatile",
messages: [
{ role: "system", content: "You are a helpful coding assistant." },
{ role: "user", content: "Write a TypeScript function to debounce async calls." },
],
temperature: 0.7,
max_tokens: 1024,
});
console.log(completion.choices[0].message.content);
console.log("Tokens per second:", completion.usage?.completion_tokens_per_second);
Streaming (with speed measurement)
import Groq from "groq-sdk";
const client = new Groq();
const startTime = Date.now();
let tokenCount = 0;
const stream = await client.chat.completions.create({
model: "llama-3.3-70b-versatile",
messages: [{ role: "user", content: "Explain the CAP theorem in depth." }],
stream: true,
});
for await (const chunk of stream) {
const delta = chunk.choices[0]?.delta?.content ?? "";
if (delta) {
process.stdout.write(delta);
tokenCount++;
}
if (chunk.x_groq?.usage) {
const elapsed = (Date.now() - startTime) / 1000;
const tps = chunk.x_groq.usage.completion_tokens / elapsed;
console.log(`\n\nSpeed: ${tps.toFixed(0)} tokens/sec`);
}
}
Available Models on Groq
// Groq model selection — each has different speed/capability trade-offs
const models = {
// Fastest — best for simple tasks, chatbots
"llama-3.1-8b-instant": { speed: "fastest", context: "128k", cost: "$0.05/$0.08" },
// Balanced — most tasks
"llama-3.3-70b-versatile": { speed: "fast", context: "128k", cost: "$0.59/$0.79" },
// High intelligence — complex reasoning
"llama-3.1-405b-reasoning": { speed: "moderate", context: "16k", cost: "$3/$3" },
// Coding specialist
"deepseek-r1-distill-llama-70b": { speed: "fast", context: "128k", cost: "$0.75/$0.99" },
// Multimodal (vision)
"llama-3.2-11b-vision-preview": { speed: "fast", context: "8k", cost: "$0.18/$0.18" },
};
// Using OpenAI SDK with Groq
import OpenAI from "openai";
const groq = new OpenAI({
apiKey: process.env.GROQ_API_KEY,
baseURL: "https://api.groq.com/openai/v1",
});
const response = await groq.chat.completions.create({
model: "llama-3.3-70b-versatile",
messages: [{ role: "user", content: "Hello!" }],
});
JSON Mode
// Structured output via JSON mode
const result = await client.chat.completions.create({
model: "llama-3.3-70b-versatile",
response_format: { type: "json_object" },
messages: [
{
role: "user",
content: `Extract the key entities from this text and return as JSON:
"Apple announced the M4 chip at WWDC 2024, featuring neural engine improvements."
Return: { entities: [{ name, type, description }] }`,
},
],
});
const data = JSON.parse(result.choices[0].message.content!);
console.log(data.entities);
Together AI: Open-Source Model Breadth
Together AI runs 100+ open-source models and provides the only consumer fine-tuning API that lets you train and deploy custom model variants.
Installation
npm install together-ai
# Or OpenAI SDK compatible
npm install openai
Chat Completion
import Together from "together-ai";
const client = new Together({
apiKey: process.env.TOGETHER_API_KEY,
});
const response = await client.chat.completions.create({
model: "meta-llama/Meta-Llama-3.1-70B-Instruct-Turbo",
messages: [
{ role: "user", content: "What are the differences between PostgreSQL and MySQL?" },
],
max_tokens: 512,
temperature: 0.7,
});
console.log(response.choices[0].message.content);
Available Model Categories
// Together AI model families (as of 2026)
const togetherModels = {
// Meta Llama
llama: [
"meta-llama/Meta-Llama-3.1-8B-Instruct-Turbo",
"meta-llama/Meta-Llama-3.1-70B-Instruct-Turbo",
"meta-llama/Meta-Llama-3.1-405B-Instruct-Turbo",
"meta-llama/Llama-3.2-11B-Vision-Instruct-Turbo", // Multimodal
],
// Qwen
qwen: [
"Qwen/Qwen2.5-7B-Instruct-Turbo",
"Qwen/Qwen2.5-72B-Instruct-Turbo",
"Qwen/QwQ-32B-Preview", // Reasoning model
],
// Mistral
mistral: [
"mistralai/Mistral-7B-Instruct-v0.3",
"mistralai/Mixtral-8x22B-Instruct-v0.1",
],
// Code-specialized
code: [
"Qwen/Qwen2.5-Coder-32B-Instruct",
"deepseek-ai/DeepSeek-V3",
],
// Image generation
image: [
"black-forest-labs/FLUX.1-schnell",
"black-forest-labs/FLUX.1-dev",
],
};
Image Generation
// Together AI also handles image generation
const imageResponse = await client.images.create({
model: "black-forest-labs/FLUX.1-schnell",
prompt: "A minimalist logo for a tech startup, clean lines, dark background",
n: 1,
width: 1024,
height: 1024,
steps: 4, // FLUX.1-schnell is fast — works in 4 steps
});
console.log(imageResponse.data[0].url);
Fine-Tuning API
import Together from "together-ai";
import fs from "fs";
const client = new Together();
// Upload training data (JSONL format)
const file = await client.files.upload({
file: fs.createReadStream("training-data.jsonl"),
purpose: "fine-tune",
});
// Create fine-tuning job
const job = await client.fineTuning.create({
training_file: file.id,
model: "meta-llama/Meta-Llama-3.1-8B-Instruct-Reference", // Base model
n_epochs: 3,
learning_rate: 1e-5,
lora: true, // LoRA for efficient fine-tuning
suffix: "my-domain-model",
});
console.log("Fine-tune job created:", job.id);
// Monitor training progress
const finishedJob = await client.fineTuning.retrieve(job.id);
console.log("Status:", finishedJob.status);
// When "succeeded", use fine-tuned model ID in completions
Using OpenAI SDK with Together
import OpenAI from "openai";
const together = new OpenAI({
apiKey: process.env.TOGETHER_API_KEY,
baseURL: "https://api.together.xyz/v1",
});
// Exactly like OpenAI — just different models and baseURL
const response = await together.chat.completions.create({
model: "meta-llama/Meta-Llama-3.1-70B-Instruct-Turbo",
messages: [{ role: "user", content: "Explain async/await in JavaScript." }],
stream: true,
});
for await (const chunk of response) {
process.stdout.write(chunk.choices[0]?.delta?.content ?? "");
}
Fireworks AI: Production OSS Model Serving
Fireworks AI specializes in production-grade open-source model serving with dedicated endpoints, function-calling optimization, and structured output support.
Installation
npm install openai # Fireworks uses OpenAI-compatible API
Setup and Basic Usage
import OpenAI from "openai";
const fireworks = new OpenAI({
apiKey: process.env.FIREWORKS_API_KEY,
baseURL: "https://api.fireworks.ai/inference/v1",
});
const response = await fireworks.chat.completions.create({
model: "accounts/fireworks/models/llama-v3p1-70b-instruct",
messages: [
{ role: "user", content: "What is the best way to handle errors in TypeScript?" },
],
});
console.log(response.choices[0].message.content);
Function Calling
// Fireworks AI has optimized function calling for production
const tools: OpenAI.Chat.Completions.ChatCompletionTool[] = [
{
type: "function",
function: {
name: "search_documentation",
description: "Search technical documentation",
parameters: {
type: "object",
properties: {
query: { type: "string", description: "Search query" },
technology: { type: "string", description: "Tech stack (react, node, etc.)" },
},
required: ["query"],
},
},
},
];
const response = await fireworks.chat.completions.create({
model: "accounts/fireworks/models/firefunction-v2", // Optimized for tool use
messages: [{ role: "user", content: "How do I handle CORS in Express?" }],
tools,
tool_choice: "auto",
});
if (response.choices[0].message.tool_calls) {
const toolCall = response.choices[0].message.tool_calls[0];
const args = JSON.parse(toolCall.function.arguments);
console.log("Searching for:", args.query);
}
Structured Output with Pydantic/Zod
import { zodResponseFormat } from "openai/helpers/zod";
import { z } from "zod";
const AnalysisSchema = z.object({
sentiment: z.enum(["positive", "negative", "neutral"]),
score: z.number().min(-1).max(1),
keywords: z.array(z.string()),
summary: z.string(),
});
const result = await fireworks.beta.chat.completions.parse({
model: "accounts/fireworks/models/llama-v3p1-8b-instruct",
messages: [
{
role: "user",
content: "Analyze the sentiment of: 'This product is amazing but shipping was slow.'",
},
],
response_format: zodResponseFormat(AnalysisSchema, "analysis"),
});
const analysis = result.choices[0].message.parsed;
// Fully typed: { sentiment: "positive", score: 0.6, keywords: [...], summary: "..." }
Dedicated Deployments (Production)
// Fireworks dedicated deployments = reserved capacity for consistent latency SLAs
// Configured via Fireworks dashboard or API
// Once deployed, use the deployment endpoint:
const response = await fireworks.chat.completions.create({
model: "accounts/YOUR_ACCOUNT/deployments/YOUR_DEPLOYMENT",
messages: [{ role: "user", content: "Process this user request." }],
});
Feature Comparison
| Feature | Groq | Together AI | Fireworks AI |
|---|---|---|---|
| Max throughput | ✅ 400-800 tok/s | ~100-200 tok/s | ~100-300 tok/s |
| OpenAI-compatible API | ✅ | ✅ | ✅ |
| Model count | ~20 curated | ✅ 100+ | ~50 curated |
| Fine-tuning | ❌ | ✅ LoRA fine-tuning | ✅ Limited |
| Image generation | ❌ | ✅ FLUX.1 | ❌ |
| Function calling | ✅ | ✅ | ✅ Optimized |
| Structured output | ✅ JSON mode | ✅ | ✅ Zod/Pydantic |
| Vision models | ✅ Llama 3.2 Vision | ✅ | ✅ |
| Dedicated deployments | ❌ | ❌ | ✅ |
| Free tier | ✅ 14,400 req/day | ✅ $1 credit | ✅ $1 credit |
| Pricing (70B model) | $0.59/$0.79/1M | $0.88/$0.88/1M | $0.90/$0.90/1M |
| Rate limits | Limited (free tier) | Higher | Higher |
When to Use Each
Choose Groq if:
- Latency is the primary concern — real-time chat, voice assistants, live feedback
- You're using Llama, Mixtral, or Gemma models and need maximum throughput
- High volume on a budget (cost per token is competitive at scale)
- Prototyping with the generous free tier (14,400 requests/day)
Choose Together AI if:
- You need a model Groq doesn't offer (Qwen, DeepSeek, FLUX image generation)
- Fine-tuning on your domain data is required
- You want to explore 100+ models to find the best fit for your use case
- Image generation alongside text completion in one provider
Choose Fireworks AI if:
- Production-grade SLAs with dedicated deployment capacity matter
- Function calling and structured outputs are core to your use case
- You need consistent latency guarantees (not shared capacity)
- Your use case involves complex agent workflows with tool use
Methodology
Data sourced from official documentation and pricing pages for Groq, Together AI, and Fireworks AI (as of February 2026), throughput benchmarks from ArtificialAnalysis.ai (independent LLM benchmarking service), and community reports from the AI Engineer Discord and Twitter/X. Model availability and pricing verified directly against provider pricing pages.
Related: Vercel AI SDK vs OpenAI SDK vs Anthropic SDK for SDK choice when integrating these providers, or Portkey vs LiteLLM vs OpenRouter for routing requests across multiple providers automatically.