Groq vs Together AI vs Fireworks AI: Fast LLM Inference APIs 2026
TL;DR
OpenAI is expensive and rate-limited. A new tier of inference providers runs open-source models — Llama 3, Mixtral, Qwen, Gemma — with OpenAI-compatible APIs at a fraction of the cost. Groq uses custom LPU (Language Processing Unit) hardware delivering 400-800 tokens/second — the fastest inference available, period. Together AI is the most flexible — 100+ open-source models, fine-tuning API, and custom deployment of your own models. Fireworks AI focuses on production-grade OSS model serving with dedicated deployments and function-calling optimizations. For maximum inference speed: Groq. For open-source model breadth and fine-tuning: Together AI. For production-grade OSS model serving with SLAs: Fireworks AI.
Key Takeaways
- Groq delivers 400-800 tokens/second for Llama 3 — 5-10x faster than OpenAI
- Together AI hosts 100+ models including Llama 3, Qwen 2.5, Mistral, and Flux image models
- All three expose OpenAI-compatible APIs — swap with one
baseURLchange - Groq's free tier: 14,400 req/day — generous for prototyping
- Together AI fine-tuning — train on your data, serve the fine-tuned model via API
- Fireworks AI
accounts/fireworks/modelsnamespace — curated, optimized model versions - Groq latency varies by model — Llama 3.3 70B is fast; larger models require Groq's On-Demand tier
Why Use Alternative Inference Providers?
OpenAI gpt-4o:
- Cost: $2.50 input / $10 output per 1M tokens
- Speed: ~50-100 tokens/sec
- Models: Proprietary only
Groq / Together / Fireworks:
- Cost: $0.05-$0.90 per 1M tokens (80-90% cheaper)
- Speed: 100-800 tokens/sec
- Models: Llama 3, Mistral, Qwen, Gemma, and 100+ open-source
Use cases that make sense:
- High-volume API calls (cost savings at scale)
- Real-time applications (speed matters)
- Needing a specific open-source model
- Avoiding vendor lock-in on proprietary model families
Groq: Custom Hardware, Maximum Speed
Groq's LPUs (Language Processing Units) are purpose-built chips for transformer inference. They achieve throughput OpenAI and Anthropic's cloud deployments cannot match on commodity GPUs.
Installation
npm install groq-sdk
# Or use OpenAI SDK with baseURL override
npm install openai
Basic Completion
import Groq from "groq-sdk";
const client = new Groq({
apiKey: process.env.GROQ_API_KEY,
});
const completion = await client.chat.completions.create({
model: "llama-3.3-70b-versatile",
messages: [
{ role: "system", content: "You are a helpful coding assistant." },
{ role: "user", content: "Write a TypeScript function to debounce async calls." },
],
temperature: 0.7,
max_tokens: 1024,
});
console.log(completion.choices[0].message.content);
console.log("Tokens per second:", completion.usage?.completion_tokens_per_second);
Streaming (with speed measurement)
import Groq from "groq-sdk";
const client = new Groq();
const startTime = Date.now();
let tokenCount = 0;
const stream = await client.chat.completions.create({
model: "llama-3.3-70b-versatile",
messages: [{ role: "user", content: "Explain the CAP theorem in depth." }],
stream: true,
});
for await (const chunk of stream) {
const delta = chunk.choices[0]?.delta?.content ?? "";
if (delta) {
process.stdout.write(delta);
tokenCount++;
}
if (chunk.x_groq?.usage) {
const elapsed = (Date.now() - startTime) / 1000;
const tps = chunk.x_groq.usage.completion_tokens / elapsed;
console.log(`\n\nSpeed: ${tps.toFixed(0)} tokens/sec`);
}
}
Available Models on Groq
// Groq model selection — each has different speed/capability trade-offs
const models = {
// Fastest — best for simple tasks, chatbots
"llama-3.1-8b-instant": { speed: "fastest", context: "128k", cost: "$0.05/$0.08" },
// Balanced — most tasks
"llama-3.3-70b-versatile": { speed: "fast", context: "128k", cost: "$0.59/$0.79" },
// High intelligence — complex reasoning
"llama-3.1-405b-reasoning": { speed: "moderate", context: "16k", cost: "$3/$3" },
// Coding specialist
"deepseek-r1-distill-llama-70b": { speed: "fast", context: "128k", cost: "$0.75/$0.99" },
// Multimodal (vision)
"llama-3.2-11b-vision-preview": { speed: "fast", context: "8k", cost: "$0.18/$0.18" },
};
// Using OpenAI SDK with Groq
import OpenAI from "openai";
const groq = new OpenAI({
apiKey: process.env.GROQ_API_KEY,
baseURL: "https://api.groq.com/openai/v1",
});
const response = await groq.chat.completions.create({
model: "llama-3.3-70b-versatile",
messages: [{ role: "user", content: "Hello!" }],
});
JSON Mode
// Structured output via JSON mode
const result = await client.chat.completions.create({
model: "llama-3.3-70b-versatile",
response_format: { type: "json_object" },
messages: [
{
role: "user",
content: `Extract the key entities from this text and return as JSON:
"Apple announced the M4 chip at WWDC 2024, featuring neural engine improvements."
Return: { entities: [{ name, type, description }] }`,
},
],
});
const data = JSON.parse(result.choices[0].message.content!);
console.log(data.entities);
Together AI: Open-Source Model Breadth
Together AI runs 100+ open-source models and provides the only consumer fine-tuning API that lets you train and deploy custom model variants.
Installation
npm install together-ai
# Or OpenAI SDK compatible
npm install openai
Chat Completion
import Together from "together-ai";
const client = new Together({
apiKey: process.env.TOGETHER_API_KEY,
});
const response = await client.chat.completions.create({
model: "meta-llama/Meta-Llama-3.1-70B-Instruct-Turbo",
messages: [
{ role: "user", content: "What are the differences between PostgreSQL and MySQL?" },
],
max_tokens: 512,
temperature: 0.7,
});
console.log(response.choices[0].message.content);
Available Model Categories
// Together AI model families (as of 2026)
const togetherModels = {
// Meta Llama
llama: [
"meta-llama/Meta-Llama-3.1-8B-Instruct-Turbo",
"meta-llama/Meta-Llama-3.1-70B-Instruct-Turbo",
"meta-llama/Meta-Llama-3.1-405B-Instruct-Turbo",
"meta-llama/Llama-3.2-11B-Vision-Instruct-Turbo", // Multimodal
],
// Qwen
qwen: [
"Qwen/Qwen2.5-7B-Instruct-Turbo",
"Qwen/Qwen2.5-72B-Instruct-Turbo",
"Qwen/QwQ-32B-Preview", // Reasoning model
],
// Mistral
mistral: [
"mistralai/Mistral-7B-Instruct-v0.3",
"mistralai/Mixtral-8x22B-Instruct-v0.1",
],
// Code-specialized
code: [
"Qwen/Qwen2.5-Coder-32B-Instruct",
"deepseek-ai/DeepSeek-V3",
],
// Image generation
image: [
"black-forest-labs/FLUX.1-schnell",
"black-forest-labs/FLUX.1-dev",
],
};
Image Generation
// Together AI also handles image generation
const imageResponse = await client.images.create({
model: "black-forest-labs/FLUX.1-schnell",
prompt: "A minimalist logo for a tech startup, clean lines, dark background",
n: 1,
width: 1024,
height: 1024,
steps: 4, // FLUX.1-schnell is fast — works in 4 steps
});
console.log(imageResponse.data[0].url);
Fine-Tuning API
import Together from "together-ai";
import fs from "fs";
const client = new Together();
// Upload training data (JSONL format)
const file = await client.files.upload({
file: fs.createReadStream("training-data.jsonl"),
purpose: "fine-tune",
});
// Create fine-tuning job
const job = await client.fineTuning.create({
training_file: file.id,
model: "meta-llama/Meta-Llama-3.1-8B-Instruct-Reference", // Base model
n_epochs: 3,
learning_rate: 1e-5,
lora: true, // LoRA for efficient fine-tuning
suffix: "my-domain-model",
});
console.log("Fine-tune job created:", job.id);
// Monitor training progress
const finishedJob = await client.fineTuning.retrieve(job.id);
console.log("Status:", finishedJob.status);
// When "succeeded", use fine-tuned model ID in completions
Using OpenAI SDK with Together
import OpenAI from "openai";
const together = new OpenAI({
apiKey: process.env.TOGETHER_API_KEY,
baseURL: "https://api.together.xyz/v1",
});
// Exactly like OpenAI — just different models and baseURL
const response = await together.chat.completions.create({
model: "meta-llama/Meta-Llama-3.1-70B-Instruct-Turbo",
messages: [{ role: "user", content: "Explain async/await in JavaScript." }],
stream: true,
});
for await (const chunk of response) {
process.stdout.write(chunk.choices[0]?.delta?.content ?? "");
}
Fireworks AI: Production OSS Model Serving
Fireworks AI specializes in production-grade open-source model serving with dedicated endpoints, function-calling optimization, and structured output support.
Installation
npm install openai # Fireworks uses OpenAI-compatible API
Setup and Basic Usage
import OpenAI from "openai";
const fireworks = new OpenAI({
apiKey: process.env.FIREWORKS_API_KEY,
baseURL: "https://api.fireworks.ai/inference/v1",
});
const response = await fireworks.chat.completions.create({
model: "accounts/fireworks/models/llama-v3p1-70b-instruct",
messages: [
{ role: "user", content: "What is the best way to handle errors in TypeScript?" },
],
});
console.log(response.choices[0].message.content);
Function Calling
// Fireworks AI has optimized function calling for production
const tools: OpenAI.Chat.Completions.ChatCompletionTool[] = [
{
type: "function",
function: {
name: "search_documentation",
description: "Search technical documentation",
parameters: {
type: "object",
properties: {
query: { type: "string", description: "Search query" },
technology: { type: "string", description: "Tech stack (react, node, etc.)" },
},
required: ["query"],
},
},
},
];
const response = await fireworks.chat.completions.create({
model: "accounts/fireworks/models/firefunction-v2", // Optimized for tool use
messages: [{ role: "user", content: "How do I handle CORS in Express?" }],
tools,
tool_choice: "auto",
});
if (response.choices[0].message.tool_calls) {
const toolCall = response.choices[0].message.tool_calls[0];
const args = JSON.parse(toolCall.function.arguments);
console.log("Searching for:", args.query);
}
Structured Output with Pydantic/Zod
import { zodResponseFormat } from "openai/helpers/zod";
import { z } from "zod";
const AnalysisSchema = z.object({
sentiment: z.enum(["positive", "negative", "neutral"]),
score: z.number().min(-1).max(1),
keywords: z.array(z.string()),
summary: z.string(),
});
const result = await fireworks.beta.chat.completions.parse({
model: "accounts/fireworks/models/llama-v3p1-8b-instruct",
messages: [
{
role: "user",
content: "Analyze the sentiment of: 'This product is amazing but shipping was slow.'",
},
],
response_format: zodResponseFormat(AnalysisSchema, "analysis"),
});
const analysis = result.choices[0].message.parsed;
// Fully typed: { sentiment: "positive", score: 0.6, keywords: [...], summary: "..." }
Dedicated Deployments (Production)
// Fireworks dedicated deployments = reserved capacity for consistent latency SLAs
// Configured via Fireworks dashboard or API
// Once deployed, use the deployment endpoint:
const response = await fireworks.chat.completions.create({
model: "accounts/YOUR_ACCOUNT/deployments/YOUR_DEPLOYMENT",
messages: [{ role: "user", content: "Process this user request." }],
});
Feature Comparison
| Feature | Groq | Together AI | Fireworks AI |
|---|---|---|---|
| Max throughput | ✅ 400-800 tok/s | ~100-200 tok/s | ~100-300 tok/s |
| OpenAI-compatible API | ✅ | ✅ | ✅ |
| Model count | ~20 curated | ✅ 100+ | ~50 curated |
| Fine-tuning | ❌ | ✅ LoRA fine-tuning | ✅ Limited |
| Image generation | ❌ | ✅ FLUX.1 | ❌ |
| Function calling | ✅ | ✅ | ✅ Optimized |
| Structured output | ✅ JSON mode | ✅ | ✅ Zod/Pydantic |
| Vision models | ✅ Llama 3.2 Vision | ✅ | ✅ |
| Dedicated deployments | ❌ | ❌ | ✅ |
| Free tier | ✅ 14,400 req/day | ✅ $1 credit | ✅ $1 credit |
| Pricing (70B model) | $0.59/$0.79/1M | $0.88/$0.88/1M | $0.90/$0.90/1M |
| Rate limits | Limited (free tier) | Higher | Higher |
When to Use Each
Choose Groq if:
- Latency is the primary concern — real-time chat, voice assistants, live feedback
- You're using Llama, Mixtral, or Gemma models and need maximum throughput
- High volume on a budget (cost per token is competitive at scale)
- Prototyping with the generous free tier (14,400 requests/day)
Choose Together AI if:
- You need a model Groq doesn't offer (Qwen, DeepSeek, FLUX image generation)
- Fine-tuning on your domain data is required
- You want to explore 100+ models to find the best fit for your use case
- Image generation alongside text completion in one provider
Choose Fireworks AI if:
- Production-grade SLAs with dedicated deployment capacity matter
- Function calling and structured outputs are core to your use case
- You need consistent latency guarantees (not shared capacity)
- Your use case involves complex agent workflows with tool use
Rate Limits and Production Reliability at Scale
The generous free tier on Groq (14,400 requests/day) is excellent for prototyping but masks significant rate limiting constraints in production. Groq's free tier is throttled to 30 requests per minute on the fastest models and has context window limitations per minute rather than just per request. When you scale to production traffic, Groq's rate limits can become a bottleneck — a user-facing application that sends concurrent requests from multiple users simultaneously will hit per-minute token limits quickly. Groq's paid On-Demand tier removes per-minute limits but does not guarantee throughput during periods of high platform demand. Together AI and Fireworks AI offer more predictable throughput at scale because they run on GPU clusters with conventional batching rather than Groq's unique LPU architecture. For production applications with SLA requirements, Fireworks AI's dedicated deployment option — reserved GPU capacity for your account — provides the consistent sub-second latency guarantees that shared capacity cannot. Build retry logic with exponential backoff into any LLM integration regardless of provider, since all three services experience occasional elevated latency or throttling during peak periods.
OpenAI SDK Compatibility and Provider Switching
All three providers expose OpenAI-compatible APIs, which means a single integration with the OpenAI SDK can target all three providers by changing only the baseURL and apiKey. This portability is practically valuable: you can start with Groq for its free tier, switch to Together AI when you need a model Groq doesn't offer, or route to Fireworks AI for production reliability, all without changing your application code. The baseURL swap pattern works cleanly for chat completions, streaming, and embeddings (where supported). Function calling compatibility varies — Groq and Fireworks support OpenAI's tool_calls format, while Together AI uses the same format for most models but some community models have inconsistent function calling support. For structured output via response_format: { type: "json_object" }, all three providers support it on Llama 3 and Mistral models. Teams building multi-provider routing (using a gateway like LiteLLM or Portkey to route different request types to the cheapest or fastest provider) benefit from this standardization — a single interface can intelligently route simple tasks to Groq's fast cheap models and complex reasoning to a Together AI larger model.
Model Freshness and Ecosystem Coverage
The open-source model ecosystem moves faster than proprietary models — Meta releases new Llama versions, Qwen releases major model families, and community fine-tunes appear monthly. Together AI maintains the broadest catalog with 100+ models, including niche fine-tunes (code-specialized Llama variants, long-context extensions, multilingual models) that Groq and Fireworks don't host. Groq's catalog is deliberately curated — they host perhaps 20 models at any time, selecting models where their LPU hardware delivers the most dramatic speed advantage. Fireworks focuses on production-proven models with strong function calling and structured output capabilities rather than maximizing model count. For research or applications that need to compare multiple model families on the same task (model evaluation pipelines, A/B testing different model sizes), Together AI's breadth is unmatched. For applications where model freshness matters — using the latest Llama or Qwen release within days of publication — Together AI typically onboards new models faster than the other providers.
Security and Data Privacy Considerations
Enterprise adopters of LLM inference APIs need to understand how input and output data is handled. All three providers process request and response data on their servers to generate completions — your prompts and user messages are sent to and processed by their infrastructure. None of the three providers offer on-premises deployment for their managed APIs. For sensitive data (PII, PHI, financial information), you should either redact sensitive fields before sending to the LLM, use prompt-level anonymization, or evaluate on-premises alternatives like self-hosted vLLM or Ollama for data-sensitive workloads. Together AI, Groq, and Fireworks all state they do not use your API request data for model training without explicit opt-in, but review their current Data Processing Agreements for your compliance requirements — these terms change and should be verified against current documentation rather than assumed from this article. For HIPAA-covered applications, obtain a Business Associate Agreement from your provider before processing any PHI through their APIs.
Methodology
Data sourced from official documentation and pricing pages for Groq, Together AI, and Fireworks AI (as of February 2026), throughput benchmarks from ArtificialAnalysis.ai (independent LLM benchmarking service), and community reports from the AI Engineer Discord and Twitter/X. Model availability and pricing verified directly against provider pricing pages.
Related: Vercel AI SDK vs OpenAI SDK vs Anthropic SDK for SDK choice when integrating these providers, or Portkey vs LiteLLM vs OpenRouter for routing requests across multiple providers automatically.
See also: Langfuse vs LangSmith vs Helicone: LLM Observability 2026 and Mastra vs LangChain.js vs Google GenKit