Langfuse vs LangSmith vs Helicone: LLM Observability Platforms 2026
Langfuse vs LangSmith vs Helicone: LLM Observability Platforms 2026
TL;DR
LangSmith is LangChain's native observability layer — tightest integration with LangChain.js, but you're locked into the LangChain ecosystem. Langfuse is the open-source champion — self-hostable, framework-agnostic, and offers the most comprehensive evaluation and prompt management features for teams that care about data ownership. Helicone is the proxy-first option — drop one line of code, change your OpenAI base URL, and get immediate observability without SDK instrumentation. If you use LangChain, start with LangSmith. If you need open-source/self-hosted, go Langfuse. If you want zero-code integration, Helicone.
Key Takeaways
- Langfuse GitHub stars: ~12k — the fastest-growing open-source LLM observability tool (Feb 2026)
- LangSmith is the only option with native LangGraph trace visualization for complex agent runs
- Helicone integration takes ~30 seconds — literally one URL change:
baseURL: "https://oai.helicone.ai/v1" - All three track token costs and latency — the differentiators are evaluation, prompt management, and self-hosting
- Langfuse is the only fully self-hostable option (Docker Compose available)
- LangSmith's free tier limits to 5k traces/month — Langfuse and Helicone offer more generous limits
- Prompt versioning and A/B testing — Langfuse and LangSmith both have it; Helicone does not
Why LLM Observability Matters
When you move an LLM application from prototype to production, you immediately hit problems that don't exist in traditional software:
- Why did the model give a bad answer? You need the exact prompt that was sent
- Which prompt version is performing better? You need A/B testing with LLM-specific metrics
- What's my actual token cost? Model pricing changes frequently and usage spikes unexpectedly
- Are my evals passing? You need automated evaluation pipelines, not manual spot-checking
LLM observability platforms solve all of these. They sit between your app and the LLM API, capturing traces with the full prompt/response/token data.
Helicone: Zero-Effort Integration
Helicone is a proxy service — you point your OpenAI/Anthropic/Gemini calls at Helicone's servers, and it records everything before forwarding to the actual model. No SDK instrumentation required.
30-Second Integration
import OpenAI from "openai";
// Before Helicone
const client = new OpenAI({ apiKey: process.env.OPENAI_API_KEY });
// After Helicone — literally one change
const client = new OpenAI({
apiKey: process.env.OPENAI_API_KEY,
baseURL: "https://oai.helicone.ai/v1",
defaultHeaders: {
"Helicone-Auth": `Bearer ${process.env.HELICONE_API_KEY}`,
},
});
// Everything else stays the same — all calls are automatically traced
const response = await client.chat.completions.create({
model: "gpt-4o",
messages: [{ role: "user", content: "Hello world" }],
});
Anthropic Integration
import Anthropic from "@anthropic-ai/sdk";
const anthropic = new Anthropic({
apiKey: process.env.ANTHROPIC_API_KEY,
baseURL: "https://anthropic.helicone.ai",
defaultHeaders: {
"Helicone-Auth": `Bearer ${process.env.HELICONE_API_KEY}`,
},
});
Custom Properties and Session Tracking
// Add metadata to traces via headers
const response = await client.chat.completions.create(
{
model: "gpt-4o",
messages: [{ role: "user", content: prompt }],
},
{
headers: {
// Group traces by session
"Helicone-Session-Id": `session-${userId}-${Date.now()}`,
"Helicone-Session-Name": "customer-support-chat",
// Custom properties for filtering in dashboard
"Helicone-Property-UserId": userId,
"Helicone-Property-Plan": userPlan,
"Helicone-Property-Feature": "chat",
// User tracking
"Helicone-User-Id": userId,
},
}
);
Helicone Caching (Reduce Costs)
// Cache identical prompts — useful for deterministic queries
const response = await client.chat.completions.create(
{
model: "gpt-4o",
messages: [{ role: "user", content: "What is the capital of France?" }],
temperature: 0, // Must be 0 for caching
},
{
headers: {
"Helicone-Cache-Enabled": "true",
"Helicone-Cache-Bucket-Max-Size": "3",
},
}
);
// Second call with same prompt returns cached result — 0 tokens, instant
Rate Limiting via Helicone
// Policy-based rate limiting per user
const response = await client.chat.completions.create(
{ model: "gpt-4o", messages: [{ role: "user", content: prompt }] },
{
headers: {
"Helicone-RateLimit-Policy": "100;w=86400;u=user",
// 100 requests per day (86400 seconds) per user
"Helicone-User-Id": userId,
},
}
);
Langfuse: Open-Source Observability
Langfuse is a comprehensive open-source platform covering traces, evaluations, prompt management, datasets, and analytics. It's the only option you can fully self-host, making it the default choice for compliance-sensitive applications.
Node.js SDK Integration
import Langfuse from "langfuse";
const langfuse = new Langfuse({
publicKey: process.env.LANGFUSE_PUBLIC_KEY!,
secretKey: process.env.LANGFUSE_SECRET_KEY!,
baseUrl: "https://cloud.langfuse.com", // Or your self-hosted URL
});
// Create a trace
const trace = langfuse.trace({
name: "customer-support-response",
userId: "user_123",
sessionId: "session_456",
metadata: { plan: "pro", channel: "chat" },
tags: ["production", "v2"],
});
// Span for the LLM generation
const span = trace.span({
name: "generate-response",
input: { userMessage: prompt, context: retrievedDocs },
});
const generation = span.generation({
name: "gpt4o-call",
model: "gpt-4o",
input: messages,
modelParameters: { temperature: 0.7, maxTokens: 500 },
});
// Make the actual LLM call
const response = await openai.chat.completions.create({
model: "gpt-4o",
messages,
temperature: 0.7,
max_tokens: 500,
});
// Record the output
generation.end({
output: response.choices[0].message,
usage: {
promptTokens: response.usage?.prompt_tokens,
completionTokens: response.usage?.completion_tokens,
totalCost: calculateCost(response.usage!),
},
});
span.end({ output: response.choices[0].message.content });
OpenAI SDK Drop-In Wrapper
import { observeOpenAI } from "langfuse/openai";
import OpenAI from "openai";
// Automatic instrumentation — similar to Helicone proxy but via SDK
const openai = observeOpenAI(new OpenAI(), {
clientInitParams: {
publicKey: process.env.LANGFUSE_PUBLIC_KEY,
secretKey: process.env.LANGFUSE_SECRET_KEY,
},
});
// All calls automatically traced
const response = await openai.chat.completions.create({
model: "gpt-4o",
messages: [{ role: "user", content: "Hello" }],
// Optional: add metadata
langfusePrompt: await langfuse.getPrompt("my-prompt-template"),
});
Prompt Management
// Versioned prompts — store in Langfuse dashboard, fetch at runtime
const promptTemplate = await langfuse.getPrompt("customer-support-v2");
const compiledPrompt = promptTemplate.compile({
userQuery: userInput,
productName: "PkgPulse",
context: retrievedContext,
});
// The prompt version is automatically tracked in traces
// You can compare performance across prompt versions in the dashboard
const response = await openai.chat.completions.create({
model: "gpt-4o",
messages: compiledPrompt.messages,
});
Evaluation with Langfuse
// Add scores to traces — manual or automated
trace.score({
name: "user-satisfaction",
value: 1, // 1 = positive feedback
comment: "User clicked 'helpful' button",
});
// LLM-as-judge evaluation
const evalResult = await langfuse.score({
traceId: trace.id,
name: "response-quality",
value: 0.85,
dataType: "NUMERIC",
comment: "Automated eval: factual accuracy score",
});
// Batch evaluation against a dataset
const dataset = await langfuse.getDataset("customer-questions-500");
for await (const item of dataset.items) {
const trace = langfuse.trace({ name: "eval-run" });
const response = await generateResponse(item.input);
// Score each output
trace.score({
name: "correctness",
value: compareWithExpected(response, item.expectedOutput),
});
}
Self-Hosting with Docker Compose
# docker-compose.yml for self-hosted Langfuse
services:
langfuse-worker:
image: langfuse/langfuse-worker:latest
depends_on: [postgres, redis]
environment:
DATABASE_URL: "postgresql://postgres:password@postgres:5432/langfuse"
REDIS_CONNECTION_STRING: "redis://redis:6379"
LANGFUSE_TELEMETRY_ENABLED: "false"
langfuse-web:
image: langfuse/langfuse:latest
ports:
- "3000:3000"
depends_on: [postgres, redis, langfuse-worker]
environment:
DATABASE_URL: "postgresql://postgres:password@postgres:5432/langfuse"
NEXTAUTH_URL: "http://localhost:3000"
NEXTAUTH_SECRET: "your-secret-32-chars"
SALT: "your-salt-string"
postgres:
image: postgres:16-alpine
environment:
POSTGRES_PASSWORD: password
POSTGRES_DB: langfuse
redis:
image: redis:7-alpine
LangSmith: Native LangChain Integration
LangSmith is built by the LangChain team specifically to debug and evaluate LangChain applications. If you use LangChain.js, it provides the deepest integration — automatic tracing of every chain step, LCEL execution visualization, and LangGraph run visualization.
Setup — Environment Variables (Simplest Method)
# LangSmith traces automatically when these are set
export LANGCHAIN_TRACING_V2=true
export LANGCHAIN_API_KEY=your-langsmith-api-key
export LANGCHAIN_PROJECT=my-production-project
# That's it — all LangChain.js calls are automatically traced
Manual SDK Integration
import { Client } from "langsmith";
import { traceable } from "langsmith/traceable";
import { wrapOpenAI } from "langsmith/wrappers";
import OpenAI from "openai";
const ls = new Client({ apiKey: process.env.LANGSMITH_API_KEY });
// Wrap OpenAI client — all calls auto-traced
const openai = wrapOpenAI(new OpenAI());
// Mark custom functions as traceable
const retrieveContext = traceable(
async (query: string): Promise<string[]> => {
// Your vector search logic
const results = await vectorStore.similaritySearch(query, 4);
return results.map((r) => r.pageContent);
},
{ name: "vector-retrieval", run_type: "retriever" }
);
const generateAnswer = traceable(
async (query: string, context: string[]): Promise<string> => {
const response = await openai.chat.completions.create({
model: "gpt-4o",
messages: [
{
role: "system",
content: `Answer based on:\n${context.join("\n\n")}`,
},
{ role: "user", content: query },
],
});
return response.choices[0].message.content!;
},
{ name: "answer-generation", run_type: "chain" }
);
// Top-level traced function
const ragPipeline = traceable(
async (query: string) => {
const context = await retrieveContext(query);
const answer = await generateAnswer(query, context);
return { answer, sourceDocs: context.length };
},
{ name: "rag-pipeline", run_type: "chain" }
);
const result = await ragPipeline("What is the capital of France?");
LangSmith Evaluation
import { evaluate } from "langsmith/evaluation";
// Define evaluators
const correctnessEvaluator = async ({ prediction, reference }: any) => {
// Use LLM to evaluate correctness
const score = await llmJudge(prediction, reference);
return { key: "correctness", score: score > 0.7 ? 1 : 0 };
};
// Run evaluation against a dataset
const results = await evaluate(
(inputs) => ragPipeline(inputs.query),
{
data: "customer-questions-dataset",
evaluators: [correctnessEvaluator],
experimentPrefix: "rag-v2-eval",
metadata: { model: "gpt-4o", vectorStore: "pgvector" },
maxConcurrency: 5,
}
);
console.log(`Average correctness: ${results.results.correctness.mean}`);
Feature Comparison
| Feature | Helicone | Langfuse | LangSmith |
|---|---|---|---|
| Integration effort | Minimal (URL change) | Medium (SDK) | Low (env vars for LangChain) |
| Open source | Partial (OSS lite) | ✅ Fully open source | ❌ |
| Self-hosted | ❌ | ✅ Docker Compose | ❌ |
| Framework-agnostic | ✅ | ✅ | LangChain-optimized |
| Prompt management | ❌ | ✅ | ✅ |
| Prompt versioning | ❌ | ✅ | ✅ |
| Evaluation pipelines | ❌ | ✅ | ✅ |
| Dataset management | ❌ | ✅ | ✅ |
| Cost tracking | ✅ | ✅ | ✅ |
| Latency tracking | ✅ | ✅ | ✅ |
| Token usage | ✅ | ✅ | ✅ |
| Caching | ✅ | ❌ | ❌ |
| Rate limiting | ✅ | ❌ | ❌ |
| LangChain integration | Manual | ✅ | ✅ Native |
| LangGraph visualization | ❌ | ❌ | ✅ |
| Free tier | 100k req/mo | 50k obs/mo | 5k traces/mo |
| GitHub stars | 1.5k | 12k | ~2k |
When to Use Each
Choose Helicone if:
- You want observability in under 5 minutes with zero code changes
- Caching LLM responses to reduce costs is important (unique to Helicone)
- You don't use LangChain and want framework-agnostic logging
- Rate limiting LLM usage per user is a requirement
Choose Langfuse if:
- Data ownership and compliance require self-hosted infrastructure
- You need prompt versioning, A/B testing, and structured evaluation pipelines
- Your team does serious LLM evaluation (datasets, scoring, regression testing)
- You use multiple AI providers and frameworks (not just OpenAI/LangChain)
Choose LangSmith if:
- You're already using LangChain.js and want zero-configuration tracing
- You need LangGraph run visualization for complex multi-agent debugging
- You don't have self-hosting requirements and the SaaS model is fine
- The LangSmith evaluation framework fits your evaluation needs
Methodology
Data sourced from GitHub repositories (star counts as of February 2026), official documentation, npm weekly download statistics (January 2026), and community discussion on Twitter/X and Discord. Free tier limits verified from official pricing pages. Self-hosting capabilities verified from official Docker Compose documentation.
Related: Mastra vs LangChain.js vs Google GenKit for AI agent frameworks, or OpenTelemetry vs Sentry vs Datadog for general application observability.