Skip to main content

Langfuse vs LangSmith vs Helicone: LLM Observability Platforms 2026

·PkgPulse Team

Langfuse vs LangSmith vs Helicone: LLM Observability Platforms 2026

TL;DR

LangSmith is LangChain's native observability layer — tightest integration with LangChain.js, but you're locked into the LangChain ecosystem. Langfuse is the open-source champion — self-hostable, framework-agnostic, and offers the most comprehensive evaluation and prompt management features for teams that care about data ownership. Helicone is the proxy-first option — drop one line of code, change your OpenAI base URL, and get immediate observability without SDK instrumentation. If you use LangChain, start with LangSmith. If you need open-source/self-hosted, go Langfuse. If you want zero-code integration, Helicone.

Key Takeaways

  • Langfuse GitHub stars: ~12k — the fastest-growing open-source LLM observability tool (Feb 2026)
  • LangSmith is the only option with native LangGraph trace visualization for complex agent runs
  • Helicone integration takes ~30 seconds — literally one URL change: baseURL: "https://oai.helicone.ai/v1"
  • All three track token costs and latency — the differentiators are evaluation, prompt management, and self-hosting
  • Langfuse is the only fully self-hostable option (Docker Compose available)
  • LangSmith's free tier limits to 5k traces/month — Langfuse and Helicone offer more generous limits
  • Prompt versioning and A/B testing — Langfuse and LangSmith both have it; Helicone does not

Why LLM Observability Matters

When you move an LLM application from prototype to production, you immediately hit problems that don't exist in traditional software:

  • Why did the model give a bad answer? You need the exact prompt that was sent
  • Which prompt version is performing better? You need A/B testing with LLM-specific metrics
  • What's my actual token cost? Model pricing changes frequently and usage spikes unexpectedly
  • Are my evals passing? You need automated evaluation pipelines, not manual spot-checking

LLM observability platforms solve all of these. They sit between your app and the LLM API, capturing traces with the full prompt/response/token data.


Helicone: Zero-Effort Integration

Helicone is a proxy service — you point your OpenAI/Anthropic/Gemini calls at Helicone's servers, and it records everything before forwarding to the actual model. No SDK instrumentation required.

30-Second Integration

import OpenAI from "openai";

// Before Helicone
const client = new OpenAI({ apiKey: process.env.OPENAI_API_KEY });

// After Helicone — literally one change
const client = new OpenAI({
  apiKey: process.env.OPENAI_API_KEY,
  baseURL: "https://oai.helicone.ai/v1",
  defaultHeaders: {
    "Helicone-Auth": `Bearer ${process.env.HELICONE_API_KEY}`,
  },
});

// Everything else stays the same — all calls are automatically traced
const response = await client.chat.completions.create({
  model: "gpt-4o",
  messages: [{ role: "user", content: "Hello world" }],
});

Anthropic Integration

import Anthropic from "@anthropic-ai/sdk";

const anthropic = new Anthropic({
  apiKey: process.env.ANTHROPIC_API_KEY,
  baseURL: "https://anthropic.helicone.ai",
  defaultHeaders: {
    "Helicone-Auth": `Bearer ${process.env.HELICONE_API_KEY}`,
  },
});

Custom Properties and Session Tracking

// Add metadata to traces via headers
const response = await client.chat.completions.create(
  {
    model: "gpt-4o",
    messages: [{ role: "user", content: prompt }],
  },
  {
    headers: {
      // Group traces by session
      "Helicone-Session-Id": `session-${userId}-${Date.now()}`,
      "Helicone-Session-Name": "customer-support-chat",
      // Custom properties for filtering in dashboard
      "Helicone-Property-UserId": userId,
      "Helicone-Property-Plan": userPlan,
      "Helicone-Property-Feature": "chat",
      // User tracking
      "Helicone-User-Id": userId,
    },
  }
);

Helicone Caching (Reduce Costs)

// Cache identical prompts — useful for deterministic queries
const response = await client.chat.completions.create(
  {
    model: "gpt-4o",
    messages: [{ role: "user", content: "What is the capital of France?" }],
    temperature: 0, // Must be 0 for caching
  },
  {
    headers: {
      "Helicone-Cache-Enabled": "true",
      "Helicone-Cache-Bucket-Max-Size": "3",
    },
  }
);
// Second call with same prompt returns cached result — 0 tokens, instant

Rate Limiting via Helicone

// Policy-based rate limiting per user
const response = await client.chat.completions.create(
  { model: "gpt-4o", messages: [{ role: "user", content: prompt }] },
  {
    headers: {
      "Helicone-RateLimit-Policy": "100;w=86400;u=user",
      // 100 requests per day (86400 seconds) per user
      "Helicone-User-Id": userId,
    },
  }
);

Langfuse: Open-Source Observability

Langfuse is a comprehensive open-source platform covering traces, evaluations, prompt management, datasets, and analytics. It's the only option you can fully self-host, making it the default choice for compliance-sensitive applications.

Node.js SDK Integration

import Langfuse from "langfuse";

const langfuse = new Langfuse({
  publicKey: process.env.LANGFUSE_PUBLIC_KEY!,
  secretKey: process.env.LANGFUSE_SECRET_KEY!,
  baseUrl: "https://cloud.langfuse.com", // Or your self-hosted URL
});

// Create a trace
const trace = langfuse.trace({
  name: "customer-support-response",
  userId: "user_123",
  sessionId: "session_456",
  metadata: { plan: "pro", channel: "chat" },
  tags: ["production", "v2"],
});

// Span for the LLM generation
const span = trace.span({
  name: "generate-response",
  input: { userMessage: prompt, context: retrievedDocs },
});

const generation = span.generation({
  name: "gpt4o-call",
  model: "gpt-4o",
  input: messages,
  modelParameters: { temperature: 0.7, maxTokens: 500 },
});

// Make the actual LLM call
const response = await openai.chat.completions.create({
  model: "gpt-4o",
  messages,
  temperature: 0.7,
  max_tokens: 500,
});

// Record the output
generation.end({
  output: response.choices[0].message,
  usage: {
    promptTokens: response.usage?.prompt_tokens,
    completionTokens: response.usage?.completion_tokens,
    totalCost: calculateCost(response.usage!),
  },
});

span.end({ output: response.choices[0].message.content });

OpenAI SDK Drop-In Wrapper

import { observeOpenAI } from "langfuse/openai";
import OpenAI from "openai";

// Automatic instrumentation — similar to Helicone proxy but via SDK
const openai = observeOpenAI(new OpenAI(), {
  clientInitParams: {
    publicKey: process.env.LANGFUSE_PUBLIC_KEY,
    secretKey: process.env.LANGFUSE_SECRET_KEY,
  },
});

// All calls automatically traced
const response = await openai.chat.completions.create({
  model: "gpt-4o",
  messages: [{ role: "user", content: "Hello" }],
  // Optional: add metadata
  langfusePrompt: await langfuse.getPrompt("my-prompt-template"),
});

Prompt Management

// Versioned prompts — store in Langfuse dashboard, fetch at runtime
const promptTemplate = await langfuse.getPrompt("customer-support-v2");

const compiledPrompt = promptTemplate.compile({
  userQuery: userInput,
  productName: "PkgPulse",
  context: retrievedContext,
});

// The prompt version is automatically tracked in traces
// You can compare performance across prompt versions in the dashboard
const response = await openai.chat.completions.create({
  model: "gpt-4o",
  messages: compiledPrompt.messages,
});

Evaluation with Langfuse

// Add scores to traces — manual or automated
trace.score({
  name: "user-satisfaction",
  value: 1, // 1 = positive feedback
  comment: "User clicked 'helpful' button",
});

// LLM-as-judge evaluation
const evalResult = await langfuse.score({
  traceId: trace.id,
  name: "response-quality",
  value: 0.85,
  dataType: "NUMERIC",
  comment: "Automated eval: factual accuracy score",
});

// Batch evaluation against a dataset
const dataset = await langfuse.getDataset("customer-questions-500");

for await (const item of dataset.items) {
  const trace = langfuse.trace({ name: "eval-run" });
  const response = await generateResponse(item.input);

  // Score each output
  trace.score({
    name: "correctness",
    value: compareWithExpected(response, item.expectedOutput),
  });
}

Self-Hosting with Docker Compose

# docker-compose.yml for self-hosted Langfuse
services:
  langfuse-worker:
    image: langfuse/langfuse-worker:latest
    depends_on: [postgres, redis]
    environment:
      DATABASE_URL: "postgresql://postgres:password@postgres:5432/langfuse"
      REDIS_CONNECTION_STRING: "redis://redis:6379"
      LANGFUSE_TELEMETRY_ENABLED: "false"

  langfuse-web:
    image: langfuse/langfuse:latest
    ports:
      - "3000:3000"
    depends_on: [postgres, redis, langfuse-worker]
    environment:
      DATABASE_URL: "postgresql://postgres:password@postgres:5432/langfuse"
      NEXTAUTH_URL: "http://localhost:3000"
      NEXTAUTH_SECRET: "your-secret-32-chars"
      SALT: "your-salt-string"

  postgres:
    image: postgres:16-alpine
    environment:
      POSTGRES_PASSWORD: password
      POSTGRES_DB: langfuse

  redis:
    image: redis:7-alpine

LangSmith: Native LangChain Integration

LangSmith is built by the LangChain team specifically to debug and evaluate LangChain applications. If you use LangChain.js, it provides the deepest integration — automatic tracing of every chain step, LCEL execution visualization, and LangGraph run visualization.

Setup — Environment Variables (Simplest Method)

# LangSmith traces automatically when these are set
export LANGCHAIN_TRACING_V2=true
export LANGCHAIN_API_KEY=your-langsmith-api-key
export LANGCHAIN_PROJECT=my-production-project
# That's it — all LangChain.js calls are automatically traced

Manual SDK Integration

import { Client } from "langsmith";
import { traceable } from "langsmith/traceable";
import { wrapOpenAI } from "langsmith/wrappers";
import OpenAI from "openai";

const ls = new Client({ apiKey: process.env.LANGSMITH_API_KEY });

// Wrap OpenAI client — all calls auto-traced
const openai = wrapOpenAI(new OpenAI());

// Mark custom functions as traceable
const retrieveContext = traceable(
  async (query: string): Promise<string[]> => {
    // Your vector search logic
    const results = await vectorStore.similaritySearch(query, 4);
    return results.map((r) => r.pageContent);
  },
  { name: "vector-retrieval", run_type: "retriever" }
);

const generateAnswer = traceable(
  async (query: string, context: string[]): Promise<string> => {
    const response = await openai.chat.completions.create({
      model: "gpt-4o",
      messages: [
        {
          role: "system",
          content: `Answer based on:\n${context.join("\n\n")}`,
        },
        { role: "user", content: query },
      ],
    });
    return response.choices[0].message.content!;
  },
  { name: "answer-generation", run_type: "chain" }
);

// Top-level traced function
const ragPipeline = traceable(
  async (query: string) => {
    const context = await retrieveContext(query);
    const answer = await generateAnswer(query, context);
    return { answer, sourceDocs: context.length };
  },
  { name: "rag-pipeline", run_type: "chain" }
);

const result = await ragPipeline("What is the capital of France?");

LangSmith Evaluation

import { evaluate } from "langsmith/evaluation";

// Define evaluators
const correctnessEvaluator = async ({ prediction, reference }: any) => {
  // Use LLM to evaluate correctness
  const score = await llmJudge(prediction, reference);
  return { key: "correctness", score: score > 0.7 ? 1 : 0 };
};

// Run evaluation against a dataset
const results = await evaluate(
  (inputs) => ragPipeline(inputs.query),
  {
    data: "customer-questions-dataset",
    evaluators: [correctnessEvaluator],
    experimentPrefix: "rag-v2-eval",
    metadata: { model: "gpt-4o", vectorStore: "pgvector" },
    maxConcurrency: 5,
  }
);

console.log(`Average correctness: ${results.results.correctness.mean}`);

Feature Comparison

FeatureHeliconeLangfuseLangSmith
Integration effortMinimal (URL change)Medium (SDK)Low (env vars for LangChain)
Open sourcePartial (OSS lite)✅ Fully open source
Self-hosted✅ Docker Compose
Framework-agnosticLangChain-optimized
Prompt management
Prompt versioning
Evaluation pipelines
Dataset management
Cost tracking
Latency tracking
Token usage
Caching
Rate limiting
LangChain integrationManual✅ Native
LangGraph visualization
Free tier100k req/mo50k obs/mo5k traces/mo
GitHub stars1.5k12k~2k

When to Use Each

Choose Helicone if:

  • You want observability in under 5 minutes with zero code changes
  • Caching LLM responses to reduce costs is important (unique to Helicone)
  • You don't use LangChain and want framework-agnostic logging
  • Rate limiting LLM usage per user is a requirement

Choose Langfuse if:

  • Data ownership and compliance require self-hosted infrastructure
  • You need prompt versioning, A/B testing, and structured evaluation pipelines
  • Your team does serious LLM evaluation (datasets, scoring, regression testing)
  • You use multiple AI providers and frameworks (not just OpenAI/LangChain)

Choose LangSmith if:

  • You're already using LangChain.js and want zero-configuration tracing
  • You need LangGraph run visualization for complex multi-agent debugging
  • You don't have self-hosting requirements and the SaaS model is fine
  • The LangSmith evaluation framework fits your evaluation needs

Methodology

Data sourced from GitHub repositories (star counts as of February 2026), official documentation, npm weekly download statistics (January 2026), and community discussion on Twitter/X and Discord. Free tier limits verified from official pricing pages. Self-hosting capabilities verified from official Docker Compose documentation.


Related: Mastra vs LangChain.js vs Google GenKit for AI agent frameworks, or OpenTelemetry vs Sentry vs Datadog for general application observability.

Comments

Stay Updated

Get the latest package insights, npm trends, and tooling tips delivered to your inbox.