Skip to main content

Guide

franc vs langdetect vs cld3 (2026)

Compare franc, langdetect, and cld3 for detecting the language of text in JavaScript. Accuracy, language coverage, bundle size, short text detection, Node.js.

·PkgPulse Team·
0

TL;DR

franc is the most popular JavaScript language detection library — pure JavaScript, works in browsers and Node.js, covers 400+ languages, and is tree-shakable (use franc-min for smaller bundles). langdetect is a port of Google's language detection algorithm — accurate for longer texts, designed for Node.js. @google-cloud/language and cld3 (compiled to WASM) offer Google's production-grade detection but require more setup. For browser-compatible language detection: franc. For server-side with high accuracy: langdetect or cld3. For short texts (tweets, comments): all struggle — franc-min is usually fine.

Key Takeaways

  • franc: ~400K weekly downloads — 400+ languages, browser + Node.js, ESM-native, configurable
  • langdetect: ~50K weekly downloads — port of Google's LangDetect, probabilistic, Node.js
  • cld3 / @langion/cld3: WASM-compiled Compact Language Detector v3 — Google's production algorithm
  • All language detectors struggle with: short texts (<50 chars), code snippets, mixed-language text
  • franc provides confidence scores — filter low-confidence results
  • For production apps: consider server-side with langdetect or cld3 for better accuracy

franc

franc — pure JavaScript language detection:

Basic usage

import { franc } from "franc"
// Or: import { franc } from "franc-min"  // Fewer languages, smaller bundle

// Detect language:
franc("Hello, how are you?")    // "eng" (English)
franc("Bonjour, comment allez-vous?")   // "fra" (French)
franc("Guten Morgen, wie geht es Ihnen?")  // "deu" (German)
franc("こんにちは、お元気ですか?")  // "jpn" (Japanese)
franc("你好,你好吗?")  // "cmn" (Mandarin Chinese)
franc("مرحبا كيف حالك؟")  // "arb" (Arabic)

// Returns ISO 639-3 codes (3-letter codes, not ISO 639-1 2-letter codes)
// "eng" not "en", "fra" not "fr", "deu" not "de"

// Convert to ISO 639-1 if needed:
import iso6393to1 from "iso-639-1"
const lang3 = franc("Hello world")  // "eng"
// Map manually or use a lookup table:
const iso1Map: Record<string, string> = { eng: "en", fra: "fr", deu: "de", jpn: "ja" }
const lang1 = iso1Map[lang3] ?? lang3

Confidence scores

import { francAll } from "franc"

// Get all candidates with confidence scores:
const results = francAll("Hello, how are you?")
// [
//   ["eng", 1.0],   // English — 100% confidence
//   ["sco", 0.8],   // Scots
//   ["nob", 0.5],   // Norwegian Bokmål
//   ...
// ]

// Use the top result only if confidence is high:
function detectLanguage(text: string, minConfidence = 0.7): string | null {
  const results = francAll(text)
  const [lang, confidence] = results[0] ?? []

  if (!lang || confidence < minConfidence) {
    return null  // Not confident enough
  }

  return lang  // ISO 639-3 code
}

detectLanguage("Hello world")           // "eng" (high confidence)
detectLanguage("Hi")                    // null (too short/ambiguous)
detectLanguage("Bonjour tout le monde") // "fra"

Configuration options

import { franc, francAll } from "franc"

// Limit to specific languages (improves accuracy when domain is known):
franc("Hello world", { only: ["eng", "fra", "deu", "spa"] })
// "eng" (only considers English, French, German, Spanish)

// Exclude certain languages:
franc("Hello world", { ignore: ["sco", "nob"] })
// "eng" (doesn't confuse with Scots or Norwegian)

// Minimum text length (default: 10):
franc("Hi", { minLength: 0 })   // Attempt even with very short text
franc("Hi", { minLength: 10 })  // Returns "und" (undetermined) for short text

franc-min vs franc vs franc-all

// franc ships multiple variants:

// franc-min — 82 languages, ~540KB (best for browsers):
import { franc } from "franc-min"

// franc — 400 languages, ~1.5MB (more coverage):
import { franc } from "franc"

// franc-all — 400+ languages (most comprehensive):
import { franc } from "franc-all"

// For browser apps, use franc-min — significant bundle size difference
// For server-side: franc (400 languages) is fine

Content moderation use case

import { franc } from "franc-min"

interface UserContent {
  id: string
  text: string
  expectedLanguage: string  // ISO 639-1: "en", "fr", etc.
}

const iso1ToIso3: Record<string, string> = {
  en: "eng", fr: "fra", de: "deu", es: "spa",
  pt: "por", it: "ita", nl: "nld", ja: "jpn",
  zh: "cmn", ar: "arb", ru: "rus", ko: "kor",
}

function validateContentLanguage(content: UserContent): boolean {
  const expected = iso1ToIso3[content.expectedLanguage]
  const detected = franc(content.text, { minLength: 20 })

  if (detected === "und") {
    return true  // Too short to determine — let through
  }

  return detected === expected
}

langdetect

langdetect — Google's LangDetect algorithm for Node.js:

Basic usage

import langdetect from "langdetect"
// Note: langdetect uses ISO 639-1 (2-letter codes) by default

// Detect (returns most likely language):
langdetect.detect("Hello, how are you?")
// "en"

langdetect.detect("Bonjour, comment allez-vous?")
// "fr"

langdetect.detect("这是一段中文文本")
// "zh-cn"

// Detect with probabilities:
langdetect.detectOne("Hello, how are you?")
// { lang: "en", prob: 0.9999... }

langdetect.detectAll("Hello, how are you?")
// [
//   { lang: "en", prob: 0.9999 },
//   { lang: "af", prob: 0.0000... },  // Afrikaans
//   ...
// ]

Compared to franc accuracy

// langdetect performs better on longer texts (50+ words)
// franc performs better on very short texts (5-10 words)
// Both struggle with mixed-language content and code

// Test on short text:
import { franc } from "franc-min"
import langdetect from "langdetect"

const shortText = "Hello"
franc(shortText)                           // "sco" (often wrong on very short)
langdetect.detect(shortText)               // "en" (usually correct)

// Test on longer text:
const paragraph = "This is a longer text that contains multiple sentences in English."
franc(paragraph)                           // "eng" ✓
langdetect.detect(paragraph)              // "en" ✓

// langdetect is probabilistic — runs multiple trials internally
// More accurate for longer texts due to statistical approach

cld3 (Compact Language Detector)

CLD3 / node-cld — Google's production language detection:

// @langion/cld3 — WASM build of Google's CLD3:
import cld3 from "@langion/cld3"

await cld3.ready()  // Wait for WASM initialization

const result = cld3.findLanguage("Hello, how are you?")
// {
//   language: "en",
//   probability: 0.9999...,
//   isReliable: true,
//   proportion: 1.0,
// }

// Find top 3 languages (for mixed-language text):
const results = cld3.findTopNMostFreqLangs("Hello world, Bonjour monde!", 3)
// [
//   { language: "en", probability: 0.6, isReliable: true },
//   { language: "fr", probability: 0.3, isReliable: false },
// ]

// CLD3 is the most accurate for production use cases
// but requires WASM setup and is larger than franc/langdetect

Feature Comparison

Featurefranclangdetectcld3
Language count400+55107
Short text⚠️ Weak⚠️ Weak✅ Better
Browser support✅ (WASM)
Bundle size~540KB (min)~2MB~8MB (WASM)
ISO codes639-3639-1639-1
Confidence score
ESM
TypeScript✅ @types
No binary deps✅ (WASM)
Accuracy (long text)GoodVery goodExcellent

When to Use Each

Choose franc if:

  • Browser compatibility required (React, Vue, Svelte apps)
  • You need 400+ language support
  • Lightweight detection (franc-min for browser bundles)
  • ESM-first codebase

Choose langdetect if:

  • Server-side Node.js only (no browser)
  • You need the probabilistic accuracy of Google's original algorithm
  • Text is typically 50+ words (langdetect shines with longer text)

Choose cld3 if:

  • Production apps requiring Google-grade accuracy
  • You can accept the WASM bundle overhead (~8MB)
  • Mixed-language text detection is important

Handle edge cases:

// All detectors struggle with these cases — handle gracefully:

// 1. Very short text:
if (text.length < 20) return "und"  // Don't trust detection

// 2. All-caps or all numbers:
if (/^[A-Z0-9\s]+$/.test(text)) return "und"

// 3. Mixed language (code-switching):
// Consider breaking into sentences first

// 4. Code/technical content:
// Package names, URLs, code snippets always return wrong results
// Strip before detecting

// 5. Confidence threshold:
const [lang, score] = francAll(text)[0]
if (score < 0.8) return "und"  // Threshold for "I'm sure"

Production Integration Patterns and Preprocessing

Language detection accuracy degrades predictably when applied to raw, unprocessed user input. A content moderation pipeline that feeds raw social media text directly to franc will see substantially worse results than one that preprocesses the input first. The most impactful preprocessing steps are: stripping URLs (which contain English-like ASCII path segments regardless of the document language), removing @mentions and #hashtags, and filtering out emoji-heavy segments. After this cleanup, even short texts of 30–50 characters detect significantly more reliably.

For multi-paragraph documents, a common pattern is to detect language per paragraph and use a majority vote, discarding paragraphs that return "und" (undetermined). This is more robust than detecting the full document as a single string, because introductory quotes in foreign languages or embedded code blocks can skew the overall detection. franc's francAll function, which returns confidence-ranked candidates, is useful here: take the top candidate only if it scores above 0.85, otherwise treat the paragraph as ambiguous.

API-level language detection at the route level is a useful fallback when client-provided Accept-Language headers are absent. A middleware that calls franc on the request body and attaches the detected language to the request object costs roughly 1–3ms per request for typical API payloads under 500 characters. For multi-region applications where content routing depends on language, this server-side detection prevents mislabeled content from reaching the wrong regional index or translation queue. Use franc-min in this middleware path — the 82-language subset covers over 95% of internet traffic and keeps the module footprint small.


Accuracy Trade-offs: Statistical vs N-gram Approaches

franc uses trigram frequency analysis — it builds profiles of the most common three-character sequences for each supported language and scores input text against those profiles. This approach is fast, requires no training data at runtime, and works well for texts of 50+ characters. The weakness is that trigram profiles for closely related languages overlap significantly: Dutch and Afrikaans, Spanish and Portuguese, Norwegian Bokmål and Danish share high-frequency trigrams, causing franc to frequently confuse them for short inputs.

langdetect's probabilistic approach runs multiple independent trials with randomized character n-gram sampling and averages the results. This is slower than franc (5–20× depending on text length) but produces more reliable probability estimates for the top candidate, especially for texts between 50–200 characters where franc's confidence scores are least reliable. langdetect also maintains a seed-based deterministic mode for reproducible results in testing — useful for regression tests that verify detection behavior for known inputs.

cld3's neural network classifier, compiled to WASM, was trained on web-scale multilingual data. It handles code-switching (text that alternates between two languages mid-sentence) better than either statistical approach, and it reports per-byte span annotations for mixed-language documents. The 8MB WASM bundle makes cld3 impractical for browser use and adds startup latency (100–300ms for WASM initialization) to server processes. The correct deployment pattern for cld3 is a dedicated microservice or a module that initializes once at process start rather than on each detection request. For applications where detection accuracy is a core quality metric — international content platforms, multilingual search indexing, cross-border compliance tools — the cld3 accuracy premium justifies the operational overhead.


Browser Bundle Size and Tree-Shaking

franc's approach to bundle optimization is worth examining for browser applications. The full franc package ships a trigram model for 400+ languages, which accounts for most of its 2.5MB uncompressed size. For browser applications targeting a known set of languages — for example, a product launching only in English, Spanish, French, and German — franc-min (82 languages, 800KB) covers that set at roughly one-third the size, and you can further reduce it using the only option to specify exactly which languages to consider: franc(text, { only: ['eng', 'spa', 'fra', 'deu'] }). This restricted-language mode is faster because fewer trigram profiles are evaluated, and the result is more reliable because the algorithm is not confused by profiles from unrelated language families.

langdetect does not offer a comparable tree-shaking or language-subset approach — the detection model is loaded as a single unit. For server-side Node.js applications where bundle size is not a concern, this makes no difference. For edge functions or browser bundles where every kilobyte matters, franc's composable approach is uniquely suited. cld3's WASM binary is approximately 8MB and cannot be meaningfully subset — it ships Google's full language model. For applications that need cld3's accuracy on a small set of languages, consider running it as a separate microservice that the main application calls via HTTP, keeping the heavy binary isolated from latency-sensitive code paths.

Methodology

Download data from npm registry (weekly average, February 2026). Accuracy comparisons based on community benchmarks and documentation for franc v6.x, langdetect v1.x, and @langion/cld3 v1.x.

Compare NLP and text processing packages on PkgPulse →

In 2026, franc is the right choice for server-side language detection in Node.js where you want a single npm dependency with no native binaries and support for 400+ languages. langdetect works well for short social media texts in European languages. cld3 via node bindings provides the highest accuracy at the cost of a C++ addon and native build step. For production systems handling user-generated content, language detection should always be combined with a manual override mechanism — no detector is 100% accurate on short texts, mixed-language content, or heavily code-switched text.

See also: AVA vs Jest and ohash vs object-hash vs hash-wasm, acorn vs @babel/parser vs espree.

The 2026 JavaScript Stack Cheatsheet

One PDF: the best package for every category (ORMs, bundlers, auth, testing, state management). Used by 500+ devs. Free, updated monthly.