Skip to main content

Transformers.js vs ONNX Runtime Web: ML in the Browser 2026

·PkgPulse Team

Transformers.js v4 achieved 53% smaller bundle sizes and dropped build times from 2 seconds to 200 milliseconds — and it runs entirely in the browser, with no server required. The ability to run serious machine learning models client-side has gone from a curiosity to a production pattern, with embeddings, text classification, and even LLMs now feasible in the browser.

TL;DR

Transformers.js is the right choice if you want a high-level, model-specific API that abstracts away ONNX details — great for text classification, embeddings, question answering, and named entity recognition in the browser. ONNX Runtime Web is the right choice if you're bringing your own trained model from PyTorch or TensorFlow and need maximum control over inference. In practice, Transformers.js uses ONNX Runtime Web under the hood.

Key Takeaways

  • Transformers.js v4 (released February 2026): 53% smaller bundles, 200ms build time, WebGPU support
  • @xenova/transformers: ~300K weekly npm downloads; onnxruntime-web: ~150K weekly downloads
  • Transformers.js v3+ rebranded to @huggingface/transformers (the @xenova/transformers package is v2.x legacy)
  • ONNX Runtime Web supports WebGPU, WebAssembly (WASM), and WebGL backends
  • FP16 quantized models run ~40% faster than FP32 with minimal accuracy loss on WebGPU
  • Llama-3.2-1B achievable in browser at ~20 tok/s on M-series Mac; models >2B impractical on most devices
  • Both packages work in Node.js too — they're not browser-only

Understanding the Relationship

Before comparing them, it's important to understand that these tools are complementary, not competing:

Your Code
    ↓
Transformers.js    ←— High-level pipeline API (text classification, embeddings, etc.)
    ↓
ONNX Runtime Web   ←— Low-level model execution engine
    ↓
WebGPU / WASM      ←— Hardware acceleration layer

Transformers.js is built on top of ONNX Runtime Web. It provides task-specific APIs (pipelines) and a model hub integration, while ONNX Runtime Web does the actual computation.

Transformers.js

Package and Installation

# New package (v3+, actively developed by Hugging Face)
npm install @huggingface/transformers

# Legacy package (v2.x, still widely used)
npm install @xenova/transformers

Core Concept: Pipelines

Transformers.js uses the same pipeline API as Python's transformers library:

import { pipeline } from '@huggingface/transformers';

// Text classification
const classifier = await pipeline('text-classification', 'Xenova/distilbert-base-uncased-finetuned-sst-2-english');
const result = await classifier('I love this product!');
// [{ label: 'POSITIVE', score: 0.9998 }]

// Generate embeddings
const embedder = await pipeline('feature-extraction', 'Xenova/all-MiniLM-L6-v2');
const output = await embedder('Hello, world!', { pooling: 'mean', normalize: true });
const embedding = Array.from(output.data); // 384-dimensional vector

// Question answering
const qa = await pipeline('question-answering', 'Xenova/distilbert-base-cased-distilled-squad');
const answer = await qa({
  question: 'What year was TypeScript released?',
  context: 'TypeScript was first made public in October 2012 by Anders Hejlsberg.',
});
// { answer: 'October 2012', score: 0.98, start: 37, end: 49 }

// Automatic Speech Recognition (Whisper)
const transcriber = await pipeline('automatic-speech-recognition', 'Xenova/whisper-tiny.en');
const audio = await fetch('audio.wav').then(r => r.arrayBuffer());
const transcript = await transcriber(audio);
// { text: 'Hello, world!' }

Supported Tasks (2026)

TaskExample ModelSize
Text classificationdistilbert-base-uncased-finetuned-sst-267 MB
Feature extraction (embeddings)all-MiniLM-L6-v223 MB
Named entity recognitionbert-base-NER110 MB
Question answeringdistilbert-base-cased-distilled-squad67 MB
Text generationgpt2124 MB
Speech-to-textwhisper-tiny39 MB
TranslationHelsinki-NLP/opus-mt-en-fr77 MB
Zero-shot classificationbart-large-mnli407 MB
Object detectiondetr-resnet-50166 MB
Image classificationvit-base-patch16-22487 MB

WebGPU Acceleration (v4)

import { pipeline } from '@huggingface/transformers';

const embedder = await pipeline(
  'feature-extraction',
  'Xenova/all-MiniLM-L6-v2',
  { device: 'webgpu' } // GPU acceleration
);

With WebGPU enabled, embedding generation is 3-5x faster than WASM.

Browser Caching

Models are downloaded from Hugging Face Hub and cached in the browser's Cache API:

// First run: downloads ~23 MB model
const embedder = await pipeline('feature-extraction', 'Xenova/all-MiniLM-L6-v2');

// Subsequent runs: loads from cache, ~200ms
const embedder2 = await pipeline('feature-extraction', 'Xenova/all-MiniLM-L6-v2');

ONNX Runtime Web

Package and Installation

npm install onnxruntime-web

Core Concept: Direct Model Inference

ONNX Runtime Web lets you run any ONNX model — regardless of the task or framework it was trained with:

import * as ort from 'onnxruntime-web';

// Configure backend priority
ort.env.wasm.wasmPaths = '/wasm/'; // Path to WASM files
ort.env.webgpu.powerPreference = 'high-performance';

// Load model
const session = await ort.InferenceSession.create('/models/my_model.onnx', {
  executionProviders: ['webgpu', 'wasm'], // Try WebGPU first, fall back to WASM
  graphOptimizationLevel: 'all',
});

// Prepare inputs as tensors
const inputTensor = new ort.Tensor('float32', Float32Array.from(inputData), [1, 128]);
const feeds = { input_ids: inputTensor };

// Run inference
const results = await session.run(feeds);
const output = results.logits.data; // TypedArray

Converting PyTorch Models

# Python: Export to ONNX
import torch
import torch.onnx

model = MyModel()
model.load_state_dict(torch.load('model.pth'))
model.eval()

dummy_input = torch.zeros(1, 128, dtype=torch.long)
torch.onnx.export(
    model, dummy_input, 'model.onnx',
    input_names=['input_ids'],
    output_names=['logits'],
    dynamic_axes={'input_ids': {0: 'batch_size', 1: 'sequence_length'}},
)
// JavaScript: Run the exported model
const session = await ort.InferenceSession.create('/model.onnx');

Execution Providers

ProviderWhen UsedPerformance
webgpuModern browsers with GPUFastest
wasmAll browsersGood
webglOlder GPU pathModerate
// WebGPU with FP16 (40% faster, minimal accuracy loss)
const session = await ort.InferenceSession.create('/model_fp16.onnx', {
  executionProviders: [{ name: 'webgpu', deviceType: 'gpu', preferredLayout: 'NHWC' }],
});

Performance Comparison

Embedding Generation (all-MiniLM-L6-v2, 1000 documents)

MethodBackendTimeTok/sec
Transformers.jsWASM8.2s~5K
Transformers.jsWebGPU1.8s~22K
ONNX Runtime WebWASM7.9s~5K
ONNX Runtime WebWebGPU1.6s~25K

Text Generation (GPT-2 small, 100 tokens)

MethodBackendTime
Transformers.jsWASM12s
Transformers.jsWebGPU3.1s
ONNX Runtime Web (direct)WebGPU2.8s

Bundle Size

PackageMinifiedWith WASM files
@huggingface/transformers1.2 MB+3.5 MB WASM
@xenova/transformers (v2)2.1 MB+3.5 MB WASM
onnxruntime-web0.5 MB+3.5 MB WASM

The WASM files are loaded lazily and cached — they only download once.

Feature Comparison

FeatureTransformers.jsONNX Runtime Web
API levelHigh-level (pipelines)Low-level (tensor ops)
Model hubHugging Face (auto-download)Bring your own
Task coverage30+ NLP/vision tasksAny ONNX model
WebGPUYes (v4)Yes
Node.js supportYesYes
Bundle sizeLargerSmaller
Custom modelsNeeds ONNX exportNative
Streaming generationYes (v4)Manual

Use Case Decision Guide

Use Transformers.js if:

  • You want a pre-trained model for a standard task (embeddings, classification, NER, ASR)
  • You're building client-side RAG — Transformers.js + ChromaDB is a complete in-browser RAG stack
  • You want models auto-downloaded from Hugging Face without managing model files
  • You need the Python transformers API to be portable to JavaScript
  • TypeScript inference from pipeline output is important
// Complete in-browser semantic search (no server!)
import { pipeline } from '@huggingface/transformers';
import { ChromaClient } from 'chromadb';

const embedder = await pipeline('feature-extraction', 'Xenova/all-MiniLM-L6-v2');
const embed = async (text: string) => {
  const out = await embedder(text, { pooling: 'mean', normalize: true });
  return Array.from(out.data);
};

// Index documents
const docs = ['Document 1...', 'Document 2...'];
const embeddings = await Promise.all(docs.map(embed));

Use ONNX Runtime Web if:

  • You have a custom trained model in PyTorch or TensorFlow
  • You need maximum performance with direct tensor control
  • You're building a non-NLP use case (custom computer vision, signal processing)
  • You want the smallest possible bundle without Transformers.js overhead
  • You need fine-grained memory management for large model inference

LLMs in the Browser (2026)

With WebGPU and optimized quantization, small LLMs are now feasible in the browser:

ModelQuantizationVRAMTok/sec (M3)
Llama 3.2 1BQ4F16800 MB~25
Phi-3 mini 3.8BQ4F162.3 GB~15
Gemma 2 2BQ4F161.4 GB~18
Llama 3.2 3BQ4F162.0 GB~12

These require WebGPU and devices with 2+ GB available GPU memory. Mobile support is limited to iPhone 15 Pro and recent Android flagships with 8+ GB RAM.

// Running a small LLM in the browser with Transformers.js v4
const generator = await pipeline('text-generation', 'onnx-community/Llama-3.2-1B-Instruct-q4f16', {
  device: 'webgpu',
});

for await (const chunk of await generator('Tell me a joke', { max_new_tokens: 200, do_sample: false })) {
  process.stdout.write(chunk.generated_text);
}

The WebLLM Alternative

For LLM-specific inference, @mlc-ai/web-llm (built on WebGPU/TVM) often outperforms both packages for generation tasks:

npm install @mlc-ai/web-llm

It focuses exclusively on LLM inference and achieves higher throughput than Transformers.js for generation.

Recommendation

For 2026 production browser ML work, the stack is:

  1. Use @huggingface/transformers for standard NLP tasks
  2. Use onnxruntime-web directly for custom models
  3. Use @mlc-ai/web-llm for LLM generation
  4. Always target WebGPU with WASM fallback

Compare download trends for these packages on PkgPulse.

Comments

Stay Updated

Get the latest package insights, npm trends, and tooling tips delivered to your inbox.