Transformers.js vs ONNX Runtime Web: ML in the Browser 2026
Transformers.js v4 achieved 53% smaller bundle sizes and dropped build times from 2 seconds to 200 milliseconds — and it runs entirely in the browser, with no server required. The ability to run serious machine learning models client-side has gone from a curiosity to a production pattern, with embeddings, text classification, and even LLMs now feasible in the browser.
TL;DR
Transformers.js is the right choice if you want a high-level, model-specific API that abstracts away ONNX details — great for text classification, embeddings, question answering, and named entity recognition in the browser. ONNX Runtime Web is the right choice if you're bringing your own trained model from PyTorch or TensorFlow and need maximum control over inference. In practice, Transformers.js uses ONNX Runtime Web under the hood.
Key Takeaways
- Transformers.js v4 (released February 2026): 53% smaller bundles, 200ms build time, WebGPU support
@xenova/transformers: ~300K weekly npm downloads;onnxruntime-web: ~150K weekly downloads- Transformers.js v3+ rebranded to
@huggingface/transformers(the@xenova/transformerspackage is v2.x legacy) - ONNX Runtime Web supports WebGPU, WebAssembly (WASM), and WebGL backends
- FP16 quantized models run ~40% faster than FP32 with minimal accuracy loss on WebGPU
- Llama-3.2-1B achievable in browser at ~20 tok/s on M-series Mac; models >2B impractical on most devices
- Both packages work in Node.js too — they're not browser-only
Understanding the Relationship
Before comparing them, it's important to understand that these tools are complementary, not competing:
Your Code
↓
Transformers.js ←— High-level pipeline API (text classification, embeddings, etc.)
↓
ONNX Runtime Web ←— Low-level model execution engine
↓
WebGPU / WASM ←— Hardware acceleration layer
Transformers.js is built on top of ONNX Runtime Web. It provides task-specific APIs (pipelines) and a model hub integration, while ONNX Runtime Web does the actual computation.
Transformers.js
Package and Installation
# New package (v3+, actively developed by Hugging Face)
npm install @huggingface/transformers
# Legacy package (v2.x, still widely used)
npm install @xenova/transformers
Core Concept: Pipelines
Transformers.js uses the same pipeline API as Python's transformers library:
import { pipeline } from '@huggingface/transformers';
// Text classification
const classifier = await pipeline('text-classification', 'Xenova/distilbert-base-uncased-finetuned-sst-2-english');
const result = await classifier('I love this product!');
// [{ label: 'POSITIVE', score: 0.9998 }]
// Generate embeddings
const embedder = await pipeline('feature-extraction', 'Xenova/all-MiniLM-L6-v2');
const output = await embedder('Hello, world!', { pooling: 'mean', normalize: true });
const embedding = Array.from(output.data); // 384-dimensional vector
// Question answering
const qa = await pipeline('question-answering', 'Xenova/distilbert-base-cased-distilled-squad');
const answer = await qa({
question: 'What year was TypeScript released?',
context: 'TypeScript was first made public in October 2012 by Anders Hejlsberg.',
});
// { answer: 'October 2012', score: 0.98, start: 37, end: 49 }
// Automatic Speech Recognition (Whisper)
const transcriber = await pipeline('automatic-speech-recognition', 'Xenova/whisper-tiny.en');
const audio = await fetch('audio.wav').then(r => r.arrayBuffer());
const transcript = await transcriber(audio);
// { text: 'Hello, world!' }
Supported Tasks (2026)
| Task | Example Model | Size |
|---|---|---|
| Text classification | distilbert-base-uncased-finetuned-sst-2 | 67 MB |
| Feature extraction (embeddings) | all-MiniLM-L6-v2 | 23 MB |
| Named entity recognition | bert-base-NER | 110 MB |
| Question answering | distilbert-base-cased-distilled-squad | 67 MB |
| Text generation | gpt2 | 124 MB |
| Speech-to-text | whisper-tiny | 39 MB |
| Translation | Helsinki-NLP/opus-mt-en-fr | 77 MB |
| Zero-shot classification | bart-large-mnli | 407 MB |
| Object detection | detr-resnet-50 | 166 MB |
| Image classification | vit-base-patch16-224 | 87 MB |
WebGPU Acceleration (v4)
import { pipeline } from '@huggingface/transformers';
const embedder = await pipeline(
'feature-extraction',
'Xenova/all-MiniLM-L6-v2',
{ device: 'webgpu' } // GPU acceleration
);
With WebGPU enabled, embedding generation is 3-5x faster than WASM.
Browser Caching
Models are downloaded from Hugging Face Hub and cached in the browser's Cache API:
// First run: downloads ~23 MB model
const embedder = await pipeline('feature-extraction', 'Xenova/all-MiniLM-L6-v2');
// Subsequent runs: loads from cache, ~200ms
const embedder2 = await pipeline('feature-extraction', 'Xenova/all-MiniLM-L6-v2');
ONNX Runtime Web
Package and Installation
npm install onnxruntime-web
Core Concept: Direct Model Inference
ONNX Runtime Web lets you run any ONNX model — regardless of the task or framework it was trained with:
import * as ort from 'onnxruntime-web';
// Configure backend priority
ort.env.wasm.wasmPaths = '/wasm/'; // Path to WASM files
ort.env.webgpu.powerPreference = 'high-performance';
// Load model
const session = await ort.InferenceSession.create('/models/my_model.onnx', {
executionProviders: ['webgpu', 'wasm'], // Try WebGPU first, fall back to WASM
graphOptimizationLevel: 'all',
});
// Prepare inputs as tensors
const inputTensor = new ort.Tensor('float32', Float32Array.from(inputData), [1, 128]);
const feeds = { input_ids: inputTensor };
// Run inference
const results = await session.run(feeds);
const output = results.logits.data; // TypedArray
Converting PyTorch Models
# Python: Export to ONNX
import torch
import torch.onnx
model = MyModel()
model.load_state_dict(torch.load('model.pth'))
model.eval()
dummy_input = torch.zeros(1, 128, dtype=torch.long)
torch.onnx.export(
model, dummy_input, 'model.onnx',
input_names=['input_ids'],
output_names=['logits'],
dynamic_axes={'input_ids': {0: 'batch_size', 1: 'sequence_length'}},
)
// JavaScript: Run the exported model
const session = await ort.InferenceSession.create('/model.onnx');
Execution Providers
| Provider | When Used | Performance |
|---|---|---|
webgpu | Modern browsers with GPU | Fastest |
wasm | All browsers | Good |
webgl | Older GPU path | Moderate |
// WebGPU with FP16 (40% faster, minimal accuracy loss)
const session = await ort.InferenceSession.create('/model_fp16.onnx', {
executionProviders: [{ name: 'webgpu', deviceType: 'gpu', preferredLayout: 'NHWC' }],
});
Performance Comparison
Embedding Generation (all-MiniLM-L6-v2, 1000 documents)
| Method | Backend | Time | Tok/sec |
|---|---|---|---|
| Transformers.js | WASM | 8.2s | ~5K |
| Transformers.js | WebGPU | 1.8s | ~22K |
| ONNX Runtime Web | WASM | 7.9s | ~5K |
| ONNX Runtime Web | WebGPU | 1.6s | ~25K |
Text Generation (GPT-2 small, 100 tokens)
| Method | Backend | Time |
|---|---|---|
| Transformers.js | WASM | 12s |
| Transformers.js | WebGPU | 3.1s |
| ONNX Runtime Web (direct) | WebGPU | 2.8s |
Bundle Size
| Package | Minified | With WASM files |
|---|---|---|
@huggingface/transformers | 1.2 MB | +3.5 MB WASM |
@xenova/transformers (v2) | 2.1 MB | +3.5 MB WASM |
onnxruntime-web | 0.5 MB | +3.5 MB WASM |
The WASM files are loaded lazily and cached — they only download once.
Feature Comparison
| Feature | Transformers.js | ONNX Runtime Web |
|---|---|---|
| API level | High-level (pipelines) | Low-level (tensor ops) |
| Model hub | Hugging Face (auto-download) | Bring your own |
| Task coverage | 30+ NLP/vision tasks | Any ONNX model |
| WebGPU | Yes (v4) | Yes |
| Node.js support | Yes | Yes |
| Bundle size | Larger | Smaller |
| Custom models | Needs ONNX export | Native |
| Streaming generation | Yes (v4) | Manual |
Use Case Decision Guide
Use Transformers.js if:
- You want a pre-trained model for a standard task (embeddings, classification, NER, ASR)
- You're building client-side RAG — Transformers.js + ChromaDB is a complete in-browser RAG stack
- You want models auto-downloaded from Hugging Face without managing model files
- You need the Python transformers API to be portable to JavaScript
- TypeScript inference from pipeline output is important
// Complete in-browser semantic search (no server!)
import { pipeline } from '@huggingface/transformers';
import { ChromaClient } from 'chromadb';
const embedder = await pipeline('feature-extraction', 'Xenova/all-MiniLM-L6-v2');
const embed = async (text: string) => {
const out = await embedder(text, { pooling: 'mean', normalize: true });
return Array.from(out.data);
};
// Index documents
const docs = ['Document 1...', 'Document 2...'];
const embeddings = await Promise.all(docs.map(embed));
Use ONNX Runtime Web if:
- You have a custom trained model in PyTorch or TensorFlow
- You need maximum performance with direct tensor control
- You're building a non-NLP use case (custom computer vision, signal processing)
- You want the smallest possible bundle without Transformers.js overhead
- You need fine-grained memory management for large model inference
LLMs in the Browser (2026)
With WebGPU and optimized quantization, small LLMs are now feasible in the browser:
| Model | Quantization | VRAM | Tok/sec (M3) |
|---|---|---|---|
| Llama 3.2 1B | Q4F16 | 800 MB | ~25 |
| Phi-3 mini 3.8B | Q4F16 | 2.3 GB | ~15 |
| Gemma 2 2B | Q4F16 | 1.4 GB | ~18 |
| Llama 3.2 3B | Q4F16 | 2.0 GB | ~12 |
These require WebGPU and devices with 2+ GB available GPU memory. Mobile support is limited to iPhone 15 Pro and recent Android flagships with 8+ GB RAM.
// Running a small LLM in the browser with Transformers.js v4
const generator = await pipeline('text-generation', 'onnx-community/Llama-3.2-1B-Instruct-q4f16', {
device: 'webgpu',
});
for await (const chunk of await generator('Tell me a joke', { max_new_tokens: 200, do_sample: false })) {
process.stdout.write(chunk.generated_text);
}
The WebLLM Alternative
For LLM-specific inference, @mlc-ai/web-llm (built on WebGPU/TVM) often outperforms both packages for generation tasks:
npm install @mlc-ai/web-llm
It focuses exclusively on LLM inference and achieves higher throughput than Transformers.js for generation.
Recommendation
For 2026 production browser ML work, the stack is:
- Use
@huggingface/transformersfor standard NLP tasks - Use
onnxruntime-webdirectly for custom models - Use
@mlc-ai/web-llmfor LLM generation - Always target WebGPU with WASM fallback
Compare download trends for these packages on PkgPulse.
See the live comparison
View transformersjs vs. onnx runtime web on PkgPulse →