Transformers.js vs ONNX Runtime Web 2026

Transformers.js v4 achieved 53% smaller bundle sizes and dropped build times from 2 seconds to 200 milliseconds — and it runs entirely in the browser, with no server required. The ability to run serious machine learning models client-side has gone from a curiosity to a production pattern, with embeddings, text classification, and even LLMs now feasible in the browser.

TL;DR

Transformers.js is the right choice if you want a high-level, model-specific API that abstracts away ONNX details — great for text classification, embeddings, question answering, and named entity recognition in the browser. ONNX Runtime Web is the right choice if you're bringing your own trained model from PyTorch or TensorFlow and need maximum control over inference. In practice, Transformers.js uses ONNX Runtime Web under the hood.

Key Takeaways

Transformers.js v4 (released February 2026): 53% smaller bundles, 200ms build time, WebGPU support
@xenova/transformers: ~300K weekly npm downloads; onnxruntime-web: ~150K weekly downloads
Transformers.js v3+ rebranded to @huggingface/transformers (the @xenova/transformers package is v2.x legacy)
ONNX Runtime Web supports WebGPU, WebAssembly (WASM), and WebGL backends
FP16 quantized models run ~40% faster than FP32 with minimal accuracy loss on WebGPU
Llama-3.2-1B achievable in browser at ~20 tok/s on M-series Mac; models >2B impractical on most devices
Both packages work in Node.js too — they're not browser-only

Understanding the Relationship

Before comparing them, it's important to understand that these tools are complementary, not competing:

Your Code
    ↓
Transformers.js    ←— High-level pipeline API (text classification, embeddings, etc.)
    ↓
ONNX Runtime Web   ←— Low-level model execution engine
    ↓
WebGPU / WASM      ←— Hardware acceleration layer

Transformers.js is built on top of ONNX Runtime Web. It provides task-specific APIs (pipelines) and a model hub integration, while ONNX Runtime Web does the actual computation.

Transformers.js

Package and Installation

# New package (v3+, actively developed by Hugging Face)
npm install @huggingface/transformers

# Legacy package (v2.x, still widely used)
npm install @xenova/transformers

Core Concept: Pipelines

Transformers.js uses the same pipeline API as Python's transformers library:

import { pipeline } from '@huggingface/transformers';

// Text classification
const classifier = await pipeline('text-classification', 'Xenova/distilbert-base-uncased-finetuned-sst-2-english');
const result = await classifier('I love this product!');
// [{ label: 'POSITIVE', score: 0.9998 }]

// Generate embeddings
const embedder = await pipeline('feature-extraction', 'Xenova/all-MiniLM-L6-v2');
const output = await embedder('Hello, world!', { pooling: 'mean', normalize: true });
const embedding = Array.from(output.data); // 384-dimensional vector

// Question answering
const qa = await pipeline('question-answering', 'Xenova/distilbert-base-cased-distilled-squad');
const answer = await qa({
  question: 'What year was TypeScript released?',
  context: 'TypeScript was first made public in October 2012 by Anders Hejlsberg.',
});
// { answer: 'October 2012', score: 0.98, start: 37, end: 49 }

// Automatic Speech Recognition (Whisper)
const transcriber = await pipeline('automatic-speech-recognition', 'Xenova/whisper-tiny.en');
const audio = await fetch('audio.wav').then(r => r.arrayBuffer());
const transcript = await transcriber(audio);
// { text: 'Hello, world!' }

Supported Tasks (2026)

Task	Example Model	Size
Text classification	distilbert-base-uncased-finetuned-sst-2	67 MB
Feature extraction (embeddings)	all-MiniLM-L6-v2	23 MB
Named entity recognition	bert-base-NER	110 MB
Question answering	distilbert-base-cased-distilled-squad	67 MB
Text generation	gpt2	124 MB
Speech-to-text	whisper-tiny	39 MB
Translation	Helsinki-NLP/opus-mt-en-fr	77 MB
Zero-shot classification	bart-large-mnli	407 MB
Object detection	detr-resnet-50	166 MB
Image classification	vit-base-patch16-224	87 MB

WebGPU Acceleration (v4)

import { pipeline } from '@huggingface/transformers';

const embedder = await pipeline(
  'feature-extraction',
  'Xenova/all-MiniLM-L6-v2',
  { device: 'webgpu' } // GPU acceleration
);

With WebGPU enabled, embedding generation is 3-5x faster than WASM.

Browser Caching

Models are downloaded from Hugging Face Hub and cached in the browser's Cache API:

// First run: downloads ~23 MB model
const embedder = await pipeline('feature-extraction', 'Xenova/all-MiniLM-L6-v2');

// Subsequent runs: loads from cache, ~200ms
const embedder2 = await pipeline('feature-extraction', 'Xenova/all-MiniLM-L6-v2');

ONNX Runtime Web

Package and Installation

npm install onnxruntime-web

Core Concept: Direct Model Inference

ONNX Runtime Web lets you run any ONNX model — regardless of the task or framework it was trained with:

import * as ort from 'onnxruntime-web';

// Configure backend priority
ort.env.wasm.wasmPaths = '/wasm/'; // Path to WASM files
ort.env.webgpu.powerPreference = 'high-performance';

// Load model
const session = await ort.InferenceSession.create('/models/my_model.onnx', {
  executionProviders: ['webgpu', 'wasm'], // Try WebGPU first, fall back to WASM
  graphOptimizationLevel: 'all',
});

// Prepare inputs as tensors
const inputTensor = new ort.Tensor('float32', Float32Array.from(inputData), [1, 128]);
const feeds = { input_ids: inputTensor };

// Run inference
const results = await session.run(feeds);
const output = results.logits.data; // TypedArray

Converting PyTorch Models

# Python: Export to ONNX
import torch
import torch.onnx

model = MyModel()
model.load_state_dict(torch.load('model.pth'))
model.eval()

dummy_input = torch.zeros(1, 128, dtype=torch.long)
torch.onnx.export(
    model, dummy_input, 'model.onnx',
    input_names=['input_ids'],
    output_names=['logits'],
    dynamic_axes={'input_ids': {0: 'batch_size', 1: 'sequence_length'}},
)

// JavaScript: Run the exported model
const session = await ort.InferenceSession.create('/model.onnx');

Execution Providers

Provider	When Used	Performance
`webgpu`	Modern browsers with GPU	Fastest
`wasm`	All browsers	Good
`webgl`	Older GPU path	Moderate

// WebGPU with FP16 (40% faster, minimal accuracy loss)
const session = await ort.InferenceSession.create('/model_fp16.onnx', {
  executionProviders: [{ name: 'webgpu', deviceType: 'gpu', preferredLayout: 'NHWC' }],
});

Performance Comparison

Embedding Generation (all-MiniLM-L6-v2, 1000 documents)

Method	Backend	Time	Tok/sec
Transformers.js	WASM	8.2s	~5K
Transformers.js	WebGPU	1.8s	~22K
ONNX Runtime Web	WASM	7.9s	~5K
ONNX Runtime Web	WebGPU	1.6s	~25K

Text Generation (GPT-2 small, 100 tokens)

Method	Backend	Time
Transformers.js	WASM	12s
Transformers.js	WebGPU	3.1s
ONNX Runtime Web (direct)	WebGPU	2.8s

Bundle Size

Package	Minified	With WASM files
`@huggingface/transformers`	1.2 MB	+3.5 MB WASM
`@xenova/transformers` (v2)	2.1 MB	+3.5 MB WASM
`onnxruntime-web`	0.5 MB	+3.5 MB WASM

The WASM files are loaded lazily and cached — they only download once.

Feature Comparison

Feature	Transformers.js	ONNX Runtime Web
API level	High-level (pipelines)	Low-level (tensor ops)
Model hub	Hugging Face (auto-download)	Bring your own
Task coverage	30+ NLP/vision tasks	Any ONNX model
WebGPU	Yes (v4)	Yes
Node.js support	Yes	Yes
Bundle size	Larger	Smaller
Custom models	Needs ONNX export	Native
Streaming generation	Yes (v4)	Manual

Use Case Decision Guide

Use Transformers.js if:

You want a pre-trained model for a standard task (embeddings, classification, NER, ASR)
You're building client-side RAG — Transformers.js + ChromaDB is a complete in-browser RAG stack
You want models auto-downloaded from Hugging Face without managing model files
You need the Python transformers API to be portable to JavaScript
TypeScript inference from pipeline output is important

// Complete in-browser semantic search (no server!)
import { pipeline } from '@huggingface/transformers';
import { ChromaClient } from 'chromadb';

const embedder = await pipeline('feature-extraction', 'Xenova/all-MiniLM-L6-v2');
const embed = async (text: string) => {
  const out = await embedder(text, { pooling: 'mean', normalize: true });
  return Array.from(out.data);
};

// Index documents
const docs = ['Document 1...', 'Document 2...'];
const embeddings = await Promise.all(docs.map(embed));

Use ONNX Runtime Web if:

You have a custom trained model in PyTorch or TensorFlow
You need maximum performance with direct tensor control
You're building a non-NLP use case (custom computer vision, signal processing)
You want the smallest possible bundle without Transformers.js overhead
You need fine-grained memory management for large model inference

LLMs in the Browser (2026)

With WebGPU and optimized quantization, small LLMs are now feasible in the browser:

Model	Quantization	VRAM	Tok/sec (M3)
Llama 3.2 1B	Q4F16	800 MB	~25
Phi-3 mini 3.8B	Q4F16	2.3 GB	~15
Gemma 2 2B	Q4F16	1.4 GB	~18
Llama 3.2 3B	Q4F16	2.0 GB	~12

These require WebGPU and devices with 2+ GB available GPU memory. Mobile support is limited to iPhone 15 Pro and recent Android flagships with 8+ GB RAM.

// Running a small LLM in the browser with Transformers.js v4
const generator = await pipeline('text-generation', 'onnx-community/Llama-3.2-1B-Instruct-q4f16', {
  device: 'webgpu',
});

for await (const chunk of await generator('Tell me a joke', { max_new_tokens: 200, do_sample: false })) {
  process.stdout.write(chunk.generated_text);
}

The WebLLM Alternative

For LLM-specific inference, @mlc-ai/web-llm (built on WebGPU/TVM) often outperforms both packages for generation tasks:

npm install @mlc-ai/web-llm

It focuses exclusively on LLM inference and achieves higher throughput than Transformers.js for generation.

Recommendation

For 2026 production browser ML work, the stack is:

Use @huggingface/transformers for standard NLP tasks
Use onnxruntime-web directly for custom models
Use @mlc-ai/web-llm for LLM generation
Always target WebGPU with WASM fallback

Node.js Usage: Same Packages, Same API

Both packages work in Node.js without changes. For embedding generation in a backend service:

// Node.js: Generate embeddings for RAG pipeline
import { pipeline } from '@huggingface/transformers';

// Models download to ~/.cache/huggingface/hub/ on first run
const embedder = await pipeline('feature-extraction', 'Xenova/all-MiniLM-L6-v2', {
  device: 'cpu',  // Node.js can also use 'cuda' if available
});

// Process documents before storing in vector DB
async function embedDocuments(docs: string[]): Promise<number[][]> {
  const results = await Promise.all(
    docs.map(doc => embedder(doc, { pooling: 'mean', normalize: true }))
  );
  return results.map(r => Array.from(r.data as Float32Array));
}

The browser/Node.js parity is a significant advantage: you can prototype in the browser and deploy the same model to Node.js without changing your application code. The only difference is the model cache location and the available compute backends (Node.js can use CUDA via ONNX Runtime Node; browsers use WebGPU/WASM). This makes it practical to build shared model utility functions that work identically in both environments.

Privacy-First AI: The Browser Advantage

The strongest argument for browser ML is privacy. When inference runs client-side, user data never leaves the device. This is a genuine architectural advantage for applications handling sensitive content:

Document analysis: Legal contracts, medical records, financial data
Semantic search: Personal notes, private repositories
Content moderation: User-generated content analyzed locally without uploading to a moderation API
Offline-capable apps: Progressive web apps that work without a network connection

Running Transformers.js in a Service Worker enables persistent background inference even when the main tab is not active. Combined with IndexedDB for document storage and the browser's Cache API for model caching, you can build a fully offline semantic search application that runs entirely client-side with no backend.

Choosing the Right Quantization Level

Model quantization is the primary lever for balancing accuracy against performance in browser ML. The tradeoffs are real and worth understanding before committing to a model format:

FP32 (float32): Full precision, largest size, slowest inference. Rarely used in browser contexts — reserved for debugging or comparing against quantized output.

FP16 (float16): Half precision, 50% size reduction vs FP32, minimal accuracy loss for most NLP tasks. Requires WebGPU — falls back to FP32 on WASM. The sweet spot for WebGPU-capable devices.

INT8 (8-bit integer): 75% size reduction vs FP32, ~2-5% accuracy drop on most tasks, works on WASM. A good balance for embedding models where you need broad device compatibility.

INT4 (4-bit integer): 87.5% size reduction, 5-15% accuracy drop. Only viable for generative tasks (LLMs) where users have calibrated expectations. Not recommended for classification or embedding tasks where precision matters.

For production deployment, the recommended approach is: start with fp16 and test accuracy on your specific task. If accuracy is acceptable and your target devices support WebGPU, ship FP16. If you need broader compatibility, fall back to INT8. Run both through your evaluation set rather than assuming quantization will or won't matter for your use case.

Production Considerations: First Load and Model Size

The main friction in deploying browser ML is the first-load experience. A 23 MB embedding model (all-MiniLM-L6-v2) downloads once and caches via the browser's Cache API. Subsequent page loads take ~200ms to initialize. But that first download is a real UX consideration.

Strategies that work in production:

Progressive disclosure: Don't load the model until the user initiates a ML-powered feature. For semantic search, load the model when the user opens the search overlay, not at page load.

Background prefetching: Use <link rel="prefetch"> for smaller models (<50 MB) or initiate download in a Service Worker after the main page content loads.

User expectation setting: Show a one-time loading indicator ("Setting up AI features...") that explains what's happening. Users accept a 2-3 second initial wait better than an unexplained pause.

Model selection by device: Use the navigator.gpu API to detect WebGPU support before downloading FP16 models. Fall back to smaller INT8 models for non-WebGPU devices.

Compare download trends for these packages on PkgPulse.

Compare Transformersjs and Onnx-runtime-web package health on PkgPulse.

The 2026 JavaScript Stack Cheatsheet