Best npm Packages for Web Scraping in 2026
Playwright downloads exceeded 4 million per week. Cheerio consistently outpaces Puppeteer in monthly downloads despite being a simple HTML parser with no browser automation. Crawlee — Apify's framework that wraps Playwright, Puppeteer, and Cheerio — adds queuing, retry logic, proxy rotation, and anti-bot fingerprinting on top. In 2026, web scraping with Node.js has a clear hierarchy: use Cheerio for static HTML, Playwright for dynamic sites, and Crawlee when you're building a production crawler.
TL;DR
Cheerio for fast parsing of static HTML pages — 5x faster than a headless browser, tiny bundle, no browser needed. Playwright for dynamic sites that require JavaScript execution, login flows, or complex browser interactions — better cross-browser support than Puppeteer. Puppeteer for Chrome-specific automation and scraping — Google's library, battle-tested, Chrome DevTools Protocol. Crawlee when you need a full crawler with queuing, retries, proxy rotation, and anti-detection built in. Most production scrapers in 2026 use Crawlee + Playwright.
Key Takeaways
- Playwright: 4M+ weekly downloads, Microsoft, Chrome/Firefox/WebKit support, parallel execution
- Puppeteer: 3M+ weekly downloads, Google, Chrome/Chromium-only, Chrome DevTools Protocol
- Cheerio: 10M+ weekly downloads, lightweight HTML parser, no browser, jQuery-like API
- Crawlee: Apify's framework combining all three — adds queuing, retry, proxy rotation, fingerprinting
- Static sites: Cheerio is 5x+ faster than headless browsers, uses less memory
- Dynamic sites: Playwright > Puppeteer for cross-browser, anti-detection, modern TypeScript API
- Anti-bot: Crawlee automatically rotates browser fingerprints to avoid detection
The Web Scraping Stack
Different scraping needs require different tools:
Static HTML → Cheerio (fast, lightweight, no browser)
Dynamic JS → Playwright or Puppeteer (headless browser)
Production crawler → Crawlee (orchestration + anti-bot)
Simple HTTP → node-fetch / undici + Cheerio
Cheerio
Package: cheerio
Weekly downloads: 10M+
GitHub stars: 28K
Creator: Matt Mueller, maintained by Cheerio org
Cheerio parses HTML on the server with a jQuery-like API. It doesn't execute JavaScript — it's just an HTML parser. For static sites, it's dramatically faster than a headless browser.
Installation
npm install cheerio
Basic Usage
import * as cheerio from 'cheerio';
// Fetch HTML
const response = await fetch('https://news.ycombinator.com');
const html = await response.text();
// Load into Cheerio
const $ = cheerio.load(html);
// jQuery-like selectors
const stories = $('.athing').map((i, el) => {
const title = $(el).find('.titleline a').text();
const url = $(el).find('.titleline a').attr('href');
const score = $(el).next().find('.score').text();
return { title, url, score };
}).get();
console.log(stories);
// [{ title: 'Article Title', url: 'https://...', score: '142 points' }, ...]
Cheerio Performance
# Parsing 10MB HTML document:
Cheerio: ~50ms
Puppeteer: ~2000ms (browser startup + navigation)
Playwright: ~1500ms (browser startup + navigation)
# Memory usage:
Cheerio: ~15 MB
Headless Chrome: ~150-300 MB
For static content, Cheerio is the obvious choice — no browser overhead, no startup time, runs in any Node.js process.
Scraping with Selectors
import * as cheerio from 'cheerio';
async function scrapeProductPage(url: string) {
const response = await fetch(url, {
headers: {
'User-Agent': 'Mozilla/5.0 (compatible; MyBot/1.0)',
},
});
const html = await response.text();
const $ = cheerio.load(html);
return {
name: $('h1.product-title').text().trim(),
price: $('[data-price]').attr('data-price'),
description: $('meta[name="description"]').attr('content'),
images: $('img.product-image').map((_, el) => $(el).attr('src')).get(),
inStock: !$('.out-of-stock').length,
};
}
Cheerio Limitations
- No JavaScript execution — dynamic content rendered by React/Vue/Angular won't be there
- No interactions (clicking, form submission, scrolling)
- If the site returns different HTML without JavaScript enabled, Cheerio will miss it
Playwright
Package: playwright
Weekly downloads: 4M+
GitHub stars: 68K
Creator: Microsoft
Playwright is the modern standard for browser automation and scraping of JavaScript-heavy sites. It supports Chromium, Firefox, and WebKit (Safari's engine) from a single API.
Installation
npm install playwright
npx playwright install # Downloads browser binaries
# Or install only specific browsers:
npx playwright install chromium
Basic Scraping
import { chromium } from 'playwright';
const browser = await chromium.launch({ headless: true });
const page = await browser.newPage();
await page.goto('https://example.com');
// Wait for dynamic content to load
await page.waitForSelector('.product-list');
// Extract data
const products = await page.$$eval('.product-card', (cards) =>
cards.map((card) => ({
name: card.querySelector('.name')?.textContent?.trim(),
price: card.querySelector('.price')?.textContent?.trim(),
url: card.querySelector('a')?.href,
}))
);
await browser.close();
console.log(products);
Handling Dynamic Content
import { chromium } from 'playwright';
const browser = await chromium.launch();
const page = await browser.newPage();
// Handle infinite scroll
await page.goto('https://example.com/feed');
const items = [];
let previousHeight = 0;
while (true) {
// Scroll to bottom
await page.evaluate(() => window.scrollTo(0, document.body.scrollHeight));
await page.waitForTimeout(1000); // Wait for content to load
const currentHeight = await page.evaluate(() => document.body.scrollHeight);
// Extract new items
const newItems = await page.$$eval('.feed-item:not([data-scraped])', (els) => {
return els.map((el) => {
el.setAttribute('data-scraped', 'true');
return { text: el.textContent?.trim(), href: el.querySelector('a')?.href };
});
});
items.push(...newItems);
if (currentHeight === previousHeight) break; // No more content
previousHeight = currentHeight;
}
await browser.close();
Login and Session
import { chromium } from 'playwright';
const browser = await chromium.launch();
const context = await browser.newContext();
const page = await context.newPage();
// Login
await page.goto('https://example.com/login');
await page.fill('input[name="email"]', 'user@example.com');
await page.fill('input[name="password"]', 'password123');
await page.click('button[type="submit"]');
await page.waitForURL('**/dashboard');
// Save auth state for reuse
await context.storageState({ path: 'auth.json' });
await browser.close();
// Reuse in subsequent scrapes:
const browser2 = await chromium.launch();
const authedContext = await browser2.newContext({ storageState: 'auth.json' });
const authedPage = await authedContext.newPage();
await authedPage.goto('https://example.com/protected-data');
// Already logged in
Parallel Scraping
import { chromium } from 'playwright';
const browser = await chromium.launch();
// Scrape multiple pages in parallel (multiple contexts)
const urls = ['https://example.com/page/1', 'https://example.com/page/2', /* ... */];
const results = await Promise.all(
urls.map(async (url) => {
const page = await browser.newPage();
await page.goto(url);
const data = await page.$eval('.content', (el) => el.textContent);
await page.close();
return { url, data };
})
);
await browser.close();
Puppeteer
Package: puppeteer
Weekly downloads: 3M+
GitHub stars: 88K
Creator: Google
Puppeteer is Google's Chrome automation library. It uses the Chrome DevTools Protocol and is Chromium/Chrome-only. If you need Chrome-specific features or are already invested in Puppeteer, it remains a solid choice.
Installation
npm install puppeteer # Downloads bundled Chromium
# Or use installed Chrome:
npm install puppeteer-core
Basic Usage
import puppeteer from 'puppeteer';
const browser = await puppeteer.launch({ headless: true });
const page = await browser.newPage();
// Set realistic user agent
await page.setUserAgent(
'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36'
);
await page.goto('https://example.com', { waitUntil: 'networkidle0' });
const data = await page.evaluate(() => {
return Array.from(document.querySelectorAll('.item')).map((el) => ({
text: el.textContent?.trim(),
href: (el as HTMLAnchorElement).href,
}));
});
await browser.close();
Playwright vs Puppeteer
Feature Playwright Puppeteer
Browsers Chrome, Firefox, Chrome/Chromium only
WebKit (Safari)
Auto-wait Yes (built-in) Manual (waitForSelector)
TypeScript Excellent Good
Anti-detection Via stealth plugin Via stealth plugin
Screenshot Yes Yes
PDF generation Yes Yes
Network mock Yes (built-in) Via DevTools Protocol
Active dev Microsoft Google
Playwright has surpassed Puppeteer as the default for new projects because of better cross-browser support and built-in auto-waiting.
Crawlee
Package: crawlee
GitHub stars: 16K
Creator: Apify
Crawlee is what you use when you need a production-grade crawler, not just a script. It wraps Playwright, Puppeteer, and Cheerio, adding the infrastructure for running at scale.
Installation
npm install crawlee playwright
PlaywrightCrawler
import { PlaywrightCrawler } from 'crawlee';
const crawler = new PlaywrightCrawler({
// Auto-rotate proxies
proxyConfiguration: new ProxyConfiguration({
proxyUrls: ['http://proxy1:8080', 'http://proxy2:8080'],
}),
// How many concurrent pages
maxConcurrency: 10,
// Retry failed requests
maxRequestRetries: 3,
async requestHandler({ request, page, enqueueLinks, log }) {
log.info(`Scraping: ${request.url}`);
const title = await page.title();
const data = await page.$$eval('.product', (products) =>
products.map((p) => ({
name: p.querySelector('h2')?.textContent,
price: p.querySelector('.price')?.textContent,
}))
);
// Store results
await Dataset.pushData({ url: request.url, title, data });
// Discover and queue links
await enqueueLinks({
selector: 'a.next-page',
label: 'PRODUCT_LIST',
});
},
});
// Start with initial URLs
await crawler.run(['https://example.com/products']);
Anti-Bot Fingerprinting
import { PlaywrightCrawler, BrowserPool } from 'crawlee';
const crawler = new PlaywrightCrawler({
// Crawlee automatically rotates:
// - Browser fingerprints (canvas, WebGL, fonts)
// - User agents
// - Screen resolutions
// - Timezone and locale
// - Hardware concurrency
browserPoolOptions: {
useFingerprints: true, // Rotates fingerprints automatically
fingerprintOptions: {
fingerprintGeneratorOptions: {
browsers: ['chrome', 'firefox'],
operatingSystems: ['windows', 'macos'],
},
},
},
async requestHandler({ page }) {
// Looks like a real browser to anti-bot systems
},
});
CheerioCrawler (Fast Static)
import { CheerioCrawler, Dataset } from 'crawlee';
// Use Cheerio for static sites — no browser overhead
const crawler = new CheerioCrawler({
maxConcurrency: 50, // High concurrency without browser overhead
async requestHandler({ $, request, enqueueLinks }) {
// Cheerio API
const title = $('title').text();
const links = $('a.product-link').map((_, el) => $(el).attr('href')).get();
await Dataset.pushData({ url: request.url, title });
// Queue discovered links
await enqueueLinks({ selector: 'a.category-link' });
},
});
await crawler.run(['https://example.com']);
Choosing Your Scraping Tool
| Tool | JavaScript execution | Speed | Memory | Anti-bot |
|---|---|---|---|---|
| Cheerio | No | Fastest | ~15 MB | None |
| Playwright | Yes | Medium | ~200 MB | Via plugins |
| Puppeteer | Yes | Medium | ~200 MB | Via plugins |
| Crawlee | Yes (wraps above) | Medium | ~200 MB | Built-in |
Use Cheerio if:
- The site's HTML is fully rendered server-side (no JS required)
- You're scraping thousands of pages and performance matters
- You want minimal dependencies
Use Playwright if:
- The site uses React, Vue, Angular, or any client-side rendering
- You need to interact (click, scroll, fill forms)
- Cross-browser testing or Safari/WebKit support needed
Use Puppeteer if:
- You're already invested in Puppeteer's API
- Chrome-only is acceptable
- You need Chrome DevTools Protocol features directly
Use Crawlee if:
- Building a production crawler (not a one-off script)
- Anti-bot detection is a concern
- You need queuing, retries, and rate limiting built in
- Deploying to Apify's cloud platform
Compare web scraping package downloads on PkgPulse.
See the live comparison
View best npm packages web scraping on PkgPulse →