Best npm Packages for Web Scraping in 2026
Playwright downloads exceeded 4 million per week. Cheerio consistently outpaces Puppeteer in monthly downloads despite being a simple HTML parser with no browser automation. Crawlee — Apify's framework that wraps Playwright, Puppeteer, and Cheerio — adds queuing, retry logic, proxy rotation, and anti-bot fingerprinting on top. In 2026, web scraping with Node.js has a clear hierarchy: use Cheerio for static HTML, Playwright for dynamic sites, and Crawlee when you're building a production crawler.
TL;DR
Cheerio for fast parsing of static HTML pages — 5x faster than a headless browser, tiny bundle, no browser needed. Playwright for dynamic sites that require JavaScript execution, login flows, or complex browser interactions — better cross-browser support than Puppeteer. Puppeteer for Chrome-specific automation and scraping — Google's library, battle-tested, Chrome DevTools Protocol. Crawlee when you need a full crawler with queuing, retries, proxy rotation, and anti-detection built in. Most production scrapers in 2026 use Crawlee + Playwright.
Key Takeaways
- Playwright: 4M+ weekly downloads, Microsoft, Chrome/Firefox/WebKit support, parallel execution
- Puppeteer: 3M+ weekly downloads, Google, Chrome/Chromium-only, Chrome DevTools Protocol
- Cheerio: 10M+ weekly downloads, lightweight HTML parser, no browser, jQuery-like API
- Crawlee: Apify's framework combining all three — adds queuing, retry, proxy rotation, fingerprinting
- Static sites: Cheerio is 5x+ faster than headless browsers, uses less memory
- Dynamic sites: Playwright > Puppeteer for cross-browser, anti-detection, modern TypeScript API
- Anti-bot: Crawlee automatically rotates browser fingerprints to avoid detection
The Web Scraping Stack
Different scraping needs require different tools:
Static HTML → Cheerio (fast, lightweight, no browser)
Dynamic JS → Playwright or Puppeteer (headless browser)
Production crawler → Crawlee (orchestration + anti-bot)
Simple HTTP → node-fetch / undici + Cheerio
Cheerio
Package: cheerio
Weekly downloads: 10M+
GitHub stars: 28K
Creator: Matt Mueller, maintained by Cheerio org
Cheerio parses HTML on the server with a jQuery-like API. It doesn't execute JavaScript — it's just an HTML parser. For static sites, it's dramatically faster than a headless browser.
Installation
npm install cheerio
Basic Usage
import * as cheerio from 'cheerio';
// Fetch HTML
const response = await fetch('https://news.ycombinator.com');
const html = await response.text();
// Load into Cheerio
const $ = cheerio.load(html);
// jQuery-like selectors
const stories = $('.athing').map((i, el) => {
const title = $(el).find('.titleline a').text();
const url = $(el).find('.titleline a').attr('href');
const score = $(el).next().find('.score').text();
return { title, url, score };
}).get();
console.log(stories);
// [{ title: 'Article Title', url: 'https://...', score: '142 points' }, ...]
Cheerio Performance
# Parsing 10MB HTML document:
Cheerio: ~50ms
Puppeteer: ~2000ms (browser startup + navigation)
Playwright: ~1500ms (browser startup + navigation)
# Memory usage:
Cheerio: ~15 MB
Headless Chrome: ~150-300 MB
For static content, Cheerio is the obvious choice — no browser overhead, no startup time, runs in any Node.js process.
Scraping with Selectors
import * as cheerio from 'cheerio';
async function scrapeProductPage(url: string) {
const response = await fetch(url, {
headers: {
'User-Agent': 'Mozilla/5.0 (compatible; MyBot/1.0)',
},
});
const html = await response.text();
const $ = cheerio.load(html);
return {
name: $('h1.product-title').text().trim(),
price: $('[data-price]').attr('data-price'),
description: $('meta[name="description"]').attr('content'),
images: $('img.product-image').map((_, el) => $(el).attr('src')).get(),
inStock: !$('.out-of-stock').length,
};
}
Cheerio Limitations
- No JavaScript execution — dynamic content rendered by React/Vue/Angular won't be there
- No interactions (clicking, form submission, scrolling)
- If the site returns different HTML without JavaScript enabled, Cheerio will miss it
Playwright
Package: playwright
Weekly downloads: 4M+
GitHub stars: 68K
Creator: Microsoft
Playwright is the modern standard for browser automation and scraping of JavaScript-heavy sites. It supports Chromium, Firefox, and WebKit (Safari's engine) from a single API.
Installation
npm install playwright
npx playwright install # Downloads browser binaries
# Or install only specific browsers:
npx playwright install chromium
Basic Scraping
import { chromium } from 'playwright';
const browser = await chromium.launch({ headless: true });
const page = await browser.newPage();
await page.goto('https://example.com');
// Wait for dynamic content to load
await page.waitForSelector('.product-list');
// Extract data
const products = await page.$$eval('.product-card', (cards) =>
cards.map((card) => ({
name: card.querySelector('.name')?.textContent?.trim(),
price: card.querySelector('.price')?.textContent?.trim(),
url: card.querySelector('a')?.href,
}))
);
await browser.close();
console.log(products);
Handling Dynamic Content
import { chromium } from 'playwright';
const browser = await chromium.launch();
const page = await browser.newPage();
// Handle infinite scroll
await page.goto('https://example.com/feed');
const items = [];
let previousHeight = 0;
while (true) {
// Scroll to bottom
await page.evaluate(() => window.scrollTo(0, document.body.scrollHeight));
await page.waitForTimeout(1000); // Wait for content to load
const currentHeight = await page.evaluate(() => document.body.scrollHeight);
// Extract new items
const newItems = await page.$$eval('.feed-item:not([data-scraped])', (els) => {
return els.map((el) => {
el.setAttribute('data-scraped', 'true');
return { text: el.textContent?.trim(), href: el.querySelector('a')?.href };
});
});
items.push(...newItems);
if (currentHeight === previousHeight) break; // No more content
previousHeight = currentHeight;
}
await browser.close();
Login and Session
import { chromium } from 'playwright';
const browser = await chromium.launch();
const context = await browser.newContext();
const page = await context.newPage();
// Login
await page.goto('https://example.com/login');
await page.fill('input[name="email"]', 'user@example.com');
await page.fill('input[name="password"]', 'password123');
await page.click('button[type="submit"]');
await page.waitForURL('**/dashboard');
// Save auth state for reuse
await context.storageState({ path: 'auth.json' });
await browser.close();
// Reuse in subsequent scrapes:
const browser2 = await chromium.launch();
const authedContext = await browser2.newContext({ storageState: 'auth.json' });
const authedPage = await authedContext.newPage();
await authedPage.goto('https://example.com/protected-data');
// Already logged in
Parallel Scraping
import { chromium } from 'playwright';
const browser = await chromium.launch();
// Scrape multiple pages in parallel (multiple contexts)
const urls = ['https://example.com/page/1', 'https://example.com/page/2', /* ... */];
const results = await Promise.all(
urls.map(async (url) => {
const page = await browser.newPage();
await page.goto(url);
const data = await page.$eval('.content', (el) => el.textContent);
await page.close();
return { url, data };
})
);
await browser.close();
Puppeteer
Package: puppeteer
Weekly downloads: 3M+
GitHub stars: 88K
Creator: Google
Puppeteer is Google's Chrome automation library. It uses the Chrome DevTools Protocol and is Chromium/Chrome-only. If you need Chrome-specific features or are already invested in Puppeteer, it remains a solid choice.
Installation
npm install puppeteer # Downloads bundled Chromium
# Or use installed Chrome:
npm install puppeteer-core
Basic Usage
import puppeteer from 'puppeteer';
const browser = await puppeteer.launch({ headless: true });
const page = await browser.newPage();
// Set realistic user agent
await page.setUserAgent(
'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36'
);
await page.goto('https://example.com', { waitUntil: 'networkidle0' });
const data = await page.evaluate(() => {
return Array.from(document.querySelectorAll('.item')).map((el) => ({
text: el.textContent?.trim(),
href: (el as HTMLAnchorElement).href,
}));
});
await browser.close();
Playwright vs Puppeteer
Feature Playwright Puppeteer
Browsers Chrome, Firefox, Chrome/Chromium only
WebKit (Safari)
Auto-wait Yes (built-in) Manual (waitForSelector)
TypeScript Excellent Good
Anti-detection Via stealth plugin Via stealth plugin
Screenshot Yes Yes
PDF generation Yes Yes
Network mock Yes (built-in) Via DevTools Protocol
Active dev Microsoft Google
Playwright has surpassed Puppeteer as the default for new projects because of better cross-browser support and built-in auto-waiting.
Crawlee
Package: crawlee
GitHub stars: 16K
Creator: Apify
Crawlee is what you use when you need a production-grade crawler, not just a script. It wraps Playwright, Puppeteer, and Cheerio, adding the infrastructure for running at scale.
Installation
npm install crawlee playwright
PlaywrightCrawler
import { PlaywrightCrawler } from 'crawlee';
const crawler = new PlaywrightCrawler({
// Auto-rotate proxies
proxyConfiguration: new ProxyConfiguration({
proxyUrls: ['http://proxy1:8080', 'http://proxy2:8080'],
}),
// How many concurrent pages
maxConcurrency: 10,
// Retry failed requests
maxRequestRetries: 3,
async requestHandler({ request, page, enqueueLinks, log }) {
log.info(`Scraping: ${request.url}`);
const title = await page.title();
const data = await page.$$eval('.product', (products) =>
products.map((p) => ({
name: p.querySelector('h2')?.textContent,
price: p.querySelector('.price')?.textContent,
}))
);
// Store results
await Dataset.pushData({ url: request.url, title, data });
// Discover and queue links
await enqueueLinks({
selector: 'a.next-page',
label: 'PRODUCT_LIST',
});
},
});
// Start with initial URLs
await crawler.run(['https://example.com/products']);
Anti-Bot Fingerprinting
import { PlaywrightCrawler, BrowserPool } from 'crawlee';
const crawler = new PlaywrightCrawler({
// Crawlee automatically rotates:
// - Browser fingerprints (canvas, WebGL, fonts)
// - User agents
// - Screen resolutions
// - Timezone and locale
// - Hardware concurrency
browserPoolOptions: {
useFingerprints: true, // Rotates fingerprints automatically
fingerprintOptions: {
fingerprintGeneratorOptions: {
browsers: ['chrome', 'firefox'],
operatingSystems: ['windows', 'macos'],
},
},
},
async requestHandler({ page }) {
// Looks like a real browser to anti-bot systems
},
});
CheerioCrawler (Fast Static)
import { CheerioCrawler, Dataset } from 'crawlee';
// Use Cheerio for static sites — no browser overhead
const crawler = new CheerioCrawler({
maxConcurrency: 50, // High concurrency without browser overhead
async requestHandler({ $, request, enqueueLinks }) {
// Cheerio API
const title = $('title').text();
const links = $('a.product-link').map((_, el) => $(el).attr('href')).get();
await Dataset.pushData({ url: request.url, title });
// Queue discovered links
await enqueueLinks({ selector: 'a.category-link' });
},
});
await crawler.run(['https://example.com']);
Choosing Your Scraping Tool
| Tool | JavaScript execution | Speed | Memory | Anti-bot |
|---|---|---|---|---|
| Cheerio | No | Fastest | ~15 MB | None |
| Playwright | Yes | Medium | ~200 MB | Via plugins |
| Puppeteer | Yes | Medium | ~200 MB | Via plugins |
| Crawlee | Yes (wraps above) | Medium | ~200 MB | Built-in |
Use Cheerio if:
- The site's HTML is fully rendered server-side (no JS required)
- You're scraping thousands of pages and performance matters
- You want minimal dependencies
Use Playwright if:
- The site uses React, Vue, Angular, or any client-side rendering
- You need to interact (click, scroll, fill forms)
- Cross-browser testing or Safari/WebKit support needed
Use Puppeteer if:
- You're already invested in Puppeteer's API
- Chrome-only is acceptable
- You need Chrome DevTools Protocol features directly
Use Crawlee if:
- Building a production crawler (not a one-off script)
- Anti-bot detection is a concern
- You need queuing, retries, and rate limiting built in
- Deploying to Apify's cloud platform
Legal, Ethical, and robots.txt Considerations
Before deploying any web scraper into production, review the target site's robots.txt file and terms of service. Many sites explicitly prohibit automated scraping, and violating these terms can result in IP bans, legal action under the Computer Fraud and Abuse Act (in the US), or GDPR violations if you scrape personal data from EU users without a lawful basis. robots.txt is not legally enforceable but represents the site owner's stated preference — scraping pages marked Disallow is widely considered bad practice and increases the chance of your IP being blocked. For legitimate scraping use cases, implement polite crawl delays between requests (Crawlee does this automatically via its request queue), respect Crawl-delay directives, and identify your bot with a descriptive User-Agent string that includes contact information. Some sites offer official APIs that provide the same data more reliably and legally — always check for a public API before building a scraper.
Anti-Bot Detection and Fingerprinting
Modern anti-bot systems like Cloudflare Bot Management, DataDome, and PerimeterX use browser fingerprinting signals far beyond IP address: WebGL renderer strings, canvas fingerprints, font detection, battery API availability, mouse movement patterns, and TLS fingerprinting at the network layer. Playwright by itself is detectable — Chromium exposes navigator.webdriver = true and other automation flags. The playwright-extra package with puppeteer-extra-plugin-stealth addresses many of these signals by patching the browser's JavaScript environment before page load. Crawlee goes further by rotating browser fingerprints automatically across requests using fingerprint-generator, which produces statistically realistic fingerprint combinations based on real browser populations. For the most heavily protected sites, residential proxy networks (Bright Data, Oxylabs, Smartproxy) rotate your requests through real user IP addresses, making them significantly harder to block at the IP level. Budget around $10–30 per GB for residential proxy bandwidth.
Performance at Scale and Infrastructure Considerations
Scaling a scraper from a single-machine script to a production data pipeline introduces infrastructure challenges that Cheerio, Playwright, and Crawlee address differently. Cheerio's minimal resource footprint allows running hundreds of concurrent requests on a single EC2 instance — a t3.medium can comfortably handle 200+ concurrent Cheerio fetches. Headless browsers are fundamentally different: each Chromium instance consumes 150–300 MB of RAM and uses significant CPU for page rendering. On a machine with 8 GB of RAM, you can realistically run 20–25 concurrent Playwright pages before memory pressure causes instability. For production Playwright scrapers, containerize each browser session (Playwright supports --remote-debugging-port for external browser connections), use autoscaling to spin up worker instances based on queue depth, and store results incrementally rather than accumulating them in memory. Crawlee's cloud deployment on Apify's platform handles this automatically — their Actor runtime scales workers on demand and charges per compute unit rather than requiring you to manage server fleets.
Data Quality, Deduplication, and Change Detection
Raw scraped data is rarely clean enough to use directly. HTML structures change without warning — a CSS class rename breaks your selectors silently, and your pipeline outputs empty strings or null values for days before anyone notices. Build selector health checks that assert scraped values are non-empty and match expected patterns (prices should be numeric, URLs should be valid), and alert on anomalous drop in data volume. Crawlee's dataset storage provides built-in deduplication via the request queue's URL fingerprinting, preventing you from re-scraping the same page on every run. For change detection — tracking when a page's content updates — compute a hash of the normalized HTML on each scrape and compare against your previously stored hash. Only process and store the full data when the hash changes. This dramatically reduces downstream processing costs when scraping frequently-updated sites where most pages haven't changed between runs.
Compare web scraping package downloads on PkgPulse.
See also: Playwright vs Puppeteer and Cypress vs Playwright, Best npm Packages for Web Scraping 2026.
See the live comparison
View best npm packages web scraping on PkgPulse →