Best npm Packages for Web Scraping 2026: Crawlee vs Puppeteer vs Playwright
·PkgPulse Team
TL;DR
Crawlee (Apify) is the 2026 standard for production web scraping — it handles anti-bot fingerprinting, request queuing, retry logic, and session rotation out of the box. For simple page scraping: Playwright + Cheerio. For headless browser automation (not scraping): Playwright. For legacy projects: Puppeteer (no new features). For static HTML: Cheerio + node-fetch.
Key Takeaways
- Crawlee: Full scraping framework, stealth mode, queue management, Playwright/Puppeteer runner
- Playwright: Better than Puppeteer for scraping — multi-browser, better anti-detection
- Puppeteer: Chrome-only, legacy, 8M downloads/week (inertia), no new features coming
- Cheerio: HTML parsing only (no JS), fastest for static pages (~5M downloads/week)
- 2026 trend: Anti-bot measures intensified — headless detection requires stealth plugins
- Apify Cloud: Managed scraping infrastructure for Crawlee at scale
Downloads
| Package | Weekly Downloads | Trend |
|---|---|---|
puppeteer | ~8M | → Stable (legacy) |
playwright | ~5M | ↑ Growing |
cheerio | ~12M | → Stable |
crawlee | ~200K | ↑ Growing |
playwright-extra | ~300K | ↑ Growing |
Crawlee: Production Scraping
npm install crawlee
# Or with specific crawler types:
npm install crawlee playwright
npm install crawlee puppeteer
// Full crawl with Crawlee + Playwright:
import { PlaywrightCrawler, Dataset } from 'crawlee';
const crawler = new PlaywrightCrawler({
// Stealth mode enabled by default in Crawlee:
launchContext: {
launchOptions: {
headless: true,
},
},
// Rate limiting:
maxRequestsPerCrawl: 100,
maxConcurrency: 3,
requestHandlerTimeoutSecs: 30,
// Retry failed requests:
maxRequestRetries: 3,
async requestHandler({ request, page, enqueueLinks, log }) {
log.info(`Scraping: ${request.url}`);
// Extract data:
const title = await page.title();
const description = await page.$eval(
'meta[name="description"]',
(el) => el.getAttribute('content') ?? ''
).catch(() => '');
// Extract product data:
const products = await page.$$eval('.product-card', (cards) =>
cards.map((card) => ({
title: card.querySelector('h2')?.textContent?.trim() ?? '',
price: card.querySelector('.price')?.textContent?.trim() ?? '',
url: card.querySelector('a')?.href ?? '',
}))
);
// Save to dataset (auto-persisted):
await Dataset.pushData({
url: request.url,
title,
description,
products,
scrapedAt: new Date().toISOString(),
});
// Follow pagination links:
await enqueueLinks({
selector: 'a.next-page',
label: 'PAGINATION',
});
},
failedRequestHandler({ request, log }) {
log.error(`Failed to scrape: ${request.url}`);
},
});
// Start crawl:
await crawler.run(['https://example.com/products']);
// Export results:
const dataset = await Dataset.open();
const { items } = await dataset.getData();
console.log(`Scraped ${items.length} items`);
// Crawlee HTTP crawler (no browser — 100x faster for static pages):
import { CheerioCrawler, Dataset } from 'crawlee';
const crawler = new CheerioCrawler({
maxConcurrency: 20, // Much higher — no browser overhead
maxRequestsPerCrawl: 10000,
async requestHandler({ $, request, enqueueLinks }) {
// $ is Cheerio — same API as jQuery:
const title = $('h1').first().text().trim();
const links = $('a[href]').map((_, el) => $(el).attr('href')).get();
await Dataset.pushData({ url: request.url, title, linkCount: links.length });
// Discover and enqueue linked pages:
await enqueueLinks({
selector: 'a',
baseUrl: new URL(request.url).origin,
transformRequestFunction: (req) => {
// Filter to same domain:
if (!req.url.startsWith(new URL(request.url).origin)) return false;
return req;
},
});
},
});
await crawler.run(['https://example.com']);
Anti-Bot: Stealth Mode
// Playwright with stealth plugin (playwright-extra):
import { chromium } from 'playwright-extra';
import stealth from 'puppeteer-extra-plugin-stealth';
chromium.use(stealth());
const browser = await chromium.launch({ headless: true });
const context = await browser.newContext({
userAgent: 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36',
viewport: { width: 1920, height: 1080 },
// Disable WebDriver flag:
extraHTTPHeaders: {
'Accept-Language': 'en-US,en;q=0.9',
},
});
const page = await context.newPage();
// Override WebDriver detection:
await page.addInitScript(() => {
Object.defineProperty(navigator, 'webdriver', { get: () => false });
Object.defineProperty(navigator, 'plugins', { get: () => [1, 2, 3] });
});
await page.goto('https://bot-detection-target.com');
const data = await page.$eval('#main-content', (el) => el.textContent);
await browser.close();
// Crawlee has built-in fingerprinting via @crawlee/browser-pool:
import { PlaywrightCrawler } from 'crawlee';
import { FingerprintGenerator } from 'fingerprint-generator';
import { FingerprintInjector } from 'fingerprint-injector';
const generator = new FingerprintGenerator({ browsers: ['chrome'], operatingSystems: ['windows', 'macos'] });
const injector = new FingerprintInjector();
const crawler = new PlaywrightCrawler({
preNavigationHooks: [
async ({ page }) => {
const fingerprint = generator.getFingerprint();
await injector.attachFingerprintToPlaywright(page, fingerprint);
},
],
requestHandler: async ({ page }) => {
// Now scraping with randomized browser fingerprint
},
});
Cheerio: Fast Static HTML Parsing
npm install cheerio undici
# undici is faster than node-fetch for many requests
// Cheerio for static pages (no JS rendering):
import { load } from 'cheerio';
import { fetch } from 'undici';
async function scrapeProductPage(url: string) {
const response = await fetch(url, {
headers: {
'User-Agent': 'Mozilla/5.0 (compatible; MyBot/1.0)',
'Accept': 'text/html,application/xhtml+xml',
},
});
const html = await response.text();
const $ = load(html);
return {
title: $('h1').first().text().trim(),
price: $('.price, [data-price]').first().text().trim(),
description: $('meta[name="description"]').attr('content') ?? '',
images: $('img.product-image').map((_, el) => $(el).attr('src')).get(),
inStock: $('.add-to-cart').length > 0,
};
}
// Batch scraping with concurrency control:
import pLimit from 'p-limit';
const limit = pLimit(5); // Max 5 concurrent requests
const urls = ['url1', 'url2', /* ... 1000 URLs */];
const results = await Promise.all(
urls.map(url => limit(() => scrapeProductPage(url)))
);
Puppeteer vs Playwright for Scraping
Puppeteer (2026 status):
✅ 8M downloads/week (many legacy codebases)
✅ Chrome DevTools Protocol native
❌ Chrome/Chromium only
❌ No new scraping features
❌ Worse anti-detection than Playwright
→ Recommendation: Migrate to Playwright
Playwright (2026 status):
✅ Chrome, Firefox, Safari support
✅ Better anti-detection (less WebDriver artifacts)
✅ Better selector API (locators > $)
✅ Active development
✅ Built-in request interception
→ Recommendation: Use for browser scraping
Crawlee on Playwright (2026):
✅ All Playwright benefits
✅ + Queue management
✅ + Automatic retries
✅ + Session pool rotation
✅ + Built-in fingerprinting
→ Recommendation: Use for production crawling
Decision Guide
Use Crawlee if:
→ Crawling many pages (100+)
→ Need retry logic and queue management
→ Building a data pipeline or scraper product
→ Anti-bot is a concern
→ Need to scale (use with Apify Cloud)
Use Playwright if:
→ Scraping a few specific pages
→ Already using Playwright for testing
→ Need precise interaction (login, form fill, SPA)
→ One-off scraping tasks
Use Cheerio + undici if:
→ Pages are static HTML (no JS rendering needed)
→ Maximum performance (100+ req/sec possible)
→ Simple data extraction from known HTML structure
Use Puppeteer if:
→ Legacy codebase already using it
→ Don't have time to migrate to Playwright
→ Chrome-only is acceptable
Compare Crawlee, Puppeteer, and Playwright download trends on PkgPulse.