Skip to main content

Best npm Packages for Web Scraping 2026: Crawlee vs Puppeteer vs Playwright

·PkgPulse Team

TL;DR

Crawlee (Apify) is the 2026 standard for production web scraping — it handles anti-bot fingerprinting, request queuing, retry logic, and session rotation out of the box. For simple page scraping: Playwright + Cheerio. For headless browser automation (not scraping): Playwright. For legacy projects: Puppeteer (no new features). For static HTML: Cheerio + node-fetch.

Key Takeaways

  • Crawlee: Full scraping framework, stealth mode, queue management, Playwright/Puppeteer runner
  • Playwright: Better than Puppeteer for scraping — multi-browser, better anti-detection
  • Puppeteer: Chrome-only, legacy, 8M downloads/week (inertia), no new features coming
  • Cheerio: HTML parsing only (no JS), fastest for static pages (~5M downloads/week)
  • 2026 trend: Anti-bot measures intensified — headless detection requires stealth plugins
  • Apify Cloud: Managed scraping infrastructure for Crawlee at scale

Downloads

PackageWeekly DownloadsTrend
puppeteer~8M→ Stable (legacy)
playwright~5M↑ Growing
cheerio~12M→ Stable
crawlee~200K↑ Growing
playwright-extra~300K↑ Growing

Crawlee: Production Scraping

npm install crawlee
# Or with specific crawler types:
npm install crawlee playwright
npm install crawlee puppeteer
// Full crawl with Crawlee + Playwright:
import { PlaywrightCrawler, Dataset } from 'crawlee';

const crawler = new PlaywrightCrawler({
  // Stealth mode enabled by default in Crawlee:
  launchContext: {
    launchOptions: {
      headless: true,
    },
  },
  
  // Rate limiting:
  maxRequestsPerCrawl: 100,
  maxConcurrency: 3,
  requestHandlerTimeoutSecs: 30,
  
  // Retry failed requests:
  maxRequestRetries: 3,
  
  async requestHandler({ request, page, enqueueLinks, log }) {
    log.info(`Scraping: ${request.url}`);

    // Extract data:
    const title = await page.title();
    const description = await page.$eval(
      'meta[name="description"]',
      (el) => el.getAttribute('content') ?? ''
    ).catch(() => '');

    // Extract product data:
    const products = await page.$$eval('.product-card', (cards) =>
      cards.map((card) => ({
        title: card.querySelector('h2')?.textContent?.trim() ?? '',
        price: card.querySelector('.price')?.textContent?.trim() ?? '',
        url: card.querySelector('a')?.href ?? '',
      }))
    );

    // Save to dataset (auto-persisted):
    await Dataset.pushData({
      url: request.url,
      title,
      description,
      products,
      scrapedAt: new Date().toISOString(),
    });

    // Follow pagination links:
    await enqueueLinks({
      selector: 'a.next-page',
      label: 'PAGINATION',
    });
  },

  failedRequestHandler({ request, log }) {
    log.error(`Failed to scrape: ${request.url}`);
  },
});

// Start crawl:
await crawler.run(['https://example.com/products']);

// Export results:
const dataset = await Dataset.open();
const { items } = await dataset.getData();
console.log(`Scraped ${items.length} items`);
// Crawlee HTTP crawler (no browser — 100x faster for static pages):
import { CheerioCrawler, Dataset } from 'crawlee';

const crawler = new CheerioCrawler({
  maxConcurrency: 20,  // Much higher — no browser overhead
  maxRequestsPerCrawl: 10000,

  async requestHandler({ $, request, enqueueLinks }) {
    // $ is Cheerio — same API as jQuery:
    const title = $('h1').first().text().trim();
    const links = $('a[href]').map((_, el) => $(el).attr('href')).get();

    await Dataset.pushData({ url: request.url, title, linkCount: links.length });

    // Discover and enqueue linked pages:
    await enqueueLinks({
      selector: 'a',
      baseUrl: new URL(request.url).origin,
      transformRequestFunction: (req) => {
        // Filter to same domain:
        if (!req.url.startsWith(new URL(request.url).origin)) return false;
        return req;
      },
    });
  },
});

await crawler.run(['https://example.com']);

Anti-Bot: Stealth Mode

// Playwright with stealth plugin (playwright-extra):
import { chromium } from 'playwright-extra';
import stealth from 'puppeteer-extra-plugin-stealth';

chromium.use(stealth());

const browser = await chromium.launch({ headless: true });
const context = await browser.newContext({
  userAgent: 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36',
  viewport: { width: 1920, height: 1080 },
  // Disable WebDriver flag:
  extraHTTPHeaders: {
    'Accept-Language': 'en-US,en;q=0.9',
  },
});

const page = await context.newPage();

// Override WebDriver detection:
await page.addInitScript(() => {
  Object.defineProperty(navigator, 'webdriver', { get: () => false });
  Object.defineProperty(navigator, 'plugins', { get: () => [1, 2, 3] });
});

await page.goto('https://bot-detection-target.com');
const data = await page.$eval('#main-content', (el) => el.textContent);
await browser.close();
// Crawlee has built-in fingerprinting via @crawlee/browser-pool:
import { PlaywrightCrawler } from 'crawlee';
import { FingerprintGenerator } from 'fingerprint-generator';
import { FingerprintInjector } from 'fingerprint-injector';

const generator = new FingerprintGenerator({ browsers: ['chrome'], operatingSystems: ['windows', 'macos'] });
const injector = new FingerprintInjector();

const crawler = new PlaywrightCrawler({
  preNavigationHooks: [
    async ({ page }) => {
      const fingerprint = generator.getFingerprint();
      await injector.attachFingerprintToPlaywright(page, fingerprint);
    },
  ],
  requestHandler: async ({ page }) => {
    // Now scraping with randomized browser fingerprint
  },
});

Cheerio: Fast Static HTML Parsing

npm install cheerio undici
# undici is faster than node-fetch for many requests
// Cheerio for static pages (no JS rendering):
import { load } from 'cheerio';
import { fetch } from 'undici';

async function scrapeProductPage(url: string) {
  const response = await fetch(url, {
    headers: {
      'User-Agent': 'Mozilla/5.0 (compatible; MyBot/1.0)',
      'Accept': 'text/html,application/xhtml+xml',
    },
  });

  const html = await response.text();
  const $ = load(html);

  return {
    title: $('h1').first().text().trim(),
    price: $('.price, [data-price]').first().text().trim(),
    description: $('meta[name="description"]').attr('content') ?? '',
    images: $('img.product-image').map((_, el) => $(el).attr('src')).get(),
    inStock: $('.add-to-cart').length > 0,
  };
}

// Batch scraping with concurrency control:
import pLimit from 'p-limit';

const limit = pLimit(5);  // Max 5 concurrent requests

const urls = ['url1', 'url2', /* ... 1000 URLs */];
const results = await Promise.all(
  urls.map(url => limit(() => scrapeProductPage(url)))
);

Puppeteer vs Playwright for Scraping

Puppeteer (2026 status):
  ✅ 8M downloads/week (many legacy codebases)
  ✅ Chrome DevTools Protocol native
  ❌ Chrome/Chromium only
  ❌ No new scraping features
  ❌ Worse anti-detection than Playwright
  → Recommendation: Migrate to Playwright

Playwright (2026 status):
  ✅ Chrome, Firefox, Safari support
  ✅ Better anti-detection (less WebDriver artifacts)
  ✅ Better selector API (locators > $)
  ✅ Active development
  ✅ Built-in request interception
  → Recommendation: Use for browser scraping

Crawlee on Playwright (2026):
  ✅ All Playwright benefits
  ✅ + Queue management
  ✅ + Automatic retries
  ✅ + Session pool rotation
  ✅ + Built-in fingerprinting
  → Recommendation: Use for production crawling

Decision Guide

Use Crawlee if:
  → Crawling many pages (100+)
  → Need retry logic and queue management
  → Building a data pipeline or scraper product
  → Anti-bot is a concern
  → Need to scale (use with Apify Cloud)

Use Playwright if:
  → Scraping a few specific pages
  → Already using Playwright for testing
  → Need precise interaction (login, form fill, SPA)
  → One-off scraping tasks

Use Cheerio + undici if:
  → Pages are static HTML (no JS rendering needed)
  → Maximum performance (100+ req/sec possible)
  → Simple data extraction from known HTML structure

Use Puppeteer if:
  → Legacy codebase already using it
  → Don't have time to migrate to Playwright
  → Chrome-only is acceptable

Compare Crawlee, Puppeteer, and Playwright download trends on PkgPulse.

Comments

Stay Updated

Get the latest package insights, npm trends, and tooling tips delivered to your inbox.