Skip to main content

Best npm Packages for Web Scraping in 2026

·PkgPulse Team

Playwright downloads exceeded 4 million per week. Cheerio consistently outpaces Puppeteer in monthly downloads despite being a simple HTML parser with no browser automation. Crawlee — Apify's framework that wraps Playwright, Puppeteer, and Cheerio — adds queuing, retry logic, proxy rotation, and anti-bot fingerprinting on top. In 2026, web scraping with Node.js has a clear hierarchy: use Cheerio for static HTML, Playwright for dynamic sites, and Crawlee when you're building a production crawler.

TL;DR

Cheerio for fast parsing of static HTML pages — 5x faster than a headless browser, tiny bundle, no browser needed. Playwright for dynamic sites that require JavaScript execution, login flows, or complex browser interactions — better cross-browser support than Puppeteer. Puppeteer for Chrome-specific automation and scraping — Google's library, battle-tested, Chrome DevTools Protocol. Crawlee when you need a full crawler with queuing, retries, proxy rotation, and anti-detection built in. Most production scrapers in 2026 use Crawlee + Playwright.

Key Takeaways

  • Playwright: 4M+ weekly downloads, Microsoft, Chrome/Firefox/WebKit support, parallel execution
  • Puppeteer: 3M+ weekly downloads, Google, Chrome/Chromium-only, Chrome DevTools Protocol
  • Cheerio: 10M+ weekly downloads, lightweight HTML parser, no browser, jQuery-like API
  • Crawlee: Apify's framework combining all three — adds queuing, retry, proxy rotation, fingerprinting
  • Static sites: Cheerio is 5x+ faster than headless browsers, uses less memory
  • Dynamic sites: Playwright > Puppeteer for cross-browser, anti-detection, modern TypeScript API
  • Anti-bot: Crawlee automatically rotates browser fingerprints to avoid detection

The Web Scraping Stack

Different scraping needs require different tools:

Static HTML → Cheerio (fast, lightweight, no browser)
Dynamic JS → Playwright or Puppeteer (headless browser)
Production crawler → Crawlee (orchestration + anti-bot)
Simple HTTP → node-fetch / undici + Cheerio

Cheerio

Package: cheerio Weekly downloads: 10M+ GitHub stars: 28K Creator: Matt Mueller, maintained by Cheerio org

Cheerio parses HTML on the server with a jQuery-like API. It doesn't execute JavaScript — it's just an HTML parser. For static sites, it's dramatically faster than a headless browser.

Installation

npm install cheerio

Basic Usage

import * as cheerio from 'cheerio';

// Fetch HTML
const response = await fetch('https://news.ycombinator.com');
const html = await response.text();

// Load into Cheerio
const $ = cheerio.load(html);

// jQuery-like selectors
const stories = $('.athing').map((i, el) => {
  const title = $(el).find('.titleline a').text();
  const url = $(el).find('.titleline a').attr('href');
  const score = $(el).next().find('.score').text();

  return { title, url, score };
}).get();

console.log(stories);
// [{ title: 'Article Title', url: 'https://...', score: '142 points' }, ...]

Cheerio Performance

# Parsing 10MB HTML document:
Cheerio:           ~50ms
Puppeteer:         ~2000ms (browser startup + navigation)
Playwright:        ~1500ms (browser startup + navigation)

# Memory usage:
Cheerio:           ~15 MB
Headless Chrome:   ~150-300 MB

For static content, Cheerio is the obvious choice — no browser overhead, no startup time, runs in any Node.js process.

Scraping with Selectors

import * as cheerio from 'cheerio';

async function scrapeProductPage(url: string) {
  const response = await fetch(url, {
    headers: {
      'User-Agent': 'Mozilla/5.0 (compatible; MyBot/1.0)',
    },
  });
  const html = await response.text();
  const $ = cheerio.load(html);

  return {
    name: $('h1.product-title').text().trim(),
    price: $('[data-price]').attr('data-price'),
    description: $('meta[name="description"]').attr('content'),
    images: $('img.product-image').map((_, el) => $(el).attr('src')).get(),
    inStock: !$('.out-of-stock').length,
  };
}

Cheerio Limitations

  • No JavaScript execution — dynamic content rendered by React/Vue/Angular won't be there
  • No interactions (clicking, form submission, scrolling)
  • If the site returns different HTML without JavaScript enabled, Cheerio will miss it

Playwright

Package: playwright Weekly downloads: 4M+ GitHub stars: 68K Creator: Microsoft

Playwright is the modern standard for browser automation and scraping of JavaScript-heavy sites. It supports Chromium, Firefox, and WebKit (Safari's engine) from a single API.

Installation

npm install playwright
npx playwright install  # Downloads browser binaries
# Or install only specific browsers:
npx playwright install chromium

Basic Scraping

import { chromium } from 'playwright';

const browser = await chromium.launch({ headless: true });
const page = await browser.newPage();

await page.goto('https://example.com');

// Wait for dynamic content to load
await page.waitForSelector('.product-list');

// Extract data
const products = await page.$$eval('.product-card', (cards) =>
  cards.map((card) => ({
    name: card.querySelector('.name')?.textContent?.trim(),
    price: card.querySelector('.price')?.textContent?.trim(),
    url: card.querySelector('a')?.href,
  }))
);

await browser.close();
console.log(products);

Handling Dynamic Content

import { chromium } from 'playwright';

const browser = await chromium.launch();
const page = await browser.newPage();

// Handle infinite scroll
await page.goto('https://example.com/feed');

const items = [];
let previousHeight = 0;

while (true) {
  // Scroll to bottom
  await page.evaluate(() => window.scrollTo(0, document.body.scrollHeight));
  await page.waitForTimeout(1000);  // Wait for content to load

  const currentHeight = await page.evaluate(() => document.body.scrollHeight);

  // Extract new items
  const newItems = await page.$$eval('.feed-item:not([data-scraped])', (els) => {
    return els.map((el) => {
      el.setAttribute('data-scraped', 'true');
      return { text: el.textContent?.trim(), href: el.querySelector('a')?.href };
    });
  });

  items.push(...newItems);

  if (currentHeight === previousHeight) break;  // No more content
  previousHeight = currentHeight;
}

await browser.close();

Login and Session

import { chromium } from 'playwright';

const browser = await chromium.launch();
const context = await browser.newContext();
const page = await context.newPage();

// Login
await page.goto('https://example.com/login');
await page.fill('input[name="email"]', 'user@example.com');
await page.fill('input[name="password"]', 'password123');
await page.click('button[type="submit"]');
await page.waitForURL('**/dashboard');

// Save auth state for reuse
await context.storageState({ path: 'auth.json' });
await browser.close();

// Reuse in subsequent scrapes:
const browser2 = await chromium.launch();
const authedContext = await browser2.newContext({ storageState: 'auth.json' });
const authedPage = await authedContext.newPage();
await authedPage.goto('https://example.com/protected-data');
// Already logged in

Parallel Scraping

import { chromium } from 'playwright';

const browser = await chromium.launch();

// Scrape multiple pages in parallel (multiple contexts)
const urls = ['https://example.com/page/1', 'https://example.com/page/2', /* ... */];

const results = await Promise.all(
  urls.map(async (url) => {
    const page = await browser.newPage();
    await page.goto(url);
    const data = await page.$eval('.content', (el) => el.textContent);
    await page.close();
    return { url, data };
  })
);

await browser.close();

Puppeteer

Package: puppeteer Weekly downloads: 3M+ GitHub stars: 88K Creator: Google

Puppeteer is Google's Chrome automation library. It uses the Chrome DevTools Protocol and is Chromium/Chrome-only. If you need Chrome-specific features or are already invested in Puppeteer, it remains a solid choice.

Installation

npm install puppeteer  # Downloads bundled Chromium
# Or use installed Chrome:
npm install puppeteer-core

Basic Usage

import puppeteer from 'puppeteer';

const browser = await puppeteer.launch({ headless: true });
const page = await browser.newPage();

// Set realistic user agent
await page.setUserAgent(
  'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36'
);

await page.goto('https://example.com', { waitUntil: 'networkidle0' });

const data = await page.evaluate(() => {
  return Array.from(document.querySelectorAll('.item')).map((el) => ({
    text: el.textContent?.trim(),
    href: (el as HTMLAnchorElement).href,
  }));
});

await browser.close();

Playwright vs Puppeteer

Feature          Playwright          Puppeteer
Browsers         Chrome, Firefox,    Chrome/Chromium only
                 WebKit (Safari)
Auto-wait        Yes (built-in)      Manual (waitForSelector)
TypeScript       Excellent           Good
Anti-detection   Via stealth plugin  Via stealth plugin
Screenshot       Yes                 Yes
PDF generation   Yes                 Yes
Network mock     Yes (built-in)      Via DevTools Protocol
Active dev       Microsoft           Google

Playwright has surpassed Puppeteer as the default for new projects because of better cross-browser support and built-in auto-waiting.

Crawlee

Package: crawlee GitHub stars: 16K Creator: Apify

Crawlee is what you use when you need a production-grade crawler, not just a script. It wraps Playwright, Puppeteer, and Cheerio, adding the infrastructure for running at scale.

Installation

npm install crawlee playwright

PlaywrightCrawler

import { PlaywrightCrawler } from 'crawlee';

const crawler = new PlaywrightCrawler({
  // Auto-rotate proxies
  proxyConfiguration: new ProxyConfiguration({
    proxyUrls: ['http://proxy1:8080', 'http://proxy2:8080'],
  }),

  // How many concurrent pages
  maxConcurrency: 10,

  // Retry failed requests
  maxRequestRetries: 3,

  async requestHandler({ request, page, enqueueLinks, log }) {
    log.info(`Scraping: ${request.url}`);

    const title = await page.title();
    const data = await page.$$eval('.product', (products) =>
      products.map((p) => ({
        name: p.querySelector('h2')?.textContent,
        price: p.querySelector('.price')?.textContent,
      }))
    );

    // Store results
    await Dataset.pushData({ url: request.url, title, data });

    // Discover and queue links
    await enqueueLinks({
      selector: 'a.next-page',
      label: 'PRODUCT_LIST',
    });
  },
});

// Start with initial URLs
await crawler.run(['https://example.com/products']);

Anti-Bot Fingerprinting

import { PlaywrightCrawler, BrowserPool } from 'crawlee';

const crawler = new PlaywrightCrawler({
  // Crawlee automatically rotates:
  // - Browser fingerprints (canvas, WebGL, fonts)
  // - User agents
  // - Screen resolutions
  // - Timezone and locale
  // - Hardware concurrency

  browserPoolOptions: {
    useFingerprints: true,  // Rotates fingerprints automatically
    fingerprintOptions: {
      fingerprintGeneratorOptions: {
        browsers: ['chrome', 'firefox'],
        operatingSystems: ['windows', 'macos'],
      },
    },
  },

  async requestHandler({ page }) {
    // Looks like a real browser to anti-bot systems
  },
});

CheerioCrawler (Fast Static)

import { CheerioCrawler, Dataset } from 'crawlee';

// Use Cheerio for static sites — no browser overhead
const crawler = new CheerioCrawler({
  maxConcurrency: 50,  // High concurrency without browser overhead

  async requestHandler({ $, request, enqueueLinks }) {
    // Cheerio API
    const title = $('title').text();
    const links = $('a.product-link').map((_, el) => $(el).attr('href')).get();

    await Dataset.pushData({ url: request.url, title });

    // Queue discovered links
    await enqueueLinks({ selector: 'a.category-link' });
  },
});

await crawler.run(['https://example.com']);

Choosing Your Scraping Tool

ToolJavaScript executionSpeedMemoryAnti-bot
CheerioNoFastest~15 MBNone
PlaywrightYesMedium~200 MBVia plugins
PuppeteerYesMedium~200 MBVia plugins
CrawleeYes (wraps above)Medium~200 MBBuilt-in

Use Cheerio if:

  • The site's HTML is fully rendered server-side (no JS required)
  • You're scraping thousands of pages and performance matters
  • You want minimal dependencies

Use Playwright if:

  • The site uses React, Vue, Angular, or any client-side rendering
  • You need to interact (click, scroll, fill forms)
  • Cross-browser testing or Safari/WebKit support needed

Use Puppeteer if:

  • You're already invested in Puppeteer's API
  • Chrome-only is acceptable
  • You need Chrome DevTools Protocol features directly

Use Crawlee if:

  • Building a production crawler (not a one-off script)
  • Anti-bot detection is a concern
  • You need queuing, retries, and rate limiting built in
  • Deploying to Apify's cloud platform

Compare web scraping package downloads on PkgPulse.

Comments

Stay Updated

Get the latest package insights, npm trends, and tooling tips delivered to your inbox.