Skip to main content

Best npm Packages for Web Scraping in 2026

·PkgPulse Team
0

Playwright downloads exceeded 4 million per week. Cheerio consistently outpaces Puppeteer in monthly downloads despite being a simple HTML parser with no browser automation. Crawlee — Apify's framework that wraps Playwright, Puppeteer, and Cheerio — adds queuing, retry logic, proxy rotation, and anti-bot fingerprinting on top. In 2026, web scraping with Node.js has a clear hierarchy: use Cheerio for static HTML, Playwright for dynamic sites, and Crawlee when you're building a production crawler.

TL;DR

Cheerio for fast parsing of static HTML pages — 5x faster than a headless browser, tiny bundle, no browser needed. Playwright for dynamic sites that require JavaScript execution, login flows, or complex browser interactions — better cross-browser support than Puppeteer. Puppeteer for Chrome-specific automation and scraping — Google's library, battle-tested, Chrome DevTools Protocol. Crawlee when you need a full crawler with queuing, retries, proxy rotation, and anti-detection built in. Most production scrapers in 2026 use Crawlee + Playwright.

Key Takeaways

  • Playwright: 4M+ weekly downloads, Microsoft, Chrome/Firefox/WebKit support, parallel execution
  • Puppeteer: 3M+ weekly downloads, Google, Chrome/Chromium-only, Chrome DevTools Protocol
  • Cheerio: 10M+ weekly downloads, lightweight HTML parser, no browser, jQuery-like API
  • Crawlee: Apify's framework combining all three — adds queuing, retry, proxy rotation, fingerprinting
  • Static sites: Cheerio is 5x+ faster than headless browsers, uses less memory
  • Dynamic sites: Playwright > Puppeteer for cross-browser, anti-detection, modern TypeScript API
  • Anti-bot: Crawlee automatically rotates browser fingerprints to avoid detection

The Web Scraping Stack

Different scraping needs require different tools:

Static HTML → Cheerio (fast, lightweight, no browser)
Dynamic JS → Playwright or Puppeteer (headless browser)
Production crawler → Crawlee (orchestration + anti-bot)
Simple HTTP → node-fetch / undici + Cheerio

Cheerio

Package: cheerio Weekly downloads: 10M+ GitHub stars: 28K Creator: Matt Mueller, maintained by Cheerio org

Cheerio parses HTML on the server with a jQuery-like API. It doesn't execute JavaScript — it's just an HTML parser. For static sites, it's dramatically faster than a headless browser.

Installation

npm install cheerio

Basic Usage

import * as cheerio from 'cheerio';

// Fetch HTML
const response = await fetch('https://news.ycombinator.com');
const html = await response.text();

// Load into Cheerio
const $ = cheerio.load(html);

// jQuery-like selectors
const stories = $('.athing').map((i, el) => {
  const title = $(el).find('.titleline a').text();
  const url = $(el).find('.titleline a').attr('href');
  const score = $(el).next().find('.score').text();

  return { title, url, score };
}).get();

console.log(stories);
// [{ title: 'Article Title', url: 'https://...', score: '142 points' }, ...]

Cheerio Performance

# Parsing 10MB HTML document:
Cheerio:           ~50ms
Puppeteer:         ~2000ms (browser startup + navigation)
Playwright:        ~1500ms (browser startup + navigation)

# Memory usage:
Cheerio:           ~15 MB
Headless Chrome:   ~150-300 MB

For static content, Cheerio is the obvious choice — no browser overhead, no startup time, runs in any Node.js process.

Scraping with Selectors

import * as cheerio from 'cheerio';

async function scrapeProductPage(url: string) {
  const response = await fetch(url, {
    headers: {
      'User-Agent': 'Mozilla/5.0 (compatible; MyBot/1.0)',
    },
  });
  const html = await response.text();
  const $ = cheerio.load(html);

  return {
    name: $('h1.product-title').text().trim(),
    price: $('[data-price]').attr('data-price'),
    description: $('meta[name="description"]').attr('content'),
    images: $('img.product-image').map((_, el) => $(el).attr('src')).get(),
    inStock: !$('.out-of-stock').length,
  };
}

Cheerio Limitations

  • No JavaScript execution — dynamic content rendered by React/Vue/Angular won't be there
  • No interactions (clicking, form submission, scrolling)
  • If the site returns different HTML without JavaScript enabled, Cheerio will miss it

Playwright

Package: playwright Weekly downloads: 4M+ GitHub stars: 68K Creator: Microsoft

Playwright is the modern standard for browser automation and scraping of JavaScript-heavy sites. It supports Chromium, Firefox, and WebKit (Safari's engine) from a single API.

Installation

npm install playwright
npx playwright install  # Downloads browser binaries
# Or install only specific browsers:
npx playwright install chromium

Basic Scraping

import { chromium } from 'playwright';

const browser = await chromium.launch({ headless: true });
const page = await browser.newPage();

await page.goto('https://example.com');

// Wait for dynamic content to load
await page.waitForSelector('.product-list');

// Extract data
const products = await page.$$eval('.product-card', (cards) =>
  cards.map((card) => ({
    name: card.querySelector('.name')?.textContent?.trim(),
    price: card.querySelector('.price')?.textContent?.trim(),
    url: card.querySelector('a')?.href,
  }))
);

await browser.close();
console.log(products);

Handling Dynamic Content

import { chromium } from 'playwright';

const browser = await chromium.launch();
const page = await browser.newPage();

// Handle infinite scroll
await page.goto('https://example.com/feed');

const items = [];
let previousHeight = 0;

while (true) {
  // Scroll to bottom
  await page.evaluate(() => window.scrollTo(0, document.body.scrollHeight));
  await page.waitForTimeout(1000);  // Wait for content to load

  const currentHeight = await page.evaluate(() => document.body.scrollHeight);

  // Extract new items
  const newItems = await page.$$eval('.feed-item:not([data-scraped])', (els) => {
    return els.map((el) => {
      el.setAttribute('data-scraped', 'true');
      return { text: el.textContent?.trim(), href: el.querySelector('a')?.href };
    });
  });

  items.push(...newItems);

  if (currentHeight === previousHeight) break;  // No more content
  previousHeight = currentHeight;
}

await browser.close();

Login and Session

import { chromium } from 'playwright';

const browser = await chromium.launch();
const context = await browser.newContext();
const page = await context.newPage();

// Login
await page.goto('https://example.com/login');
await page.fill('input[name="email"]', 'user@example.com');
await page.fill('input[name="password"]', 'password123');
await page.click('button[type="submit"]');
await page.waitForURL('**/dashboard');

// Save auth state for reuse
await context.storageState({ path: 'auth.json' });
await browser.close();

// Reuse in subsequent scrapes:
const browser2 = await chromium.launch();
const authedContext = await browser2.newContext({ storageState: 'auth.json' });
const authedPage = await authedContext.newPage();
await authedPage.goto('https://example.com/protected-data');
// Already logged in

Parallel Scraping

import { chromium } from 'playwright';

const browser = await chromium.launch();

// Scrape multiple pages in parallel (multiple contexts)
const urls = ['https://example.com/page/1', 'https://example.com/page/2', /* ... */];

const results = await Promise.all(
  urls.map(async (url) => {
    const page = await browser.newPage();
    await page.goto(url);
    const data = await page.$eval('.content', (el) => el.textContent);
    await page.close();
    return { url, data };
  })
);

await browser.close();

Puppeteer

Package: puppeteer Weekly downloads: 3M+ GitHub stars: 88K Creator: Google

Puppeteer is Google's Chrome automation library. It uses the Chrome DevTools Protocol and is Chromium/Chrome-only. If you need Chrome-specific features or are already invested in Puppeteer, it remains a solid choice.

Installation

npm install puppeteer  # Downloads bundled Chromium
# Or use installed Chrome:
npm install puppeteer-core

Basic Usage

import puppeteer from 'puppeteer';

const browser = await puppeteer.launch({ headless: true });
const page = await browser.newPage();

// Set realistic user agent
await page.setUserAgent(
  'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36'
);

await page.goto('https://example.com', { waitUntil: 'networkidle0' });

const data = await page.evaluate(() => {
  return Array.from(document.querySelectorAll('.item')).map((el) => ({
    text: el.textContent?.trim(),
    href: (el as HTMLAnchorElement).href,
  }));
});

await browser.close();

Playwright vs Puppeteer

Feature          Playwright          Puppeteer
Browsers         Chrome, Firefox,    Chrome/Chromium only
                 WebKit (Safari)
Auto-wait        Yes (built-in)      Manual (waitForSelector)
TypeScript       Excellent           Good
Anti-detection   Via stealth plugin  Via stealth plugin
Screenshot       Yes                 Yes
PDF generation   Yes                 Yes
Network mock     Yes (built-in)      Via DevTools Protocol
Active dev       Microsoft           Google

Playwright has surpassed Puppeteer as the default for new projects because of better cross-browser support and built-in auto-waiting.

Crawlee

Package: crawlee GitHub stars: 16K Creator: Apify

Crawlee is what you use when you need a production-grade crawler, not just a script. It wraps Playwright, Puppeteer, and Cheerio, adding the infrastructure for running at scale.

Installation

npm install crawlee playwright

PlaywrightCrawler

import { PlaywrightCrawler } from 'crawlee';

const crawler = new PlaywrightCrawler({
  // Auto-rotate proxies
  proxyConfiguration: new ProxyConfiguration({
    proxyUrls: ['http://proxy1:8080', 'http://proxy2:8080'],
  }),

  // How many concurrent pages
  maxConcurrency: 10,

  // Retry failed requests
  maxRequestRetries: 3,

  async requestHandler({ request, page, enqueueLinks, log }) {
    log.info(`Scraping: ${request.url}`);

    const title = await page.title();
    const data = await page.$$eval('.product', (products) =>
      products.map((p) => ({
        name: p.querySelector('h2')?.textContent,
        price: p.querySelector('.price')?.textContent,
      }))
    );

    // Store results
    await Dataset.pushData({ url: request.url, title, data });

    // Discover and queue links
    await enqueueLinks({
      selector: 'a.next-page',
      label: 'PRODUCT_LIST',
    });
  },
});

// Start with initial URLs
await crawler.run(['https://example.com/products']);

Anti-Bot Fingerprinting

import { PlaywrightCrawler, BrowserPool } from 'crawlee';

const crawler = new PlaywrightCrawler({
  // Crawlee automatically rotates:
  // - Browser fingerprints (canvas, WebGL, fonts)
  // - User agents
  // - Screen resolutions
  // - Timezone and locale
  // - Hardware concurrency

  browserPoolOptions: {
    useFingerprints: true,  // Rotates fingerprints automatically
    fingerprintOptions: {
      fingerprintGeneratorOptions: {
        browsers: ['chrome', 'firefox'],
        operatingSystems: ['windows', 'macos'],
      },
    },
  },

  async requestHandler({ page }) {
    // Looks like a real browser to anti-bot systems
  },
});

CheerioCrawler (Fast Static)

import { CheerioCrawler, Dataset } from 'crawlee';

// Use Cheerio for static sites — no browser overhead
const crawler = new CheerioCrawler({
  maxConcurrency: 50,  // High concurrency without browser overhead

  async requestHandler({ $, request, enqueueLinks }) {
    // Cheerio API
    const title = $('title').text();
    const links = $('a.product-link').map((_, el) => $(el).attr('href')).get();

    await Dataset.pushData({ url: request.url, title });

    // Queue discovered links
    await enqueueLinks({ selector: 'a.category-link' });
  },
});

await crawler.run(['https://example.com']);

Choosing Your Scraping Tool

ToolJavaScript executionSpeedMemoryAnti-bot
CheerioNoFastest~15 MBNone
PlaywrightYesMedium~200 MBVia plugins
PuppeteerYesMedium~200 MBVia plugins
CrawleeYes (wraps above)Medium~200 MBBuilt-in

Use Cheerio if:

  • The site's HTML is fully rendered server-side (no JS required)
  • You're scraping thousands of pages and performance matters
  • You want minimal dependencies

Use Playwright if:

  • The site uses React, Vue, Angular, or any client-side rendering
  • You need to interact (click, scroll, fill forms)
  • Cross-browser testing or Safari/WebKit support needed

Use Puppeteer if:

  • You're already invested in Puppeteer's API
  • Chrome-only is acceptable
  • You need Chrome DevTools Protocol features directly

Use Crawlee if:

  • Building a production crawler (not a one-off script)
  • Anti-bot detection is a concern
  • You need queuing, retries, and rate limiting built in
  • Deploying to Apify's cloud platform

Before deploying any web scraper into production, review the target site's robots.txt file and terms of service. Many sites explicitly prohibit automated scraping, and violating these terms can result in IP bans, legal action under the Computer Fraud and Abuse Act (in the US), or GDPR violations if you scrape personal data from EU users without a lawful basis. robots.txt is not legally enforceable but represents the site owner's stated preference — scraping pages marked Disallow is widely considered bad practice and increases the chance of your IP being blocked. For legitimate scraping use cases, implement polite crawl delays between requests (Crawlee does this automatically via its request queue), respect Crawl-delay directives, and identify your bot with a descriptive User-Agent string that includes contact information. Some sites offer official APIs that provide the same data more reliably and legally — always check for a public API before building a scraper.

Anti-Bot Detection and Fingerprinting

Modern anti-bot systems like Cloudflare Bot Management, DataDome, and PerimeterX use browser fingerprinting signals far beyond IP address: WebGL renderer strings, canvas fingerprints, font detection, battery API availability, mouse movement patterns, and TLS fingerprinting at the network layer. Playwright by itself is detectable — Chromium exposes navigator.webdriver = true and other automation flags. The playwright-extra package with puppeteer-extra-plugin-stealth addresses many of these signals by patching the browser's JavaScript environment before page load. Crawlee goes further by rotating browser fingerprints automatically across requests using fingerprint-generator, which produces statistically realistic fingerprint combinations based on real browser populations. For the most heavily protected sites, residential proxy networks (Bright Data, Oxylabs, Smartproxy) rotate your requests through real user IP addresses, making them significantly harder to block at the IP level. Budget around $10–30 per GB for residential proxy bandwidth.

Performance at Scale and Infrastructure Considerations

Scaling a scraper from a single-machine script to a production data pipeline introduces infrastructure challenges that Cheerio, Playwright, and Crawlee address differently. Cheerio's minimal resource footprint allows running hundreds of concurrent requests on a single EC2 instance — a t3.medium can comfortably handle 200+ concurrent Cheerio fetches. Headless browsers are fundamentally different: each Chromium instance consumes 150–300 MB of RAM and uses significant CPU for page rendering. On a machine with 8 GB of RAM, you can realistically run 20–25 concurrent Playwright pages before memory pressure causes instability. For production Playwright scrapers, containerize each browser session (Playwright supports --remote-debugging-port for external browser connections), use autoscaling to spin up worker instances based on queue depth, and store results incrementally rather than accumulating them in memory. Crawlee's cloud deployment on Apify's platform handles this automatically — their Actor runtime scales workers on demand and charges per compute unit rather than requiring you to manage server fleets.

Data Quality, Deduplication, and Change Detection

Raw scraped data is rarely clean enough to use directly. HTML structures change without warning — a CSS class rename breaks your selectors silently, and your pipeline outputs empty strings or null values for days before anyone notices. Build selector health checks that assert scraped values are non-empty and match expected patterns (prices should be numeric, URLs should be valid), and alert on anomalous drop in data volume. Crawlee's dataset storage provides built-in deduplication via the request queue's URL fingerprinting, preventing you from re-scraping the same page on every run. For change detection — tracking when a page's content updates — compute a hash of the normalized HTML on each scrape and compare against your previously stored hash. Only process and store the full data when the hash changes. This dramatically reduces downstream processing costs when scraping frequently-updated sites where most pages haven't changed between runs.

Compare web scraping package downloads on PkgPulse.

See also: Playwright vs Puppeteer and Cypress vs Playwright, Best npm Packages for Web Scraping 2026.

The 2026 JavaScript Stack Cheatsheet

One PDF: the best package for every category (ORMs, bundlers, auth, testing, state management). Used by 500+ devs. Free, updated monthly.