<!-- PkgPulse AI-readable guide source -->
<!-- Canonical: https://www.pkgpulse.com/guides/best-npm-packages-web-scraping-2026 -->
<!-- Raw Markdown: https://www.pkgpulse.com/guides/best-npm-packages-web-scraping-2026/raw.md -->
<!-- Source path: content/guides/best-npm-packages-web-scraping-2026.mdx -->

---
og_image: "/images/guides/best-npm-packages-web-scraping-2026.webp"
title: "Best npm Packages for Web Scraping in 2026"
description: "Best npm packages for web scraping in 2026. Playwright, Puppeteer, Cheerio, and Crawlee compared — browser automation, HTML parsing, anti-bot, and which tool."
date: "2026-03-08"
author: "PkgPulse Team"
tags: ["playwright", "puppeteer", "cheerio", "crawlee", "web-scraping", "nodejs", "2026"]
featured_comparison: "best-npm-packages-web-scraping"
---

Playwright downloads exceeded 4 million per week. Cheerio consistently outpaces Puppeteer in monthly downloads despite being a simple HTML parser with no browser automation. Crawlee — Apify's framework that wraps Playwright, Puppeteer, and Cheerio — adds queuing, retry logic, proxy rotation, and anti-bot fingerprinting on top. In 2026, web scraping with Node.js has a clear hierarchy: use Cheerio for static HTML, Playwright for dynamic sites, and Crawlee when you're building a production crawler.

## TL;DR

**Cheerio** for fast parsing of static HTML pages — 5x faster than a headless browser, tiny bundle, no browser needed. **Playwright** for dynamic sites that require JavaScript execution, login flows, or complex browser interactions — better cross-browser support than Puppeteer. **Puppeteer** for Chrome-specific automation and scraping — Google's library, battle-tested, Chrome DevTools Protocol. **Crawlee** when you need a full crawler with queuing, retries, proxy rotation, and anti-detection built in. Most production scrapers in 2026 use Crawlee + Playwright.

## Key Takeaways

- Playwright: 4M+ weekly downloads, Microsoft, Chrome/Firefox/WebKit support, parallel execution
- Puppeteer: 3M+ weekly downloads, Google, Chrome/Chromium-only, Chrome DevTools Protocol
- Cheerio: 10M+ weekly downloads, lightweight HTML parser, no browser, jQuery-like API
- Crawlee: Apify's framework combining all three — adds queuing, retry, proxy rotation, fingerprinting
- Static sites: Cheerio is 5x+ faster than headless browsers, uses less memory
- Dynamic sites: Playwright > Puppeteer for cross-browser, anti-detection, modern TypeScript API
- Anti-bot: Crawlee automatically rotates browser fingerprints to avoid detection

## The Web Scraping Stack

Different scraping needs require different tools:

```
Static HTML → Cheerio (fast, lightweight, no browser)
Dynamic JS → Playwright or Puppeteer (headless browser)
Production crawler → Crawlee (orchestration + anti-bot)
Simple HTTP → node-fetch / undici + Cheerio
```

## Cheerio

**Package**: `cheerio`
**Weekly downloads**: 10M+
**GitHub stars**: 28K
**Creator**: Matt Mueller, maintained by Cheerio org

Cheerio parses HTML on the server with a jQuery-like API. It doesn't execute JavaScript — it's just an HTML parser. For static sites, it's dramatically faster than a headless browser.

### Installation

```bash
npm install cheerio
```

### Basic Usage

```typescript
import * as cheerio from 'cheerio';

// Fetch HTML
const response = await fetch('https://news.ycombinator.com');
const html = await response.text();

// Load into Cheerio
const $ = cheerio.load(html);

// jQuery-like selectors
const stories = $('.athing').map((i, el) => {
  const title = $(el).find('.titleline a').text();
  const url = $(el).find('.titleline a').attr('href');
  const score = $(el).next().find('.score').text();

  return { title, url, score };
}).get();

console.log(stories);
// [{ title: 'Article Title', url: 'https://...', score: '142 points' }, ...]
```

### Cheerio Performance

```bash
# Parsing 10MB HTML document:
Cheerio:           ~50ms
Puppeteer:         ~2000ms (browser startup + navigation)
Playwright:        ~1500ms (browser startup + navigation)

# Memory usage:
Cheerio:           ~15 MB
Headless Chrome:   ~150-300 MB
```

For static content, Cheerio is the obvious choice — no browser overhead, no startup time, runs in any Node.js process.

### Scraping with Selectors

```typescript
import * as cheerio from 'cheerio';

async function scrapeProductPage(url: string) {
  const response = await fetch(url, {
    headers: {
      'User-Agent': 'Mozilla/5.0 (compatible; MyBot/1.0)',
    },
  });
  const html = await response.text();
  const $ = cheerio.load(html);

  return {
    name: $('h1.product-title').text().trim(),
    price: $('[data-price]').attr('data-price'),
    description: $('meta[name="description"]').attr('content'),
    images: $('img.product-image').map((_, el) => $(el).attr('src')).get(),
    inStock: !$('.out-of-stock').length,
  };
}
```

### Cheerio Limitations

- No JavaScript execution — dynamic content rendered by React/Vue/Angular won't be there
- No interactions (clicking, form submission, scrolling)
- If the site returns different HTML without JavaScript enabled, Cheerio will miss it

## Playwright

**Package**: `playwright`
**Weekly downloads**: 4M+
**GitHub stars**: 68K
**Creator**: Microsoft

Playwright is the modern standard for browser automation and scraping of JavaScript-heavy sites. It supports Chromium, Firefox, and WebKit (Safari's engine) from a single API.

### Installation

```bash
npm install playwright
npx playwright install  # Downloads browser binaries
# Or install only specific browsers:
npx playwright install chromium
```

### Basic Scraping

```typescript
import { chromium } from 'playwright';

const browser = await chromium.launch({ headless: true });
const page = await browser.newPage();

await page.goto('https://example.com');

// Wait for dynamic content to load
await page.waitForSelector('.product-list');

// Extract data
const products = await page.$$eval('.product-card', (cards) =>
  cards.map((card) => ({
    name: card.querySelector('.name')?.textContent?.trim(),
    price: card.querySelector('.price')?.textContent?.trim(),
    url: card.querySelector('a')?.href,
  }))
);

await browser.close();
console.log(products);
```

### Handling Dynamic Content

```typescript
import { chromium } from 'playwright';

const browser = await chromium.launch();
const page = await browser.newPage();

// Handle infinite scroll
await page.goto('https://example.com/feed');

const items = [];
let previousHeight = 0;

while (true) {
  // Scroll to bottom
  await page.evaluate(() => window.scrollTo(0, document.body.scrollHeight));
  await page.waitForTimeout(1000);  // Wait for content to load

  const currentHeight = await page.evaluate(() => document.body.scrollHeight);

  // Extract new items
  const newItems = await page.$$eval('.feed-item:not([data-scraped])', (els) => {
    return els.map((el) => {
      el.setAttribute('data-scraped', 'true');
      return { text: el.textContent?.trim(), href: el.querySelector('a')?.href };
    });
  });

  items.push(...newItems);

  if (currentHeight === previousHeight) break;  // No more content
  previousHeight = currentHeight;
}

await browser.close();
```

### Login and Session

```typescript
import { chromium } from 'playwright';

const browser = await chromium.launch();
const context = await browser.newContext();
const page = await context.newPage();

// Login
await page.goto('https://example.com/login');
await page.fill('input[name="email"]', 'user@example.com');
await page.fill('input[name="password"]', 'password123');
await page.click('button[type="submit"]');
await page.waitForURL('**/dashboard');

// Save auth state for reuse
await context.storageState({ path: 'auth.json' });
await browser.close();

// Reuse in subsequent scrapes:
const browser2 = await chromium.launch();
const authedContext = await browser2.newContext({ storageState: 'auth.json' });
const authedPage = await authedContext.newPage();
await authedPage.goto('https://example.com/protected-data');
// Already logged in
```

### Parallel Scraping

```typescript
import { chromium } from 'playwright';

const browser = await chromium.launch();

// Scrape multiple pages in parallel (multiple contexts)
const urls = ['https://example.com/page/1', 'https://example.com/page/2', /* ... */];

const results = await Promise.all(
  urls.map(async (url) => {
    const page = await browser.newPage();
    await page.goto(url);
    const data = await page.$eval('.content', (el) => el.textContent);
    await page.close();
    return { url, data };
  })
);

await browser.close();
```

## Puppeteer

**Package**: `puppeteer`
**Weekly downloads**: 3M+
**GitHub stars**: 88K
**Creator**: Google

Puppeteer is Google's Chrome automation library. It uses the Chrome DevTools Protocol and is Chromium/Chrome-only. If you need Chrome-specific features or are already invested in Puppeteer, it remains a solid choice.

### Installation

```bash
npm install puppeteer  # Downloads bundled Chromium
# Or use installed Chrome:
npm install puppeteer-core
```

### Basic Usage

```typescript
import puppeteer from 'puppeteer';

const browser = await puppeteer.launch({ headless: true });
const page = await browser.newPage();

// Set realistic user agent
await page.setUserAgent(
  'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36'
);

await page.goto('https://example.com', { waitUntil: 'networkidle0' });

const data = await page.evaluate(() => {
  return Array.from(document.querySelectorAll('.item')).map((el) => ({
    text: el.textContent?.trim(),
    href: (el as HTMLAnchorElement).href,
  }));
});

await browser.close();
```

### Playwright vs Puppeteer

```
Feature          Playwright          Puppeteer
Browsers         Chrome, Firefox,    Chrome/Chromium only
                 WebKit (Safari)
Auto-wait        Yes (built-in)      Manual (waitForSelector)
TypeScript       Excellent           Good
Anti-detection   Via stealth plugin  Via stealth plugin
Screenshot       Yes                 Yes
PDF generation   Yes                 Yes
Network mock     Yes (built-in)      Via DevTools Protocol
Active dev       Microsoft           Google
```

Playwright has surpassed Puppeteer as the default for new projects because of better cross-browser support and built-in auto-waiting.

## Crawlee

**Package**: `crawlee`
**GitHub stars**: 16K
**Creator**: Apify

Crawlee is what you use when you need a production-grade crawler, not just a script. It wraps Playwright, Puppeteer, and Cheerio, adding the infrastructure for running at scale.

### Installation

```bash
npm install crawlee playwright
```

### PlaywrightCrawler

```typescript
import { PlaywrightCrawler } from 'crawlee';

const crawler = new PlaywrightCrawler({
  // Auto-rotate proxies
  proxyConfiguration: new ProxyConfiguration({
    proxyUrls: ['http://proxy1:8080', 'http://proxy2:8080'],
  }),

  // How many concurrent pages
  maxConcurrency: 10,

  // Retry failed requests
  maxRequestRetries: 3,

  async requestHandler({ request, page, enqueueLinks, log }) {
    log.info(`Scraping: ${request.url}`);

    const title = await page.title();
    const data = await page.$$eval('.product', (products) =>
      products.map((p) => ({
        name: p.querySelector('h2')?.textContent,
        price: p.querySelector('.price')?.textContent,
      }))
    );

    // Store results
    await Dataset.pushData({ url: request.url, title, data });

    // Discover and queue links
    await enqueueLinks({
      selector: 'a.next-page',
      label: 'PRODUCT_LIST',
    });
  },
});

// Start with initial URLs
await crawler.run(['https://example.com/products']);
```

### Anti-Bot Fingerprinting

```typescript
import { PlaywrightCrawler, BrowserPool } from 'crawlee';

const crawler = new PlaywrightCrawler({
  // Crawlee automatically rotates:
  // - Browser fingerprints (canvas, WebGL, fonts)
  // - User agents
  // - Screen resolutions
  // - Timezone and locale
  // - Hardware concurrency

  browserPoolOptions: {
    useFingerprints: true,  // Rotates fingerprints automatically
    fingerprintOptions: {
      fingerprintGeneratorOptions: {
        browsers: ['chrome', 'firefox'],
        operatingSystems: ['windows', 'macos'],
      },
    },
  },

  async requestHandler({ page }) {
    // Looks like a real browser to anti-bot systems
  },
});
```

### CheerioCrawler (Fast Static)

```typescript
import { CheerioCrawler, Dataset } from 'crawlee';

// Use Cheerio for static sites — no browser overhead
const crawler = new CheerioCrawler({
  maxConcurrency: 50,  // High concurrency without browser overhead

  async requestHandler({ $, request, enqueueLinks }) {
    // Cheerio API
    const title = $('title').text();
    const links = $('a.product-link').map((_, el) => $(el).attr('href')).get();

    await Dataset.pushData({ url: request.url, title });

    // Queue discovered links
    await enqueueLinks({ selector: 'a.category-link' });
  },
});

await crawler.run(['https://example.com']);
```

## Choosing Your Scraping Tool

| Tool | JavaScript execution | Speed | Memory | Anti-bot |
|------|---------------------|-------|--------|---------|
| Cheerio | No | Fastest | ~15 MB | None |
| Playwright | Yes | Medium | ~200 MB | Via plugins |
| Puppeteer | Yes | Medium | ~200 MB | Via plugins |
| Crawlee | Yes (wraps above) | Medium | ~200 MB | Built-in |

**Use Cheerio if:**
- The site's HTML is fully rendered server-side (no JS required)
- You're scraping thousands of pages and performance matters
- You want minimal dependencies

**Use Playwright if:**
- The site uses React, Vue, Angular, or any client-side rendering
- You need to interact (click, scroll, fill forms)
- Cross-browser testing or Safari/WebKit support needed

**Use Puppeteer if:**
- You're already invested in Puppeteer's API
- Chrome-only is acceptable
- You need Chrome DevTools Protocol features directly

**Use Crawlee if:**
- Building a production crawler (not a one-off script)
- Anti-bot detection is a concern
- You need queuing, retries, and rate limiting built in
- Deploying to Apify's cloud platform

## Legal, Ethical, and robots.txt Considerations

Before deploying any web scraper into production, review the target site's `robots.txt` file and terms of service. Many sites explicitly prohibit automated scraping, and violating these terms can result in IP bans, legal action under the Computer Fraud and Abuse Act (in the US), or GDPR violations if you scrape personal data from EU users without a lawful basis. `robots.txt` is not legally enforceable but represents the site owner's stated preference — scraping pages marked `Disallow` is widely considered bad practice and increases the chance of your IP being blocked. For legitimate scraping use cases, implement polite crawl delays between requests (Crawlee does this automatically via its request queue), respect `Crawl-delay` directives, and identify your bot with a descriptive `User-Agent` string that includes contact information. Some sites offer official APIs that provide the same data more reliably and legally — always check for a public API before building a scraper.

## Anti-Bot Detection and Fingerprinting

Modern anti-bot systems like Cloudflare Bot Management, DataDome, and PerimeterX use browser fingerprinting signals far beyond IP address: WebGL renderer strings, canvas fingerprints, font detection, battery API availability, mouse movement patterns, and TLS fingerprinting at the network layer. Playwright by itself is detectable — Chromium exposes `navigator.webdriver = true` and other automation flags. The `playwright-extra` package with `puppeteer-extra-plugin-stealth` addresses many of these signals by patching the browser's JavaScript environment before page load. Crawlee goes further by rotating browser fingerprints automatically across requests using `fingerprint-generator`, which produces statistically realistic fingerprint combinations based on real browser populations. For the most heavily protected sites, residential proxy networks (Bright Data, Oxylabs, Smartproxy) rotate your requests through real user IP addresses, making them significantly harder to block at the IP level. Budget around $10–30 per GB for residential proxy bandwidth.

## Performance at Scale and Infrastructure Considerations

Scaling a scraper from a single-machine script to a production data pipeline introduces infrastructure challenges that Cheerio, Playwright, and Crawlee address differently. Cheerio's minimal resource footprint allows running hundreds of concurrent requests on a single EC2 instance — a `t3.medium` can comfortably handle 200+ concurrent Cheerio fetches. Headless browsers are fundamentally different: each Chromium instance consumes 150–300 MB of RAM and uses significant CPU for page rendering. On a machine with 8 GB of RAM, you can realistically run 20–25 concurrent Playwright pages before memory pressure causes instability. For production Playwright scrapers, containerize each browser session (Playwright supports `--remote-debugging-port` for external browser connections), use autoscaling to spin up worker instances based on queue depth, and store results incrementally rather than accumulating them in memory. Crawlee's cloud deployment on Apify's platform handles this automatically — their Actor runtime scales workers on demand and charges per compute unit rather than requiring you to manage server fleets.

## Data Quality, Deduplication, and Change Detection

Raw scraped data is rarely clean enough to use directly. HTML structures change without warning — a CSS class rename breaks your selectors silently, and your pipeline outputs empty strings or null values for days before anyone notices. Build selector health checks that assert scraped values are non-empty and match expected patterns (prices should be numeric, URLs should be valid), and alert on anomalous drop in data volume. Crawlee's dataset storage provides built-in deduplication via the request queue's URL fingerprinting, preventing you from re-scraping the same page on every run. For change detection — tracking when a page's content updates — compute a hash of the normalized HTML on each scrape and compare against your previously stored hash. Only process and store the full data when the hash changes. This dramatically reduces downstream processing costs when scraping frequently-updated sites where most pages haven't changed between runs.

Compare web scraping package downloads on [PkgPulse](https://pkgpulse.com).

*See also: [Playwright vs Puppeteer](/compare/playwright-vs-puppeteer) and [Cypress vs Playwright](/compare/cypress-vs-playwright), [Best npm Packages for Web Scraping 2026](/guides/best-npm-packages-web-scraping-crawlee-puppeteer-2026).*
