Skip to main content

Guide

Best npm Packages for Web Scraping 2026

Crawlee, Puppeteer, and Playwright compared for web scraping in Node.js 2026. Anti-bot handling, headless browsers, HTTP scraping with Cheerio, and when to.

·PkgPulse Team·
0

TL;DR

Crawlee (Apify) is the 2026 standard for production web scraping — it handles anti-bot fingerprinting, request queuing, retry logic, and session rotation out of the box. For simple page scraping: Playwright + Cheerio. For headless browser automation (not scraping): Playwright. For legacy projects: Puppeteer (no new features). For static HTML: Cheerio + node-fetch.

Key Takeaways

  • Crawlee: Full scraping framework, stealth mode, queue management, Playwright/Puppeteer runner
  • Playwright: Better than Puppeteer for scraping — multi-browser, better anti-detection
  • Puppeteer: Chrome-only, legacy, 8M downloads/week (inertia), no new features coming
  • Cheerio: HTML parsing only (no JS), fastest for static pages (~5M downloads/week)
  • 2026 trend: Anti-bot measures intensified — headless detection requires stealth plugins
  • Apify Cloud: Managed scraping infrastructure for Crawlee at scale

Downloads

PackageWeekly DownloadsTrend
puppeteer~8M→ Stable (legacy)
playwright~5M↑ Growing
cheerio~12M→ Stable
crawlee~200K↑ Growing
playwright-extra~300K↑ Growing

Crawlee: Production Scraping

npm install crawlee
# Or with specific crawler types:
npm install crawlee playwright
npm install crawlee puppeteer
// Full crawl with Crawlee + Playwright:
import { PlaywrightCrawler, Dataset } from 'crawlee';

const crawler = new PlaywrightCrawler({
  // Stealth mode enabled by default in Crawlee:
  launchContext: {
    launchOptions: {
      headless: true,
    },
  },
  
  // Rate limiting:
  maxRequestsPerCrawl: 100,
  maxConcurrency: 3,
  requestHandlerTimeoutSecs: 30,
  
  // Retry failed requests:
  maxRequestRetries: 3,
  
  async requestHandler({ request, page, enqueueLinks, log }) {
    log.info(`Scraping: ${request.url}`);

    // Extract data:
    const title = await page.title();
    const description = await page.$eval(
      'meta[name="description"]',
      (el) => el.getAttribute('content') ?? ''
    ).catch(() => '');

    // Extract product data:
    const products = await page.$$eval('.product-card', (cards) =>
      cards.map((card) => ({
        title: card.querySelector('h2')?.textContent?.trim() ?? '',
        price: card.querySelector('.price')?.textContent?.trim() ?? '',
        url: card.querySelector('a')?.href ?? '',
      }))
    );

    // Save to dataset (auto-persisted):
    await Dataset.pushData({
      url: request.url,
      title,
      description,
      products,
      scrapedAt: new Date().toISOString(),
    });

    // Follow pagination links:
    await enqueueLinks({
      selector: 'a.next-page',
      label: 'PAGINATION',
    });
  },

  failedRequestHandler({ request, log }) {
    log.error(`Failed to scrape: ${request.url}`);
  },
});

// Start crawl:
await crawler.run(['https://example.com/products']);

// Export results:
const dataset = await Dataset.open();
const { items } = await dataset.getData();
console.log(`Scraped ${items.length} items`);
// Crawlee HTTP crawler (no browser — 100x faster for static pages):
import { CheerioCrawler, Dataset } from 'crawlee';

const crawler = new CheerioCrawler({
  maxConcurrency: 20,  // Much higher — no browser overhead
  maxRequestsPerCrawl: 10000,

  async requestHandler({ $, request, enqueueLinks }) {
    // $ is Cheerio — same API as jQuery:
    const title = $('h1').first().text().trim();
    const links = $('a[href]').map((_, el) => $(el).attr('href')).get();

    await Dataset.pushData({ url: request.url, title, linkCount: links.length });

    // Discover and enqueue linked pages:
    await enqueueLinks({
      selector: 'a',
      baseUrl: new URL(request.url).origin,
      transformRequestFunction: (req) => {
        // Filter to same domain:
        if (!req.url.startsWith(new URL(request.url).origin)) return false;
        return req;
      },
    });
  },
});

await crawler.run(['https://example.com']);

Anti-Bot: Stealth Mode

// Playwright with stealth plugin (playwright-extra):
import { chromium } from 'playwright-extra';
import stealth from 'puppeteer-extra-plugin-stealth';

chromium.use(stealth());

const browser = await chromium.launch({ headless: true });
const context = await browser.newContext({
  userAgent: 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36',
  viewport: { width: 1920, height: 1080 },
  // Disable WebDriver flag:
  extraHTTPHeaders: {
    'Accept-Language': 'en-US,en;q=0.9',
  },
});

const page = await context.newPage();

// Override WebDriver detection:
await page.addInitScript(() => {
  Object.defineProperty(navigator, 'webdriver', { get: () => false });
  Object.defineProperty(navigator, 'plugins', { get: () => [1, 2, 3] });
});

await page.goto('https://bot-detection-target.com');
const data = await page.$eval('#main-content', (el) => el.textContent);
await browser.close();
// Crawlee has built-in fingerprinting via @crawlee/browser-pool:
import { PlaywrightCrawler } from 'crawlee';
import { FingerprintGenerator } from 'fingerprint-generator';
import { FingerprintInjector } from 'fingerprint-injector';

const generator = new FingerprintGenerator({ browsers: ['chrome'], operatingSystems: ['windows', 'macos'] });
const injector = new FingerprintInjector();

const crawler = new PlaywrightCrawler({
  preNavigationHooks: [
    async ({ page }) => {
      const fingerprint = generator.getFingerprint();
      await injector.attachFingerprintToPlaywright(page, fingerprint);
    },
  ],
  requestHandler: async ({ page }) => {
    // Now scraping with randomized browser fingerprint
  },
});

Cheerio: Fast Static HTML Parsing

npm install cheerio undici
# undici is faster than node-fetch for many requests
// Cheerio for static pages (no JS rendering):
import { load } from 'cheerio';
import { fetch } from 'undici';

async function scrapeProductPage(url: string) {
  const response = await fetch(url, {
    headers: {
      'User-Agent': 'Mozilla/5.0 (compatible; MyBot/1.0)',
      'Accept': 'text/html,application/xhtml+xml',
    },
  });

  const html = await response.text();
  const $ = load(html);

  return {
    title: $('h1').first().text().trim(),
    price: $('.price, [data-price]').first().text().trim(),
    description: $('meta[name="description"]').attr('content') ?? '',
    images: $('img.product-image').map((_, el) => $(el).attr('src')).get(),
    inStock: $('.add-to-cart').length > 0,
  };
}

// Batch scraping with concurrency control:
import pLimit from 'p-limit';

const limit = pLimit(5);  // Max 5 concurrent requests

const urls = ['url1', 'url2', /* ... 1000 URLs */];
const results = await Promise.all(
  urls.map(url => limit(() => scrapeProductPage(url)))
);

Puppeteer vs Playwright for Scraping

Puppeteer (2026 status):
  ✅ 8M downloads/week (many legacy codebases)
  ✅ Chrome DevTools Protocol native
  ❌ Chrome/Chromium only
  ❌ No new scraping features
  ❌ Worse anti-detection than Playwright
  → Recommendation: Migrate to Playwright

Playwright (2026 status):
  ✅ Chrome, Firefox, Safari support
  ✅ Better anti-detection (less WebDriver artifacts)
  ✅ Better selector API (locators > $)
  ✅ Active development
  ✅ Built-in request interception
  → Recommendation: Use for browser scraping

Crawlee on Playwright (2026):
  ✅ All Playwright benefits
  ✅ + Queue management
  ✅ + Automatic retries
  ✅ + Session pool rotation
  ✅ + Built-in fingerprinting
  → Recommendation: Use for production crawling

Decision Guide

Use Crawlee if:
  → Crawling many pages (100+)
  → Need retry logic and queue management
  → Building a data pipeline or scraper product
  → Anti-bot is a concern
  → Need to scale (use with Apify Cloud)

Use Playwright if:
  → Scraping a few specific pages
  → Already using Playwright for testing
  → Need precise interaction (login, form fill, SPA)
  → One-off scraping tasks

Use Cheerio + undici if:
  → Pages are static HTML (no JS rendering needed)
  → Maximum performance (100+ req/sec possible)
  → Simple data extraction from known HTML structure

Use Puppeteer if:
  → Legacy codebase already using it
  → Don't have time to migrate to Playwright
  → Chrome-only is acceptable

Production Architecture for Large-Scale Scraping

Running web scrapers in production at scale requires architecture decisions beyond selecting a library. Crawlee's built-in RequestQueue persists the crawl state to disk or a key-value store, which means a crashed or restarted scraper continues from where it left off rather than starting over from the beginning. This is essential for scraping jobs that run for hours or days across tens of thousands of URLs. For distributed scraping at larger scales, Apify Cloud provides a managed platform that runs Crawlee actors across a pool of machines, handling scheduling, monitoring, and output storage without requiring you to manage infrastructure. Rate limiting should be configured conservatively in production — most sites that monitor traffic will temporarily ban IPs that make more than a few requests per second. Setting maxRequestsPerMinute and adding waitForSelector delays to ensure page content has loaded before extraction reduces both detection risk and server load on the target site.

Storing and Processing Scraped Data

Scraped data pipelines require careful design around storage and incremental updates. Crawlee's Dataset API writes scraped items to a local JSON store during development, and in production on Apify Cloud, datasets are versioned and accessible via API for downstream processing. For self-managed production deployments, integrating Crawlee with a message queue (Redis Streams, BullMQ) to publish scraped items for downstream processing is a common pattern: the scraper publishes raw page data as messages, and a separate consumer service handles extraction, normalization, and database writes. This decouples the scraping rate from the database write rate and allows multiple downstream consumers to process the same scraped data for different purposes. Deduplication is an important operational concern — scraping the same URL multiple times and writing duplicate records to your database degrades data quality. Crawlee's request queue handles URL-level deduplication natively, but content-level deduplication (detecting when a previously scraped URL has not changed since the last crawl) requires storing and comparing content hashes.

Anti-Detection and Rate Limiting

Modern websites deploy bot detection mechanisms that block naive scrapers — browser fingerprinting, TLS fingerprint analysis, honeypot traps, and behavioral analysis. Choosing the right scraping library affects how much anti-detection work you need to implement.

Crawlee's anti-detection features include automatic browser fingerprint randomization when using the PlaywrightCrawler or PuppeteerCrawler with the fingerprint option enabled. It rotates user agents, screen resolutions, browser language settings, and timing patterns between sessions. For most target sites that use fingerprinting-as-a-service (Datadome, PerimeterX), Crawlee's built-in fingerprinting handles the common cases without additional configuration.

Playwright's stealth capabilities require the playwright-extra package with the stealth plugin (or the puppeteer-stealth equivalent) to pass fingerprint checks. Without stealth plugins, Playwright's headless Chrome is detectable by standard detection libraries. The puppeteer-extra-plugin-stealth package patches 17+ detection vectors and is broadly compatible with Playwright via the playwright-extra wrapper.

Rate limiting and politeness matter for long-running scraping jobs. Crawlee's RequestQueue with configurable maxRequestsPerMinute prevents overwhelming target servers. For sites with explicit rate limits in robots.txt or HTTP Retry-After headers, respecting these limits avoids IP bans. Using rotating residential proxies (not datacenter IPs) combined with realistic request timing remains the most reliable anti-detection strategy for high-volume scraping.

TypeScript and Crawlee Type Safety

Crawlee's TypeScript support has matured significantly and is now one of its stronger selling points for production teams. The PlaywrightCrawler and CheerioCrawler classes accept generic type parameters for the context, allowing you to type the crawl context and user data stored alongside requests in the queue. The Dataset API is also typed, so Dataset.pushData<ProductData>({ ... }) enforces that pushed objects match your defined product schema. When using the CheerioCrawler, the $ parameter is typed as Cheerio's CheerioAPI, providing autocomplete for selectors and DOM traversal methods. The requestHandler context is typed to include the crawler-specific APIs for each crawler type — PlaywrightCrawler's context includes page typed as Playwright's Page, while CheerioCrawler's context provides $ typed as Cheerio's API. This end-to-end type safety significantly reduces the category of runtime errors where scraped data doesn't match the expected shape.

Web scraping operates in a legally complex space that developers should understand before deploying production scrapers. The legal landscape varies by jurisdiction and use case: scraping publicly available data for research or indexing is generally accepted, while scraping behind authentication, violating robots.txt, or exceeding rate limits may breach the Computer Fraud and Abuse Act (US) or similar laws. Many platforms' terms of service explicitly prohibit automated access, and violating ToS — while not always illegal — can result in account termination or IP bans. The ethical and practical baseline for responsible scraping is to respect robots.txt disallow rules (Crawlee provides a RobotsTxtChecker middleware), implement reasonable rate limiting that does not harm the target server's performance, identify your scraper with a descriptive User-Agent string that includes contact information, and cache scraped data to avoid redundant requests. For commercial scraping of public data, consulting legal counsel about the specific jurisdiction and target site is advisable before scaling to production volumes.

Proxy Strategy and IP Rotation

Proxy selection is often the most consequential infrastructure decision in a web scraping project. Datacenter proxies are fast and inexpensive but carry high block rates on modern e-commerce and social platforms because their IP ranges are widely known and flagged. Residential proxies (IPs belonging to real consumer ISP connections) are significantly harder to detect but cost more per gigabyte of traffic. ISP proxies occupy the middle ground — static IPs from ISP allocations that behave like residential IPs but with more consistent speeds. For most production scraping with Crawlee, configuring a rotating residential proxy pool via the ProxyConfiguration class is the starting point: the crawler automatically rotates the proxy for each request and retires proxies that return too many failures. Smartproxy, Oxylabs, and Bright Data each offer Node.js SDK integrations that work directly with Crawlee's proxy configuration. For lower-volume scraping, free proxy lists are unreliable — they are monitored and blocked within hours of becoming public. Budgeting for quality residential proxies is usually more cost-effective than the engineering time spent debugging blocks caused by poor proxy quality.

Headless Browser Resource Management

Browser-based scraping with Playwright or Puppeteer carries significant resource overhead compared to HTTP-only scraping, and managing browser resources correctly is critical for stable production deployments. Each browser context consumes approximately 100-150MB of RAM, and unclosed contexts accumulate over long scraping runs, eventually crashing the Node.js process with out-of-memory errors. Crawlee's browser pool manages context lifecycle automatically, creating new contexts as needed and closing them after the session pool TTL expires. When using Playwright or Puppeteer directly without Crawlee's pool management, explicitly closing the browser context in a finally block is mandatory — a single unclosed context per request adds up to gigabytes of leaked memory in a multi-hour crawl. For memory-constrained environments (containers with 512MB RAM limits), using CheerioCrawler for pages that don't require JavaScript execution and falling back to PlaywrightCrawler only for JS-rendered pages dramatically reduces peak memory usage and allows higher concurrent request rates.

Compare Crawlee, Puppeteer, and Playwright download trends on PkgPulse.

See also: Playwright vs Puppeteer and Cypress vs Playwright, Best npm Packages for Web Scraping in 2026.

The 2026 JavaScript Stack Cheatsheet

One PDF: the best package for every category (ORMs, bundlers, auth, testing, state management). Used by 500+ devs. Free, updated monthly.