Use got-scraping if: Scraping server-rendered HTML (no JavaScript needed) Need high-volume, fast HTTP scraping Want realistic headers without a full browser Simple scraping scripts with Cheerio for parsing Use puppeteer-extra if: Need to bypass sophisticated bot detection Scraping JavaScript-rendered pages (SPAs) Need reCAPTCHA solving or ad blocking Want a full browser with stealth capabilities Already using Puppeteer and need anti-detection Use Crawlee if: Building a production scraping pipeli

got-scraping vs Crawlee vs puppeteer-extra 2026

TL;DR

Crawlee is the full web scraping framework from Apify — request queuing, automatic retries, proxy rotation, browser pool management, and both HTTP and browser-based crawlers in one toolkit. got-scraping is got with anti-bot headers — generates realistic browser-like HTTP headers, TLS fingerprinting, automatic header rotation for scraping without a browser. puppeteer-extra is Puppeteer with plugins — stealth mode to bypass bot detection, ad blocking, reCAPTCHA solving, and other plugin extensions. In 2026: Crawlee for production scraping pipelines, got-scraping for fast HTTP-based scraping, puppeteer-extra for browser automation with stealth.

Key Takeaways

Crawlee: ~50K weekly downloads — full framework, queue management, proxy rotation, Apify
got-scraping: ~30K weekly downloads — HTTP scraping with realistic headers, TLS fingerprinting
puppeteer-extra: ~200K weekly downloads — Puppeteer + stealth plugin, anti-detection
got-scraping is for HTTP requests — fast, no browser overhead, works for many sites
puppeteer-extra controls a real browser — handles JavaScript-rendered pages, stealth mode
Crawlee combines both approaches — use HTTP crawlers or browser crawlers as needed

The Anti-Bot Challenge

Modern websites detect scrapers via:
  🔍 HTTP headers — missing or wrong User-Agent, Accept, Accept-Language
  🔍 TLS fingerprint — Node.js has a different TLS fingerprint than Chrome
  🔍 JavaScript challenges — Cloudflare, PerimeterX, DataDome
  🔍 Browser fingerprint — headless Chrome has detectable properties
  🔍 Rate limiting — too many requests too fast
  🔍 IP reputation — datacenter IPs flagged as bots

Solutions:
  got-scraping    → Fixes HTTP headers + TLS fingerprint
  puppeteer-extra → Fixes browser fingerprint + JS challenges
  Crawlee         → Framework that orchestrates both approaches

got-scraping

got-scraping — HTTP scraping with realistic headers:

Basic usage

import { gotScraping } from "got-scraping"

// Makes requests that look like a real browser:
const response = await gotScraping({
  url: "https://example.com/products",
  // Automatically generates realistic headers:
  // User-Agent, Accept, Accept-Language, Accept-Encoding
  // Sec-Ch-Ua, Sec-Fetch-* headers (Chrome-like)
})

console.log(response.body) // HTML content

// With proxy:
const response2 = await gotScraping({
  url: "https://example.com/api/data",
  proxyUrl: "http://proxy:8080",
  responseType: "json",
})

Header generation

import { gotScraping } from "got-scraping"

// got-scraping generates different realistic headers each time:
const response = await gotScraping({
  url: "https://example.com",
  headerGeneratorOptions: {
    browsers: ["chrome", "firefox"],     // Mimic Chrome or Firefox
    devices: ["desktop"],                 // Desktop headers
    operatingSystems: ["windows", "macos"],
    locales: ["en-US"],
  },
})

// Example generated headers:
// User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36...
// Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
// Accept-Language: en-US,en;q=0.9
// Sec-Ch-Ua: "Chromium";v="122", "Google Chrome";v="122"
// Sec-Fetch-Mode: navigate

Scraping with Cheerio

import { gotScraping } from "got-scraping"
import * as cheerio from "cheerio"

async function scrapeProducts(url: string) {
  const { body } = await gotScraping({ url })
  const $ = cheerio.load(body)

  const products = $(".product-card").map((_, el) => ({
    name: $(el).find(".product-name").text().trim(),
    price: $(el).find(".product-price").text().trim(),
    url: $(el).find("a").attr("href"),
  })).get()

  return products
}

// Fast — no browser needed:
const products = await scrapeProducts("https://example.com/products")

When got-scraping isn't enough

got-scraping works for:
  ✅ Server-rendered HTML pages
  ✅ REST APIs with anti-bot headers
  ✅ Sites that check User-Agent and headers
  ✅ High-volume scraping (fast — no browser)

got-scraping fails for:
  ❌ JavaScript-rendered content (SPAs)
  ❌ Cloudflare challenge pages
  ❌ Sites requiring browser fingerprinting
  ❌ Interactive elements (login forms, infinite scroll)
  → Use puppeteer-extra or Crawlee with browser crawler

puppeteer-extra

puppeteer-extra — Puppeteer with plugins:

Stealth plugin

import puppeteer from "puppeteer-extra"
import StealthPlugin from "puppeteer-extra-plugin-stealth"

// Add stealth plugin — hides headless Chrome indicators:
puppeteer.use(StealthPlugin())

const browser = await puppeteer.launch({ headless: true })
const page = await browser.newPage()

// Now headless Chrome looks like a real browser:
await page.goto("https://bot-detection-site.com")

// Stealth plugin patches:
// ✅ navigator.webdriver → false
// ✅ Chrome runtime properties present
// ✅ Correct WebGL vendor/renderer
// ✅ Plugin array not empty
// ✅ Language and timezone consistent
// ✅ iframe contentWindow access

Scraping with stealth

import puppeteer from "puppeteer-extra"
import StealthPlugin from "puppeteer-extra-plugin-stealth"

puppeteer.use(StealthPlugin())

async function scrapeJSRenderedPage(url: string) {
  const browser = await puppeteer.launch({ headless: true })
  const page = await browser.newPage()

  await page.goto(url, { waitUntil: "networkidle0" })

  // Wait for dynamic content:
  await page.waitForSelector(".product-card")

  // Extract data from JavaScript-rendered page:
  const products = await page.evaluate(() => {
    return Array.from(document.querySelectorAll(".product-card")).map((el) => ({
      name: el.querySelector(".name")?.textContent?.trim(),
      price: el.querySelector(".price")?.textContent?.trim(),
    }))
  })

  await browser.close()
  return products
}

Plugins ecosystem

import puppeteer from "puppeteer-extra"
import StealthPlugin from "puppeteer-extra-plugin-stealth"
import AdblockerPlugin from "puppeteer-extra-plugin-adblocker"
import RecaptchaPlugin from "puppeteer-extra-plugin-recaptcha"

// Stealth — bypass bot detection:
puppeteer.use(StealthPlugin())

// Ad blocker — faster page loads, less noise:
puppeteer.use(AdblockerPlugin({ blockTrackers: true }))

// reCAPTCHA solver (requires 2captcha API key):
puppeteer.use(RecaptchaPlugin({
  provider: { id: "2captcha", token: "YOUR_2CAPTCHA_KEY" },
}))

const browser = await puppeteer.launch()
const page = await browser.newPage()
await page.goto("https://example.com/login")

// Automatically solves reCAPTCHA if present:
const { solved } = await page.solveRecaptchas()

Crawlee

Crawlee — full scraping framework:

HTTP crawler (fast)

import { CheerioCrawler } from "crawlee"

const crawler = new CheerioCrawler({
  maxRequestsPerCrawl: 100,
  maxConcurrency: 10,

  async requestHandler({ request, $, enqueueLinks, log }) {
    log.info(`Processing ${request.url}`)

    // Extract data with Cheerio:
    const title = $("h1").text()
    const products = $(".product").map((_, el) => ({
      name: $(el).find(".name").text().trim(),
      price: $(el).find(".price").text().trim(),
    })).get()

    // Store results:
    await Dataset.pushData({ url: request.url, title, products })

    // Follow links:
    await enqueueLinks({
      globs: ["https://example.com/products/**"],
    })
  },
})

await crawler.run(["https://example.com/products"])

Browser crawler (JavaScript-rendered)

import { PlaywrightCrawler } from "crawlee"

const crawler = new PlaywrightCrawler({
  maxConcurrency: 5,
  headless: true,

  async requestHandler({ page, request, enqueueLinks, log }) {
    log.info(`Processing ${request.url}`)

    // Wait for JavaScript content:
    await page.waitForSelector(".product-card")

    // Extract from rendered page:
    const products = await page.evaluate(() =>
      Array.from(document.querySelectorAll(".product-card")).map((el) => ({
        name: el.querySelector(".name")?.textContent?.trim(),
        price: el.querySelector(".price")?.textContent?.trim(),
      }))
    )

    await Dataset.pushData({ url: request.url, products })

    // Follow pagination:
    await enqueueLinks({ selector: ".pagination a" })
  },
})

await crawler.run(["https://example.com/products"])

Proxy rotation

import { CheerioCrawler, ProxyConfiguration } from "crawlee"

const proxyConfig = new ProxyConfiguration({
  proxyUrls: [
    "http://proxy1:8080",
    "http://proxy2:8080",
    "http://proxy3:8080",
  ],
  // Or use Apify proxy:
  // apifyProxyGroups: ["RESIDENTIAL"],
})

const crawler = new CheerioCrawler({
  proxyConfiguration: proxyConfig,
  // Automatically rotates proxies per request
  // Retries with different proxy on failure

  async requestHandler({ request, $ }) {
    // Each request uses a different proxy
  },
})

Queue management and retries

import { CheerioCrawler } from "crawlee"

const crawler = new CheerioCrawler({
  maxRequestRetries: 3,           // Retry failed requests 3 times
  maxConcurrency: 10,              // 10 parallel requests
  maxRequestsPerCrawl: 1000,       // Stop after 1000 requests
  requestHandlerTimeoutSecs: 60,   // Timeout per request

  // Automatic retry with backoff:
  // 1st retry: immediate
  // 2nd retry: after a few seconds
  // 3rd retry: after more seconds + different proxy

  async requestHandler({ request, $ }) {
    // Process page
  },

  async failedRequestHandler({ request, error }) {
    console.error(`Failed after retries: ${request.url}`, error.message)
  },
})

Feature Comparison

Feature	got-scraping	puppeteer-extra	Crawlee
HTTP scraping	✅	❌ (browser)	✅ (Cheerio)
Browser scraping	❌	✅	✅ (Playwright)
Anti-bot headers	✅	Via stealth	✅ (got-scraping)
Browser stealth	❌	✅ (plugin)	✅ (built-in)
Proxy rotation	Manual	Manual	✅ (built-in)
Request queue	❌	❌	✅
Auto retries	Via got	Manual	✅ (built-in)
Concurrency control	Manual	Manual	✅ (built-in)
Data storage	Manual	Manual	✅ (Dataset)
Link following	Manual	Manual	✅ (enqueueLinks)
reCAPTCHA solving	❌	✅ (plugin)	Via plugin
Weekly downloads	~30K	~200K	~50K

When to Use Each

Use got-scraping if:

Scraping server-rendered HTML (no JavaScript needed)
Need high-volume, fast HTTP scraping
Want realistic headers without a full browser
Simple scraping scripts with Cheerio for parsing

Use puppeteer-extra if:

Need to bypass sophisticated bot detection
Scraping JavaScript-rendered pages (SPAs)
Need reCAPTCHA solving or ad blocking
Want a full browser with stealth capabilities
Already using Puppeteer and need anti-detection

Use Crawlee if:

Building a production scraping pipeline
Need queue management, retries, and proxy rotation built-in
Want to switch between HTTP and browser crawlers as needed
Scraping at scale with concurrency control
Using Apify cloud for deployment

Legal and Ethical Considerations

Web scraping occupies complex legal and ethical territory that every developer should understand before building a scraper. The robots.txt file is a convention (not technically enforceable) that specifies which paths a crawler may access — respecting it is both ethical and practically wise since sites that detect you ignoring robots.txt escalate their bot countermeasures. Terms of Service violations are a genuine legal risk: scraping in violation of a site's ToS has been the basis for lawsuits and CFAA (Computer Fraud and Abuse Act) claims in the US, though courts have ruled inconsistently. Public data that is accessible without authentication generally has stronger fair use arguments than scraping behind login walls. Rate limiting your scraper to 1-2 requests per second per domain is both courteous and practically effective — it avoids triggering rate-limit defenses while allowing continuous operation. If an API is available for the data you need, use it.

TypeScript Integration and Type-Safe Scraped Data

Scraped data is inherently untyped — you're parsing HTML or JSON from a third-party source with no guarantees about structure. TypeScript can't validate scraped data at compile time, but you can add runtime validation using Zod, Valibot, or similar tools to ensure parsed data matches expected shapes before it enters your application. The pattern is defining a Zod schema for the expected scraped structure and using schema.parse() (which throws on mismatch) or schema.safeParse() (which returns a Result-like object). This catches changes in the scraped site's HTML structure immediately at the parsing stage rather than propagating invalid data through your pipeline. For Crawlee-based scrapers, create a typed Dataset schema and validate before pushing data — this makes scraped data as reliable as database data within your application.

Production Deployment and Proxy Architecture

Running scrapers in production requires thinking about IP rotation, detection avoidance, and operational resilience. Datacenter IP ranges (AWS, Google Cloud, DigitalOcean) are aggressively blocked by commercial anti-bot solutions — serious production scrapers use residential proxy networks (Bright Data, Oxylabs, SOAX) that rotate through real residential IP addresses. The cost is significant (residential proxies cost $5-15 per GB of traffic) but often necessary for sites with sophisticated detection. Crawlee's ProxyConfiguration handles rotation automatically, selecting a different proxy per request and retrying with a new proxy on 429 or CAPTCHA responses. For scrapers that must run continuously, deploy them on a VPS or container with stable, predictable resource allocation rather than serverless functions — long-running browser sessions and connection pools work poorly in ephemeral serverless environments with 30-second execution limits.

Performance Comparison at Scale

At high volume, the performance gap between HTTP scraping (got-scraping) and browser scraping (puppeteer-extra) becomes decisive. A single got-scraping instance can make 10-50 concurrent requests with minimal memory — running 10 got-scraping workers on a 1GB VPS is feasible. A single Chromium instance in Puppeteer consumes 150-300MB of RAM and handles 3-5 concurrent pages before performance degrades. Browser-based scraping at scale requires careful resource management: limit concurrent pages, close pages after use, restart browsers periodically to prevent memory leaks. Crawlee's PlaywrightCrawler handles this automatically with its browser pool abstraction. For scraping sites that require JavaScript rendering but load their data via XHR, intercepting the API requests with Puppeteer's page.on('response') is often faster than waiting for the full page to render — you get the raw JSON without parsing HTML.

Integration with Data Pipelines

Scraped data rarely ends up in isolation — it flows into databases, data warehouses, or analytical pipelines. Crawlee's Dataset.pushData() stores results in a local JSON file dataset by default, which integrates with Apify's platform for managed storage. For custom pipelines, transform Crawlee's dataset output into the shape your database expects using a post-processing step, or push directly to your database inside the requestHandler with proper error handling to avoid losing data on individual page failures. got-scraping pairs naturally with cheerio for HTML parsing and then any HTTP client for further data enrichment — the result can be pushed to a message queue (Bull, BullMQ) for async processing by separate workers. This decoupled pattern lets you scale scraping independently from data processing: if processing is slow, requests queue up without blocking the scraper's HTTP concurrency.

Error Recovery and Resilient Scraping

Production scrapers fail — sites go down, DOM structures change, rate limits hit, CAPTCHAs appear. Building resilient scrapers requires systematic error classification and recovery strategies. Transient errors (503 responses, timeouts, connection resets) warrant automatic retry with exponential backoff. Structure-change errors (element not found, unexpected null) indicate the site's DOM has changed and require human intervention to update selectors — log these with the full page content for debugging. Authentication errors (401, redirect to login) require credential refresh logic. Crawlee's error classification system handles transient errors automatically; puppeteer-extra requires manual retry logic. For long-running scrapers that must maintain continuity across days or weeks, persist the request queue and completed URLs to disk or a database so the scraper can resume after a crash without re-scraping already-processed pages. got-scraping's stateless design makes it the easiest to wrap with custom retry and checkpoint logic since there's no persistent state to manage.

Compare web scraping and automation tools on PkgPulse →

The 2026 JavaScript Stack Cheatsheet