TL;DR
Crawlee is the full web scraping framework from Apify — request queuing, automatic retries, proxy rotation, browser pool management, and both HTTP and browser-based crawlers in one toolkit. got-scraping is got with anti-bot headers — generates realistic browser-like HTTP headers, TLS fingerprinting, automatic header rotation for scraping without a browser. puppeteer-extra is Puppeteer with plugins — stealth mode to bypass bot detection, ad blocking, reCAPTCHA solving, and other plugin extensions. In 2026: Crawlee for production scraping pipelines, got-scraping for fast HTTP-based scraping, puppeteer-extra for browser automation with stealth.
Key Takeaways
- Crawlee: ~50K weekly downloads — full framework, queue management, proxy rotation, Apify
- got-scraping: ~30K weekly downloads — HTTP scraping with realistic headers, TLS fingerprinting
- puppeteer-extra: ~200K weekly downloads — Puppeteer + stealth plugin, anti-detection
- got-scraping is for HTTP requests — fast, no browser overhead, works for many sites
- puppeteer-extra controls a real browser — handles JavaScript-rendered pages, stealth mode
- Crawlee combines both approaches — use HTTP crawlers or browser crawlers as needed
The Anti-Bot Challenge
Modern websites detect scrapers via:
🔍 HTTP headers — missing or wrong User-Agent, Accept, Accept-Language
🔍 TLS fingerprint — Node.js has a different TLS fingerprint than Chrome
🔍 JavaScript challenges — Cloudflare, PerimeterX, DataDome
🔍 Browser fingerprint — headless Chrome has detectable properties
🔍 Rate limiting — too many requests too fast
🔍 IP reputation — datacenter IPs flagged as bots
Solutions:
got-scraping → Fixes HTTP headers + TLS fingerprint
puppeteer-extra → Fixes browser fingerprint + JS challenges
Crawlee → Framework that orchestrates both approaches
got-scraping
got-scraping — HTTP scraping with realistic headers:
Basic usage
import { gotScraping } from "got-scraping"
// Makes requests that look like a real browser:
const response = await gotScraping({
url: "https://example.com/products",
// Automatically generates realistic headers:
// User-Agent, Accept, Accept-Language, Accept-Encoding
// Sec-Ch-Ua, Sec-Fetch-* headers (Chrome-like)
})
console.log(response.body) // HTML content
// With proxy:
const response2 = await gotScraping({
url: "https://example.com/api/data",
proxyUrl: "http://proxy:8080",
responseType: "json",
})
Header generation
import { gotScraping } from "got-scraping"
// got-scraping generates different realistic headers each time:
const response = await gotScraping({
url: "https://example.com",
headerGeneratorOptions: {
browsers: ["chrome", "firefox"], // Mimic Chrome or Firefox
devices: ["desktop"], // Desktop headers
operatingSystems: ["windows", "macos"],
locales: ["en-US"],
},
})
// Example generated headers:
// User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36...
// Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
// Accept-Language: en-US,en;q=0.9
// Sec-Ch-Ua: "Chromium";v="122", "Google Chrome";v="122"
// Sec-Fetch-Mode: navigate
Scraping with Cheerio
import { gotScraping } from "got-scraping"
import * as cheerio from "cheerio"
async function scrapeProducts(url: string) {
const { body } = await gotScraping({ url })
const $ = cheerio.load(body)
const products = $(".product-card").map((_, el) => ({
name: $(el).find(".product-name").text().trim(),
price: $(el).find(".product-price").text().trim(),
url: $(el).find("a").attr("href"),
})).get()
return products
}
// Fast — no browser needed:
const products = await scrapeProducts("https://example.com/products")
When got-scraping isn't enough
got-scraping works for:
✅ Server-rendered HTML pages
✅ REST APIs with anti-bot headers
✅ Sites that check User-Agent and headers
✅ High-volume scraping (fast — no browser)
got-scraping fails for:
❌ JavaScript-rendered content (SPAs)
❌ Cloudflare challenge pages
❌ Sites requiring browser fingerprinting
❌ Interactive elements (login forms, infinite scroll)
→ Use puppeteer-extra or Crawlee with browser crawler
puppeteer-extra
puppeteer-extra — Puppeteer with plugins:
Stealth plugin
import puppeteer from "puppeteer-extra"
import StealthPlugin from "puppeteer-extra-plugin-stealth"
// Add stealth plugin — hides headless Chrome indicators:
puppeteer.use(StealthPlugin())
const browser = await puppeteer.launch({ headless: true })
const page = await browser.newPage()
// Now headless Chrome looks like a real browser:
await page.goto("https://bot-detection-site.com")
// Stealth plugin patches:
// ✅ navigator.webdriver → false
// ✅ Chrome runtime properties present
// ✅ Correct WebGL vendor/renderer
// ✅ Plugin array not empty
// ✅ Language and timezone consistent
// ✅ iframe contentWindow access
Scraping with stealth
import puppeteer from "puppeteer-extra"
import StealthPlugin from "puppeteer-extra-plugin-stealth"
puppeteer.use(StealthPlugin())
async function scrapeJSRenderedPage(url: string) {
const browser = await puppeteer.launch({ headless: true })
const page = await browser.newPage()
await page.goto(url, { waitUntil: "networkidle0" })
// Wait for dynamic content:
await page.waitForSelector(".product-card")
// Extract data from JavaScript-rendered page:
const products = await page.evaluate(() => {
return Array.from(document.querySelectorAll(".product-card")).map((el) => ({
name: el.querySelector(".name")?.textContent?.trim(),
price: el.querySelector(".price")?.textContent?.trim(),
}))
})
await browser.close()
return products
}
Plugins ecosystem
import puppeteer from "puppeteer-extra"
import StealthPlugin from "puppeteer-extra-plugin-stealth"
import AdblockerPlugin from "puppeteer-extra-plugin-adblocker"
import RecaptchaPlugin from "puppeteer-extra-plugin-recaptcha"
// Stealth — bypass bot detection:
puppeteer.use(StealthPlugin())
// Ad blocker — faster page loads, less noise:
puppeteer.use(AdblockerPlugin({ blockTrackers: true }))
// reCAPTCHA solver (requires 2captcha API key):
puppeteer.use(RecaptchaPlugin({
provider: { id: "2captcha", token: "YOUR_2CAPTCHA_KEY" },
}))
const browser = await puppeteer.launch()
const page = await browser.newPage()
await page.goto("https://example.com/login")
// Automatically solves reCAPTCHA if present:
const { solved } = await page.solveRecaptchas()
Crawlee
Crawlee — full scraping framework:
HTTP crawler (fast)
import { CheerioCrawler } from "crawlee"
const crawler = new CheerioCrawler({
maxRequestsPerCrawl: 100,
maxConcurrency: 10,
async requestHandler({ request, $, enqueueLinks, log }) {
log.info(`Processing ${request.url}`)
// Extract data with Cheerio:
const title = $("h1").text()
const products = $(".product").map((_, el) => ({
name: $(el).find(".name").text().trim(),
price: $(el).find(".price").text().trim(),
})).get()
// Store results:
await Dataset.pushData({ url: request.url, title, products })
// Follow links:
await enqueueLinks({
globs: ["https://example.com/products/**"],
})
},
})
await crawler.run(["https://example.com/products"])
Browser crawler (JavaScript-rendered)
import { PlaywrightCrawler } from "crawlee"
const crawler = new PlaywrightCrawler({
maxConcurrency: 5,
headless: true,
async requestHandler({ page, request, enqueueLinks, log }) {
log.info(`Processing ${request.url}`)
// Wait for JavaScript content:
await page.waitForSelector(".product-card")
// Extract from rendered page:
const products = await page.evaluate(() =>
Array.from(document.querySelectorAll(".product-card")).map((el) => ({
name: el.querySelector(".name")?.textContent?.trim(),
price: el.querySelector(".price")?.textContent?.trim(),
}))
)
await Dataset.pushData({ url: request.url, products })
// Follow pagination:
await enqueueLinks({ selector: ".pagination a" })
},
})
await crawler.run(["https://example.com/products"])
Proxy rotation
import { CheerioCrawler, ProxyConfiguration } from "crawlee"
const proxyConfig = new ProxyConfiguration({
proxyUrls: [
"http://proxy1:8080",
"http://proxy2:8080",
"http://proxy3:8080",
],
// Or use Apify proxy:
// apifyProxyGroups: ["RESIDENTIAL"],
})
const crawler = new CheerioCrawler({
proxyConfiguration: proxyConfig,
// Automatically rotates proxies per request
// Retries with different proxy on failure
async requestHandler({ request, $ }) {
// Each request uses a different proxy
},
})
Queue management and retries
import { CheerioCrawler } from "crawlee"
const crawler = new CheerioCrawler({
maxRequestRetries: 3, // Retry failed requests 3 times
maxConcurrency: 10, // 10 parallel requests
maxRequestsPerCrawl: 1000, // Stop after 1000 requests
requestHandlerTimeoutSecs: 60, // Timeout per request
// Automatic retry with backoff:
// 1st retry: immediate
// 2nd retry: after a few seconds
// 3rd retry: after more seconds + different proxy
async requestHandler({ request, $ }) {
// Process page
},
async failedRequestHandler({ request, error }) {
console.error(`Failed after retries: ${request.url}`, error.message)
},
})
Feature Comparison
| Feature | got-scraping | puppeteer-extra | Crawlee |
|---|---|---|---|
| HTTP scraping | ✅ | ❌ (browser) | ✅ (Cheerio) |
| Browser scraping | ❌ | ✅ | ✅ (Playwright) |
| Anti-bot headers | ✅ | Via stealth | ✅ (got-scraping) |
| Browser stealth | ❌ | ✅ (plugin) | ✅ (built-in) |
| Proxy rotation | Manual | Manual | ✅ (built-in) |
| Request queue | ❌ | ❌ | ✅ |
| Auto retries | Via got | Manual | ✅ (built-in) |
| Concurrency control | Manual | Manual | ✅ (built-in) |
| Data storage | Manual | Manual | ✅ (Dataset) |
| Link following | Manual | Manual | ✅ (enqueueLinks) |
| reCAPTCHA solving | ❌ | ✅ (plugin) | Via plugin |
| Weekly downloads | ~30K | ~200K | ~50K |
When to Use Each
Use got-scraping if:
- Scraping server-rendered HTML (no JavaScript needed)
- Need high-volume, fast HTTP scraping
- Want realistic headers without a full browser
- Simple scraping scripts with Cheerio for parsing
Use puppeteer-extra if:
- Need to bypass sophisticated bot detection
- Scraping JavaScript-rendered pages (SPAs)
- Need reCAPTCHA solving or ad blocking
- Want a full browser with stealth capabilities
- Already using Puppeteer and need anti-detection
Use Crawlee if:
- Building a production scraping pipeline
- Need queue management, retries, and proxy rotation built-in
- Want to switch between HTTP and browser crawlers as needed
- Scraping at scale with concurrency control
- Using Apify cloud for deployment
Legal and Ethical Considerations
Web scraping occupies complex legal and ethical territory that every developer should understand before building a scraper. The robots.txt file is a convention (not technically enforceable) that specifies which paths a crawler may access — respecting it is both ethical and practically wise since sites that detect you ignoring robots.txt escalate their bot countermeasures. Terms of Service violations are a genuine legal risk: scraping in violation of a site's ToS has been the basis for lawsuits and CFAA (Computer Fraud and Abuse Act) claims in the US, though courts have ruled inconsistently. Public data that is accessible without authentication generally has stronger fair use arguments than scraping behind login walls. Rate limiting your scraper to 1-2 requests per second per domain is both courteous and practically effective — it avoids triggering rate-limit defenses while allowing continuous operation. If an API is available for the data you need, use it.
TypeScript Integration and Type-Safe Scraped Data
Scraped data is inherently untyped — you're parsing HTML or JSON from a third-party source with no guarantees about structure. TypeScript can't validate scraped data at compile time, but you can add runtime validation using Zod, Valibot, or similar tools to ensure parsed data matches expected shapes before it enters your application. The pattern is defining a Zod schema for the expected scraped structure and using schema.parse() (which throws on mismatch) or schema.safeParse() (which returns a Result-like object). This catches changes in the scraped site's HTML structure immediately at the parsing stage rather than propagating invalid data through your pipeline. For Crawlee-based scrapers, create a typed Dataset schema and validate before pushing data — this makes scraped data as reliable as database data within your application.
Production Deployment and Proxy Architecture
Running scrapers in production requires thinking about IP rotation, detection avoidance, and operational resilience. Datacenter IP ranges (AWS, Google Cloud, DigitalOcean) are aggressively blocked by commercial anti-bot solutions — serious production scrapers use residential proxy networks (Bright Data, Oxylabs, SOAX) that rotate through real residential IP addresses. The cost is significant (residential proxies cost $5-15 per GB of traffic) but often necessary for sites with sophisticated detection. Crawlee's ProxyConfiguration handles rotation automatically, selecting a different proxy per request and retrying with a new proxy on 429 or CAPTCHA responses. For scrapers that must run continuously, deploy them on a VPS or container with stable, predictable resource allocation rather than serverless functions — long-running browser sessions and connection pools work poorly in ephemeral serverless environments with 30-second execution limits.
Performance Comparison at Scale
At high volume, the performance gap between HTTP scraping (got-scraping) and browser scraping (puppeteer-extra) becomes decisive. A single got-scraping instance can make 10-50 concurrent requests with minimal memory — running 10 got-scraping workers on a 1GB VPS is feasible. A single Chromium instance in Puppeteer consumes 150-300MB of RAM and handles 3-5 concurrent pages before performance degrades. Browser-based scraping at scale requires careful resource management: limit concurrent pages, close pages after use, restart browsers periodically to prevent memory leaks. Crawlee's PlaywrightCrawler handles this automatically with its browser pool abstraction. For scraping sites that require JavaScript rendering but load their data via XHR, intercepting the API requests with Puppeteer's page.on('response') is often faster than waiting for the full page to render — you get the raw JSON without parsing HTML.
Integration with Data Pipelines
Scraped data rarely ends up in isolation — it flows into databases, data warehouses, or analytical pipelines. Crawlee's Dataset.pushData() stores results in a local JSON file dataset by default, which integrates with Apify's platform for managed storage. For custom pipelines, transform Crawlee's dataset output into the shape your database expects using a post-processing step, or push directly to your database inside the requestHandler with proper error handling to avoid losing data on individual page failures. got-scraping pairs naturally with cheerio for HTML parsing and then any HTTP client for further data enrichment — the result can be pushed to a message queue (Bull, BullMQ) for async processing by separate workers. This decoupled pattern lets you scale scraping independently from data processing: if processing is slow, requests queue up without blocking the scraper's HTTP concurrency.
Error Recovery and Resilient Scraping
Production scrapers fail — sites go down, DOM structures change, rate limits hit, CAPTCHAs appear. Building resilient scrapers requires systematic error classification and recovery strategies. Transient errors (503 responses, timeouts, connection resets) warrant automatic retry with exponential backoff. Structure-change errors (element not found, unexpected null) indicate the site's DOM has changed and require human intervention to update selectors — log these with the full page content for debugging. Authentication errors (401, redirect to login) require credential refresh logic. Crawlee's error classification system handles transient errors automatically; puppeteer-extra requires manual retry logic. For long-running scrapers that must maintain continuity across days or weeks, persist the request queue and completed URLs to disk or a database so the scraper can resume after a crash without re-scraping already-processed pages. got-scraping's stateless design makes it the easiest to wrap with custom retry and checkpoint logic since there's no persistent state to manage.
Compare web scraping and automation tools on PkgPulse →
See also: Playwright vs Puppeteer and got vs node-fetch, archiver vs adm-zip vs JSZip (2026).