got-scraping vs Crawlee vs puppeteer-extra: Advanced Web Scraping in Node.js (2026)
TL;DR
Crawlee is the full web scraping framework from Apify — request queuing, automatic retries, proxy rotation, browser pool management, and both HTTP and browser-based crawlers in one toolkit. got-scraping is got with anti-bot headers — generates realistic browser-like HTTP headers, TLS fingerprinting, automatic header rotation for scraping without a browser. puppeteer-extra is Puppeteer with plugins — stealth mode to bypass bot detection, ad blocking, reCAPTCHA solving, and other plugin extensions. In 2026: Crawlee for production scraping pipelines, got-scraping for fast HTTP-based scraping, puppeteer-extra for browser automation with stealth.
Key Takeaways
- Crawlee: ~50K weekly downloads — full framework, queue management, proxy rotation, Apify
- got-scraping: ~30K weekly downloads — HTTP scraping with realistic headers, TLS fingerprinting
- puppeteer-extra: ~200K weekly downloads — Puppeteer + stealth plugin, anti-detection
- got-scraping is for HTTP requests — fast, no browser overhead, works for many sites
- puppeteer-extra controls a real browser — handles JavaScript-rendered pages, stealth mode
- Crawlee combines both approaches — use HTTP crawlers or browser crawlers as needed
The Anti-Bot Challenge
Modern websites detect scrapers via:
🔍 HTTP headers — missing or wrong User-Agent, Accept, Accept-Language
🔍 TLS fingerprint — Node.js has a different TLS fingerprint than Chrome
🔍 JavaScript challenges — Cloudflare, PerimeterX, DataDome
🔍 Browser fingerprint — headless Chrome has detectable properties
🔍 Rate limiting — too many requests too fast
🔍 IP reputation — datacenter IPs flagged as bots
Solutions:
got-scraping → Fixes HTTP headers + TLS fingerprint
puppeteer-extra → Fixes browser fingerprint + JS challenges
Crawlee → Framework that orchestrates both approaches
got-scraping
got-scraping — HTTP scraping with realistic headers:
Basic usage
import { gotScraping } from "got-scraping"
// Makes requests that look like a real browser:
const response = await gotScraping({
url: "https://example.com/products",
// Automatically generates realistic headers:
// User-Agent, Accept, Accept-Language, Accept-Encoding
// Sec-Ch-Ua, Sec-Fetch-* headers (Chrome-like)
})
console.log(response.body) // HTML content
// With proxy:
const response2 = await gotScraping({
url: "https://example.com/api/data",
proxyUrl: "http://proxy:8080",
responseType: "json",
})
Header generation
import { gotScraping } from "got-scraping"
// got-scraping generates different realistic headers each time:
const response = await gotScraping({
url: "https://example.com",
headerGeneratorOptions: {
browsers: ["chrome", "firefox"], // Mimic Chrome or Firefox
devices: ["desktop"], // Desktop headers
operatingSystems: ["windows", "macos"],
locales: ["en-US"],
},
})
// Example generated headers:
// User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36...
// Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
// Accept-Language: en-US,en;q=0.9
// Sec-Ch-Ua: "Chromium";v="122", "Google Chrome";v="122"
// Sec-Fetch-Mode: navigate
Scraping with Cheerio
import { gotScraping } from "got-scraping"
import * as cheerio from "cheerio"
async function scrapeProducts(url: string) {
const { body } = await gotScraping({ url })
const $ = cheerio.load(body)
const products = $(".product-card").map((_, el) => ({
name: $(el).find(".product-name").text().trim(),
price: $(el).find(".product-price").text().trim(),
url: $(el).find("a").attr("href"),
})).get()
return products
}
// Fast — no browser needed:
const products = await scrapeProducts("https://example.com/products")
When got-scraping isn't enough
got-scraping works for:
✅ Server-rendered HTML pages
✅ REST APIs with anti-bot headers
✅ Sites that check User-Agent and headers
✅ High-volume scraping (fast — no browser)
got-scraping fails for:
❌ JavaScript-rendered content (SPAs)
❌ Cloudflare challenge pages
❌ Sites requiring browser fingerprinting
❌ Interactive elements (login forms, infinite scroll)
→ Use puppeteer-extra or Crawlee with browser crawler
puppeteer-extra
puppeteer-extra — Puppeteer with plugins:
Stealth plugin
import puppeteer from "puppeteer-extra"
import StealthPlugin from "puppeteer-extra-plugin-stealth"
// Add stealth plugin — hides headless Chrome indicators:
puppeteer.use(StealthPlugin())
const browser = await puppeteer.launch({ headless: true })
const page = await browser.newPage()
// Now headless Chrome looks like a real browser:
await page.goto("https://bot-detection-site.com")
// Stealth plugin patches:
// ✅ navigator.webdriver → false
// ✅ Chrome runtime properties present
// ✅ Correct WebGL vendor/renderer
// ✅ Plugin array not empty
// ✅ Language and timezone consistent
// ✅ iframe contentWindow access
Scraping with stealth
import puppeteer from "puppeteer-extra"
import StealthPlugin from "puppeteer-extra-plugin-stealth"
puppeteer.use(StealthPlugin())
async function scrapeJSRenderedPage(url: string) {
const browser = await puppeteer.launch({ headless: true })
const page = await browser.newPage()
await page.goto(url, { waitUntil: "networkidle0" })
// Wait for dynamic content:
await page.waitForSelector(".product-card")
// Extract data from JavaScript-rendered page:
const products = await page.evaluate(() => {
return Array.from(document.querySelectorAll(".product-card")).map((el) => ({
name: el.querySelector(".name")?.textContent?.trim(),
price: el.querySelector(".price")?.textContent?.trim(),
}))
})
await browser.close()
return products
}
Plugins ecosystem
import puppeteer from "puppeteer-extra"
import StealthPlugin from "puppeteer-extra-plugin-stealth"
import AdblockerPlugin from "puppeteer-extra-plugin-adblocker"
import RecaptchaPlugin from "puppeteer-extra-plugin-recaptcha"
// Stealth — bypass bot detection:
puppeteer.use(StealthPlugin())
// Ad blocker — faster page loads, less noise:
puppeteer.use(AdblockerPlugin({ blockTrackers: true }))
// reCAPTCHA solver (requires 2captcha API key):
puppeteer.use(RecaptchaPlugin({
provider: { id: "2captcha", token: "YOUR_2CAPTCHA_KEY" },
}))
const browser = await puppeteer.launch()
const page = await browser.newPage()
await page.goto("https://example.com/login")
// Automatically solves reCAPTCHA if present:
const { solved } = await page.solveRecaptchas()
Crawlee
Crawlee — full scraping framework:
HTTP crawler (fast)
import { CheerioCrawler } from "crawlee"
const crawler = new CheerioCrawler({
maxRequestsPerCrawl: 100,
maxConcurrency: 10,
async requestHandler({ request, $, enqueueLinks, log }) {
log.info(`Processing ${request.url}`)
// Extract data with Cheerio:
const title = $("h1").text()
const products = $(".product").map((_, el) => ({
name: $(el).find(".name").text().trim(),
price: $(el).find(".price").text().trim(),
})).get()
// Store results:
await Dataset.pushData({ url: request.url, title, products })
// Follow links:
await enqueueLinks({
globs: ["https://example.com/products/**"],
})
},
})
await crawler.run(["https://example.com/products"])
Browser crawler (JavaScript-rendered)
import { PlaywrightCrawler } from "crawlee"
const crawler = new PlaywrightCrawler({
maxConcurrency: 5,
headless: true,
async requestHandler({ page, request, enqueueLinks, log }) {
log.info(`Processing ${request.url}`)
// Wait for JavaScript content:
await page.waitForSelector(".product-card")
// Extract from rendered page:
const products = await page.evaluate(() =>
Array.from(document.querySelectorAll(".product-card")).map((el) => ({
name: el.querySelector(".name")?.textContent?.trim(),
price: el.querySelector(".price")?.textContent?.trim(),
}))
)
await Dataset.pushData({ url: request.url, products })
// Follow pagination:
await enqueueLinks({ selector: ".pagination a" })
},
})
await crawler.run(["https://example.com/products"])
Proxy rotation
import { CheerioCrawler, ProxyConfiguration } from "crawlee"
const proxyConfig = new ProxyConfiguration({
proxyUrls: [
"http://proxy1:8080",
"http://proxy2:8080",
"http://proxy3:8080",
],
// Or use Apify proxy:
// apifyProxyGroups: ["RESIDENTIAL"],
})
const crawler = new CheerioCrawler({
proxyConfiguration: proxyConfig,
// Automatically rotates proxies per request
// Retries with different proxy on failure
async requestHandler({ request, $ }) {
// Each request uses a different proxy
},
})
Queue management and retries
import { CheerioCrawler } from "crawlee"
const crawler = new CheerioCrawler({
maxRequestRetries: 3, // Retry failed requests 3 times
maxConcurrency: 10, // 10 parallel requests
maxRequestsPerCrawl: 1000, // Stop after 1000 requests
requestHandlerTimeoutSecs: 60, // Timeout per request
// Automatic retry with backoff:
// 1st retry: immediate
// 2nd retry: after a few seconds
// 3rd retry: after more seconds + different proxy
async requestHandler({ request, $ }) {
// Process page
},
async failedRequestHandler({ request, error }) {
console.error(`Failed after retries: ${request.url}`, error.message)
},
})
Feature Comparison
| Feature | got-scraping | puppeteer-extra | Crawlee |
|---|---|---|---|
| HTTP scraping | ✅ | ❌ (browser) | ✅ (Cheerio) |
| Browser scraping | ❌ | ✅ | ✅ (Playwright) |
| Anti-bot headers | ✅ | Via stealth | ✅ (got-scraping) |
| Browser stealth | ❌ | ✅ (plugin) | ✅ (built-in) |
| Proxy rotation | Manual | Manual | ✅ (built-in) |
| Request queue | ❌ | ❌ | ✅ |
| Auto retries | Via got | Manual | ✅ (built-in) |
| Concurrency control | Manual | Manual | ✅ (built-in) |
| Data storage | Manual | Manual | ✅ (Dataset) |
| Link following | Manual | Manual | ✅ (enqueueLinks) |
| reCAPTCHA solving | ❌ | ✅ (plugin) | Via plugin |
| Weekly downloads | ~30K | ~200K | ~50K |
When to Use Each
Use got-scraping if:
- Scraping server-rendered HTML (no JavaScript needed)
- Need high-volume, fast HTTP scraping
- Want realistic headers without a full browser
- Simple scraping scripts with Cheerio for parsing
Use puppeteer-extra if:
- Need to bypass sophisticated bot detection
- Scraping JavaScript-rendered pages (SPAs)
- Need reCAPTCHA solving or ad blocking
- Want a full browser with stealth capabilities
- Already using Puppeteer and need anti-detection
Use Crawlee if:
- Building a production scraping pipeline
- Need queue management, retries, and proxy rotation built-in
- Want to switch between HTTP and browser crawlers as needed
- Scraping at scale with concurrency control
- Using Apify cloud for deployment
Methodology
Download data from npm registry (weekly average, February 2026). Feature comparison based on got-scraping v4.x, puppeteer-extra v3.x, and Crawlee v3.x.