TL;DR
cheerio is the right choice for parsing static HTML — it gives you a jQuery-like API ($('.className')) with minimal memory overhead. jsdom simulates a full browser DOM in Node.js — good when you need DOM APIs (querySelector, innerHTML) but the page doesn't require JavaScript execution. Playwright launches a real browser — necessary for dynamic SPAs, JavaScript-rendered content, or when you need to simulate user interaction. Each is 10-100x heavier than the previous.
Key Takeaways
- cheerio: ~10M weekly downloads — jQuery selector API, parses static HTML, ~1MB memory
- jsdom: ~8M weekly downloads — full DOM implementation, runs in Node.js, no JS execution
- playwright: ~3M weekly downloads — real Chromium/Firefox/WebKit, JavaScript execution, screenshots
- Static HTML scraping (GitHub, Stack Overflow, news sites): cheerio
- Testing with DOM APIs, server-rendered content: jsdom
- SPAs, login-gated content, interaction simulation: Playwright
- Memory cost: cheerio ~1MB, jsdom ~50MB, Playwright ~300MB per page
Download Trends
| Package | Weekly Downloads | Browser? | JS Execution | Memory |
|---|---|---|---|---|
cheerio | ~10M | ❌ HTML parser | ❌ | ~1MB/page |
jsdom | ~8M | ❌ DOM in Node | ❌ | ~50MB/page |
playwright | ~3M | ✅ Real browser | ✅ | ~300MB/page |
The Decision Matrix
Is the content rendered by JavaScript (SPA/React/Vue)?
Yes → Use Playwright
No (static HTML):
Do you need DOM APIs (addEventListener, classList, etc.)?
Yes → jsdom
No → cheerio (fastest, lightest)
cheerio
cheerio parses HTML with a jQuery-compatible selector API — no browser, no JavaScript runtime, just fast HTML traversal.
Basic Scraping
import * as cheerio from "cheerio"
import { JSDOM } from "jsdom"
// Parse HTML string:
const html = await fetch("https://npmjs.com/package/react").then(r => r.text())
const $ = cheerio.load(html)
// jQuery-like selectors:
const packageName = $("h1").text().trim()
const description = $('p[data-testid="package-description"]').text().trim()
const weeklyDownloads = $('[data-testid="weekly-downloads"]').text().trim()
// Attribute access:
const packageLink = $("a.package-name-link").attr("href")
// Iterate over elements:
$(".package-dependency-list li").each((index, element) => {
const depName = $(element).find("a").text()
const depVersion = $(element).find(".dep-version").text()
console.log(`${depName}: ${depVersion}`)
})
// Table scraping:
const tableData: string[][] = []
$("table tbody tr").each((_, row) => {
const cells = $(row).find("td").map((_, cell) => $(cell).text().trim()).get()
tableData.push(cells)
})
HTML Transformation
cheerio is also excellent for transforming HTML, not just reading it:
const $ = cheerio.load(html)
// Add a class to all external links:
$("a[href^='http']").each((_, el) => {
const $el = $(el)
if (!$el.attr("href")?.includes("pkgpulse.com")) {
$el.addClass("external-link")
$el.attr("target", "_blank")
$el.attr("rel", "noopener noreferrer")
}
})
// Remove all script tags (sanitize HTML):
$("script, style, iframe").remove()
// Add a nofollow to all links:
$("a").attr("rel", (_, existing) => {
return existing ? `${existing} nofollow` : "nofollow"
})
// Get transformed HTML:
const cleanHtml = $.html()
Scraping with HTTP (fetch + cheerio):
import * as cheerio from "cheerio"
interface NpmPackageStats {
name: string
version: string
weeklyDownloads: string
description: string
}
async function scrapeNpmPage(packageName: string): Promise<NpmPackageStats> {
const res = await fetch(`https://www.npmjs.com/package/${packageName}`, {
headers: {
"User-Agent": "Mozilla/5.0 (compatible; PkgPulseBot/1.0)",
"Accept": "text/html",
},
})
if (!res.ok) throw new Error(`HTTP ${res.status}`)
const html = await res.text()
const $ = cheerio.load(html)
return {
name: packageName,
version: $('span[data-testid="version-link"]').first().text().trim(),
weeklyDownloads: $('[aria-label*="weekly downloads"]').text().trim(),
description: $('p[data-testid="package-description"]').text().trim(),
}
}
// Batch scraping with concurrency limit:
import pLimit from "p-limit"
const limit = pLimit(3) // Max 3 concurrent requests
const packages = ["react", "vue", "angular", "svelte", "solid-js"]
const results = await Promise.all(
packages.map((name) => limit(() => scrapeNpmPage(name)))
)
cheerio limitations:
- Static HTML only — no JavaScript execution
- Can't handle React/Vue/Angular rendered pages
- DOM events (
addEventListener) don't work - Some CSS selectors may differ from browser behavior
jsdom
jsdom implements the browser DOM spec in Node.js — useful when you need actual DOM APIs or are running tests that simulate a browser environment.
Basic DOM Parsing
import { JSDOM } from "jsdom"
const html = `
<html>
<head><title>PkgPulse</title></head>
<body>
<nav id="main-nav">
<a href="/" class="nav-link active">Home</a>
<a href="/compare" class="nav-link">Compare</a>
</nav>
<main>
<h1 class="title">Package Analytics</h1>
<p>Weekly downloads: <span id="count">25,000,000</span></p>
</main>
</body>
</html>
`
const { window } = new JSDOM(html)
const { document } = window
// Standard DOM APIs:
const title = document.querySelector("h1.title")?.textContent // "Package Analytics"
const count = document.getElementById("count")?.textContent // "25,000,000"
const navLinks = document.querySelectorAll(".nav-link")
navLinks.forEach((link) => {
console.log(link.textContent, link.getAttribute("href"), link.classList.contains("active"))
})
// TreeWalker for deep traversal:
const walker = document.createTreeWalker(
document.body,
NodeFilter.SHOW_TEXT,
null
)
const textNodes: string[] = []
let node = walker.nextNode()
while (node) {
if (node.nodeValue?.trim()) textNodes.push(node.nodeValue.trim())
node = walker.nextNode()
}
jsdom for Test Environments
// jsdom is used by Vitest and Jest as the browser environment for unit tests:
// vitest.config.ts:
import { defineConfig } from "vitest/config"
export default defineConfig({
test: {
environment: "jsdom", // Provides window, document, etc.
setupFiles: "./test/setup.ts",
},
})
// test/setup.ts:
import "@testing-library/jest-dom"
// Now tests can use document, window, etc.:
import { render, screen } from "@testing-library/react"
import { PackageCard } from "@/components/PackageCard"
test("renders package name", () => {
render(<PackageCard name="react" downloads={25000000} />)
expect(screen.getByText("react")).toBeInTheDocument()
})
jsdom with JavaScript Execution (Limited)
// jsdom can run inline scripts (with the right settings):
const { window } = new JSDOM(html, {
runScripts: "dangerously", // Execute <script> tags — use only for trusted HTML
resources: "usable", // Load external stylesheets/scripts
url: "https://pkgpulse.com", // Required for scripts that access location
})
// Wait for scripts to execute:
window.addEventListener("load", () => {
const value = window.myGlobal // Access values set by scripts
})
⚠️ runScripts: "dangerously" runs untrusted code in Node.js — only use for content you control.
Playwright
Playwright launches a real browser — necessary when the page requires JavaScript to render content.
Dynamic Page Scraping
import { chromium } from "playwright"
async function scrapeDynamicPage(url: string) {
const browser = await chromium.launch({ headless: true })
const page = await browser.newPage()
// Set realistic browser headers:
await page.setExtraHTTPHeaders({
"Accept-Language": "en-US,en;q=0.9",
})
await page.goto(url, { waitUntil: "networkidle" })
// Wait for specific content to appear:
await page.waitForSelector("[data-testid='download-count']")
// Extract data using page.evaluate() — runs in browser context:
const data = await page.evaluate(() => ({
downloads: document.querySelector("[data-testid='download-count']")?.textContent,
version: document.querySelector("[data-testid='latest-version']")?.textContent,
// Can access window, localStorage, etc.
}))
await browser.close()
return data
}
Login and Authentication
async function scrapeAuthenticatedPage(url: string, credentials: Credentials) {
const browser = await chromium.launch()
const context = await browser.newContext({
// Persist login state across pages:
storageState: "auth/session.json",
})
const page = await context.newPage()
// Log in:
await page.goto("https://pkgpulse.com/login")
await page.fill('[name="email"]', credentials.email)
await page.fill('[name="password"]', credentials.password)
await page.click('[type="submit"]')
await page.waitForURL("/dashboard")
// Save auth state for reuse:
await context.storageState({ path: "auth/session.json" })
// Now scrape authenticated content:
await page.goto(url)
const privateData = await page.textContent(".private-content")
await browser.close()
return privateData
}
Scraping SPAs with Route Changes
async function scrapeReactApp(baseUrl: string) {
const browser = await chromium.launch()
const page = await browser.newPage()
// Intercept API calls instead of scraping rendered HTML:
const apiResponses: unknown[] = []
page.on("response", async (response) => {
if (response.url().includes("/api/packages")) {
const json = await response.json()
apiResponses.push(json)
}
})
await page.goto(`${baseUrl}/packages`)
await page.waitForLoadState("networkidle")
// More efficient: intercept JSON APIs than parse rendered HTML
console.log(apiResponses)
await browser.close()
}
Feature Comparison
| Feature | cheerio | jsdom | Playwright |
|---|---|---|---|
| JavaScript execution | ❌ | ⚠️ Limited | ✅ Full |
| Real browser APIs | ❌ | ✅ DOM spec | ✅ |
| jQuery selectors | ✅ | ❌ | ❌ |
CSS querySelectorAll | ✅ | ✅ | ✅ |
| Screenshots | ❌ | ❌ | ✅ |
| Form submission | ❌ | ⚠️ | ✅ |
| Login/session handling | ❌ | ❌ | ✅ |
| Network interception | ❌ | ❌ | ✅ |
| Memory per page | ~1MB | ~50MB | ~300MB |
| Startup time | Instant | Fast | ~2s |
| Anti-bot bypass | Moderate | Low | High |
| TypeScript | ✅ | ✅ | ✅ |
When to Use Each
Choose cheerio if:
- Scraping static HTML (server-rendered pages, RSS feeds, sitemaps)
- High-volume scraping where memory and speed matter
- Transforming or sanitizing HTML content
- The page content doesn't require JavaScript to render
Choose jsdom if:
- Running browser-based unit/component tests (Vitest, Jest)
- Testing code that uses DOM APIs without a real browser
- Processing HTML with standard DOM APIs (
querySelector,classList, events) - You need event simulation but not JavaScript-rendered content
Choose Playwright if:
- Scraping SPAs or any page that requires JavaScript to render content
- Automating form submission, login flows, or user interactions
- Taking screenshots or generating PDFs from web pages
- End-to-end testing of your own application
Anti-Bot Measures and Production Scraping Realities
The practical difficulty of scraping in 2026 is less about HTML parsing and more about getting through anti-bot systems. Cloudflare, DataDome, Kasada, and similar services detect automated browsers through behavioral fingerprinting — mouse movement patterns, canvas rendering signatures, JavaScript API inconsistencies, and timing analysis. Cheerio is the weakest option here because it sends raw HTTP requests that look nothing like browser traffic; sites that serve JavaScript challenges or require cookie-based sessions will return bot detection pages rather than the content you want. jsdom is slightly better because it has a more browser-like request profile, but it lacks GPU fingerprinting and canvas rendering, which are common detection vectors. Playwright with a configured user agent, viewport, and human-like interaction patterns has the highest anti-bot bypass rate, but sophisticated systems like Cloudflare's challenge page still detect headless Chromium without additional configuration. For production scraping that must bypass serious anti-bot measures, tools like playwright-extra with the stealth plugin, residential proxies, or commercial scraping APIs (Apify, Brightdata) are necessary additions.
TypeScript Integration Across All Three Libraries
All three libraries support TypeScript in 2026, but the depth of type coverage differs. Cheerio ships its own TypeScript definitions and the types are generally accurate — CheerioAPI is the main interface, and methods like .text(), .attr(), and .each() are properly typed. The main gap is that cheerio's dynamic nature makes it difficult to type the shape of data you extract; you typically annotate your extraction results manually. jsdom's @types/jsdom package is comprehensive because the underlying DOM API has a well-defined W3C specification — document.querySelector() returns Element | null, and the standard DOM interfaces are precisely typed. The challenge with jsdom is that the full DOM type surface includes hundreds of interfaces that TypeScript must load, which can slow down type checking in large projects. Playwright's TypeScript types are first-class — the library is written in TypeScript, types ship with the package, and methods like page.evaluate() support generic type parameters for the return value. The page.evaluate<T>() signature lets you express the expected shape of extracted data, giving end-to-end type safety from browser to Node.js.
Performance Characteristics and Concurrency
For high-volume scraping, the performance gap between libraries becomes operationally significant. Cheerio can parse and query a typical news article page in under 5 milliseconds, meaning a single-threaded Node.js process can scrape hundreds of pages per second when network latency is the bottleneck. jsdom is roughly 10-20x slower than cheerio for the same HTML because it builds a full DOM tree with circular parent-child references and event listener infrastructure even when you do not use those features. Playwright is 100-300x slower than cheerio per page — launching a browser context, navigating, waiting for load events, and closing the page is measured in seconds, not milliseconds. This makes concurrency management critical for Playwright: running 50 simultaneous Playwright page contexts is possible but requires careful memory management (each page context uses 200-400MB of RAM) and a browser pool to avoid cold-starting a new browser for each request. For large-scale Playwright scraping, frameworks like Crawlee (by Apify) handle browser pool management and request queuing automatically.
Ethical and Legal Considerations
The technical capability to scrape a website does not imply a legal or ethical right to do so. Website terms of service frequently prohibit automated data collection, and courts in multiple jurisdictions have upheld ToS violations as grounds for injunctive relief, even when the data is publicly accessible. The HiQ vs. LinkedIn case established some protections for scraping publicly available data, but the legal landscape varies by jurisdiction and changes frequently. Beyond legal concerns, aggressive scraping can cause genuine harm to small websites with limited server capacity — a scraper that ignores rate limits or cache headers can effectively DDoS a resource-constrained site. The ethical scraping baseline is: respect robots.txt disallow rules, add appropriate delays between requests, identify your scraper with a descriptive User-Agent, and prefer official APIs when they exist. Playwright's stealth plugins that evade anti-bot detection are in a legal gray area when used against a site that has explicitly blocked automated access.
Choosing the Right Tool for Your Use Case
The decision between these three libraries should be driven by the nature of the HTML you are working with, not by familiarity or download counts. If you are writing a build tool that transforms MDX or HTML files, cheerio is the correct choice — it is fast, has minimal memory overhead, and the jQuery API makes HTML transformation readable and maintainable. If you are writing component tests that need to assert on rendered HTML structure without launching a browser, jsdom through Vitest or Jest is the established solution, backed by @testing-library/react and the full Testing Library ecosystem. If you are building an end-to-end test suite or scraping JavaScript-heavy pages, Playwright is the only option that guarantees you see what a real user sees. The 100x resource difference between cheerio and Playwright is rarely a reason to choose cheerio over Playwright for a test suite where correctness matters more than speed — but it is a strong reason to choose cheerio for a high-volume data pipeline where cheerio's capabilities are sufficient.
Methodology
Download data from npm registry (weekly average, February 2026). Memory estimates based on community benchmarks and official documentation. Feature comparison based on cheerio v1.x, jsdom v25.x, and Playwright 1.4x.
Compare web automation and parsing packages on PkgPulse →
See also: Playwright vs Puppeteer and Cypress vs Playwright, archiver vs adm-zip vs JSZip (2026).