cheerio vs jsdom vs Playwright: HTML Parsing and Scraping in Node.js (2026)
TL;DR
cheerio is the right choice for parsing static HTML — it gives you a jQuery-like API ($('.className')) with minimal memory overhead. jsdom simulates a full browser DOM in Node.js — good when you need DOM APIs (querySelector, innerHTML) but the page doesn't require JavaScript execution. Playwright launches a real browser — necessary for dynamic SPAs, JavaScript-rendered content, or when you need to simulate user interaction. Each is 10-100x heavier than the previous.
Key Takeaways
- cheerio: ~10M weekly downloads — jQuery selector API, parses static HTML, ~1MB memory
- jsdom: ~8M weekly downloads — full DOM implementation, runs in Node.js, no JS execution
- playwright: ~3M weekly downloads — real Chromium/Firefox/WebKit, JavaScript execution, screenshots
- Static HTML scraping (GitHub, Stack Overflow, news sites): cheerio
- Testing with DOM APIs, server-rendered content: jsdom
- SPAs, login-gated content, interaction simulation: Playwright
- Memory cost: cheerio ~1MB, jsdom ~50MB, Playwright ~300MB per page
Download Trends
| Package | Weekly Downloads | Browser? | JS Execution | Memory |
|---|---|---|---|---|
cheerio | ~10M | ❌ HTML parser | ❌ | ~1MB/page |
jsdom | ~8M | ❌ DOM in Node | ❌ | ~50MB/page |
playwright | ~3M | ✅ Real browser | ✅ | ~300MB/page |
The Decision Matrix
Is the content rendered by JavaScript (SPA/React/Vue)?
Yes → Use Playwright
No (static HTML):
Do you need DOM APIs (addEventListener, classList, etc.)?
Yes → jsdom
No → cheerio (fastest, lightest)
cheerio
cheerio parses HTML with a jQuery-compatible selector API — no browser, no JavaScript runtime, just fast HTML traversal.
Basic Scraping
import * as cheerio from "cheerio"
import { JSDOM } from "jsdom"
// Parse HTML string:
const html = await fetch("https://npmjs.com/package/react").then(r => r.text())
const $ = cheerio.load(html)
// jQuery-like selectors:
const packageName = $("h1").text().trim()
const description = $('p[data-testid="package-description"]').text().trim()
const weeklyDownloads = $('[data-testid="weekly-downloads"]').text().trim()
// Attribute access:
const packageLink = $("a.package-name-link").attr("href")
// Iterate over elements:
$(".package-dependency-list li").each((index, element) => {
const depName = $(element).find("a").text()
const depVersion = $(element).find(".dep-version").text()
console.log(`${depName}: ${depVersion}`)
})
// Table scraping:
const tableData: string[][] = []
$("table tbody tr").each((_, row) => {
const cells = $(row).find("td").map((_, cell) => $(cell).text().trim()).get()
tableData.push(cells)
})
HTML Transformation
cheerio is also excellent for transforming HTML, not just reading it:
const $ = cheerio.load(html)
// Add a class to all external links:
$("a[href^='http']").each((_, el) => {
const $el = $(el)
if (!$el.attr("href")?.includes("pkgpulse.com")) {
$el.addClass("external-link")
$el.attr("target", "_blank")
$el.attr("rel", "noopener noreferrer")
}
})
// Remove all script tags (sanitize HTML):
$("script, style, iframe").remove()
// Add a nofollow to all links:
$("a").attr("rel", (_, existing) => {
return existing ? `${existing} nofollow` : "nofollow"
})
// Get transformed HTML:
const cleanHtml = $.html()
Scraping with HTTP (fetch + cheerio):
import * as cheerio from "cheerio"
interface NpmPackageStats {
name: string
version: string
weeklyDownloads: string
description: string
}
async function scrapeNpmPage(packageName: string): Promise<NpmPackageStats> {
const res = await fetch(`https://www.npmjs.com/package/${packageName}`, {
headers: {
"User-Agent": "Mozilla/5.0 (compatible; PkgPulseBot/1.0)",
"Accept": "text/html",
},
})
if (!res.ok) throw new Error(`HTTP ${res.status}`)
const html = await res.text()
const $ = cheerio.load(html)
return {
name: packageName,
version: $('span[data-testid="version-link"]').first().text().trim(),
weeklyDownloads: $('[aria-label*="weekly downloads"]').text().trim(),
description: $('p[data-testid="package-description"]').text().trim(),
}
}
// Batch scraping with concurrency limit:
import pLimit from "p-limit"
const limit = pLimit(3) // Max 3 concurrent requests
const packages = ["react", "vue", "angular", "svelte", "solid-js"]
const results = await Promise.all(
packages.map((name) => limit(() => scrapeNpmPage(name)))
)
cheerio limitations:
- Static HTML only — no JavaScript execution
- Can't handle React/Vue/Angular rendered pages
- DOM events (
addEventListener) don't work - Some CSS selectors may differ from browser behavior
jsdom
jsdom implements the browser DOM spec in Node.js — useful when you need actual DOM APIs or are running tests that simulate a browser environment.
Basic DOM Parsing
import { JSDOM } from "jsdom"
const html = `
<html>
<head><title>PkgPulse</title></head>
<body>
<nav id="main-nav">
<a href="/" class="nav-link active">Home</a>
<a href="/compare" class="nav-link">Compare</a>
</nav>
<main>
<h1 class="title">Package Analytics</h1>
<p>Weekly downloads: <span id="count">25,000,000</span></p>
</main>
</body>
</html>
`
const { window } = new JSDOM(html)
const { document } = window
// Standard DOM APIs:
const title = document.querySelector("h1.title")?.textContent // "Package Analytics"
const count = document.getElementById("count")?.textContent // "25,000,000"
const navLinks = document.querySelectorAll(".nav-link")
navLinks.forEach((link) => {
console.log(link.textContent, link.getAttribute("href"), link.classList.contains("active"))
})
// TreeWalker for deep traversal:
const walker = document.createTreeWalker(
document.body,
NodeFilter.SHOW_TEXT,
null
)
const textNodes: string[] = []
let node = walker.nextNode()
while (node) {
if (node.nodeValue?.trim()) textNodes.push(node.nodeValue.trim())
node = walker.nextNode()
}
jsdom for Test Environments
// jsdom is used by Vitest and Jest as the browser environment for unit tests:
// vitest.config.ts:
import { defineConfig } from "vitest/config"
export default defineConfig({
test: {
environment: "jsdom", // Provides window, document, etc.
setupFiles: "./test/setup.ts",
},
})
// test/setup.ts:
import "@testing-library/jest-dom"
// Now tests can use document, window, etc.:
import { render, screen } from "@testing-library/react"
import { PackageCard } from "@/components/PackageCard"
test("renders package name", () => {
render(<PackageCard name="react" downloads={25000000} />)
expect(screen.getByText("react")).toBeInTheDocument()
})
jsdom with JavaScript Execution (Limited)
// jsdom can run inline scripts (with the right settings):
const { window } = new JSDOM(html, {
runScripts: "dangerously", // Execute <script> tags — use only for trusted HTML
resources: "usable", // Load external stylesheets/scripts
url: "https://pkgpulse.com", // Required for scripts that access location
})
// Wait for scripts to execute:
window.addEventListener("load", () => {
const value = window.myGlobal // Access values set by scripts
})
⚠️ runScripts: "dangerously" runs untrusted code in Node.js — only use for content you control.
Playwright
Playwright launches a real browser — necessary when the page requires JavaScript to render content.
Dynamic Page Scraping
import { chromium } from "playwright"
async function scrapeDynamicPage(url: string) {
const browser = await chromium.launch({ headless: true })
const page = await browser.newPage()
// Set realistic browser headers:
await page.setExtraHTTPHeaders({
"Accept-Language": "en-US,en;q=0.9",
})
await page.goto(url, { waitUntil: "networkidle" })
// Wait for specific content to appear:
await page.waitForSelector("[data-testid='download-count']")
// Extract data using page.evaluate() — runs in browser context:
const data = await page.evaluate(() => ({
downloads: document.querySelector("[data-testid='download-count']")?.textContent,
version: document.querySelector("[data-testid='latest-version']")?.textContent,
// Can access window, localStorage, etc.
}))
await browser.close()
return data
}
Login and Authentication
async function scrapeAuthenticatedPage(url: string, credentials: Credentials) {
const browser = await chromium.launch()
const context = await browser.newContext({
// Persist login state across pages:
storageState: "auth/session.json",
})
const page = await context.newPage()
// Log in:
await page.goto("https://pkgpulse.com/login")
await page.fill('[name="email"]', credentials.email)
await page.fill('[name="password"]', credentials.password)
await page.click('[type="submit"]')
await page.waitForURL("/dashboard")
// Save auth state for reuse:
await context.storageState({ path: "auth/session.json" })
// Now scrape authenticated content:
await page.goto(url)
const privateData = await page.textContent(".private-content")
await browser.close()
return privateData
}
Scraping SPAs with Route Changes
async function scrapeReactApp(baseUrl: string) {
const browser = await chromium.launch()
const page = await browser.newPage()
// Intercept API calls instead of scraping rendered HTML:
const apiResponses: unknown[] = []
page.on("response", async (response) => {
if (response.url().includes("/api/packages")) {
const json = await response.json()
apiResponses.push(json)
}
})
await page.goto(`${baseUrl}/packages`)
await page.waitForLoadState("networkidle")
// More efficient: intercept JSON APIs than parse rendered HTML
console.log(apiResponses)
await browser.close()
}
Feature Comparison
| Feature | cheerio | jsdom | Playwright |
|---|---|---|---|
| JavaScript execution | ❌ | ⚠️ Limited | ✅ Full |
| Real browser APIs | ❌ | ✅ DOM spec | ✅ |
| jQuery selectors | ✅ | ❌ | ❌ |
CSS querySelectorAll | ✅ | ✅ | ✅ |
| Screenshots | ❌ | ❌ | ✅ |
| Form submission | ❌ | ⚠️ | ✅ |
| Login/session handling | ❌ | ❌ | ✅ |
| Network interception | ❌ | ❌ | ✅ |
| Memory per page | ~1MB | ~50MB | ~300MB |
| Startup time | Instant | Fast | ~2s |
| Anti-bot bypass | Moderate | Low | High |
| TypeScript | ✅ | ✅ | ✅ |
When to Use Each
Choose cheerio if:
- Scraping static HTML (server-rendered pages, RSS feeds, sitemaps)
- High-volume scraping where memory and speed matter
- Transforming or sanitizing HTML content
- The page content doesn't require JavaScript to render
Choose jsdom if:
- Running browser-based unit/component tests (Vitest, Jest)
- Testing code that uses DOM APIs without a real browser
- Processing HTML with standard DOM APIs (
querySelector,classList, events) - You need event simulation but not JavaScript-rendered content
Choose Playwright if:
- Scraping SPAs or any page that requires JavaScript to render content
- Automating form submission, login flows, or user interactions
- Taking screenshots or generating PDFs from web pages
- End-to-end testing of your own application
Methodology
Download data from npm registry (weekly average, February 2026). Memory estimates based on community benchmarks and official documentation. Feature comparison based on cheerio v1.x, jsdom v25.x, and Playwright 1.4x.