Skip to main content

cheerio vs jsdom vs Playwright: HTML Parsing and Scraping in Node.js (2026)

·PkgPulse Team

TL;DR

cheerio is the right choice for parsing static HTML — it gives you a jQuery-like API ($('.className')) with minimal memory overhead. jsdom simulates a full browser DOM in Node.js — good when you need DOM APIs (querySelector, innerHTML) but the page doesn't require JavaScript execution. Playwright launches a real browser — necessary for dynamic SPAs, JavaScript-rendered content, or when you need to simulate user interaction. Each is 10-100x heavier than the previous.

Key Takeaways

  • cheerio: ~10M weekly downloads — jQuery selector API, parses static HTML, ~1MB memory
  • jsdom: ~8M weekly downloads — full DOM implementation, runs in Node.js, no JS execution
  • playwright: ~3M weekly downloads — real Chromium/Firefox/WebKit, JavaScript execution, screenshots
  • Static HTML scraping (GitHub, Stack Overflow, news sites): cheerio
  • Testing with DOM APIs, server-rendered content: jsdom
  • SPAs, login-gated content, interaction simulation: Playwright
  • Memory cost: cheerio ~1MB, jsdom ~50MB, Playwright ~300MB per page

PackageWeekly DownloadsBrowser?JS ExecutionMemory
cheerio~10M❌ HTML parser~1MB/page
jsdom~8M❌ DOM in Node~50MB/page
playwright~3M✅ Real browser~300MB/page

The Decision Matrix

Is the content rendered by JavaScript (SPA/React/Vue)?
  Yes → Use Playwright

No (static HTML):
  Do you need DOM APIs (addEventListener, classList, etc.)?
    Yes → jsdom
    No → cheerio (fastest, lightest)

cheerio

cheerio parses HTML with a jQuery-compatible selector API — no browser, no JavaScript runtime, just fast HTML traversal.

Basic Scraping

import * as cheerio from "cheerio"
import { JSDOM } from "jsdom"

// Parse HTML string:
const html = await fetch("https://npmjs.com/package/react").then(r => r.text())
const $ = cheerio.load(html)

// jQuery-like selectors:
const packageName = $("h1").text().trim()
const description = $('p[data-testid="package-description"]').text().trim()
const weeklyDownloads = $('[data-testid="weekly-downloads"]').text().trim()

// Attribute access:
const packageLink = $("a.package-name-link").attr("href")

// Iterate over elements:
$(".package-dependency-list li").each((index, element) => {
  const depName = $(element).find("a").text()
  const depVersion = $(element).find(".dep-version").text()
  console.log(`${depName}: ${depVersion}`)
})

// Table scraping:
const tableData: string[][] = []
$("table tbody tr").each((_, row) => {
  const cells = $(row).find("td").map((_, cell) => $(cell).text().trim()).get()
  tableData.push(cells)
})

HTML Transformation

cheerio is also excellent for transforming HTML, not just reading it:

const $ = cheerio.load(html)

// Add a class to all external links:
$("a[href^='http']").each((_, el) => {
  const $el = $(el)
  if (!$el.attr("href")?.includes("pkgpulse.com")) {
    $el.addClass("external-link")
    $el.attr("target", "_blank")
    $el.attr("rel", "noopener noreferrer")
  }
})

// Remove all script tags (sanitize HTML):
$("script, style, iframe").remove()

// Add a nofollow to all links:
$("a").attr("rel", (_, existing) => {
  return existing ? `${existing} nofollow` : "nofollow"
})

// Get transformed HTML:
const cleanHtml = $.html()

Scraping with HTTP (fetch + cheerio):

import * as cheerio from "cheerio"

interface NpmPackageStats {
  name: string
  version: string
  weeklyDownloads: string
  description: string
}

async function scrapeNpmPage(packageName: string): Promise<NpmPackageStats> {
  const res = await fetch(`https://www.npmjs.com/package/${packageName}`, {
    headers: {
      "User-Agent": "Mozilla/5.0 (compatible; PkgPulseBot/1.0)",
      "Accept": "text/html",
    },
  })

  if (!res.ok) throw new Error(`HTTP ${res.status}`)

  const html = await res.text()
  const $ = cheerio.load(html)

  return {
    name: packageName,
    version: $('span[data-testid="version-link"]').first().text().trim(),
    weeklyDownloads: $('[aria-label*="weekly downloads"]').text().trim(),
    description: $('p[data-testid="package-description"]').text().trim(),
  }
}

// Batch scraping with concurrency limit:
import pLimit from "p-limit"
const limit = pLimit(3)  // Max 3 concurrent requests

const packages = ["react", "vue", "angular", "svelte", "solid-js"]
const results = await Promise.all(
  packages.map((name) => limit(() => scrapeNpmPage(name)))
)

cheerio limitations:

  • Static HTML only — no JavaScript execution
  • Can't handle React/Vue/Angular rendered pages
  • DOM events (addEventListener) don't work
  • Some CSS selectors may differ from browser behavior

jsdom

jsdom implements the browser DOM spec in Node.js — useful when you need actual DOM APIs or are running tests that simulate a browser environment.

Basic DOM Parsing

import { JSDOM } from "jsdom"

const html = `
  <html>
    <head><title>PkgPulse</title></head>
    <body>
      <nav id="main-nav">
        <a href="/" class="nav-link active">Home</a>
        <a href="/compare" class="nav-link">Compare</a>
      </nav>
      <main>
        <h1 class="title">Package Analytics</h1>
        <p>Weekly downloads: <span id="count">25,000,000</span></p>
      </main>
    </body>
  </html>
`

const { window } = new JSDOM(html)
const { document } = window

// Standard DOM APIs:
const title = document.querySelector("h1.title")?.textContent  // "Package Analytics"
const count = document.getElementById("count")?.textContent    // "25,000,000"

const navLinks = document.querySelectorAll(".nav-link")
navLinks.forEach((link) => {
  console.log(link.textContent, link.getAttribute("href"), link.classList.contains("active"))
})

// TreeWalker for deep traversal:
const walker = document.createTreeWalker(
  document.body,
  NodeFilter.SHOW_TEXT,
  null
)

const textNodes: string[] = []
let node = walker.nextNode()
while (node) {
  if (node.nodeValue?.trim()) textNodes.push(node.nodeValue.trim())
  node = walker.nextNode()
}

jsdom for Test Environments

// jsdom is used by Vitest and Jest as the browser environment for unit tests:
// vitest.config.ts:
import { defineConfig } from "vitest/config"

export default defineConfig({
  test: {
    environment: "jsdom",  // Provides window, document, etc.
    setupFiles: "./test/setup.ts",
  },
})

// test/setup.ts:
import "@testing-library/jest-dom"

// Now tests can use document, window, etc.:
import { render, screen } from "@testing-library/react"
import { PackageCard } from "@/components/PackageCard"

test("renders package name", () => {
  render(<PackageCard name="react" downloads={25000000} />)
  expect(screen.getByText("react")).toBeInTheDocument()
})

jsdom with JavaScript Execution (Limited)

// jsdom can run inline scripts (with the right settings):
const { window } = new JSDOM(html, {
  runScripts: "dangerously",    // Execute <script> tags — use only for trusted HTML
  resources: "usable",          // Load external stylesheets/scripts
  url: "https://pkgpulse.com",  // Required for scripts that access location
})

// Wait for scripts to execute:
window.addEventListener("load", () => {
  const value = window.myGlobal  // Access values set by scripts
})

⚠️ runScripts: "dangerously" runs untrusted code in Node.js — only use for content you control.


Playwright

Playwright launches a real browser — necessary when the page requires JavaScript to render content.

Dynamic Page Scraping

import { chromium } from "playwright"

async function scrapeDynamicPage(url: string) {
  const browser = await chromium.launch({ headless: true })
  const page = await browser.newPage()

  // Set realistic browser headers:
  await page.setExtraHTTPHeaders({
    "Accept-Language": "en-US,en;q=0.9",
  })

  await page.goto(url, { waitUntil: "networkidle" })

  // Wait for specific content to appear:
  await page.waitForSelector("[data-testid='download-count']")

  // Extract data using page.evaluate() — runs in browser context:
  const data = await page.evaluate(() => ({
    downloads: document.querySelector("[data-testid='download-count']")?.textContent,
    version: document.querySelector("[data-testid='latest-version']")?.textContent,
    // Can access window, localStorage, etc.
  }))

  await browser.close()
  return data
}

Login and Authentication

async function scrapeAuthenticatedPage(url: string, credentials: Credentials) {
  const browser = await chromium.launch()
  const context = await browser.newContext({
    // Persist login state across pages:
    storageState: "auth/session.json",
  })
  const page = await context.newPage()

  // Log in:
  await page.goto("https://pkgpulse.com/login")
  await page.fill('[name="email"]', credentials.email)
  await page.fill('[name="password"]', credentials.password)
  await page.click('[type="submit"]')
  await page.waitForURL("/dashboard")

  // Save auth state for reuse:
  await context.storageState({ path: "auth/session.json" })

  // Now scrape authenticated content:
  await page.goto(url)
  const privateData = await page.textContent(".private-content")

  await browser.close()
  return privateData
}

Scraping SPAs with Route Changes

async function scrapeReactApp(baseUrl: string) {
  const browser = await chromium.launch()
  const page = await browser.newPage()

  // Intercept API calls instead of scraping rendered HTML:
  const apiResponses: unknown[] = []
  page.on("response", async (response) => {
    if (response.url().includes("/api/packages")) {
      const json = await response.json()
      apiResponses.push(json)
    }
  })

  await page.goto(`${baseUrl}/packages`)
  await page.waitForLoadState("networkidle")

  // More efficient: intercept JSON APIs than parse rendered HTML
  console.log(apiResponses)

  await browser.close()
}

Feature Comparison

FeaturecheeriojsdomPlaywright
JavaScript execution⚠️ Limited✅ Full
Real browser APIs✅ DOM spec
jQuery selectors
CSS querySelectorAll
Screenshots
Form submission⚠️
Login/session handling
Network interception
Memory per page~1MB~50MB~300MB
Startup timeInstantFast~2s
Anti-bot bypassModerateLowHigh
TypeScript

When to Use Each

Choose cheerio if:

  • Scraping static HTML (server-rendered pages, RSS feeds, sitemaps)
  • High-volume scraping where memory and speed matter
  • Transforming or sanitizing HTML content
  • The page content doesn't require JavaScript to render

Choose jsdom if:

  • Running browser-based unit/component tests (Vitest, Jest)
  • Testing code that uses DOM APIs without a real browser
  • Processing HTML with standard DOM APIs (querySelector, classList, events)
  • You need event simulation but not JavaScript-rendered content

Choose Playwright if:

  • Scraping SPAs or any page that requires JavaScript to render content
  • Automating form submission, login flows, or user interactions
  • Taking screenshots or generating PDFs from web pages
  • End-to-end testing of your own application

Methodology

Download data from npm registry (weekly average, February 2026). Memory estimates based on community benchmarks and official documentation. Feature comparison based on cheerio v1.x, jsdom v25.x, and Playwright 1.4x.

Compare web automation and parsing packages on PkgPulse →

Comments

Stay Updated

Get the latest package insights, npm trends, and tooling tips delivered to your inbox.