Skip to main content

Guide

unpdf vs pdf-parse vs pdfjs-dist 2026

unpdf, pdf-parse, and pdfjs-dist compared for PDF text extraction in Node.js in 2026. Edge runtime support, metadata reading, and page rendering capabilities.

·PkgPulse Team·
0

TL;DR

unpdf is the UnJS PDF extraction library — lightweight wrapper around pdf.js for extracting text and metadata from PDFs, works in Node.js, edge runtimes, and browsers. pdf-parse is the simple PDF text extractor — wraps pdf.js with a minimal API, returns text + metadata + page count, the most popular Node.js PDF parser. pdfjs-dist (pdf.js) is Mozilla's full PDF renderer — renders PDFs to canvas, extracts text, handles annotations, powers Firefox's PDF viewer. In 2026: unpdf for modern/edge PDF extraction, pdf-parse for simple Node.js text extraction, pdfjs-dist for full rendering and browser display.

Key Takeaways

  • unpdf: ~200K weekly downloads — UnJS, edge-compatible, text + metadata extraction
  • pdf-parse: ~2M weekly downloads — simple API, text + info + page count, Node.js
  • pdfjs-dist: ~3M weekly downloads — Mozilla, full PDF rendering, canvas support
  • All three use pdf.js internally — different abstraction levels
  • unpdf and pdf-parse focus on text extraction; pdfjs-dist does everything
  • For AI/LLM pipelines, any of these work for extracting text from PDFs

unpdf

unpdf — modern PDF extraction:

Basic text extraction

import { extractText, getDocumentProxy } from "unpdf"

// Simple text extraction:
const buffer = await fs.readFile("document.pdf")
const { text, totalPages } = await extractText(buffer)

console.log(`Pages: ${totalPages}`)
console.log(text)
// → Full text content of the PDF

Document proxy (advanced)

import { getDocumentProxy, extractText } from "unpdf"

// Get full document access:
const buffer = await fs.readFile("report.pdf")
const pdf = await getDocumentProxy(buffer)

console.log(`Pages: ${pdf.numPages}`)

// Extract text from specific page:
const page = await pdf.getPage(1)
const content = await page.getTextContent()
const pageText = content.items
  .map((item) => item.str)
  .join(" ")

console.log(pageText)

Metadata extraction

import { getDocumentProxy } from "unpdf"

const pdf = await getDocumentProxy(buffer)
const metadata = await pdf.getMetadata()

console.log(metadata.info)
// → {
//   Title: "Annual Report 2026",
//   Author: "PkgPulse Team",
//   Creator: "Google Docs",
//   Producer: "Skia/PDF m120",
//   CreationDate: "D:20260309...",
// }

Edge runtime / serverless

// unpdf works in edge runtimes (Cloudflare Workers, Vercel Edge):
export default {
  async fetch(request: Request): Promise<Response> {
    const formData = await request.formData()
    const file = formData.get("pdf") as File
    const buffer = await file.arrayBuffer()

    const { text, totalPages } = await extractText(
      new Uint8Array(buffer)
    )

    return Response.json({ text, totalPages })
  },
}

pdf-parse

pdf-parse — simple text extraction:

Basic usage

import pdf from "pdf-parse"

const buffer = fs.readFileSync("document.pdf")
const data = await pdf(buffer)

console.log(data.numpages)    // Number of pages
console.log(data.numrender)   // Number of rendered pages
console.log(data.info)        // PDF metadata
console.log(data.metadata)    // PDF metadata (XML)
console.log(data.version)     // PDF version
console.log(data.text)        // All text content

With options

import pdf from "pdf-parse"

const data = await pdf(buffer, {
  // Limit pages to parse:
  max: 10,  // Only first 10 pages

  // Custom page render function:
  pagerender(pageData) {
    const textContent = pageData.getTextContent()
    return textContent.then((content) => {
      return content.items
        .map((item) => item.str)
        .join(" ")
    })
  },
})

console.log(data.text)

Extract text per page

import pdf from "pdf-parse"

const pages: string[] = []

const data = await pdf(buffer, {
  pagerender(pageData) {
    return pageData.getTextContent().then((content) => {
      const pageText = content.items
        .map((item) => item.str)
        .join(" ")
      pages.push(pageText)
      return pageText
    })
  },
})

// pages[0] = text from page 1
// pages[1] = text from page 2
// etc.

Common use: AI/LLM document processing

import pdf from "pdf-parse"

async function extractForLLM(pdfPath: string): Promise<string> {
  const buffer = fs.readFileSync(pdfPath)
  const data = await pdf(buffer)

  // Clean up extracted text:
  const cleanText = data.text
    .replace(/\n{3,}/g, "\n\n")  // Collapse multiple newlines
    .replace(/\s{2,}/g, " ")     // Collapse whitespace
    .trim()

  return cleanText
}

// Feed to LLM:
const text = await extractForLLM("contract.pdf")
const response = await llm.chat({
  messages: [
    { role: "system", content: "Summarize this document." },
    { role: "user", content: text },
  ],
})

pdfjs-dist

pdfjs-dist (pdf.js) — full PDF rendering:

Text extraction

import { getDocument } from "pdfjs-dist"

const doc = await getDocument("document.pdf").promise

// Extract text from all pages:
const texts: string[] = []
for (let i = 1; i <= doc.numPages; i++) {
  const page = await doc.getPage(i)
  const content = await page.getTextContent()
  const pageText = content.items
    .map((item) => item.str)
    .join(" ")
  texts.push(pageText)
}

console.log(texts.join("\n\n"))

Render to canvas (browser)

import { getDocument } from "pdfjs-dist"

const doc = await getDocument(pdfUrl).promise
const page = await doc.getPage(1)

const scale = 1.5
const viewport = page.getViewport({ scale })

const canvas = document.getElementById("pdf-canvas") as HTMLCanvasElement
const context = canvas.getContext("2d")!
canvas.height = viewport.height
canvas.width = viewport.width

await page.render({
  canvasContext: context,
  viewport,
}).promise

Node.js canvas rendering

import { getDocument } from "pdfjs-dist/legacy/build/pdf.mjs"
import { createCanvas } from "canvas"

const doc = await getDocument({
  data: new Uint8Array(buffer),
  useSystemFonts: true,
}).promise

const page = await doc.getPage(1)
const viewport = page.getViewport({ scale: 2.0 })

const canvas = createCanvas(viewport.width, viewport.height)
const context = canvas.getContext("2d")

await page.render({
  canvasContext: context,
  viewport,
}).promise

// Save as PNG:
const png = canvas.toBuffer("image/png")
fs.writeFileSync("page-1.png", png)
import { getDocument } from "pdfjs-dist"

const doc = await getDocument(pdfUrl).promise
const page = await doc.getPage(1)

// Get annotations (links, form fields, etc.):
const annotations = await page.getAnnotations()
for (const annotation of annotations) {
  if (annotation.subtype === "Link") {
    console.log(`Link: ${annotation.url}`)
  }
  if (annotation.subtype === "Widget") {
    console.log(`Form field: ${annotation.fieldName} = ${annotation.fieldValue}`)
  }
}

Structured text with positions

import { getDocument } from "pdfjs-dist"

const doc = await getDocument(buffer).promise
const page = await doc.getPage(1)
const content = await page.getTextContent()

// Each item has position and font info:
for (const item of content.items) {
  if ("str" in item) {
    console.log({
      text: item.str,
      x: item.transform[4],       // X position
      y: item.transform[5],       // Y position
      width: item.width,
      height: item.height,
      fontName: item.fontName,
    })
  }
}

// Useful for table extraction, layout analysis, etc.

Feature Comparison

Featureunpdfpdf-parsepdfjs-dist
PurposeModern text extractionSimple text extractionFull PDF rendering
API complexityLowLowHigh
Text extraction
Metadata
Canvas rendering
Annotations
Text positions✅ (via proxy)
Edge runtime
Browser support
Node.js
TypeScript✅ (@types)
Based onpdf.jspdf.jspdf.js (original)
Weekly downloads~200K~2M~3M

When to Use Each

Use unpdf if:

  • Need PDF text extraction in edge runtimes or serverless
  • Want a modern, lightweight API for text + metadata
  • In the UnJS ecosystem
  • Need isomorphic PDF parsing (Node.js + browser + edge)

Use pdf-parse if:

  • Need the simplest possible PDF text extraction
  • Building a Node.js script or backend service
  • Processing PDFs for AI/LLM pipelines
  • Want minimal API surface (one function call)

Use pdfjs-dist if:

  • Need to render PDFs visually (browser PDF viewer)
  • Need text positions, annotations, or form fields
  • Building a PDF viewer component
  • Need the full feature set of Mozilla's pdf.js

Text Extraction Quality and Edge Cases

All three libraries use Mozilla's pdf.js engine internally, but their text extraction quality varies based on how they configure the underlying engine and post-process the output. The fundamental challenge is that PDF is a page-description format, not a semantic text format: text in a PDF is positioned absolutely on a page with no inherent reading order, no semantic paragraph breaks, and no guaranteed relationship between adjacent text chunks. What you get from getTextContent() is a flat list of positioned strings, and assembling them into coherent prose is a heuristic process.

pdf-parse's default behavior concatenates all text items with spaces, which works reasonably well for simple single-column documents but produces jumbled output for multi-column layouts, tables, and text that wraps around images. The custom pagerender function gives you control over this assembly, but writing a robust layout-aware text extractor requires understanding pdf.js's item structure (x/y coordinates, font size, transformation matrices). For AI/LLM pipelines where you just need "approximately correct text" rather than layout-preserving extraction, pdf-parse's default output is usually sufficient — language models are robust to ordering artifacts.

unpdf's extractText function and pdfjs-dist's raw getTextContent both give access to the same underlying item array, but unpdf's API is more ergonomic for common cases. When you need to go beyond simple text extraction — identifying headers by font size, detecting tables by alignment, or extracting text within a specific bounding box — pdfjs-dist's low-level API is the only option. The item objects include transform (a 6-element matrix encoding position, scale, and rotation), width, height, and fontName, which together contain enough information to reconstruct approximate document structure. Libraries like pdf-lib (which handles PDF creation and modification) are complementary: pdfjs-dist for reading, pdf-lib for writing.

Handling Encrypted, Scanned, and Malformed PDFs

All three libraries struggle with the same classes of problematic PDFs, and understanding these limitations prevents production surprises. Password-encrypted PDFs require passing the password to the parser: pdfjs-dist accepts { data: buffer, password: "secret" }, and pdf-parse passes it via options. If the password is wrong or missing, you get a decryption error rather than garbled text. PDFs encrypted with 256-bit AES (common in modern tools) are supported; older 40-bit RC4 encryption may not be.

Scanned PDFs — where the document is a series of images without any embedded text layer — return empty or near-empty text extraction results from all three libraries. This is expected: they're extracting the text layer that PDF stores as vector glyphs, not performing optical character recognition on the image pixels. For scanned documents, you need an OCR pipeline: tesseract.js for browser/Node.js OCR, or cloud services like Google Document AI or AWS Textract. A common production pattern is to attempt text extraction first (fast, free), check if the result is suspiciously short (under ~50 characters per page), and fall back to OCR only when the text layer is absent.

Malformed PDFs are more nuanced. pdf.js has historically been the most permissive parser (since it needs to handle whatever Firefox users encounter), and pdfjs-dist inherits this tolerance. pdf-parse and unpdf benefit from the same tolerance since they wrap pdf.js. Still, some PDFs produced by older tools or with corrupted cross-reference tables will fail entirely. A defensive wrapper that catches parse errors and returns a partial result (text from successfully parsed pages) is worth adding in production document processing pipelines.

Runtime Compatibility and Bundle Size Tradeoffs

The most significant practical difference between these libraries for modern JavaScript projects is their runtime compatibility. pdfjs-dist ships a legacy/build/pdf.mjs entry for Node.js environments that need canvas rendering, and a standard build/pdf.mjs for browser and edge environments. The full pdfjs-dist package is large — the WASM file for the PDF rendering worker alone is several megabytes — but this is only a concern if you're bundling it for browser delivery. On a server, bundle size is irrelevant and pdfjs-dist's capabilities justify its footprint.

unpdf is specifically designed for edge runtimes: it bundles a minimal version of pdf.js without canvas rendering capabilities (which aren't available in Workers or Deno anyway) and is tested against Cloudflare Workers and Vercel Edge Functions. If your use case is a serverless PDF extraction endpoint, unpdf's explicit edge-runtime support means you won't encounter the "this module requires Node.js APIs" errors that pdfjs-dist's legacy build throws in Workers. pdf-parse uses synchronous filesystem reads internally for a test file detection mechanism that famously fails in environments without a real filesystem — this is a known footgun when deploying pdf-parse to serverless or edge environments and requires a workaround.

Handling Encrypted and Password-Protected PDFs

All three libraries have limited support for encrypted PDFs. pdfjs-dist handles password-protected PDFs by accepting a password option in the getDocument() call — the most complete implementation of the three. pdf-parse does not support password-protected PDFs and will throw on encrypted documents. unpdf inherits pdfjs-dist's decryption capability but the extractText() API does not expose the password option directly; you need to use the lower-level pdfjs-dist API through unpdf's exposed getPdfDocument() for encrypted files.

For PDFs with owner passwords (which restrict printing/copying but not viewing), all three libraries can extract text successfully since PDF viewers handle this transparently. User passwords (which prevent opening) require explicit decryption with the password string, which only pdfjs-dist/unpdf support.

Methodology

Download data from npm registry (weekly average, February 2026). Feature comparison based on unpdf v0.11.x, pdf-parse v1.x, and pdfjs-dist v4.x.

Compare PDF tooling and developer utilities on PkgPulse →

See also: cac vs meow vs arg 2026 and Ink vs @clack/prompts vs Enquirer, acorn vs @babel/parser vs espree.

The 2026 JavaScript Stack Cheatsheet

One PDF: the best package for every category (ORMs, bundlers, auth, testing, state management). Used by 500+ devs. Free, updated monthly.