<!-- PkgPulse AI-readable guide source -->
<!-- Canonical: https://www.pkgpulse.com/guides/unpdf-vs-pdf-parse-vs-pdfjs-dist-pdf-2026 -->
<!-- Raw Markdown: https://www.pkgpulse.com/guides/unpdf-vs-pdf-parse-vs-pdfjs-dist-pdf-2026/raw.md -->
<!-- Source path: content/guides/unpdf-vs-pdf-parse-vs-pdfjs-dist-pdf-2026.mdx -->

---
og_image: "/images/guides/unpdf-vs-pdf-parse-vs-pdfjs-dist-pdf-2026.webp"
title: "unpdf vs pdf-parse vs pdfjs-dist 2026"
description: "unpdf, pdf-parse, and pdfjs-dist compared for PDF text extraction in Node.js in 2026. Edge runtime support, metadata reading, and page rendering capabilities."
date: "2026-03-09"
tier: 2
authors: ["team"]
tags: ["nodejs", "typescript", "pdf", "developer-tools"]
---

## TL;DR

**unpdf** is the UnJS PDF extraction library — lightweight wrapper around pdf.js for extracting text and metadata from PDFs, works in Node.js, edge runtimes, and browsers. **pdf-parse** is the simple PDF text extractor — wraps pdf.js with a minimal API, returns text + metadata + page count, the most popular Node.js PDF parser. **pdfjs-dist** (pdf.js) is Mozilla's full PDF renderer — renders PDFs to canvas, extracts text, handles annotations, powers Firefox's PDF viewer. In 2026: unpdf for modern/edge PDF extraction, pdf-parse for simple Node.js text extraction, pdfjs-dist for full rendering and browser display.

## Key Takeaways

- **unpdf**: ~200K weekly downloads — UnJS, edge-compatible, text + metadata extraction
- **pdf-parse**: ~2M weekly downloads — simple API, text + info + page count, Node.js
- **pdfjs-dist**: ~3M weekly downloads — Mozilla, full PDF rendering, canvas support
- All three use pdf.js internally — different abstraction levels
- unpdf and pdf-parse focus on text extraction; pdfjs-dist does everything
- For AI/LLM pipelines, any of these work for extracting text from PDFs

---

## unpdf

[unpdf](https://github.com/unjs/unpdf) — modern PDF extraction:

### Basic text extraction

```typescript
import { extractText, getDocumentProxy } from "unpdf"

// Simple text extraction:
const buffer = await fs.readFile("document.pdf")
const { text, totalPages } = await extractText(buffer)

console.log(`Pages: ${totalPages}`)
console.log(text)
// → Full text content of the PDF
```

### Document proxy (advanced)

```typescript
import { getDocumentProxy, extractText } from "unpdf"

// Get full document access:
const buffer = await fs.readFile("report.pdf")
const pdf = await getDocumentProxy(buffer)

console.log(`Pages: ${pdf.numPages}`)

// Extract text from specific page:
const page = await pdf.getPage(1)
const content = await page.getTextContent()
const pageText = content.items
  .map((item) => item.str)
  .join(" ")

console.log(pageText)
```

### Metadata extraction

```typescript
import { getDocumentProxy } from "unpdf"

const pdf = await getDocumentProxy(buffer)
const metadata = await pdf.getMetadata()

console.log(metadata.info)
// → {
//   Title: "Annual Report 2026",
//   Author: "PkgPulse Team",
//   Creator: "Google Docs",
//   Producer: "Skia/PDF m120",
//   CreationDate: "D:20260309...",
// }
```

### Edge runtime / serverless

```typescript
// unpdf works in edge runtimes (Cloudflare Workers, Vercel Edge):
export default {
  async fetch(request: Request): Promise<Response> {
    const formData = await request.formData()
    const file = formData.get("pdf") as File
    const buffer = await file.arrayBuffer()

    const { text, totalPages } = await extractText(
      new Uint8Array(buffer)
    )

    return Response.json({ text, totalPages })
  },
}
```

---

## pdf-parse

[pdf-parse](https://github.com/nicolo-ribaudo/pdf-parse) — simple text extraction:

### Basic usage

```typescript
import pdf from "pdf-parse"

const buffer = fs.readFileSync("document.pdf")
const data = await pdf(buffer)

console.log(data.numpages)    // Number of pages
console.log(data.numrender)   // Number of rendered pages
console.log(data.info)        // PDF metadata
console.log(data.metadata)    // PDF metadata (XML)
console.log(data.version)     // PDF version
console.log(data.text)        // All text content
```

### With options

```typescript
import pdf from "pdf-parse"

const data = await pdf(buffer, {
  // Limit pages to parse:
  max: 10,  // Only first 10 pages

  // Custom page render function:
  pagerender(pageData) {
    const textContent = pageData.getTextContent()
    return textContent.then((content) => {
      return content.items
        .map((item) => item.str)
        .join(" ")
    })
  },
})

console.log(data.text)
```

### Extract text per page

```typescript
import pdf from "pdf-parse"

const pages: string[] = []

const data = await pdf(buffer, {
  pagerender(pageData) {
    return pageData.getTextContent().then((content) => {
      const pageText = content.items
        .map((item) => item.str)
        .join(" ")
      pages.push(pageText)
      return pageText
    })
  },
})

// pages[0] = text from page 1
// pages[1] = text from page 2
// etc.
```

### Common use: AI/LLM document processing

```typescript
import pdf from "pdf-parse"

async function extractForLLM(pdfPath: string): Promise<string> {
  const buffer = fs.readFileSync(pdfPath)
  const data = await pdf(buffer)

  // Clean up extracted text:
  const cleanText = data.text
    .replace(/\n{3,}/g, "\n\n")  // Collapse multiple newlines
    .replace(/\s{2,}/g, " ")     // Collapse whitespace
    .trim()

  return cleanText
}

// Feed to LLM:
const text = await extractForLLM("contract.pdf")
const response = await llm.chat({
  messages: [
    { role: "system", content: "Summarize this document." },
    { role: "user", content: text },
  ],
})
```

---

## pdfjs-dist

[pdfjs-dist](https://github.com/nicolo-ribaudo/pdfjs-dist) (pdf.js) — full PDF rendering:

### Text extraction

```typescript
import { getDocument } from "pdfjs-dist"

const doc = await getDocument("document.pdf").promise

// Extract text from all pages:
const texts: string[] = []
for (let i = 1; i <= doc.numPages; i++) {
  const page = await doc.getPage(i)
  const content = await page.getTextContent()
  const pageText = content.items
    .map((item) => item.str)
    .join(" ")
  texts.push(pageText)
}

console.log(texts.join("\n\n"))
```

### Render to canvas (browser)

```typescript
import { getDocument } from "pdfjs-dist"

const doc = await getDocument(pdfUrl).promise
const page = await doc.getPage(1)

const scale = 1.5
const viewport = page.getViewport({ scale })

const canvas = document.getElementById("pdf-canvas") as HTMLCanvasElement
const context = canvas.getContext("2d")!
canvas.height = viewport.height
canvas.width = viewport.width

await page.render({
  canvasContext: context,
  viewport,
}).promise
```

### Node.js canvas rendering

```typescript
import { getDocument } from "pdfjs-dist/legacy/build/pdf.mjs"
import { createCanvas } from "canvas"

const doc = await getDocument({
  data: new Uint8Array(buffer),
  useSystemFonts: true,
}).promise

const page = await doc.getPage(1)
const viewport = page.getViewport({ scale: 2.0 })

const canvas = createCanvas(viewport.width, viewport.height)
const context = canvas.getContext("2d")

await page.render({
  canvasContext: context,
  viewport,
}).promise

// Save as PNG:
const png = canvas.toBuffer("image/png")
fs.writeFileSync("page-1.png", png)
```

### Annotations and links

```typescript
import { getDocument } from "pdfjs-dist"

const doc = await getDocument(pdfUrl).promise
const page = await doc.getPage(1)

// Get annotations (links, form fields, etc.):
const annotations = await page.getAnnotations()
for (const annotation of annotations) {
  if (annotation.subtype === "Link") {
    console.log(`Link: ${annotation.url}`)
  }
  if (annotation.subtype === "Widget") {
    console.log(`Form field: ${annotation.fieldName} = ${annotation.fieldValue}`)
  }
}
```

### Structured text with positions

```typescript
import { getDocument } from "pdfjs-dist"

const doc = await getDocument(buffer).promise
const page = await doc.getPage(1)
const content = await page.getTextContent()

// Each item has position and font info:
for (const item of content.items) {
  if ("str" in item) {
    console.log({
      text: item.str,
      x: item.transform[4],       // X position
      y: item.transform[5],       // Y position
      width: item.width,
      height: item.height,
      fontName: item.fontName,
    })
  }
}

// Useful for table extraction, layout analysis, etc.
```

---

## Feature Comparison

| Feature | unpdf | pdf-parse | pdfjs-dist |
|---------|-------|----------|-----------|
| Purpose | Modern text extraction | Simple text extraction | Full PDF rendering |
| API complexity | Low | Low | High |
| Text extraction | ✅ | ✅ | ✅ |
| Metadata | ✅ | ✅ | ✅ |
| Canvas rendering | ❌ | ❌ | ✅ |
| Annotations | ❌ | ❌ | ✅ |
| Text positions | ✅ (via proxy) | ❌ | ✅ |
| Edge runtime | ✅ | ❌ | ❌ |
| Browser support | ✅ | ❌ | ✅ |
| Node.js | ✅ | ✅ | ✅ |
| TypeScript | ✅ | ✅ (@types) | ✅ |
| Based on | pdf.js | pdf.js | pdf.js (original) |
| Weekly downloads | ~200K | ~2M | ~3M |

---

## When to Use Each

**Use unpdf if:**
- Need PDF text extraction in edge runtimes or serverless
- Want a modern, lightweight API for text + metadata
- In the UnJS ecosystem
- Need isomorphic PDF parsing (Node.js + browser + edge)

**Use pdf-parse if:**
- Need the simplest possible PDF text extraction
- Building a Node.js script or backend service
- Processing PDFs for AI/LLM pipelines
- Want minimal API surface (one function call)

**Use pdfjs-dist if:**
- Need to render PDFs visually (browser PDF viewer)
- Need text positions, annotations, or form fields
- Building a PDF viewer component
- Need the full feature set of Mozilla's pdf.js

---

## Text Extraction Quality and Edge Cases

All three libraries use Mozilla's pdf.js engine internally, but their text extraction quality varies based on how they configure the underlying engine and post-process the output. The fundamental challenge is that PDF is a page-description format, not a semantic text format: text in a PDF is positioned absolutely on a page with no inherent reading order, no semantic paragraph breaks, and no guaranteed relationship between adjacent text chunks. What you get from `getTextContent()` is a flat list of positioned strings, and assembling them into coherent prose is a heuristic process.

pdf-parse's default behavior concatenates all text items with spaces, which works reasonably well for simple single-column documents but produces jumbled output for multi-column layouts, tables, and text that wraps around images. The custom `pagerender` function gives you control over this assembly, but writing a robust layout-aware text extractor requires understanding pdf.js's item structure (x/y coordinates, font size, transformation matrices). For AI/LLM pipelines where you just need "approximately correct text" rather than layout-preserving extraction, pdf-parse's default output is usually sufficient — language models are robust to ordering artifacts.

unpdf's `extractText` function and pdfjs-dist's raw `getTextContent` both give access to the same underlying item array, but unpdf's API is more ergonomic for common cases. When you need to go beyond simple text extraction — identifying headers by font size, detecting tables by alignment, or extracting text within a specific bounding box — pdfjs-dist's low-level API is the only option. The item objects include `transform` (a 6-element matrix encoding position, scale, and rotation), `width`, `height`, and `fontName`, which together contain enough information to reconstruct approximate document structure. Libraries like `pdf-lib` (which handles PDF creation and modification) are complementary: pdfjs-dist for reading, pdf-lib for writing.

## Handling Encrypted, Scanned, and Malformed PDFs

All three libraries struggle with the same classes of problematic PDFs, and understanding these limitations prevents production surprises. Password-encrypted PDFs require passing the password to the parser: pdfjs-dist accepts `{ data: buffer, password: "secret" }`, and pdf-parse passes it via options. If the password is wrong or missing, you get a decryption error rather than garbled text. PDFs encrypted with 256-bit AES (common in modern tools) are supported; older 40-bit RC4 encryption may not be.

Scanned PDFs — where the document is a series of images without any embedded text layer — return empty or near-empty text extraction results from all three libraries. This is expected: they're extracting the text layer that PDF stores as vector glyphs, not performing optical character recognition on the image pixels. For scanned documents, you need an OCR pipeline: `tesseract.js` for browser/Node.js OCR, or cloud services like Google Document AI or AWS Textract. A common production pattern is to attempt text extraction first (fast, free), check if the result is suspiciously short (under ~50 characters per page), and fall back to OCR only when the text layer is absent.

Malformed PDFs are more nuanced. pdf.js has historically been the most permissive parser (since it needs to handle whatever Firefox users encounter), and pdfjs-dist inherits this tolerance. pdf-parse and unpdf benefit from the same tolerance since they wrap pdf.js. Still, some PDFs produced by older tools or with corrupted cross-reference tables will fail entirely. A defensive wrapper that catches parse errors and returns a partial result (text from successfully parsed pages) is worth adding in production document processing pipelines.

## Runtime Compatibility and Bundle Size Tradeoffs

The most significant practical difference between these libraries for modern JavaScript projects is their runtime compatibility. pdfjs-dist ships a `legacy/build/pdf.mjs` entry for Node.js environments that need canvas rendering, and a standard `build/pdf.mjs` for browser and edge environments. The full pdfjs-dist package is large — the WASM file for the PDF rendering worker alone is several megabytes — but this is only a concern if you're bundling it for browser delivery. On a server, bundle size is irrelevant and pdfjs-dist's capabilities justify its footprint.

unpdf is specifically designed for edge runtimes: it bundles a minimal version of pdf.js without canvas rendering capabilities (which aren't available in Workers or Deno anyway) and is tested against Cloudflare Workers and Vercel Edge Functions. If your use case is a serverless PDF extraction endpoint, unpdf's explicit edge-runtime support means you won't encounter the "this module requires Node.js APIs" errors that pdfjs-dist's legacy build throws in Workers. pdf-parse uses synchronous filesystem reads internally for a test file detection mechanism that famously fails in environments without a real filesystem — this is a known footgun when deploying pdf-parse to serverless or edge environments and requires a workaround.

## Handling Encrypted and Password-Protected PDFs

All three libraries have limited support for encrypted PDFs. pdfjs-dist handles password-protected PDFs by accepting a `password` option in the `getDocument()` call — the most complete implementation of the three. pdf-parse does not support password-protected PDFs and will throw on encrypted documents. unpdf inherits pdfjs-dist's decryption capability but the `extractText()` API does not expose the password option directly; you need to use the lower-level pdfjs-dist API through unpdf's exposed `getPdfDocument()` for encrypted files.

For PDFs with owner passwords (which restrict printing/copying but not viewing), all three libraries can extract text successfully since PDF viewers handle this transparently. User passwords (which prevent opening) require explicit decryption with the password string, which only pdfjs-dist/unpdf support.

## Methodology

Download data from npm registry (weekly average, February 2026). Feature comparison based on unpdf v0.11.x, pdf-parse v1.x, and pdfjs-dist v4.x.

*[Compare PDF tooling and developer utilities on PkgPulse →](https://www.pkgpulse.com)*

*See also: [cac vs meow vs arg 2026](/guides/cac-vs-meow-vs-arg-lightweight-cli-argument-parsers-2026) and [Ink vs @clack/prompts vs Enquirer](/guides/ink-vs-clack-vs-enquirer-interactive-cli-nodejs-2026), [acorn vs @babel/parser vs espree](/guides/acorn-vs-babel-parser-vs-espree-javascript-ast-parsers-2026).*