Skip to main content

unpdf vs pdf-parse vs pdf.js: PDF Parsing and Text Extraction in Node.js (2026)

·PkgPulse Team

TL;DR

unpdf is the UnJS PDF extraction library — lightweight wrapper around pdf.js for extracting text and metadata from PDFs, works in Node.js, edge runtimes, and browsers. pdf-parse is the simple PDF text extractor — wraps pdf.js with a minimal API, returns text + metadata + page count, the most popular Node.js PDF parser. pdfjs-dist (pdf.js) is Mozilla's full PDF renderer — renders PDFs to canvas, extracts text, handles annotations, powers Firefox's PDF viewer. In 2026: unpdf for modern/edge PDF extraction, pdf-parse for simple Node.js text extraction, pdfjs-dist for full rendering and browser display.

Key Takeaways

  • unpdf: ~200K weekly downloads — UnJS, edge-compatible, text + metadata extraction
  • pdf-parse: ~2M weekly downloads — simple API, text + info + page count, Node.js
  • pdfjs-dist: ~3M weekly downloads — Mozilla, full PDF rendering, canvas support
  • All three use pdf.js internally — different abstraction levels
  • unpdf and pdf-parse focus on text extraction; pdfjs-dist does everything
  • For AI/LLM pipelines, any of these work for extracting text from PDFs

unpdf

unpdf — modern PDF extraction:

Basic text extraction

import { extractText, getDocumentProxy } from "unpdf"

// Simple text extraction:
const buffer = await fs.readFile("document.pdf")
const { text, totalPages } = await extractText(buffer)

console.log(`Pages: ${totalPages}`)
console.log(text)
// → Full text content of the PDF

Document proxy (advanced)

import { getDocumentProxy, extractText } from "unpdf"

// Get full document access:
const buffer = await fs.readFile("report.pdf")
const pdf = await getDocumentProxy(buffer)

console.log(`Pages: ${pdf.numPages}`)

// Extract text from specific page:
const page = await pdf.getPage(1)
const content = await page.getTextContent()
const pageText = content.items
  .map((item) => item.str)
  .join(" ")

console.log(pageText)

Metadata extraction

import { getDocumentProxy } from "unpdf"

const pdf = await getDocumentProxy(buffer)
const metadata = await pdf.getMetadata()

console.log(metadata.info)
// → {
//   Title: "Annual Report 2026",
//   Author: "PkgPulse Team",
//   Creator: "Google Docs",
//   Producer: "Skia/PDF m120",
//   CreationDate: "D:20260309...",
// }

Edge runtime / serverless

// unpdf works in edge runtimes (Cloudflare Workers, Vercel Edge):
export default {
  async fetch(request: Request): Promise<Response> {
    const formData = await request.formData()
    const file = formData.get("pdf") as File
    const buffer = await file.arrayBuffer()

    const { text, totalPages } = await extractText(
      new Uint8Array(buffer)
    )

    return Response.json({ text, totalPages })
  },
}

pdf-parse

pdf-parse — simple text extraction:

Basic usage

import pdf from "pdf-parse"

const buffer = fs.readFileSync("document.pdf")
const data = await pdf(buffer)

console.log(data.numpages)    // Number of pages
console.log(data.numrender)   // Number of rendered pages
console.log(data.info)        // PDF metadata
console.log(data.metadata)    // PDF metadata (XML)
console.log(data.version)     // PDF version
console.log(data.text)        // All text content

With options

import pdf from "pdf-parse"

const data = await pdf(buffer, {
  // Limit pages to parse:
  max: 10,  // Only first 10 pages

  // Custom page render function:
  pagerender(pageData) {
    const textContent = pageData.getTextContent()
    return textContent.then((content) => {
      return content.items
        .map((item) => item.str)
        .join(" ")
    })
  },
})

console.log(data.text)

Extract text per page

import pdf from "pdf-parse"

const pages: string[] = []

const data = await pdf(buffer, {
  pagerender(pageData) {
    return pageData.getTextContent().then((content) => {
      const pageText = content.items
        .map((item) => item.str)
        .join(" ")
      pages.push(pageText)
      return pageText
    })
  },
})

// pages[0] = text from page 1
// pages[1] = text from page 2
// etc.

Common use: AI/LLM document processing

import pdf from "pdf-parse"

async function extractForLLM(pdfPath: string): Promise<string> {
  const buffer = fs.readFileSync(pdfPath)
  const data = await pdf(buffer)

  // Clean up extracted text:
  const cleanText = data.text
    .replace(/\n{3,}/g, "\n\n")  // Collapse multiple newlines
    .replace(/\s{2,}/g, " ")     // Collapse whitespace
    .trim()

  return cleanText
}

// Feed to LLM:
const text = await extractForLLM("contract.pdf")
const response = await llm.chat({
  messages: [
    { role: "system", content: "Summarize this document." },
    { role: "user", content: text },
  ],
})

pdfjs-dist

pdfjs-dist (pdf.js) — full PDF rendering:

Text extraction

import { getDocument } from "pdfjs-dist"

const doc = await getDocument("document.pdf").promise

// Extract text from all pages:
const texts: string[] = []
for (let i = 1; i <= doc.numPages; i++) {
  const page = await doc.getPage(i)
  const content = await page.getTextContent()
  const pageText = content.items
    .map((item) => item.str)
    .join(" ")
  texts.push(pageText)
}

console.log(texts.join("\n\n"))

Render to canvas (browser)

import { getDocument } from "pdfjs-dist"

const doc = await getDocument(pdfUrl).promise
const page = await doc.getPage(1)

const scale = 1.5
const viewport = page.getViewport({ scale })

const canvas = document.getElementById("pdf-canvas") as HTMLCanvasElement
const context = canvas.getContext("2d")!
canvas.height = viewport.height
canvas.width = viewport.width

await page.render({
  canvasContext: context,
  viewport,
}).promise

Node.js canvas rendering

import { getDocument } from "pdfjs-dist/legacy/build/pdf.mjs"
import { createCanvas } from "canvas"

const doc = await getDocument({
  data: new Uint8Array(buffer),
  useSystemFonts: true,
}).promise

const page = await doc.getPage(1)
const viewport = page.getViewport({ scale: 2.0 })

const canvas = createCanvas(viewport.width, viewport.height)
const context = canvas.getContext("2d")

await page.render({
  canvasContext: context,
  viewport,
}).promise

// Save as PNG:
const png = canvas.toBuffer("image/png")
fs.writeFileSync("page-1.png", png)
import { getDocument } from "pdfjs-dist"

const doc = await getDocument(pdfUrl).promise
const page = await doc.getPage(1)

// Get annotations (links, form fields, etc.):
const annotations = await page.getAnnotations()
for (const annotation of annotations) {
  if (annotation.subtype === "Link") {
    console.log(`Link: ${annotation.url}`)
  }
  if (annotation.subtype === "Widget") {
    console.log(`Form field: ${annotation.fieldName} = ${annotation.fieldValue}`)
  }
}

Structured text with positions

import { getDocument } from "pdfjs-dist"

const doc = await getDocument(buffer).promise
const page = await doc.getPage(1)
const content = await page.getTextContent()

// Each item has position and font info:
for (const item of content.items) {
  if ("str" in item) {
    console.log({
      text: item.str,
      x: item.transform[4],       // X position
      y: item.transform[5],       // Y position
      width: item.width,
      height: item.height,
      fontName: item.fontName,
    })
  }
}

// Useful for table extraction, layout analysis, etc.

Feature Comparison

Featureunpdfpdf-parsepdfjs-dist
PurposeModern text extractionSimple text extractionFull PDF rendering
API complexityLowLowHigh
Text extraction
Metadata
Canvas rendering
Annotations
Text positions✅ (via proxy)
Edge runtime
Browser support
Node.js
TypeScript✅ (@types)
Based onpdf.jspdf.jspdf.js (original)
Weekly downloads~200K~2M~3M

When to Use Each

Use unpdf if:

  • Need PDF text extraction in edge runtimes or serverless
  • Want a modern, lightweight API for text + metadata
  • In the UnJS ecosystem
  • Need isomorphic PDF parsing (Node.js + browser + edge)

Use pdf-parse if:

  • Need the simplest possible PDF text extraction
  • Building a Node.js script or backend service
  • Processing PDFs for AI/LLM pipelines
  • Want minimal API surface (one function call)

Use pdfjs-dist if:

  • Need to render PDFs visually (browser PDF viewer)
  • Need text positions, annotations, or form fields
  • Building a PDF viewer component
  • Need the full feature set of Mozilla's pdf.js

Methodology

Download data from npm registry (weekly average, February 2026). Feature comparison based on unpdf v0.11.x, pdf-parse v1.x, and pdfjs-dist v4.x.

Compare PDF tooling and developer utilities on PkgPulse →

Comments

Stay Updated

Get the latest package insights, npm trends, and tooling tips delivered to your inbox.