unpdf vs pdf-parse vs pdf.js: PDF Parsing and Text Extraction in Node.js (2026)
TL;DR
unpdf is the UnJS PDF extraction library — lightweight wrapper around pdf.js for extracting text and metadata from PDFs, works in Node.js, edge runtimes, and browsers. pdf-parse is the simple PDF text extractor — wraps pdf.js with a minimal API, returns text + metadata + page count, the most popular Node.js PDF parser. pdfjs-dist (pdf.js) is Mozilla's full PDF renderer — renders PDFs to canvas, extracts text, handles annotations, powers Firefox's PDF viewer. In 2026: unpdf for modern/edge PDF extraction, pdf-parse for simple Node.js text extraction, pdfjs-dist for full rendering and browser display.
Key Takeaways
- unpdf: ~200K weekly downloads — UnJS, edge-compatible, text + metadata extraction
- pdf-parse: ~2M weekly downloads — simple API, text + info + page count, Node.js
- pdfjs-dist: ~3M weekly downloads — Mozilla, full PDF rendering, canvas support
- All three use pdf.js internally — different abstraction levels
- unpdf and pdf-parse focus on text extraction; pdfjs-dist does everything
- For AI/LLM pipelines, any of these work for extracting text from PDFs
unpdf
unpdf — modern PDF extraction:
Basic text extraction
import { extractText, getDocumentProxy } from "unpdf"
// Simple text extraction:
const buffer = await fs.readFile("document.pdf")
const { text, totalPages } = await extractText(buffer)
console.log(`Pages: ${totalPages}`)
console.log(text)
// → Full text content of the PDF
Document proxy (advanced)
import { getDocumentProxy, extractText } from "unpdf"
// Get full document access:
const buffer = await fs.readFile("report.pdf")
const pdf = await getDocumentProxy(buffer)
console.log(`Pages: ${pdf.numPages}`)
// Extract text from specific page:
const page = await pdf.getPage(1)
const content = await page.getTextContent()
const pageText = content.items
.map((item) => item.str)
.join(" ")
console.log(pageText)
Metadata extraction
import { getDocumentProxy } from "unpdf"
const pdf = await getDocumentProxy(buffer)
const metadata = await pdf.getMetadata()
console.log(metadata.info)
// → {
// Title: "Annual Report 2026",
// Author: "PkgPulse Team",
// Creator: "Google Docs",
// Producer: "Skia/PDF m120",
// CreationDate: "D:20260309...",
// }
Edge runtime / serverless
// unpdf works in edge runtimes (Cloudflare Workers, Vercel Edge):
export default {
async fetch(request: Request): Promise<Response> {
const formData = await request.formData()
const file = formData.get("pdf") as File
const buffer = await file.arrayBuffer()
const { text, totalPages } = await extractText(
new Uint8Array(buffer)
)
return Response.json({ text, totalPages })
},
}
pdf-parse
pdf-parse — simple text extraction:
Basic usage
import pdf from "pdf-parse"
const buffer = fs.readFileSync("document.pdf")
const data = await pdf(buffer)
console.log(data.numpages) // Number of pages
console.log(data.numrender) // Number of rendered pages
console.log(data.info) // PDF metadata
console.log(data.metadata) // PDF metadata (XML)
console.log(data.version) // PDF version
console.log(data.text) // All text content
With options
import pdf from "pdf-parse"
const data = await pdf(buffer, {
// Limit pages to parse:
max: 10, // Only first 10 pages
// Custom page render function:
pagerender(pageData) {
const textContent = pageData.getTextContent()
return textContent.then((content) => {
return content.items
.map((item) => item.str)
.join(" ")
})
},
})
console.log(data.text)
Extract text per page
import pdf from "pdf-parse"
const pages: string[] = []
const data = await pdf(buffer, {
pagerender(pageData) {
return pageData.getTextContent().then((content) => {
const pageText = content.items
.map((item) => item.str)
.join(" ")
pages.push(pageText)
return pageText
})
},
})
// pages[0] = text from page 1
// pages[1] = text from page 2
// etc.
Common use: AI/LLM document processing
import pdf from "pdf-parse"
async function extractForLLM(pdfPath: string): Promise<string> {
const buffer = fs.readFileSync(pdfPath)
const data = await pdf(buffer)
// Clean up extracted text:
const cleanText = data.text
.replace(/\n{3,}/g, "\n\n") // Collapse multiple newlines
.replace(/\s{2,}/g, " ") // Collapse whitespace
.trim()
return cleanText
}
// Feed to LLM:
const text = await extractForLLM("contract.pdf")
const response = await llm.chat({
messages: [
{ role: "system", content: "Summarize this document." },
{ role: "user", content: text },
],
})
pdfjs-dist
pdfjs-dist (pdf.js) — full PDF rendering:
Text extraction
import { getDocument } from "pdfjs-dist"
const doc = await getDocument("document.pdf").promise
// Extract text from all pages:
const texts: string[] = []
for (let i = 1; i <= doc.numPages; i++) {
const page = await doc.getPage(i)
const content = await page.getTextContent()
const pageText = content.items
.map((item) => item.str)
.join(" ")
texts.push(pageText)
}
console.log(texts.join("\n\n"))
Render to canvas (browser)
import { getDocument } from "pdfjs-dist"
const doc = await getDocument(pdfUrl).promise
const page = await doc.getPage(1)
const scale = 1.5
const viewport = page.getViewport({ scale })
const canvas = document.getElementById("pdf-canvas") as HTMLCanvasElement
const context = canvas.getContext("2d")!
canvas.height = viewport.height
canvas.width = viewport.width
await page.render({
canvasContext: context,
viewport,
}).promise
Node.js canvas rendering
import { getDocument } from "pdfjs-dist/legacy/build/pdf.mjs"
import { createCanvas } from "canvas"
const doc = await getDocument({
data: new Uint8Array(buffer),
useSystemFonts: true,
}).promise
const page = await doc.getPage(1)
const viewport = page.getViewport({ scale: 2.0 })
const canvas = createCanvas(viewport.width, viewport.height)
const context = canvas.getContext("2d")
await page.render({
canvasContext: context,
viewport,
}).promise
// Save as PNG:
const png = canvas.toBuffer("image/png")
fs.writeFileSync("page-1.png", png)
Annotations and links
import { getDocument } from "pdfjs-dist"
const doc = await getDocument(pdfUrl).promise
const page = await doc.getPage(1)
// Get annotations (links, form fields, etc.):
const annotations = await page.getAnnotations()
for (const annotation of annotations) {
if (annotation.subtype === "Link") {
console.log(`Link: ${annotation.url}`)
}
if (annotation.subtype === "Widget") {
console.log(`Form field: ${annotation.fieldName} = ${annotation.fieldValue}`)
}
}
Structured text with positions
import { getDocument } from "pdfjs-dist"
const doc = await getDocument(buffer).promise
const page = await doc.getPage(1)
const content = await page.getTextContent()
// Each item has position and font info:
for (const item of content.items) {
if ("str" in item) {
console.log({
text: item.str,
x: item.transform[4], // X position
y: item.transform[5], // Y position
width: item.width,
height: item.height,
fontName: item.fontName,
})
}
}
// Useful for table extraction, layout analysis, etc.
Feature Comparison
| Feature | unpdf | pdf-parse | pdfjs-dist |
|---|---|---|---|
| Purpose | Modern text extraction | Simple text extraction | Full PDF rendering |
| API complexity | Low | Low | High |
| Text extraction | ✅ | ✅ | ✅ |
| Metadata | ✅ | ✅ | ✅ |
| Canvas rendering | ❌ | ❌ | ✅ |
| Annotations | ❌ | ❌ | ✅ |
| Text positions | ✅ (via proxy) | ❌ | ✅ |
| Edge runtime | ✅ | ❌ | ❌ |
| Browser support | ✅ | ❌ | ✅ |
| Node.js | ✅ | ✅ | ✅ |
| TypeScript | ✅ | ✅ (@types) | ✅ |
| Based on | pdf.js | pdf.js | pdf.js (original) |
| Weekly downloads | ~200K | ~2M | ~3M |
When to Use Each
Use unpdf if:
- Need PDF text extraction in edge runtimes or serverless
- Want a modern, lightweight API for text + metadata
- In the UnJS ecosystem
- Need isomorphic PDF parsing (Node.js + browser + edge)
Use pdf-parse if:
- Need the simplest possible PDF text extraction
- Building a Node.js script or backend service
- Processing PDFs for AI/LLM pipelines
- Want minimal API surface (one function call)
Use pdfjs-dist if:
- Need to render PDFs visually (browser PDF viewer)
- Need text positions, annotations, or form fields
- Building a PDF viewer component
- Need the full feature set of Mozilla's pdf.js
Methodology
Download data from npm registry (weekly average, February 2026). Feature comparison based on unpdf v0.11.x, pdf-parse v1.x, and pdfjs-dist v4.x.