TL;DR
Web scraping for AI pipelines has different requirements than traditional data extraction — you need clean markdown (not HTML soup), images extracted separately, and content structured for LLM context windows. Crawl4AI is the open-source Python crawler purpose-built for AI — outputs clean LLM-ready markdown, supports vision models for screenshots, async concurrent crawling, and runs entirely local for free. Firecrawl is the API-first LLM scraping service — call one endpoint, get back clean markdown without running your own browser; great for prototyping and when you don't want to manage infrastructure. Apify is the full-scale web scraping platform — actors (cloud-deployed scrapers), anti-bot evasion, residential proxies, and a marketplace of pre-built scrapers for major websites. For self-hosted AI data pipelines: Crawl4AI. For quick API-based scraping in your RAG pipeline: Firecrawl. For large-scale production scraping with anti-bot protection: Apify.
Key Takeaways
- Crawl4AI is free and open-source — runs locally, no API costs, Python-first
- Firecrawl converts any URL to LLM-ready markdown — one API call, handles JS rendering
- Apify has 1,500+ pre-built actors — Amazon, LinkedIn, Google scraping with anti-bot bypasses
- Crawl4AI supports LLM extraction strategies — use Claude/GPT to extract structured data directly from pages
- Firecrawl's scrape vs crawl — scrape = one page, crawl = entire site with URL discovery
- Apify residential proxies — rotates through real user IPs to avoid bot detection
- All support JavaScript rendering — modern SPAs, React apps, infinite scroll
Why Standard Scrapers Fall Short for AI
Traditional scraping output:
Full HTML with:
- Navigation menus (noise)
- Cookie banners (noise)
- Script tags (noise)
- Style sheets (noise)
- Advertisements (noise)
- Meaningful content (what you actually want)
→ Pass to LLM → Context window waste, high cost, poor extraction
AI-optimized scraping output:
Clean markdown:
# Article Title
Meaningful paragraph content here.
## Section Heading
More content...
[Link text](url)
→ Pass to LLM → Clean, low-cost, high-quality extraction
Crawl4AI: Open-Source Python AI Crawler
Crawl4AI is a Python library built specifically for AI workloads — async crawling, LLM-optimized markdown output, and direct integration with extraction strategies.
Installation
pip install crawl4ai
playwright install chromium # Required for browser automation
Basic Crawling
import asyncio
from crawl4ai import AsyncWebCrawler
async def scrape_article(url: str) -> str:
async with AsyncWebCrawler() as crawler:
result = await crawler.arun(url=url)
# result.markdown — clean markdown for LLMs
# result.cleaned_html — cleaned HTML
# result.extracted_content — structured data (if extraction strategy set)
# result.links — discovered links
# result.media — images, videos found
return result.markdown
# Run
markdown = asyncio.run(scrape_article("https://example.com/article"))
print(markdown[:500])
Concurrent Crawling (Site-Wide)
import asyncio
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig, CacheMode
async def crawl_documentation_site(base_url: str) -> list[dict]:
config = CrawlerRunConfig(
cache_mode=CacheMode.BYPASS, # Skip cache for fresh data
wait_for="css:.main-content", # Wait for content to load
page_timeout=30000, # 30 second timeout
word_count_threshold=50, # Skip pages with < 50 words
exclude_external_links=True, # Only internal links
exclude_social_media_links=True,
)
async with AsyncWebCrawler() as crawler:
# Crawl multiple pages concurrently
urls = await discover_docs_urls(base_url)
results = await crawler.arun_many(
urls=urls[:50], # Process 50 pages
config=config,
max_concurrent=5, # 5 browser tabs at once
)
documents = []
for result in results:
if result.success and result.markdown:
documents.append({
"url": result.url,
"title": result.metadata.get("title", ""),
"content": result.markdown,
"word_count": len(result.markdown.split()),
})
return documents
LLM Extraction Strategy
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig
from crawl4ai.extraction_strategy import LLMExtractionStrategy
from pydantic import BaseModel, Field
import json
class ProductInfo(BaseModel):
name: str = Field(description="Product name")
price: str = Field(description="Price with currency symbol")
rating: float = Field(description="Rating out of 5")
review_count: int = Field(description="Number of reviews")
features: list[str] = Field(description="Key product features")
async def extract_product(url: str) -> ProductInfo:
extraction_strategy = LLMExtractionStrategy(
provider="openai/gpt-4o-mini", # Or anthropic/claude-3-haiku
api_token=os.getenv("OPENAI_API_KEY"),
schema=ProductInfo.model_json_schema(),
extraction_type="schema",
instruction="Extract the main product information from this page.",
)
config = CrawlerRunConfig(
extraction_strategy=extraction_strategy,
)
async with AsyncWebCrawler() as crawler:
result = await crawler.arun(url=url, config=config)
return ProductInfo.model_validate_json(result.extracted_content)
Using Crawl4AI from Node.js
// Crawl4AI is Python-native, but you can use it from Node.js via:
// 1. Python subprocess
// 2. Crawl4AI REST API (self-hosted)
// 3. Firecrawl (API wrapper, see below)
import { spawn } from "child_process";
async function crawlWithPython(url: string): Promise<string> {
return new Promise((resolve, reject) => {
const python = spawn("python3", ["-c", `
import asyncio
from crawl4ai import AsyncWebCrawler
import json
async def crawl(url):
async with AsyncWebCrawler() as crawler:
result = await crawler.arun(url=url)
print(json.dumps({
"markdown": result.markdown,
"title": result.metadata.get("title", ""),
"success": result.success,
}))
asyncio.run(crawl("${url}"))
`]);
let output = "";
python.stdout.on("data", (data) => (output += data));
python.on("close", (code) => {
if (code !== 0) reject(new Error("Crawl failed"));
else resolve(JSON.parse(output).markdown);
});
});
}
Firecrawl: API-First LLM Scraping
Firecrawl exposes scraping as a clean REST API — one call returns LLM-ready markdown without managing browsers or proxies.
Installation
npm install @mendable/firecrawl-js
Scraping a Single Page
import FirecrawlApp from "@mendable/firecrawl-js";
const app = new FirecrawlApp({ apiKey: process.env.FIRECRAWL_API_KEY! });
// Scrape a single URL → clean markdown
const scrapeResult = await app.scrapeUrl("https://example.com/docs/getting-started", {
formats: ["markdown", "html"], // What to return
onlyMainContent: true, // Strip navigation, footer, ads
waitFor: 2000, // Wait 2s for JS to render
});
if (scrapeResult.success) {
console.log(scrapeResult.markdown);
// Clean markdown, ready for LLM context
}
Crawling an Entire Site
// Crawl multiple pages with URL discovery
const crawlResult = await app.crawlUrl("https://docs.example.com", {
limit: 100, // Max pages to crawl
scrapeOptions: {
formats: ["markdown"],
onlyMainContent: true,
},
maxDepth: 3, // Max link depth from start URL
});
// Wait for async crawl to complete
if (crawlResult.success) {
const pages = crawlResult.data;
// pages: Array of { url, markdown, metadata }
// Build RAG document store
for (const page of pages) {
await vectorStore.addDocument({
id: page.metadata.sourceURL,
content: page.markdown,
metadata: {
title: page.metadata.title,
url: page.metadata.sourceURL,
},
});
}
}
Structured Data Extraction
import { z } from "zod";
const ProductSchema = z.object({
name: z.string(),
price: z.string(),
rating: z.number().optional(),
features: z.array(z.string()),
});
// Use LLM-based extraction with schema
const extractResult = await app.extract(["https://example.com/product"], {
prompt: "Extract the main product information from this page.",
schema: ProductSchema,
});
const product = extractResult.data; // Typed as ProductSchema
Building a RAG Pipeline with Firecrawl
import FirecrawlApp from "@mendable/firecrawl-js";
import OpenAI from "openai";
const firecrawl = new FirecrawlApp({ apiKey: process.env.FIRECRAWL_API_KEY! });
const openai = new OpenAI({ apiKey: process.env.OPENAI_API_KEY! });
async function buildRagIndex(docsUrl: string) {
// 1. Crawl documentation
const crawl = await firecrawl.crawlUrl(docsUrl, {
limit: 200,
scrapeOptions: { formats: ["markdown"], onlyMainContent: true },
});
// 2. Chunk and embed each page
const embeddings: { url: string; embedding: number[]; content: string }[] = [];
for (const page of crawl.data ?? []) {
if (!page.markdown) continue;
const chunks = chunkMarkdown(page.markdown, 1500); // 1500 chars per chunk
for (const chunk of chunks) {
const embedding = await openai.embeddings.create({
model: "text-embedding-3-small",
input: chunk,
});
embeddings.push({
url: page.metadata?.sourceURL ?? "",
embedding: embedding.data[0].embedding,
content: chunk,
});
}
}
return embeddings;
}
Apify: Full-Scale Production Scraping
Apify is the complete web scraping platform — cloud actors, anti-bot evasion, proxy rotation, and a marketplace of pre-built scrapers for major websites.
Installation
npm install apify-client
Using Pre-Built Actors
import { ApifyClient } from "apify-client";
const client = new ApifyClient({ token: process.env.APIFY_API_TOKEN! });
// Use a pre-built actor — no scraper code required
// Actor: apify/website-content-crawler (Firecrawl competitor)
const run = await client.actor("apify/website-content-crawler").call({
startUrls: [{ url: "https://docs.example.com" }],
maxCrawlPages: 100,
crawlerType: "playwright:firefox", // Use Firefox for better compatibility
includeUrlGlobs: ["https://docs.example.com/**"],
});
// Get results
const dataset = await client.dataset(run.defaultDatasetId).listItems();
const pages = dataset.items;
// Each item: { url, text, markdown, ... }
Running Custom Scrapers
// Actor: web-scraper (custom JavaScript scraper)
const run = await client.actor("apify/web-scraper").call({
startUrls: [{ url: "https://example.com/products" }],
pageFunction: async function pageFunction(context) {
const { $, request, log } = context;
// jQuery-like extraction
return {
url: request.url,
products: $(".product-card").map((_, el) => ({
name: $(el).find(".product-name").text().trim(),
price: $(el).find(".product-price").text().trim(),
imageUrl: $(el).find("img").attr("src"),
})).get(),
};
},
maxConcurrency: 5,
proxyConfiguration: { useApifyProxy: true }, // Residential proxies
});
Proxy Configuration
// Apify proxy types
const runWithProxy = await client.actor("apify/web-scraper").call({
startUrls: [{ url: "https://protected-site.com" }],
proxyConfiguration: {
useApifyProxy: true,
apifyProxyGroups: ["RESIDENTIAL"], // RESIDENTIAL | DATACENTER | GOOGLE_SERP
apifyProxyCountry: "US",
},
// Anti-bot mitigation
browserPoolOptions: {
useChrome: true, // Use real Chrome
useStealth: true, // Puppeteer-extra stealth plugin
retireInstanceAfterRequestCount: 5, // Rotate browsers
},
});
Feature Comparison
| Feature | Crawl4AI | Firecrawl | Apify |
|---|---|---|---|
| Language | Python | REST API (any) | REST API / Node.js SDK |
| Self-hosted | ✅ | ❌ | ❌ |
| LLM-ready markdown | ✅ | ✅ | ✅ (some actors) |
| JS rendering | ✅ Playwright | ✅ | ✅ Playwright + Puppeteer |
| Anti-bot bypass | Limited | Limited | ✅ Stealth + proxy |
| Proxy rotation | DIY | Limited | ✅ Residential proxies |
| Pre-built scrapers | ❌ | ❌ | ✅ 1,500+ actors |
| LLM extraction | ✅ Native | ✅ | Limited |
| Concurrent crawl | ✅ | ✅ (async) | ✅ |
| Pricing | Free (open-source) | $16/month+ | $49/month+ |
| GitHub stars | 40k | 13k | 4k (SDK) |
When to Use Each
Choose Crawl4AI if:
- Python AI pipeline where you want zero API costs and full local control
- Complex extraction strategies with LLM integration (structured extraction)
- Privacy-sensitive data that shouldn't leave your infrastructure
- High-volume crawling where per-page API costs would be prohibitive
Choose Firecrawl if:
- Rapid prototyping of an LLM or RAG pipeline with minimal setup
- Node.js/TypeScript codebase without Python infrastructure
- Clean markdown output from arbitrary URLs without managing browser instances
- Pages under a few hundred per run where API costs are acceptable
Choose Apify if:
- Scraping major websites (Amazon, LinkedIn, Google) that aggressively block bots
- Residential proxy rotation is required to avoid IP blocks
- You need pre-built, maintained scrapers for specific sites from the actor marketplace
- Large-scale production scraping with monitoring and scheduling in the cloud
Handling JavaScript-Heavy Sites and Anti-Bot Measures
Modern web apps are almost entirely JavaScript-rendered — product pages, documentation sites, and news aggregators require a real browser to produce any meaningful content. All three tools run Playwright or Puppeteer under the hood, but their anti-bot posture differs significantly.
Crawl4AI's CrawlerRunConfig exposes wait_for as a CSS selector string — the crawler waits until that element appears in the DOM before capturing content. This solves infinite scroll and lazy-loaded components at the configuration level without writing custom JavaScript. For pages that require more control, js_code accepts raw JavaScript to execute before capture, letting you click "load more" buttons or dismiss cookie banners programmatically.
Firecrawl's waitFor parameter accepts a millisecond integer — simpler than a CSS selector but less precise. For well-behaved SPAs that finish rendering within a predictable window, this is sufficient. For dynamic content that loads on scroll or interaction, Firecrawl's API-based model limits how much control you have over browser behavior.
Apify is the only option here with production-grade anti-bot evasion. The useStealth flag activates puppeteer-extra's stealth plugin, which patches navigator.webdriver, randomizes canvas fingerprints, and spoofs plugin arrays. Combined with residential proxy rotation (real user IPs), Apify handles sites that block both datacenter IPs and headless browser signals. Crawl4AI and Firecrawl will get blocked on sites with aggressive bot detection (Cloudflare, PerimeterX, DataDome) — Apify is the tool when the target site actively fights scrapers.
Output Quality for RAG and LLM Pipelines
"LLM-ready markdown" means different things across these tools, and the difference matters when you're stuffing content into a context window.
Crawl4AI's markdown output strips navigation, ads, and boilerplate using a combination of HTML cleaning heuristics and configurable filters. The word_count_threshold setting (default: 200) discards pages with insufficient content automatically. For structured extraction, the LLMExtractionStrategy passes cleaned markdown to your choice of Claude, GPT-4, or Gemini — letting the LLM do schema extraction rather than relying on CSS selectors that break with site redesigns.
Firecrawl's onlyMainContent: true flag is the equivalent filter — it identifies and strips navigation, footers, and sidebar content. The output is generally cleaner than raw HTML scraping, though it occasionally loses content that's in non-standard layout containers. Firecrawl's extract() endpoint with a Zod schema handles structured extraction natively, similar to Crawl4AI's LLM strategy but managed by Firecrawl's hosted models.
For RAG pipelines specifically, Crawl4AI's result.links property returns all discovered links in the page — useful for building a crawl frontier for documentation sites. Firecrawl's crawlUrl() handles frontier expansion automatically, returning all discovered pages as a flat array. The metadata each returns differs: Firecrawl includes sourceURL, title, and status code; Crawl4AI's result.metadata is richer, including OG tags and canonical URL.
Rate Limiting and Ethical Scraping
Responsible scraping requires respecting robots.txt, honoring rate limits, and staying within terms of service — and the tools differ in how much enforcement is built in.
Crawl4AI respects robots.txt when respect_robots_txt=True is set in CrawlerRunConfig, but it's opt-in, not default. The delay_before_return_html and mean_delay parameters control request pacing. For high-volume crawls, configuring concurrency conservatively (max_concurrent=2 or 3) prevents triggering server-side rate limits.
Firecrawl manages rate limiting server-side — if you exceed your plan's credit consumption rate, requests queue automatically. This means you can't accidentally hammer a target server with Firecrawl; the API gateway throttles on your behalf. The downside is less control over per-domain politeness policies.
Apify's scheduler and proxy rotation spread requests across IPs by design, which can obscure high-volume traffic from rate-limiting systems. For production use on sensitive targets, configuring requestHandlerTimeoutSecs and maxRequestsPerMinute in actor input keeps crawl rates defensible.
A note on terms of service: all three tools can access any public URL a browser can reach. Whether scraping a particular site is permitted under its terms of service is a legal question, not a technical one. Public data, your own sites, and sites that explicitly permit scraping are safe use cases; scraping competitors' pricing data or proprietary content may not be.
Methodology
Data sourced from the Crawl4AI GitHub repository (github.com/unclecode/crawl4ai), Firecrawl documentation (firecrawl.dev/docs), Apify documentation (docs.apify.com), GitHub star counts and npm download statistics as of February 2026, and community discussions in the LangChain Discord and AI builders communities. Pricing from official pricing pages as of February 2026.
Related: Mastra vs LangChain.js vs Genkit for building the AI agents that consume the scraped content, or Vercel AI SDK vs OpenAI SDK vs Anthropic SDK for the LLM client that processes extracted data.
See also: ElevenLabs vs OpenAI TTS vs Cartesia and Langfuse vs LangSmith vs Helicone: LLM Observability 2026