Firecrawl Crawl

Bulk extracts content from entire websites or targeted site sections

What Is This?

Overview

Firecrawl Crawl is a command-line skill that enables bulk content extraction from entire websites or specific site sections. Rather than scraping individual pages one at a time, this skill follows links automatically, traverses site structures, and extracts content from multiple pages in a single operation. It is built on top of the Firecrawl CLI and handles the complexity of concurrent requests, depth management, and path filtering behind a straightforward interface.

The skill is particularly well suited for documentation sites, knowledge bases, and any web property where content is distributed across dozens or hundreds of linked pages. When you need everything under a specific URL path, such as all pages within a /docs or /blog section, Firecrawl Crawl retrieves and structures that content without requiring manual navigation or custom scraping scripts.

Under the hood, the skill respects configurable depth limits to prevent runaway crawls, supports path-based filtering to stay within relevant sections of a site, and runs extractions concurrently to reduce total processing time. The output is clean, structured content ready for downstream use in documentation pipelines, AI training datasets, or content analysis workflows.

Who Should Use This

  • Developers building documentation ingestion pipelines or knowledge retrieval systems who need structured content from large sites.
  • Technical writers who want to audit or migrate content from an existing documentation site.
  • Data engineers preparing training datasets that require bulk web content from specific domains or sections.
  • AI and LLM practitioners who need to feed entire documentation sites into retrieval-augmented generation systems.

Why Use It?

Problems It Solves

  • Manually visiting and copying content from dozens of pages is time-consuming and error-prone. Firecrawl Crawl automates this entirely.
  • Custom scraping scripts require maintenance, handle edge cases poorly, and often break when site structures change. This skill provides a stable, maintained abstraction.
  • Extracting content from paginated or deeply linked documentation sites without a crawler means missing pages. Firecrawl follows links systematically to ensure complete coverage.
  • Concurrent scraping without rate management can overload servers or result in blocked requests. The skill handles concurrency responsibly by default.

Core Highlights

  • Crawls entire websites or scoped site sections from a single command
  • Follows internal links automatically up to a configurable depth
  • Filters crawl scope by URL path prefix to avoid irrelevant sections
  • Runs concurrent extractions to minimize total crawl time
  • Returns clean, structured content stripped of navigation and boilerplate
  • Supports output formats suitable for downstream processing and AI ingestion
  • Works with both public sites and authenticated environments via the Firecrawl API
  • Integrates directly into shell scripts and automated pipelines

How to Use It?

Basic Usage

To crawl an entire documentation section, run the following command using the Firecrawl CLI:

firecrawl crawl https://docs.example.com/docs

To limit crawl depth and scope the output to a specific path:

firecrawl crawl https://docs.example.com --include-paths "/docs/*" --max-depth 3

Using npx without a global install:

npx firecrawl crawl https://docs.example.com/api --max-depth 2

Specific Scenarios

Scenario 1: Extracting all pages from a versioned docs site When a documentation site organizes content under versioned paths, you can scope the crawl to a single version to avoid pulling duplicate or outdated content.

firecrawl crawl https://docs.example.com --include-paths "/v2/*" --max-depth 4

Scenario 2: Bulk extraction for an AI knowledge base When building a retrieval-augmented generation system, crawl the target site and pipe the output to a file for further processing.

firecrawl crawl https://docs.example.com/guides --output ./output --format markdown

Real-World Examples

  • A platform team crawls their internal developer portal nightly to keep a vector database synchronized with the latest documentation.
  • A technical writer uses the skill to extract all content from a legacy docs site before migrating it to a new platform.
  • An AI engineer crawls a third-party API reference to build a domain-specific assistant with accurate, up-to-date knowledge.

When to Use It?

Use Cases

  • Ingesting full documentation sites into vector stores for semantic search
  • Auditing all pages in a site section for content quality or broken links
  • Migrating content from one documentation platform to another
  • Building offline documentation archives for air-gapped environments
  • Generating training data for domain-specific language models
  • Monitoring documentation sites for content changes over time
  • Extracting structured reference material for automated summarization

Important Notes

Requirements

  • A valid Firecrawl API key configured in your environment or passed via CLI flags
  • Node.js installed if using the npx firecrawl invocation method
  • Network access to the target site from the machine running the crawl
  • Sufficient API quota for the expected number of pages in the crawl scope