Firecrawl Crawl
Bulk extracts content from entire websites or targeted site sections
What Is This?
Overview
Firecrawl Crawl is a command-line skill that enables bulk content extraction from entire websites or specific site sections. Rather than scraping individual pages one at a time, this skill follows links automatically, traverses site structures, and extracts content from multiple pages in a single operation. It is built on top of the Firecrawl CLI and handles the complexity of concurrent requests, depth management, and path filtering behind a straightforward interface.
The skill is particularly well suited for documentation sites, knowledge bases, and any web property where content is distributed across dozens or hundreds of linked pages. When you need everything under a specific URL path, such as all pages within a /docs or /blog section, Firecrawl Crawl retrieves and structures that content without requiring manual navigation or custom scraping scripts.
Under the hood, the skill respects configurable depth limits to prevent runaway crawls, supports path-based filtering to stay within relevant sections of a site, and runs extractions concurrently to reduce total processing time. The output is clean, structured content ready for downstream use in documentation pipelines, AI training datasets, or content analysis workflows.
Who Should Use This
- Developers building documentation ingestion pipelines or knowledge retrieval systems who need structured content from large sites.
- Technical writers who want to audit or migrate content from an existing documentation site.
- Data engineers preparing training datasets that require bulk web content from specific domains or sections.
- AI and LLM practitioners who need to feed entire documentation sites into retrieval-augmented generation systems.
Why Use It?
Problems It Solves
- Manually visiting and copying content from dozens of pages is time-consuming and error-prone. Firecrawl Crawl automates this entirely.
- Custom scraping scripts require maintenance, handle edge cases poorly, and often break when site structures change. This skill provides a stable, maintained abstraction.
- Extracting content from paginated or deeply linked documentation sites without a crawler means missing pages. Firecrawl follows links systematically to ensure complete coverage.
- Concurrent scraping without rate management can overload servers or result in blocked requests. The skill handles concurrency responsibly by default.
Core Highlights
- Crawls entire websites or scoped site sections from a single command
- Follows internal links automatically up to a configurable depth
- Filters crawl scope by URL path prefix to avoid irrelevant sections
- Runs concurrent extractions to minimize total crawl time
- Returns clean, structured content stripped of navigation and boilerplate
- Supports output formats suitable for downstream processing and AI ingestion
- Works with both public sites and authenticated environments via the Firecrawl API
- Integrates directly into shell scripts and automated pipelines
How to Use It?
Basic Usage
To crawl an entire documentation section, run the following command using the Firecrawl CLI:
firecrawl crawl https://docs.example.com/docsTo limit crawl depth and scope the output to a specific path:
firecrawl crawl https://docs.example.com --include-paths "/docs/*" --max-depth 3Using npx without a global install:
npx firecrawl crawl https://docs.example.com/api --max-depth 2Specific Scenarios
Scenario 1: Extracting all pages from a versioned docs site When a documentation site organizes content under versioned paths, you can scope the crawl to a single version to avoid pulling duplicate or outdated content.
firecrawl crawl https://docs.example.com --include-paths "/v2/*" --max-depth 4Scenario 2: Bulk extraction for an AI knowledge base When building a retrieval-augmented generation system, crawl the target site and pipe the output to a file for further processing.
firecrawl crawl https://docs.example.com/guides --output ./output --format markdownReal-World Examples
- A platform team crawls their internal developer portal nightly to keep a vector database synchronized with the latest documentation.
- A technical writer uses the skill to extract all content from a legacy docs site before migrating it to a new platform.
- An AI engineer crawls a third-party API reference to build a domain-specific assistant with accurate, up-to-date knowledge.
When to Use It?
Use Cases
- Ingesting full documentation sites into vector stores for semantic search
- Auditing all pages in a site section for content quality or broken links
- Migrating content from one documentation platform to another
- Building offline documentation archives for air-gapped environments
- Generating training data for domain-specific language models
- Monitoring documentation sites for content changes over time
- Extracting structured reference material for automated summarization
Important Notes
Requirements
- A valid Firecrawl API key configured in your environment or passed via CLI flags
- Node.js installed if using the
npx firecrawlinvocation method - Network access to the target site from the machine running the crawl
- Sufficient API quota for the expected number of pages in the crawl scope
More Skills You Might Like
Explore similar skills to enhance your workflow
Azure Cloud Architect
Design Azure architectures for startups and enterprises. Use when asked to design Azure infrastructure, create Bicep/ARM templates, optimize Azure cos
Analyzing Security Logs with Splunk
Leverages Splunk Enterprise Security and SPL (Search Processing Language) to investigate security incidents
Adversarial Reviewer
Adversarial code review that breaks the self-review monoculture. Use when you want a genuinely critical review of recent changes, before merging a PR,
Auditing GCP IAM Permissions
Auditing Google Cloud Platform IAM permissions to identify overly permissive bindings, primitive role usage,
SAP BTP Cloud Logging
Set up logging and monitoring on SAP BTP Cloud Foundry environment
Twitter Automation
twitter-automation skill for programming & development