Firecrawl doesn't seem to crawl everything

I'm trying to run Firecrawl on pytorch's documentation, and I merely get ~15 results with these URLs:

https://pytorch.org/docs/stable
https://pytorch.org/docs/stable/distributed.pipelining.html
https://pytorch.org/docs/stable/dynamo/index.html
https://pytorch.org/docs/stable/fsdp.html
https://pytorch.org/docs/stable/fx.html
https://pytorch.org/docs/stable/generated/torch.nn.functional.scaled_dot_product_attention.html
https://pytorch.org/docs/stable/generated/torch.set_num_threads.html
https://pytorch.org/docs/stable/hub.html
https://pytorch.org/docs/stable/index.html
https://pytorch.org/docs/stable/jit.html
https://pytorch.org/docs/stable/library.html
https://pytorch.org/docs/stable/torch.compiler.html
https://pytorch.org/docs/stable/torch.compiler_aot_inductor.html
https://pytorch.org/docs/stable/torch.compiler_get_started.html
https://pytorch.org/docs/stable/torch.compiler_troubleshooting.html

https://pytorch.org/docs/stable
https://pytorch.org/docs/stable/distributed.pipelining.html
https://pytorch.org/docs/stable/dynamo/index.html
https://pytorch.org/docs/stable/fsdp.html
https://pytorch.org/docs/stable/fx.html
https://pytorch.org/docs/stable/generated/torch.nn.functional.scaled_dot_product_attention.html
https://pytorch.org/docs/stable/generated/torch.set_num_threads.html
https://pytorch.org/docs/stable/hub.html
https://pytorch.org/docs/stable/index.html
https://pytorch.org/docs/stable/jit.html
https://pytorch.org/docs/stable/library.html
https://pytorch.org/docs/stable/torch.compiler.html
https://pytorch.org/docs/stable/torch.compiler_aot_inductor.html
https://pytorch.org/docs/stable/torch.compiler_get_started.html
https://pytorch.org/docs/stable/torch.compiler_troubleshooting.html

Clearly it's missing out on a whole lot of pages. Here's how I'm calling it:

crawl_result = firecrawl_app.crawl_url(
    "https://pytorch.org/docs/stable",
    params={"crawlerOptions": {"maxDepth": 10_000, "limit": 10_000}}
)

crawl_result = firecrawl_app.crawl_url(
    "https://pytorch.org/docs/stable",
    params={"crawlerOptions": {"maxDepth": 10_000, "limit": 10_000}}
)

Am I missing something? Are any of the other default parameters truncating the crawl somehow?

3 Replies

rafaelmiller•15mo ago

hey @Julia from Storia your parameters for maxDepth and limit don't seem right. You should use 10000. I tested the url:

POST https://api.firecrawl.dev/v0/crawl HTTP/1.1
Authorization: Bearer fc-*
content-type: application/json

{
  "url": "https://pytorch.org/docs/stable"
}

POST https://api.firecrawl.dev/v0/crawl HTTP/1.1
Authorization: Bearer fc-*
content-type: application/json

{
  "url": "https://pytorch.org/docs/stable"
}

and I was able to get 15 URLs for this page. Another option to consider for retrieving more URLs is setting crawlerOptions.allowBackwardCrawling = true. This allows the crawler to retrieve URLs beyond those containing the base URL.

Julia from StoriaOP•15mo ago

I also got 15 URLs, but I was expecting hundreds. Also, in Python, 10_000 means 10000

Julia from StoriaOP•15mo ago

https://peps.python.org/pep-0515/

Python Enhancement Proposals (PEPs)

PEP 515 – Underscores in Numeric Literals | peps.python.org

Python Enhancement Proposals (PEPs)

Gaming

Programming

Firecrawl doesn't seem to crawl everything

Did you find this page helpful?