Firecrawl doesn't seem to crawl everything

I'm trying to run Firecrawl on pytorch's documentation, and I merely get ~15 results with these URLs:
https://pytorch.org/docs/stable
https://pytorch.org/docs/stable/distributed.pipelining.html
https://pytorch.org/docs/stable/dynamo/index.html
https://pytorch.org/docs/stable/fsdp.html
https://pytorch.org/docs/stable/fx.html
https://pytorch.org/docs/stable/generated/torch.nn.functional.scaled_dot_product_attention.html
https://pytorch.org/docs/stable/generated/torch.set_num_threads.html
https://pytorch.org/docs/stable/hub.html
https://pytorch.org/docs/stable/index.html
https://pytorch.org/docs/stable/jit.html
https://pytorch.org/docs/stable/library.html
https://pytorch.org/docs/stable/torch.compiler.html
https://pytorch.org/docs/stable/torch.compiler_aot_inductor.html
https://pytorch.org/docs/stable/torch.compiler_get_started.html
https://pytorch.org/docs/stable/torch.compiler_troubleshooting.html
https://pytorch.org/docs/stable
https://pytorch.org/docs/stable/distributed.pipelining.html
https://pytorch.org/docs/stable/dynamo/index.html
https://pytorch.org/docs/stable/fsdp.html
https://pytorch.org/docs/stable/fx.html
https://pytorch.org/docs/stable/generated/torch.nn.functional.scaled_dot_product_attention.html
https://pytorch.org/docs/stable/generated/torch.set_num_threads.html
https://pytorch.org/docs/stable/hub.html
https://pytorch.org/docs/stable/index.html
https://pytorch.org/docs/stable/jit.html
https://pytorch.org/docs/stable/library.html
https://pytorch.org/docs/stable/torch.compiler.html
https://pytorch.org/docs/stable/torch.compiler_aot_inductor.html
https://pytorch.org/docs/stable/torch.compiler_get_started.html
https://pytorch.org/docs/stable/torch.compiler_troubleshooting.html
Clearly it's missing out on a whole lot of pages. Here's how I'm calling it:
crawl_result = firecrawl_app.crawl_url(
"https://pytorch.org/docs/stable",
params={"crawlerOptions": {"maxDepth": 10_000, "limit": 10_000}}
)
crawl_result = firecrawl_app.crawl_url(
"https://pytorch.org/docs/stable",
params={"crawlerOptions": {"maxDepth": 10_000, "limit": 10_000}}
)
Am I missing something? Are any of the other default parameters truncating the crawl somehow?
3 Replies
rafaelmiller
rafaelmiller15mo ago
hey @Julia from Storia your parameters for maxDepth and limit don't seem right. You should use 10000. I tested the url:
POST https://api.firecrawl.dev/v0/crawl HTTP/1.1
Authorization: Bearer fc-*
content-type: application/json

{
"url": "https://pytorch.org/docs/stable"
}
POST https://api.firecrawl.dev/v0/crawl HTTP/1.1
Authorization: Bearer fc-*
content-type: application/json

{
"url": "https://pytorch.org/docs/stable"
}
and I was able to get 15 URLs for this page. Another option to consider for retrieving more URLs is setting crawlerOptions.allowBackwardCrawling = true. This allows the crawler to retrieve URLs beyond those containing the base URL.
Julia from Storia
Julia from StoriaOP15mo ago
I also got 15 URLs, but I was expecting hundreds. Also, in Python, 10_000 means 10000
Julia from Storia
Julia from StoriaOP15mo ago
Python Enhancement Proposals (PEPs)
PEP 515 – Underscores in Numeric Literals | peps.python.org
Python Enhancement Proposals (PEPs)

Did you find this page helpful?