F
Firecrawl15mo ago
Janice

Crawl is not respecting limit crawl option

I set a limit of 500 for my crawl but I find that it keeps crawling beyond 500 pages and I have to interrupt it on my end. My code is below. Am I doing something wrong? app = FirecrawlApp(api_key=FIRECRAWL_API_KEY) start_time = time.time() crawl_url = "https://www.lsu.edu/majors" params = { "crawlerOptions": { "limit": 500, "maxDepth": 2, "ignoreSitemap": False, "ignoreRobots": False, }, "pageOptions": { "onlyMainContent": True, "parsePDF": True, "removeTags": ["script", "style", "nav", "header", "footer", ".advertisement", ".sidebar", ".nav", ".menu", "#comments", "img", "svg", "iframe", "video", "audio"] }, } urls = [] job_id = app.crawl_url(crawl_url, params=params, wait_until_done=False)
7 Replies
Sachin
Sachin15mo ago
@Janice , could you please try without the "maxDepth" parameter. I believe that might be causing the crawler to crawl beyond the set limit.
Janice
JaniceOP15mo ago
Thank you. I will try this.
Caleb
Caleb15mo ago
Hi there @Janice, I just created a issue on the Github for this problem. Whether or not the maxDepth is set, it should be respecting the limit of 500 you set. https://github.com/mendableai/firecrawl/issues/435 Let me know if the fix @Sachin offered works!
GitHub
[BUG] Issue with crawl going beyond Limit · Issue #435 · mendableai...
From Janice in Discord: I set a limit of 500 for my crawl but I find that it keeps crawling beyond 500 pages and I have to interrupt it on my end. My code is below. Am I doing something wrong? app ...
Janice
JaniceOP15mo ago
Without the maxdepth it stops at 453 pages. What is the default max depth?
rafaelmiller
rafaelmiller15mo ago
Hey @Janice , we noticed the sitemap for this page contains about 1700 images, which isn't typical. This was overloading our crawlers, causing them to crash due to memory limits and get stuck in loops. We've improved our memory handling and have now excluded images from the crawl process. As a result, we’re successfully crawling around 160 pages, specifically from the /majors domain actually, we don’t have a default max depth. The stop at 453 pages was due to a memory crash, not a depth limit. We’ve fixed the memory issue to keep it from happening again! Let me know if you need any help
Janice
JaniceOP15mo ago
interesting... Thank you for the update info. I will run another round of my eval crawls tomorrow morning I just tried this again and the limit was not respected whether I included the depth param or not.
Caleb
Caleb15mo ago
Hey Janice. Thats odd. I'm adding that to the github issue.

Did you find this page helpful?