Crawl is not respecting limit crawl option
I set a limit of 500 for my crawl but I find that it keeps crawling beyond 500 pages and I have to interrupt it on my end. My code is below. Am I doing something wrong?
app = FirecrawlApp(api_key=FIRECRAWL_API_KEY)
start_time = time.time()
crawl_url = "https://www.lsu.edu/majors"
params = {
"crawlerOptions": {
"limit": 500,
"maxDepth": 2,
"ignoreSitemap": False,
"ignoreRobots": False,
},
"pageOptions": {
"onlyMainContent": True,
"parsePDF": True,
"removeTags": ["script", "style", "nav", "header", "footer",
".advertisement", ".sidebar", ".nav", ".menu",
"#comments", "img", "svg", "iframe", "video",
"audio"]
},
}
urls = []
job_id = app.crawl_url(crawl_url, params=params, wait_until_done=False)
7 Replies
@Janice , could you please try without the "maxDepth" parameter. I believe that might be causing the crawler to crawl beyond the set limit.
Thank you. I will try this.
Hi there @Janice, I just created a issue on the Github for this problem. Whether or not the maxDepth is set, it should be respecting the limit of 500 you set.
https://github.com/mendableai/firecrawl/issues/435
Let me know if the fix @Sachin offered works!
GitHub
[BUG] Issue with crawl going beyond Limit · Issue #435 · mendableai...
From Janice in Discord: I set a limit of 500 for my crawl but I find that it keeps crawling beyond 500 pages and I have to interrupt it on my end. My code is below. Am I doing something wrong? app ...
Without the maxdepth it stops at 453 pages. What is the default max depth?
Hey @Janice , we noticed the sitemap for this page contains about 1700 images, which isn't typical. This was overloading our crawlers, causing them to crash due to memory limits and get stuck in loops. We've improved our memory handling and have now excluded images from the crawl process. As a result, we’re successfully crawling around 160 pages, specifically from the /majors domain
actually, we don’t have a default max depth. The stop at 453 pages was due to a memory crash, not a depth limit. We’ve fixed the memory issue to keep it from happening again! Let me know if you need any help
interesting... Thank you for the update info. I will run another round of my eval crawls tomorrow morning
I just tried this again and the limit was not respected whether I included the depth param or not.
Hey Janice. Thats odd. I'm adding that to the github issue.