Firecrawl•15mo ago

Crawl is not respecting limit crawl option

I set a limit of 500 for my crawl but I find that it keeps crawling beyond 500 pages and I have to interrupt it on my end. My code is below. Am I doing something wrong? app = FirecrawlApp(api_key=FIRECRAWL_API_KEY) start_time = time.time() crawl_url = "https://www.lsu.edu/majors" params = { "crawlerOptions": { "limit": 500, "maxDepth": 2, "ignoreSitemap": False, "ignoreRobots": False, }, "pageOptions": { "onlyMainContent": True, "parsePDF": True, "removeTags": ["script", "style", "nav", "header", "footer", ".advertisement", ".sidebar", ".nav", ".menu", "#comments", "img", "svg", "iframe", "video", "audio"] }, } urls = [] job_id = app.crawl_url(crawl_url, params=params, wait_until_done=False)

7 Replies

Sachin•15mo ago

@Janice , could you please try without the "maxDepth" parameter. I believe that might be causing the crawler to crawl beyond the set limit.

JaniceOP•15mo ago

Thank you. I will try this.

Caleb•15mo ago

Hi there @Janice, I just created a issue on the Github for this problem. Whether or not the maxDepth is set, it should be respecting the limit of 500 you set. https://github.com/mendableai/firecrawl/issues/435 Let me know if the fix @Sachin offered works!

GitHub

[BUG] Issue with crawl going beyond Limit · Issue #435 · mendableai...

From Janice in Discord: I set a limit of 500 for my crawl but I find that it keeps crawling beyond 500 pages and I have to interrupt it on my end. My code is below. Am I doing something wrong? app ...

JaniceOP•15mo ago

Without the maxdepth it stops at 453 pages. What is the default max depth?

rafaelmiller•15mo ago

Hey @Janice , we noticed the sitemap for this page contains about 1700 images, which isn't typical. This was overloading our crawlers, causing them to crash due to memory limits and get stuck in loops. We've improved our memory handling and have now excluded images from the crawl process. As a result, we’re successfully crawling around 160 pages, specifically from the /majors domain actually, we don’t have a default max depth. The stop at 453 pages was due to a memory crash, not a depth limit. We’ve fixed the memory issue to keep it from happening again! Let me know if you need any help

JaniceOP•15mo ago

interesting... Thank you for the update info. I will run another round of my eval crawls tomorrow morning I just tried this again and the limit was not respected whether I included the depth param or not.

Caleb•15mo ago

Hey Janice. Thats odd. I'm adding that to the github issue.

Gaming

Programming

Crawl is not respecting limit crawl option

Did you find this page helpful?