Firecrawl•14mo ago

'limit' not respected?

Hi there - I'm having trouble getting the limit option respected; this was true in v0 and now in v1. I have just built the docker image from the main repo and running local. Different permutations of limit and maxDepth return different results, but not in a way I understand. Hopefully I just don't understand the documentation! Here's a minimal example:

app = FirecrawlApp(api_key='your_api_key', api_url='http://localhost:3002/')

crawl_url = 'https://stats.oarc.ucla.edu/stata/'

params = {
        'limit': 5,
        'maxDepth': 2,
}
crawl_result = app.crawl_url(crawl_url, params=params)

However, len(crawl_result['data']) returns 8. I tried this with varying limits and maxDepths (5,10,100 and 1,2,5): and got this result:

{'limit': 5, 'maxDepth': 1, 'num_urls': 1}
{'limit': 10, 'maxDepth': 1, 'num_urls': 1}
{'limit': 100, 'maxDepth': 1, 'num_urls': 1}
{'limit': 5, 'maxDepth': 2, 'num_urls': 8}
{'limit': 10, 'maxDepth': 2, 'num_urls': 8}
{'limit': 100, 'maxDepth': 2, 'num_urls': 8}
{'limit': 5, 'maxDepth': 5, 'num_urls': 189}
{'limit': 10, 'maxDepth': 5, 'num_urls': 179}
{'limit': 100, 'maxDepth': 5, 'num_urls': 182}

Any idea what could be going on? Perhaps something off in my configuration?

15 Replies

lawtjOP•13mo ago

Hi - just wondering if someone could tell me if I'm going crazy or not! I've tried every permutation I can think of in settings. No matter what I do, it seems like limit isn't respected at all. Is it possible to verify this bug on self-hosted? appreciated!

BrianJM•13mo ago

I have observed the same

Adobe.Flash•13mo ago

That's odd. What version of the sdk are you using? I tested this in the playground and seems to respect it. I wonder if there is an issue with earlier versions of the python sdk

BrianJM•13mo ago

I'm using the Node SDK. I'll do more extensive testing over the next couple of weeks. I have multiple configurations to test.

Adobe.Flash•13mo ago

Sweet, def let me know 🙂

lawtjOP•13mo ago

Yep - I think this is a self hosted issue.

Adobe.Flash•13mo ago

Interesting let me open up a GitHub issue so we can investigate it!

lawtjOP•13mo ago

Tried with the latest python sdk, but also directly sending a curl request. fwiw, 'waitFor' doesn't seem to be respected either atm!

Adobe.Flash•13mo ago

Thanks! Will add that to the issue too!

Adobe.Flash•13mo ago

https://github.com/mendableai/firecrawl/issues/738

GitHub

[Self-Host] /crawl limit + waitFor sometimes is not respected in th...

Discord thread link: https://discord.com/channels/1226707384710332458/1281320643484323941

BrianJM•13mo ago

@Adobe.Flash I noticed yesterday that setting a limit (via the api) results in the Bull MQ job having a limit set to null and maxCrawlLimit (if I recall) set to the limit initially set. Limit in self hosted is entirely ignored. Over the past several weeks I noticed the same. Crawl depth and include/exclude are respected however.

Adobe.Flash•13mo ago

Thanks for the insight @BrianJM! The team is working on fixing it! 🙂

Abdulaziz•12mo ago

Try this website: https://ghoroos.sa Playground gets it perfectly. self-host goes BAZINGA with "maxDepth": 1, "limit": 10 it gets 78 pages, from what I understands if maxDepth = 1 then it should return only 1 right?

Ghoroos

مؤسسة محمد إبراهيم السبيعي وأولاده الخيرية

الرئيسية

Abdulaziz•12mo ago

Full param for reference: params={ "maxDepth": 1, "limit": 10, "ignoreSitemap": True, "allowBackwardLinks": False, "allowExternalLinks": False, "scrapeOptions": { "formats": [ "markdown" ], "onlyMainContent": False, "waitFor": 1000 } } I found excatly where is the issue and it make sense why it works on playground but not self-host I am going to submit a PR tomorrow morning it is too late in the night here

Abdulaziz•12mo ago

sorry couldn't sleep 😦 https://github.com/mendableai/firecrawl/pull/755

GitHub

bugfix: self-host crawling doesnt respect limit by busaud · Pull Re...

partially solves #738

Gaming

Programming

'limit' not respected?

Did you find this page helpful?