F
Firecrawl14mo ago
lawtj

'limit' not respected?

Hi there - I'm having trouble getting the limit option respected; this was true in v0 and now in v1. I have just built the docker image from the main repo and running local. Different permutations of limit and maxDepth return different results, but not in a way I understand. Hopefully I just don't understand the documentation! Here's a minimal example: app = FirecrawlApp(api_key='your_api_key', api_url='http://localhost:3002/') crawl_url = 'https://stats.oarc.ucla.edu/stata/' params = { 'limit': 5, 'maxDepth': 2, } crawl_result = app.crawl_url(crawl_url, params=params) However, len(crawl_result['data']) returns 8. I tried this with varying limits and maxDepths (5,10,100 and 1,2,5): and got this result: {'limit': 5, 'maxDepth': 1, 'num_urls': 1} {'limit': 10, 'maxDepth': 1, 'num_urls': 1} {'limit': 100, 'maxDepth': 1, 'num_urls': 1} {'limit': 5, 'maxDepth': 2, 'num_urls': 8} {'limit': 10, 'maxDepth': 2, 'num_urls': 8} {'limit': 100, 'maxDepth': 2, 'num_urls': 8} {'limit': 5, 'maxDepth': 5, 'num_urls': 189} {'limit': 10, 'maxDepth': 5, 'num_urls': 179} {'limit': 100, 'maxDepth': 5, 'num_urls': 182} Any idea what could be going on? Perhaps something off in my configuration?
15 Replies
lawtj
lawtjOP13mo ago
Hi - just wondering if someone could tell me if I'm going crazy or not! I've tried every permutation I can think of in settings. No matter what I do, it seems like limit isn't respected at all. Is it possible to verify this bug on self-hosted? appreciated!
BrianJM
BrianJM13mo ago
I have observed the same
Adobe.Flash
Adobe.Flash13mo ago
That's odd. What version of the sdk are you using? I tested this in the playground and seems to respect it. I wonder if there is an issue with earlier versions of the python sdk
BrianJM
BrianJM13mo ago
I'm using the Node SDK. I'll do more extensive testing over the next couple of weeks. I have multiple configurations to test.
Adobe.Flash
Adobe.Flash13mo ago
Sweet, def let me know 🙂
lawtj
lawtjOP13mo ago
Yep - I think this is a self hosted issue.
Adobe.Flash
Adobe.Flash13mo ago
Interesting let me open up a GitHub issue so we can investigate it!
lawtj
lawtjOP13mo ago
Tried with the latest python sdk, but also directly sending a curl request. fwiw, 'waitFor' doesn't seem to be respected either atm!
Adobe.Flash
Adobe.Flash13mo ago
Thanks! Will add that to the issue too!
BrianJM
BrianJM13mo ago
@Adobe.Flash I noticed yesterday that setting a limit (via the api) results in the Bull MQ job having a limit set to null and maxCrawlLimit (if I recall) set to the limit initially set. Limit in self hosted is entirely ignored. Over the past several weeks I noticed the same. Crawl depth and include/exclude are respected however.
Adobe.Flash
Adobe.Flash13mo ago
Thanks for the insight @BrianJM! The team is working on fixing it! 🙂
Abdulaziz
Abdulaziz12mo ago
Try this website: https://ghoroos.sa Playground gets it perfectly. self-host goes BAZINGA with "maxDepth": 1, "limit": 10 it gets 78 pages, from what I understands if maxDepth = 1 then it should return only 1 right?
Ghoroos
مؤسسة محمد إبراهيم السبيعي وأولاده الخيرية
الرئيسية
Abdulaziz
Abdulaziz12mo ago
Full param for reference: params={ "maxDepth": 1, "limit": 10, "ignoreSitemap": True, "allowBackwardLinks": False, "allowExternalLinks": False, "scrapeOptions": { "formats": [ "markdown" ], "onlyMainContent": False, "waitFor": 1000 } } I found excatly where is the issue and it make sense why it works on playground but not self-host I am going to submit a PR tomorrow morning it is too late in the night here

Did you find this page helpful?