'limit' not respected?
Hi there -
I'm having trouble getting the
limit
option respected; this was true in v0 and now in v1.
I have just built the docker image from the main repo and running local. Different permutations of limit
and maxDepth
return different results, but not in a way I understand.
Hopefully I just don't understand the documentation! Here's a minimal example:
app = FirecrawlApp(api_key='your_api_key', api_url='http://localhost:3002/')
crawl_url = 'https://stats.oarc.ucla.edu/stata/'
params = {
'limit': 5,
'maxDepth': 2,
}
crawl_result = app.crawl_url(crawl_url, params=params)
However, len(crawl_result['data'])
returns 8
.
I tried this with varying limits and maxDepths (5,10,100 and 1,2,5):
and got this result:
{'limit': 5, 'maxDepth': 1, 'num_urls': 1}
{'limit': 10, 'maxDepth': 1, 'num_urls': 1}
{'limit': 100, 'maxDepth': 1, 'num_urls': 1}
{'limit': 5, 'maxDepth': 2, 'num_urls': 8}
{'limit': 10, 'maxDepth': 2, 'num_urls': 8}
{'limit': 100, 'maxDepth': 2, 'num_urls': 8}
{'limit': 5, 'maxDepth': 5, 'num_urls': 189}
{'limit': 10, 'maxDepth': 5, 'num_urls': 179}
{'limit': 100, 'maxDepth': 5, 'num_urls': 182}
Any idea what could be going on? Perhaps something off in my configuration?15 Replies
Hi - just wondering if someone could tell me if I'm going crazy or not! I've tried every permutation I can think of in settings. No matter what I do, it seems like
limit
isn't respected at all.
Is it possible to verify this bug on self-hosted? appreciated!I have observed the same
That's odd. What version of the sdk are you using?
I tested this in the playground and seems to respect it.
I wonder if there is an issue with earlier versions of the python sdk
I'm using the Node SDK. I'll do more extensive testing over the next couple of weeks. I have multiple configurations to test.
Sweet, def let me know 🙂
Yep - I think this is a self hosted issue.
Interesting let me open up a GitHub issue so we can investigate it!
Tried with the latest python sdk, but also directly sending a curl request.
fwiw, 'waitFor' doesn't seem to be respected either atm!
Thanks! Will add that to the issue too!
@Adobe.Flash I noticed yesterday that setting a
limit
(via the api) results in the Bull MQ job having a limit
set to null
and maxCrawlLimit
(if I recall) set to the limit initially set.
Limit in self hosted is entirely ignored. Over the past several weeks I noticed the same. Crawl depth and include/exclude are respected however.Thanks for the insight @BrianJM! The team is working on fixing it! 🙂
Try this website: https://ghoroos.sa
Playground gets it perfectly. self-host goes BAZINGA
with "maxDepth": 1, "limit": 10
it gets 78 pages, from what I understands if maxDepth = 1 then it should return only 1 right?
Full param for reference:
params={
"maxDepth": 1,
"limit": 10,
"ignoreSitemap": True,
"allowBackwardLinks": False,
"allowExternalLinks": False,
"scrapeOptions": {
"formats": [ "markdown" ],
"onlyMainContent": False,
"waitFor": 1000
}
}
I found excatly where is the issue and it make sense why it works on playground but not self-host
I am going to submit a PR tomorrow morning it is too late in the night here
sorry couldn't sleep 😦
https://github.com/mendableai/firecrawl/pull/755