Unstable behavior of Crawl Jobs for V1
I'm facing this issue of incomplete crawling where the actual count of pages for the website is different from what the crawl job is able to scrape.
The /map endpoint is giving correct count of URLs in the sitemap, which is 104. [ website - freenome.com ]
but the crawling job is able to scrape sometimes 55, 28 or 29 pages only.
Peculiar thing which I have noticed is that the credit usage is also high irrespective of the actual scraped pages.
Here's the output:
Job Info {'status': 'completed', 'completed': 69, 'total': 69, 'creditsUsed': 69, 'expiresAt': '2024-09-04T09:08:31.000Z', 'next': 'http://api.firecrawl.dev/v1/crawl/72dda60f-5384-498d-81b1-0a830dfa0cc8?skip=28'}
shape of the output saved in a pandas dataframe for "data" key: (28, 2)
parameters used -
crawl_params = {
"excludePaths": [],
"includePaths": [],
"maxDepth": 2,
"ignoreSitemap": False,
"limit": 400,
"allowBackwardLinks": False,
"allowExternalLinks": False,
"scrapeOptions": {
"formats": ["markdown"],
"headers": {},
"includeTags": [],
"excludeTags": [],
"onlyMainContent": True,
"waitFor": 300
}
}
NOTE - I have tried using True value for ignoreSitemap parameter as well, but didn't help either.
CCing - @Adobe.Flash @mogery @rafaelmiller
23 Replies
are you using pagination correctly?
you should query the next URL to get the next batch of pages
to be fair this is on us -- somehow we missed this on the SDK side -- raising this issue internally
Hi @mogery ,
I have used the default value of TRUE for wait_until_done parameter.
As per my assumption, the behavior for this should be similar to what we had in V0, which was to return the output result once the crawl process completes successfully.
Shouldn't that be the case by default? Or do we have to now explicitly perform the pagination?
It should be the default, but we messed up. Rafa is working on the fix, I'll let you know when it's up!
The reason why you saw varied page counts is because we added pagination up to 10 megabytes of page data. So each page is max 10MB, to avoid issues with large response payloads
ok, thanks for the update.
@Sachin we pushed an update to the sdks for the above fix (1.2.2)
sure, thanks for the update @Adobe.Flash
@Adobe.Flash @mogery The credit usage now is in line with what is getting scraped.
Though, I still feel that the crawl job is not stable yet.
The /map endpoint returns accurate count of URLs, while the Crawl job is returning lesser number of pages in the output.
some examples - omnyhealth.com, oncotab.com
Also, can we remove the default printing behavior of the /map endpoint response containing list of URLs?

@Sachin it is removed on 1.2.3
are you calling the next param to fetch for the rest of results?
Nope, do we need to now?
I'm just using the wait_until_done parameter to get the complete result once the process finishes successfully.
Is there something which I might be missing here?
V1 sdks don't have wait until done anymore. We now have async crawl and crawl. Basically the crawl method does that for you and waits until it is done.
I would recommend updating it if you can and see if it solves your issue!
The async crawl just returns the job id, and you can do your own pooling, which in that case you would need to call the
next
paramwow, how did I miss that. wait_until_done parameter is not there anymore.😑
but sometime the crawl job is able to fetch the entire data. Maybe that's related to what mogery mentioned regarding default pagination of upto 10 MB of page data.
Correct, if it exceeds 10mb, it gets paginated!
Is there a simpler way of achieving similar functionality to what /crawl of v0 was able to do?
Just using the
crawl
method in the python sdk (version 1.2.3)
it will do that for you
so you will get all the data you need without needing to think about itBut I'm using the crawl method from the python SDK.
Still the results are not complete.
I have upgraded to the version 1.2.3 now.
example websites - omnyhealth.com [50 pages by map method but crawl is returning 35 only], oncotab.com [71 pages returned by map but crawl is returning only 62 pages]
/map is different than crawl. Crawl wont get subdomains like map does
map allows for a bit more flexibility than crawl in that regard.
ccing @rafaelmiller here to take a deeper look if thats not the case
I have set the flag for subdomain & ignoresitemap to False while fetching the list of urls using /map.
Any update on this?
@rafaelmiller @Adobe.Flash
Chanced across this, and I'm experiencing the same as Sachin on python sdk: /crawl might say 10 successful results, but actually only contain 7. I assumed it was a pagination thing and have a todo to switch away from the python sdk so that I can page. But it sounds like from above that that's not the case (as it should be loading the full set)?
@James Peterson , have you upgraded the version of your FireCrawl python SDK to 1.2.3?
hey @Sachin , I've been testing those urls and got the same result as you. I noticed that both pages use YoastSEO for generating sitemaps. Our request to
/sitemap.xml
(the usual sitemap page) is not being redirected to /sitemap_index.xml
(the YoastSEO sitemap page), so we're relying on crawling and filtering pages that have links starting with the base url (similar to using ignoreSitemap
param).
I'm currently adding a check for /sitemap_index.xml
if /sitemap.xml
is not found, which should improve results for YoastSEO-generated sitemaps.
As for /map
, it's an alpha feature that uses a completely different methodology to fetch urls. We combine sitemap values, links found by the crawler, and a semantic indexed search (similar to how Google does it), which is likely why you're seeing fewer results with crawl compared to map.
Let me know if you have any other questions.@rafaelmiller got it, thanks for the update.
Quick question, are we thinking about merging the alpha feature built for MAP meothd with the CRAWL method in the near future?
@Sachin most likely. We're still testing how many results map and crawl can fetch with their own strategies but we plan to add a
mode
parameter for crawl in the near future.awesome, thanks.