Discrepancy in the count of URLs returned by the FireCrawl API v/s actual sitemap of the Website
I'm trying to get the list of URLs present in the sitemap of a website using FireCrawl API with following parameters in CRAWL mode -
params = {
'crawlerOptions': {
"returnOnlyUrls": True
}
}
firecrawl_urls = app.crawl_url(url, params=params)
But the count of URLs is not matching with the actual count of URLs present in the sitemap of the website.
e.g: for this website - eonhealth.com, the actual sitemap has around 106 pages but firecrawl is only returning 1 page in the results list.
Can anyone please clarify what could be the possible reason behind this?
@rafaelmiller, @Adobe.Flash Any thoughts on this?
31 Replies
Hey @Sachin I just tested this page and it's working when running locally, but not for the API. This is a strange behavior and we're investigating why this is happening
Yea seems like the website blocked us as for the fetching the sitemap we don't use any stealth proxies
Working on a solution right now
@rafaelmiller , @Adobe.Flash Thanks a lot guys for the quick response.
Much appreciated.
@rafaelmiller , @Adobe.Flash Hi folks, any update on this one?
Hey @Sachin we should be pushing a fix today / tomorrow.
@Adobe.Flash Great to hear that. Thanks.
Sharing another example which might be helpful for your testing. lunit.io
@Sachin pr is in, we are merging it today https://github.com/mendableai/firecrawl/pull/386
Hi @Adobe.Flash , Thanks for the update.
Really appreciate it.
Will test it once it is merged successfully.
Sounds good, we are testing it and hopefully is merged by the morning!
cool, additionally could you please share your thoughts on below query as well?
Let me know if something like this already exists.
That makes sense, Im adding that to the github board!
Also, we just merged the pr. Should be working after goes through deployment
Let me know π
great, thanks for the update.
will keep you posted on the same.
Hi @Adobe.Flash , seems like the issue still persists, assuming changes have been deployed by this time.
I'm still getting the parent URL only in the output.
Looking into @Sachin
Yea, they are heavily blocking us. Just escalated to our proxy engineer. He is trying to figure it out.
@Sachin It is finally fixed!!!
Just tested and worked great! Let me know. That was quite some work to get around their anti block π
Hi @Adobe.Flash , appreciate all the help you guys are providing on the issue.
Though, I have tested the Crawl API for eonhealth.com and lunit.io but seems like the count of URLs is still not matching with the sitemap.
e.g: for eonhealth = 30 pages and lunit = 8 pages only.
I am still experiencing a similar issue
https://www.friesland.nl/sitemap-nl.xml this sitemap has 500+ urls, however when crawling in playground it returns only 140 results.
@Adobe.Flash is this fix already deployed to the playground? Or is my URL also blocking somewhere

Hey all, @Kyo @Sachin
that's odd. @rafaelmiller can you take a look into that?
Would be great! @Adobe.Flash @rafaelmiller Kind of stuck here rn, currently testing if firecrawl suits the needs, if it does we are looking to integrate it into our own SaaS so we probably would end up on standard or growth.
hey @Kyo I just checked and I think your 'exclude paths' might be considering paths you don't want to. I tried without the exclude paths and got 689 URLs
Yeah because the site has multiple (4) locales and I only want to scrape the URLs of the NL path/locale.
Have you got 689 URLs with a /nl/* path? @rafaelmiller
Friesland.nl
Visit Friesland | Toeristische voordeur van Friesland
Op Friesland.nl vind je alles wat deze prachtige provincie te bieden heeft voor de toerist. Bekijk wat er te doen is en reserveer een hotel of restaurant.
@rafaelmiller thanks! did the job :pepeBlushHat:
@rafaelmiller any updates on my issue?
the fix doesn't seems to be working for eonhealth.com or lunit.io.
@Caleb adding for visibility.
Hey @Sachin. - do those gels have the /no domain? That specification will only work for websites with that site path structure
sorry, but not sure what you mean by that.
Excuse me. Do the urls (eon health etc.) have the /nl path in their site structure. Is there a eonheath.com/nl path for example?
Or lunit.io/nl ?
Oh. I see. Itβs a .nl
Thanks for the clarification.
Though I don't see such paths in the eonhealth.com sitemap

What if you pass the ignoreSitemap parameter and use the eonHealth.com.nl as the url?
For lunit.io, the paths are like this - lunit.io/en/
let me try doing that.
I tried again with below parameters, but still getting 30 pages for eonhealth and 8 pages for lunit.
params = {
'crawlerOptions': {
"returnOnlyUrls": True,
"ignoreSitemap": True,
}
}
and with eonhealth.com.nl, I just got 1 page in the ouput.
Lets back up for a second. What linksare you trying to gather on this site? Everything? or just particular paths. The solution @rafaelmiller suggested before only worked because the website had a .nl/nl section in its site map, which was all kyo needed.
I think this is a different issue, probably with the crawler on this particular path. CCing @Adobe.Flash @rafaelmiller to look into this. Sorry.
sure, I'm trying to scrape data for all the pages for this website.
Hi guys @Caleb @Adobe.Flash @rafaelmiller,
Any conclusive findings on the above matter?
Hey @Sachin! I'm checking the urls you sent:
eonhealth.com.nl has a invalid certificate issue and redirects to a domain service (https://domein.com.nl/).
eonhealth.com this site has a scraper blocker based on IP, and it seems to have detected our proxy system. We've been able to scrape/crawl this page locally but not through our servers. We are currently enhancing our system with new strategies to avoid such blocks.
lunit.io this site has a malformed sitemap, which has caused a bug in our crawl. I'm working on a fix right now, and we should be able to crawl this page later today.
Thanks @rafaelmiller for the updates.