Firecrawl•16mo ago

Discrepancy in the count of URLs returned by the FireCrawl API v/s actual sitemap of the Website

I'm trying to get the list of URLs present in the sitemap of a website using FireCrawl API with following parameters in CRAWL mode - params = { 'crawlerOptions': { "returnOnlyUrls": True } } firecrawl_urls = app.crawl_url(url, params=params) But the count of URLs is not matching with the actual count of URLs present in the sitemap of the website. e.g: for this website - eonhealth.com, the actual sitemap has around 106 pages but firecrawl is only returning 1 page in the results list. Can anyone please clarify what could be the possible reason behind this? @rafaelmiller, @Adobe.Flash Any thoughts on this?

31 Replies

rafaelmiller•16mo ago

Hey @Sachin I just tested this page and it's working when running locally, but not for the API. This is a strange behavior and we're investigating why this is happening

Adobe.Flash•16mo ago

Yea seems like the website blocked us as for the fetching the sitemap we don't use any stealth proxies Working on a solution right now

SachinOP•16mo ago

@rafaelmiller , @Adobe.Flash Thanks a lot guys for the quick response. Much appreciated. @rafaelmiller , @Adobe.Flash Hi folks, any update on this one?

Adobe.Flash•16mo ago

Hey @Sachin we should be pushing a fix today / tomorrow.

SachinOP•16mo ago

@Adobe.Flash Great to hear that. Thanks. Sharing another example which might be helpful for your testing. lunit.io

Adobe.Flash•15mo ago

@Sachin pr is in, we are merging it today https://github.com/mendableai/firecrawl/pull/386

GitHub

[Feat] Added fire-engine fallback for getting sitemaps by rafaelsid...

eonhealth.com usecase

SachinOP•15mo ago

Hi @Adobe.Flash , Thanks for the update. Really appreciate it. Will test it once it is merged successfully.

Adobe.Flash•15mo ago

Sounds good, we are testing it and hopefully is merged by the morning!

SachinOP•15mo ago

cool, additionally could you please share your thoughts on below query as well?

Hi, would it make sense to add a new feature of returning the total count of URLs present in the sitemap of a website along with returning the link to the sitemap in a single request?
I guess the current workaround is to use the returnOnlyUrls parameter, but wouldn't that cause the consumption of equivalent credits as well? Like if the website has 500 pages, so this request would consume 500 credits as well as per my understanding.

Hi, would it make sense to add a new feature of returning the total count of URLs present in the sitemap of a website along with returning the link to the sitemap in a single request?
I guess the current workaround is to use the returnOnlyUrls parameter, but wouldn't that cause the consumption of equivalent credits as well? Like if the website has 500 pages, so this request would consume 500 credits as well as per my understanding.

Let me know if something like this already exists.

Adobe.Flash•15mo ago

That makes sense, Im adding that to the github board! Also, we just merged the pr. Should be working after goes through deployment Let me know 🙂

SachinOP•15mo ago

great, thanks for the update. will keep you posted on the same. Hi @Adobe.Flash , seems like the issue still persists, assuming changes have been deployed by this time. I'm still getting the parent URL only in the output.

{"status":"completed","current":1,"current_url":"","current_step":"SCRAPING","total":100,"data":[{"url":"https://eonhealth.com"}],"partial_data":[]}

{"status":"completed","current":1,"current_url":"","current_step":"SCRAPING","total":100,"data":[{"url":"https://eonhealth.com"}],"partial_data":[]}

Adobe.Flash•15mo ago

Looking into @Sachin Yea, they are heavily blocking us. Just escalated to our proxy engineer. He is trying to figure it out. @Sachin It is finally fixed!!! Just tested and worked great! Let me know. That was quite some work to get around their anti block 🙂

SachinOP•15mo ago

Hi @Adobe.Flash , appreciate all the help you guys are providing on the issue. Though, I have tested the Crawl API for eonhealth.com and lunit.io but seems like the count of URLs is still not matching with the sitemap. e.g: for eonhealth = 30 pages and lunit = 8 pages only.

Kyo•15mo ago

I am still experiencing a similar issue https://www.friesland.nl/sitemap-nl.xml this sitemap has 500+ urls, however when crawling in playground it returns only 140 results. @Adobe.Flash is this fix already deployed to the playground? Or is my URL also blocking somewhere

Adobe.Flash•15mo ago

Hey all, @Kyo @Sachin that's odd. @rafaelmiller can you take a look into that?

Kyo•15mo ago

Would be great! @Adobe.Flash @rafaelmiller Kind of stuck here rn, currently testing if firecrawl suits the needs, if it does we are looking to integrate it into our own SaaS so we probably would end up on standard or growth.

rafaelmiller•15mo ago

hey @Kyo I just checked and I think your 'exclude paths' might be considering paths you don't want to. I tried without the exclude paths and got 689 URLs

Kyo•15mo ago

Yeah because the site has multiple (4) locales and I only want to scrape the URLs of the NL path/locale. Have you got 689 URLs with a /nl/* path? @rafaelmiller

rafaelmiller•15mo ago

The params I used: URL: https://www.friesland.nl Include Only Paths: nl/*

Friesland.nl

Visit Friesland | Toeristische voordeur van Friesland

Op Friesland.nl vind je alles wat deze prachtige provincie te bieden heeft voor de toerist. Bekijk wat er te doen is en reserveer een hotel of restaurant.

message.txt

Kyo•15mo ago

@rafaelmiller thanks! did the job :pepeBlushHat:

SachinOP•15mo ago

@rafaelmiller any updates on my issue? the fix doesn't seems to be working for eonhealth.com or lunit.io. @Caleb adding for visibility.

Caleb•15mo ago

Hey @Sachin. - do those gels have the /no domain? That specification will only work for websites with that site path structure

SachinOP•15mo ago

sorry, but not sure what you mean by that.

Caleb•15mo ago

Excuse me. Do the urls (eon health etc.) have the /nl path in their site structure. Is there a eonheath.com/nl path for example? Or lunit.io/nl ? Oh. I see. It’s a .nl

SachinOP•15mo ago

Thanks for the clarification. Though I don't see such paths in the eonhealth.com sitemap

Caleb•15mo ago

What if you pass the ignoreSitemap parameter and use the eonHealth.com.nl as the url?

SachinOP•15mo ago

For lunit.io, the paths are like this - lunit.io/en/ let me try doing that. I tried again with below parameters, but still getting 30 pages for eonhealth and 8 pages for lunit. params = { 'crawlerOptions': { "returnOnlyUrls": True, "ignoreSitemap": True, } } and with eonhealth.com.nl, I just got 1 page in the ouput.

Caleb•15mo ago

Lets back up for a second. What linksare you trying to gather on this site? Everything? or just particular paths. The solution @rafaelmiller suggested before only worked because the website had a .nl/nl section in its site map, which was all kyo needed. I think this is a different issue, probably with the crawler on this particular path. CCing @Adobe.Flash @rafaelmiller to look into this. Sorry.

SachinOP•15mo ago

sure, I'm trying to scrape data for all the pages for this website. Hi guys @Caleb @Adobe.Flash @rafaelmiller, Any conclusive findings on the above matter?

rafaelmiller•15mo ago

Hey @Sachin! I'm checking the urls you sent: eonhealth.com.nl has a invalid certificate issue and redirects to a domain service (https://domein.com.nl/). eonhealth.com this site has a scraper blocker based on IP, and it seems to have detected our proxy system. We've been able to scrape/crawl this page locally but not through our servers. We are currently enhancing our system with new strategies to avoid such blocks. lunit.io this site has a malformed sitemap, which has caused a bug in our crawl. I'm working on a fix right now, and we should be able to crawl this page later today.

admin

Com.nl | Leg jouw Com.nl domeinnaam snel vast

.com.nl domeinnaam bestellen

SachinOP•15mo ago

Thanks @rafaelmiller for the updates.

Gaming

Programming

Discrepancy in the count of URLs returned by the FireCrawl API v/s actual sitemap of the Website

Did you find this page helpful?