F
Firecrawlβ€’16mo ago
Sachin

Discrepancy in the count of URLs returned by the FireCrawl API v/s actual sitemap of the Website

I'm trying to get the list of URLs present in the sitemap of a website using FireCrawl API with following parameters in CRAWL mode - params = { 'crawlerOptions': { "returnOnlyUrls": True } } firecrawl_urls = app.crawl_url(url, params=params) But the count of URLs is not matching with the actual count of URLs present in the sitemap of the website. e.g: for this website - eonhealth.com, the actual sitemap has around 106 pages but firecrawl is only returning 1 page in the results list. Can anyone please clarify what could be the possible reason behind this? @rafaelmiller, @Adobe.Flash Any thoughts on this?
31 Replies
rafaelmiller
rafaelmillerβ€’16mo ago
Hey @Sachin I just tested this page and it's working when running locally, but not for the API. This is a strange behavior and we're investigating why this is happening
Adobe.Flash
Adobe.Flashβ€’16mo ago
Yea seems like the website blocked us as for the fetching the sitemap we don't use any stealth proxies Working on a solution right now
Sachin
SachinOPβ€’16mo ago
@rafaelmiller , @Adobe.Flash Thanks a lot guys for the quick response. Much appreciated. @rafaelmiller , @Adobe.Flash Hi folks, any update on this one?
Adobe.Flash
Adobe.Flashβ€’16mo ago
Hey @Sachin we should be pushing a fix today / tomorrow.
Sachin
SachinOPβ€’16mo ago
@Adobe.Flash Great to hear that. Thanks. Sharing another example which might be helpful for your testing. lunit.io
Sachin
SachinOPβ€’15mo ago
Hi @Adobe.Flash , Thanks for the update. Really appreciate it. Will test it once it is merged successfully.
Adobe.Flash
Adobe.Flashβ€’15mo ago
Sounds good, we are testing it and hopefully is merged by the morning!
Sachin
SachinOPβ€’15mo ago
cool, additionally could you please share your thoughts on below query as well?
Hi, would it make sense to add a new feature of returning the total count of URLs present in the sitemap of a website along with returning the link to the sitemap in a single request?
I guess the current workaround is to use the returnOnlyUrls parameter, but wouldn't that cause the consumption of equivalent credits as well? Like if the website has 500 pages, so this request would consume 500 credits as well as per my understanding.
Hi, would it make sense to add a new feature of returning the total count of URLs present in the sitemap of a website along with returning the link to the sitemap in a single request?
I guess the current workaround is to use the returnOnlyUrls parameter, but wouldn't that cause the consumption of equivalent credits as well? Like if the website has 500 pages, so this request would consume 500 credits as well as per my understanding.
Let me know if something like this already exists.
Adobe.Flash
Adobe.Flashβ€’15mo ago
That makes sense, Im adding that to the github board! Also, we just merged the pr. Should be working after goes through deployment Let me know πŸ™‚
Sachin
SachinOPβ€’15mo ago
great, thanks for the update. will keep you posted on the same. Hi @Adobe.Flash , seems like the issue still persists, assuming changes have been deployed by this time. I'm still getting the parent URL only in the output.
{"status":"completed","current":1,"current_url":"","current_step":"SCRAPING","total":100,"data":[{"url":"https://eonhealth.com"}],"partial_data":[]}
{"status":"completed","current":1,"current_url":"","current_step":"SCRAPING","total":100,"data":[{"url":"https://eonhealth.com"}],"partial_data":[]}
Adobe.Flash
Adobe.Flashβ€’15mo ago
Looking into @Sachin Yea, they are heavily blocking us. Just escalated to our proxy engineer. He is trying to figure it out. @Sachin It is finally fixed!!! Just tested and worked great! Let me know. That was quite some work to get around their anti block πŸ™‚
Sachin
SachinOPβ€’15mo ago
Hi @Adobe.Flash , appreciate all the help you guys are providing on the issue. Though, I have tested the Crawl API for eonhealth.com and lunit.io but seems like the count of URLs is still not matching with the sitemap. e.g: for eonhealth = 30 pages and lunit = 8 pages only.
Kyo
Kyoβ€’15mo ago
I am still experiencing a similar issue https://www.friesland.nl/sitemap-nl.xml this sitemap has 500+ urls, however when crawling in playground it returns only 140 results. @Adobe.Flash is this fix already deployed to the playground? Or is my URL also blocking somewhere
No description
Adobe.Flash
Adobe.Flashβ€’15mo ago
Hey all, @Kyo @Sachin that's odd. @rafaelmiller can you take a look into that?
Kyo
Kyoβ€’15mo ago
Would be great! @Adobe.Flash @rafaelmiller Kind of stuck here rn, currently testing if firecrawl suits the needs, if it does we are looking to integrate it into our own SaaS so we probably would end up on standard or growth.
rafaelmiller
rafaelmillerβ€’15mo ago
hey @Kyo I just checked and I think your 'exclude paths' might be considering paths you don't want to. I tried without the exclude paths and got 689 URLs
Kyo
Kyoβ€’15mo ago
Yeah because the site has multiple (4) locales and I only want to scrape the URLs of the NL path/locale. Have you got 689 URLs with a /nl/* path? @rafaelmiller
rafaelmiller
rafaelmillerβ€’15mo ago
The params I used: URL: https://www.friesland.nl Include Only Paths: nl/*
Friesland.nl
Visit Friesland | Toeristische voordeur van Friesland
Op Friesland.nl vind je alles wat deze prachtige provincie te bieden heeft voor de toerist. Bekijk wat er te doen is en reserveer een hotel of restaurant.
Kyo
Kyoβ€’15mo ago
@rafaelmiller thanks! did the job :pepeBlushHat:
Sachin
SachinOPβ€’15mo ago
@rafaelmiller any updates on my issue? the fix doesn't seems to be working for eonhealth.com or lunit.io. @Caleb adding for visibility.
Caleb
Calebβ€’15mo ago
Hey @Sachin. - do those gels have the /no domain? That specification will only work for websites with that site path structure
Sachin
SachinOPβ€’15mo ago
sorry, but not sure what you mean by that.
Caleb
Calebβ€’15mo ago
Excuse me. Do the urls (eon health etc.) have the /nl path in their site structure. Is there a eonheath.com/nl path for example? Or lunit.io/nl ? Oh. I see. It’s a .nl
Sachin
SachinOPβ€’15mo ago
Thanks for the clarification. Though I don't see such paths in the eonhealth.com sitemap
No description
Caleb
Calebβ€’15mo ago
What if you pass the ignoreSitemap parameter and use the eonHealth.com.nl as the url?
Sachin
SachinOPβ€’15mo ago
For lunit.io, the paths are like this - lunit.io/en/ let me try doing that. I tried again with below parameters, but still getting 30 pages for eonhealth and 8 pages for lunit. params = { 'crawlerOptions': { "returnOnlyUrls": True, "ignoreSitemap": True, } } and with eonhealth.com.nl, I just got 1 page in the ouput.
Caleb
Calebβ€’15mo ago
Lets back up for a second. What linksare you trying to gather on this site? Everything? or just particular paths. The solution @rafaelmiller suggested before only worked because the website had a .nl/nl section in its site map, which was all kyo needed. I think this is a different issue, probably with the crawler on this particular path. CCing @Adobe.Flash @rafaelmiller to look into this. Sorry.
Sachin
SachinOPβ€’15mo ago
sure, I'm trying to scrape data for all the pages for this website. Hi guys @Caleb @Adobe.Flash @rafaelmiller, Any conclusive findings on the above matter?
rafaelmiller
rafaelmillerβ€’15mo ago
Hey @Sachin! I'm checking the urls you sent: eonhealth.com.nl has a invalid certificate issue and redirects to a domain service (https://domein.com.nl/). eonhealth.com this site has a scraper blocker based on IP, and it seems to have detected our proxy system. We've been able to scrape/crawl this page locally but not through our servers. We are currently enhancing our system with new strategies to avoid such blocks. lunit.io this site has a malformed sitemap, which has caused a bug in our crawl. I'm working on a fix right now, and we should be able to crawl this page later today.
admin
Com.nl | Leg jouw Com.nl domeinnaam snel vast
.com.nl domeinnaam bestellen
Sachin
SachinOPβ€’15mo ago
Thanks @rafaelmiller for the updates.

Did you find this page helpful?