Firecrawl•14mo ago

Unstable behavior of Crawl Jobs for V1

I'm facing this issue of incomplete crawling where the actual count of pages for the website is different from what the crawl job is able to scrape. The /map endpoint is giving correct count of URLs in the sitemap, which is 104. [ website - freenome.com ] but the crawling job is able to scrape sometimes 55, 28 or 29 pages only. Peculiar thing which I have noticed is that the credit usage is also high irrespective of the actual scraped pages. Here's the output: Job Info {'status': 'completed', 'completed': 69, 'total': 69, 'creditsUsed': 69, 'expiresAt': '2024-09-04T09:08:31.000Z', 'next': 'http://api.firecrawl.dev/v1/crawl/72dda60f-5384-498d-81b1-0a830dfa0cc8?skip=28'} shape of the output saved in a pandas dataframe for "data" key: (28, 2) parameters used - crawl_params = { "excludePaths": [], "includePaths": [], "maxDepth": 2, "ignoreSitemap": False, "limit": 400, "allowBackwardLinks": False, "allowExternalLinks": False, "scrapeOptions": { "formats": ["markdown"], "headers": {}, "includeTags": [], "excludeTags": [], "onlyMainContent": True, "waitFor": 300 } } NOTE - I have tried using True value for ignoreSitemap parameter as well, but didn't help either. CCing - @Adobe.Flash @mogery @rafaelmiller

23 Replies

mogery•14mo ago

are you using pagination correctly? you should query the next URL to get the next batch of pages to be fair this is on us -- somehow we missed this on the SDK side -- raising this issue internally

SachinOP•14mo ago

Hi @mogery , I have used the default value of TRUE for wait_until_done parameter. As per my assumption, the behavior for this should be similar to what we had in V0, which was to return the output result once the crawl process completes successfully. Shouldn't that be the case by default? Or do we have to now explicitly perform the pagination?

mogery•14mo ago

It should be the default, but we messed up. Rafa is working on the fix, I'll let you know when it's up! The reason why you saw varied page counts is because we added pagination up to 10 megabytes of page data. So each page is max 10MB, to avoid issues with large response payloads

SachinOP•14mo ago

ok, thanks for the update.

Adobe.Flash•14mo ago

@Sachin we pushed an update to the sdks for the above fix (1.2.2)

SachinOP•14mo ago

sure, thanks for the update @Adobe.Flash @Adobe.Flash @mogery The credit usage now is in line with what is getting scraped. Though, I still feel that the crawl job is not stable yet. The /map endpoint returns accurate count of URLs, while the Crawl job is returning lesser number of pages in the output. some examples - omnyhealth.com, oncotab.com

SachinOP•14mo ago

Also, can we remove the default printing behavior of the /map endpoint response containing list of URLs?

Adobe.Flash•14mo ago

@Sachin it is removed on 1.2.3 are you calling the next param to fetch for the rest of results?

SachinOP•14mo ago

Nope, do we need to now? I'm just using the wait_until_done parameter to get the complete result once the process finishes successfully. Is there something which I might be missing here?

Adobe.Flash•14mo ago

V1 sdks don't have wait until done anymore. We now have async crawl and crawl. Basically the crawl method does that for you and waits until it is done. I would recommend updating it if you can and see if it solves your issue! The async crawl just returns the job id, and you can do your own pooling, which in that case you would need to call the next param

SachinOP•14mo ago

wow, how did I miss that. wait_until_done parameter is not there anymore.😑 but sometime the crawl job is able to fetch the entire data. Maybe that's related to what mogery mentioned regarding default pagination of upto 10 MB of page data.

Adobe.Flash•14mo ago

Correct, if it exceeds 10mb, it gets paginated!

SachinOP•14mo ago

Is there a simpler way of achieving similar functionality to what /crawl of v0 was able to do?

Adobe.Flash•14mo ago

Just using the crawl method in the python sdk (version 1.2.3) it will do that for you so you will get all the data you need without needing to think about it

SachinOP•14mo ago

But I'm using the crawl method from the python SDK. Still the results are not complete. I have upgraded to the version 1.2.3 now. example websites - omnyhealth.com [50 pages by map method but crawl is returning 35 only], oncotab.com [71 pages returned by map but crawl is returning only 62 pages]

Adobe.Flash•14mo ago

/map is different than crawl. Crawl wont get subdomains like map does map allows for a bit more flexibility than crawl in that regard. ccing @rafaelmiller here to take a deeper look if thats not the case

SachinOP•14mo ago

I have set the flag for subdomain & ignoresitemap to False while fetching the list of urls using /map. Any update on this? @rafaelmiller @Adobe.Flash

James Peterson•14mo ago

Chanced across this, and I'm experiencing the same as Sachin on python sdk: /crawl might say 10 successful results, but actually only contain 7. I assumed it was a pagination thing and have a todo to switch away from the python sdk so that I can page. But it sounds like from above that that's not the case (as it should be loading the full set)?

SachinOP•14mo ago

@James Peterson , have you upgraded the version of your FireCrawl python SDK to 1.2.3?

rafaelmiller•14mo ago

hey @Sachin , I've been testing those urls and got the same result as you. I noticed that both pages use YoastSEO for generating sitemaps. Our request to /sitemap.xml (the usual sitemap page) is not being redirected to /sitemap_index.xml (the YoastSEO sitemap page), so we're relying on crawling and filtering pages that have links starting with the base url (similar to using ignoreSitemap param). I'm currently adding a check for /sitemap_index.xml if /sitemap.xml is not found, which should improve results for YoastSEO-generated sitemaps. As for /map, it's an alpha feature that uses a completely different methodology to fetch urls. We combine sitemap values, links found by the crawler, and a semantic indexed search (similar to how Google does it), which is likely why you're seeing fewer results with crawl compared to map. Let me know if you have any other questions.

SachinOP•14mo ago

@rafaelmiller got it, thanks for the update. Quick question, are we thinking about merging the alpha feature built for MAP meothd with the CRAWL method in the near future?

rafaelmiller•14mo ago

@Sachin most likely. We're still testing how many results map and crawl can fetch with their own strategies but we plan to add a mode parameter for crawl in the near future.

SachinOP•14mo ago

awesome, thanks.

Gaming

Programming

Unstable behavior of Crawl Jobs for V1

Did you find this page helpful?