Retry failed pages during batchscrape[FIRECRAWL SDK]

Is there a way to retry failed pages in batchscrape? i'm using the webhook method with batchscrape in firecrawl sdk and only the successful scrapped pages are returned with batchscrape.page. am i missing something here? i can see in dashboard we are actually logging the failed pages but i don't think its being returned to webhook.
9 Replies
Gaurav Chadha
Gaurav Chadha6d ago
Hi @pratosh Yes, failed pages in batchscrape are sent to your webhook, but as separate events with success: false. You're likely only processing the successful batchscrape.page events and missing the failure events However there's no automatic retry mechanism, but you can: 1. Collect Failed URLs from Webhooks Listen for webhook events where success: false and collect the sourceURL from metadata. 2. Use the Errors Endpoint After a batch completes, fetch all errors: // JS SDK
const errors = await firecrawl.checkBatchScrapeErrors(batchId);
// errors.data contains failed URLs with error messages And then retry with a new batch with just failed URLs.
pratosh
pratoshOP6d ago
From what i observed, the failed webhooks were not received itself in our server. All the batchscrape.page events that i received were in success true. In cases where few pages failed to be scrapped the batchscrape.completed was also not being received. After increasing the waitFor limit to 3000, the pages now seems to be scrapped and the completed webhook is also received. From the docs i can notice "You’ll receive one batch_scrape.page event for every URL successfully scraped." which confirms what i have put right?
No description
Gaurav Chadha
Gaurav Chadha6d ago
Yeah, that's correct but I see possible reason seems is the low waitFor limit, pages are timing out before they can be properly processed as either success or failure. This causes: - No webhook events for timed-out pages - No batch_scrape.completed event (because the batch doesn't cleanly finish) Could you increase the waitFor to some higher value? As if there are failed it will be captured https://github.com/firecrawl/firecrawl/blob/9f4f011a7834a2067cb40cc884379bbc719a968f/apps/api/openapi.json#L168
pratosh
pratoshOP5d ago
Yeah, i have increased the value of waitFor and it seems to be working as expected, is 3000 is a good enough value for waitFor or is it too high? Also is there a reason why failed pages are not emitted to webhook? it would really be helpful to rescrape them on the fly instead of waiting for whole batch scrape to finish and then rescrape the failed urls again. Also i don't think this https://docs.firecrawl.dev/api-reference/endpoint/batch-scrape-get-errors is working for v2. can we check this?
Gaurav Chadha
Gaurav Chadha5d ago
Yeah, 3000 is a good enough, if its working for your case. Regarding this - https://docs.firecrawl.dev/api-reference/endpoint/batch-scrape-get-errors what response you get while executing this API Call?
pratosh
pratoshOP5d ago
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="utf-8">
<title>Error</title>
</head>
<body>
<pre>Cannot GET /v2/batch/scrape/019ada83-1638-7235-b350-195725467a09/errors</pre>
</body>
</html>
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="utf-8">
<title>Error</title>
</head>
<body>
<pre>Cannot GET /v2/batch/scrape/019ada83-1638-7235-b350-195725467a09/errors</pre>
</body>
</html>
This is what i get
Gaurav Chadha
Gaurav Chadha5d ago
Are you using /v2 endpoint for this? If so, added a fix here https://github.com/firecrawl/firecrawl/pull/2471
pratosh
pratoshOP5d ago
Yes i'm using the v2 endpoint, thanks for the help. Also is it possible we can get failed pages(scrape) notified via webhooks as well?
Gaurav Chadha
Gaurav Chadha4d ago
I think you'll need to handle that in your webhook? I'll have to check. Feel free to open a GitHub issue on https://github.com/firecrawl/firecrawl.

Did you find this page helpful?