Mark session as bad when request times out or proxy responds with 502

I'm using CheerioCrawler and I'd like to mark sessions as bad when the request either times out or there's a proxy error. Those cases trigger an error before reaching requestHandler and the request is added back to the queue without me having the opportunity to mark the session. Is there a hook somewhere that I can use? Or should I override _requestFunctionErrorHandler?
16 Replies
fair-rose
fair-rose3y ago
I would like to know this as well
optimistic-gold
optimistic-gold3y ago
You can mark a session as bad with the session.markBad() function within the errorHandler function (which runs on every request failed, as opposed to failedRequestHandler, which runs once a request has reached its max retries)
const crawler = new CheerioCrawler({
proxyConfiguration,
requestHandler: router,
errorHandler: ({ session }) => {
session.markBad();
},
});
const crawler = new CheerioCrawler({
proxyConfiguration,
requestHandler: router,
errorHandler: ({ session }) => {
session.markBad();
},
});
But if you just want a session to be thrown away if it fails once, you can do this instead in the sessionPoolOptions:
const crawler = new CheerioCrawler({
proxyConfiguration,
requestHandler: router,
sessionPoolOptions: {
sessionOptions: {
maxErrorScore: 1,
},
},
});
const crawler = new CheerioCrawler({
proxyConfiguration,
requestHandler: router,
sessionPoolOptions: {
sessionOptions: {
maxErrorScore: 1,
},
},
});
mute-gold
mute-goldOP3y ago
Amazing thank you @thek1tten I didn't know about errorHandler One more question: how can I access the error in errorHandler? Is it passed as parameter? All good I found my answer in the docs! @thek1tten can I prevent the request from being retried depending on the error from the errorHandler?
optimistic-gold
optimistic-gold3y ago
No description
optimistic-gold
optimistic-gold3y ago
This should work
mute-gold
mute-goldOP3y ago
So I've tried that but without success, the request still ends up being retried Is there any other way to prevent a retry? Maybe throwing a NonRetryableError?
mute-gold
mute-goldOP3y ago
See on the logs, I print request right after setting request.noRetry to true in errorHandler, then the request is retried right after
No description
optimistic-gold
optimistic-gold3y ago
Hmm, that means it’s going off of the old value and reassigning it here does nothing. Let me look into it.
mute-gold
mute-goldOP3y ago
Thanks!
MEE6
MEE63y ago
@fab8203 just advanced to level 4! Thanks for your contributions! 🎉
optimistic-gold
optimistic-gold3y ago
This feature doesn’t seem to exist yet. I’m making a PR on Crawlee’s GitHub to fix this
mute-gold
mute-goldOP3y ago
Thank you @thek1tten let me know if there is a link to the issue that I can follow
optimistic-gold
optimistic-gold3y ago
GitHub
feat(basic-crawler): allow request skipping by mstephen19 · Pull Re...
See this Discord post to fully understand the use case: https://discord.com/channels/801163717915574323/1019936393235017769 Didn't want to make big changes to existing code so kept the else sta...
optimistic-gold
optimistic-gold3y ago
@fab8203 It was merged with master
mute-gold
mute-goldOP3y ago
Thank you for the follow up @thek1tten
exotic-emerald
exotic-emerald3y ago
Btw: retiring sessions and not retrying request are 2 completely different concepts. Request and Session are separate objects that might be connected temporarily (do a Request using this Session)

Did you find this page helpful?