Redirect Control
Im trying to make a simple crawler, how do proper control the redirects? Some bad proxies sometimes redirect to auth page , in this case i want to mark the request as failed if the redirect URL ( target ) contains something like /auth/login. Whats the best to handle this scenarios and abort the request earlier?
5 Replies
Someone will reply to you shortly. In the meantime, this might help:
correct-apricot•2mo ago
Session Management | Crawlee · Build reliable crawlers. Fast.
Crawlee helps you build and maintain your crawlers. It's open source, but built by developers who scrape millions of pages every day for a living.
absent-sapphireOP•2mo ago
so each request is a session? say i send 3 urls to crawl would this mark them all as failed once the session is marked as bad? I think i might have explained myself incorrectly. This still lets the page navigate to the auth-login page, my question was if its possible to prevent a redirect on the main document and retire the session in case it is.
correct-apricot•2mo ago
sessions defined by the session pool, so on blocking mark request session as "bad" to not continue with other requests if current one is blocked
metropolitan-bronze•2mo ago
You can do something like this:
Also You can use maxRedirects option: https://crawlee.dev/api/next/core/interface/HttpRequest#maxRedirects
And followRedirect: https://crawlee.dev/api/next/core/interface/HttpRequest#followRedirect