Redirect Control

Im trying to make a simple crawler, how do proper control the redirects? Some bad proxies sometimes redirect to auth page , in this case i want to mark the request as failed if the redirect URL ( target ) contains something like /auth/login. Whats the best to handle this scenarios and abort the request earlier?
5 Replies
Hall
Hall2mo ago
Someone will reply to you shortly. In the meantime, this might help:
correct-apricot
correct-apricot2mo ago
Session Management | Crawlee · Build reliable crawlers. Fast.
Crawlee helps you build and maintain your crawlers. It's open source, but built by developers who scrape millions of pages every day for a living.
absent-sapphire
absent-sapphireOP2mo ago
so each request is a session? say i send 3 urls to crawl would this mark them all as failed once the session is marked as bad? I think i might have explained myself incorrectly. This still lets the page navigate to the auth-login page, my question was if its possible to prevent a redirect on the main document and retire the session in case it is.
correct-apricot
correct-apricot2mo ago
sessions defined by the session pool, so on blocking mark request session as "bad" to not continue with other requests if current one is blocked
metropolitan-bronze
metropolitan-bronze2mo ago
You can do something like this:
// Option 1: Use the failedRequestHandler
failedRequestHandler: async ({ request, session, error }) => {
if (error.message.includes('/auth/login') || request.url.includes('/auth/login')) {
console.log(`Request redirected to auth page: ${request.url}`);
// Mark the proxy as bad if you're using a session pool
if (session) {
session.markBad();
}
// You can retry with a different proxy if needed
// request.retryCount = 0;
// await crawler.addRequest(request);
}
},

// Option 2: Handle redirects in the request handler
requestHandler: async ({ request, response, $, crawler, session }) => {
// Check if we were redirected to an auth page
if (request.url.includes('/auth/login') || response.url.includes('/auth/login')) {
console.log(`Detected auth redirect: ${response.url}`);
// Mark the session as bad
if (session) {
session.markBad();
}
// Throw an error to fail this request
throw new Error('Redirected to auth page');
}

// Your normal processing code if not redirected
// ...
},

// Option 3: Use the preNavigationHooks for Playwright/Puppeteer
preNavigationHooks: [
async ({ request, page, session }) => {
// Set up redirect interception
await page.route('**', async (route) => {
const url = route.request().url();
if (url.includes('/auth/login')) {
console.log(`Intercepted auth redirect: ${url}`);
// Abort the navigation
await route.abort();
// Mark the session as bad
if (session) {
session.markBad();
}
throw new Error('Prevented auth page redirect');
} else {
await route.continue();
}
});
}
],
// Option 1: Use the failedRequestHandler
failedRequestHandler: async ({ request, session, error }) => {
if (error.message.includes('/auth/login') || request.url.includes('/auth/login')) {
console.log(`Request redirected to auth page: ${request.url}`);
// Mark the proxy as bad if you're using a session pool
if (session) {
session.markBad();
}
// You can retry with a different proxy if needed
// request.retryCount = 0;
// await crawler.addRequest(request);
}
},

// Option 2: Handle redirects in the request handler
requestHandler: async ({ request, response, $, crawler, session }) => {
// Check if we were redirected to an auth page
if (request.url.includes('/auth/login') || response.url.includes('/auth/login')) {
console.log(`Detected auth redirect: ${response.url}`);
// Mark the session as bad
if (session) {
session.markBad();
}
// Throw an error to fail this request
throw new Error('Redirected to auth page');
}

// Your normal processing code if not redirected
// ...
},

// Option 3: Use the preNavigationHooks for Playwright/Puppeteer
preNavigationHooks: [
async ({ request, page, session }) => {
// Set up redirect interception
await page.route('**', async (route) => {
const url = route.request().url();
if (url.includes('/auth/login')) {
console.log(`Intercepted auth redirect: ${url}`);
// Abort the navigation
await route.abort();
// Mark the session as bad
if (session) {
session.markBad();
}
throw new Error('Prevented auth page redirect');
} else {
await route.continue();
}
});
}
],
Also You can use maxRedirects option: https://crawlee.dev/api/next/core/interface/HttpRequest#maxRedirects And followRedirect: https://crawlee.dev/api/next/core/interface/HttpRequest#followRedirect

Did you find this page helpful?