Is it possible to bypass proxies for specific requests?

I have a use case where I want to have a crawler running permanently. This crawler has a tieredProxyList set up that it will iterate over in case some of them don't work. For scraping some pages I don't want to use proxies to reduce the amount of money I am spending on them (When I scrape my own page I don't want to proxy, but I want to use the same logic / handlers. Is it possible to specify either the proxy that should be used for specific requests? Or maybe even the proxy tier? Basic Setup: const proxyConfiguration = new ProxyConfiguration({tieredProxyUrls: [{'proxyTier1'], ['proxyTier2']]}); const crawler = new PlaywrightCrawler( { keepAlive: true, proxyConfiguration: proxyConfiguration, // ... }, ); // ... crawler.addRequests(requestsWhereWeWantProxies); crawler.addRequests(requestsWhereWeDontWantProxies); It would be nice to be able to do something like: crawler.addRequests(requestsWhereWeWantProxies); crawler.addRequests(requestsWhereWeDontWantProxies.map((request) => ({...request, proxy: null})); or const proxyConfiguration = new ProxyConfiguration({tieredProxyUrls: [{'proxyTier1'], ['proxyTier2'], [null]]}); // ... crawler.addRequests(requestsWhereWeWantProxies); crawler.addRequests(requestsWhereWeDontWantProxies.map((request) => ({...request, proxyTier: 2}));
4 Replies
Hall
Hallโ€ข3mo ago
Someone will reply to you shortly. In the meantime, this might help:
afraid-scarlet
afraid-scarletOPโ€ข3mo ago
I have also seen that const proxyConfiguration = new ProxyConfiguration({tieredProxyUrls: [{[null], ['proxyTier1'], ['proxyTier2']]}); would be a solution where the crawler would always start scraping without proxies, but I don't want to have other requests being scraped without proxy usage ๐Ÿ™
Louis Deconinck
Louis Deconinckโ€ข3mo ago
Yes, that's possible. You want to use something like this
const proxyConfiguration = new ProxyConfiguration({
newUrlFunction: (sessionId, { request }) => {
if (request?.url.includes('crawlee.dev')) {
return null; // for crawlee.dev, we don't use a proxy
}

return 'http://proxy-1.com'; // for all other URLs, we use this proxy
}
});
const proxyConfiguration = new ProxyConfiguration({
newUrlFunction: (sessionId, { request }) => {
if (request?.url.includes('crawlee.dev')) {
return null; // for crawlee.dev, we don't use a proxy
}

return 'http://proxy-1.com'; // for all other URLs, we use this proxy
}
});
afraid-scarlet
afraid-scarletOPโ€ข3mo ago
That was my initial approach, but I thought I'd lose the tieredProxy feature then, but I could use a second instance of the ProxyConfiguration and call the newUrlFn from that, like @Jan Buchar suggested here ๐Ÿ‘
GitHub
Bypass proxy / pass specific proxy tier for requests ยท apify crawle...
I have a use case where I want to have a crawler running permanently. This crawler has a tieredProxyList set up that it will iterate over in case some of them don't work. For scraping some page...

Did you find this page helpful?