Crawlee & Apify•3mo ago

Is it possible to bypass proxies for specific requests?

I have a use case where I want to have a crawler running permanently. This crawler has a tieredProxyList set up that it will iterate over in case some of them don't work. For scraping some pages I don't want to use proxies to reduce the amount of money I am spending on them (When I scrape my own page I don't want to proxy, but I want to use the same logic / handlers. Is it possible to specify either the proxy that should be used for specific requests? Or maybe even the proxy tier? Basic Setup: const proxyConfiguration = new ProxyConfiguration({tieredProxyUrls: [{'proxyTier1'], ['proxyTier2']]}); const crawler = new PlaywrightCrawler( { keepAlive: true, proxyConfiguration: proxyConfiguration, // ... }, ); // ... crawler.addRequests(requestsWhereWeWantProxies); crawler.addRequests(requestsWhereWeDontWantProxies); It would be nice to be able to do something like: crawler.addRequests(requestsWhereWeWantProxies); crawler.addRequests(requestsWhereWeDontWantProxies.map((request) => ({...request, proxy: null})); or const proxyConfiguration = new ProxyConfiguration({tieredProxyUrls: [{'proxyTier1'], ['proxyTier2'], [null]]}); // ... crawler.addRequests(requestsWhereWeWantProxies); crawler.addRequests(requestsWhereWeDontWantProxies.map((request) => ({...request, proxyTier: 2}));

4 Replies

Hall•3mo ago

Someone will reply to you shortly. In the meantime, this might help:

afraid-scarletOP•3mo ago

I have also seen that const proxyConfiguration = new ProxyConfiguration({tieredProxyUrls: [{[null], ['proxyTier1'], ['proxyTier2']]}); would be a solution where the crawler would always start scraping without proxies, but I don't want to have other requests being scraped without proxy usage 🙁

Louis Deconinck•3mo ago

Yes, that's possible. You want to use something like this

const proxyConfiguration = new ProxyConfiguration({
    newUrlFunction: (sessionId, { request }) => {
        if (request?.url.includes('crawlee.dev')) {
            return null; // for crawlee.dev, we don't use a proxy
        }

        return 'http://proxy-1.com'; // for all other URLs, we use this proxy
    }
});

const proxyConfiguration = new ProxyConfiguration({
    newUrlFunction: (sessionId, { request }) => {
        if (request?.url.includes('crawlee.dev')) {
            return null; // for crawlee.dev, we don't use a proxy
        }

        return 'http://proxy-1.com'; // for all other URLs, we use this proxy
    }
});

afraid-scarletOP•3mo ago

That was my initial approach, but I thought I'd lose the tieredProxy feature then, but I could use a second instance of the ProxyConfiguration and call the newUrlFn from that, like @Jan Buchar suggested here 👍

GitHub

Bypass proxy / pass specific proxy tier for requests · apify crawle...

I have a use case where I want to have a crawler running permanently. This crawler has a tieredProxyList set up that it will iterate over in case some of them don't work. For scraping some page...

Gaming

Programming

Is it possible to bypass proxies for specific requests?

Did you find this page helpful?