Crawlee & Apify•3y ago

Ways to minimize traffic (save money) when crawling-scraping?

1. Block images, media files and similar things It can be done either with preNavigationHooks, see https://crawlee.dev/api/playwright-crawler/interface/PlaywrightCrawlerOptions#preNavigationHooks or with the blockRequests https://crawlee.dev/api/playwright-crawler/interface/PlaywrightCrawlingContext#blockRequests As far as I know, blockRequests has some limitations (does it works in incognito mode with Firerox as launcher?). This was discussed in this forum, see: https://discord.com/channels/801163717915574323/1039557325784105002 https://discord.com/channels/801163717915574323/1019949012415160370 2. Use cache As far as I understand - you can not have both: cache AND incognito mode. Well, there is the experimentalContainers thing - in theory it should allow both cache and incognito. I tried it, see https://discord.com/channels/801163717915574323/1060738415370453032/1060952860868739192 it looks it's not really "incognito" when fingerprint.com recognize you even when your IP is different. (you can disprove me - may be my test was wrong, who knows?) 3. Something else to reduce traffic? Please suggest... 4. Actually I care more about money than about traffic... So one of the ideas - to use "Datacenter proxy" instead of "Residential"... I see Datacenter proxies for about $0.7 per GB - much cheaper that Residential. Does it make sense to try? What is your experience with Datacenter proxies ?

PlaywrightCrawlerOptions | API | Crawlee

PlaywrightCrawlingContext | API | Crawlee

9 Replies

MEE6•3y ago

@new_in_town just advanced to level 4! Thanks for your contributions! 🎉

sensitive-blue•3y ago

Why not change set of datacenter proxies every 5-6 days?

like-gold•3y ago

1. blockRequests doesn't work in Firefox at all sadly 2. Generally, you reduce traffic and compute by scraping via API which is possible on most modern websites. Those who have all in HTML are heavier but you almost never need browser. https://developers.apify.com/academy/api-scraping Browser is so much heavier that none of the above things really matter if you don't need it

Apify

API scraping · Apify Developers

Learn all about how the professionals scrape various types of APIs with various configurations, parameters, and requirements.

dependent-tanOP•3y ago

by the way: I want to avoid downloading unnecessary files (unnecessary requests) so I'm using the method described here: https://discord.com/channels/801163717915574323/1039557325784105002 Here https://playwright.dev/docs/api/class-request#request-resource-type (Playwright documentation) is the list of resource types to check:

document, stylesheet, image, media, font, script, texttrack, xhr, fetch, eventsource, websocket, manifest, other.

document, stylesheet, image, media, font, script, texttrack, xhr, fetch, eventsource, websocket, manifest, other.

BUT! Looking here https://developer.mozilla.org/en-US/docs/Mozilla/Add-ons/WebExtensions/API/webRequest/ResourceType I see much much more resource types. Things like

"csp_report", "beacon", "imageset", "ping"

"csp_report", "beacon", "imageset", "ping"

and many many others. Should I include elements both lists in my BLOCKED array? (Imagine I'm paranoid and want to block everything except the main document)

webRequest.ResourceType - Mozilla | MDN

This type is a string, which represents the context in which a resource was fetched in a web request.

Request | Playwright

Whenever the page sends a request for a network resource the following sequence of events are emitted by [Page]:

like-gold•3y ago

Sure, why not. Just keep in mind that the website might not work properly if you block some stuff.

dependent-tanOP•3y ago

...and here the resource types I block:

const BLOCKED_IMG        =                                                               ['image', 'imageset', 'object', 'object_subrequest', 'ping', 'web_manifest', 'xslt',  'media', 'font', 'other', 'beacon', 'csp_report', 'speculative', 'sub_frame', 'xbl', 'xml_dtd', 'texttrack', 'fetch', 'eventsource', 'manifest'];

const BLOCKED_IMG_CSS    =                                                 ['stylesheet', 'image', 'imageset', 'object', 'object_subrequest', 'ping', 'web_manifest', 'xslt',  'media', 'font', 'other', 'beacon', 'csp_report', 'speculative', 'sub_frame', 'xbl', 'xml_dtd', 'texttrack', 'fetch', 'eventsource', 'manifest'];

const BLOCKED_IMG_CSS_JS = ['websocket', 'xhr', 'xmlhttprequest', 'script', 'stylesheet', 'image', 'imageset', 'object', 'object_subrequest', 'ping', 'web_manifest', 'xslt',  'media', 'font', 'other', 'beacon', 'csp_report', 'speculative', 'sub_frame', 'xbl', 'xml_dtd', 'texttrack', 'fetch', 'eventsource', 'manifest'];

const BLOCKED_IMG        =                                                               ['image', 'imageset', 'object', 'object_subrequest', 'ping', 'web_manifest', 'xslt',  'media', 'font', 'other', 'beacon', 'csp_report', 'speculative', 'sub_frame', 'xbl', 'xml_dtd', 'texttrack', 'fetch', 'eventsource', 'manifest'];

const BLOCKED_IMG_CSS    =                                                 ['stylesheet', 'image', 'imageset', 'object', 'object_subrequest', 'ping', 'web_manifest', 'xslt',  'media', 'font', 'other', 'beacon', 'csp_report', 'speculative', 'sub_frame', 'xbl', 'xml_dtd', 'texttrack', 'fetch', 'eventsource', 'manifest'];

const BLOCKED_IMG_CSS_JS = ['websocket', 'xhr', 'xmlhttprequest', 'script', 'stylesheet', 'image', 'imageset', 'object', 'object_subrequest', 'ping', 'web_manifest', 'xslt',  'media', 'font', 'other', 'beacon', 'csp_report', 'speculative', 'sub_frame', 'xbl', 'xml_dtd', 'texttrack', 'fetch', 'eventsource', 'manifest'];

For a new site I try to block everything in BLOCKED_IMG_CSS_JS. In case the site does not work (pages not rendered properly) I try the BLOCKED_IMG_CSS. If it is still not work - BLOCKED_IMG and than - no block at all. Feel free to use/improve this approach. Share your improvements :-) and the code that implements blocking:

    preNavigationHooks: [

        async ({ page, request }) => {
            await page.route('**/*', (route) => {
                if ( (request.userData.headLessImg==='noimg') &&  (BLOCKED_IMG.includes(route.request().resourceType())) )
                {
                    return route.abort();
                }
                if ( (request.userData.headLessImg==='noimgnocss') &&  (BLOCKED_IMG_CSS.includes(route.request().resourceType())) )
                {
                    return route.abort();
                }
                if ( (request.userData.headLessImg==='noimgnocssnojs') &&  (BLOCKED_IMG_CSS_JS.includes(route.request().resourceType())) )
                {
                    return route.abort();
                }
                return route.continue();
            });
        },
    ],

    preNavigationHooks: [

        async ({ page, request }) => {
            await page.route('**/*', (route) => {
                if ( (request.userData.headLessImg==='noimg') &&  (BLOCKED_IMG.includes(route.request().resourceType())) )
                {
                    return route.abort();
                }
                if ( (request.userData.headLessImg==='noimgnocss') &&  (BLOCKED_IMG_CSS.includes(route.request().resourceType())) )
                {
                    return route.abort();
                }
                if ( (request.userData.headLessImg==='noimgnocssnojs') &&  (BLOCKED_IMG_CSS_JS.includes(route.request().resourceType())) )
                {
                    return route.abort();
                }
                return route.continue();
            });
        },
    ],

So I set the request.userData.headLessImg per request and the code in preNavigationHooks just check the value....

sensitive-blue•3y ago

Great thanks for sharing. If possible Can you please respond on DM?

dependent-tanOP•3y ago

write DM again (can not see it now)

MEE6•3y ago

@new_in_town just advanced to level 6! Thanks for your contributions! 🎉

Gaming

Programming

Ways to minimize traffic (save money) when crawling-scraping?

Did you find this page helpful?