Ways to minimize traffic (save money) when crawling-scraping?
1. Block images, media files and similar things
It can be done either with
preNavigationHooks
, see https://crawlee.dev/api/playwright-crawler/interface/PlaywrightCrawlerOptions#preNavigationHooks
or with the blockRequests
https://crawlee.dev/api/playwright-crawler/interface/PlaywrightCrawlingContext#blockRequests
As far as I know, blockRequests
has some limitations (does it works in incognito mode with Firerox as launcher?). This was discussed in this forum, see:
https://discord.com/channels/801163717915574323/1039557325784105002
https://discord.com/channels/801163717915574323/1019949012415160370
2. Use cache
As far as I understand - you can not have both: cache AND incognito mode.
Well, there is the experimentalContainers
thing - in theory it should allow both cache and incognito.
I tried it, see https://discord.com/channels/801163717915574323/1060738415370453032/1060952860868739192
it looks it's not really "incognito" when fingerprint.com recognize you even when your IP is different.
(you can disprove me - may be my test was wrong, who knows?)
3. Something else to reduce traffic?
Please suggest...
4. Actually I care more about money than about traffic...
So one of the ideas - to use "Datacenter proxy" instead of "Residential"...
I see Datacenter proxies for about $0.7 per GB - much cheaper that Residential.
Does it make sense to try?
What is your experience with Datacenter proxies ?9 Replies
@new_in_town just advanced to level 4! Thanks for your contributions! 🎉
sensitive-blue•3y ago
Why not change set of datacenter proxies every 5-6 days?
like-gold•3y ago
1. blockRequests doesn't work in Firefox at all sadly
2. Generally, you reduce traffic and compute by scraping via API which is possible on most modern websites. Those who have all in HTML are heavier but you almost never need browser. https://developers.apify.com/academy/api-scraping
Browser is so much heavier that none of the above things really matter if you don't need it
Apify
API scraping · Apify Developers
Learn all about how the professionals scrape various types of APIs with various configurations, parameters, and requirements.
dependent-tanOP•3y ago
by the way:
I want to avoid downloading unnecessary files (unnecessary requests) so I'm using
the method described here:
https://discord.com/channels/801163717915574323/1039557325784105002
Here https://playwright.dev/docs/api/class-request#request-resource-type (Playwright documentation) is the list of resource types to check:
BUT!
Looking here
https://developer.mozilla.org/en-US/docs/Mozilla/Add-ons/WebExtensions/API/webRequest/ResourceType
I see much much more resource types. Things like
and many many others.
Should I include elements both lists in my BLOCKED array? (Imagine I'm paranoid and want to block everything except the main document)
webRequest.ResourceType - Mozilla | MDN
This type is a string, which represents the context in which a resource was fetched in a web request.
Request | Playwright
Whenever the page sends a request for a network resource the following sequence of events are emitted by [Page]:
like-gold•3y ago
Sure, why not. Just keep in mind that the website might not work properly if you block some stuff.
dependent-tanOP•3y ago
...and here the resource types I block:
For a new site I try to block everything in BLOCKED_IMG_CSS_JS.
In case the site does not work (pages not rendered properly) I try the BLOCKED_IMG_CSS. If it is still not work - BLOCKED_IMG and than - no block at all.
Feel free to use/improve this approach.
Share your improvements :-)
and the code that implements blocking:
So I set the
request.userData.headLessImg
per request and the code in preNavigationHooks
just check the value....sensitive-blue•3y ago
Great thanks for sharing. If possible Can you please respond on DM?
dependent-tanOP•3y ago
write DM again (can not see it now)
@new_in_town just advanced to level 6! Thanks for your contributions! 🎉