High Volume Scraping

I'm evaluating Crawlee for my startup that will require us to scrape several hundred websites. These sites are non-ecommerce nor social media and require interaction with the page (feeding in a list of search parameters, clicking submit buttons, etc.) The documentation seems to imply that I need to use a headless browser in order to interact with the site, but headless browsers consume tons of memory when compared to the non browser counterparts and are overkill for sites that do not render Javascript. Other tools seem to have workarounds for these sites such as adding the pageviewstate in the crawler with parameters taken from the site. Does Crawlee have options to interact with pages without using the browser object?
2 Replies
rival-black
rival-black2y ago
Hey there! I think you're talking about several things or there is some misunderstanding. require interaction with the page - if the page really needs interaction - you can't really go without the browser as the browser is the only option in this case. On the other hand - there is an option to intercept and replicate the XHR requests which are sent went you click on a button, or type something in a search field. If you can do it - you're not interacting with the page/browser per se, and just receiving static HTML or JSONs, meaning it's fast, nothing is rendered (there is no page or browser open), but you can't really type or click anything. In this case you could just use CheerioCrawler - https://crawlee.dev/api/cheerio-crawler. There's a also third option - using JSDom - https://crawlee.dev/api/jsdom-crawler - which is something in between the first two in terms of performance. Hope that helps.
flat-fuchsia
flat-fuchsiaOP2y ago
Thanks I'll read through the documentation for the Cheerio and JSDOM options

Did you find this page helpful?