Crawlee & Apify

CA

Crawlee & Apify

This is the official developer community of Apify and Crawlee.

Join

crawlee-js

apify-platform

crawlee-python

💻hire-freelancers

🚀actor-promotion

💫feature-request

💻devs-and-apify

🗣general-chat

🎁giveaways

programming-memes

🌐apify-announcements

🕷crawlee-announcements

👥community

Unable to run crawlee in aws lambda (Protocol error (Target.setAutoAttach): Target closed)

I am trying to run crawlee on aws lambda but getting this error message: Reclaiming failed request back to the list or queue. Protocol error (Target.setAutoAttach): Target closed. chromium version: 109 node version: 16 code: ```...

Bypassing cookies consent

Hello everyone. I want to scrape data from Google Maps using Crawlee. However, it seems that, after scraping content of certain tag, I realize that content is about the Cookies consent that the first page of Google Maps shows you. Some of you may know that of you visit Google Maps for the first time, you will face a different page telling about accepting cookies consent and all that and after you click on accept all, you will be forwarded to the Maps itself. How can I make sure that I go straigh...

Proxy fails on SSL secured(httpS) websites

Hey! I'm trying different proxy providers and I've noticed the issue in the title. I'm setting the proxy in
proxyUrls
proxyUrls
in the following format:...

How can I get more data when the site is only providing 50 items per page then 40 pages per seller?

Good day guys How can I get more data when the site is only providing 50 items per page then 40 pages per seller? So the scrapper is only getting 2k items in total when the seller has 7554 items in total. TIA...
No description

Cannot add requests to my actor requestQueue

Hello guys, this is my first post on this discord, first of all thanks in advance for the potential help ! I'm new to apify and was trying out pupeteer crawler, to get all the products informations on a website. For that, i need to loop through all the products of a given page and get all the product links to enqueue them to my requestQueue. But when I add them it doesn't seem to go through the others requests.. I uploaded as a file what I came up with as apify.js and what requestQueue object was returned....

For use with Hero?

Are there any plans to integrate/use Hero [1] with Crawlee? [1] https://github.com/ulixee/hero...

PlaywrightCrawler - how often browser fingerprints are changed?

Are browser fingerprints changed - every request? - every 1 min? - every... I do not what else )) ...
No description

New fingerprint per new page in browser-pool

Hi all, I'm trying out crawlee's browser-pool with just one browser (only puppeteer with chrome for now), and settings its fingerprintGeneratorOptions (https://crawlee.dev/api/browser-pool#changing-browser-fingerprints-aka-browser-signatures) to multi-OS multi-browser options, but each new page opened in the browser-pool has the same static fingerprint & headers. How can each new page opened via the browser-pool have a different fingerprint and headers?

Crawlee+PlaywrightCrawler+proxy - original IP leaking through WebRTC

I'm running this simple program from a server in German datacenter with IP 167.235... This program uses US residential proxies (rotating every 1min). And I see that pixelscan.net is able to detect my original IP: 167.235... On the attached screenshot you can find it under "WebRTC address"...
No description

Crawlee - how to set timezone?

Ok, I know in which country are my proxies/IPs, so I can set locale: ``` const crawler = new PlaywrightCrawler({ ... fingerprintOptions: {...
No description

Crawlee vs bot detection systems - Plugins length is not OK

I tested PlaywrightCrawler on three bot detection sites (see [1], [2], [3] and the attached screenshots). In all cases these sites complains about "0 plugins" or "Plugins length". If I open these sites with browser I use every day (Firefox on Linux, by the way - the same as used in PlaywrightCrawler settings) - these sites say "5 plugins" and the field is green....
No description

Share cache between multiple crawlee instances

I am using Crawlee with Chromium Playwright to scrape information about products from various retailers. For some of the information I need to extract, I have to run a headless browser to be able to interact with the page. I noticed that for one of my targets I have a lot of network transfers happening for scripts (js, json, css) that are the same for all the products. So if I scrape a long list of products these resources are getting cached and their impact on the overall transferred data size is not big. On the other hand if for every session I scrape only a few pages at the target, all this script resources need to be loaded because the cache is initially empty for every playwright session / context. Does anyone have an idea about how I could reuse the same cache in playwright / crawlee between 2 or multiple runs of my script?...

Controling Crawlee run modes during runtime in Dockerized enviroment

Hi there, I am running Crawlee workers in dockerized environment and want to be able to switch between Cheerio/Playwright run modes during operation. I also want to switch between headless/headful at runtime when running Playwright crawler. Is it even possible? Right now the only workaround I managed to think is running different docker containers for cheerio/playwright and also headless/headful so that would require Cheerio, Playwright headless, Playwright headful containers, a 3 containers in total....

External Queue Provider

Hello, I started using Crawlee and am quite impressed with the speed and anti-bot protection. However I do want to scale the system horizontally between different machines (nodes) in cluster and therefore I need to have shared queue broker (like Redis or RabbitMQ). Is it possible to configure Crawlee to use my own queue instead of local in-process one? Or There is no out-of-the-box...

Export products with price from eshop

Hi, I would like to ask for help with web scraping eshop and export products with price. It will be products from fashion eshop. I try set up it based on the docs from apify but I am not able to do it. Is here anyone who can help me? Thank you:)

per-site interval between requests?

Imagine the request queue of Crawlee (PlaywrightCrawler) containing URLs of two (or more) sites: example.com/url1 another-site.com/url2 example.com/url3...

External request queue + external result storage, Crawlee as daemon process - how to implement it?

Hi all, I would like to run Crawlee (actually PlaywrightCrawler) all the time, even when no requests in the request queue. (Crawlee will run on small Ubuntu box in datacenter. I can handle all the devops work needed for this). The requests/URLs should come from an external (running outside of Nodejs process) message queue. The Nodejs API to read from the external message queue exists. The scraping results should be stored in the same external message queue. ...

Setting a cookie in Cheerio before the page request

I am trying to use Cheerio to crawl a site that authenticates via session-based cookies. I have the cookie value I want to set, but don't know where/how to set it so every page request of my Actor's run has that cookie set. Are there pre-request callbacks I can use in Cheerio to set a cookie, or perhaps a high-level per-Actor config I can set cookie values that will persist across all sessions? I can't find any examples or documentation for how to access the session/sessionPool not already within a Cheerio requestHandler 🤷🏻‍♂️...