Crawlee & Apify

CA

Crawlee & Apify

This is the official developer community of Apify and Crawlee.

Join

crawlee-js

apify-platform

crawlee-python

💻hire-freelancers

🚀actor-promotion

💫feature-request

💻devs-and-apify

🗣general-chat

🎁giveaways

programming-memes

🌐apify-announcements

🕷crawlee-announcements

👥community

What does produce this error?

page.goto: net::ERR_TIMED_OUT does it mean the pge is blocking me? If I open in browser it works fine

setCookie and session.getCookies don't work together

I'm trying to run this code in my default handler: ``` if (request.loadedUrl === 'url-from-where-i-get-cookies'){ goodCokies = session.getCookies('url-from-where-i-get-cookies') await crawler.addRequests(['url-where-i-need-cookies'])...

One-proxy, many-sessions?

Context: Proxy providers often provide a single proxyUrl from which any number of connections can be opened, i.e. each connection having a different IP despite proxyUrl all being seemingly the same. I wonder, is Crawlee able to create e.g. 100 Sessions (and rotate upon retire/markBad), despite there only being one proxyUrl specified? From reading the docs, a round robin rotation mechanism is referred to and a sessionId-proxyUrl pair is mentioned, so I get the sense that Crawlee every one Session locks exactly one proxyUrl, making redundant the setting sessionPoolOptions.maxPoolSize > 0 . In other words, each proxyConfiguration.proxyUrl can have max. 1 session attached to it. Hypothesis: When ProxyConfiguration.proxyUrls.length === 1, even though e.g. sessionPoolOptions.maxPoolSize === 100, Crawlee can/will only create one Session because Crawlee thinks there's only one IP available (or something). Does this happen? Sorry for long winded question, it's my first time using Crawlee and I'm unsure how the details fit together. Thanks for any attention this may get....

Request works in Postman but doesnt work with Cheerio Crawler, request object headers empty

Dear all, I am trying to scrap data from a public ip. For some reason cheeriocrawler is not getting the data back but in postman I could easily get the data. Proxy ip is whitelisted because I am using the same ip for postman and for cheerio. Postman does add some default headers but when I look at my request object the headers are empty. Does someone knows at which points cheerio sets the headers and generate some fingerprints and how can I see them ? `Request {...

Retire session after request handler timed out

Hi everyone, I have a quick question: I removed all the block status codes in all crawler sessions (puppeteer) and set the page.defaultTimeout to 1 minutes. I want to retire session when the timeout is being reached how can I check if it did?...

parallel Login Scraping

Hello, I want to make a scaled scraper that would scrape data from the site after logging in, and I want to run multiple instances, and such that each instance looks to have scraped from a unique device/location given proxies. Can you help me in visualizing the high-level overview of the project on how should I go solve this problem?

Error: browserController.newPage() failed on basic puppeteer example

Hi everyone. I've just started my work with crawlee. I tried to test some basic examples. I picked PuppeteerCrawler and I have a problem with running it. I didn't change anything in code it's plain crawlee boilerplate. I'm constantly getting an error like this: ```ERROR PuppeteerCrawler: Request failed and reached maximum retries. Error: browserController.newPage() failed: bNjhABNbboGcM1zBugHIV Cause:browserController.newPage() timed out.....

Elements not rendering

Hi, I am trying to crawl a page (it's a very small crawl, I used to do it manually every week). I am using the PlaywrightCrawler. I am running into a problem, when visiting the site I want to crawl, the elements that are most important to crawl are not appearing. They are not hidden with CSS or something, they are just literally not in the DOM. The website uses server side rendering....

Pupeteer unable to find element (dev tools show the element)

Hello everyone, I'm currently working on a web scraping tool that extracts data from the following web page: https://www.handelsregister.de/rp_web/documents-dk.xhtml. However, I'm running into some unexpected issues and I'm not sure how to debug them. Here's a brief overview of the issue:...
No description

running multiple scrapers with speed

alr have a web scraper for amazon outputting to a rawData.json file able to successfully to scrape product links and then go through each of those product links to get the data i need but i want to scale up to many many scrapers and im having trouble running multiple scrapers at once i essentially made a new router to handle the other site and want to know how i can make sure that only the url with the same label will run the router handler with the same label but it wont let me define both routes like ...

How to authenticate PlaywrightCrawler

I see that there is a Session object, but I can't find any examples of how to instantiate it with user credentials. I have a Typescript NodeJS application that I trigger with a HTTP call and running locally (all works nicely). I'm trying to crawl an internal CMS but can't get past the front door....

Random disappearing requests

Cheerio crawler : ```/x/node_modules/proper-lockfile/lib/lockfile.js:213 onCompromised: (err) => { throw err; }, ^ ...

Errors when trying to send a request

Hello, I'm trying to send a request with puppeteer with modified headers and I'm getting the following error:
DEBUG Error while disabling request interception {"error":{"name":"ProtocolError","message":"Protocol error (Network.setCacheDisabled): Target closed","stack":"ProtocolError: Protocol error (Network.setCacheDisabled): Target closed\n at new Callback (x/Connection.js:61:35)\n at CallbackRegistry.create (x/Connection.js:106:26)\n at Connection._rawSend (x/Connection.js:216:26)\n at CDPSessionImpl.send x/Connection.js:425:78)\n at NetworkManager._NetworkManager_updateProtocolCacheDisabled (x/NetworkManager.js:198:69)\n at NetworkManager._NetworkManager_updateProtocolRequestInterception (x/NetworkManager.js:191:119)\n at NetworkManager.setRequestInterception (x/NetworkManager.js:163:127)\n at CDPPage.setRequestInterception (x/Page.js:297:88)\n at disableRequestInterception (x/puppeteer_request_interception.js:221:16)\n at ObservableSet.onDelete (x/puppeteer_request_interception.js:207:31)"}}
WARN PuppeteerCrawler: Reclaiming failed request back to the list or queue. net::ERR_INVALID_ARGUMENT at https://www.x.com/grading/set_match/8897845
DEBUG Error while disabling request interception {"error":{"name":"ProtocolError","message":"Protocol error (Network.setCacheDisabled): Target closed","stack":"ProtocolError: Protocol error (Network.setCacheDisabled): Target closed\n at new Callback (x/Connection.js:61:35)\n at CallbackRegistry.create (x/Connection.js:106:26)\n at Connection._rawSend (x/Connection.js:216:26)\n at CDPSessionImpl.send x/Connection.js:425:78)\n at NetworkManager._NetworkManager_updateProtocolCacheDisabled (x/NetworkManager.js:198:69)\n at NetworkManager._NetworkManager_updateProtocolRequestInterception (x/NetworkManager.js:191:119)\n at NetworkManager.setRequestInterception (x/NetworkManager.js:163:127)\n at CDPPage.setRequestInterception (x/Page.js:297:88)\n at disableRequestInterception (x/puppeteer_request_interception.js:221:16)\n at ObservableSet.onDelete (x/puppeteer_request_interception.js:207:31)"}}
WARN PuppeteerCrawler: Reclaiming failed request back to the list or queue. net::ERR_INVALID_ARGUMENT at https://www.x.com/grading/set_match/8897845
...

running numerous scrapers from one start file with speed

alr have a web scraper for amazon outputting to a rawData.json file able to successfully to scrape product links and then go through each of those product links to get the data i need but i want to scale up to many many scrapers and im having trouble running multiple scrapers at once i essentially made a new router to handle the other site and want to know how i can make sure that only the url with the same label will run the router handler with the same label but it wont let me define both routes like
requestHandler: [router, router2]
requestHandler: [router, router2]
...

Custom user agent playwright browser

how can put a custom user agent into the playwright crawler context?
No description

RequestQueue.open issue in dockerized app

When i'm trying to run my app with crawlee inside docker container (from apify/actor-node) "RequestQueue.open()" code just infinite executed without any errors and hinders further program execution. During usual execution at my PC it works fine.

cookies help

Hello, I've done first the experiment in postman. Visit a page, take the cookies from there, go to another page ( that required those cookies ) and it worked. I'm trying to do the same thing with Cheerio, but it fails. Here is my code : ```...

Could not find file at storage/key_value_stores/default/SDK_SESSION_POOL_STATE.json

Hi there! 👋 I'm crawling some pages from different countries using proxy configurations. Function running crawling:...

Maintain the same browser/scope

Hi! I'm having a issue while scraping a web app. This app have heavy use on context and cookies, and when I enqueue over 80 urls using EnqueueLink, after about the 20th url scraped, my algorithm opens another browser windows loosing scope, and loosing access to the urls which needs the context or cookes. So, is there any way to config Crawlee to avoid opening more browsers? Or maybe if is there any way to keep the first scope even between browsers....

Cheerio Crawler inner text

When creating a simple cheerio crawler and retrieving all the contents within a specific div tag that is contained within the retrieved content, if I use the .text() method, it appears to strip out all the HTML tags and concatenate content from different children tags without any spacing/delimiters. If i do a similar crawler utilizing puppeteer and call the innerText method on a particular retrieved tag it appears to be put spacing/newlines between the content contained in different child tags...