Crawlee & Apify

CA

Crawlee & Apify

This is the official developer community of Apify and Crawlee.

Join

crawlee-js

apify-platform

crawlee-python

💻hire-freelancers

🚀actor-promotion

💫feature-request

💻devs-and-apify

🗣general-chat

🎁giveaways

programming-memes

🌐apify-announcements

🕷crawlee-announcements

👥community

Fingerprint and workers

Hello! I used fingerprint generator, but creepjs shows me the inconsistencies between the fingerprint and the workers. Is it possible to patch fingerprint so that the workers are patched as well?
No description

How to set 'locale' and 'timezoneId' on browsers or pages?

If in Playwright, I can create Page this way and set locale and TimezoneId ```typescript browser.newPage({ locale: 'zh-TW',...

crawlee eating memory like hell

It is eating 3 GB after running just for 2 days
No description

Can't purge named datasets

When I crate a named dataset like const dataset = await Dataset.open("test"); and let the script run the data gets appended after each run. I tried to call purgeDefaultStorages() but this has no effect. What am I doing wrong?

SyntaxError Unexpected end of JSON input!

Hi! Does anyone know what do I get this error and How can I solve it? I noticed it happens when I have many files loaded in the directory. Is there a way to get past this? Hereis the full log message: ``` SyntaxError: Unexpected end of JSON input at JSON.parse (<anonymous>) at findOrCacheDatasetByPossibleId (C:\Users\misag\OneDrive\Documents\Joiakim\Neontech\my-crawler\node_modules@crawlee\memory-storage\cache-helpers.js:48:39) at async DatasetClient.get (C:\Users\misag\OneDrive\Documents\Joiakim\Neontech\my-crawler\node_modules@crawlee\memory-storage\resource-clients\dataset.js:79:23)...

How to create a new cheerio instance $?

I need to instatiate a new cheerio object, i'm doing a search in a set o elements and need to select just one element for further processing, my actual code is: ``` function getOrigin($: typeof cheerioModule) { let origin = "" const specElements = $('#product_specs table tr').toArray()...

netERR_TUNNEL_CONNECTION_FAILED

I am trying to use proxy with crawlee playwright-crawler to connect to page at non standard port (444) and I am getting this proxy error PlaywrightCrawler: Reclaiming failed request back to the list or queue. page.goto: net::ERR_TUNNEL_CONNECTION_FAILED, any suggestions? Without proxy it works fine on local. On platform I get timeout which could be because of banned aws ip range....

How to pass UserData when executing crawler

When I do await crawler.run(['https://crawlee.dev'], { userData: { depth: 0 } }); I got this error: Uncaught ArgumentError ArgumentError: Did not expect property userData to exist, got [object Object] in object options How can I set userData in option?...

Crawler does not work anymore due to error

Hi all, I was updating some packages and after I wanted to test is my crawler still worked. The console logged the error in the screenshot. I tried going back to old versions but the error was still there, I have no idea where to solve this. Does anyone have an idea?...
No description

Trying to combine a content checker with a login on Apify (new to Apify and webscraping)

The content checker actor is what I need to get an alert when the content on a web page changes. The page is behind a login and I have learned how to export cookies. But I can't seem to marry the two. What happens is: (1) I keep getting a picture of the login page! ...

How to solve navigation timed out after 60 seconds

INFO PlaywrightCrawler: Error analysis: {"totalErrors":53,"uniqueErrors":2,"mostCommonErrors":["46x: Navigation timed out after 60 seconds. (C:\Scrapers\ZolStock\my-crawler\node_modules\@crawlee\core\crawlers\crawler_utils.js:13:11)","7x: Navigation timed out after 60 seconds. (<anonymous>)"]}

Proxy Rotaion Apify-Python

Hi, I am writing an actor in python, the problem is how can I make a user to use apify proxy rotation via input, I am unable to find that in docs. I will highly appreciate any help.

Keep browser context alive in puppeteer crawler?

Per default a new context is crated for each new Request. This means that all data (localsStorage, SessionStorage...) is wiped out. Is there a way to keep the context for multiple requests?...

use JSDOMcrawler to crawl multiple consecutive links?

I want to crawl 1 page => get a link from it => crawl that link => get some other link from it => crawl the third link => get a html table from there. i need to do it like this because the 2nd and 3rd links change a lot. how can i chain link crawling like this?...

How to stop Puppeteer crawler without causing error?

I have forks in my script and if certain conditions are met, I would like to stop the script. How should I do that? page.close creates issues, especially if I run concurrently.

Sessions and proxies?

I am having a hard time understanding sessions and proxies. I have the following crawler setup: ``` const crawler = new PuppeteerCrawler({ requestList,...

Increasing a memory limit

Hello, I'm trying to increase memory limit on my computer with 4GB ram total from default 1GB to 2GB. I tried to set "CRAWLEE_MEMORY_MBYTES" to 2048 by crawlee.json, global settings and via custom configuaration too, but still it's only 1 gb. Any idea where can be problem? Thanks

Stop crawler at specific request

Is it possible to stop the crawler at a specific request and leave the window open to inspect it via devtools? When using headless:false it seems like the window is closed after the requestQueue has been processed. It also would be nice to have the devtools: true option in the puppeteer config......

Using SessionStorage in PuppeteerCrawler

How can we use SessionStorage in Puppeteer Crawler? I didn't find anything related to Session Storage in the documentation so I tried to guess some reasonable config values. ...

Specific timeout for single request in PuppeteerCrawler

I 'm aware that it's possible to set a NavigationTimeout for the complete crawling process but I need to wait for a specific page without slowing down the whole crawling process. Is there a way to do so? Right now I'm just using a setTimout function but I wonder if there is a better way to achieve this...