Crawlee & Apify

CA

Crawlee & Apify

This is the official developer community of Apify and Crawlee.

Join

crawlee-js

apify-platform

crawlee-python

💻hire-freelancers

🚀actor-promotion

💫feature-request

💻devs-and-apify

🗣general-chat

🎁giveaways

programming-memes

🌐apify-announcements

🕷crawlee-announcements

👥community

How to predict required memory for calling an actor from self-created actor (externally)?

Hi, I got a few questions regarding: https://apify.com/maxcopell/zillow-api-scraper 1. When I run this actor directly, I get no issues, but when I run the same with own actor calling this actor, I get the issues of exceeding memory (see picture with logs). Is there a difference and will upgrading to pro be sufficient? 2. I am paying about 0.2$ per request (when I scrape only 3 results) is there a way to lower this (like no pictures or html only with Cheerio?) 3. "The average cost of using the Zillow Scraper is about $0.25 for every 2,000 results scraped." This does not match what I get, which is around 0.2$ per call, returning 3 results per call max....
No description

Node-cron with CheerioCrawler

I'm currently using CheerioCrawler along with node-cron to run my web scraping tasks. However, I've been having an issue with the crawler not stopping once it's done with its tasks. I set up a node-cron job for every 30 seconds, but the problem occurs when the crawler stays open after finishing its tasks. When the cron job arrives, it seems to creates another instance, and the first iteration works fine. However, subsequent iterations do not start scraping all the pages as expected. In the terminal, it says that the crawler has finished its tasks, but it does not start scraping again. I know that I will eventually need to run it for a longer duration, such as 30 minutes. Therefore, I need to figure out how to make it stop once the crawler has finished its tasks....
No description

CheerioCrawler hangs with 12 million urls

``` const requestList = await RequestList.open('My-ReqList', allUrls, { persistStateKey: 'My-ReqList' }); console.log(requestList.length()) const crawler = new CheerioCrawler({ requestList,...

Cheerio Crawler works for Amazon.de but gets detected bot at amazon.com

Dear all, I am experimenting with cheerio crawler to scrape Amazon. I followed the tutorial online and it works for Germany but the same crawler gets detected as a bot for US. For Germany, I am using a data center proxy of Germany and it works but for USA the datacenter proxy of US doesn't work. Below is the configuration. I am building an Amazon scraper for multiple marketplaces. But this inconsistency makes it challenging. `const crawler = new CheerioCrawler({ proxyConfiguration,...

Unable to use Crawlee on AWS Lambda: hile loading shared libraries: libnss3.so: cannot open shared o

I have a problems when I deploy with CloudFormation on AWS Lambda. NODE versione: 16 Crawlee: 3.3 "aws-cdk-lib": "2.29.1", "aws-sdk": "^2.1163.0",...

download xml.gz sitemaps.

I'm trying to parse the sitemaps from a website that has .xml.gz sitemaps, in python I could use gunzip to decompress and use them. In crawlee we only have the "downloadListOfUrls" method, how I could make it to decompress those files before using them >? sitemap: https://www.zoro.com/sitemaps/usa/sitemap-product-10.xml.gz...

PerimeterX

has anyone been able to get past PerimeterX?
No description

Deploying Crawlee in Self-hosted Servers

Hello, I'm quite new in using crawlee and scraping using javascript in general. I have experience in using python for creating medium scale scraping (multiple playwright browser orchestrated by airflow). Is there any analogue for this in js/node ecosystem, or the more node-ish way of orchestrating multiple crawler in self-hosted server? For now the options that I can think of is Wraping the script using express js and hitting it with API call periodically (like cron/ airflow). Is there any better way of doing this? * I've tried searching for scaling/deploy in the forum but haven't found any that I could understand and implement...

Is it possible to close any dialogs that pop up automatically?

Sometimes a dialog box might pop up on a site and I am not interested in the dialog and would just like it to be dismissed.

How to scrape sites that generate elements with dynamic attributes?

I am trying to scrape a site that generates different CSS classes for the target elements I need to get the value of each time the page is rendered and there are no other attributes to select or suitable parent elements to traverse and I would prefer not using XPATH. Is it possible to decode this HTML to its original form to more easily scrape it? Also is there any technique that would make it possible to detect changes or addition of pages?...

Cannot find module after build with typescript

This is my typescript config ```json { "extends": "@apify/tsconfig", "compilerOptions": {...
No description

Adding request via crawler.addRequest([]) is slow in express.js app.post() method

Dear all, am building a simple API that upon call adds urls via crawler.addRequest() method. On the first call, it's quite fast but on the second and further calls, it's extremely slow. I thought this delay may be coming from me not using the request queue properly. This is what I found in the docs.
Note that RequestList can be used together with RequestQueue by the same crawler. In such cases, each request from RequestList is enqueued into RequestQueue first and then consumed from the latter. This is necessary to avoid the same URL being processed more than once (from the list first and then possibly from the queue). In practical terms, such a combination can be useful when there is a large number of initial URLs, but more URLs would be added dynamically by the crawler.
...

Crawlee Playwright Access to Network requests

Hello, Is there a method to access the “network” requests that are sent during the crawl? I’m trying to store image URLs, currently doing page$$.eval - however there are some variations in how certain sites embed image urls....

Trying to use enqueueLinksByClickingElements

The page and requestQueue parameters are needed for this function but i dont know what should i put. This is the doc: https://crawlee.dev/api/playwright-crawler/namespace/playwrightClickElements#enqueueLinksByClickingElements Thanks for the help...

Configure Apify Proxy urls in a Crawlee Playwright crawler

I've been trying to use Apify proxies in my Crawlee crawler, but have had no luck with it, always getting an net::ERR_TUNNEL_CONNECTION_FAILED error. Evidently I'm doing something wrong, but the documentation has been extremely unhelpful, and it's been very hard to find tutorials on the matter. The relevant sections of code are as follows: ```ts import { PlaywrightCrawler,...

Node running out of memory

I'm scraping some e-commerce stores in a single project, and after about 30k products node crashes because it runs out of memory. Raising the amount of memory allocated to node is not a good solution, as I plan to increase the incoming data to at least 10x. The most obvious solution seems to scale horizontally and run a node instance for each e-commerce store I want to scrape. However, is there any way to decrease the load of memory that crawlee uses? I would be happy to use streaming for exporting the datasets and the dataset items are already persisted through local files....

I am trying to reseting crawlee cache in nextjs what its note working can any one help me

this my nextjs code at initial request data is displayed but if i try to request again its displaying empty array or i have to restart the application //with this Crawl finished. Final request statistics: {"requestsFinished":0,"requestsFailed":0,"retryHistogram":[],"requestAvgFailedDurationMillis":null,"requestAvgFinishedDurationMillis":null,"requestsFinishedPerMinute":0,"requestsFailedPerMinute":0,"requestTotalDurationMillis":0,"requestsTotal":0,"crawlerRuntimeMillis":445} ...

Is it possible to stop the crawler if a condition is met ?

Hi. I'm making a crawler (CheerioCrawler) that scrape a news website. I start the crawler by giving it a list of url with all the pages of the articles list (an array containing site.com/?page=1, site.com/?page=2 etc). For every article list page, i will scrape every article inside of it. I was wondering, if my url site.com/?page=60 (per instance) doesn't have any articles on it, can i stop the execution of the crawler at this time ? I know how to check if there are any articles on the page, but I can't find how to stop the crawler at a certain point (without completing all the url in the url list). Thank you very much!...

IP address of the current browser

Maybe someone has an idea of how to get the IP address of the current puppeteer browser instance that is using a proxy? Is there another way than going to a "whats my IP page" and scraping it?...

how to wait browser to close like playwright await browser.close();

how to wait browser to close like playwright await browser.close();