Crawlee & Apify

CA

Crawlee & Apify

This is the official developer community of Apify and Crawlee.

Join

crawlee-js

apify-platform

crawlee-python

💻hire-freelancers

🚀actor-promotion

💫feature-request

💻devs-and-apify

🗣general-chat

🎁giveaways

programming-memes

🌐apify-announcements

🕷crawlee-announcements

👥community

How does createSessionFunction create session when parallel requests are being made

I have a custom function which open a browser to get cookies. The problem is my machine is very small and what would happen when multiple sessions are being made it will try to open many browsers at the same time. Can I somehow make the creation of sessions sequential ? so even though I need 1000s of sessions but at any point in time only one session is created and no session can be created in parallel. So only one browser instance will be running at any point in time. `createSessionFunction: async(sessionPool,options) => { var new_session = new Session({ sessionPool });...

Request queue with id Error

Hello, how to manage multiple instances of puppeteer running at the same time. If i launch one browser - everything works fine, but while crawling the first link, if i try to launch one more instance - i receive Request queue with id: 63925d40-c5fa-4b8d-a1eb-a4f7b93a5253 does not exist. The same problem occurs on localhost and ubuntu prod server...
No description

How would I build a crawler that accepts API requests to submit forms for a user?

I'm thinking of putting it in a serverless function, but thinking there's a better way.

deleting request queues

how to delete a request queue once the crawling is finished and how can you tell when the crawling is finished to delete the request queue

Suggestions to integrate Crawlee in a a new cloud platform

TL;DR: I'm a developer working in Estela, a cloud web scraping platform. We want to integrate support for Crawlee to expand our technology options. We use Kafka to store requests, stats, logs, and items. We're exploring different solutions like middlewares, hooks, or a custom crawler to make it work smoothly. Our goal is to ensure minimal code modification for users migrating their spiders from Crawlee to Crawlee + Estela. Any technical advice would be much appreciated. ----- Hello! I'm a developer working in https://estela.bitmaker.la/docs/, a platform for web scraping in the cloud. We currently support Scrapy and Requests, but our focus is on expanding to include Crawlee in the platform....

Why the CPU utilization rate of crawlee is going down and seem like stop processing any requests

This always happens after one hour running. Here is the link for most of the code https://medium.com/p/how-to-efficiently-scrape-millions-of-google-businesses-on-a-large-scale-using-a-distributed-35b9140030eb Most of them are c5d.2xlarge SPOT instances. In order to solve this issue, I have to build a cron task to find out the instances with low cpu utilization and terminate them. ``` CRAWLEE_MIN_CONCURRENCY: "3" CRAWLEE_MAX_CONCURRENCY: "15"...

saving data in apify actor and cleaning

ive tried saving the data to a rawdata.json file from the data i scrape from my actors, however i dont get a json output even thought the scraping works how would i save the data to the apify console that i can then use mongodb to take that data and put it in my database -...

❓ Help Needed: Downloading Linked PDF Files with Crawlee 🕸📥

Hello everyone, I need some help with Crawlee. I've been using CheerioCrawler to scrape pages and I've managed to extract links and store page titles and URLs into a dataset. Now I want to add functionality to download linked files, like PDFs, from the scraped pages. However, I'm unsure how to do this natively with Crawlee. Here's my current code:...

Self Hosted api

is there any way to expose an self hosted api, like the apify api using crawlee, like get the last dataset, call requests, enqueue requests, etc?

Initiate a crawler's actor with a POST fetch & avoid browser

I have a use case, where an Actor should start out with a POST request and not the usual GET request. Following up, i'll simply to a series of HTTP request & response parsing myself. There is no need for a browser, just plain fetch. Is the BasicCrawler or the HTTPCrawler the right option? ...

Crawler works locally but not on cloud

Hello, I've built a puppeteer crawler, nothing special about it. It works locally flawless, I've tried to deploy to AWS on batch with Fargate, I get navigation timeouts after 60 seconds, switched to EC2, navigation timeouts after 60 seconds, increased navigation timeout to 120 seconds, same error. Switched proxies between BrightData and OxyLabs, same issue. Deployed to Apify, same issue. ...

No such file or directory storage/request_queues/default/JoxD7mAqj47ssmS.json

I'm trying to run a fairly simple scraper, but I keep getting this error. I want to scrape around 64,000 pages, but I get the no such file error every time. Setting waitForAllRequestsToBeAdded to true doesn't fix the issue. This is how I'm setting up and running the crawler ```js const opts={...

Implement Apify Google Maps Scraper in an express server

Hello, I'm developing a SaaS related to scraping. I have developed the first module to scrape google Maps places and contact details. I'm actually using bright data API, and I have built a wsb scraper using puppeteer and other packages....

Crawlee uses 500GB of storage

My problem is that after tens of thousands of crawls, I've stored 100s of GBs worth of user profile data in the temp directory. How can I prevent crawlee from storing so much data? ----- Use case ...

How to persist the context with Crawlee?

Hey, so I am fairly new to Crawlee and have spent the better part of two days trying to figure out how to have persistent context when crawling with Crawlee. I am scraping a website that requires login, and I would like to avoid the overhead of always logging in when a new crawler is ran. ...

Override default terminal logging

I have a trivial nuance if anyone could point me in the right direction. I have a crawler setup using the CheerioCrawler where I am pulling urls from a database in batches and adding them to the requestQueue. Everything is working fine however for my specific application I do not want to retry any urls if any type of error occurs (ssl, status codes, etc) mainly due to proxy rotation etc the details are not important. Which brings me to my problem. Is there any way to override crawlee's default terminal logging as it is clogging up the terminal with errors and stack traces due to reaching maximum retries. ERROR CheerioCrawler: Request failed and reached maximum retries. RequestError: The proxy server rejected the request with status code 502 (Bad Gateway):...

The code sometimes returns 'undefined' IDK why.

Hi, I'm trying to add the link to the next page to the requestQueue. I know that it should always be there. But for some reason I'm getting the error. (see ss). I'm not sure why it's doing that. Any suggestions?
No description

enqueueLinksByClickingElements help

This is the code : ``` await utils.puppeteer.enqueueLinksByClickingElements({ page, requestQueue: RequestQueue.open(),...

Puppeteer Module not found on Vercel

Do you have a working example of PuppeteerCrawler working on a serverless function deployed on Vercel. I have the following error: "MODULE_NOT_FOUND","path":"/var/task/node_modules/puppeteer/package.json\