Crawlee & Apify

CA

Crawlee & Apify

This is the official developer community of Apify and Crawlee.

Join

crawlee-js

apify-platform

crawlee-python

💻hire-freelancers

🚀actor-promotion

💫feature-request

💻devs-and-apify

🗣general-chat

🎁giveaways

programming-memes

🌐apify-announcements

🕷crawlee-announcements

👥community

Help with Apify API: Getting Input Schema or OpenAPI Schema for each Agent.

hi Devs. anyone knows how can i get the input schema of an Actor? i've being trying this for several months with no success 🫠(Also works the OpenAPI Schema). I'm building an AI app that use Apify as backend for tools, obviously the users need to create an Apify Account for making this work. i know i need the ActorID, but from there i don't know what to do because when i want to Run Synchronous this agent i can't because its asking me for that input schema that i don't know where to find. The best result its a way to getting the openapi schema or input schema for each Agent using user's credentials....

customData in input isn’t going through to output

I am using google-search-scraper and am successfully defining some customData which is going into the input How do I get google-search-scraper to return the customData in the output?...

600 Results but 867 Successful Crawls?

I have a discrepancy between my results and my successful crawls in a daily running actor. This is the first time I’ve run into this issue and would like help understanding the cause. Even the Actor’s stats say that I had 700+ successful crawls, but my results dataset only has 600 items (typically it was a little over 800).

Use terminal alternate screen

This is probably a long shot, but I wonder if you can provide more information on how the log is shown when running an actor. Is this the docker terminal? Or just some observability for the logs? I would like to have a small TUI to monitor the crawl, it works locally but the Apify logs don't show anything. Is there a way to use the log screen as a TUI?...
No description

Question: About docker image size

Hello Team! Does reducing Docker size will reduce loading time and make the Actor load faster ... or ... is it the same ? Thank You!...
No description

Can actors instances run 24/7?

When looking at a run log (free mode) I see it always creates a container and starting the actor. Is there a way in paid mode to have a container always up and running? Or is it not needed at all since I'll be using an API? For time sensitive data, that time is crucial......

APIFY PROXY ECONNREFUSED 127.0.0.1:80

Hello, I'm in trouble with Apify Proxy (on a free plan) I have a connexion refused. With and witout cloudscraper. Any ideas please? ...

New Error on Long-Running Actor

Starting this afternoon, the Logs of one of my daily actors (which has run successfully every day for the past 30 days), are being flooded with this message: ```2024-06-13T04:28:54.490Z WARN ApifyClient: API request failed 4 times. Max attempts: 9. 2024-06-13T04:28:54.492Z Cause:ApifyApiError: You have exceeded the rate limit of 30 requests per second 2024-06-13T04:28:54.495Z clientMethod: RequestQueueClient.get 2024-06-13T04:28:54.497Z statusCode: 429...

Apify won't load the Crawlee/Playwright browser

Hi everyone. I've built a scrapping tool Using Crawlee and Playwright and while it runs successfully locally, when I deploy to apify, it gives me an error: Error processing sheet test-sheet: Failed to launch browser. Please check the following: 2024-06-11T16:04:09.691Z - Make sure your Dockerfile extends apify/actor-node-playwright-* (with a correct browser name). 2024-06-11T16:04:09.692Z - Try installing the required dependencies by running npx playwright install --with-deps (https://playwright.dev/docs/browsers). 2024-06-11T16:04:09.693Z 2024-06-11T16:04:09.694Z The original error is available in the cause property. Below is the error received when trying to launch a browser:...
No description

Contact details deduplicate

Hello im using the contact details merge and deduplicate by lukas ut the problem im having is when i export the results the are mostly in a different order than what i inputted meaning i have to go through manually one by one updating my sheet taking a ton of time instead of being able to just copy and paste the columns is there any way to fix this?
No description

How to Setup Alerts For Daily Run Statuses

Hi Team Apify. Is there a way on the Apify platform to set up email alerts based on the breakdown of daily run status for an Actor? For example, at the end of each day, we would like to receive an alert email if the number of runs that have timed-out exceeds a certain threshold.

How to check status of Actor initiated via API?

I have a use-case where users of my app are able to initiate a new crawl from my website's front-end. I'd like to be able to pass a "crawl status" back to the user so they don't feel like their waiting in the dark for the crawl to complete. Is there a way I can create a websock to a running Actor to provide my users with real-time feedback on the status of the Actor run? All I need is something like "Actor is running" | "Actor completed: 1 succeeded, 1 failed" | "Actor failed" Thanks in advance!...

Using userData in the queue from the API

Not using JS or python so I have to interact with the Apify API directly. I need to store arbitrary data in request items in the queue, I've seen I can use the userData field when posting a request, but when getting a request from the head (https://docs.apify.com/api/v2#/reference/request-queues/queue-head/get-head) the response does not contain this userData field. Instead i get the request id from this response, and have to make a second API call to get the details for a specific request (https://docs.apify.com/api/v2#/reference/request-queues/request/get-request) based on the request ID. That's 2 API calls for 1 item from the queue, is there a better way? Why doesn't the Get head endpoint return the complete request (including userData)? I probably missed something there, thanks for your help....

Apify in NestJS scheduler

Hello everyone I am using Apify + Crawlee Cheerio Crawler + NestJS scheduler in my project, and getting issue NestJS process for running the server is quit when calling Apify.exit() . Below is my code ```javascript @Cron('0 */5 * * * *') async handleEvery20Minutes() {...
No description

How to architect my actor and scraper

Hi friends! I've been hacking around with Apify and Crawlee for a few days now and it's a lot of fun. I'm getting stuck on how to architect my crawler for my use-case and could really use some input:...

Trying to use static proxy group

I'm trying to use a static proxy group to send traffic through our specific proxy server with a stable IP address below the piece of code involved ```javascript const httpsAgent = new HttpsProxyAgent({ host: 'proxy.apify.com', ...
No description

Injecting local script tag onto page

I'm trying to inject a few local .js files onto pages so they can do some tidying before I save the page HTML. Locally, it works well like this: ``` const __dirname = path.dirname(fileURLToPath(import.meta.url)); const SINGLE_FILES = [ 'single-file-bootstrap.js',...

Monitoring CPU and memory usage in actors

Hi, I'm develping an actor in rust and I'm trying to access resources utilisation (CPU and memory). I've seen that Crawlee uses os.cpus() from node, but I'm looking for a rust equivalent. I can make it work locally by mounting the docker socket on the container (docker run -v /var/run/docker.sock:/var/run/docker.sock <image_tag>), but it does not work on the Apify platform. Are there any resources/pointers I could check on how Apify runs an actor's container? And how I could read these resources utilisation? Any help would be appreciated....

Maximum number of schedules

Hi there, reading apify documentation at this link https://docs.apify.com/platform/limits#platform-limits I can see that there is a limitation of 100 schedules per user. Is it possible extends that limit? If yes how?...
No description