Crawlee & Apify

CA

Crawlee & Apify

This is the official developer community of Apify and Crawlee.

Join

crawlee-js

apify-platform

crawlee-python

💻hire-freelancers

🚀actor-promotion

💫feature-request

💻devs-and-apify

🗣general-chat

🎁giveaways

programming-memes

🌐apify-announcements

🕷crawlee-announcements

👥community

correct-apricot
correct-apricot5/15/2023

Bulk image downloader questions.

The Downloader seems to work great on getting the images from the URLs but how do you know what image came from what URL? I cant match the image to the source unless this is possible.....

Prevent an actor running in parallel

I want to prevent user for requesting to much. Is there a way to prevent an actor running in parallel. When a user requesting to much in parallel, the upstream server will cry with error 429 error (to many requests) or 503 or 403. I wish an actor have a flag Prevent Parallel or Wait until previous request finished something like that
rising-crimson
rising-crimson5/11/2023

Getting all replies to a single tweet

Is there a way for me to get all replies to a single tweet? The current Twitter scraper is only getting 240-260 replies when I am trying to get 24000+. Thanks!
equal-jade
equal-jade5/11/2023

Post last run via webhook

Is there a way to send the actor last run as payload via the webhook? Or all I can do is send a notification to my server that run was successful, and then initiate an API call with the SDK to get the last run? Seems like some unnecessary double work that can be simpler if I can get the last run into a variable of the webhook....

Maybe Bug 003: Can't reopen issue

Cant reopen an issue. Also name without lastname displayed as null
No description
useful-bronze
useful-bronze5/11/2023

Get Memory Size From Input?

Currently, is there a way to determine the user's memory used without having them manually type it in? My program using multithreading (works better than async for my use case) and I want to automatically pick the optimal number of workers/threads based on the memory inputted. I tried multiprocessing.cpu_count(), but it seems to just respond with 16 every time. I viewed https://docs.apify.com/platform/actors/running/usage-and-resources but didn't seem to see it there I also noticed Lukas Krivka's message on 01/09/2023, but it seems to be more about handling it on the user side rather than developer: ...
ratty-blush
ratty-blush5/11/2023

ACTOR: Notifying actor process about imminent migration to another host.

Since a week, I am getting above message while running a task on our development account. As a result, the tasks can't finish and randomly starts a new build. This can happen multiple times during one run. Does anyone know why this is happening and how I can prevent this? It seems our production environment doesn't have this issue. Thank you! `ACTOR: Notifying actor process about imminent migration to another host. 2023-05-11T09:11:24.953Z ACTOR: Pulling Docker image from repository. 2023-05-11T09:11:37.483Z ACTOR: Creating Docker container....
eastern-cyan
eastern-cyan5/10/2023

Not possible to get details for specific dataset item from API?

I used the API to pull back all the items in a specific Dataset ID. In the results, Apify has an ID for each item inside the Dataset, but I don't see any documentation on how to use that ID to access the specific item's fields. Is it not possible?
stormy-gold
stormy-gold5/10/2023

How to get the logs in json?

I run crawlee with a kubernetes cron and I have an agent that gets the output and puts it into an Elasticsearch. I would like to have a JSON output that is not prefixed by main INFO PlaywrightCrawler: How can I do this ?
stormy-gold
stormy-gold5/10/2023

How to stop the crawl after a certain time?

I use crawlee to warm up my caches. And I set a stop after 5000 pages. But sometimes it takes more or less time to crawl them. I prefer to tell it to stop the crawl after x minutes. Is it possible via the configuration ?
conventional-tan
conventional-tan5/10/2023

Dealing with images?

I need to display profile pictures (from scraped social media profiles), to my users on a website. After scraping profiles, I get a returned URL relating to a profile picture. I'm assuming I can't just set my website up to fetch the image from the URL to display because I need to do this potentially millions of times which would result in IP banning and black listing, no? So how do I deal with this? Is there a way I can download images using Apify (ideally with settings to change resolution / size, etc) that I can then store in my own database?...
stormy-gold
stormy-gold5/9/2023

How to install crawlee with yarn2 and pnp in a workspace ?

I have a monorepo that gathers several small projects. It uses yarn2 (zero install) and workspace. I wanted to add a package that uses crawlee, but I get a lot of errors at startup yarn start:dev . After some research and specifying the nodeLinker: node_modules in the .yarnrc.yml file of the package, this solved the problem. I have this structure:...
xenial-black
xenial-black5/9/2023

"cityUrl" and "city"

Before in my scraper for firmy.cz I used "cityUrl": "/kraj-stredocesky/kolin/3412-kolin", "city": "Kolín" Can you please advise me what to write after "cityUrl" and "city" if I want to scrape the whole country of the Czech Republic?...
fair-rose
fair-rose5/9/2023

Actor renaming not working

Hi, attempting to rename existing Actor results in an error within the onChange event in human_names_modal jsx. Nothing too groundbreaking, just wanted to let you know. ...
rare-sapphire
rare-sapphire5/8/2023

Help with website-content-crawler for specific website

Hi all, I'm an ML engineer but complete beginner using Apify. It looks great in theory but for my first project I ran into the roadblock that I cannot crawl this website properly: https://georgian.io/resources/ There are a bunch of articles and blogs there for which I want the text but running the web-crawler only returns the main page without any sub-links, regardless of the max-depth I specify. I essentially followed the langchain tutorial here: https://apify.com/apify/website-content-crawler. ...
extended-salmon
extended-salmon5/8/2023

I want to use apify proxy configuration from local app.

I'm trying scrape some website using puppeteerCrawler from my local machine I want to configure apify proxy for this project how do I use it. right now my code looks like this import {Actor} from 'apify' await Actor.init(); const proxyConfiguration = await Actor.createProxyConfiguration({ password: "password"...
absent-sapphire
absent-sapphire5/5/2023

Proxy access with C# and custom docker image leads to 403 error on free plan

Hi, I'm currently on the free plan and wanted to create a custom actor using C# within a Docker image. Everything works as expected unless I try to use the 5 available proxies. No matter what I do, I keep getting the following error: One or more errors occurred. (The proxy tunnel request to proxy 'http://proxy.apify.com:8000/' failed with status code '403'.") Credentials are properly set. Without valid credentials I'd get a 407 error. I'm using the normal HttpClient from the .NET Framework. ...
rising-crimson
rising-crimson5/4/2023

Hello, i'm using Puppeteer actor. I want to open new page inside the pageFunction.

so i have the initial page. but since i still need the data there i need to create new page to visit a link. i don't know how i can do that if anyone can help me. thanks

Remove Actor Builds

Hi, Hello, How to remove previous actor builds permanently ?
environmental-rose
environmental-rose5/4/2023

Twitter Follower Count Help

Hi I am a student looking to use the twitter data scrapers to extract specific follower counts for a list of accounts. Looking for advice/instructions as the scrapers I used were not giving me follower count