Marco
Marco
CACrawlee & Apify
Created by Amal Chandran on 5/16/2025 in #crawlee-js
LinkedIn Session Timeout
It sounds like LinkedIn detected some suspicious activity and logged you out. This is a complicated matter, and could actually lead to blocking your account: I would make some attempts with another, expendable, account, if you are using your main one. There must be something that tells LinkedIn that you are using some bot, or an unusual session. Take a look at this page about blocking bots: https://docs.apify.com/academy/anti-scraping, and at some tools like this: https://camoufox.com/. Or, you could try implementing the login process in your workflow, which would be slower, but maybe more reliable.
4 replies
CACrawlee & Apify
Created by memo23 on 5/14/2025 in #apify-platform
Issue with opening issue link
Hello! I've reported the bug internally: I will let you know when it will be fixed.
4 replies
CACrawlee & Apify
Created by fascinating-indigo on 2/28/2025 in #apify-platform
Website Content Crawler Actor - Get access to failed urls
Hello! Unfortunately, I can't find any option for that. I see that an issue has already being opened for the actor, I was going to suggest that. In the meantime, if you have a fixed list of URLs, you could compare it to the Actor's output, but whether it would be acceptable depends on your use-case.
6 replies
CACrawlee & Apify
Created by exotic-emerald on 2/27/2025 in #crawlee-js
Shared external queue between multiple crawlers
Hello! The request queue is managed by Crawlee, and not by Cheerio or Playwright directly. What you could try to do, is creating a custom RequestQueue which inherits Crawlee's class: https://crawlee.dev/api/core/class/RequestQueue. Here is the source code: https://github.com/apify/crawlee/blob/master/packages/core/src/storages/request_queue.ts#L77. Then, you could pass the custom queue to the (Cheerio/Playwright) Crawler: https://crawlee.dev/api/cheerio-crawler/class/CheerioCrawler#requestQueue.
4 replies
CACrawlee & Apify
Created by genetic-orange on 2/26/2025 in #crawlee-python
Django Google Maps Reviews : Pulling Data into local Django app
I would suggest experimenting a bit with the Goolge Maps scrapers directly on the platform, to see if any of them would work for you and to check the output format, then reading the documentation for the Python client to see how to call an actor programmatically and read its output. 🙂
5 replies
CACrawlee & Apify
Created by ratty-blush on 2/26/2025 in #crawlee-python
Django Google Maps Reviews : Pulling Data into local Django app
Hello! Are you talking about Apify's scraper? In that case, you probably want to use the Python client to start a run for the Google Maps scraper and gather the results: https://docs.apify.com/api/client/python/. Consider that there are quite a few actors for Google Maps. For instance: - A legacy actor, which does everything, but it's slower: https://apify.com/compass/crawler-google-places - A faster actor to scrape only places' details: https://apify.com/compass/google-maps-extractor - A faster actor to scrape only reviews: https://apify.com/compass/google-maps-reviews-scraper If you already have the URLs, I would suggest using the latter to scrape the reviews.
5 replies
CACrawlee & Apify
Created by rare-sapphire on 2/27/2025 in #crawlee-js
Reclaiming failed request back to the list or queue
Hello! Can you share some code, for instance how you instantiate your crawler? I'm wandering what could cause the "Request with ID MrkS5lrulGSjDt4 is not locked in queue default" error. Regarding the TLS error: are you using proxies?
3 replies
CACrawlee & Apify
Created by Louis Deconinck on 2/24/2025 in #apify-platform
Set country for datacenter proxies
Datacenter proxies are selected by default when you are using Apify proxies, so if you omit groups, you will be using them, and you will be able to select a country for the proxies. They support this, as stated here: https://docs.apify.com/platform/proxy/datacenter-proxy. Unfortunately, you can select only one country at a time with the countryCode parameter. Nevertheless, you could try using the groups array: since the datacenter proxies correspond to the auto group, and you can suffix the country you want, you could try doing:
const proxyConfiguration = await Actor.createProxyConfiguration({
groups: ['auto-FR', 'auto-DE', 'auto-IT', ...],
});
const proxyConfiguration = await Actor.createProxyConfiguration({
groups: ['auto-FR', 'auto-DE', 'auto-IT', ...],
});
4 replies
CACrawlee & Apify
Created by Louis Deconinck on 2/6/2025 in #apify-platform
RESIDENTIAL5 proxies
Hello! Try enabling Override restricted residential proxies in the admin settings for your actor. Also, for some websites, it could be necessary to ignore certificate errors: set ignoreSslErrors to true if you are using Cheerio: https://crawlee.dev/api/cheerio-crawler/interface/CheerioCrawlerOptions#ignoreSslErrors, or pass --ignore-certificate-errors to Chrome if you are using Playwright or Puppeteer.
3 replies
CACrawlee & Apify
Created by magic-beige on 1/30/2025 in #apify-platform
Is it possible to setup a slack notification for a low account balance?
Unfortunately, there is no in-built solution for that: the platform only supports email notifications for some events, for instance when you are about the platform limit (which has to be set manually, to avoid incurring in extra costs). You could setup your own script (or even an actor) to do what you need, checking periodically you r usage through the API: https://docs.apify.com/api/v2/users-me-limits-get.
3 replies
CACrawlee & Apify
Created by genetic-orange on 2/2/2025 in #apify-platform
How to Share Data Across Public Actor Runs?
Unfortunately, there is no such thing as a shared storage which could be accessed by multiple users, yet.
4 replies
CACrawlee & Apify
Created by conscious-sapphire on 1/26/2025 in #apify-platform
Apify cli can't find python
You can see which command the apify-cli executes when running an actor at the bottom of the Dockerfile, if you used an Apify template: in my case, it's python3 -m src, check if this command works without the apify-cli
3 replies
CACrawlee & Apify
Created by deep-jade on 1/26/2025 in #crawlee-python
error
Hello! Rental actors require to pay a monthly subscription to the original developer in order to be used: https://docs.apify.com/platform/actors/publishing/monetize#pricing-models
3 replies
CACrawlee & Apify
Created by ratty-blush on 2/21/2025 in #crawlee-js
Disable write to disk
If the problem is that data, like queues, is persisted across runs, you can try using apify run --purge: https://docs.apify.com/cli/docs/reference#apify-run
7 replies
CACrawlee & Apify
Created by correct-apricot on 7/29/2024 in #crawlee-python
how to pass proxies using selenium
Hello! For using proxies with Selenium you should refer to their documentation: https://www.selenium.dev/documentation/webdriver/drivers/options/#proxy You can generate URLs for Apify proxies to use them in other tools: https://crawlee.dev/python/docs/guides/proxy-management
4 replies
CACrawlee & Apify
Created by wee-brown on 12/30/2024 in #crawlee-js
Managing Queue using redis or something similar and having worker nodes listening on queue
To the latter question, I'd say no: Apify does not provide on premise solutions. Regarding implementing a RequestQueue with uses Redis, I think it would be possible! You can take a look at the code here: https://github.com/apify/crawlee/blob/master/packages/core/src/storages/request_queue_v2.ts#L55
9 replies
CACrawlee & Apify
Created by stormy-gold on 12/31/2024 in #apify-platform
instagram
Hello! This is a very broad question, so there are many possible answers to your request. For scraping Instagram pictures, I would suggest Apify's Instagram scraper: https://console.apify.com/actors/shu8hvrXbJbY3Eb9W. Then, to perform OCR, you could use the free library Tesseract: https://tesseract-ocr.github.io/, or, since you are already using Firebase, you could take a look at Google Vision API: https://cloud.google.com/vision/docs/ocr. For publishing the data to Firebase, you should refer to their documentation: https://firebase.google.com/docs/. In general, we only discuss issues regarding the Apify platform here.
3 replies
CACrawlee & Apify
Created by extended-salmon on 12/30/2024 in #crawlee-js
Managing Queue using redis or something similar and having worker nodes listening on queue
I'm not aware of such a possibility. Actually, I don't think that Crawlee's queues were intended for concurrent access, but for keeping track of todo/done jobs within a single or multiple, but subsequent, executions. You should develop your own solution to manage and scale workers, or look at existing solutions, such as Apify.
9 replies
CACrawlee & Apify
Created by other-emerald on 12/30/2024 in #crawlee-python
Google Maps Extractor-Retrieve a location with place ID
Hello, currently Google Maps Extractor (the faster one) only supports search URLs, the ones including /maps/search, and not the ones for a specific place ID
13 replies
CACrawlee & Apify
Created by extended-salmon on 10/31/2024 in #crawlee-js
autoscale pool trying to scale up without suffecient memory
The AutoscaledPool doesn't ensure the memory never goes above the limit, it just doesn't scale to more requests if it is close. So if there is a sudden memory spike, like on very heavy page, it can still cause troubles. You can either limit maxConcurrency or play with the autoscaledPoolOptions to reduce memory scaling.
7 replies