Marco Comments - Answer Overflow

Topics

Marco

CACrawlee & Apify

•Created by Amal Chandran on 5/16/2025 in #crawlee-js

LinkedIn Session Timeout

It sounds like LinkedIn detected some suspicious activity and logged you out. This is a complicated matter, and could actually lead to blocking your account: I would make some attempts with another, expendable, account, if you are using your main one. There must be something that tells LinkedIn that you are using some bot, or an unusual session. Take a look at this page about blocking bots: https://docs.apify.com/academy/anti-scraping, and at some tools like this: https://camoufox.com/. Or, you could try implementing the login process in your workflow, which would be slower, but maybe more reliable.

4 replies

CACrawlee & Apify

•Created by memo23 on 5/14/2025 in #apify-platform

Issue with opening issue link

Hello! I've reported the bug internally: I will let you know when it will be fixed.

4 replies

CACrawlee & Apify

•Created by fascinating-indigo on 2/28/2025 in #apify-platform

Website Content Crawler Actor - Get access to failed urls

Hello! Unfortunately, I can't find any option for that. I see that an issue has already being opened for the actor, I was going to suggest that. In the meantime, if you have a fixed list of URLs, you could compare it to the Actor's output, but whether it would be acceptable depends on your use-case.

6 replies

CACrawlee & Apify

•Created by exotic-emerald on 2/27/2025 in #crawlee-js

Shared external queue between multiple crawlers

Hello! The request queue is managed by Crawlee, and not by Cheerio or Playwright directly. What you could try to do, is creating a custom RequestQueue which inherits Crawlee's class: https://crawlee.dev/api/core/class/RequestQueue. Here is the source code: https://github.com/apify/crawlee/blob/master/packages/core/src/storages/request_queue.ts#L77. Then, you could pass the custom queue to the (Cheerio/Playwright) Crawler: https://crawlee.dev/api/cheerio-crawler/class/CheerioCrawler#requestQueue.

4 replies

CACrawlee & Apify

•Created by genetic-orange on 2/26/2025 in #crawlee-python

Django Google Maps Reviews : Pulling Data into local Django app

I would suggest experimenting a bit with the Goolge Maps scrapers directly on the platform, to see if any of them would work for you and to check the output format, then reading the documentation for the Python client to see how to call an actor programmatically and read its output. 🙂

5 replies

CACrawlee & Apify

•Created by ratty-blush on 2/26/2025 in #crawlee-python

Django Google Maps Reviews : Pulling Data into local Django app

Hello! Are you talking about Apify's scraper? In that case, you probably want to use the Python client to start a run for the Google Maps scraper and gather the results: https://docs.apify.com/api/client/python/. Consider that there are quite a few actors for Google Maps. For instance: - A legacy actor, which does everything, but it's slower: https://apify.com/compass/crawler-google-places - A faster actor to scrape only places' details: https://apify.com/compass/google-maps-extractor - A faster actor to scrape only reviews: https://apify.com/compass/google-maps-reviews-scraper If you already have the URLs, I would suggest using the latter to scrape the reviews.

5 replies

CACrawlee & Apify

•Created by rare-sapphire on 2/27/2025 in #crawlee-js

Reclaiming failed request back to the list or queue

Hello! Can you share some code, for instance how you instantiate your crawler? I'm wandering what could cause the "Request with ID MrkS5lrulGSjDt4 is not locked in queue default" error. Regarding the TLS error: are you using proxies?

3 replies

CACrawlee & Apify

•Created by Louis Deconinck on 2/24/2025 in #apify-platform

Set country for datacenter proxies

Datacenter proxies are selected by default when you are using Apify proxies, so if you omit groups, you will be using them, and you will be able to select a country for the proxies. They support this, as stated here: https://docs.apify.com/platform/proxy/datacenter-proxy. Unfortunately, you can select only one country at a time with the countryCode parameter. Nevertheless, you could try using the groups array: since the datacenter proxies correspond to the auto group, and you can suffix the country you want, you could try doing:

const proxyConfiguration = await Actor.createProxyConfiguration({
    groups: ['auto-FR', 'auto-DE', 'auto-IT', ...],
});

const proxyConfiguration = await Actor.createProxyConfiguration({
    groups: ['auto-FR', 'auto-DE', 'auto-IT', ...],
});

4 replies

CACrawlee & Apify

•Created by Louis Deconinck on 2/6/2025 in #apify-platform

RESIDENTIAL5 proxies

Hello! Try enabling Override restricted residential proxies in the admin settings for your actor. Also, for some websites, it could be necessary to ignore certificate errors: set ignoreSslErrors to true if you are using Cheerio: https://crawlee.dev/api/cheerio-crawler/interface/CheerioCrawlerOptions#ignoreSslErrors, or pass --ignore-certificate-errors to Chrome if you are using Playwright or Puppeteer.

3 replies

CACrawlee & Apify

•Created by magic-beige on 1/30/2025 in #apify-platform

Is it possible to setup a slack notification for a low account balance?

Unfortunately, there is no in-built solution for that: the platform only supports email notifications for some events, for instance when you are about the platform limit (which has to be set manually, to avoid incurring in extra costs). You could setup your own script (or even an actor) to do what you need, checking periodically you r usage through the API: https://docs.apify.com/api/v2/users-me-limits-get.

3 replies

CACrawlee & Apify

•Created by genetic-orange on 2/2/2025 in #apify-platform

How to Share Data Across Public Actor Runs?

Unfortunately, there is no such thing as a shared storage which could be accessed by multiple users, yet.

4 replies

CACrawlee & Apify

•Created by conscious-sapphire on 1/26/2025 in #apify-platform

Apify cli can't find python

You can see which command the apify-cli executes when running an actor at the bottom of the Dockerfile, if you used an Apify template: in my case, it's python3 -m src, check if this command works without the apify-cli

3 replies

CACrawlee & Apify

•Created by deep-jade on 1/26/2025 in #crawlee-python

error

Hello! Rental actors require to pay a monthly subscription to the original developer in order to be used: https://docs.apify.com/platform/actors/publishing/monetize#pricing-models

3 replies

CACrawlee & Apify

•Created by ratty-blush on 2/21/2025 in #crawlee-js

Disable write to disk

If the problem is that data, like queues, is persisted across runs, you can try using apify run --purge: https://docs.apify.com/cli/docs/reference#apify-run

7 replies

CACrawlee & Apify

•Created by correct-apricot on 7/29/2024 in #crawlee-python

how to pass proxies using selenium

Hello! For using proxies with Selenium you should refer to their documentation: https://www.selenium.dev/documentation/webdriver/drivers/options/#proxy You can generate URLs for Apify proxies to use them in other tools: https://crawlee.dev/python/docs/guides/proxy-management

4 replies

CACrawlee & Apify

•Created by wee-brown on 12/30/2024 in #crawlee-js

Managing Queue using redis or something similar and having worker nodes listening on queue

To the latter question, I'd say no: Apify does not provide on premise solutions. Regarding implementing a RequestQueue with uses Redis, I think it would be possible! You can take a look at the code here: https://github.com/apify/crawlee/blob/master/packages/core/src/storages/request_queue_v2.ts#L55

9 replies

CACrawlee & Apify

•Created by stormy-gold on 12/31/2024 in #apify-platform

instagram

Hello! This is a very broad question, so there are many possible answers to your request. For scraping Instagram pictures, I would suggest Apify's Instagram scraper: https://console.apify.com/actors/shu8hvrXbJbY3Eb9W. Then, to perform OCR, you could use the free library Tesseract: https://tesseract-ocr.github.io/, or, since you are already using Firebase, you could take a look at Google Vision API: https://cloud.google.com/vision/docs/ocr. For publishing the data to Firebase, you should refer to their documentation: https://firebase.google.com/docs/. In general, we only discuss issues regarding the Apify platform here.

3 replies

CACrawlee & Apify

•Created by extended-salmon on 12/30/2024 in #crawlee-js

Managing Queue using redis or something similar and having worker nodes listening on queue

I'm not aware of such a possibility. Actually, I don't think that Crawlee's queues were intended for concurrent access, but for keeping track of todo/done jobs within a single or multiple, but subsequent, executions. You should develop your own solution to manage and scale workers, or look at existing solutions, such as Apify.

9 replies

CACrawlee & Apify

•Created by other-emerald on 12/30/2024 in #crawlee-python

Google Maps Extractor-Retrieve a location with place ID

Hello, currently Google Maps Extractor (the faster one) only supports search URLs, the ones including /maps/search, and not the ones for a specific place ID

13 replies

CACrawlee & Apify

•Created by extended-salmon on 10/31/2024 in #crawlee-js

autoscale pool trying to scale up without suffecient memory

The AutoscaledPool doesn't ensure the memory never goes above the limit, it just doesn't scale to more requests if it is close. So if there is a sudden memory spike, like on very heavy page, it can still cause troubles. You can either limit maxConcurrency or play with the autoscaledPoolOptions to reduce memory scaling.

7 replies