Crawlee & Apify•6mo ago

Concurrency Settings vs Autoscaling Pool

I am really curious about what I configure and what I see. I am deploying it on a beefy EC2 with the following settings: concurrency_settings = ConcurrencySettings( min_concurrency=10, max_concurrency=100, ) But my autoscaling pool tells me: [crawlee._autoscaling.autoscaled_pool] INFO current_concurrency = 0; desired_concurrency = 10; cpu = 0.0; mem = 0.0; event_loop = 0.212; client_info = 0.0 Using playwright crawler with a curl impersonate http client: return PlaywrightCrawler( request_handler=router, request_handler_timeout=timeout, max_request_retries=config.max_retries, concurrency_settings=concurrency_settings, http_client=http_client ) Is there any hints on optimizing for concurrency?

2 Replies

Hall•6mo ago

Someone will reply to you shortly. In the meantime, this might help:

magic-beige•6mo ago

Hi, it makes no sense for PlaywrightCrawler to use http_client because it doesn't use it. http_client is for HTTP based crawlers. PlaywrightCrawler is a browser-based crawler. You can also configure desired_concurrency to be initiated when the crawler starts. Also the number of tasks affects the calculation of the current current_concurrency. You can read more in this issue - https://github.com/apify/crawlee-python/issues/786#issuecomment-2527802437.

Gaming

Programming

Concurrency Settings vs Autoscaling Pool

Did you find this page helpful?