Concurrency Settings vs Autoscaling Pool
I am really curious about what I configure and what I see.
I am deploying it on a beefy EC2 with the following settings:
concurrency_settings = ConcurrencySettings(
min_concurrency=10,
max_concurrency=100,
)
But my autoscaling pool tells me: [crawlee._autoscaling.autoscaled_pool] INFO current_concurrency = 0; desired_concurrency = 10; cpu = 0.0; mem = 0.0; event_loop = 0.212; client_info = 0.0
Using playwright crawler with a curl impersonate http client:
return PlaywrightCrawler(
request_handler=router,
request_handler_timeout=timeout,
max_request_retries=config.max_retries,
concurrency_settings=concurrency_settings,
http_client=http_client
)
Is there any hints on optimizing for concurrency?
2 Replies
Someone will reply to you shortly. In the meantime, this might help:
magic-beige•6mo ago
Hi, it makes no sense for
PlaywrightCrawler
to use http_client
because it doesn't use it.
http_client
is for HTTP based crawlers.
PlaywrightCrawler
is a browser-based crawler.
You can also configure desired_concurrency
to be initiated when the crawler starts. Also the number of tasks affects the calculation of the current current_concurrency
. You can read more in this issue - https://github.com/apify/crawlee-python/issues/786#issuecomment-2527802437.