Crawlee & Apify•2y ago

Python crawlers running in parallel

Hi, I have a custom Python + requests Actor that works great. It's pretty simple, it works against a list of starting URLs and pulls out a piece of information per URL. My question is: If (for example) one run of 1,000 input URLs takes an hour to complete, i would like to parallel-ize it 4 ways so that I can run 4,000 URLs in an hour. What's the best way to do this? I could kick off 4 copies of the run with segmented data, but this seems like something Apify could support natively.
I saw that if I was using Crawlee (and therefore JS) I could use autoscaling: https://docs.apify.com/platform/actors/running/usage-and-resources . But is there a way to build a single Python based Actor that uses more threads/CPU cores if needed?

Usage and resources | Platform | Apify Documentation

Learn about your Actors' memory and processing power requirements, their relationship with Docker resources, minimum requirements for different use cases and its impact on the cost.

4 Replies

MEE6•2y ago

@grackle just advanced to level 1! Thanks for your contributions! 🎉

probable-pink•2y ago

Hi, We don't have a similar functionality in the Python SDK yet (it's planned in the upcoming months). But for now, you could just write some simple utility, using asyncio.Queue, to get what you need: https://docs.python.org/3/library/asyncio-queue.html#examples

Python documentation

Queues

Source code: Lib/asyncio/queues.py asyncio queues are designed to be similar to classes of the queue module. Although asyncio queues are not thread-safe, they are designed to be used specifically i...

xenial-blackOP•2y ago

Thank you ! I will try out the queue system.

xenial-blackOP•2y ago

Hello! I've ported my crawler to use asyncio queues. It does work faster, at 4 tasks running it's roughly 2x the speed. But increasing that task # any more doesn't really increase speed. Going to 8 tasks is a slight speed boost (measured in requests/second) but 16 is the same and sometimes slower than 8. I've tried increasing memory but it doesn't make anything go faster, just costs more. My RAM / CPU graph is pretty stable (this is at 16) I do notice if i just boot another Actor run, it runs again at the same speed, so I know I can get a speed boost by running in parallel, but i don't know how to do that with just one Actor. Are we limited in bandwidth per Actor run / docker container? Is there a way to increase that limit so i can just have one Actor run for simplicity?

Gaming

Programming

Python crawlers running in parallel

Did you find this page helpful?