CA
rival-black
Best practices/examples of hardening an actor that handles tens of thousands of records?
I was told to post this here instead of #chat by DanielDo:
I'm looking for any helpful links/articles/source code for writing actors that split a collection of objects from a dataset into paged collections for batching? I want to support actor input for capping the total dataset records that are allowed to be processed, the size of each page/batch, etc.
The objects retrieved will have a url in one of their keys that the actor will then go fetch and save to the local fs, so I'd like to make sure the actor can stop and resume where it left off without redundant fetches or fs operations.
The end goal is to go from having a dataset with records in the shape of
{ image: 'https://..../x.png', identifier: 'My Image' }
to a zipped archive of all of the images–and the images will be nested under parent directories that are named based on the identifier
key of a given record.1 Reply
rival-blackOP•2y ago
So, for a record of
{ image: 'https://..../x.png', identifier: 'My Image' }
I will end up with an archive that when unzipped, will produce the following:
Anyone? Could really use some help on this. Docs give just enough to spark my interest / mention in passing.
It'd be great if RequestQueue
could be used outside of scrapers–can we use it for queueing up image URLs to download?
Or is it only intended to be passed into playwright/puppeteer/crawlee?