Passing user data to the crawler ?

Hello, I am trying to find how to best handle the output of my scraper with Datasets. I have a main request handler dispatching to sub-handlers based on labels, and I would like to have a Dataset for each label/sub-handler with data following a specific format (basically a database table). I could probably open and close named Datasets every time I process one request but considering that (as far as I understand) Datasets are stored on-disk this would seem quite wasteful in terms of disk I/O. Is there a way to pass my datasets to the crawler so that any request can access them ? I know about Request's userData but that would require passing them explicitly to every new Request I create. I would like to avoid global variables, especially given that I would have to initialize them which would be TypeScript-unfriendly. I thought useState would be what I was looking for but looking at answers on this server seems to indicate I am quite wrong, and I am fairly certain that SessionPools are not what I am looking for. If that makes any difference, I am using the CheerioCrawler. Thanks !
3 Replies
extended-salmon
extended-salmon3y ago
Hey @Iridescent ! The first thing that comes to my mind is to use the request label as the name of dataset too. In each handler you have access to the label, and I assume you know the labels in advance. So you could just (probably hard-code) the opening of datasets in the very beginning of the actor, and then just some object/map to push items depending on label?
crude-lavender
crude-lavenderOP3y ago
Oh, initialize the Datasets before scraping and then await Dataset.open(label) on each request ? I did not even think about this, but I imagine opening an already loaded Dataset would just return a handle to the previous instance rather than loading a new one from disk then ? That seems like a very good idea, thanks !
extended-salmon
extended-salmon3y ago
I actually even meant to .open() them before scraping and keep the dataset in some map/object, and then do somewhat like datasets[label].pushData(). On the other hand - yeah - if you open something that already created and has the same name - it should just return the same instance

Did you find this page helpful?