Passing user data to the crawler ?
Hello,
I am trying to find how to best handle the output of my scraper with Datasets.
I have a main request handler dispatching to sub-handlers based on labels, and I would like to have a Dataset for each label/sub-handler with data following a specific format (basically a database table).
I could probably open and close named Datasets every time I process one request but considering that (as far as I understand) Datasets are stored on-disk this would seem quite wasteful in terms of disk I/O.
Is there a way to pass my datasets to the crawler so that any request can access them ?
I know about Request's userData but that would require passing them explicitly to every new Request I create.
I would like to avoid global variables, especially given that I would have to initialize them which would be TypeScript-unfriendly.
I thought useState would be what I was looking for but looking at answers on this server seems to indicate I am quite wrong, and I am fairly certain that SessionPools are not what I am looking for.
If that makes any difference, I am using the CheerioCrawler.
Thanks !
3 Replies
extended-salmon•3y ago
Hey @Iridescent ! The first thing that comes to my mind is to use the request label as the name of dataset too. In each handler you have access to the label, and I assume you know the labels in advance. So you could just (probably hard-code) the opening of datasets in the very beginning of the actor, and then just some object/map to push items depending on label?
crude-lavenderOP•3y ago
Oh, initialize the Datasets before scraping and then
await Dataset.open(label)
on each request ? I did not even think about this, but I imagine opening an already loaded Dataset would just return a handle to the previous instance rather than loading a new one from disk then ? That seems like a very good idea, thanks !extended-salmon•3y ago
I actually even meant to
.open()
them before scraping and keep the dataset in some map/object, and then do somewhat like datasets[label].pushData()
.
On the other hand - yeah - if you open something that already created and has the same name - it should just return the same instance