How to share object between requests with Crawlee on Apify
Hello. While scraping website, I need an access object, which will be shared between all requests. I keep some data in this object, every request can read/write there. When all requests are handled, I do some validation and calculations on the data and write the result to Dataset.
It was easy in Apify SDKv2. I created instance of the object and passed it as parameter of handleXY methods. Like this:
This works without any problems. I need to achieve the same behavior with Crawlee and I want to use routing. Since I can't pass any parameters to handlers, I create instance of
myData
, set this instance to crawler
and then read it from it. Like this:
However, I found, that sometimes the task is restarted somehow. It handles some requests and then new Docker instance is created and this handles rest of requests. When this new instance is created, I lost instance of myData
.
How to solve this issue? Do I have to serialize this object to DataSet/KeyValueStore? What about parallel request? The best solution for me would be to keep all request in one Docker instance. Is it possible somehow?5 Replies
exotic-emeraldOP•3y ago
When working with parallel request, I'm afraid of this: Request A will deserialize data, change some data. Meanwhile Request B will deserialize it as well and change some data. Request A will serialize it and save to DateSet/KeyValueStore. Request B will do the same, but changes made by Request A are lost (because the data are overwritten by Request B).
equal-jade•3y ago
rising-crimson•3y ago
There are two solutions for this.
userData
and useState
. They are both quite different.
1. userData
When you create a request like this with the userData
property:
That data will be available in the handler for that request like this:
Read more about userData
here: https://crawlee.dev/api/core/class/Request#userData
2. useState
This is a method that is available on a crawler instance that basically tries to copy what React's useState
hook does.
You can manage global state easily for an entire crawler with this hook without needing to drill data down through requests:
When the state is modified, it will reflect in an other handlers:
Read more about useState
here: https://crawlee.dev/api/basic-crawler/class/BasicCrawler#useState
For your use-case, I would recommend the crawler.useState
route of doing thingsexotic-emeraldOP•3y ago
Thanks for the answer. I think that the
userData
approach isn't suitable for this, because it only passes some data from one request to other. But I need to share complex object which should be mutable. So if the object changes in one request, it should be changed in all others as well. userData
wouldn't work this way. So I guess I'll have to use useState
. I'm not sure whether complex object (I mean object which contains some other data structures like Map
s etc.) is supported here, but I'll check it out. 🙂equal-jade•3y ago
state is data tree,
{ anything }
so just ensure correct handling of concurrent conditions, i.e. if new data of the same type expected from multiple requests push it to array etc