Crawlee & Apify•3y ago

How to share object between requests with Crawlee on Apify

Hello. While scraping website, I need an access object, which will be shared between all requests. I keep some data in this object, every request can read/write there. When all requests are handled, I do some validation and calculations on the data and write the result to Dataset. It was easy in Apify SDKv2. I created instance of the object and passed it as parameter of handleXY methods. Like this:

const myData = new MyData();

const crawlerOptions = {
  handlePageFunction: async (context) => {
      switch (context.request.userData.type) {
        case "pageA": await handleBranch(myData); break;
        default: await handleStart(myData);
      }
    },
};

const crawler = new Apify.PuppeteerCrawler(crawlerOptions);
await crawler.run();
await Apify.pushData(myData.getData());

const myData = new MyData();

const crawlerOptions = {
  handlePageFunction: async (context) => {
      switch (context.request.userData.type) {
        case "pageA": await handleBranch(myData); break;
        default: await handleStart(myData);
      }
    },
};

const crawler = new Apify.PuppeteerCrawler(crawlerOptions);
await crawler.run();
await Apify.pushData(myData.getData());

This works without any problems. I need to achieve the same behavior with Crawlee and I want to use routing. Since I can't pass any parameters to handlers, I create instance of myData, set this instance to crawler and then read it from it. Like this:

// main.js
const crawler = new PuppeteerCrawler();
crawler.myData = new MyData();

// routes.js
router.addDefaultHandler(async ({ crawler }) => {
  const myData = crawler.myData;
}

// main.js
const crawler = new PuppeteerCrawler();
crawler.myData = new MyData();

// routes.js
router.addDefaultHandler(async ({ crawler }) => {
  const myData = crawler.myData;
}

However, I found, that sometimes the task is restarted somehow. It handles some requests and then new Docker instance is created and this handles rest of requests. When this new instance is created, I lost instance of myData.

2022-10-11T12:56:01.157Z INFO  Request N
2022-10-11T12:56:20.894Z ACTOR: Pulling Docker image from repository.
2022-10-11T12:56:42.031Z ACTOR: Creating Docker container.
2022-10-11T12:56:42.303Z ACTOR: Starting Docker container.
2022-10-11T12:56:54.251Z INFO Request N + 1

2022-10-11T12:56:01.157Z INFO  Request N
2022-10-11T12:56:20.894Z ACTOR: Pulling Docker image from repository.
2022-10-11T12:56:42.031Z ACTOR: Creating Docker container.
2022-10-11T12:56:42.303Z ACTOR: Starting Docker container.
2022-10-11T12:56:54.251Z INFO Request N + 1

How to solve this issue? Do I have to serialize this object to DataSet/KeyValueStore? What about parallel request? The best solution for me would be to keep all request in one Docker instance. Is it possible somehow?

5 Replies

exotic-emeraldOP•3y ago

When working with parallel request, I'm afraid of this: Request A will deserialize data, change some data. Meanwhile Request B will deserialize it as well and change some data. Request A will serialize it and save to DateSet/KeyValueStore. Request B will do the same, but changes made by Request A are lost (because the data are overwritten by Request B).

equal-jade•3y ago

try https://sdk.apify.com/docs/upgrading/upgrading-to-v3#auto-saved-crawler-state

rising-crimson•3y ago

There are two solutions for this. userData and useState. They are both quite different. 1. userData When you create a request like this with the userData property:

router.addHandler('some-label', async ({ crawler }) => {
    const request = {
        url: 'https://foo.com',
        userData: {
            // this is shared data
            hello: 'world',
        },
    };

    await crawler.addRequests([request]);
});

router.addHandler('some-label', async ({ crawler }) => {
    const request = {
        url: 'https://foo.com',
        userData: {
            // this is shared data
            hello: 'world',
        },
    };

    await crawler.addRequests([request]);
});

That data will be available in the handler for that request like this:

router.addHandler('foo', ({ request }) => {
    const { hello } = request.userData as { hello: string };

    console.log(hello) // -> world
});

router.addHandler('foo', ({ request }) => {
    const { hello } = request.userData as { hello: string };

    console.log(hello) // -> world
});

Read more about userData here: https://crawlee.dev/api/core/class/Request#userData 2. useState This is a method that is available on a crawler instance that basically tries to copy what React's useState hook does. You can manage global state easily for an entire crawler with this hook without needing to drill data down through requests:

router.addHandler('some-label', async ({ crawler }) => {
    // access the state
    const state = await crawler.useState<Record<string, string>>({});

    // modify it in this handler
    state.hello = 'world';
});

router.addHandler('some-label', async ({ crawler }) => {
    // access the state
    const state = await crawler.useState<Record<string, string>>({});

    // modify it in this handler
    state.hello = 'world';
});

When the state is modified, it will reflect in an other handlers:

router.addHandler('foo', async ({ crawler }) => {
    // access the state
    const state = await crawler.useState<Record<string, string>>({});

    console.log(state.hello) // -> "world"
});

router.addHandler('foo', async ({ crawler }) => {
    // access the state
    const state = await crawler.useState<Record<string, string>>({});

    console.log(state.hello) // -> "world"
});

Read more about useState here: https://crawlee.dev/api/basic-crawler/class/BasicCrawler#useState For your use-case, I would recommend the crawler.useState route of doing things

exotic-emeraldOP•3y ago

Thanks for the answer. I think that the userData approach isn't suitable for this, because it only passes some data from one request to other. But I need to share complex object which should be mutable. So if the object changes in one request, it should be changed in all others as well. userData wouldn't work this way. So I guess I'll have to use useState. I'm not sure whether complex object (I mean object which contains some other data structures like Maps etc.) is supported here, but I'll check it out. 🙂

equal-jade•3y ago

state is data tree, { anything } so just ensure correct handling of concurrent conditions, i.e. if new data of the same type expected from multiple requests push it to array etc

Gaming

Programming

How to share object between requests with Crawlee on Apify

Did you find this page helpful?