Crawlee & Apify•2mo ago

Saving the working configurations & Sessions for each sites

Hi! I'm new to Crawlee, I'm super excited to migrate my scraping architecture to Crawlee but I can't find how to achieve this. My use case: I'm scraping 100 websites multiple times a day. I'd like to save the working configurations (cookies, headers, proxy) for each site. From what I understand, Session are made for this. However, I'd like to have the working Sessions in my database: this way working sessions persists even if the script shutdown... Also, saving the working configurations in a database would be useful when scaling Crawlee to multiple server instances. My ideal scenario would be to save all the configurations for each sites (including the type of crawler used (cheerio, got, playwright), css selectors, proxy needs, headers, cookies...) Thanks a lot for your help!

4 Replies

Hall•2mo ago

Someone will reply to you shortly. In the meantime, this might help:

like-gold•2mo ago

For storing your sessions, you can use this option:
https://crawlee.dev/api/next/core/interface/SessionPoolOptions#persistStateKeyValueStoreId
It will persist session state in a named KeyValue Store. You can simply create a separate store for each target site.
You can also use Apify's platform (along with Apify SDK) to create your database for storing data. It's already well optimized for all scraping purposes.
More info about Apify's KV store:
- KV store on the platform: https://docs.apify.com/platform/storage/key-value-store
- SDK docs: https://docs.apify.com/sdk/js/docs/guides/session-management

unwilling-turquoiseOP•2mo ago

Thank you for your answer!

You can simply create a separate store for each target site.

Does it mean to have one crawler instance per target site? Performance-wise, would it be better to have a limited amount of crawler instances?
The Apify platform looks neat! I'll look into it thank you!

like-gold•2mo ago

You don’t necessarily need one crawler instance per target site. Instead, you can use a single crawler instance and dynamically load the correct session data from the Key-Value Store (KV Store) or your own database based on the target site. Performance Considerations: If you have 100+ websites, running one crawler per site might be too resource-intensive. A better approach is to use a single, centralized crawler that loads the required session (cookies, headers, proxy, etc.) before making a request. You can implement this by fetching the saved session from KV Store (or your DB) inside the requestHandler (https://crawlee.dev/api/next/basic-crawler/interface/BasicCrawlerOptions#requestHandler) or preNavigationHooks (https://crawlee.dev/api/next/browser-crawler/interface/BrowserCrawlerOptions#preNavigationHooks).

Gaming

Programming

Saving the working configurations & Sessions for each sites

Did you find this page helpful?