CA
genetic-orange
Adding session-cookies
After following the tutorial on scraping using crawlee, I cannot figure out how to add specific cookies (key-value pair) to the request. E.g., sid=1234
There is something like a session, and a session-pool, but how to reach these objects?
Then max_pool_size of session-pool has default size of 1000, should one then iterate through the sessions in the session-pool to set the session-id to the session.cookies (dict)?
Imagine the below from the tutorial, default handler handles the incoming request, it wants to enqueue requests to the category-pages. Lets say these category-pages require the sid-cookie to be set, how to achieve this?
Any help is very much appreciated, as no examples can be found via Google / ChatGPT / Perplexity.
15 Replies
View post on community site
This post has been pushed to the community knowledgebase. Any replies in this thread will be synced to the community site.
Apify Community
I'm a little surprised that you need to set cookies for PlaywrightCrawling.
For HTTP crawlers you could pass cookies inside headers. But for Playwright I can't think of a quick solution.
genetic-orangeOP•8mo ago
Dear Mantisus, thanks for your follow-up. How would you handle then a login-page, the sid-cookie is not shared with all 1000 sessions in the session-pool right? So instead of logging in once, would it then need a seperate login (including 2fa-resolving in worst case) for every request?
Understood your use case. I'm going to go dig into the crawlee-python code a bit and see if I can come up with some ideas
@crawleexl
I would use something like this
Or if you want to set a cookie after some action
genetic-orangeOP•8mo ago
Thanks for your quick reply.
Crawlee v0.3.9
Just checking, trying out the first code example it fails as it says:
'PlaywrightBrowserPlugin' object is not iterable
Or when simply doing:
It says:
BrowserContext.new_page() got an unexpected keyword argument 'extra_http_headers'
The looking in Git:
https://github.com/apify/crawlee-python/blob/master/src/crawlee/browsers/_playwright_browser_plugin.py
It does not specify what valid page-options are, extra_http_headers should be part of the normal specifications.
GitHub
crawlee-python/src/crawlee/browsers/_playwright_browser_plugin.py a...
Crawlee—A web scraping and browser automation library for Python to build reliable crawlers. Extract data for AI, LLMs, RAG, or GPTs. Download HTML, PDF, JPG, PNG, and other files from websites. Wo...
I tested it on Crawlee v0.3.5.
I see they've changed something.
Until the development team provides public methods for passing parameters to the PlaywrightBrowserController I can only see a solution with patching the HeaderGenerator
Example
They went from creating a one-page context to a full context.
But they don't provide any methods to pass custom parameters to it yet
https://github.com/apify/crawlee-python/blob/master/src/crawlee/browsers/_playwright_browser_controller.py#L155
GitHub
crawlee-python/src/crawlee/browsers/_playwright_browser_controller....
Crawlee—A web scraping and browser automation library for Python to build reliable crawlers. Extract data for AI, LLMs, RAG, or GPTs. Download HTML, PDF, JPG, PNG, and other files from websites. Wo...
Apparently these updates came with version 0.3.9 if you are using an earlier version then my previous examples should work (at least on version 0.3.5).
You can see the allowed parameters for single page context in the playwright documentation - https://playwright.dev/python/docs/api/class-browser#browser-new-page.
Browser | Playwright Python
A Browser is created via browser_type.launch(). An example of using a [Browser] to create a [Page]:
A cleaner solution for v0.3.9
genetic-orangeOP•8mo ago
That works like a charm for now with the override.
Just for future reference, from v0.4.0 onwards.
Lets say one sets the session-cookie like this:
Then on a certain request it needs to re-auth. Is there a way to within a request_handler retrieve the BrowserPool object and override the Custom-Header?
I don't know what the developers plans are for the next releases. I don't think we'll get access to context management from request_handler by the approaches that are being used now
For rewriting headers now I would use this approach
To contact the development team, the best way is to use - https://github.com/apify/crawlee-python/discussions
Those who reply here are mostly developers like you and me who are just using the library.
GitHub
apify crawlee-python · Discussions
Explore the GitHub Discussions forum for apify crawlee-python. Discuss code, ask questions & collaborate with the developer community.
genetic-orangeOP•8mo ago
Thanks for your help, it gives many clues, great help.
@crawleexl just advanced to level 1! Thanks for your contributions! 🎉
@crawleexl
Pay attention to
https://github.com/apify/crawlee-python/blob/master/src/crawlee/playwright_crawler/_playwright_pre_navigation_context.py - which will be in the next release
and https://crawlee.dev/python/docs/examples/playwright-crawler (obviously published by mistake as this functionality is not yet available in v0.3.9)
When this code is released, it should make it possible to do something like this
GitHub
crawlee-python/src/crawlee/playwright_crawler/_playwright_pre_navig...
Crawlee—A web scraping and browser automation library for Python to build reliable crawlers. Extract data for AI, LLMs, RAG, or GPTs. Download HTML, PDF, JPG, PNG, and other files from websites. Wo...
Playwright crawler | Crawlee for Python · Fast, reliable crawlers.
Crawlee helps you build and maintain your Python crawlers. It's open source and modern, with type hints for Python to help you catch bugs early.
@Mantisus just advanced to level 5! Thanks for your contributions! 🎉