CA
fair-rose
Python Session Tracking
Is there a way to ensure that successive requests are made using the same session (with the same cookies, etc.) in the Python API? I am scraping a very fussy site that seems to have strict session continuity requirements so I need to ensure that for main page A, all requests to sub pages linked from there, A-1, A-2, A-3, etc. (as well as A-1-1, A-1-2, etc.,) are made within the same session as the original request.
Thanks as always.
11 Replies
Someone will reply to you shortly. In the meantime, this might help:
-# This post was marked as solved by uberpea5000. View answer.
Unfortunately, I don't see a good way to do this at the moment. Since the session is passed to the context at a pretty deep level - https://github.com/apify/crawlee-python/blob/master/src/crawlee/crawlers/_basic/_basic_crawler.py#L985
I think it has to do with some boundary cases. For example when in the middle of a request chain, the session gets blocked.
I would consider 2 workarounds with https://crawlee.dev/python/api/class/PlaywrightCrawler#pre_navigation_hook.
You check if the session has the necessary cookies and if not, you make a request to the page that generates them
The second is to pass cookies as user_data and update the session that will make the request with them
If you don't care about high parallelism. You can try to use 1 session for everything
fair-roseOPโข6mo ago
Thanks! These are great solutions. I'm going with option 3 for now (which is working for me well enough for now), but I'll experiment with 1 and 2 as well.
@uberpea5000 just advanced to level 1! Thanks for your contributions! ๐
Glad it's helpful for you
fair-roseโข5mo ago
Hey Mantisus,
I was wondering what is the trade off between updating the session request by passing the cookies in the pre_navigation_hook or in the request header level like you said in this issue:
https://github.com/apify/crawlee-python/issues/710
Just to clarify my understanding with these solutions, the session cookies will persist with each session, so we wouldn't need to store them ourselves?
Thanks super much.
GitHub
Add session cookies to crawling context ยท Issue #710 ยท apify/crawle...
Add to the context, the cookie of the session from which the request was made, both for HTTP crawlers and Playwright
Hey @Doigus
The key difference between these approaches. When you pass a
cookie
to a Request it will overwrite any other cookies. So this approach works best when you want all requests to be made with the same cookie.
With pre_navigation_hook
you have more control over what happens.
For example, if your crawler is performing authorization on a site and you know that the sessionid
cookie is responsible for this, you can hash it and pass it inside pre_navigation_hook
for all sessions that do not have a sessionid
.
or using use_state
since version 0.5.0
In this case, yes, the sessionid
cookie will be in every session and it doesn't matter when it was created.
Note that this approach will not work for Playwright, as it is a bit more complicated.
fair-roseโข5mo ago
I'm using Playwright with Camoufox,
I'll give this a go thank you ๐
Glad if this proves useful.
Oh, that's a pretty heavy decision. I've been testing
Camoufox
with PlaywrightCrawler
for a while.
Interesting, but very resource intensive, although I realize that in some cases this is the best approach ๐fair-roseโข5mo ago
You would suggest trying Chromium instead?
Am I write to assume that the sessions get automattically set after login?
I favor HTTP crawlers wherever possible. ๐
Yes, in any browser-based system, cookies are set automatically in context when you authorize. If you have a single context that won't be closed, you may not have to worry about cookies at all
If the site uses a lot of anti-scraping technologies, just
Chromium
probably won't work.
But if Chromium
works for you, then yes it is better than Camoufox
as it will use significantly less resources.
This is some very promising PR - https://github.com/apify/crawlee-python/pull/829
Which can replace many cases when simply Chromium
does not work, and Camoufox
is excessive.