CA
Crawlee & Apifyโ€ข6mo ago
fair-rose

Python Session Tracking

Is there a way to ensure that successive requests are made using the same session (with the same cookies, etc.) in the Python API? I am scraping a very fussy site that seems to have strict session continuity requirements so I need to ensure that for main page A, all requests to sub pages linked from there, A-1, A-2, A-3, etc. (as well as A-1-1, A-1-2, etc.,) are made within the same session as the original request. Thanks as always.
11 Replies
Hall
Hallโ€ข6mo ago
Someone will reply to you shortly. In the meantime, this might help: -# This post was marked as solved by uberpea5000. View answer.
Mantisus
Mantisusโ€ข6mo ago
Unfortunately, I don't see a good way to do this at the moment. Since the session is passed to the context at a pretty deep level - https://github.com/apify/crawlee-python/blob/master/src/crawlee/crawlers/_basic/_basic_crawler.py#L985 I think it has to do with some boundary cases. For example when in the middle of a request chain, the session gets blocked. I would consider 2 workarounds with https://crawlee.dev/python/api/class/PlaywrightCrawler#pre_navigation_hook. You check if the session has the necessary cookies and if not, you make a request to the page that generates them
@crawler.pre_navigation_hook
async def hook1(context: HttpCrawlingContext) -> None:
if context.request.label and 'basic' not in context.session.cookies:
await context.send_request('https://httpbin.org/cookies/set/basic/100')
@crawler.pre_navigation_hook
async def hook1(context: HttpCrawlingContext) -> None:
if context.request.label and 'basic' not in context.session.cookies:
await context.send_request('https://httpbin.org/cookies/set/basic/100')
The second is to pass cookies as user_data and update the session that will make the request with them
@crawler.router.default_handler
async def handler_one(context: HttpCrawlingContext) -> None:
session_cookie = context.session.cookies
request = Request.from_url(
url='https://httpbin.org/cookies/set/d/10',
label='label_two',
user_data={'session_cookie': session_cookie})
await context.add_requests([request])

@crawler.pre_navigation_hook
async def hook1(context: HttpCrawlingContext) -> None:
if context.request.label:
context.session.cookies.update(context.request.user_data['session_cookie'])
@crawler.router.default_handler
async def handler_one(context: HttpCrawlingContext) -> None:
session_cookie = context.session.cookies
request = Request.from_url(
url='https://httpbin.org/cookies/set/d/10',
label='label_two',
user_data={'session_cookie': session_cookie})
await context.add_requests([request])

@crawler.pre_navigation_hook
async def hook1(context: HttpCrawlingContext) -> None:
if context.request.label:
context.session.cookies.update(context.request.user_data['session_cookie'])
If you don't care about high parallelism. You can try to use 1 session for everything
from crawlee.sessions import SessionPool

crawler = HttpCrawler(
session_pool=SessionPool(
max_pool_size=1,
create_session_settings={
'max_usage_count': float('inf'),
}))
from crawlee.sessions import SessionPool

crawler = HttpCrawler(
session_pool=SessionPool(
max_pool_size=1,
create_session_settings={
'max_usage_count': float('inf'),
}))
fair-rose
fair-roseOPโ€ข6mo ago
Thanks! These are great solutions. I'm going with option 3 for now (which is working for me well enough for now), but I'll experiment with 1 and 2 as well.
MEE6
MEE6โ€ข6mo ago
@uberpea5000 just advanced to level 1! Thanks for your contributions! ๐ŸŽ‰
Mantisus
Mantisusโ€ข6mo ago
Glad it's helpful for you
fair-rose
fair-roseโ€ข5mo ago
Hey Mantisus, I was wondering what is the trade off between updating the session request by passing the cookies in the pre_navigation_hook or in the request header level like you said in this issue: https://github.com/apify/crawlee-python/issues/710 Just to clarify my understanding with these solutions, the session cookies will persist with each session, so we wouldn't need to store them ourselves? Thanks super much.
GitHub
Add session cookies to crawling context ยท Issue #710 ยท apify/crawle...
Add to the context, the cookie of the session from which the request was made, both for HTTP crawlers and Playwright
Mantisus
Mantisusโ€ข5mo ago
Hey @Doigus The key difference between these approaches. When you pass a cookie to a Request it will overwrite any other cookies. So this approach works best when you want all requests to be made with the same cookie. With pre_navigation_hook you have more control over what happens. For example, if your crawler is performing authorization on a site and you know that the sessionid cookie is responsible for this, you can hash it and pass it inside pre_navigation_hook for all sessions that do not have a sessionid.
async def main() -> None:
crawler = HttpCrawler()
_cache = {}

@crawler.pre_navigation_hook
async def hook(context: HttpCrawlingContext) -> None:
if 'sessionid' not in context.session.cookies and 'sessionid' in _cache:
context.session.cookies['sessionid'] = _cache['sessionid']

@crawler.router.default_handler
async def request_handler(context: HttpCrawlingContext) -> None:
context.log.info(f'Processing {context.request.url}...')

if 'sessionid' not in _cache and 'sessionid' in context.session.cookies:
_cache['sessionid'] = context.session.cookies['sessionid']

print(context.http_response.read())

await context.add_requests([Request.from_url('https://httpbin.org/get')])

await crawler.run([Request.from_url('https://httpbin.org/cookies/set/sessionid/1')])

if __name__ == '__main__':
asyncio.run(main())
async def main() -> None:
crawler = HttpCrawler()
_cache = {}

@crawler.pre_navigation_hook
async def hook(context: HttpCrawlingContext) -> None:
if 'sessionid' not in context.session.cookies and 'sessionid' in _cache:
context.session.cookies['sessionid'] = _cache['sessionid']

@crawler.router.default_handler
async def request_handler(context: HttpCrawlingContext) -> None:
context.log.info(f'Processing {context.request.url}...')

if 'sessionid' not in _cache and 'sessionid' in context.session.cookies:
_cache['sessionid'] = context.session.cookies['sessionid']

print(context.http_response.read())

await context.add_requests([Request.from_url('https://httpbin.org/get')])

await crawler.run([Request.from_url('https://httpbin.org/cookies/set/sessionid/1')])

if __name__ == '__main__':
asyncio.run(main())
or using use_state since version 0.5.0
async def main() -> None:
crawler = HttpCrawler()

@crawler.pre_navigation_hook
async def hook(context: HttpCrawlingContext) -> None:
_cache = await context.use_state()
if 'sessionid' not in context.session.cookies and 'sessionid' in _cache:
context.session.cookies['sessionid'] = _cache['sessionid']

@crawler.router.default_handler
async def request_handler(context: HttpCrawlingContext) -> None:
context.log.info(f'Processing {context.request.url}...')
_cache = await context.use_state()

if 'sessionid' not in _cache and 'sessionid' in context.session.cookies:
_cache['sessionid'] = context.session.cookies['sessionid']

print(context.http_response.read())

await context.add_requests([Request.from_url('https://httpbin.org/get')])

await crawler.run([Request.from_url('https://httpbin.org/cookies/set/sessionid/1')])

if __name__ == '__main__':
asyncio.run(main())
async def main() -> None:
crawler = HttpCrawler()

@crawler.pre_navigation_hook
async def hook(context: HttpCrawlingContext) -> None:
_cache = await context.use_state()
if 'sessionid' not in context.session.cookies and 'sessionid' in _cache:
context.session.cookies['sessionid'] = _cache['sessionid']

@crawler.router.default_handler
async def request_handler(context: HttpCrawlingContext) -> None:
context.log.info(f'Processing {context.request.url}...')
_cache = await context.use_state()

if 'sessionid' not in _cache and 'sessionid' in context.session.cookies:
_cache['sessionid'] = context.session.cookies['sessionid']

print(context.http_response.read())

await context.add_requests([Request.from_url('https://httpbin.org/get')])

await crawler.run([Request.from_url('https://httpbin.org/cookies/set/sessionid/1')])

if __name__ == '__main__':
asyncio.run(main())
In this case, yes, the sessionid cookie will be in every session and it doesn't matter when it was created. Note that this approach will not work for Playwright, as it is a bit more complicated.
async def main() -> None:
crawler = PlaywrightCrawler()

@crawler.pre_navigation_hook
async def hook(context: PlaywrightCrawlingContext) -> None:
_cache = await context.use_state()
if 'sessionid' in _cache:
await context.page.context.add_cookies([_cache['sessionid']])

@crawler.router.default_handler
async def request_handler(context: PlaywrightCrawlingContext) -> None:
context_cookies = await context.page.context.cookies(context.request.url)

_cache = await context.use_state()

target_cookie = None
for cookie in context_cookies:
if cookie['name'] == 'sessionid':
target_cookie = cookie

if 'sessionid' not in _cache and target_cookie:
_cache['sessionid'] = target_cookie

print(await context.page.content())

# clearing cookies to make sure that even if the same context is used. Our solution works.
await context.page.context.clear_cookies()

await context.add_requests([Request.from_url('https://httpbin.org/get')])

await crawler.run([Request.from_url('https://httpbin.org/cookies/set/sessionid/1')])

if __name__ == '__main__':
asyncio.run(main())
async def main() -> None:
crawler = PlaywrightCrawler()

@crawler.pre_navigation_hook
async def hook(context: PlaywrightCrawlingContext) -> None:
_cache = await context.use_state()
if 'sessionid' in _cache:
await context.page.context.add_cookies([_cache['sessionid']])

@crawler.router.default_handler
async def request_handler(context: PlaywrightCrawlingContext) -> None:
context_cookies = await context.page.context.cookies(context.request.url)

_cache = await context.use_state()

target_cookie = None
for cookie in context_cookies:
if cookie['name'] == 'sessionid':
target_cookie = cookie

if 'sessionid' not in _cache and target_cookie:
_cache['sessionid'] = target_cookie

print(await context.page.content())

# clearing cookies to make sure that even if the same context is used. Our solution works.
await context.page.context.clear_cookies()

await context.add_requests([Request.from_url('https://httpbin.org/get')])

await crawler.run([Request.from_url('https://httpbin.org/cookies/set/sessionid/1')])

if __name__ == '__main__':
asyncio.run(main())
fair-rose
fair-roseโ€ข5mo ago
I'm using Playwright with Camoufox, I'll give this a go thank you ๐Ÿ™‚
Mantisus
Mantisusโ€ข5mo ago
Glad if this proves useful. Oh, that's a pretty heavy decision. I've been testing Camoufox with PlaywrightCrawler for a while. Interesting, but very resource intensive, although I realize that in some cases this is the best approach ๐Ÿ™‚
fair-rose
fair-roseโ€ข5mo ago
You would suggest trying Chromium instead? Am I write to assume that the sessions get automattically set after login?
Mantisus
Mantisusโ€ข5mo ago
I favor HTTP crawlers wherever possible. ๐Ÿ™‚ Yes, in any browser-based system, cookies are set automatically in context when you authorize. If you have a single context that won't be closed, you may not have to worry about cookies at all If the site uses a lot of anti-scraping technologies, just Chromium probably won't work. But if Chromium works for you, then yes it is better than Camoufox as it will use significantly less resources. This is some very promising PR - https://github.com/apify/crawlee-python/pull/829 Which can replace many cases when simply Chromium does not work, and Camoufox is excessive.

Did you find this page helpful?