fair-rose

Python Session Tracking

Is there a way to ensure that successive requests are made using the same session (with the same cookies, etc.) in the Python API? I am scraping a very fussy site that seems to have strict session continuity requirements so I need to ensure that for main page A, all requests to sub pages linked from there, A-1, A-2, A-3, etc. (as well as A-1-1, A-1-2, etc.,) are made within the same session as the original request. Thanks as always.

11 Replies

Hall•6mo ago

Someone will reply to you shortly. In the meantime, this might help: -# This post was marked as solved by uberpea5000. View answer.

Mantisus•6mo ago

Unfortunately, I don't see a good way to do this at the moment. Since the session is passed to the context at a pretty deep level - https://github.com/apify/crawlee-python/blob/master/src/crawlee/crawlers/_basic/_basic_crawler.py#L985 I think it has to do with some boundary cases. For example when in the middle of a request chain, the session gets blocked. I would consider 2 workarounds with https://crawlee.dev/python/api/class/PlaywrightCrawler#pre_navigation_hook. You check if the session has the necessary cookies and if not, you make a request to the page that generates them

  @crawler.pre_navigation_hook
  async def hook1(context: HttpCrawlingContext) -> None:
      if context.request.label and 'basic' not in context.session.cookies:
          await context.send_request('https://httpbin.org/cookies/set/basic/100')

  @crawler.pre_navigation_hook
  async def hook1(context: HttpCrawlingContext) -> None:
      if context.request.label and 'basic' not in context.session.cookies:
          await context.send_request('https://httpbin.org/cookies/set/basic/100')

The second is to pass cookies as user_data and update the session that will make the request with them

    @crawler.router.default_handler
    async def handler_one(context: HttpCrawlingContext) -> None:
        session_cookie = context.session.cookies
        request =  Request.from_url(
            url='https://httpbin.org/cookies/set/d/10',
            label='label_two',
            user_data={'session_cookie': session_cookie})
        await context.add_requests([request])

    @crawler.pre_navigation_hook
    async def hook1(context: HttpCrawlingContext) -> None:
        if context.request.label:
            context.session.cookies.update(context.request.user_data['session_cookie'])

    @crawler.router.default_handler
    async def handler_one(context: HttpCrawlingContext) -> None:
        session_cookie = context.session.cookies
        request =  Request.from_url(
            url='https://httpbin.org/cookies/set/d/10',
            label='label_two',
            user_data={'session_cookie': session_cookie})
        await context.add_requests([request])

    @crawler.pre_navigation_hook
    async def hook1(context: HttpCrawlingContext) -> None:
        if context.request.label:
            context.session.cookies.update(context.request.user_data['session_cookie'])

If you don't care about high parallelism. You can try to use 1 session for everything

from crawlee.sessions import SessionPool

crawler = HttpCrawler(
    session_pool=SessionPool(
        max_pool_size=1,
        create_session_settings={
            'max_usage_count': float('inf'),
        }))

from crawlee.sessions import SessionPool

crawler = HttpCrawler(
    session_pool=SessionPool(
        max_pool_size=1,
        create_session_settings={
            'max_usage_count': float('inf'),
        }))

fair-roseOP•6mo ago

Thanks! These are great solutions. I'm going with option 3 for now (which is working for me well enough for now), but I'll experiment with 1 and 2 as well.

MEE6•6mo ago

@uberpea5000 just advanced to level 1! Thanks for your contributions! 🎉

Mantisus•6mo ago

Glad it's helpful for you

fair-rose•5mo ago

Hey Mantisus, I was wondering what is the trade off between updating the session request by passing the cookies in the pre_navigation_hook or in the request header level like you said in this issue: https://github.com/apify/crawlee-python/issues/710 Just to clarify my understanding with these solutions, the session cookies will persist with each session, so we wouldn't need to store them ourselves? Thanks super much.

GitHub

Add session cookies to crawling context · Issue #710 · apify/crawle...

Add to the context, the cookie of the session from which the request was made, both for HTTP crawlers and Playwright

Mantisus•5mo ago

Hey @Doigus The key difference between these approaches. When you pass a cookie to a Request it will overwrite any other cookies. So this approach works best when you want all requests to be made with the same cookie. With pre_navigation_hook you have more control over what happens. For example, if your crawler is performing authorization on a site and you know that the sessionid cookie is responsible for this, you can hash it and pass it inside pre_navigation_hook for all sessions that do not have a sessionid.

async def main() -> None:
    crawler = HttpCrawler()
    _cache = {}

    @crawler.pre_navigation_hook
    async def hook(context: HttpCrawlingContext) -> None:
        if 'sessionid' not in context.session.cookies and 'sessionid' in _cache:
            context.session.cookies['sessionid'] = _cache['sessionid']

    @crawler.router.default_handler
    async def request_handler(context: HttpCrawlingContext) -> None:
        context.log.info(f'Processing {context.request.url}...')

        if 'sessionid' not in _cache and 'sessionid' in context.session.cookies:
            _cache['sessionid'] = context.session.cookies['sessionid']

        print(context.http_response.read())

        await context.add_requests([Request.from_url('https://httpbin.org/get')])

    await crawler.run([Request.from_url('https://httpbin.org/cookies/set/sessionid/1')])

if __name__ == '__main__':
    asyncio.run(main())

async def main() -> None:
    crawler = HttpCrawler()
    _cache = {}

    @crawler.pre_navigation_hook
    async def hook(context: HttpCrawlingContext) -> None:
        if 'sessionid' not in context.session.cookies and 'sessionid' in _cache:
            context.session.cookies['sessionid'] = _cache['sessionid']

    @crawler.router.default_handler
    async def request_handler(context: HttpCrawlingContext) -> None:
        context.log.info(f'Processing {context.request.url}...')

        if 'sessionid' not in _cache and 'sessionid' in context.session.cookies:
            _cache['sessionid'] = context.session.cookies['sessionid']

        print(context.http_response.read())

        await context.add_requests([Request.from_url('https://httpbin.org/get')])

    await crawler.run([Request.from_url('https://httpbin.org/cookies/set/sessionid/1')])

if __name__ == '__main__':
    asyncio.run(main())

or using use_state since version 0.5.0

async def main() -> None:
    crawler = HttpCrawler()

    @crawler.pre_navigation_hook
    async def hook(context: HttpCrawlingContext) -> None:
        _cache = await context.use_state()
        if 'sessionid' not in context.session.cookies and 'sessionid' in _cache:
            context.session.cookies['sessionid'] = _cache['sessionid']

    @crawler.router.default_handler
    async def request_handler(context: HttpCrawlingContext) -> None:
        context.log.info(f'Processing {context.request.url}...')
        _cache = await context.use_state()

        if 'sessionid' not in _cache and 'sessionid' in context.session.cookies:
            _cache['sessionid'] = context.session.cookies['sessionid']

        print(context.http_response.read())

        await context.add_requests([Request.from_url('https://httpbin.org/get')])

    await crawler.run([Request.from_url('https://httpbin.org/cookies/set/sessionid/1')])

if __name__ == '__main__':
    asyncio.run(main())

async def main() -> None:
    crawler = HttpCrawler()

    @crawler.pre_navigation_hook
    async def hook(context: HttpCrawlingContext) -> None:
        _cache = await context.use_state()
        if 'sessionid' not in context.session.cookies and 'sessionid' in _cache:
            context.session.cookies['sessionid'] = _cache['sessionid']

    @crawler.router.default_handler
    async def request_handler(context: HttpCrawlingContext) -> None:
        context.log.info(f'Processing {context.request.url}...')
        _cache = await context.use_state()

        if 'sessionid' not in _cache and 'sessionid' in context.session.cookies:
            _cache['sessionid'] = context.session.cookies['sessionid']

        print(context.http_response.read())

        await context.add_requests([Request.from_url('https://httpbin.org/get')])

    await crawler.run([Request.from_url('https://httpbin.org/cookies/set/sessionid/1')])

if __name__ == '__main__':
    asyncio.run(main())

In this case, yes, the sessionid cookie will be in every session and it doesn't matter when it was created. Note that this approach will not work for Playwright, as it is a bit more complicated.

async def main() -> None:
    crawler = PlaywrightCrawler()

    @crawler.pre_navigation_hook
    async def hook(context: PlaywrightCrawlingContext) -> None:
        _cache = await context.use_state()
        if 'sessionid' in _cache:
            await context.page.context.add_cookies([_cache['sessionid']])

    @crawler.router.default_handler
    async def request_handler(context: PlaywrightCrawlingContext) -> None:
        context_cookies = await context.page.context.cookies(context.request.url)

        _cache = await context.use_state()

        target_cookie = None
        for cookie in context_cookies:
            if cookie['name'] == 'sessionid':
                target_cookie = cookie

        if 'sessionid' not in _cache and target_cookie:
            _cache['sessionid'] = target_cookie

        print(await context.page.content())

        # clearing cookies to make sure that even if the same context is used. Our solution works.
        await context.page.context.clear_cookies()

        await context.add_requests([Request.from_url('https://httpbin.org/get')])

    await crawler.run([Request.from_url('https://httpbin.org/cookies/set/sessionid/1')])

if __name__ == '__main__':
    asyncio.run(main())

async def main() -> None:
    crawler = PlaywrightCrawler()

    @crawler.pre_navigation_hook
    async def hook(context: PlaywrightCrawlingContext) -> None:
        _cache = await context.use_state()
        if 'sessionid' in _cache:
            await context.page.context.add_cookies([_cache['sessionid']])

    @crawler.router.default_handler
    async def request_handler(context: PlaywrightCrawlingContext) -> None:
        context_cookies = await context.page.context.cookies(context.request.url)

        _cache = await context.use_state()

        target_cookie = None
        for cookie in context_cookies:
            if cookie['name'] == 'sessionid':
                target_cookie = cookie

        if 'sessionid' not in _cache and target_cookie:
            _cache['sessionid'] = target_cookie

        print(await context.page.content())

        # clearing cookies to make sure that even if the same context is used. Our solution works.
        await context.page.context.clear_cookies()

        await context.add_requests([Request.from_url('https://httpbin.org/get')])

    await crawler.run([Request.from_url('https://httpbin.org/cookies/set/sessionid/1')])

if __name__ == '__main__':
    asyncio.run(main())

fair-rose•5mo ago

I'm using Playwright with Camoufox, I'll give this a go thank you 🙂

Mantisus•5mo ago

Glad if this proves useful. Oh, that's a pretty heavy decision. I've been testing Camoufox with PlaywrightCrawler for a while. Interesting, but very resource intensive, although I realize that in some cases this is the best approach 🙂

fair-rose•5mo ago

You would suggest trying Chromium instead? Am I write to assume that the sessions get automattically set after login?

Mantisus•5mo ago

I favor HTTP crawlers wherever possible. 🙂 Yes, in any browser-based system, cookies are set automatically in context when you authorize. If you have a single context that won't be closed, you may not have to worry about cookies at all If the site uses a lot of anti-scraping technologies, just Chromium probably won't work. But if Chromium works for you, then yes it is better than Camoufox as it will use significantly less resources. This is some very promising PR - https://github.com/apify/crawlee-python/pull/829 Which can replace many cases when simply Chromium does not work, and Camoufox is excessive.

Gaming

Programming

Python Session Tracking

Did you find this page helpful?