genetic-orange

Adding session-cookies

After following the tutorial on scraping using crawlee, I cannot figure out how to add specific cookies (key-value pair) to the request. E.g., sid=1234 There is something like a session, and a session-pool, but how to reach these objects? Then max_pool_size of session-pool has default size of 1000, should one then iterate through the sessions in the session-pool to set the session-id to the session.cookies (dict)? Imagine the below from the tutorial, default handler handles the incoming request, it wants to enqueue requests to the category-pages. Lets say these category-pages require the sid-cookie to be set, how to achieve this? Any help is very much appreciated, as no examples can be found via Google / ChatGPT / Perplexity.

@router.default_handler
async def default_handler(context: PlaywrightCrawlingContext) -> None:
    # This is a fallback route which will handle the start URL.
    context.log.info(f'default_handler is processing {context.request.url}')
    
    await context.page.wait_for_selector('.collection-block-item')

    await context.enqueue_links(
        selector='.collection-block-item',
        label='CATEGORY',
    )

@router.default_handler
async def default_handler(context: PlaywrightCrawlingContext) -> None:
    # This is a fallback route which will handle the start URL.
    context.log.info(f'default_handler is processing {context.request.url}')
    
    await context.page.wait_for_selector('.collection-block-item')

    await context.enqueue_links(
        selector='.collection-block-item',
        label='CATEGORY',
    )

15 Replies

Hall•8mo ago

View post on community site

This post has been pushed to the community knowledgebase. Any replies in this thread will be synced to the community site.

Apify Community

Mantisus•8mo ago

I'm a little surprised that you need to set cookies for PlaywrightCrawling. For HTTP crawlers you could pass cookies inside headers. But for Playwright I can't think of a quick solution.

genetic-orangeOP•8mo ago

Dear Mantisus, thanks for your follow-up. How would you handle then a login-page, the sid-cookie is not shared with all 1000 sessions in the session-pool right? So instead of logging in once, would it then need a seperate login (including 2fa-resolving in worst case) for every request?

Mantisus•8mo ago

Understood your use case. I'm going to go dig into the crawlee-python code a bit and see if I can come up with some ideas @crawleexl I would use something like this

import asyncio

from crawlee.playwright_crawler import PlaywrightCrawler, PlaywrightCrawlingContext
from crawlee.browsers._playwright_browser_plugin import PlaywrightBrowserPlugin
from crawlee.browsers import BrowserPool

async def main() -> None:

    plugin = PlaywrightBrowserPlugin(
        page_options={"extra_http_headers": {"cookie": "auth=to_rule_over_everyone"}}
        )
    pool = BrowserPool(plugins=[plugin])
    crawler = PlaywrightCrawler(
        max_requests_per_crawl=10,
        browser_pool=pool
    )

    @crawler.router.default_handler
    async def request_handler(context: PlaywrightCrawlingContext) -> None:
        context.log.info(f'Processing {context.request.url} ...')

        content = await context.page.content()
        print(content)


    await crawler.run(['https://httpbin.org/get'])


if __name__ == '__main__':
    asyncio.run(main())

import asyncio

from crawlee.playwright_crawler import PlaywrightCrawler, PlaywrightCrawlingContext
from crawlee.browsers._playwright_browser_plugin import PlaywrightBrowserPlugin
from crawlee.browsers import BrowserPool

async def main() -> None:

    plugin = PlaywrightBrowserPlugin(
        page_options={"extra_http_headers": {"cookie": "auth=to_rule_over_everyone"}}
        )
    pool = BrowserPool(plugins=[plugin])
    crawler = PlaywrightCrawler(
        max_requests_per_crawl=10,
        browser_pool=pool
    )

    @crawler.router.default_handler
    async def request_handler(context: PlaywrightCrawlingContext) -> None:
        context.log.info(f'Processing {context.request.url} ...')

        content = await context.page.content()
        print(content)


    await crawler.run(['https://httpbin.org/get'])


if __name__ == '__main__':
    asyncio.run(main())

Or if you want to set a cookie after some action

import asyncio

from crawlee.playwright_crawler import PlaywrightCrawler, PlaywrightCrawlingContext
from crawlee import Request
from crawlee.browsers._playwright_browser_plugin import PlaywrightBrowserPlugin
from crawlee.browsers import BrowserPool

async def main() -> None:
    user_headers = {}
    user_plugin = PlaywrightBrowserPlugin(page_options={"extra_http_headers": user_headers})
    pool = BrowserPool(plugins=[user_plugin])
    crawler = PlaywrightCrawler(
        max_requests_per_crawl=10,
        browser_pool=pool
    )

    @crawler.router.default_handler
    async def request_handler(context: PlaywrightCrawlingContext) -> None:
        context.log.info(f'Processing {context.request.url} ...')

        content = await context.page.content()
        print(content)
        user_headers["cookie"] = "auth=to_rule_over_everyone"

        await context.add_requests([Request.from_url(
            "https://httpbin.org/get?page=2"
        )])


    await crawler.run(['https://httpbin.org/get'])


if __name__ == '__main__':
    asyncio.run(main())

import asyncio

from crawlee.playwright_crawler import PlaywrightCrawler, PlaywrightCrawlingContext
from crawlee import Request
from crawlee.browsers._playwright_browser_plugin import PlaywrightBrowserPlugin
from crawlee.browsers import BrowserPool

async def main() -> None:
    user_headers = {}
    user_plugin = PlaywrightBrowserPlugin(page_options={"extra_http_headers": user_headers})
    pool = BrowserPool(plugins=[user_plugin])
    crawler = PlaywrightCrawler(
        max_requests_per_crawl=10,
        browser_pool=pool
    )

    @crawler.router.default_handler
    async def request_handler(context: PlaywrightCrawlingContext) -> None:
        context.log.info(f'Processing {context.request.url} ...')

        content = await context.page.content()
        print(content)
        user_headers["cookie"] = "auth=to_rule_over_everyone"

        await context.add_requests([Request.from_url(
            "https://httpbin.org/get?page=2"
        )])


    await crawler.run(['https://httpbin.org/get'])


if __name__ == '__main__':
    asyncio.run(main())

genetic-orangeOP•8mo ago

Thanks for your quick reply. Crawlee v0.3.9 Just checking, trying out the first code example it fails as it says: 'PlaywrightBrowserPlugin' object is not iterable Or when simply doing:

# Create a browser pool with a Playwright browser plugin
    pool = BrowserPool(
        plugins=[
            PlaywrightBrowserPlugin(
                browser_type='chromium',
                browser_options={'headless': False},
                page_options={
                    'extra_http_headers': {
                        'Custom-Header': 'Value'
                    }
                }
            )
        ]
    )

# Create a browser pool with a Playwright browser plugin
    pool = BrowserPool(
        plugins=[
            PlaywrightBrowserPlugin(
                browser_type='chromium',
                browser_options={'headless': False},
                page_options={
                    'extra_http_headers': {
                        'Custom-Header': 'Value'
                    }
                }
            )
        ]
    )

It says: BrowserContext.new_page() got an unexpected keyword argument 'extra_http_headers' The looking in Git: https://github.com/apify/crawlee-python/blob/master/src/crawlee/browsers/_playwright_browser_plugin.py It does not specify what valid page-options are, extra_http_headers should be part of the normal specifications.

GitHub

crawlee-python/src/crawlee/browsers/_playwright_browser_plugin.py a...

Crawlee—A web scraping and browser automation library for Python to build reliable crawlers. Extract data for AI, LLMs, RAG, or GPTs. Download HTML, PDF, JPG, PNG, and other files from websites. Wo...

Mantisus•8mo ago

I tested it on Crawlee v0.3.5. I see they've changed something. Until the development team provides public methods for passing parameters to the PlaywrightBrowserController I can only see a solution with patching the HeaderGenerator Example

import asyncio

from crawlee.fingerprint_suite import HeaderGenerator
from crawlee._types import HttpHeaders
from crawlee.playwright_crawler import PlaywrightCrawler, PlaywrightCrawlingContext

def get_common_headers(self):
    headers = {
        'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.7',
        'Accept-Language': 'en-US,en;q=0.9',
        'cookie': "auth=to_rule_over_everyone"
    }
    return HttpHeaders(headers)

HeaderGenerator.get_common_headers = get_common_headers

async def main() -> None:
    crawler = PlaywrightCrawler(
        max_requests_per_crawl=10
    )

    @crawler.router.default_handler
    async def request_handler(context: PlaywrightCrawlingContext) -> None:
        context.log.info(f'Processing {context.request.url} ...')

        content = await context.page.content()
        print(content)


    await crawler.run(['https://httpbin.org/get'])


if __name__ == '__main__':
    asyncio.run(main())

import asyncio

from crawlee.fingerprint_suite import HeaderGenerator
from crawlee._types import HttpHeaders
from crawlee.playwright_crawler import PlaywrightCrawler, PlaywrightCrawlingContext

def get_common_headers(self):
    headers = {
        'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.7',
        'Accept-Language': 'en-US,en;q=0.9',
        'cookie': "auth=to_rule_over_everyone"
    }
    return HttpHeaders(headers)

HeaderGenerator.get_common_headers = get_common_headers

async def main() -> None:
    crawler = PlaywrightCrawler(
        max_requests_per_crawl=10
    )

    @crawler.router.default_handler
    async def request_handler(context: PlaywrightCrawlingContext) -> None:
        context.log.info(f'Processing {context.request.url} ...')

        content = await context.page.content()
        print(content)


    await crawler.run(['https://httpbin.org/get'])


if __name__ == '__main__':
    asyncio.run(main())

Mantisus•8mo ago

They went from creating a one-page context to a full context. But they don't provide any methods to pass custom parameters to it yet https://github.com/apify/crawlee-python/blob/master/src/crawlee/browsers/_playwright_browser_controller.py#L155

GitHub

crawlee-python/src/crawlee/browsers/_playwright_browser_controller....

Mantisus•8mo ago

Apparently these updates came with version 0.3.9 if you are using an earlier version then my previous examples should work (at least on version 0.3.5). You can see the allowed parameters for single page context in the playwright documentation - https://playwright.dev/python/docs/api/class-browser#browser-new-page.

Browser | Playwright Python

A Browser is created via browser_type.launch(). An example of using a [Browser] to create a [Page]:

Mantisus•8mo ago

A cleaner solution for v0.3.9

import asyncio

from crawlee.browsers import BrowserPool
from crawlee.playwright_crawler import PlaywrightCrawler, PlaywrightCrawlingContext

class CustomBrowserPool(BrowserPool):
    async def new_page(self, *args, **kwargs):
        page = await super().new_page(*args, **kwargs)
        await page.page.set_extra_http_headers({'cookie': "auth=to_rule_over_everyone"})
        return page

async def main() -> None:
    crawler = PlaywrightCrawler(
        browser_pool=CustomBrowserPool(),
        max_requests_per_crawl=10
    )

    @crawler.router.default_handler
    async def request_handler(context: PlaywrightCrawlingContext) -> None:
        context.log.info(f'Processing {context.request.url} ...')
        
        content = await context.page.content()
        print(content)


    await crawler.run(['https://httpbin.org/get'])


if __name__ == '__main__':
    asyncio.run(main())

import asyncio

from crawlee.browsers import BrowserPool
from crawlee.playwright_crawler import PlaywrightCrawler, PlaywrightCrawlingContext

class CustomBrowserPool(BrowserPool):
    async def new_page(self, *args, **kwargs):
        page = await super().new_page(*args, **kwargs)
        await page.page.set_extra_http_headers({'cookie': "auth=to_rule_over_everyone"})
        return page

async def main() -> None:
    crawler = PlaywrightCrawler(
        browser_pool=CustomBrowserPool(),
        max_requests_per_crawl=10
    )

    @crawler.router.default_handler
    async def request_handler(context: PlaywrightCrawlingContext) -> None:
        context.log.info(f'Processing {context.request.url} ...')
        
        content = await context.page.content()
        print(content)


    await crawler.run(['https://httpbin.org/get'])


if __name__ == '__main__':
    asyncio.run(main())

genetic-orangeOP•8mo ago

That works like a charm for now with the override. Just for future reference, from v0.4.0 onwards. Lets say one sets the session-cookie like this:

pool = BrowserPool(
        plugins=[
            PlaywrightBrowserPlugin(
                browser_type='chromium',
                browser_options={'headless': False},
                page_options={
                    'extra_http_headers': {
                        'Custom-Header': 'Value'
                    }
                }
            )
        ]
    )

pool = BrowserPool(
        plugins=[
            PlaywrightBrowserPlugin(
                browser_type='chromium',
                browser_options={'headless': False},
                page_options={
                    'extra_http_headers': {
                        'Custom-Header': 'Value'
                    }
                }
            )
        ]
    )

Then on a certain request it needs to re-auth. Is there a way to within a request_handler retrieve the BrowserPool object and override the Custom-Header?

@router.default_handler
async def default_handler(context: PlaywrightCrawlingContext) -> None:
    # pseudo-code
    pool = BrowserPool()
    pool.plugins[0].page_options['extra_http_headers'] = { 'Custom-Header': 'New-Value }

@router.default_handler
async def default_handler(context: PlaywrightCrawlingContext) -> None:
    # pseudo-code
    pool = BrowserPool()
    pool.plugins[0].page_options['extra_http_headers'] = { 'Custom-Header': 'New-Value }

Mantisus•8mo ago

I don't know what the developers plans are for the next releases. I don't think we'll get access to context management from request_handler by the approaches that are being used now For rewriting headers now I would use this approach

import asyncio

from crawlee.browsers import BrowserPool
from crawlee.playwrightcrawler import PlaywrightCrawler, PlaywrightCrawlingContext
from crawlee import Request

custom_headers = {}

class CustomBrowserPool(BrowserPool):
    async def new_page(self, args, **kwargs):
        page = await super().new_page(args, **kwargs)
        await page.page.set_extra_http_headers(custom_headers)
        return page

async def main() -> None:
    crawler = PlaywrightCrawler(
        browser_pool=CustomBrowserPool(),
        max_requests_per_crawl=10
    )

    @crawler.router.default_handler
    async def request_handler(context: PlaywrightCrawlingContext) -> None:
        context.log.info(f'Processing {context.request.url} ...')
        content = await context.page.content()
        custom_headers['cookie'] = "auth=to_rule_over_everyone"
        print(content)
        await context.add_requests([Request.from_url(
                    "https://httpbin.org/get?page=2"
                )])

    await crawler.run(['https://httpbin.org/get'])


if __name == '__main':
    asyncio.run(main())

import asyncio

from crawlee.browsers import BrowserPool
from crawlee.playwrightcrawler import PlaywrightCrawler, PlaywrightCrawlingContext
from crawlee import Request

custom_headers = {}

class CustomBrowserPool(BrowserPool):
    async def new_page(self, args, **kwargs):
        page = await super().new_page(args, **kwargs)
        await page.page.set_extra_http_headers(custom_headers)
        return page

async def main() -> None:
    crawler = PlaywrightCrawler(
        browser_pool=CustomBrowserPool(),
        max_requests_per_crawl=10
    )

    @crawler.router.default_handler
    async def request_handler(context: PlaywrightCrawlingContext) -> None:
        context.log.info(f'Processing {context.request.url} ...')
        content = await context.page.content()
        custom_headers['cookie'] = "auth=to_rule_over_everyone"
        print(content)
        await context.add_requests([Request.from_url(
                    "https://httpbin.org/get?page=2"
                )])

    await crawler.run(['https://httpbin.org/get'])


if __name == '__main':
    asyncio.run(main())

To contact the development team, the best way is to use - https://github.com/apify/crawlee-python/discussions Those who reply here are mostly developers like you and me who are just using the library.

GitHub

apify crawlee-python · Discussions

Explore the GitHub Discussions forum for apify crawlee-python. Discuss code, ask questions & collaborate with the developer community.

genetic-orangeOP•8mo ago

Thanks for your help, it gives many clues, great help.

MEE6•8mo ago

@crawleexl just advanced to level 1! Thanks for your contributions! 🎉

Mantisus•8mo ago

@crawleexl Pay attention to https://github.com/apify/crawlee-python/blob/master/src/crawlee/playwright_crawler/_playwright_pre_navigation_context.py - which will be in the next release and https://crawlee.dev/python/docs/examples/playwright-crawler (obviously published by mistake as this functionality is not yet available in v0.3.9) When this code is released, it should make it possible to do something like this

@crawler.pre_navigation_hook
async def log_navigation_url(context: PlaywrightPreNavigationContext) -> None:
    await context.page.set_extra_http_headers(custom_headers)
    context.log.info(f'Navigating to {context.request.url} ...')

@crawler.pre_navigation_hook
async def log_navigation_url(context: PlaywrightPreNavigationContext) -> None:
    await context.page.set_extra_http_headers(custom_headers)
    context.log.info(f'Navigating to {context.request.url} ...')

GitHub

crawlee-python/src/crawlee/playwright_crawler/_playwright_pre_navig...

Playwright crawler | Crawlee for Python · Fast, reliable crawlers.

Crawlee helps you build and maintain your Python crawlers. It's open source and modern, with type hints for Python to help you catch bugs early.

MEE6•8mo ago

@Mantisus just advanced to level 5! Thanks for your contributions! 🎉

Gaming

Programming

Adding session-cookies

Did you find this page helpful?