Crawlee & Apify•6mo ago

Playwright increase timeout

While using playwright with proxies sometimes the page is taking more time to load, so how can I increase the load time.

Page.goto: Timeout 30000ms exceeded

Page.goto: Timeout 30000ms exceeded

19 Replies

Hall•6mo ago

Someone will reply to you shortly. In the meantime, this might help:

MEE6•6mo ago

@Shine just advanced to level 2! Thanks for your contributions! 🎉

other-emerald•6mo ago

Hi Did you try to use this code? try: await page.goto("https://example.com", timeout=60000) # 60 seconds timeout except Exception as e: print(f"Error loading the page: {e}")

rival-black•6mo ago

If you're referring to the PlaywrightCrawler in crawlee You can increase the default timeout by passing the appropriate parameter to the browser Update: Solution - https://discord.com/channels/801163717915574323/1314296091650428948/1314315014118834207

from crawlee.playwright_crawler import PlaywrightCrawler, PlaywrightCrawlingContext
from crawlee.browsers import BrowserPool, PlaywrightBrowserPlugin


user_plugin = PlaywrightBrowserPlugin(browser_options={"timeout": 60000})

browser_pool = BrowserPool(plugins=[user_plugin])

crawler = PlaywrightCrawler(browser_pool=browser_pool)

from crawlee.playwright_crawler import PlaywrightCrawler, PlaywrightCrawlingContext
from crawlee.browsers import BrowserPool, PlaywrightBrowserPlugin


user_plugin = PlaywrightBrowserPlugin(browser_options={"timeout": 60000})

browser_pool = BrowserPool(plugins=[user_plugin])

crawler = PlaywrightCrawler(browser_pool=browser_pool)

You can pass any parameters that Playwright supports. https://playwright.dev/python/docs/api/class-browsertype#browser-type-launch

rival-blackOP•6mo ago

I was referring to PlaywrightCrawler let me try this

[crawlee.playwright_crawler._playwright_crawler] ERROR Request failed and reached maximum retries
Traceback (most recent call last):
File "/usr/local/lib/python3.12/site-packages/crawlee/basic_crawler/_context_pipeline.py", line 65, in __call__
result = await middleware_instance.__anext__()
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/site-packages/crawlee/playwright_crawler/_playwright_crawler.py", line 260, in _handle_blocked_request
selector for selector in RETRY_CSS_SELECTORS if (await context.page.query_selector(selector))
                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/site-packages/playwright/async_api/_generated.py", line 8064, in query_selector
await self._impl_obj.query_selector(selector=selector, strict=strict)
File "/usr/local/lib/python3.12/site-packages/playwright/_impl/_page.py", line 414, in query_selector
return await self._main_frame.query_selector(selector, strict)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/site-packages/playwright/_impl/_frame.py", line 304, in query_selector
await self._channel.send("querySelector", locals_to_params(locals()))
File "/usr/local/lib/python3.12/site-packages/playwright/_impl/_connection.py", line 59, in send
return await self._connection.wrap_api_call(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/site-packages/playwright/_impl/_connection.py", line 520, in wrap_api_call
raise rewrite_error(error, f"{parsed_st['apiName']}: {error}") from None
playwright._impl._errors.Error: Page.query_selector: Execution context was destroyed, most likely because of a navigation
[crawlee._autoscaling.autoscaled_pool] INFO  Waiting for remaining tasks to finish
[crawlee.playwright_crawler._playwright_crawler] INFO  Error analysis: total_errors=3 unique_errors=1

[crawlee.playwright_crawler._playwright_crawler] ERROR Request failed and reached maximum retries
Traceback (most recent call last):
File "/usr/local/lib/python3.12/site-packages/crawlee/basic_crawler/_context_pipeline.py", line 65, in __call__
result = await middleware_instance.__anext__()
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/site-packages/crawlee/playwright_crawler/_playwright_crawler.py", line 260, in _handle_blocked_request
selector for selector in RETRY_CSS_SELECTORS if (await context.page.query_selector(selector))
                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/site-packages/playwright/async_api/_generated.py", line 8064, in query_selector
await self._impl_obj.query_selector(selector=selector, strict=strict)
File "/usr/local/lib/python3.12/site-packages/playwright/_impl/_page.py", line 414, in query_selector
return await self._main_frame.query_selector(selector, strict)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/site-packages/playwright/_impl/_frame.py", line 304, in query_selector
await self._channel.send("querySelector", locals_to_params(locals()))
File "/usr/local/lib/python3.12/site-packages/playwright/_impl/_connection.py", line 59, in send
return await self._connection.wrap_api_call(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/site-packages/playwright/_impl/_connection.py", line 520, in wrap_api_call
raise rewrite_error(error, f"{parsed_st['apiName']}: {error}") from None
playwright._impl._errors.Error: Page.query_selector: Execution context was destroyed, most likely because of a navigation
[crawlee._autoscaling.autoscaled_pool] INFO  Waiting for remaining tasks to finish
[crawlee.playwright_crawler._playwright_crawler] INFO  Error analysis: total_errors=3 unique_errors=1

I am getting this error

from apify import Actor, Request
from crawlee.playwright_crawler import PlaywrightCrawler, PlaywrightCrawlingContext
from crawlee.proxy_configuration import ProxyConfiguration
from crawlee.browsers import BrowserPool, PlaywrightBrowserPlugin

async def main() -> None:
    async with Actor:
        # Retrieve the Actor input, and use default values if not provided.
        actor_input = await Actor.get_input() or {}
        start_urls = [url.get('url') for url in actor_input.get('start_urls', [{'url': 'https://apify.com'}])]
        proxy_settings = actor_input.get('proxy')
        proxy_configuration = ProxyConfiguration(proxy_urls=[
            'http://xxx:xxx@xxx:xxxx',
        ])

        # Exit if no start URLs are provided.
        if not start_urls:
            Actor.log.info('No start URLs specified in Actor input, exiting...')
            await Actor.exit()

        user_plugin = PlaywrightBrowserPlugin(browser_options={"timeout": 60000})
        browser_pool = BrowserPool(plugins=[user_plugin])

        # Create a crawler.
        crawler = PlaywrightCrawler(
            max_requests_per_crawl=50,
            proxy_configuration=proxy_configuration,
            browser_pool=browser_pool
        )

        # Define a request handler, which will be called for every request.
        @crawler.router.default_handler
        async def request_handler(context: PlaywrightCrawlingContext) -> None:
            Actor.log.info("H")
            url = context.request.url
            Actor.log.info(f'Scraping {url}...')
 

        # Run the crawler with the starting requests.
        await crawler.run(start_urls)

from apify import Actor, Request
from crawlee.playwright_crawler import PlaywrightCrawler, PlaywrightCrawlingContext
from crawlee.proxy_configuration import ProxyConfiguration
from crawlee.browsers import BrowserPool, PlaywrightBrowserPlugin

async def main() -> None:
    async with Actor:
        # Retrieve the Actor input, and use default values if not provided.
        actor_input = await Actor.get_input() or {}
        start_urls = [url.get('url') for url in actor_input.get('start_urls', [{'url': 'https://apify.com'}])]
        proxy_settings = actor_input.get('proxy')
        proxy_configuration = ProxyConfiguration(proxy_urls=[
            'http://xxx:xxx@xxx:xxxx',
        ])

        # Exit if no start URLs are provided.
        if not start_urls:
            Actor.log.info('No start URLs specified in Actor input, exiting...')
            await Actor.exit()

        user_plugin = PlaywrightBrowserPlugin(browser_options={"timeout": 60000})
        browser_pool = BrowserPool(plugins=[user_plugin])

        # Create a crawler.
        crawler = PlaywrightCrawler(
            max_requests_per_crawl=50,
            proxy_configuration=proxy_configuration,
            browser_pool=browser_pool
        )

        # Define a request handler, which will be called for every request.
        @crawler.router.default_handler
        async def request_handler(context: PlaywrightCrawlingContext) -> None:
            Actor.log.info("H")
            url = context.request.url
            Actor.log.info(f'Scraping {url}...')
 

        # Run the crawler with the starting requests.
        await crawler.run(start_urls)

This is the code I am trying

rival-black•6mo ago

The same code without proxy works correctly for me. Even when set to high slow_mo to simulate a slow connection. Is it possible that the problem is with the proxy?

rival-blackOP•6mo ago

yes proxy is working checked locally

rival-black•6mo ago

Hmm, I don't have any ideas yet. The error looks like an attempt to work with a page that no longer exists in the context of browser execution. I would try to make request_handler_timeout higher as its default value is 60 seconds maybe the problem occurs when there is interaction with the element and the handler closes by timeout.

rival-blackOP•6mo ago

same error when using apify proxy I tried in local system with

user_plugin = PlaywrightBrowserPlugin(browser_options={"timeout": 600000, 'headless': False})

user_plugin = PlaywrightBrowserPlugin(browser_options={"timeout": 600000, 'headless': False})

and

request_handler_timeout=timedelta(minutes=100)

request_handler_timeout=timedelta(minutes=100)

but still I am getting this error

          raise rewrite_error(error, f"{parsed_st['apiName']}: {error}") from None
      playwright._impl._errors.TimeoutError: Page.goto: Timeout 30000ms exceeded.
      Call log:
        - navigating to "https://apify.com/", waiting until "load"

          raise rewrite_error(error, f"{parsed_st['apiName']}: {error}") from None
      playwright._impl._errors.TimeoutError: Page.goto: Timeout 30000ms exceeded.
      Call log:
        - navigating to "https://apify.com/", waiting until "load"

rival-black•6mo ago

With Apify's auto proxy, it works on my side.

rival-blackOP•6mo ago

the timeout is not changing

rival-black•6mo ago

I apologize, my mistake. That timeout only affects the opening of the browser. But not the page open 😢

rival-blackOP•6mo ago

yes

        @crawler.pre_navigation_hook
        async def log_navigation_url(context: PlaywrightPreNavigationContext) -> None:
            context.log.info(f'Navigating to {context.request.url} ...')
            context.page.set_default_navigation_timeout(60000)

        @crawler.pre_navigation_hook
        async def log_navigation_url(context: PlaywrightPreNavigationContext) -> None:
            context.log.info(f'Navigating to {context.request.url} ...')
            context.page.set_default_navigation_timeout(60000)

I think this should work

rival-black•6mo ago

Yeah, I think that should help, too.

rival-blackOP•6mo ago

now I am not receiving any timeout

rival-blackOP•6mo ago

I need to set the page navigation timeout, is this the only way? https://playwright.dev/python/docs/api/class-browsercontext#browser-context-set-default-navigation-timeout

BrowserContext | Playwright Python

BrowserContexts provide a way to operate multiple independent browser sessions.

MEE6•6mo ago

@Shine just advanced to level 3! Thanks for your contributions! 🎉

rival-black•6mo ago

Crawlee doesn't have access to browser context right now, pre_navigation_hook is the only way available So I think that's the best and only way. https://discord.com/channels/801163717915574323/1314296091650428948/1314315014118834207

rival-blackOP•6mo ago

thank you

Gaming

Programming

Playwright increase timeout

Did you find this page helpful?