stormy-gold

How to retry when hit with 429

When using crawlee-js its working fine, but when using python 429 is not getting retried. Is there anything I am missing I am using BeautifulSoupCrawler. Please help.

12 Replies

Hall•8mo ago

Someone will reply to you shortly. In the meantime, we’ve found some posts that could help answer your question.

afraid-scarlet•7mo ago

Is it still an issue ? Can you please provide short code reproduction, so we can check it ?

stormy-goldOP•7mo ago

hi sorry for the delayed reply, yes when we get any errors in 429, 403 and anything in 400 range its not retrying

Mantisus•7mo ago

Hi, could you please show a code sample, I'm wondering how you configure the crowler (max_request_retries and max_session_rotations) and do you handle the cases of getting an error somehow additionally? It is possible that when you get a 429 response, a re-request is executed, but it happens too fast and all re-requests get 429 error status too? The 403 response status signals that you have received an access lock. I don't think a re-request should be performed in this case, more like a session change. 400 usually signals that the request is invalid, I don't think such requests should be repeated. In general it seems to me that if you are encountering 429, you should adjust ConcurrencySettings to reduce the aggressiveness of scraping. Also, what http client are you using?

afraid-scarlet•7mo ago

@Shine Yes, please share some code to reproduce the issue, including the configuration of your scraper. Also, provide logs or proof showing that your requests are not being retried in case of a 429 response. Without this information, it’s difficult to assist, as your case seems quite unusual. By default, such requests should be retried automatically.

stormy-goldOP•7mo ago

Hi the below is the code

from apify import Actor, Request
from crawlee.beautifulsoup_crawler import BeautifulSoupCrawler, BeautifulSoupCrawlingContext
from .routes import router
from crawlee import ConcurrencySettings

async def main() -> None:
    concurrency_settings = ConcurrencySettings(
        max_concurrency=3,
    )
    async with Actor:
    
        # Create a crawler.
        crawler = BeautifulSoupCrawler(
            request_handler=router,
            max_requests_per_crawl=100,
            max_request_retries=10,
             concurrency_settings=concurrency_settings
         )

        # Run the crawler with the starting requests.
        await crawler.run(['https://example.com'])

from apify import Actor, Request
from crawlee.beautifulsoup_crawler import BeautifulSoupCrawler, BeautifulSoupCrawlingContext
from .routes import router
from crawlee import ConcurrencySettings

async def main() -> None:
    concurrency_settings = ConcurrencySettings(
        max_concurrency=3,
    )
    async with Actor:
    
        # Create a crawler.
        crawler = BeautifulSoupCrawler(
            request_handler=router,
            max_requests_per_crawl=100,
            max_request_retries=10,
             concurrency_settings=concurrency_settings
         )

        # Run the crawler with the starting requests.
        await crawler.run(['https://example.com'])

stormy-goldOP•7mo ago

Log

log.txt

stormy-goldOP•7mo ago

if there is 403 error when we try again then its accessible so I want to do retry for this status code

Mantisus•7mo ago

Yes, it looks like you can't call repeats for status code in the 400-499 range https://github.com/apify/crawlee-python/blob/master/src/crawlee/basic_crawler/_basic_crawler.py#L653 I don't think it's supposed to work that way

GitHub

crawlee-python/src/crawlee/basic_crawler/_basic_crawler.py at maste...

Crawlee—A web scraping and browser automation library for Python to build reliable crawlers. Extract data for AI, LLMs, RAG, or GPTs. Download HTML, PDF, JPG, PNG, and other files from websites. Wo...

stormy-goldOP•7mo ago

for now what I did is used

ignore_http_error_status_codes=[403]

ignore_http_error_status_codes=[403]

and in request handler it will have error in elements so retry works from there

Mantisus•6mo ago

I created an Issue on this problem - https://github.com/apify/crawlee-python/issues/756 I'll post here when it's resolved. Once v.0.5.0 is released, you will be able to invoke retries for 403 or 429 with additional_http_error_status_codes. See PR https://github.com/apify/crawlee-python/pull/812.

stormy-goldOP•6mo ago

thank you for the update

Gaming

Programming

How to retry when hit with 429

Did you find this page helpful?