CA
Crawlee & Apify8mo ago
stormy-gold

How to retry when hit with 429

When using crawlee-js its working fine, but when using python 429 is not getting retried. Is there anything I am missing I am using BeautifulSoupCrawler. Please help.
12 Replies
Hall
Hall8mo ago
Someone will reply to you shortly. In the meantime, we’ve found some posts that could help answer your question.
afraid-scarlet
afraid-scarlet7mo ago
Is it still an issue ? Can you please provide short code reproduction, so we can check it ?
stormy-gold
stormy-goldOP7mo ago
hi sorry for the delayed reply, yes when we get any errors in 429, 403 and anything in 400 range its not retrying
Mantisus
Mantisus7mo ago
Hi, could you please show a code sample, I'm wondering how you configure the crowler (max_request_retries and max_session_rotations) and do you handle the cases of getting an error somehow additionally? It is possible that when you get a 429 response, a re-request is executed, but it happens too fast and all re-requests get 429 error status too? The 403 response status signals that you have received an access lock. I don't think a re-request should be performed in this case, more like a session change. 400 usually signals that the request is invalid, I don't think such requests should be repeated. In general it seems to me that if you are encountering 429, you should adjust ConcurrencySettings to reduce the aggressiveness of scraping. Also, what http client are you using?
afraid-scarlet
afraid-scarlet7mo ago
@Shine Yes, please share some code to reproduce the issue, including the configuration of your scraper. Also, provide logs or proof showing that your requests are not being retried in case of a 429 response. Without this information, it’s difficult to assist, as your case seems quite unusual. By default, such requests should be retried automatically.
stormy-gold
stormy-goldOP7mo ago
Hi the below is the code
from apify import Actor, Request
from crawlee.beautifulsoup_crawler import BeautifulSoupCrawler, BeautifulSoupCrawlingContext
from .routes import router
from crawlee import ConcurrencySettings

async def main() -> None:
concurrency_settings = ConcurrencySettings(
max_concurrency=3,
)
async with Actor:

# Create a crawler.
crawler = BeautifulSoupCrawler(
request_handler=router,
max_requests_per_crawl=100,
max_request_retries=10,
concurrency_settings=concurrency_settings
)

# Run the crawler with the starting requests.
await crawler.run(['https://example.com'])
from apify import Actor, Request
from crawlee.beautifulsoup_crawler import BeautifulSoupCrawler, BeautifulSoupCrawlingContext
from .routes import router
from crawlee import ConcurrencySettings

async def main() -> None:
concurrency_settings = ConcurrencySettings(
max_concurrency=3,
)
async with Actor:

# Create a crawler.
crawler = BeautifulSoupCrawler(
request_handler=router,
max_requests_per_crawl=100,
max_request_retries=10,
concurrency_settings=concurrency_settings
)

# Run the crawler with the starting requests.
await crawler.run(['https://example.com'])
stormy-gold
stormy-goldOP7mo ago
stormy-gold
stormy-goldOP7mo ago
if there is 403 error when we try again then its accessible so I want to do retry for this status code
Mantisus
Mantisus7mo ago
Yes, it looks like you can't call repeats for status code in the 400-499 range https://github.com/apify/crawlee-python/blob/master/src/crawlee/basic_crawler/_basic_crawler.py#L653 I don't think it's supposed to work that way
GitHub
crawlee-python/src/crawlee/basic_crawler/_basic_crawler.py at maste...
Crawlee—A web scraping and browser automation library for Python to build reliable crawlers. Extract data for AI, LLMs, RAG, or GPTs. Download HTML, PDF, JPG, PNG, and other files from websites. Wo...
stormy-gold
stormy-goldOP7mo ago
for now what I did is used
ignore_http_error_status_codes=[403]
ignore_http_error_status_codes=[403]
and in request handler it will have error in elements so retry works from there
Mantisus
Mantisus6mo ago
I created an Issue on this problem - https://github.com/apify/crawlee-python/issues/756 I'll post here when it's resolved. Once v.0.5.0 is released, you will be able to invoke retries for 403 or 429 with additional_http_error_status_codes. See PR https://github.com/apify/crawlee-python/pull/812.
stormy-gold
stormy-goldOP6mo ago
thank you for the update

Did you find this page helpful?