Website language filtering (/en)
Hi, I am new with crawlee, I was wondering if there was a method with which we can only crawl English versions of websites when they exist and when they dont, to just scrape the regular version at its home language. The issue with only setting URLs with https://example/en/.... is that some websites dont have such endings, which means that they will return an error. In those cases id still want to scrape it even if in another language, its just that wherever possible Id prefer the english version to be scraped, and nothing else. Ideally I dont want to post process the results, because i would have already paid for a lot of crawling unnecessarily.
2 Replies
flat-fuchsia•2y ago
do you have an exemple of such website ?
fair-rose•2y ago
Hi! What you can do is firstly check in the handleRequest method if the English version exists/does not return 404 and scrape it. Otherwise, you can enqueue the page in the language and scrape that one. Please, refer to this guide for more information: https://crawlee.dev/docs/introduction/first-crawler
First crawler | Crawlee
Your first steps into the world of scraping with Crawlee