CA
Crawlee & Apify12mo ago
generous-apricot

Crawler terminates when URL is invalid.

I try to crawl a website which contains an invalid link. I dynamically add the links through context.enqueue_links. The detected link looks like this:
<a href="http://DoorLock-WA2 – DATENBLATT" target="_blank" rel="noreferrer noopener">DATENBLATT KXC-WA2-IP1, KXC-WA2-IP2</a>
<a href="http://DoorLock-WA2 – DATENBLATT" target="_blank" rel="noreferrer noopener">DATENBLATT KXC-WA2-IP1, KXC-WA2-IP2</a>
I get the following error:
httpx.InvalidURL: Invalid IDNA hostname: 'DoorLock-WA2 – DATENBLATT'
httpx.InvalidURL: Invalid IDNA hostname: 'DoorLock-WA2 – DATENBLATT'
The desired behavior is that the link is skipped. How to achieve this?
2 Replies
Mantisus
Mantisus12mo ago
Hey @Thorin Thunderbeard As I understand it you are using the new apify team library - crawlee-python According to the source code of the function - https://github.com/apify/crawlee-python/blob/master/src/crawlee/beautifulsoup_crawler/beautifulsoup_crawler.py#L109C19-L109C32 The only way you can do this is to make a selector that excludes this type of url
generous-apricot
generous-apricotOP12mo ago
Hi Mantisus, good suggestion. I will try later

Did you find this page helpful?