generous-apricot

Crawler terminates when URL is invalid.

I try to crawl a website which contains an invalid link. I dynamically add the links through context.enqueue_links. The detected link looks like this:

 <a href="http://DoorLock-WA2 – DATENBLATT" target="_blank" rel="noreferrer noopener">DATENBLATT KXC-WA2-IP1, KXC-WA2-IP2</a>

 <a href="http://DoorLock-WA2 – DATENBLATT" target="_blank" rel="noreferrer noopener">DATENBLATT KXC-WA2-IP1, KXC-WA2-IP2</a>

I get the following error:

httpx.InvalidURL: Invalid IDNA hostname: 'DoorLock-WA2 – DATENBLATT'

httpx.InvalidURL: Invalid IDNA hostname: 'DoorLock-WA2 – DATENBLATT'

The desired behavior is that the link is skipped. How to achieve this?

2 Replies

Mantisus•12mo ago

Hey @Thorin Thunderbeard As I understand it you are using the new apify team library - crawlee-python According to the source code of the function - https://github.com/apify/crawlee-python/blob/master/src/crawlee/beautifulsoup_crawler/beautifulsoup_crawler.py#L109C19-L109C32 The only way you can do this is to make a selector that excludes this type of url

generous-apricotOP•12mo ago

Hi Mantisus, good suggestion. I will try later

Gaming

Programming

Crawler terminates when URL is invalid.

Did you find this page helpful?