Crawlee stops scanning for links with different anchors (#xyz) but the same base URL
I am trying to crawl a domain that is structured in the same index.html base URL and different anchors for subpages. An example is:
'https://myDomain.com/index.html#/welcome',
'https://myDomain.com/index.html#/documents',
'https://myDomain.com/index.html#/test'
I am using Crawlee with Playwright. While the first URL is still being crawled correctly, Crawlee just stops afterwards and does not scan the other URLs. Even tough they were actively added to the Queue.
I am assume it is because Crawlee thinks they are all the same URL and ignores the rest.
How can i configure Crawlee to also scan these URLs?
Thanks for your help and let me know if you have questions!
2 Replies
@TobiasGreiner just advanced to level 1! Thanks for your contributions! 🎉
fascinating-indigoOP•10mo ago
The Crawlee AI Bot already helped me 🙂
Anser is by setting the uniqueKey directly:
// Add initial URLs to the queue with unique keys to include hash fragments.
await requestQueue.addRequests([
{ url: 'https://mydomain.com/index.html#/welcome', uniqueKey: 'https://mydomain.com/index.html#/welcome' },
{ url: 'https://mydomain.com/index.html#/documents', uniqueKey: 'https://mydomain.com/index.html#/documents' },
{ url: 'https://mydomain.com/index.html#/test', uniqueKey: 'https://mydomain.com/index.html#/test' },
]);