crawlee misses links #depth #missing-urls
Happy Fourth everyone! Hoping someone can suggest how to address the following. I copied the simple example on the docs in an attempt to scrape all links to pages below https://weaviate.io/developers/weaviate. It runs and reports 32 links found but misses many links, particularly those 3 or more levels down. For instance it misses all the pages below https://weaviate.io/developers/weaviate/api/graphql/ like https://weaviate.io/developers/weaviate/api/graphql/get. My code is
Crawlee's output shows no errors, and outputs logs it found 32 URLs when actually there are many more URLs under the starting URL. Something seems to be preventing crawlee from descending further into the site. I can only get it to grab those URLs if I explicitly add their direct parent to startUrls. This indicates there isn't anything unique about those pages other than their depth. Of course its impractical to manually add all those parents, and my logs indicate their direct parents are read by crawlee, but for some reason crawlee doesn't grab the children. Any suggestions?
2 Replies
To dive deeper put the same enqueue links function in the “devDocs” handler
Now you only scrape the URLs from the first page.
absent-sapphireOP•2y ago
Thank you! that worked.