crawlee misses links #depth #missing-urls

Happy Fourth everyone! Hoping someone can suggest how to address the following. I copied the simple example on the docs in an attempt to scrape all links to pages below https://weaviate.io/developers/weaviate. It runs and reports 32 links found but misses many links, particularly those 3 or more levels down. For instance it misses all the pages below https://weaviate.io/developers/weaviate/api/graphql/ like https://weaviate.io/developers/weaviate/api/graphql/get. My code is

const startUrls = ['https://weaviate.io/developers/weaviate'];
const storageDir = path.join(__dirname, '../storage/datasets/default');

const crawler = new PlaywrightCrawler({
requestHandler: router,
});
await crawler.run(startUrls);

router.addDefaultHandler(async ({ enqueueLinks, log }) => {
log.info(`enqueueing new URLs`);
await enqueueLinks({
label: "devDocs", // has to match first arg of addHandler()
});
});

router.addHandler("devDocs", async ({ request, page, log }) => {
const title = await page.title();
const url = request.loadedUrl;
log.info(`${title}`, { url: url });
await Dataset.pushData(await scrapePage(page));
});

const startUrls = ['https://weaviate.io/developers/weaviate'];
const storageDir = path.join(__dirname, '../storage/datasets/default');

const crawler = new PlaywrightCrawler({
requestHandler: router,
});
await crawler.run(startUrls);

router.addDefaultHandler(async ({ enqueueLinks, log }) => {
log.info(`enqueueing new URLs`);
await enqueueLinks({
label: "devDocs", // has to match first arg of addHandler()
});
});

router.addHandler("devDocs", async ({ request, page, log }) => {
const title = await page.title();
const url = request.loadedUrl;
log.info(`${title}`, { url: url });
await Dataset.pushData(await scrapePage(page));
});
Crawlee's output shows no errors, and outputs logs it found 32 URLs when actually there are many more URLs under the starting URL. Something seems to be preventing crawlee from descending further into the site. I can only get it to grab those URLs if I explicitly add their direct parent to startUrls. This indicates there isn't anything unique about those pages other than their depth. Of course its impractical to manually add all those parents, and my logs indicate their direct parents are read by crawlee, but for some reason crawlee doesn't grab the children. Any suggestions?
2 Replies
NeoNomade
NeoNomade2y ago
To dive deeper put the same enqueue links function in the “devDocs” handler Now you only scrape the URLs from the first page.
absent-sapphire
absent-sapphireOP2y ago
Thank you! that worked.

Did you find this page helpful?