Crawlee & Apify•2y ago

crawlee misses links #depth #missing-urls

Happy Fourth everyone! Hoping someone can suggest how to address the following. I copied the simple example on the docs in an attempt to scrape all links to pages below https://weaviate.io/developers/weaviate. It runs and reports 32 links found but misses many links, particularly those 3 or more levels down. For instance it misses all the pages below https://weaviate.io/developers/weaviate/api/graphql/ like https://weaviate.io/developers/weaviate/api/graphql/get. My code is

 
const startUrls = ['https://weaviate.io/developers/weaviate'];
const storageDir = path.join(__dirname, '../storage/datasets/default');

const crawler = new PlaywrightCrawler({
    requestHandler: router,
});
await crawler.run(startUrls); 

router.addDefaultHandler(async ({ enqueueLinks, log }) => {
  log.info(`enqueueing new URLs`);
  await enqueueLinks({
    label: "devDocs", // has to match first arg of addHandler()
  });
});

router.addHandler("devDocs", async ({ request, page, log }) => {
  const title = await page.title();
  const url = request.loadedUrl;
  log.info(`${title}`, { url: url });
  await Dataset.pushData(await scrapePage(page));
});

 
const startUrls = ['https://weaviate.io/developers/weaviate'];
const storageDir = path.join(__dirname, '../storage/datasets/default');

const crawler = new PlaywrightCrawler({
    requestHandler: router,
});
await crawler.run(startUrls); 

router.addDefaultHandler(async ({ enqueueLinks, log }) => {
  log.info(`enqueueing new URLs`);
  await enqueueLinks({
    label: "devDocs", // has to match first arg of addHandler()
  });
});

router.addHandler("devDocs", async ({ request, page, log }) => {
  const title = await page.title();
  const url = request.loadedUrl;
  log.info(`${title}`, { url: url });
  await Dataset.pushData(await scrapePage(page));
});

Crawlee's output shows no errors, and outputs logs it found 32 URLs when actually there are many more URLs under the starting URL. Something seems to be preventing crawlee from descending further into the site. I can only get it to grab those URLs if I explicitly add their direct parent to startUrls. This indicates there isn't anything unique about those pages other than their depth. Of course its impractical to manually add all those parents, and my logs indicate their direct parents are read by crawlee, but for some reason crawlee doesn't grab the children. Any suggestions?

GraphQL API | Weaviate - vector database

GraphQL

2 Replies

NeoNomade•2y ago

To dive deeper put the same enqueue links function in the “devDocs” handler Now you only scrape the URLs from the first page.

absent-sapphireOP•2y ago

Thank you! that worked.

Gaming

Programming

crawlee misses links #depth #missing-urls

Did you find this page helpful?