Crawlee & Apify•3y ago

New to Crawlee and after reading the docs, I'm not sure how to use it to crawl links in a website

So I'm quite new to Crawlee and I'm not sure how it really works 😦 I've reach the docs and checked some examples but couldn't find anything really useful. I have a case where I need to login to a website and then go to a page where I have a list of links that I'd like to crawl, within each page I have more links to crawl and finally, within each page I'd like to perform some actions on the page. One of them is get the URL of a video and download the video to Google drive. I've read about enqueueLinks and RequestQueue but I really don't know how it works. I've checked the example in the home page but that's not really what I want. I'd like to login, then go to a page https://www.my-site.com/categories and then from there grab all links that match the glob I have this

await enqueueLinks({
  globs: ['https://www.my-site.com/categories']
})

await enqueueLinks({
  globs: ['https://www.my-site.com/categories']
})

So it would get links https://www.my-site.com/categories/1, https://www.my-site.com/categories/2, etc. Then for each category it would get all links in the page 'https://www.my-site.com/categories/1/posts/1, 'https://www.my-site.com/categories/1/posts/2, etc. And then in each of these pages I'd like to do something here. I've tried to add links to the queue with the glob above but it only get the root URL https://www.my-site.com. Any help would be greatly appreciated. Thank you

5 Replies

Pepa J•3y ago

Hello @hel.io , I am sorry to hear that. The logging part is really specific for each webpage and there is not really any universal solution to it. RequestQueue is basically source of urls for the crawler. To fill RequestQueue it with urls you may use the enqueueLinks method as you mentioned. I briefly check the docs and thee that if you want to scrape url starting with https://www.my-site.com/categories you may want to set glob as https://www.my-site.com/categories/*/posts/* with trailing asterisk symbol. The downloading of files and uploading to google drive may be the hardest part.

deep-jade•3y ago

Regarding login logic... This article can be used for inspiration: https://docs.apify.com/academy/puppeteer-playwright/common-use-cases/logging-into-a-website But yeah, each case is different and requires its own implementation.

wise-whiteOP•3y ago

Hi @Pepa J and @Oleg V. Thank you very much for your replies, very helpful. I have some experience writing E2E tests with playwright, what I'm lacking is how Crawlee manages all these things. For example once I've logged in to the website, do I need to go to the specific page where I have the categories so enqueueLinks can see all links within this page or enqueueLinks will automatically crawl all URLs that match the glob? The other thing is, once I have all the links that match this glob https://www.my-site.com/categories/*/posts/* how do I tell that in each of these links to run a specific handler? Let's say I have a function that goes and grabs the video URL and downloads it to a Google Drive folder, how can I tell Crawlee that for each links run this specific function?

deep-jade•3y ago

I would recommend you to check our Academy first (https://docs.apify.com/academy). It will give some general vision how crawlee works. 1. do I need to go to the specific page? - yes, enqueueLinks searches links only on the exact page. 2. how do I tell that in each of these links to run a specific handler? - there is a router for it + label. https://crawlee.dev/api/next/playwright-crawler/function/createPlaywrightRouter it works based on request labels (https://crawlee.dev/api/next/core/interface/RequestOptions#label) for each label you add

router.addHandler('your-label', async (ctx) => {
// your logic for specific request (label)
})

router.addHandler('your-label', async (ctx) => {
// your logic for specific request (label)
})

wise-whiteOP•3y ago

Thanks a lot, I am definitely going to take a look at the academy 🙂

Gaming

Programming

New to Crawlee and after reading the docs, I'm not sure how to use it to crawl links in a website

Did you find this page helpful?