New to Crawlee and after reading the docs, I'm not sure how to use it to crawl links in a website
So I'm quite new to Crawlee and I'm not sure how it really works 😦
I've reach the docs and checked some examples but couldn't find anything really useful. I have a case where I need to login to a website and then go to a page where I have a list of links that I'd like to crawl, within each page I have more links to crawl and finally, within each page I'd like to perform some actions on the page. One of them is get the URL of a video and download the video to Google drive.
I've read about
enqueueLinks
and RequestQueue
but I really don't know how it works. I've checked the example in the home page but that's not really what I want. I'd like to login, then go to a page https://www.my-site.com/categories
and then from there grab all links that match the glob
I have this
So it would get links https://www.my-site.com/categories/1
, https://www.my-site.com/categories/2
, etc. Then for each category it would get all links in the page 'https://www.my-site.com/categories/1/posts/1
, 'https://www.my-site.com/categories/1/posts/2
, etc. And then in each of these pages I'd like to do something here.
I've tried to add links to the queue with the glob above but it only get the root URL https://www.my-site.com
.
Any help would be greatly appreciated.
Thank you5 Replies
Hello @hel.io ,
I am sorry to hear that.
The logging part is really specific for each webpage and there is not really any universal solution to it.
RequestQueue
is basically source of urls for the crawler. To fill RequestQueue
it with urls you may use the enqueueLinks
method as you mentioned. I briefly check the docs and thee that if you want to scrape url starting with https://www.my-site.com/categories
you may want to set glob as https://www.my-site.com/categories/*/posts/*
with trailing asterisk symbol.
The downloading of files and uploading to google drive may be the hardest part.deep-jade•3y ago
Regarding login logic...
This article can be used for inspiration:
https://docs.apify.com/academy/puppeteer-playwright/common-use-cases/logging-into-a-website
But yeah, each case is different and requires its own implementation.
wise-whiteOP•3y ago
Hi @Pepa J and @Oleg V.
Thank you very much for your replies, very helpful.
I have some experience writing E2E tests with playwright, what I'm lacking is how Crawlee manages all these things. For example once I've logged in to the website, do I need to go to the specific page where I have the categories so
enqueueLinks
can see all links within this page or enqueueLinks
will automatically crawl all URLs that match the glob?
The other thing is, once I have all the links that match this glob https://www.my-site.com/categories/*/posts/*
how do I tell that in each of these links to run a specific handler?
Let's say I have a function that goes and grabs the video URL and downloads it to a Google Drive folder, how can I tell Crawlee that for each links run this specific function?deep-jade•3y ago
I would recommend you to check our Academy first (https://docs.apify.com/academy).
It will give some general vision how crawlee works.
1. do I need to go to the specific page?
- yes, enqueueLinks searches links only on the exact page.
2. how do I tell that in each of these links to run a specific handler?
- there is a router for it + label.
https://crawlee.dev/api/next/playwright-crawler/function/createPlaywrightRouter
it works based on request labels (https://crawlee.dev/api/next/core/interface/RequestOptions#label)
for each label you add
wise-whiteOP•3y ago
Thanks a lot,
I am definitely going to take a look at the academy 🙂