enqueuelinks doesn't work.
also how to check the function code for enqueue links etc and functions mentioned in this trace? I can only see the function definitions and headers in a ts file. No @crawlee/src folder in the node modules is shown
44 Replies
Can you also post your code ?
genetic-orangeOP•2y ago
sure
here is the current code. I have made the uderdata have some robots.txt stuff and rules for polite crawling etc which i will add to enqueue links strategy, regexp and exclude etc later
and the is a function i defined for taking out the links from the crawlee request queue and add to my own personal persistent queue
and following is the launch context for the playwright crawler that i am using
@chirag.jain just advanced to level 1! Thanks for your contributions! 🎉
genetic-orangeOP•2y ago
basic idea of crawler for me is that it will be given a url to a webpage which it needs to scrape data and links from. These both things should then be given to me in text form where i can store them as per my own requirements. Basically like a scraper which provides links and raw html .
solid-orange•2y ago
Maybe try to use addRequests() instead ? (https://crawlee.dev/docs/3.3/upgrading/upgrading-to-v3#crawleraddrequests)
it's strange that enqueueLinks fails... never happened to me in hundreds of written scrapers.
Maybe try to make it a bit more particular and not just strategy: 'all'
genetic-orangeOP•2y ago
but i want all links in it for once
for testing cuz if a page with only 100links in total fails on all then in case some page with 100 specific links i want comes then it will fail again
also to add all the links that enqueueLinks added to request queue to my own queue/ persistent memory i have a function like this
but this function sometimes has fetchnext request returning null even though requestqueue has some links left
it gave error as well the same one as it gave in the enqueue links case.
the error is as follows:
ERROR PlaywrightCrawler: Request failed and reached maximum retries. Error: Request blocked - received 403 status code.
genetic-orangeOP•2y ago
the error has come on the line number 126 of my code
which maps to the
have you imported Request from crawlee ?
genetic-orangeOP•2y ago
without that line it works there is no error in the whole code
@chirag.jain just advanced to level 2! Thanks for your contributions! 🎉
genetic-orangeOP•2y ago
let me try with that imported is it different from the regular request in js?
yes
it doesn't work with crawlee
and also you have to use the proper object keys
genetic-orangeOP•2y ago
this correct way for defining a request?
genetic-orangeOP•2y ago
it says some requestoptions are expected when defining a new request
look at my code above.
genetic-orangeOP•2y ago
ohk thanks will try that just a minute
this is why I've told you the crawlee request is different.
genetic-orangeOP•2y ago
it still gives same error in the addrequest function line
it can't be the same error
probably you are really blocked and you are really getting a 403
genetic-orangeOP•2y ago
but i am getting the data saved in the dataset
and also i have the lnks printed in href
all the urls from the helpdesk page ?
genetic-orangeOP•2y ago
this is code for extracting the links from page.content
genetic-orangeOP•2y ago
and here is a ss of the found links in terminal

genetic-orangeOP•2y ago
yes helpdesk.egnyte.com
it can be from here
because you mark the request as handled
why are you doing that ?
genetic-orangeOP•2y ago
this function not getting called in current flow
can you share the entire flow that is working now ?
genetic-orangeOP•2y ago
earlier i was doing that to extract links from the requestQueue where the enqueueLinks had added the links and then i would simply add them to my own task queue to schedule for later and stuff
and the addDatatoDatabase is just the Dataset.pushdata in it so no changes there
this would normal be:
But addRequests is adding batches of 1k, in your case you have 100, it doesn't make sense to use it so you can completely remove it.
genetic-orangeOP•2y ago
not sure why this error happening
try what I've written above and we take it step by step.
genetic-orangeOP•2y ago
when i try to scrape another link it is getting scraped fine (www.100ms.live/docs and some other similair pages it works fine) but on this helpdesk.egnyte.com it fails
@chirag.jain just advanced to level 3! Thanks for your contributions! 🎉
it doesn't have anything to do with the webpage.
genetic-orangeOP•2y ago
the 100ms page has only 40 something links in total and its working fine for that not for this
so it might have something to do with size of request Queue?
absolutely nothing.
genetic-orangeOP•2y ago
i wrote still same error
I've made spiders with millions of urls in the requestQueue.
if you give me 15 minutes I'll run the same code on my machine to see what is wrong.
genetic-orangeOP•2y ago
can you try scraping this website? cuz a friend of mine tried to scrape it using the apify api as well and he also failed with similair errors
genetic-orangeOP•2y ago
this file has the code that i am running
genetic-orangeOP•2y ago
@NeoNomade any update
hey any updates on this error man?
Hello @NeoNomade did it work on your pc?
the error is originating on calling add requests function call. without it the code is working. any update on why it might be so?