enqueuelinks doesn't work.

at ArrayValidator.handle (/home/chirag/Desktop/crawlee_test/my-crawler/node_modules/@sapphire/shapeshift/src/validators/ArrayValidator.ts:102:17)
at ArrayValidator.parse (/home/chirag/Desktop/crawlee_test/my-crawler/node_modules/@sapphire/shapeshift/src/validators/BaseValidator.ts:103:2)
at RequestQueueClient.batchAddRequests (/home/chirag/Desktop/crawlee_test/my-crawler/node_modules/@crawlee/src/resource-clients/request-queue.ts:338:36)
at RequestQueue.addRequests (/home/chirag/Desktop/crawlee_test/my-crawler/node_modules/@crawlee/src/storages/request_queue.ts:376:46)
at enqueueLinks (/home/chirag/Desktop/crawlee_test/my-crawler/node_modules/@crawlee/src/enqueue_links/enqueue_links.ts:373:25)
at browserCrawlerEnqueueLinks (/home/chirag/Desktop/crawlee_test/my-crawler/node_modules/@crawlee/src/internals/browser-crawler.ts:725:24)
at async PlaywrightCrawler.requestHandler [as userProvidedRequestHandler] (file:///home/chirag/Desktop/crawlee_test/my-crawler/src/main.ts:119:13)
at async wrap (/home/chirag/Desktop/crawlee_test/my-crawler/node_modules/@apify/src/index.ts:77:27) {"id":"HNdzawiNRrxNKir","url":"https://helpdesk.egnyte.com/","method":"GET","uniqueKey":"https://helpdesk.egnyte.com"}
at ArrayValidator.handle (/home/chirag/Desktop/crawlee_test/my-crawler/node_modules/@sapphire/shapeshift/src/validators/ArrayValidator.ts:102:17)
at ArrayValidator.parse (/home/chirag/Desktop/crawlee_test/my-crawler/node_modules/@sapphire/shapeshift/src/validators/BaseValidator.ts:103:2)
at RequestQueueClient.batchAddRequests (/home/chirag/Desktop/crawlee_test/my-crawler/node_modules/@crawlee/src/resource-clients/request-queue.ts:338:36)
at RequestQueue.addRequests (/home/chirag/Desktop/crawlee_test/my-crawler/node_modules/@crawlee/src/storages/request_queue.ts:376:46)
at enqueueLinks (/home/chirag/Desktop/crawlee_test/my-crawler/node_modules/@crawlee/src/enqueue_links/enqueue_links.ts:373:25)
at browserCrawlerEnqueueLinks (/home/chirag/Desktop/crawlee_test/my-crawler/node_modules/@crawlee/src/internals/browser-crawler.ts:725:24)
at async PlaywrightCrawler.requestHandler [as userProvidedRequestHandler] (file:///home/chirag/Desktop/crawlee_test/my-crawler/src/main.ts:119:13)
at async wrap (/home/chirag/Desktop/crawlee_test/my-crawler/node_modules/@apify/src/index.ts:77:27) {"id":"HNdzawiNRrxNKir","url":"https://helpdesk.egnyte.com/","method":"GET","uniqueKey":"https://helpdesk.egnyte.com"}
also how to check the function code for enqueue links etc and functions mentioned in this trace? I can only see the function definitions and headers in a ts file. No @crawlee/src folder in the node modules is shown
44 Replies
NeoNomade
NeoNomade•2y ago
Can you also post your code ?
genetic-orange
genetic-orangeOP•2y ago
sure
async requestHandler({ request, page, enqueueLinks, log }) {


log.info('in crawling function');

// wait for page to load and all the JS to render
await page.waitForLoadState('networkidle',);


const title = await page.title();
log.info(`Title of ${request.loadedUrl} is '${title}'`);

// get page HTML
const raw_html = await page.content();

// Save results as JSON to Database
await addDataToDatabase({ title, url: request.loadedUrl , raw_html});


// Extract links from the current page and add to request queue
if (request.userData.event.sitemap.length==0) {
const RobotRulesV2 = {allow:[], disallow:[]};
request.userData.robotRules.forEach(rule => {
// log.debug(rule.pattern);
const pattern = rule.pattern.replaceAll('*','[\x00-\x7F]*');
// log.debug(pattern);
rule.allow? RobotRulesV2.allow.push(RegExp("^" + request.userData.event.seed_url+pattern + "$")): RobotRulesV2.disallow.push(RegExp('^'+request.userData.event.seed_url+pattern+'$'));
});

await enqueueLinks({
strategy: 'all',
});

// Extract links from request queue to add on system pipeline
await addLinksTOQueue(this.requestQueue, log, request.userData);
}
async requestHandler({ request, page, enqueueLinks, log }) {


log.info('in crawling function');

// wait for page to load and all the JS to render
await page.waitForLoadState('networkidle',);


const title = await page.title();
log.info(`Title of ${request.loadedUrl} is '${title}'`);

// get page HTML
const raw_html = await page.content();

// Save results as JSON to Database
await addDataToDatabase({ title, url: request.loadedUrl , raw_html});


// Extract links from the current page and add to request queue
if (request.userData.event.sitemap.length==0) {
const RobotRulesV2 = {allow:[], disallow:[]};
request.userData.robotRules.forEach(rule => {
// log.debug(rule.pattern);
const pattern = rule.pattern.replaceAll('*','[\x00-\x7F]*');
// log.debug(pattern);
rule.allow? RobotRulesV2.allow.push(RegExp("^" + request.userData.event.seed_url+pattern + "$")): RobotRulesV2.disallow.push(RegExp('^'+request.userData.event.seed_url+pattern+'$'));
});

await enqueueLinks({
strategy: 'all',
});

// Extract links from request queue to add on system pipeline
await addLinksTOQueue(this.requestQueue, log, request.userData);
}
here is the current code. I have made the uderdata have some robots.txt stuff and rules for polite crawling etc which i will add to enqueue links strategy, regexp and exclude etc later and the
addLinksTOQueue
addLinksTOQueue
is a function i defined for taking out the links from the crawlee request queue and add to my own personal persistent queue and following is the launch context for the playwright crawler that i am using
launchContext: {
launcher : chromium,
launchOptions: {
chromiumSandbox: true,
},
// ignoreHTTPSErrors: true,
},
launchContext: {
launcher : chromium,
launchOptions: {
chromiumSandbox: true,
},
// ignoreHTTPSErrors: true,
},
MEE6
MEE6•2y ago
@chirag.jain just advanced to level 1! Thanks for your contributions! 🎉
genetic-orange
genetic-orangeOP•2y ago
basic idea of crawler for me is that it will be given a url to a webpage which it needs to scrape data and links from. These both things should then be given to me in text form where i can store them as per my own requirements. Basically like a scraper which provides links and raw html .
solid-orange
solid-orange•2y ago
NeoNomade
NeoNomade•2y ago
it's strange that enqueueLinks fails... never happened to me in hundreds of written scrapers. Maybe try to make it a bit more particular and not just strategy: 'all'
genetic-orange
genetic-orangeOP•2y ago
but i want all links in it for once for testing cuz if a page with only 100links in total fails on all then in case some page with 100 specific links i want comes then it will fail again also to add all the links that enqueueLinks added to request queue to my own queue/ persistent memory i have a function like this
const addLinksTOQueue = async (requestQueue, log, event)=>{
while(true){
const link = await requestQueue?.fetchNextRequest();
if (!link) break;
log.info(`Found link ${link.url}`);
await addLinkToQueue(link.url, event);
const done = await requestQueue?.markRequestHandled(link);
}
log.info(`Done with all links `);
}
const addLinksTOQueue = async (requestQueue, log, event)=>{
while(true){
const link = await requestQueue?.fetchNextRequest();
if (!link) break;
log.info(`Found link ${link.url}`);
await addLinkToQueue(link.url, event);
const done = await requestQueue?.markRequestHandled(link);
}
log.info(`Done with all links `);
}
but this function sometimes has fetchnext request returning null even though requestqueue has some links left it gave error as well the same one as it gave in the enqueue links case.
const hrefs = await page.evaluate(() => {
return Array.from(document.links).map(item => item.href);
});
for(const href of hrefs){
log.info(`Found link ${href}`);
};

const requests_array= [];
for(const href of hrefs){
const req = new Request(href);
req.userData = request.userData;
requests_array.push(req);
}
// console.log(requests_array);
const result = await crawler.addRequests(requests_array);
await result.waitForAllRequestsToBeAdded;
const hrefs = await page.evaluate(() => {
return Array.from(document.links).map(item => item.href);
});
for(const href of hrefs){
log.info(`Found link ${href}`);
};

const requests_array= [];
for(const href of hrefs){
const req = new Request(href);
req.userData = request.userData;
requests_array.push(req);
}
// console.log(requests_array);
const result = await crawler.addRequests(requests_array);
await result.waitForAllRequestsToBeAdded;
the error is as follows:
WARN PlaywrightCrawler: Reclaiming failed request back to the list or queue. Received one or more errors
at async PlaywrightCrawler.requestHandler [as userProvidedRequestHandler] (file:///home/chirag/Desktop/crawlee_test/my-crawler/src/main.ts:126:24) {"id":"HNdzawiNRrxNKir","url":"https://helpdesk.egnyte.com/","retryCount":3}
ERROR PlaywrightCrawler: Request failed and reached maximum retries. Error: Request blocked - received 403 status code.
at PlaywrightCrawler._throwOnBlockedRequest (/home/chirag/Desktop/crawlee_test/my-crawler/node_modules/@crawlee/src/internals/basic-crawler.ts:825:21)
at PlaywrightCrawler._responseHandler (/home/chirag/Desktop/crawlee_test/my-crawler/node_modules/@crawlee/src/internals/browser-crawler.ts:637:22)
at PlaywrightCrawler._runRequestHandler (/home/chirag/Desktop/crawlee_test/my-crawler/node_modules/@crawlee/src/internals/browser-crawler.ts:498:24)
at async PlaywrightCrawler._runRequestHandler (/home/chirag/Desktop/crawlee_test/my-crawler/node_modules/@crawlee/src/internals/playwright-crawler.ts:240:9)
at async wrap (/home/chirag/Desktop/crawlee_test/my-crawler/node_modules/@apify/src/index.ts:77:27) {"id":"HNdzawiNRrxNKir","url":"https://helpdesk.egnyte.com/","method":"GET","uniqueKey":"https://helpdesk.egnyte.com"}
WARN PlaywrightCrawler: Reclaiming failed request back to the list or queue. Received one or more errors
at async PlaywrightCrawler.requestHandler [as userProvidedRequestHandler] (file:///home/chirag/Desktop/crawlee_test/my-crawler/src/main.ts:126:24) {"id":"HNdzawiNRrxNKir","url":"https://helpdesk.egnyte.com/","retryCount":3}
ERROR PlaywrightCrawler: Request failed and reached maximum retries. Error: Request blocked - received 403 status code.
at PlaywrightCrawler._throwOnBlockedRequest (/home/chirag/Desktop/crawlee_test/my-crawler/node_modules/@crawlee/src/internals/basic-crawler.ts:825:21)
at PlaywrightCrawler._responseHandler (/home/chirag/Desktop/crawlee_test/my-crawler/node_modules/@crawlee/src/internals/browser-crawler.ts:637:22)
at PlaywrightCrawler._runRequestHandler (/home/chirag/Desktop/crawlee_test/my-crawler/node_modules/@crawlee/src/internals/browser-crawler.ts:498:24)
at async PlaywrightCrawler._runRequestHandler (/home/chirag/Desktop/crawlee_test/my-crawler/node_modules/@crawlee/src/internals/playwright-crawler.ts:240:9)
at async wrap (/home/chirag/Desktop/crawlee_test/my-crawler/node_modules/@apify/src/index.ts:77:27) {"id":"HNdzawiNRrxNKir","url":"https://helpdesk.egnyte.com/","method":"GET","uniqueKey":"https://helpdesk.egnyte.com"}
NeoNomade
NeoNomade•2y ago
ERROR PlaywrightCrawler: Request failed and reached maximum retries. Error: Request blocked - received 403 status code.
genetic-orange
genetic-orangeOP•2y ago
the error has come on the line number 126 of my code which maps to the
const result = await crawler.addRequests(requests_array);
const result = await crawler.addRequests(requests_array);
NeoNomade
NeoNomade•2y ago
have you imported Request from crawlee ?
genetic-orange
genetic-orangeOP•2y ago
without that line it works there is no error in the whole code
MEE6
MEE6•2y ago
@chirag.jain just advanced to level 2! Thanks for your contributions! 🎉
genetic-orange
genetic-orangeOP•2y ago
let me try with that imported is it different from the regular request in js?
NeoNomade
NeoNomade•2y ago
yes it doesn't work with crawlee and also you have to use the proper object keys
genetic-orange
genetic-orangeOP•2y ago
const request = new Request();
request.url = event.URL;
request.userData = {event, robotRules};
const request = new Request();
request.url = event.URL;
request.userData = {event, robotRules};
this correct way for defining a request?
NeoNomade
NeoNomade•2y ago
const requests_array= [];
for(const href of hrefs){
const req = new Request({
url: href,
userData: {yourKey: yourValue}
});
requests_array.push(req);
}
const requests_array= [];
for(const href of hrefs){
const req = new Request({
url: href,
userData: {yourKey: yourValue}
});
requests_array.push(req);
}
genetic-orange
genetic-orangeOP•2y ago
it says some requestoptions are expected when defining a new request
NeoNomade
NeoNomade•2y ago
look at my code above.
genetic-orange
genetic-orangeOP•2y ago
ohk thanks will try that just a minute
NeoNomade
NeoNomade•2y ago
this is why I've told you the crawlee request is different.
genetic-orange
genetic-orangeOP•2y ago
it still gives same error in the addrequest function line
NeoNomade
NeoNomade•2y ago
it can't be the same error probably you are really blocked and you are really getting a 403
genetic-orange
genetic-orangeOP•2y ago
WARN PlaywrightCrawler: Reclaiming failed request back to the list or queue. Received one or more errors
at async PlaywrightCrawler.requestHandler [as userProvidedRequestHandler] (file:///home/chirag/Desktop/crawlee_test/my-crawler/src/main.ts:127:24) {"id":"HNdzawiNRrxNKir","url":"https://helpdesk.egnyte.com","retryCount":3}
ERROR PlaywrightCrawler: Request failed and reached maximum retries. Error: Request blocked - received 403 status code.
at PlaywrightCrawler._throwOnBlockedRequest (/home/chirag/Desktop/crawlee_test/my-crawler/node_modules/@crawlee/src/internals/basic-crawler.ts:825:21)
at PlaywrightCrawler._responseHandler (/home/chirag/Desktop/crawlee_test/my-crawler/node_modules/@crawlee/src/internals/browser-crawler.ts:637:22)
at PlaywrightCrawler._runRequestHandler (/home/chirag/Desktop/crawlee_test/my-crawler/node_modules/@crawlee/src/internals/browser-crawler.ts:498:24)
at runNextTicks (node:internal/process/task_queues:60:5)
at processImmediate (node:internal/timers:447:9)
at process.topLevelDomainCallback (node:domain:161:15)
at process.callbackTrampoline (node:internal/async_hooks:128:24)
at async PlaywrightCrawler._runRequestHandler (/home/chirag/Desktop/crawlee_test/my-crawler/node_modules/@crawlee/src/internals/playwright-crawler.ts:240:9)
at async wrap (/home/chirag/Desktop/crawlee_test/my-crawler/node_modules/@apify/src/index.ts:77:27) {"id":"HNdzawiNRrxNKir","url":"https://helpdesk.egnyte.com","method":"GET","uniqueKey":"https://helpdesk.egnyte.com"}
WARN PlaywrightCrawler: Reclaiming failed request back to the list or queue. Received one or more errors
at async PlaywrightCrawler.requestHandler [as userProvidedRequestHandler] (file:///home/chirag/Desktop/crawlee_test/my-crawler/src/main.ts:127:24) {"id":"HNdzawiNRrxNKir","url":"https://helpdesk.egnyte.com","retryCount":3}
ERROR PlaywrightCrawler: Request failed and reached maximum retries. Error: Request blocked - received 403 status code.
at PlaywrightCrawler._throwOnBlockedRequest (/home/chirag/Desktop/crawlee_test/my-crawler/node_modules/@crawlee/src/internals/basic-crawler.ts:825:21)
at PlaywrightCrawler._responseHandler (/home/chirag/Desktop/crawlee_test/my-crawler/node_modules/@crawlee/src/internals/browser-crawler.ts:637:22)
at PlaywrightCrawler._runRequestHandler (/home/chirag/Desktop/crawlee_test/my-crawler/node_modules/@crawlee/src/internals/browser-crawler.ts:498:24)
at runNextTicks (node:internal/process/task_queues:60:5)
at processImmediate (node:internal/timers:447:9)
at process.topLevelDomainCallback (node:domain:161:15)
at process.callbackTrampoline (node:internal/async_hooks:128:24)
at async PlaywrightCrawler._runRequestHandler (/home/chirag/Desktop/crawlee_test/my-crawler/node_modules/@crawlee/src/internals/playwright-crawler.ts:240:9)
at async wrap (/home/chirag/Desktop/crawlee_test/my-crawler/node_modules/@apify/src/index.ts:77:27) {"id":"HNdzawiNRrxNKir","url":"https://helpdesk.egnyte.com","method":"GET","uniqueKey":"https://helpdesk.egnyte.com"}
but i am getting the data saved in the dataset and also i have the lnks printed in href
NeoNomade
NeoNomade•2y ago
all the urls from the helpdesk page ?
genetic-orange
genetic-orangeOP•2y ago
const hrefs = await page.evaluate(() => {
return Array.from(document.links).map(item => item.href);
});
for(const href of hrefs){
log.info(`Found link ${href}`);
};
const hrefs = await page.evaluate(() => {
return Array.from(document.links).map(item => item.href);
});
for(const href of hrefs){
log.info(`Found link ${href}`);
};
this is code for extracting the links from page.content
genetic-orange
genetic-orangeOP•2y ago
and here is a ss of the found links in terminal
No description
genetic-orange
genetic-orangeOP•2y ago
yes helpdesk.egnyte.com
NeoNomade
NeoNomade•2y ago
const addLinksTOQueue = async (requestQueue, log, event)=>{
while(true){
const link = await requestQueue?.fetchNextRequest();
if (!link) break;
log.info(`Found link ${link.url}`);
await addLinkToQueue(link.url, event);
const done = await requestQueue?.markRequestHandled(link);
}
log.info(`Done with all links `);
}
const addLinksTOQueue = async (requestQueue, log, event)=>{
while(true){
const link = await requestQueue?.fetchNextRequest();
if (!link) break;
log.info(`Found link ${link.url}`);
await addLinkToQueue(link.url, event);
const done = await requestQueue?.markRequestHandled(link);
}
log.info(`Done with all links `);
}
it can be from here because you mark the request as handled why are you doing that ?
genetic-orange
genetic-orangeOP•2y ago
this function not getting called in current flow
NeoNomade
NeoNomade•2y ago
can you share the entire flow that is working now ?
genetic-orange
genetic-orangeOP•2y ago
earlier i was doing that to extract links from the requestQueue where the enqueueLinks had added the links and then i would simply add them to my own task queue to schedule for later and stuff
async requestHandler({ request, page, enqueueLinks, log }) {


log.info('in crawling function');

// wait for page to load and all the JS to render
await page.waitForLoadState('networkidle',);

// get page title
const title = await page.title();
log.info(`Title of ${request.loadedUrl} is '${title}'`);

// get page HTML
const raw_html = await page.content();
// log.info(`HTML of the page: ${html}`)

// Save results as JSON to Database
await addDataToDatabase({ title, url: request.loadedUrl , raw_html});



const hrefs = await page.evaluate(() => {
return Array.from(document.links).map(item => item.href);
});
for(const href of hrefs){
log.info(`Found link ${href}`);
};

const requests_array= [];
for(const href of hrefs){
const req = new Request({
url: href,
userData: request.userData});
requests_array.push(req);
}
// console.log(requests_array);
const result = await crawler.addRequests(requests_array);
await result.waitForAllRequestsToBeAdded;


},
async requestHandler({ request, page, enqueueLinks, log }) {


log.info('in crawling function');

// wait for page to load and all the JS to render
await page.waitForLoadState('networkidle',);

// get page title
const title = await page.title();
log.info(`Title of ${request.loadedUrl} is '${title}'`);

// get page HTML
const raw_html = await page.content();
// log.info(`HTML of the page: ${html}`)

// Save results as JSON to Database
await addDataToDatabase({ title, url: request.loadedUrl , raw_html});



const hrefs = await page.evaluate(() => {
return Array.from(document.links).map(item => item.href);
});
for(const href of hrefs){
log.info(`Found link ${href}`);
};

const requests_array= [];
for(const href of hrefs){
const req = new Request({
url: href,
userData: request.userData});
requests_array.push(req);
}
// console.log(requests_array);
const result = await crawler.addRequests(requests_array);
await result.waitForAllRequestsToBeAdded;


},
and the addDatatoDatabase is just the Dataset.pushdata in it so no changes there
NeoNomade
NeoNomade•2y ago
const result = await crawler.addRequests(requests_array);
await result.waitForAllRequestsToBeAdded;
const result = await crawler.addRequests(requests_array);
await result.waitForAllRequestsToBeAdded;
this would normal be:
const result = await crawler.addRequests(requests_array, {waitForAllRequestsToBeAdded: true})
const result = await crawler.addRequests(requests_array, {waitForAllRequestsToBeAdded: true})
But addRequests is adding batches of 1k, in your case you have 100, it doesn't make sense to use it so you can completely remove it.
genetic-orange
genetic-orangeOP•2y ago
not sure why this error happening
NeoNomade
NeoNomade•2y ago
try what I've written above and we take it step by step.
genetic-orange
genetic-orangeOP•2y ago
when i try to scrape another link it is getting scraped fine (www.100ms.live/docs and some other similair pages it works fine) but on this helpdesk.egnyte.com it fails
MEE6
MEE6•2y ago
@chirag.jain just advanced to level 3! Thanks for your contributions! 🎉
NeoNomade
NeoNomade•2y ago
it doesn't have anything to do with the webpage.
genetic-orange
genetic-orangeOP•2y ago
the 100ms page has only 40 something links in total and its working fine for that not for this so it might have something to do with size of request Queue?
NeoNomade
NeoNomade•2y ago
absolutely nothing.
genetic-orange
genetic-orangeOP•2y ago
i wrote
const result = await crawler.addRequests(requests_array, {waitForAllRequestsToBeAdded: true})
const result = await crawler.addRequests(requests_array, {waitForAllRequestsToBeAdded: true})
still same error
NeoNomade
NeoNomade•2y ago
I've made spiders with millions of urls in the requestQueue. if you give me 15 minutes I'll run the same code on my machine to see what is wrong.
genetic-orange
genetic-orangeOP•2y ago
can you try scraping this website? cuz a friend of mine tried to scrape it using the apify api as well and he also failed with similair errors
genetic-orange
genetic-orangeOP•2y ago
this file has the code that i am running
genetic-orange
genetic-orangeOP•2y ago
@NeoNomade any update hey any updates on this error man? Hello @NeoNomade did it work on your pc? the error is originating on calling add requests function call. without it the code is working. any update on why it might be so?

Did you find this page helpful?