Crawlee & Apify•2y ago

enqueuelinks doesn't work.

at ArrayValidator.handle (/home/chirag/Desktop/crawlee_test/my-crawler/node_modules/@sapphire/shapeshift/src/validators/ArrayValidator.ts:102:17)
    at ArrayValidator.parse (/home/chirag/Desktop/crawlee_test/my-crawler/node_modules/@sapphire/shapeshift/src/validators/BaseValidator.ts:103:2)
    at RequestQueueClient.batchAddRequests (/home/chirag/Desktop/crawlee_test/my-crawler/node_modules/@crawlee/src/resource-clients/request-queue.ts:338:36)
    at RequestQueue.addRequests (/home/chirag/Desktop/crawlee_test/my-crawler/node_modules/@crawlee/src/storages/request_queue.ts:376:46)
    at enqueueLinks (/home/chirag/Desktop/crawlee_test/my-crawler/node_modules/@crawlee/src/enqueue_links/enqueue_links.ts:373:25)
    at browserCrawlerEnqueueLinks (/home/chirag/Desktop/crawlee_test/my-crawler/node_modules/@crawlee/src/internals/browser-crawler.ts:725:24)
    at async PlaywrightCrawler.requestHandler [as userProvidedRequestHandler] (file:///home/chirag/Desktop/crawlee_test/my-crawler/src/main.ts:119:13)
    at async wrap (/home/chirag/Desktop/crawlee_test/my-crawler/node_modules/@apify/src/index.ts:77:27) {"id":"HNdzawiNRrxNKir","url":"https://helpdesk.egnyte.com/","method":"GET","uniqueKey":"https://helpdesk.egnyte.com"}

at ArrayValidator.handle (/home/chirag/Desktop/crawlee_test/my-crawler/node_modules/@sapphire/shapeshift/src/validators/ArrayValidator.ts:102:17)
    at ArrayValidator.parse (/home/chirag/Desktop/crawlee_test/my-crawler/node_modules/@sapphire/shapeshift/src/validators/BaseValidator.ts:103:2)
    at RequestQueueClient.batchAddRequests (/home/chirag/Desktop/crawlee_test/my-crawler/node_modules/@crawlee/src/resource-clients/request-queue.ts:338:36)
    at RequestQueue.addRequests (/home/chirag/Desktop/crawlee_test/my-crawler/node_modules/@crawlee/src/storages/request_queue.ts:376:46)
    at enqueueLinks (/home/chirag/Desktop/crawlee_test/my-crawler/node_modules/@crawlee/src/enqueue_links/enqueue_links.ts:373:25)
    at browserCrawlerEnqueueLinks (/home/chirag/Desktop/crawlee_test/my-crawler/node_modules/@crawlee/src/internals/browser-crawler.ts:725:24)
    at async PlaywrightCrawler.requestHandler [as userProvidedRequestHandler] (file:///home/chirag/Desktop/crawlee_test/my-crawler/src/main.ts:119:13)
    at async wrap (/home/chirag/Desktop/crawlee_test/my-crawler/node_modules/@apify/src/index.ts:77:27) {"id":"HNdzawiNRrxNKir","url":"https://helpdesk.egnyte.com/","method":"GET","uniqueKey":"https://helpdesk.egnyte.com"}

also how to check the function code for enqueue links etc and functions mentioned in this trace? I can only see the function definitions and headers in a ts file. No @crawlee/src folder in the node modules is shown

44 Replies

NeoNomade•2y ago

Can you also post your code ?

genetic-orangeOP•2y ago

sure

async requestHandler({ request, page, enqueueLinks, log }) {

        
        log.info('in crawling function');

        // wait for page to load and all the JS to render
        await page.waitForLoadState('networkidle',);

        
        const title = await page.title();
        log.info(`Title of ${request.loadedUrl} is '${title}'`);

        // get page HTML
        const raw_html = await page.content();
        
        // Save results as JSON to Database
        await addDataToDatabase({ title, url: request.loadedUrl , raw_html});


        // Extract links from the current page and add to request queue
        if (request.userData.event.sitemap.length==0) {
            const RobotRulesV2 = {allow:[], disallow:[]};
            request.userData.robotRules.forEach(rule => {
                // log.debug(rule.pattern);
                const pattern = rule.pattern.replaceAll('*','[\x00-\x7F]*');
                // log.debug(pattern);
                rule.allow? RobotRulesV2.allow.push(RegExp("^" + request.userData.event.seed_url+pattern + "$")): RobotRulesV2.disallow.push(RegExp('^'+request.userData.event.seed_url+pattern+'$'));
            });

            await enqueueLinks({
                strategy: 'all',
            });

             // Extract links from request queue to add on system pipeline
            await addLinksTOQueue(this.requestQueue, log, request.userData);
        }

async requestHandler({ request, page, enqueueLinks, log }) {

        
        log.info('in crawling function');

        // wait for page to load and all the JS to render
        await page.waitForLoadState('networkidle',);

        
        const title = await page.title();
        log.info(`Title of ${request.loadedUrl} is '${title}'`);

        // get page HTML
        const raw_html = await page.content();
        
        // Save results as JSON to Database
        await addDataToDatabase({ title, url: request.loadedUrl , raw_html});


        // Extract links from the current page and add to request queue
        if (request.userData.event.sitemap.length==0) {
            const RobotRulesV2 = {allow:[], disallow:[]};
            request.userData.robotRules.forEach(rule => {
                // log.debug(rule.pattern);
                const pattern = rule.pattern.replaceAll('*','[\x00-\x7F]*');
                // log.debug(pattern);
                rule.allow? RobotRulesV2.allow.push(RegExp("^" + request.userData.event.seed_url+pattern + "$")): RobotRulesV2.disallow.push(RegExp('^'+request.userData.event.seed_url+pattern+'$'));
            });

            await enqueueLinks({
                strategy: 'all',
            });

             // Extract links from request queue to add on system pipeline
            await addLinksTOQueue(this.requestQueue, log, request.userData);
        }

here is the current code. I have made the uderdata have some robots.txt stuff and rules for polite crawling etc which i will add to enqueue links strategy, regexp and exclude etc later and the

addLinksTOQueue

addLinksTOQueue

is a function i defined for taking out the links from the crawlee request queue and add to my own personal persistent queue and following is the launch context for the playwright crawler that i am using

launchContext: {
        launcher : chromium,
        launchOptions: {
            chromiumSandbox: true,
        },
        // ignoreHTTPSErrors: true,
    },

launchContext: {
        launcher : chromium,
        launchOptions: {
            chromiumSandbox: true,
        },
        // ignoreHTTPSErrors: true,
    },

MEE6•2y ago

@chirag.jain just advanced to level 1! Thanks for your contributions! 🎉

genetic-orangeOP•2y ago

basic idea of crawler for me is that it will be given a url to a webpage which it needs to scrape data and links from. These both things should then be given to me in text form where i can store them as per my own requirements. Basically like a scraper which provides links and raw html .

solid-orange•2y ago

Maybe try to use addRequests() instead ? (https://crawlee.dev/docs/3.3/upgrading/upgrading-to-v3#crawleraddrequests)

NeoNomade•2y ago

it's strange that enqueueLinks fails... never happened to me in hundreds of written scrapers. Maybe try to make it a bit more particular and not just strategy: 'all'

genetic-orangeOP•2y ago

but i want all links in it for once for testing cuz if a page with only 100links in total fails on all then in case some page with 100 specific links i want comes then it will fail again also to add all the links that enqueueLinks added to request queue to my own queue/ persistent memory i have a function like this

const addLinksTOQueue =  async (requestQueue, log, event)=>{
    while(true){
        const link = await requestQueue?.fetchNextRequest();
        if (!link) break;
            log.info(`Found link ${link.url}`);
            await addLinkToQueue(link.url, event);
        const done = await requestQueue?.markRequestHandled(link);
    }
    log.info(`Done with all links `);
}

const addLinksTOQueue =  async (requestQueue, log, event)=>{
    while(true){
        const link = await requestQueue?.fetchNextRequest();
        if (!link) break;
            log.info(`Found link ${link.url}`);
            await addLinkToQueue(link.url, event);
        const done = await requestQueue?.markRequestHandled(link);
    }
    log.info(`Done with all links `);
}

but this function sometimes has fetchnext request returning null even though requestqueue has some links left it gave error as well the same one as it gave in the enqueue links case.

const hrefs = await page.evaluate(() => {
            return Array.from(document.links).map(item => item.href);
        });
        for(const href of hrefs){
            log.info(`Found link ${href}`);
        };

        const requests_array= [];
        for(const href of hrefs){
            const req = new Request(href);
            req.userData = request.userData;
            requests_array.push(req);
        }
        // console.log(requests_array);
        const result = await crawler.addRequests(requests_array); 
        await result.waitForAllRequestsToBeAdded;

const hrefs = await page.evaluate(() => {
            return Array.from(document.links).map(item => item.href);
        });
        for(const href of hrefs){
            log.info(`Found link ${href}`);
        };

        const requests_array= [];
        for(const href of hrefs){
            const req = new Request(href);
            req.userData = request.userData;
            requests_array.push(req);
        }
        // console.log(requests_array);
        const result = await crawler.addRequests(requests_array); 
        await result.waitForAllRequestsToBeAdded;

the error is as follows:

WARN  PlaywrightCrawler: Reclaiming failed request back to the list or queue. Received one or more errors
    at async PlaywrightCrawler.requestHandler [as userProvidedRequestHandler] (file:///home/chirag/Desktop/crawlee_test/my-crawler/src/main.ts:126:24) {"id":"HNdzawiNRrxNKir","url":"https://helpdesk.egnyte.com/","retryCount":3}
ERROR PlaywrightCrawler: Request failed and reached maximum retries. Error: Request blocked - received 403 status code.
    at PlaywrightCrawler._throwOnBlockedRequest (/home/chirag/Desktop/crawlee_test/my-crawler/node_modules/@crawlee/src/internals/basic-crawler.ts:825:21)
    at PlaywrightCrawler._responseHandler (/home/chirag/Desktop/crawlee_test/my-crawler/node_modules/@crawlee/src/internals/browser-crawler.ts:637:22)
    at PlaywrightCrawler._runRequestHandler (/home/chirag/Desktop/crawlee_test/my-crawler/node_modules/@crawlee/src/internals/browser-crawler.ts:498:24)
    at async PlaywrightCrawler._runRequestHandler (/home/chirag/Desktop/crawlee_test/my-crawler/node_modules/@crawlee/src/internals/playwright-crawler.ts:240:9)
    at async wrap (/home/chirag/Desktop/crawlee_test/my-crawler/node_modules/@apify/src/index.ts:77:27) {"id":"HNdzawiNRrxNKir","url":"https://helpdesk.egnyte.com/","method":"GET","uniqueKey":"https://helpdesk.egnyte.com"}

WARN  PlaywrightCrawler: Reclaiming failed request back to the list or queue. Received one or more errors
    at async PlaywrightCrawler.requestHandler [as userProvidedRequestHandler] (file:///home/chirag/Desktop/crawlee_test/my-crawler/src/main.ts:126:24) {"id":"HNdzawiNRrxNKir","url":"https://helpdesk.egnyte.com/","retryCount":3}
ERROR PlaywrightCrawler: Request failed and reached maximum retries. Error: Request blocked - received 403 status code.
    at PlaywrightCrawler._throwOnBlockedRequest (/home/chirag/Desktop/crawlee_test/my-crawler/node_modules/@crawlee/src/internals/basic-crawler.ts:825:21)
    at PlaywrightCrawler._responseHandler (/home/chirag/Desktop/crawlee_test/my-crawler/node_modules/@crawlee/src/internals/browser-crawler.ts:637:22)
    at PlaywrightCrawler._runRequestHandler (/home/chirag/Desktop/crawlee_test/my-crawler/node_modules/@crawlee/src/internals/browser-crawler.ts:498:24)
    at async PlaywrightCrawler._runRequestHandler (/home/chirag/Desktop/crawlee_test/my-crawler/node_modules/@crawlee/src/internals/playwright-crawler.ts:240:9)
    at async wrap (/home/chirag/Desktop/crawlee_test/my-crawler/node_modules/@apify/src/index.ts:77:27) {"id":"HNdzawiNRrxNKir","url":"https://helpdesk.egnyte.com/","method":"GET","uniqueKey":"https://helpdesk.egnyte.com"}

NeoNomade•2y ago

ERROR PlaywrightCrawler: Request failed and reached maximum retries. Error: Request blocked - received 403 status code.

genetic-orangeOP•2y ago

the error has come on the line number 126 of my code which maps to the

        const result = await crawler.addRequests(requests_array);

        const result = await crawler.addRequests(requests_array);

NeoNomade•2y ago

have you imported Request from crawlee ?

genetic-orangeOP•2y ago

without that line it works there is no error in the whole code

MEE6•2y ago

@chirag.jain just advanced to level 2! Thanks for your contributions! 🎉

genetic-orangeOP•2y ago

let me try with that imported is it different from the regular request in js?

NeoNomade•2y ago

yes it doesn't work with crawlee and also you have to use the proper object keys

genetic-orangeOP•2y ago

const request = new Request();
request.url = event.URL;
request.userData = {event, robotRules};

const request = new Request();
request.url = event.URL;
request.userData = {event, robotRules};

this correct way for defining a request?

NeoNomade•2y ago

const requests_array= [];
for(const href of hrefs){
    const req = new Request({
      url: href,
      userData: {yourKey: yourValue}
    });
    requests_array.push(req);
}

const requests_array= [];
for(const href of hrefs){
    const req = new Request({
      url: href,
      userData: {yourKey: yourValue}
    });
    requests_array.push(req);
}

genetic-orangeOP•2y ago

it says some requestoptions are expected when defining a new request

NeoNomade•2y ago

look at my code above.

genetic-orangeOP•2y ago

ohk thanks will try that just a minute

NeoNomade•2y ago

this is why I've told you the crawlee request is different.

genetic-orangeOP•2y ago

it still gives same error in the addrequest function line

NeoNomade•2y ago

it can't be the same error probably you are really blocked and you are really getting a 403

genetic-orangeOP•2y ago

WARN  PlaywrightCrawler: Reclaiming failed request back to the list or queue. Received one or more errors
    at async PlaywrightCrawler.requestHandler [as userProvidedRequestHandler] (file:///home/chirag/Desktop/crawlee_test/my-crawler/src/main.ts:127:24) {"id":"HNdzawiNRrxNKir","url":"https://helpdesk.egnyte.com","retryCount":3}
ERROR PlaywrightCrawler: Request failed and reached maximum retries. Error: Request blocked - received 403 status code.
    at PlaywrightCrawler._throwOnBlockedRequest (/home/chirag/Desktop/crawlee_test/my-crawler/node_modules/@crawlee/src/internals/basic-crawler.ts:825:21)
    at PlaywrightCrawler._responseHandler (/home/chirag/Desktop/crawlee_test/my-crawler/node_modules/@crawlee/src/internals/browser-crawler.ts:637:22)
    at PlaywrightCrawler._runRequestHandler (/home/chirag/Desktop/crawlee_test/my-crawler/node_modules/@crawlee/src/internals/browser-crawler.ts:498:24)
    at runNextTicks (node:internal/process/task_queues:60:5)
    at processImmediate (node:internal/timers:447:9)
    at process.topLevelDomainCallback (node:domain:161:15)
    at process.callbackTrampoline (node:internal/async_hooks:128:24)
    at async PlaywrightCrawler._runRequestHandler (/home/chirag/Desktop/crawlee_test/my-crawler/node_modules/@crawlee/src/internals/playwright-crawler.ts:240:9)
    at async wrap (/home/chirag/Desktop/crawlee_test/my-crawler/node_modules/@apify/src/index.ts:77:27) {"id":"HNdzawiNRrxNKir","url":"https://helpdesk.egnyte.com","method":"GET","uniqueKey":"https://helpdesk.egnyte.com"}

WARN  PlaywrightCrawler: Reclaiming failed request back to the list or queue. Received one or more errors
    at async PlaywrightCrawler.requestHandler [as userProvidedRequestHandler] (file:///home/chirag/Desktop/crawlee_test/my-crawler/src/main.ts:127:24) {"id":"HNdzawiNRrxNKir","url":"https://helpdesk.egnyte.com","retryCount":3}
ERROR PlaywrightCrawler: Request failed and reached maximum retries. Error: Request blocked - received 403 status code.
    at PlaywrightCrawler._throwOnBlockedRequest (/home/chirag/Desktop/crawlee_test/my-crawler/node_modules/@crawlee/src/internals/basic-crawler.ts:825:21)
    at PlaywrightCrawler._responseHandler (/home/chirag/Desktop/crawlee_test/my-crawler/node_modules/@crawlee/src/internals/browser-crawler.ts:637:22)
    at PlaywrightCrawler._runRequestHandler (/home/chirag/Desktop/crawlee_test/my-crawler/node_modules/@crawlee/src/internals/browser-crawler.ts:498:24)
    at runNextTicks (node:internal/process/task_queues:60:5)
    at processImmediate (node:internal/timers:447:9)
    at process.topLevelDomainCallback (node:domain:161:15)
    at process.callbackTrampoline (node:internal/async_hooks:128:24)
    at async PlaywrightCrawler._runRequestHandler (/home/chirag/Desktop/crawlee_test/my-crawler/node_modules/@crawlee/src/internals/playwright-crawler.ts:240:9)
    at async wrap (/home/chirag/Desktop/crawlee_test/my-crawler/node_modules/@apify/src/index.ts:77:27) {"id":"HNdzawiNRrxNKir","url":"https://helpdesk.egnyte.com","method":"GET","uniqueKey":"https://helpdesk.egnyte.com"}

but i am getting the data saved in the dataset and also i have the lnks printed in href

NeoNomade•2y ago

all the urls from the helpdesk page ?

genetic-orangeOP•2y ago

const hrefs = await page.evaluate(() => {
            return Array.from(document.links).map(item => item.href);
        });
        for(const href of hrefs){
            log.info(`Found link ${href}`);
        };

const hrefs = await page.evaluate(() => {
            return Array.from(document.links).map(item => item.href);
        });
        for(const href of hrefs){
            log.info(`Found link ${href}`);
        };

this is code for extracting the links from page.content

genetic-orangeOP•2y ago

and here is a ss of the found links in terminal

genetic-orangeOP•2y ago

yes helpdesk.egnyte.com

NeoNomade•2y ago

const addLinksTOQueue =  async (requestQueue, log, event)=>{
    while(true){
        const link = await requestQueue?.fetchNextRequest();
        if (!link) break;
            log.info(`Found link ${link.url}`);
            await addLinkToQueue(link.url, event);
        const done = await requestQueue?.markRequestHandled(link);
    }
    log.info(`Done with all links `);
}

const addLinksTOQueue =  async (requestQueue, log, event)=>{
    while(true){
        const link = await requestQueue?.fetchNextRequest();
        if (!link) break;
            log.info(`Found link ${link.url}`);
            await addLinkToQueue(link.url, event);
        const done = await requestQueue?.markRequestHandled(link);
    }
    log.info(`Done with all links `);
}

it can be from here because you mark the request as handled why are you doing that ?

genetic-orangeOP•2y ago

this function not getting called in current flow

NeoNomade•2y ago

can you share the entire flow that is working now ?

genetic-orangeOP•2y ago

earlier i was doing that to extract links from the requestQueue where the enqueueLinks had added the links and then i would simply add them to my own task queue to schedule for later and stuff

async requestHandler({ request, page, enqueueLinks, log }) {

        
        log.info('in crawling function');

        // wait for page to load and all the JS to render
        await page.waitForLoadState('networkidle',);

        // get page title
        const title = await page.title();
        log.info(`Title of ${request.loadedUrl} is '${title}'`);

        // get page HTML
        const raw_html = await page.content();
        // log.info(`HTML of the page:  ${html}`)

        // Save results as JSON to Database
        await addDataToDatabase({ title, url: request.loadedUrl , raw_html});

        

        const hrefs = await page.evaluate(() => {
            return Array.from(document.links).map(item => item.href);
        });
        for(const href of hrefs){
            log.info(`Found link ${href}`);
        };

        const requests_array= [];
        for(const href of hrefs){
            const req = new Request({
                url: href,
                userData: request.userData});
            requests_array.push(req);
        }
        // console.log(requests_array);
        const result = await crawler.addRequests(requests_array); 
        await result.waitForAllRequestsToBeAdded;  

        
    },

async requestHandler({ request, page, enqueueLinks, log }) {

        
        log.info('in crawling function');

        // wait for page to load and all the JS to render
        await page.waitForLoadState('networkidle',);

        // get page title
        const title = await page.title();
        log.info(`Title of ${request.loadedUrl} is '${title}'`);

        // get page HTML
        const raw_html = await page.content();
        // log.info(`HTML of the page:  ${html}`)

        // Save results as JSON to Database
        await addDataToDatabase({ title, url: request.loadedUrl , raw_html});

        

        const hrefs = await page.evaluate(() => {
            return Array.from(document.links).map(item => item.href);
        });
        for(const href of hrefs){
            log.info(`Found link ${href}`);
        };

        const requests_array= [];
        for(const href of hrefs){
            const req = new Request({
                url: href,
                userData: request.userData});
            requests_array.push(req);
        }
        // console.log(requests_array);
        const result = await crawler.addRequests(requests_array); 
        await result.waitForAllRequestsToBeAdded;  

        
    },

and the addDatatoDatabase is just the Dataset.pushdata in it so no changes there

NeoNomade•2y ago

const result = await crawler.addRequests(requests_array); 
        await result.waitForAllRequestsToBeAdded;

const result = await crawler.addRequests(requests_array); 
        await result.waitForAllRequestsToBeAdded;

this would normal be:

const result = await crawler.addRequests(requests_array, {waitForAllRequestsToBeAdded: true})

const result = await crawler.addRequests(requests_array, {waitForAllRequestsToBeAdded: true})

But addRequests is adding batches of 1k, in your case you have 100, it doesn't make sense to use it so you can completely remove it.

genetic-orangeOP•2y ago

not sure why this error happening

NeoNomade•2y ago

try what I've written above and we take it step by step.

genetic-orangeOP•2y ago

when i try to scrape another link it is getting scraped fine (www.100ms.live/docs and some other similair pages it works fine) but on this helpdesk.egnyte.com it fails

MEE6•2y ago

@chirag.jain just advanced to level 3! Thanks for your contributions! 🎉

NeoNomade•2y ago

it doesn't have anything to do with the webpage.

genetic-orangeOP•2y ago

the 100ms page has only 40 something links in total and its working fine for that not for this so it might have something to do with size of request Queue?

NeoNomade•2y ago

absolutely nothing.

genetic-orangeOP•2y ago

i wrote

const result = await crawler.addRequests(requests_array, {waitForAllRequestsToBeAdded: true})

const result = await crawler.addRequests(requests_array, {waitForAllRequestsToBeAdded: true})

still same error

NeoNomade•2y ago

I've made spiders with millions of urls in the requestQueue. if you give me 15 minutes I'll run the same code on my machine to see what is wrong.

genetic-orangeOP•2y ago

can you try scraping this website? cuz a friend of mine tried to scrape it using the apify api as well and he also failed with similair errors

genetic-orangeOP•2y ago

this file has the code that i am running

message.txt

genetic-orangeOP•2y ago

@NeoNomade any update hey any updates on this error man? Hello @NeoNomade did it work on your pc? the error is originating on calling add requests function call. without it the code is working. any update on why it might be so?

Gaming

Programming

enqueuelinks doesn't work.

Did you find this page helpful?