robust-apricot

How can I make Playwright browser not skip page if a selector for textContent is not matched

I'm trying to crawl some websites. I know the selector I have should work for some pages, and I want to fall back to body textContent when the selector doesn't apply. It looks to me like it's actually just failing the whole request here. Any other options I should pass to the Crawler or to the handler?

const crawler = new PlaywrightCrawler({
    // proxyConfiguration: new ProxyConfiguration({ proxyUrls: ['...'] }),
    requestHandler: router,    
});

const crawler = new PlaywrightCrawler({
    // proxyConfiguration: new ProxyConfiguration({ proxyUrls: ['...'] }),
    requestHandler: router,    
});

router.addHandler('detail', async ({ request, page, log }) => {
    log.info(`processing detail page ${request.loadedUrl}`)
    const title = await page.title();
    const html = await page.content();
    let text;
    try{
        text = await page.textContent(".css-h7put4", {timeout: 5000});
    }
    catch(e){
        console.log("Error getting text content", e);
    }
    if(!text){
        text = await page.textContent("body");
    }

router.addHandler('detail', async ({ request, page, log }) => {
    log.info(`processing detail page ${request.loadedUrl}`)
    const title = await page.title();
    const html = await page.content();
    let text;
    try{
        text = await page.textContent(".css-h7put4", {timeout: 5000});
    }
    catch(e){
        console.log("Error getting text content", e);
    }
    if(!text){
        text = await page.textContent("body");
    }

23 Replies

HonzaS•3y ago

Can you share the error message?

robust-apricotOP•3y ago

👋 yes!

MEE6•3y ago

@NicoNico just advanced to level 1! Thanks for your contributions! 🎉

robust-apricotOP•3y ago

robust-apricotOP•3y ago

Separately, I'm noticing it looks like it doesn't crawl all the pages on the domain, I'm curious what I might be doing wrong there too or if it's just erroring out on some and that's what's causing the issue If you're up for hopping on a call Honza, I'd be happy to compensate you for your time to get some of this debugged 🙂

HonzaS•3y ago

from the error it looks like it did not find body as there is default 30000ms timeout but there should be line of the code in the error to be sure I have never used page.textContent I am mostly using this https://crawlee.dev/api/playwright-crawler/interface/PlaywrightCrawlingContext#parseWithCheerio as for crawling all pages I would need to see whole code, generaly if you enqueue the request then it should be processed sooner or later I am on train now so cannot call

robust-apricotOP•3y ago

oh that's interesting, just grab html and parse with cheerio ok let me try that yah I see it error out on like 3 pages, and then it stops wondering if there is some other settings like maxrequests, or something to that effect but I haven't found anything in the docs that seemed relevant just made the cheerio change and re-running now looking good 🙌 thank you! hmm it definitely still stopped i.e. didn't get all links full code is

HonzaS•3y ago

there is this https://crawlee.dev/docs/next/introduction/adding-urls#limit-your-crawls-with-maxrequestspercrawl but if it is not set then it should not limit anything

robust-apricotOP•3y ago

const startUrls = ['https://docs.snyk.io/']

const crawler = new PlaywrightCrawler({
    // proxyConfiguration: new ProxyConfiguration({ proxyUrls: ['...'] }),
    requestHandler: router,
    maxRequestsPerCrawl: 1000
  
});


await crawler.run(startUrls);

const startUrls = ['https://docs.snyk.io/']

const crawler = new PlaywrightCrawler({
    // proxyConfiguration: new ProxyConfiguration({ proxyUrls: ['...'] }),
    requestHandler: router,
    maxRequestsPerCrawl: 1000
  
});


await crawler.run(startUrls);

router.addDefaultHandler(async ({ enqueueLinks, log }) => { log.info(enqueueing new URLs); await enqueueLinks({ strategy: EnqueueStrategy.SameDomain,
label: 'detail', }); });

router.addHandler('detail', async ({ request, page, log, parseWithCheerio}) => {
    log.info(`processing detail page ${request.loadedUrl}`)
    const title = await page.title();
    const html = await page.content();
    let text;
    // try{
    //     text = await page.textContent(".css-h7put4", {timeout: 5000});
    // }
    // catch(e){
    //     console.log("Error getting text content", e);
    // }
    // if(!text){
        
    // }
    const $ = await parseWithCheerio();
    text = $('main').text();
    if(!text){
        text = $('body').text();
    }
    
    const dataset = await Dataset.open(applicationId);

    try{

        // See if a document with this URL already exists
        const existingDocument = // GET THE DOC FROM MY API

        console.log({
            existingDocument: existingDocument.data
        })
        
        if(existingDocument.data.documents.length > 0){
            log.info(`Document with URL ${request.loadedUrl} already exists in application ${applicationId}`);
            await dataset.pushData({
                url: request.loadedUrl,
                title,
                existingDocument: existingDocument.data.documents[0]
            });
            return;
        }else{



         /// INSERT DOC VIA MY API
         const insertedDoc = //the result


            await dataset.pushData({
                url: request.loadedUrl,
                title,
                insertedDocument
            });
            return;
        }
        
    }
    catch(e:any){
        console.log('error inserting document')
        console.log({
            res: e.response.data
        })
    }

    
    
});

router.addHandler('detail', async ({ request, page, log, parseWithCheerio}) => {
    log.info(`processing detail page ${request.loadedUrl}`)
    const title = await page.title();
    const html = await page.content();
    let text;
    // try{
    //     text = await page.textContent(".css-h7put4", {timeout: 5000});
    // }
    // catch(e){
    //     console.log("Error getting text content", e);
    // }
    // if(!text){
        
    // }
    const $ = await parseWithCheerio();
    text = $('main').text();
    if(!text){
        text = $('body').text();
    }
    
    const dataset = await Dataset.open(applicationId);

    try{

        // See if a document with this URL already exists
        const existingDocument = // GET THE DOC FROM MY API

        console.log({
            existingDocument: existingDocument.data
        })
        
        if(existingDocument.data.documents.length > 0){
            log.info(`Document with URL ${request.loadedUrl} already exists in application ${applicationId}`);
            await dataset.pushData({
                url: request.loadedUrl,
                title,
                existingDocument: existingDocument.data.documents[0]
            });
            return;
        }else{



         /// INSERT DOC VIA MY API
         const insertedDoc = //the result


            await dataset.pushData({
                url: request.loadedUrl,
                title,
                insertedDocument
            });
            return;
        }
        
    }
    catch(e:any){
        console.log('error inserting document')
        console.log({
            res: e.response.data
        })
    }

    
    
});

robust-apricotOP•3y ago

seems to stop at 30 pages, but there's definitely more than that at: https://docs.snyk.io/

Snyk User Documentation

HonzaS•3y ago

so it looks like it enqueues just those 30 links you can log what links it enqueued or you can try to enqueue it by hand, just get links you want and enqueue them, that way you will have more control

robust-apricotOP•3y ago

interesting, any idea why it only enqueues those 30?

MEE6•3y ago

@NicoNico just advanced to level 2! Thanks for your contributions! 🎉

robust-apricotOP•3y ago

my sense from the docs was that it should include all the links on the page if I do something like: strategy: EnqueueStrategy.SameDomain,

HonzaS•3y ago

there is 35 links on the page as I see

robust-apricotOP•3y ago

but does it not do it recursively? i.e. does it not crawl beyond the first level of depth? I assume it would continue spidering on to the next page ah I see

HonzaS•3y ago

nope as far as I know you would need to implement that logic

robust-apricotOP•3y ago

the details handler is taking precedent and then I need to enque there too

HonzaS•3y ago

yeah, you can put there the same code

robust-apricotOP•3y ago

trying now thanks again for your help HonzaS I'd love to send you a token of my appreciation for your help if you send me a DM or you have a link that worked by the way, problem was I wasn't enqueing on the details handler thanks again!

HonzaS•3y ago

yeah, if you want to go through all the pages you need to enqueue on all pages to find all the links no problem I have nothing to do on the train anyway 😄

robust-apricotOP•3y ago

it's late where I am (west coast USA), but I'd love some help tomorrow trying to deploy this as a service to the platform - which I haven't done before, let me know if you think you'd have some time (I'd compensate you) I'm still running it locally 🙂 your suggestion def fixed my issues!

HonzaS•3y ago

I am using this https://www.npmjs.com/package/apify-cli
I am developing locally mostly too and then just write command apify push. The program is then pushed to the platform and built there. Let me know if you run into any issues.

Gaming

Programming

How can I make Playwright browser not skip page if a selector for textContent is not matched

Did you find this page helpful?