CA
Crawlee & Apify•3y ago
robust-apricot

How can I make Playwright browser not skip page if a selector for textContent is not matched

I'm trying to crawl some websites. I know the selector I have should work for some pages, and I want to fall back to body textContent when the selector doesn't apply. It looks to me like it's actually just failing the whole request here. Any other options I should pass to the Crawler or to the handler?
const crawler = new PlaywrightCrawler({
// proxyConfiguration: new ProxyConfiguration({ proxyUrls: ['...'] }),
requestHandler: router,
});
const crawler = new PlaywrightCrawler({
// proxyConfiguration: new ProxyConfiguration({ proxyUrls: ['...'] }),
requestHandler: router,
});
router.addHandler('detail', async ({ request, page, log }) => {
log.info(`processing detail page ${request.loadedUrl}`)
const title = await page.title();
const html = await page.content();
let text;
try{
text = await page.textContent(".css-h7put4", {timeout: 5000});
}
catch(e){
console.log("Error getting text content", e);
}
if(!text){
text = await page.textContent("body");
}
router.addHandler('detail', async ({ request, page, log }) => {
log.info(`processing detail page ${request.loadedUrl}`)
const title = await page.title();
const html = await page.content();
let text;
try{
text = await page.textContent(".css-h7put4", {timeout: 5000});
}
catch(e){
console.log("Error getting text content", e);
}
if(!text){
text = await page.textContent("body");
}
`
23 Replies
HonzaS
HonzaS•3y ago
Can you share the error message?
robust-apricot
robust-apricotOP•3y ago
👋 yes!
MEE6
MEE6•3y ago
@NicoNico just advanced to level 1! Thanks for your contributions! 🎉
robust-apricot
robust-apricotOP•3y ago
No description
robust-apricot
robust-apricotOP•3y ago
Separately, I'm noticing it looks like it doesn't crawl all the pages on the domain, I'm curious what I might be doing wrong there too or if it's just erroring out on some and that's what's causing the issue If you're up for hopping on a call Honza, I'd be happy to compensate you for your time to get some of this debugged 🙂
HonzaS
HonzaS•3y ago
from the error it looks like it did not find body as there is default 30000ms timeout but there should be line of the code in the error to be sure I have never used page.textContent I am mostly using this https://crawlee.dev/api/playwright-crawler/interface/PlaywrightCrawlingContext#parseWithCheerio as for crawling all pages I would need to see whole code, generaly if you enqueue the request then it should be processed sooner or later I am on train now so cannot call
robust-apricot
robust-apricotOP•3y ago
oh that's interesting, just grab html and parse with cheerio ok let me try that yah I see it error out on like 3 pages, and then it stops wondering if there is some other settings like maxrequests, or something to that effect but I haven't found anything in the docs that seemed relevant just made the cheerio change and re-running now looking good 🙌 thank you! hmm it definitely still stopped i.e. didn't get all links full code is
HonzaS
HonzaS•3y ago
there is this https://crawlee.dev/docs/next/introduction/adding-urls#limit-your-crawls-with-maxrequestspercrawl but if it is not set then it should not limit anything
robust-apricot
robust-apricotOP•3y ago
const startUrls = ['https://docs.snyk.io/']

const crawler = new PlaywrightCrawler({
// proxyConfiguration: new ProxyConfiguration({ proxyUrls: ['...'] }),
requestHandler: router,
maxRequestsPerCrawl: 1000

});


await crawler.run(startUrls);
const startUrls = ['https://docs.snyk.io/']

const crawler = new PlaywrightCrawler({
// proxyConfiguration: new ProxyConfiguration({ proxyUrls: ['...'] }),
requestHandler: router,
maxRequestsPerCrawl: 1000

});


await crawler.run(startUrls);
router.addDefaultHandler(async ({ enqueueLinks, log }) => { log.info(enqueueing new URLs); await enqueueLinks({ strategy: EnqueueStrategy.SameDomain,
label: 'detail', }); });
router.addHandler('detail', async ({ request, page, log, parseWithCheerio}) => {
log.info(`processing detail page ${request.loadedUrl}`)
const title = await page.title();
const html = await page.content();
let text;
// try{
// text = await page.textContent(".css-h7put4", {timeout: 5000});
// }
// catch(e){
// console.log("Error getting text content", e);
// }
// if(!text){

// }
const $ = await parseWithCheerio();
text = $('main').text();
if(!text){
text = $('body').text();
}

const dataset = await Dataset.open(applicationId);

try{

// See if a document with this URL already exists
const existingDocument = // GET THE DOC FROM MY API

console.log({
existingDocument: existingDocument.data
})

if(existingDocument.data.documents.length > 0){
log.info(`Document with URL ${request.loadedUrl} already exists in application ${applicationId}`);
await dataset.pushData({
url: request.loadedUrl,
title,
existingDocument: existingDocument.data.documents[0]
});
return;
}else{



/// INSERT DOC VIA MY API
const insertedDoc = //the result


await dataset.pushData({
url: request.loadedUrl,
title,
insertedDocument
});
return;
}

}
catch(e:any){
console.log('error inserting document')
console.log({
res: e.response.data
})
}



});
router.addHandler('detail', async ({ request, page, log, parseWithCheerio}) => {
log.info(`processing detail page ${request.loadedUrl}`)
const title = await page.title();
const html = await page.content();
let text;
// try{
// text = await page.textContent(".css-h7put4", {timeout: 5000});
// }
// catch(e){
// console.log("Error getting text content", e);
// }
// if(!text){

// }
const $ = await parseWithCheerio();
text = $('main').text();
if(!text){
text = $('body').text();
}

const dataset = await Dataset.open(applicationId);

try{

// See if a document with this URL already exists
const existingDocument = // GET THE DOC FROM MY API

console.log({
existingDocument: existingDocument.data
})

if(existingDocument.data.documents.length > 0){
log.info(`Document with URL ${request.loadedUrl} already exists in application ${applicationId}`);
await dataset.pushData({
url: request.loadedUrl,
title,
existingDocument: existingDocument.data.documents[0]
});
return;
}else{



/// INSERT DOC VIA MY API
const insertedDoc = //the result


await dataset.pushData({
url: request.loadedUrl,
title,
insertedDocument
});
return;
}

}
catch(e:any){
console.log('error inserting document')
console.log({
res: e.response.data
})
}



});
robust-apricot
robust-apricotOP•3y ago
seems to stop at 30 pages, but there's definitely more than that at: https://docs.snyk.io/
HonzaS
HonzaS•3y ago
so it looks like it enqueues just those 30 links you can log what links it enqueued or you can try to enqueue it by hand, just get links you want and enqueue them, that way you will have more control
robust-apricot
robust-apricotOP•3y ago
interesting, any idea why it only enqueues those 30?
MEE6
MEE6•3y ago
@NicoNico just advanced to level 2! Thanks for your contributions! 🎉
robust-apricot
robust-apricotOP•3y ago
my sense from the docs was that it should include all the links on the page if I do something like: strategy: EnqueueStrategy.SameDomain,
HonzaS
HonzaS•3y ago
there is 35 links on the page as I see
robust-apricot
robust-apricotOP•3y ago
but does it not do it recursively? i.e. does it not crawl beyond the first level of depth? I assume it would continue spidering on to the next page ah I see
HonzaS
HonzaS•3y ago
nope as far as I know you would need to implement that logic
robust-apricot
robust-apricotOP•3y ago
the details handler is taking precedent and then I need to enque there too
HonzaS
HonzaS•3y ago
yeah, you can put there the same code
robust-apricot
robust-apricotOP•3y ago
trying now thanks again for your help HonzaS I'd love to send you a token of my appreciation for your help if you send me a DM or you have a link that worked by the way, problem was I wasn't enqueing on the details handler thanks again!
HonzaS
HonzaS•3y ago
yeah, if you want to go through all the pages you need to enqueue on all pages to find all the links no problem I have nothing to do on the train anyway 😄
robust-apricot
robust-apricotOP•3y ago
it's late where I am (west coast USA), but I'd love some help tomorrow trying to deploy this as a service to the platform - which I haven't done before, let me know if you think you'd have some time (I'd compensate you) I'm still running it locally 🙂 your suggestion def fixed my issues!
HonzaS
HonzaS•3y ago
I am using this https://www.npmjs.com/package/apify-cli
I am developing locally mostly too and then just write command apify push. The program is then pushed to the platform and built there. Let me know if you run into any issues.

Did you find this page helpful?