CA
robust-apricot
How can I make Playwright browser not skip page if a selector for textContent is not matched
I'm trying to crawl some websites. I know the selector I have should work for some pages, and I want to fall back to
body
textContent when the selector doesn't apply. It looks to me like it's actually just failing the whole request here.
Any other options I should pass to the Crawler or to the handler?
`23 Replies
Can you share the error message?
robust-apricotOP•3y ago
👋 yes!
@NicoNico just advanced to level 1! Thanks for your contributions! 🎉
robust-apricotOP•3y ago

robust-apricotOP•3y ago
Separately, I'm noticing it looks like it doesn't crawl all the pages on the domain, I'm curious what I might be doing wrong there too
or if it's just erroring out on some and that's what's causing the issue
If you're up for hopping on a call Honza, I'd be happy to compensate you for your time to get some of this debugged 🙂
from the error it looks like it did not find
body
as there is default 30000ms timeout but there should be line of the code in the error to be sure
I have never used page.textContent
I am mostly using this https://crawlee.dev/api/playwright-crawler/interface/PlaywrightCrawlingContext#parseWithCheerio
as for crawling all pages I would need to see whole code, generaly if you enqueue the request then it should be processed sooner or later
I am on train now so cannot callrobust-apricotOP•3y ago
oh that's interesting, just grab html and parse with cheerio
ok let me try that
yah I see it error out on like 3 pages, and then it stops
wondering if there is some other settings like maxrequests, or something to that effect
but I haven't found anything in the docs that seemed relevant
just made the cheerio change and re-running now
looking good 🙌
thank you!
hmm it definitely still stopped
i.e. didn't get all links
full code is
there is this https://crawlee.dev/docs/next/introduction/adding-urls#limit-your-crawls-with-maxrequestspercrawl
but if it is not set then it should not limit anything
robust-apricotOP•3y ago
router.addDefaultHandler(async ({ enqueueLinks, log }) => {
log.info(
label: 'detail', }); });
enqueueing new URLs
);
await enqueueLinks({
strategy: EnqueueStrategy.SameDomain,label: 'detail', }); });
robust-apricotOP•3y ago
seems to stop at 30 pages, but there's definitely more than that at: https://docs.snyk.io/
so it looks like it enqueues just those 30 links
you can log what links it enqueued
or you can try to enqueue it by hand, just get links you want and enqueue them, that way you will have more control
robust-apricotOP•3y ago
interesting, any idea why it only enqueues those 30?
@NicoNico just advanced to level 2! Thanks for your contributions! 🎉
robust-apricotOP•3y ago
my sense from the docs was that it should include all the links on the page if I do something like:
strategy: EnqueueStrategy.SameDomain,
there is 35 links on the page as I see
robust-apricotOP•3y ago
but does it not do it recursively?
i.e. does it not crawl beyond the first level of depth?
I assume it would continue spidering on to the next page
ah I see
nope as far as I know you would need to implement that logic
robust-apricotOP•3y ago
the
details
handler is taking precedent and then I need to enque there tooyeah, you can put there the same code
robust-apricotOP•3y ago
trying now
thanks again for your help HonzaS
I'd love to send you a token of my appreciation for your help if you send me a DM or you have a link
that worked by the way, problem was I wasn't enqueing on the details handler
thanks again!
yeah, if you want to go through all the pages you need to enqueue on all pages to find all the links
no problem I have nothing to do on the train anyway 😄
robust-apricotOP•3y ago
it's late where I am (west coast USA), but I'd love some help tomorrow trying to deploy this as a service to the platform - which I haven't done before, let me know if you think you'd have some time (I'd compensate you)
I'm still running it locally 🙂
your suggestion def fixed my issues!
I am using this https://www.npmjs.com/package/apify-cli
I am developing locally mostly too and then just write command
I am developing locally mostly too and then just write command
apify push
. The program is then pushed to the platform and built there.
Let me know if you run into any issues.