25 Replies
Looking rnow
You need to set allowBackwardLinks to true as all the urls on that page are not children of the original. (looking at the url hierarchy)
Setting that works fine


Sorry. I mean /scrape not /crawl. I just want to scrape this single page. There is no Allow backwards links option
Oh i saw the title that was unable to /crawl
let me see
seems to be working, can u try again?

Try setting the waitFor to 5000 if not
hmm. That's weird. I set 5000 wait for. Let me try with js-sdk

that's super odd, i just tested 5 times and they all show up. Where are you located (region wise)?
Bay area
hm
I found the reason now. It's the main content checkbox causing the problem

Did you check this box?
It seems that scraper incorrectly deleted the main content as the footer or header
hmm
interesting
let me check the websites html structure
and see if there is anything that could have triggered it to remove the main body
my guess is that they have a parent overlay/cookie/nav div as the parent of their main content which is pretty bad html semantic wise.
okay. This issue is not that important now because we can just disable the onlyMainContent option . It doesn't seem reliable for this option.
It is not a problem with the option I can assure you that
as we never had an issue with taht
is a lot more likely that the page has an invalid overlay over it which triggers the removal of parent tags that embodies most of the content
I assume that the onlyMainContent is to remove <header> and <footer> element but it's more than that.
You really cannot control what website your clients gave you. What the client want to extract is only the visible part of that website... But any way, we can close this ticket now as we have a workaround.
Thanks for investigating!
Sounds good, i will keep investigating and you are right! Hard to control that!
np!
for example, in the content of this ticket: https://discord.com/channels/1226707384710332458/1281649719168339968/1281649719168339968
You will see that your scraper only extacts the cookie overlay but missed the main content.
I personally think that doing too much assumption is worse than not doing any assumption at all. Otherwise, you will overkill some of the important cases.
Yea, that's fair. That one we are investigating as it should have 100% grabbed the content even if it grabs the cookie too
btw
find out the issue with this one
We remove the elements
.tag #tag
by default due on the onlyMainContent, and that website has a .tag
in their body classname, which is quite uncommon, but also fair as the web is full of weird stuff haha. I just prd it a removal of that from the prunning.Not dropping any thing at all might be easier for you and covers the most conclusive case for the /scrape api endpoint. As a user of your endpoint, we don't mind do pruning ourselves. However, the reverse is impossible and we cannot add the dropped content back.
Funny that we used to not prune by default in v0 btw. We then changed that on v1 due to customer requests and feedback.
maybe add on option to skip pruning like, noPruning, flag
the customer feedback might assume that you can prune the html for that particular case or assume that your can think of every possiblitiy to handle the pruning in the "intended" way. But the internet is full of weird pages and you cannot assume that the assumption you made covers all the cases.
It's always a good practice to leave a backdoor that do not do any modification at all. Assumptions might fail for some corner cases.
You can set onlyMainContent to false
As the default is true, that will prevent the prunning
Agreed with the assumptions thing though and will provide the feedback to the team 🙂