Unable to crawl: https://consaludmental.org/tag/fedeafes/

25 Replies

Looking rnow You need to set allowBackwardLinks to true as all the urls on that page are not children of the original. (looking at the url hierarchy) Setting that works fine

Adobe.Flash•14mo ago

EmmaOP•14mo ago

Sorry. I mean /scrape not /crawl. I just want to scrape this single page. There is no Allow backwards links option

Adobe.Flash•14mo ago

Oh i saw the title that was unable to /crawl let me see

Adobe.Flash•14mo ago

seems to be working, can u try again?

Adobe.Flash•14mo ago

Try setting the waitFor to 5000 if not

EmmaOP•14mo ago

hmm. That's weird. I set 5000 wait for. Let me try with js-sdk

Adobe.Flash•14mo ago

that's super odd, i just tested 5 times and they all show up. Where are you located (region wise)?

EmmaOP•14mo ago

Bay area

Adobe.Flash•14mo ago

EmmaOP•14mo ago

I found the reason now. It's the main content checkbox causing the problem

EmmaOP•14mo ago

Did you check this box? It seems that scraper incorrectly deleted the main content as the footer or header

Adobe.Flash•14mo ago

hmm interesting let me check the websites html structure and see if there is anything that could have triggered it to remove the main body my guess is that they have a parent overlay/cookie/nav div as the parent of their main content which is pretty bad html semantic wise.

EmmaOP•14mo ago

okay. This issue is not that important now because we can just disable the onlyMainContent option . It doesn't seem reliable for this option.

Adobe.Flash•14mo ago

It is not a problem with the option I can assure you that as we never had an issue with taht is a lot more likely that the page has an invalid overlay over it which triggers the removal of parent tags that embodies most of the content

EmmaOP•14mo ago

I assume that the onlyMainContent is to remove <header> and <footer> element but it's more than that. You really cannot control what website your clients gave you. What the client want to extract is only the visible part of that website... But any way, we can close this ticket now as we have a workaround. Thanks for investigating!

Adobe.Flash•14mo ago

Sounds good, i will keep investigating and you are right! Hard to control that! np!

EmmaOP•14mo ago

for example, in the content of this ticket: https://discord.com/channels/1226707384710332458/1281649719168339968/1281649719168339968 You will see that your scraper only extacts the cookie overlay but missed the main content. I personally think that doing too much assumption is worse than not doing any assumption at all. Otherwise, you will overkill some of the important cases.

Adobe.Flash•14mo ago

Yea, that's fair. That one we are investigating as it should have 100% grabbed the content even if it grabs the cookie too btw find out the issue with this one We remove the elements .tag #tag by default due on the onlyMainContent, and that website has a .tag in their body classname, which is quite uncommon, but also fair as the web is full of weird stuff haha. I just prd it a removal of that from the prunning.

EmmaOP•14mo ago

Not dropping any thing at all might be easier for you and covers the most conclusive case for the /scrape api endpoint. As a user of your endpoint, we don't mind do pruning ourselves. However, the reverse is impossible and we cannot add the dropped content back.

Adobe.Flash•14mo ago

Funny that we used to not prune by default in v0 btw. We then changed that on v1 due to customer requests and feedback.

EmmaOP•14mo ago

maybe add on option to skip pruning like, noPruning, flag the customer feedback might assume that you can prune the html for that particular case or assume that your can think of every possiblitiy to handle the pruning in the "intended" way. But the internet is full of weird pages and you cannot assume that the assumption you made covers all the cases. It's always a good practice to leave a backdoor that do not do any modification at all. Assumptions might fail for some corner cases.

Adobe.Flash•14mo ago

You can set onlyMainContent to false As the default is true, that will prevent the prunning Agreed with the assumptions thing though and will provide the feedback to the team 🙂

Gaming

Programming

Unable to crawl: https://consaludmental.org/tag/fedeafes/

Did you find this page helpful?