F
Firecrawl14mo ago
Emma

Unable to crawl: https://consaludmental.org/tag/fedeafes/

No description
25 Replies
Adobe.Flash
Adobe.Flash14mo ago
Looking rnow You need to set allowBackwardLinks to true as all the urls on that page are not children of the original. (looking at the url hierarchy) Setting that works fine
Adobe.Flash
Adobe.Flash14mo ago
No description
Adobe.Flash
Adobe.Flash14mo ago
No description
Emma
EmmaOP14mo ago
Sorry. I mean /scrape not /crawl. I just want to scrape this single page. There is no Allow backwards links option
Adobe.Flash
Adobe.Flash14mo ago
Oh i saw the title that was unable to /crawl let me see
Adobe.Flash
Adobe.Flash14mo ago
seems to be working, can u try again?
No description
Adobe.Flash
Adobe.Flash14mo ago
Try setting the waitFor to 5000 if not
Emma
EmmaOP14mo ago
hmm. That's weird. I set 5000 wait for. Let me try with js-sdk
No description
Adobe.Flash
Adobe.Flash14mo ago
that's super odd, i just tested 5 times and they all show up. Where are you located (region wise)?
Emma
EmmaOP14mo ago
Bay area
Adobe.Flash
Adobe.Flash14mo ago
hm
Emma
EmmaOP14mo ago
I found the reason now. It's the main content checkbox causing the problem
Emma
EmmaOP14mo ago
No description
Emma
EmmaOP14mo ago
Did you check this box? It seems that scraper incorrectly deleted the main content as the footer or header
Adobe.Flash
Adobe.Flash14mo ago
hmm interesting let me check the websites html structure and see if there is anything that could have triggered it to remove the main body my guess is that they have a parent overlay/cookie/nav div as the parent of their main content which is pretty bad html semantic wise.
Emma
EmmaOP14mo ago
okay. This issue is not that important now because we can just disable the onlyMainContent option . It doesn't seem reliable for this option.
Adobe.Flash
Adobe.Flash14mo ago
It is not a problem with the option I can assure you that as we never had an issue with taht is a lot more likely that the page has an invalid overlay over it which triggers the removal of parent tags that embodies most of the content
Emma
EmmaOP14mo ago
I assume that the onlyMainContent is to remove <header> and <footer> element but it's more than that. You really cannot control what website your clients gave you. What the client want to extract is only the visible part of that website... But any way, we can close this ticket now as we have a workaround. Thanks for investigating!
Adobe.Flash
Adobe.Flash14mo ago
Sounds good, i will keep investigating and you are right! Hard to control that! np!
Emma
EmmaOP14mo ago
for example, in the content of this ticket: https://discord.com/channels/1226707384710332458/1281649719168339968/1281649719168339968 You will see that your scraper only extacts the cookie overlay but missed the main content. I personally think that doing too much assumption is worse than not doing any assumption at all. Otherwise, you will overkill some of the important cases.
Adobe.Flash
Adobe.Flash14mo ago
Yea, that's fair. That one we are investigating as it should have 100% grabbed the content even if it grabs the cookie too btw find out the issue with this one We remove the elements .tag #tag by default due on the onlyMainContent, and that website has a .tag in their body classname, which is quite uncommon, but also fair as the web is full of weird stuff haha. I just prd it a removal of that from the prunning.
Emma
EmmaOP14mo ago
Not dropping any thing at all might be easier for you and covers the most conclusive case for the /scrape api endpoint. As a user of your endpoint, we don't mind do pruning ourselves. However, the reverse is impossible and we cannot add the dropped content back.
Adobe.Flash
Adobe.Flash14mo ago
Funny that we used to not prune by default in v0 btw. We then changed that on v1 due to customer requests and feedback.
Emma
EmmaOP14mo ago
maybe add on option to skip pruning like, noPruning, flag the customer feedback might assume that you can prune the html for that particular case or assume that your can think of every possiblitiy to handle the pruning in the "intended" way. But the internet is full of weird pages and you cannot assume that the assumption you made covers all the cases. It's always a good practice to leave a backdoor that do not do any modification at all. Assumptions might fail for some corner cases.
Adobe.Flash
Adobe.Flash14mo ago
You can set onlyMainContent to false As the default is true, that will prevent the prunning Agreed with the assumptions thing though and will provide the feedback to the team 🙂

Did you find this page helpful?