application/octect stream in cheerio

I'm trying to scrape a second page, in a working scraper. Though this page gives the response as "application/octect-stream". Is there something I could do to fix this or should I swap to puppeteer/playwright. Looks kinda same since the page is full static Here the error message:
ERROR CheerioCrawler: Request failed and reached maximum retries. Error: Resource http://127.0.0.1/website.web/part_to_scrape served Content-Type application/octet-stream, but only text/html, text/xml, application/xhtml+xml, application/xml, application/json are allowed. Skipping resource.
ERROR CheerioCrawler: Request failed and reached maximum retries. Error: Resource http://127.0.0.1/website.web/part_to_scrape served Content-Type application/octet-stream, but only text/html, text/xml, application/xhtml+xml, application/xml, application/json are allowed. Skipping resource.
Thank you so much
15 Replies
NeoNomade
NeoNomade2y ago
Change the headers to match the request in your browser , in special the application type if it’s application/json
Pepa J
Pepa J2y ago
Hello @Lamp can you share a link of such a website? Generally application/octet-stream is used when the page provide you some data to download.
rival-black
rival-blackOP2y ago
Well, actually it's a website that I downloaded via wget. So I could test my stuff on local before doing it with the real site. Now that you told me I noticed that looking at the network packets I get text/html. Dunno why in wget it changes it Thank you for the help, I'll look for another way to download the site
MEE6
MEE62y ago
@Lamp just advanced to level 1! Thanks for your contributions! 🎉
Pepa J
Pepa J2y ago
so you are browsing/scraping the website from you filesystem?
rival-black
rival-blackOP2y ago
yep http://127.0.0.1/ I usually follow this process of first downloading a portion and then launching on the actual thing With tools like HTTrack or just wget it generally works! :3
Pepa J
Pepa J2y ago
how do you serve the content?, I don't think this would be wget related, content-type is more webserver related. does the page have some proper extension like .html ?
rival-black
rival-blackOP2y ago
nop, the website spits out in text/html
No description
rival-black
rival-blackOP2y ago
maybe wget just started digging into some other stuff I probably have to just refine the options
Pepa J
Pepa J2y ago
By website you mean the original website or your local one? The content type is not saved in a file it is provided to HTTP response by the webserver if it is not specified on application level.
rival-black
rival-blackOP2y ago
this was the original one let me check on the 127 one
Pepa J
Pepa J2y ago
but yea tools like HTTTrack may be smart enought to save everything with .html filename extension, so in the end it might solve your issue
rival-black
rival-blackOP2y ago
no content type on the local one
No description
rival-black
rival-blackOP2y ago
ye generally that works perfectly. Tho just httrack could pass over the restriction of this website
rival-black
rival-blackOP2y ago
this one is another website, that I could get with htt
No description

Did you find this page helpful?