Crawlee & Apify•2y ago

application/octect stream in cheerio

I'm trying to scrape a second page, in a working scraper. Though this page gives the response as "application/octect-stream". Is there something I could do to fix this or should I swap to puppeteer/playwright. Looks kinda same since the page is full static Here the error message:

ERROR CheerioCrawler: Request failed and reached maximum retries. Error: Resource http://127.0.0.1/website.web/part_to_scrape served Content-Type application/octet-stream, but only text/html, text/xml, application/xhtml+xml, application/xml, application/json are allowed. Skipping resource.

ERROR CheerioCrawler: Request failed and reached maximum retries. Error: Resource http://127.0.0.1/website.web/part_to_scrape served Content-Type application/octet-stream, but only text/html, text/xml, application/xhtml+xml, application/xml, application/json are allowed. Skipping resource.

Thank you so much

15 Replies

NeoNomade•2y ago

Change the headers to match the request in your browser , in special the application type if it’s application/json

Pepa J•2y ago

Hello @Lamp can you share a link of such a website? Generally application/octet-stream is used when the page provide you some data to download.

rival-blackOP•2y ago

Well, actually it's a website that I downloaded via wget. So I could test my stuff on local before doing it with the real site. Now that you told me I noticed that looking at the network packets I get text/html. Dunno why in wget it changes it Thank you for the help, I'll look for another way to download the site

MEE6•2y ago

@Lamp just advanced to level 1! Thanks for your contributions! 🎉

Pepa J•2y ago

so you are browsing/scraping the website from you filesystem?

rival-blackOP•2y ago

yep http://127.0.0.1/ I usually follow this process of first downloading a portion and then launching on the actual thing With tools like HTTrack or just wget it generally works! :3

Pepa J•2y ago

how do you serve the content?, I don't think this would be wget related, content-type is more webserver related. does the page have some proper extension like .html ?

rival-blackOP•2y ago

nop, the website spits out in text/html

rival-blackOP•2y ago

maybe wget just started digging into some other stuff I probably have to just refine the options

Pepa J•2y ago

By website you mean the original website or your local one? The content type is not saved in a file it is provided to HTTP response by the webserver if it is not specified on application level.

rival-blackOP•2y ago

this was the original one let me check on the 127 one

Pepa J•2y ago

but yea tools like HTTTrack may be smart enought to save everything with .html filename extension, so in the end it might solve your issue

rival-blackOP•2y ago

no content type on the local one

rival-blackOP•2y ago

ye generally that works perfectly. Tho just httrack could pass over the restriction of this website

rival-blackOP•2y ago

this one is another website, that I could get with htt

Gaming

Programming

application/octect stream in cheerio

Did you find this page helpful?