CA
robust-apricot
Issues with charset
Hi everyone,
I am new to Apify. I love the utility of it so I have decided to learn it by using it to solve real life issue - by scraping data from official government website.
I am using cheerio scraper to get data from a list (link attached below) with Czech text. My problem is I cannot make it to get the the data with correct encoding. Characters from Czech alphabet are encoded incorrectly.
It scrapes this: "Apoďż˝tolskďż˝ cďż˝rkev, 1. sbor Praha" (with windows-1250) or this: Apo�tolsk� c�rkev, 1. sbor Praha (with utf8) instead of this: Apoštolská církev, 1. sbor Praha
I have tried experimenting forcing different response encoding (utf8, windows-1250), I tried sending different headers but without success.
After many hours I feel like getting nowhere. Do You have any suggestions?
Start URL:
https://www-cns.mkcr.cz/cns_internet/CNS/Seznam_cpo.aspx?id_subj=148&str_zpet=Seznam_CPO.aspx
Glob pattern: https://www-cns.mkcr.cz/cns_internet/CNS/Detail_cpo.aspx?id_subj=*&str_zpet=Seznam_CPO.aspx
Link selector: td > a
Code:
BTW: I am using proxy located in CZ to get to it.9 Replies
when you add request try headers:
{ url, headers: {'accept-endcoding':
ENC} }
see details https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/Accept-Encoding
in other words crawler getting what is provided by server, if default encoding is wrong just change itrobust-apricotOP•3y ago
Thank You for Your response.
No matter what header I give it, the result turns out wrong.
In your example you provided header
Content-Type
so you tried it instead of Accept-Encoding
?! Indeed, try to copy all headers from real browser, may be server provides correct encoding only for certain set of headersrobust-apricotOP•3y ago
That's what I tried 😦 Still nothing.
(These are my headers from Safari and I tried them in different permutations.)
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,/;q=0.8
Accept-Language: cs-CZ,cs;q=0.9
Accept-Encoding: gzip, deflate, br
But what else I tried was to install crawlee to my computer and run it locally on my Mac. And there I had no issues. It worked as it should. So I wonder, could it be something between the server and dockerized linux on APIfy?... I don't know :/
@IRQcko just advanced to level 1! Thanks for your contributions! 🎉
It can be so, to quick check try to run https://apify.com/apify/cheerio-scraper and see if you get correct encoding, if not then I guess its OS-level issue
robust-apricotOP•3y ago
I guess i give up. I simply cannot make apify to resolve czech characters on this specific website correctly.
Just run public actor: https://console.apify.com/view/runs/88SUu8J13mIHtFRmw see https://api.apify.com/v2/datasets/eTbvUiQ8ljaIRroEy/items?clean=true&format=json I just saved entire content as
{ title: $('title').text(), text: $('body').text() }
but you can split to more details, text fragments already there and in correct encoding t06723331Apoštolská církev, sbor Bystřice pod Hostýnem

robust-apricotOP•3y ago
Oh man, this one worked. I am still confused because I don't understand to problem. But this seems to be a way to go.