CA
robust-apricot

Issues with charset

Hi everyone, I am new to Apify. I love the utility of it so I have decided to learn it by using it to solve real life issue - by scraping data from official government website. I am using cheerio scraper to get data from a list (link attached below) with Czech text. My problem is I cannot make it to get the the data with correct encoding. Characters from Czech alphabet are encoded incorrectly. It scrapes this: "Apoďż˝tolskďż˝ cďż˝rkev, 1. sbor Praha" (with windows-1250) or this: Apo�tolsk� c�rkev, 1. sbor Praha (with utf8) instead of this: Apoštolská církev, 1. sbor Praha I have tried experimenting forcing different response encoding  (utf8, windows-1250), I tried sending different headers but without success. After many hours I feel like getting nowhere. Do You have any suggestions? Start URL: https://www-cns.mkcr.cz/cns_internet/CNS/Seznam_cpo.aspx?id_subj=148&str_zpet=Seznam_CPO.aspx Glob pattern: https://www-cns.mkcr.cz/cns_internet/CNS/Detail_cpo.aspx?id_subj=*&str_zpet=Seznam_CPO.aspx Link selector: td > a Code:
async function pageFunction(context) {
const { $, request, log } = context;
const pageTitle = $('title').first().text();
const url = request.url;
const churchName = $('td:contains("zev:")').next().text();

log.info('Church Name:', { churchName });
return {
url,
churchName
};
}
async function pageFunction(context) {
const { $, request, log } = context;
const pageTitle = $('title').first().text();
const url = request.url;
const churchName = $('td:contains("zev:")').next().text();

log.info('Church Name:', { churchName });
return {
url,
churchName
};
}
BTW: I am using proxy located in CZ to get to it.
9 Replies
Alexey Udovydchenko
when you add request try headers: { url, headers: {'accept-endcoding': ENC} } see details https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/Accept-Encoding in other words crawler getting what is provided by server, if default encoding is wrong just change it
robust-apricot
robust-apricotOP3y ago
Thank You for Your response. No matter what header I give it, the result turns out wrong.
"startUrls": [
{
"url": "https://www-cns.mkcr.cz/cns_internet/CNS/Seznam_cpo.aspx?id_subj=148&str_zpet=Seznam_CPO.aspx",
"method": "GET",
"headers": {
"Content-Type": "text/html; charset=windows-1250"
}
}
"startUrls": [
{
"url": "https://www-cns.mkcr.cz/cns_internet/CNS/Seznam_cpo.aspx?id_subj=148&str_zpet=Seznam_CPO.aspx",
"method": "GET",
"headers": {
"Content-Type": "text/html; charset=windows-1250"
}
}
Alexey Udovydchenko
In your example you provided header Content-Type so you tried it instead of Accept-Encoding ?! Indeed, try to copy all headers from real browser, may be server provides correct encoding only for certain set of headers
robust-apricot
robust-apricotOP3y ago
That's what I tried 😦 Still nothing. (These are my headers from Safari and I tried them in different permutations.) Accept: text/html,application/xhtml+xml,application/xml;q=0.9,/;q=0.8 Accept-Language: cs-CZ,cs;q=0.9 Accept-Encoding: gzip, deflate, br But what else I tried was to install crawlee to my computer and run it locally on my Mac. And there I had no issues. It worked as it should. So I wonder, could it be something between the server and dockerized linux on APIfy?... I don't know :/
MEE6
MEE63y ago
@IRQcko just advanced to level 1! Thanks for your contributions! 🎉
Alexey Udovydchenko
It can be so, to quick check try to run https://apify.com/apify/cheerio-scraper and see if you get correct encoding, if not then I guess its OS-level issue
robust-apricot
robust-apricotOP3y ago
I guess i give up. I simply cannot make apify to resolve czech characters on this specific website correctly.
Alexey Udovydchenko
Just run public actor: https://console.apify.com/view/runs/88SUu8J13mIHtFRmw see https://api.apify.com/v2/datasets/eTbvUiQ8ljaIRroEy/items?clean=true&format=json I just saved entire content as { title: $('title').text(), text: $('body').text() } but you can split to more details, text fragments already there and in correct encoding t06723331Apoštolská církev, sbor Bystřice pod Hostýnem
Apify
Apify Console
Manage the Apify platform and your account.
No description
robust-apricot
robust-apricotOP3y ago
Oh man, this one worked. I am still confused because I don't understand to problem. But this seems to be a way to go.

Did you find this page helpful?