HonzaS Posts - Answer Overflow

HonzaS

Posts Comments

CACrawlee & Apify

•Created by HonzaS on 3/22/2024 in #apify-platform

requestQueue write costs

10 replies

CACrawlee & Apify

•Created by HonzaS on 3/21/2024 in #apify-platform

cheerio works on local but RequestError: Proxy responded with 400 Bad Request: 30 bytes on platform

Hi there, I have problem running cheeriocrawler with the apify czech proxies on the platform because I get this error. Crawler with the same proxies works on local, what could be the reason?

9 replies

CACrawlee & Apify

•Created by HonzaS on 10/16/2023 in #crawlee-js

playwright and proxy problem

When I try to access page with proxy with playwright I get captcha. Without proxy it works with no problem. But what is weird that if I use the same proxy in the regular browser via SwitchyOmega extension then the page loads also without problem. So I think page somehow detects that automated browser is using proxies. Did anyone encounter this problem?

3 replies

CACrawlee & Apify

•Created by HonzaS on 9/22/2023 in #apify-platform

google drive integration question

Hi, I need to upload csv file to the google drive. I see there is some new integration tab on apify now, so I have some questions. Sadly there is a lot of settings but no documentation. Is it possible to upload via this integration file from the key value store? I have managed to convert csv file to json then push it to the dataset and then push it via the integration to the drive but there are some issues. 1. I do not know how to preserve the filename. 2. converting csv to json change the data a little. I know there is actor for uploading to the gdrive but I like that with integration there is google sign in button so I do not need to care about permissions. Thanks for suggestions

7 replies

CACrawlee & Apify

•Created by HonzaS on 9/1/2023 in #apify-platform

how to run headful on the platform?

Failed to launch the browser process! undefined
2023-09-01T21:35:32.962Z [141:164:0901/213532.948704:ERROR:bus.cc(399)] Failed to connect to the bus: Failed to connect to socket /run/dbus/system_bus_socket: No such file or directory
2023-09-01T21:35:32.964Z [141:141:0901/213532.952831:ERROR:ozone_platform_x11.cc(240)] Missing X server or $DISPLAY
2023-09-01T21:35:32.966Z [141:141:0901/213532.952846:ERROR:env.cc(255)] The platform failed to initialize.  Exiting.

Failed to launch the browser process! undefined
2023-09-01T21:35:32.962Z [141:164:0901/213532.948704:ERROR:bus.cc(399)] Failed to connect to the bus: Failed to connect to socket /run/dbus/system_bus_socket: No such file or directory
2023-09-01T21:35:32.964Z [141:141:0901/213532.952831:ERROR:ozone_platform_x11.cc(240)] Missing X server or $DISPLAY
2023-09-01T21:35:32.966Z [141:141:0901/213532.952846:ERROR:env.cc(255)] The platform failed to initialize.  Exiting.

I got this error on the platform when trying to run this code:

const browser = await launchPuppeteer({
        useChrome: true,
        
        // Native Puppeteer options
        launchOptions: {
            headless: false,
         }});

const browser = await launchPuppeteer({
        useChrome: true,
        
        // Native Puppeteer options
        launchOptions: {
            headless: false,
         }});

1 replies

CACrawlee & Apify

•Created by HonzaS on 5/3/2023 in #apify-platform

Google Sheets Import & Export Actor

13 replies

CACrawlee & Apify

•Created by HonzaS on 4/27/2023 in #apify-platform

crawler stops when there are still pending requests

2 replies

CACrawlee & Apify

•Created by HonzaS on 2/14/2023 in #apify-platform

parsing input urls from google sheet

Hi, I have tried this feature https://docs.apify.com/platform/tutorials/crawl-urls-from-a-google-sheet It looks like there is a bug that it does not parse out the whole url when there is comma inside it. I have tried it on this sheet https://docs.google.com/spreadsheets/d/14eS_kezUiZ13U1zEaDrb4s7xnmerJuHwG7wiRIPwBIM/edit#gid=0 I even tried to put the url it inside " but it did not help. Here is the result you can see that the urls requested are not the same as in the sheet. https://api.apify.com/v2/datasets/vlTmoYRiFWawRdJsZ/items?clean=true&format=json

15 replies

CACrawlee & Apify

•Created by HonzaS on 11/28/2022 in #apify-platform

Is it possible to have dataset with constant url?

I want actor to fill the dataset each run. But I do not want it to add items I just want the items from that run only. So before insert I drop the dataset and then make the new named dataset with the same name and insert the items. Problem is that url is based on id of the dataset and that is different. So is there some way to have constant url?

3 replies

CACrawlee & Apify

•Created by HonzaS on 11/15/2022 in #crawlee-js

how to set payload in cheerio crawler preNavigationHooks

doing it like this:

preNavigationHooks:[async (crawlingContext, gotOptions) => {
         const { request } = crawlingContext;
             request.payload = `.......`;
 }

preNavigationHooks:[async (crawlingContext, gotOptions) => {
         const { request } = crawlingContext;
             request.payload = `.......`;
 }

does not work, error: ReferenceError: page is not defined also when I want to set headers, should I use gotOptions.headers= or request.headers= what is the difference?

5 replies

CACrawlee & Apify

•Created by HonzaS on 11/8/2022 in #crawlee-js

pass the cloudflare browser check

Anybody know how to pass cloudflare browser check with crawlee playwrightCrawler? site I have problem with: https://www.g2.com/ I have tried residential proxies, no proxies, chrome and firefox browser, headful or headless but nothing works. My chrome browser passes the check with no proxies and residential proxies too, so I guess proxy is not the problem. The problem is that cloudflare somehow knows that it is automated browser. In apify store there is working scraper for g2 but it is written in python but atleast I know it is possible to do it.

20 replies

CACrawlee & Apify

•Created by HonzaS on 10/30/2022 in #crawlee-js

netERR_TUNNEL_CONNECTION_FAILED

I am trying to use proxy with crawlee playwright-crawler to connect to page at non standard port (444) and I am getting this proxy error PlaywrightCrawler: Reclaiming failed request back to the list or queue. page.goto: net::ERR_TUNNEL_CONNECTION_FAILED, any suggestions? Without proxy it works fine on local. On platform I get timeout which could be because of banned aws ip range.

7 replies

CACrawlee & Apify

•Created by HonzaS on 9/30/2022 in #apify-platform

run logs on the platform

I have run cheerio crawler on the platform and it logs line like this:

2022-09-30T08:02:44.883Z WARN  CheerioCrawler: Reclaiming failed request back to the list or queue. Cannot read properties of null (reading 'match') {"id":"PxlxlTCgnI7zPOi","url":"https://www........","retryCount":1}

I have two questions: 1. why is there WARN instead of ERROR? I would prefer if it is ERROR and in red color, I believe it was always like this, was it changed? 2. why I can't see file and line where the error ocurred? What should I change to solve this?

3 replies

Gaming

Programming