Crawlee & Apify

CA

Crawlee & Apify

This is the official developer community of Apify and Crawlee.

Join

crawlee-js

apify-platform

crawlee-python

💻hire-freelancers

🚀actor-promotion

💫feature-request

💻devs-and-apify

🗣general-chat

🎁giveaways

programming-memes

🌐apify-announcements

🕷crawlee-announcements

👥community

Canada411 site failing after 4 hours

I am using a CheerioCrawler actor to process input files of 500,000 records against this dynamically populated url: https://www.canada411.ca/search/?stype=re&what= The actor has been mysteriously failing after 4 to 4.5 hours, and we have not observed such behavior before. I have included below the log toward the end of the failed run (#KcMSz5QQp8qIQnbYF). Any insight on this error message would be greatly appreciated. Thank you!...

How do I delay requests with HttpCrawler?

I am working with an API that has rate-limiting in place. The API gives me a timestamp of when the current rate limit will expire in seconds. I need to delay my next request by this many seconds, which is usually 15 ish minutes. I tried adding a delay with setTimeout and Promise like this and awaiting on it ```ts export function delay(seconds: number): Promise<void> {...

How to keep browser open when in development and how to clear cache when closing it?

I'd like to know how I can prevent the browser from closing when the crawling is finished? Also, I have a crawl that needs to sign in the user, however if the user is already signed is the site redirects the user. How can I clear the cache of the browser whenever he browser is closed? I'd expect the cache to be cleared each time but it doesn't seem to be the case....

pass the cloudflare browser check

Anybody know how to pass cloudflare browser check with crawlee playwrightCrawler? site I have problem with: https://www.g2.com/ I have tried residential proxies, no proxies, chrome and firefox browser, headful or headless but nothing works. My chrome browser passes the check with no proxies and residential proxies too, so I guess proxy is not the problem. The problem is that cloudflare somehow knows that it is automated browser. In apify store there is working scraper for g2 but it is written in python but atleast I know it is possible to do it....

How to avoid requesting some static resources?

When crawling with Playwright or Puppeteer, a lot of static assets (eg js, css, png, jpg) are loaded. Is it possible to only request static resources for the first time, and use the last cached data for the next crawling without making a request....

PuppeteerCrawler navigation timeout

I am using PuppeteerCrawler. Most of the time on the 10th page I get timeout error. I have tried: * increasing timeout to 100s => does not help * increasing memory/CPU => 0.25 to 1 CPU => does not help When I click the web manually in anonymous Chrome, it works just fine. ...

Downloadlistofurloptions

how to crawl XML from the XML (nested)from there I have to collect links,any suggestions by using cheerio crawler it will useful for me to move on.#crawlee

Fingerprint-suite and python

Hello there! is it possible to inject fingerprints generated by fingerprint-suite to playwright-python app?

request method delete

i am trying to make a delete request to an api but it doesnt work. ( script hangs at the request then request times out) ```js router.addDefaultHandler(async ({ request, json, enqueueLinks, log }) => { const requests = [...

Super slow keyboard input in puppeteer on Apify

When running on Apify platform it takes ~600-700 ms per character typed as compared to ~200ms when running locally Local log: ``` INFO Typing in username...

How to make crawlee try to refetch?

If the return value of the http api I crawl does not meet expectations, but http status is 200 How can I mark this request as a failure and let crawlee get it again with next proxy?...

keep getting this error message

Could you please help me what this means. I tried many of the free scrapers and this is keep popping up.
No description

Getting puppeteer-har and autoconsent to work with puppeteer crawler

Hi guys, I am totally new to crawlee, so this might or might not be an easy question. I want to get all the cookies and the third party trackers or resources from our website and monitor any changes. The changes are done by a Website Agency and I want to be sure we keep compliant with the privacy regulations. ...

disable cookies

hi, i would like to disable cookies. I am using HttpCrawler. can anyone help me with relevant doc links? thank you

How to clear named KeyStores before every run?

I have this function ```ts async function initialSetup() { // Clear previous data sets const storeKeys = ["categories_store", "details_store"]...

What is the URL for the proxy?

I'm confused. Is the entire proxy URL formatted like below.. http://auto:[email protected]:8000...

How to use MemoryStorage (mainly for RequestQueue) on the platform?

My actor runs typically use Cheerio, take <20m and have around 1k requests. For this scenario, costs for RequestQueue writes/reads are often higher than compute units. I wanted to experiment with using in-memory storage to optimize costs (i think I understand the associated risks and I'm ok with them). I've tried setting storage: new MemoryStorage() in Actor.main second argument as noted in docs & TS definitions, but actor runs on platform still seems to use "platform RQ", not "in-memory one". Any pointers? https://console.apify.com/actors/64sLcqgxq4IB5hZrI/runs/QIkqpBDa846Ftn5xK#storage...
No description

Fingerprint and workers

Hello! I used fingerprint generator, but creepjs shows me the inconsistencies between the fingerprint and the workers. Is it possible to patch fingerprint so that the workers are patched as well?
No description

How to set 'locale' and 'timezoneId' on browsers or pages?

If in Playwright, I can create Page this way and set locale and TimezoneId ```typescript browser.newPage({ locale: 'zh-TW',...

crawlee eating memory like hell

It is eating 3 GB after running just for 2 days
No description