Crawlee & Apify

CA

Crawlee & Apify

This is the official developer community of Apify and Crawlee.

Join

crawlee-js

apify-platform

crawlee-python

💻hire-freelancers

🚀actor-promotion

💫feature-request

💻devs-and-apify

🗣general-chat

🎁giveaways

programming-memes

🌐apify-announcements

🕷crawlee-announcements

👥community

Setting a cookie in Cheerio before the page request

I am trying to use Cheerio to crawl a site that authenticates via session-based cookies. I have the cookie value I want to set, but don't know where/how to set it so every page request of my Actor's run has that cookie set. Are there pre-request callbacks I can use in Cheerio to set a cookie, or perhaps a high-level per-Actor config I can set cookie values that will persist across all sessions? I can't find any examples or documentation for how to access the session/sessionPool not already within a Cheerio requestHandler 🤷🏻‍♂️...

Resume after crash

Hey, I've had a Cheerio crawler running for couple of hours, but it crashed. I'm wondering if it's possible to renew the crawl from the place it stopped at. I can see there are some files left in the
key_value_stores
key_value_stores
dir:...
No description

enqueueLinks with a selector doesn't work?

I'm trying to grab the next page link from: https://www.haskovo.net/news with:
await enqueueLinks({
selector: '.pagination li:last-child > a',
label: 'LIST',
})
await enqueueLinks({
selector: '.pagination li:last-child > a',
label: 'LIST',
})
...

how to clear request queue without stoping crawler

I want crawler to run constantly and i want to remove old links to clear memory and storage

Requesting proxy rotation for an individual organization

Is it possible to request proxy rotation or reset for an individual organization? If so, who would I need to contact to make that happen? Thanks!

Google shopping

Hi New to the whole web scraping thing so looking for some pointers. I do not have a background in coding but looking to scrape information for eccoms purposes. Been trying a couple of Actors to scrap pricing on google ads but coming across a few issues. 1. Most appear to scrap from the US and I am based in the UK, is there a way to change this to google shopping uk/any suggestions of actors to use? ...

How to scrap emails to one level of nesting and give results to API

The main question probably is how to send the answer correctly and not save the data in the Dataset. For example with the same express.

There is a major problem, Crawlee is unable to bypass the cloudflare protecti...

@Helper @gahabeen There is a major problem, Crawlee is unable to bypass the cloudflare protection (captcha solution tried 5 times) useChrome method was tried and failed. Manual login was successful when done on Chrome (out of Node and also tried with incognito mode, etc.) https://abrahamjuliot.github.io/creepjs/ Despite Crawlee receiving a higher trust score from the Chrome browser I am currently using, it is unable to pass the cloudflare page....
No description

Waiting for CF bot check

I'm trying to pass CF's bot check using Firefox without much luck. I found the thread about using Firefox to get cookies for Cheerio, but I need to use Firefox all the way. The issue I'm running into is that CF bot page gives 403 which causes Crawlee to think it's a bad request. I was able to use errorHandler to wait out the bot check but now I can't find a way to keep the CF cookies for the session. If I do session.setCookies(...) inside the errorHandler, nothing gets stored and the retry connection uses a new session. I also tried session.markGood() but didn't help. Any ideas?...

How to add data to the SDK_CRAWLER_STATISTICS

I want to add some counts to the SDK_CRAWLER_STATISTICS json. I found the solution to create a own file like this await KeyValueStore.setValue("statistics", statistics); but i would prefer to add it to the existing statistics

How can i skip .pdf files in PuppeteerCrawler

I want to skip all .pdf, .docx files from crawling.

get stats

How to get stats after run success.
No description

How to increase memory of PuppeteerCrawler

I run crawler show warning: "WARN PuppeteerCrawler:AutoscaledPool:Snapshotter: Memory is critically overloaded. Using 4141 MB of 3932 MB (105%). Consider increasing available memory." Ho to increase this memory...

Use page.on('request') in PuppeteerCrawler

How can i do that with the PuppeteerCrawler? ``` page.on('request', async (request) => { // check if request.url contains any of the google domains...

Bet 365 crawler

i'm having some issues to webscrap the bet 365 website, anyone knows how to bypass the bet365 security?

is there a way to close browser in puppeteer crawler?

my crawler got stuck getting request timeouts with concurrency of 20, if i could close browser on request timeout that could solve the issue.
No description

Error while trying to use apify

Im trying to use one of the actor to scrape instagram data but I keep getting this error "TypeError: Cannot read properties of undefined (reading 'join')', Im using typescript(node.js) as you can see in the img, anyone knows what might be the issue, thank you
No description

Custom storage provider for RequestQueue?

It's probably a little out of the ordinary, but I'm building a crawler project that stores a pretty large pile of information in a database, rather than Crawlee's native KVS and DataSet. I'm curious if there are any examples of using alternative backends to store Crawlee's own datasets and request queue? If possible i'd love to consolidate the storage in one place, particularly since it would allow me to query and manage the request pool a bit easier…

How to scroll page

Hi, I am using PupperteerCrawler, how to scroll load more in handler

Exclude query parameter URLs from crawl jobs

Hello, I'm researching currently methods to exclude URLs with, for example: https://domain[.]com/path?query1=test&query2=test2 I've tried hooking into the enqueueLinks options like:...