Crawlee & Apify

This is the official developer community of Apify and Crawlee.

Join

Crawlee & Apify

This is the official developer community of Apify and Crawlee.

Join

crawlee-js

apify-platform

crawlee-python

💻hire-freelancers

🚀actor-promotion

💫feature-request

💻devs-and-apify

🗣general-chat

🎁giveaways

programming-memes

🌐apify-announcements

🕷crawlee-announcements

👥community

stormy-gold

12/24/2022

Setting a cookie in Cheerio before the page request

I am trying to use Cheerio to crawl a site that authenticates via session-based cookies. I have the cookie value I want to set, but don't know where/how to set it so every page request of my Actor's run has that cookie set. Are there pre-request callbacks I can use in Cheerio to set a cookie, or perhaps a high-level per-Actor config I can set cookie values that will persist across all sessions? I can't find any examples or documentation for how to access the session/sessionPool not already within a Cheerio requestHandler 🤷🏻‍♂️...

flat-fuchsia

12/23/2022

Resume after crash

Hey, I've had a Cheerio crawler running for couple of hours, but it crashed. I'm wondering if it's possible to renew the crawl from the place it stopped at. I can see there are some files left in the

key_value_stores

key_value_stores

dir:...

flat-fuchsia

12/18/2022

enqueueLinks with a selector doesn't work?

I'm trying to grab the next page link from: https://www.haskovo.net/news with:

await enqueueLinks({
        selector: '.pagination li:last-child > a',
        label: 'LIST',
    })

await enqueueLinks({
        selector: '.pagination li:last-child > a',
        label: 'LIST',
    })

...

plain-purple

12/18/2022

how to clear request queue without stoping crawler

I want crawler to run constantly and i want to remove old links to clear memory and storage

quickest-silver

12/14/2022

Requesting proxy rotation for an individual organization

Is it possible to request proxy rotation or reset for an individual organization? If so, who would I need to contact to make that happen? Thanks!

optimistic-gold

12/13/2022

Google shopping

Hi New to the whole web scraping thing so looking for some pointers. I do not have a background in coding but looking to scrape information for eccoms purposes. Been trying a couple of Actors to scrap pricing on google ads but coming across a few issues. 1. Most appear to scrap from the US and I am based in the UK, is there a way to change this to google shopping uk/any suggestions of actors to use? ...

generous-apricot

12/13/2022

How to scrap emails to one level of nesting and give results to API

The main question probably is how to send the answer correctly and not save the data in the Dataset. For example with the same express.

rival-black

12/12/2022

There is a major problem, Crawlee is unable to bypass the cloudflare protecti...

@Helper @gahabeen There is a major problem, Crawlee is unable to bypass the cloudflare protection (captcha solution tried 5 times) useChrome method was tried and failed. Manual login was successful when done on Chrome (out of Node and also tried with incognito mode, etc.) https://abrahamjuliot.github.io/creepjs/ Despite Crawlee receiving a higher trust score from the Chrome browser I am currently using, it is unable to pass the cloudflare page....

rare-sapphire

12/12/2022

Waiting for CF bot check

I'm trying to pass CF's bot check using Firefox without much luck. I found the thread about using Firefox to get cookies for Cheerio, but I need to use Firefox all the way. The issue I'm running into is that CF bot page gives 403 which causes Crawlee to think it's a bad request. I was able to use errorHandler to wait out the bot check but now I can't find a way to keep the CF cookies for the session. If I do session.setCookies(...) inside the errorHandler, nothing gets stored and the retry connection uses a new session. I also tried session.markGood() but didn't help. Any ideas?...

harsh-harlequin

12/11/2022

How to add data to the SDK_CRAWLER_STATISTICS

I want to add some counts to the SDK_CRAWLER_STATISTICS json. I found the solution to create a own file like this await KeyValueStore.setValue("statistics", statistics); but i would prefer to add it to the existing statistics

harsh-harlequin

12/11/2022

How can i skip .pdf files in PuppeteerCrawler

I want to skip all .pdf, .docx files from crawling.

rival-black

12/5/2022

get stats

How to get stats after run success.

rival-black

12/5/2022

How to increase memory of PuppeteerCrawler

I run crawler show warning: "WARN PuppeteerCrawler:AutoscaledPool:Snapshotter: Memory is critically overloaded. Using 4141 MB of 3932 MB (105%). Consider increasing available memory." Ho to increase this memory...

harsh-harlequin

12/4/2022

Use page.on('request') in PuppeteerCrawler

How can i do that with the PuppeteerCrawler? ``` page.on('request', async (request) => { // check if request.url contains any of the google domains...

relaxed-coral

12/3/2022

Bet 365 crawler

i'm having some issues to webscrap the bet 365 website, anyone knows how to bypass the bet365 security?

generous-apricot

12/3/2022

is there a way to close browser in puppeteer crawler?

my crawler got stuck getting request timeouts with concurrency of 20, if i could close browser on request timeout that could solve the issue.

unwilling-turquoise

12/2/2022

Error while trying to use apify

Im trying to use one of the actor to scrape instagram data but I keep getting this error "TypeError: Cannot read properties of undefined (reading 'join')', Im using typescript(node.js) as you can see in the img, anyone knows what might be the issue, thank you

genetic-orange

12/1/2022

Custom storage provider for RequestQueue?

It's probably a little out of the ordinary, but I'm building a crawler project that stores a pretty large pile of information in a database, rather than Crawlee's native KVS and DataSet. I'm curious if there are any examples of using alternative backends to store Crawlee's own datasets and request queue? If possible i'd love to consolidate the storage in one place, particularly since it would allow me to query and manage the request pool a bit easier…

rival-black

11/29/2022

How to scroll page

Hi, I am using PupperteerCrawler, how to scroll load more in handler

sunny-green

11/28/2022

Exclude query parameter URLs from crawl jobs

Hello, I'm researching currently methods to exclude URLs with, for example: https://domain[.]com/path?query1=test&query2=test2 I've tried hooking into the enqueueLinks options like:...

Previous Next

Gaming

Programming

Crawlee & Apify

This is the official developer community of Apify and Crawlee.

Crawlee & Apify

This is the official developer community of Apify and Crawlee.