Crawlee & Apify•3y ago

Scraping auth-protected pages with CheerioCrawler, should I use Session?

I am trying to scrape some pages that only have certain information available when the user is logged in (as a personal project, I understand the risks) At first, I tried to add a request to the queue that executes a POST request to perform a login, and then save those cookies into the route handler session using session.setCookiesFromResponse, and afterwards add the starting point for my scraping. However, for some reason the session is always empty (since the session was destroyed) and the next handler has always a new session, even though I set the following configuration to my crawler:

    useSessionPool: true,
    persistCookiesPerSession: true,
    sessionPoolOptions: {
        maxPoolSize: 1,
    },

    useSessionPool: true,
    persistCookiesPerSession: true,
    sessionPoolOptions: {
        maxPoolSize: 1,
    },

I've seen that session.isBlocked() and session.isExpired() are always true, even before I set the cookies from the login response. Am I understanding sessions wrong? Are they supposed to be only available when running Apify actors? If so, what kind of flow should I use to include the authentication headers to all my requests? Thank you in advance 🙂 PD: I want to run this scraper only in my local environment. PD2: Basically what I would want to do is something similar to this Apify Store scraping https://crawlee.dev/docs/introduction/scraping But, using CheerioCrawler and now imagine that the apify actor pages are auth-protected so you need login cookies, how would you do it then?

20 Replies

harsh-harlequin•3y ago

for fine control of sessions, it's better to set your cookies in the preNavigationHooks. sessions are kind of a misnomer, since it's random and you can't handpick them unless you use createSessionFunction

wise-whiteOP•3y ago

So, if I understand correctly, basically the flow would be: 1 - Execute a HTTP request before initialising the crawler to get the auth cookies 2 - Define the crawler and add the cookies from step 1 directly on preNavigationHooks 2 - Run the crawler Is that right? Or could it be done in a better way? Because I was thinking, if I defiine a createSessionFunction but the session is going to be created and restored after every request, I would be calling the login action once per page request?

harsh-harlequin•3y ago

you can have 2 routes, one for doing the auth itself, once, then you keep the auth until it's invalid you'll need a custom logic to define what considers it to be valid or not

equal-aqua•3y ago

Your understanding is totally correct and if approach is not working it means SDK do not handle it properly, may be because of something site-specific or may be just a bug. Indeed, make sure that login actually done, usually there is session cookies related to login, check if you getting them in response or check if you redirected to logged content.

wise-whiteOP•3y ago

So my first approach was to run the crawler with an array of requests like this [getLoginRequestOptions(), getStartingUrlToScrape()] and I set two different route handlers (labeled auth and product) When the auth route handler was called (first request), I used the session object directly from the handler definition and I called session.setCookiesFromResponse (the server return an OK response with the login cookies set) However, before I even called that method, session was already "to be deleted", specifically I remember seeing that session.errorScore was 3 before I even called the method (and it was the first URL, I even tried setting the array of requests only to the auth request) So that was most confusing thing to me, like why is errorScore 3 if I have only made a single request

MEE6•3y ago

@Welcius just advanced to level 1! Thanks for your contributions! 🎉

equal-aqua•3y ago

OK response does not mean that you logged - i.e. if website tracking new devices on login you might hit confirmation by email code after login and at http level its still http200 response. If you can access data as logged user, i.e. dashboard is reached, then you not supposed to change session by code, expected flow: do login, checkup if you logged (protected content available for scraper), if yes continue with the rest of the requests

wise-whiteOP•3y ago

This is the Set-Cookie I receive after executing the POST request (I censored some fields)

PHPSESSID=xxx; path=/; secure; HttpOnly, dle_user_id=123456; expires=Mon, 02-Oct-2023 21:05:26 GMT; Max-Age=31536000; path=/; secure; HttpOnly, dle_password=xxx; expires=Mon, 02-Oct-2023 21:05:26 GMT; Max-Age=31536000; path=/; secure; HttpOnly, dle_newpm=0; expires=Mon, 02-Oct-2023 21:05:26 GMT; Max-Age=31536000; path=/; secure; HttpOnly

PHPSESSID=xxx; path=/; secure; HttpOnly, dle_user_id=123456; expires=Mon, 02-Oct-2023 21:05:26 GMT; Max-Age=31536000; path=/; secure; HttpOnly, dle_password=xxx; expires=Mon, 02-Oct-2023 21:05:26 GMT; Max-Age=31536000; path=/; secure; HttpOnly, dle_newpm=0; expires=Mon, 02-Oct-2023 21:05:26 GMT; Max-Age=31536000; path=/; secure; HttpOnly

equal-aqua•3y ago

In other words, when you logged under browser you normally verify login by some selector or url specific to logged user, right? The same is true for cheerio, with the difference that you not executing javascript

wise-whiteOP•3y ago

When I login through browser it just makes a POST request to https://example.website/?do=login The payload is just an url encoded form of the username and the password (literally the same I do when I craft the login request) I did literally attach the cookies on the preNavigationHooks and it works, but it bugs me cause the session problem looks odd

equal-aqua•3y ago

Ok, might be faster to check what you getting on second step, just leave login request as is and check second request for page with some data available for logged user only. If cookies not set in headers it means session not handled properly by crawler for whatever reason.

wise-whiteOP•3y ago

I did check it, the cookies were not in the headers, I did a bit of research and basically what happens is that the session is flagged as not valid anymore and a new one is created before the second route handler

equal-aqua•3y ago

Thanks for clarification, then its SDK bug, since session supposed to be saved for further requests Session marked as "bad" if handle request function fails, if there was no fails and your session removed its a bug, please consider to report issue at github

wise-whiteOP•3y ago

This is a log of what is happening:

INFO  Starting the crawl
DEBUG AutoscaledPool:Snapshotter: Setting max memory of this run to 8011 MB. Use the CRAWLEE_MEMORY_MBYTES or CRAWLEE_AVAILABLE_MEMORY_RATIO environment variable to override it.
DEBUG SessionPool: Created new Session - session_DJsukzcfNx
DEBUG Page opened. {"label":"auth","url":"https://example.website/?do=login"}
INFO  Login was successful! Saving session cookies...
INFO  session is not null
INFO  sessionId is session_DJsukzcfNx
INFO  >>> whole cookiejar where I can see the cookies set
INFO  session.usageCount: 2
INFO  session.isBlocked: true
INFO  session.isExpired: true
INFO  session.isMaxUsageCountReached: false
INFO  session.errorScore: 3
INFO  session.maxErrorScore: 3
DEBUG SessionPool: Removed Session - session_DJsukzcfNx
DEBUG SessionPool: Created new Session - session_XRLwcX3yDc

INFO  Starting the crawl
DEBUG AutoscaledPool:Snapshotter: Setting max memory of this run to 8011 MB. Use the CRAWLEE_MEMORY_MBYTES or CRAWLEE_AVAILABLE_MEMORY_RATIO environment variable to override it.
DEBUG SessionPool: Created new Session - session_DJsukzcfNx
DEBUG Page opened. {"label":"auth","url":"https://example.website/?do=login"}
INFO  Login was successful! Saving session cookies...
INFO  session is not null
INFO  sessionId is session_DJsukzcfNx
INFO  >>> whole cookiejar where I can see the cookies set
INFO  session.usageCount: 2
INFO  session.isBlocked: true
INFO  session.isExpired: true
INFO  session.isMaxUsageCountReached: false
INFO  session.errorScore: 3
INFO  session.maxErrorScore: 3
DEBUG SessionPool: Removed Session - session_DJsukzcfNx
DEBUG SessionPool: Created new Session - session_XRLwcX3yDc

and this is some dirty code of the auth route handler

router.addHandler('auth', async ({ $, response, session, log}) => {
    const loginWasSuccessful = $('h3.errorberrors').length === 0;
    if (!loginWasSuccessful) {
        throw new CriticalError('Login failed: please check the supplied scraper credentials.')
    }
    log.info('Login was successful! Saving session cookies...')
    log.info(`session is ${session != null}`)
    log.info(`sessionId is ${session!.id}`)
    session?.setCookiesFromResponse(response)
    // @ts-ignore
    if (!session) {
        return
    }
    log.info(JSON.stringify(session?.cookieJar))
    // @ts-ignore
    log.info(`${session?.usageCount}` ?? '')
    log.info(`${session?.isBlocked().valueOf()}` ?? '')
    log.info(`${session?.isExpired().valueOf()}` ?? '')
    log.info(`${session?.isMaxUsageCountReached().valueOf()}` ?? '')
    // @ts-ignore
    log.info(session.errorScore.toString());
    // @ts-ignore
    log.info(session.maxErrorScore.toString())
})

router.addHandler('auth', async ({ $, response, session, log}) => {
    const loginWasSuccessful = $('h3.errorberrors').length === 0;
    if (!loginWasSuccessful) {
        throw new CriticalError('Login failed: please check the supplied scraper credentials.')
    }
    log.info('Login was successful! Saving session cookies...')
    log.info(`session is ${session != null}`)
    log.info(`sessionId is ${session!.id}`)
    session?.setCookiesFromResponse(response)
    // @ts-ignore
    if (!session) {
        return
    }
    log.info(JSON.stringify(session?.cookieJar))
    // @ts-ignore
    log.info(`${session?.usageCount}` ?? '')
    log.info(`${session?.isBlocked().valueOf()}` ?? '')
    log.info(`${session?.isExpired().valueOf()}` ?? '')
    log.info(`${session?.isMaxUsageCountReached().valueOf()}` ?? '')
    // @ts-ignore
    log.info(session.errorScore.toString());
    // @ts-ignore
    log.info(session.maxErrorScore.toString())
})

I am going to create a issue then on the next few days on the GitHub page then! Thank you and Paulo for the help, I was literally doubting if I was understanding sessions

equal-aqua•3y ago

Thanks for your patience and feedback, just one note: you not forgot to set max concurrency to 1 and you not doing both requests at the same time, right?

wise-whiteOP•3y ago

This is the exact config as of now:

const crawler = new CheerioCrawler({
    useSessionPool: true,
    persistCookiesPerSession: true,
    sessionPoolOptions: {
        maxPoolSize: 1,
    },
    requestHandler: router,
    maxConcurrency: 1,
    maxRequestsPerCrawl: 5,
    log: (() => {
        defaultLog.setLevel(LogLevel.DEBUG)
        return defaultLog
    })()
});

const crawler = new CheerioCrawler({
    useSessionPool: true,
    persistCookiesPerSession: true,
    sessionPoolOptions: {
        maxPoolSize: 1,
    },
    requestHandler: router,
    maxConcurrency: 1,
    maxRequestsPerCrawl: 5,
    log: (() => {
        defaultLog.setLevel(LogLevel.DEBUG)
        return defaultLog
    })()
});

and the requests are being triggerd like this:

await crawler.run([
    getLoginPageRequestOptions('fake-username', 'fake-password'),
    getSearchPageRequestOptions()
]);

await crawler.run([
    getLoginPageRequestOptions('fake-username', 'fake-password'),
    getSearchPageRequestOptions()
]);

Also, let me mention that you guys are doing an awesome work with this library, I am very happy to use it and it covers all of my user cases, it's making scraping fun again for me!

MEE6•3y ago

@Welcius just advanced to level 2! Thanks for your contributions! 🎉

equal-aqua•3y ago

Thanks for all your efforts, patience and feedback, I personally trying to avoid doing login since its not a better approach, safe way is to get auth cookies as input, but for your case looks like not necessary.

quickest-silver•3y ago

It probably won't help but try to increase the session limits:

sessionPoolOptions: {
        maxPoolSize: 1,
        sessionOptions:
          maxUsageCount: 9999,
          maxErrorScore: 9999,
          maxAgeSecs: 99999
    },

sessionPoolOptions: {
        maxPoolSize: 1,
        sessionOptions:
          maxUsageCount: 9999,
          maxErrorScore: 9999,
          maxAgeSecs: 99999
    },

frail-apricot•3y ago

@Welcius Did you file the Github Issue? If so can you please share the link to follow up? I'm running into a similar situation and I wonder if you found a solution

Gaming

Programming

Scraping auth-protected pages with CheerioCrawler, should I use Session?

Did you find this page helpful?