Crawlee & Apify•5mo ago

How to implement persistent login with crawlee-js/playwright?

I need to scrape content on multiple pages in one social network (x.com) that requires auth. Where to implement the login mechanism in order to it happened before following urls and persisted to use it until it is valid?

8 Replies

Hall•5mo ago

Someone will reply to you shortly. In the meantime, this might help:

flat-fuchsia•5mo ago

await page.context().storageState({ path: authFilePath }) Look up storageState() on the playwright docs

sensitive-blue•4mo ago

that gets the cookie and localstorage state but how do you load it into a new session. im facing the same dilema

flat-fuchsia•4mo ago

You can store you cookies in named KV store and then modify Your session with function: https://crawlee.dev/api/next/core/interface/SessionPoolOptions#createSessionFunction or You can do the same (update your requests with those cookies) in preNavigationHooks (https://crawlee.dev/api/next/browser-crawler/interface/BrowserCrawlerOptions#preNavigationHooks)

sensitive-blue•4mo ago

what about local storage? im supprised there seems to be no easy way to seed a session with local data and also, setCookie wants a raw cookie string and a url rather than the format the getState() gives you @osenvosem I have found a solution though it isnt great. If when creating your scraper, you set useIncognitoPages to true, you can modify the pageOptions to set your cookies and localStorage inside of a prePageCreateHook

const authStorage = await KeyValueStore.open('auth');

const crawler = new PlaywrightCrawler({
    launchContext: {
      launcher: chromium,
      useIncognitoPages: true
    },
    requestHandler: router,
    browserPoolOptions: {
      prePageCreateHooks: [
        async (pageId, browserController, pageOptions) => {
          if (!pageOptions) { // pageOptions is only exposed in incognito
            throw new Error("no page options")
          }
          pageOptions.storageState = await authStorage.getValue("state") ?? undefined
        }
      ]
    }
});

const authStorage = await KeyValueStore.open('auth');

const crawler = new PlaywrightCrawler({
    launchContext: {
      launcher: chromium,
      useIncognitoPages: true
    },
    requestHandler: router,
    browserPoolOptions: {
      prePageCreateHooks: [
        async (pageId, browserController, pageOptions) => {
          if (!pageOptions) { // pageOptions is only exposed in incognito
            throw new Error("no page options")
          }
          pageOptions.storageState = await authStorage.getValue("state") ?? undefined
        }
      ]
    }
});

unfortunately this seems to add a fair chunk of overhead though. crawl with incognito INFO PlaywrightCrawler: Final request statistics:

{"requestsFinished":37,"requestsFailed":0,"retryHistogram":[37],"requestAvgFailedDurationMillis":null,"requestAvgFinishedDurationMillis":3514,"requestsFinishedPerMinute":53,"requestsFailedPerMinute":0,"requestTotalDurationMillis":130001,"requestsTotal":37,"crawlerRuntimeMillis":42227}

{"requestsFinished":37,"requestsFailed":0,"retryHistogram":[37],"requestAvgFailedDurationMillis":null,"requestAvgFinishedDurationMillis":3514,"requestsFinishedPerMinute":53,"requestsFailedPerMinute":0,"requestTotalDurationMillis":130001,"requestsTotal":37,"crawlerRuntimeMillis":42227}

without incognito (and the hook) INFO PlaywrightCrawler: Final request statistics:

{"requestsFinished":37,"requestsFailed":0,"retryHistogram":[37],"requestAvgFailedDurationMillis":null,"requestAvgFinishedDurationMillis":2238,"requestsFinishedPerMinute":67,"requestsFailedPerMinute":0,"requestTotalDurationMillis":82813,"requestsTotal":37,"crawlerRuntimeMillis":32955}

{"requestsFinished":37,"requestsFailed":0,"retryHistogram":[37],"requestAvgFailedDurationMillis":null,"requestAvgFinishedDurationMillis":2238,"requestsFinishedPerMinute":67,"requestsFailedPerMinute":0,"requestTotalDurationMillis":82813,"requestsTotal":37,"crawlerRuntimeMillis":32955}

MEE6•4mo ago

@Crafty just advanced to level 4! Thanks for your contributions! 🎉

sensitive-blue•4mo ago

in the future, it may be possibe to do the same using experimentalContainers instead of useIncognitoPages though it doesnt seem to work yet

environmental-rose•4mo ago

You can implement the login in the handlePageFunction of Crawlee’s PlaywrightCrawler. Before navigating to any URLs, you'll first need to handle the login, and then persist the cookies or local storage to reuse the session across subsequent requests.

Gaming

Programming

How to implement persistent login with crawlee-js/playwright?

Did you find this page helpful?