Is there a way to initiate crawlee crawl + scraping jobs from a server?

Context: - I'm currently using playwright on my nextjs api routes and persist some data in my database (postgres) - since I need IP roration with session management though, I'd love to offload the scraping to crawlee - I'm also considering apify as the platform to deploy this crawlee scraper to (as that seems the recommended setup?) Is there an example somewhere on how to trigger my crawlee scraping job on apify from another server? Edit: I'm cross-posting to #apify-platform as it seems like I'm interested in triggering a serverless actor.
8 Replies
absent-sapphire
absent-sapphireOP2y ago
And can I share a browser session between runs?
MEE6
MEE62y ago
@p6l.richard just advanced to level 1! Thanks for your contributions! 🎉
absent-sapphire
absent-sapphireOP2y ago
Id like to execute the first scraping step and then schedule the next ones to run in the background. The background jobs need the same session as the first run though Not sure if apify supports streaming responses or returning early data or something?
Pepa J
Pepa J2y ago
chedule the next ones to run in the background.
This is possible by using API
The background jobs need the same session as the first run though
You need to pass the sesionId from the first run to the others and then always use the same session based on the id
Not sure if apify supports streaming responses or returning early data or something?
It does not support it for default, but there is no limitation to implement this by yourself, you may create stream and the data to specific endpoint, or maybe better solution would be using websockets.
absent-sapphire
absent-sapphireOP2y ago
Thank you, I’ll try out the session sharing first. 🙏 Okay, let's see if I get this right (haven't found a way to run it successfully so far). If I provide a sessionId as an input to an actor such that I run it via actor.run({ sessionId }).  Then, within the actor, I set the persistStateKey in the SessionPool based on this very input:
import { Actor } from "apify";

await Actor.init()
const sessionId = await Actor.getInput()?.sessionId;

const crawler = new PlaywrightCrawler({
proxyConfiguration,
useSessionPool: true,
sessionPoolOptions: {
// set the sessionId based on the input
persistStateKey: sessionId,
sessionOptions: {
maxUsageCount: 60,
maxErrorScore: 1,
},
},
maxConcurrency: 50,
requestHandler: router,
});
import { Actor } from "apify";

await Actor.init()
const sessionId = await Actor.getInput()?.sessionId;

const crawler = new PlaywrightCrawler({
proxyConfiguration,
useSessionPool: true,
sessionPoolOptions: {
// set the sessionId based on the input
persistStateKey: sessionId,
sessionOptions: {
maxUsageCount: 60,
maxErrorScore: 1,
},
},
maxConcurrency: 50,
requestHandler: router,
});
Can I then have a separate Actor use the same browser session?
// a different actor that also loads the persistStateKey from the input
aDifferentActor.run({ sessionId })
// a different actor that also loads the persistStateKey from the input
aDifferentActor.run({ sessionId })
the different actor will use the useState():
import { Actor } from "apify";

await Actor.init()
const sessionId = await Actor.getInput()?.sessionId;

// now, does this different actor use the same session as the other one? 🤔
Actor.useState(sessionId)
import { Actor } from "apify";

await Actor.init()
const sessionId = await Actor.getInput()?.sessionId;

// now, does this different actor use the same session as the other one? 🤔
Actor.useState(sessionId)
Will these two separate actors then share the same browser session?
Pepa J
Pepa J2y ago
I was thinking more about creating a custom createSessionFunction:
const sessionStore = await Actor.openKeyValueStore('session-store'); // named store shared between Actors
const crawler = new Apify.PuppeteerCrawler({
// ...
sessionPoolOptions: {
createSessionFunction: async (sessionPool) => {
const sessionId = await sessionStore.getValue('sessionId');
if (!sessionId) {
const session = new Session({ sessionPool });
await sessionStore.setValue('sessionId', session.id);
return session;
}
return new Session({ sessionPool, id: sessionId })
},
},
});
const sessionStore = await Actor.openKeyValueStore('session-store'); // named store shared between Actors
const crawler = new Apify.PuppeteerCrawler({
// ...
sessionPoolOptions: {
createSessionFunction: async (sessionPool) => {
const sessionId = await sessionStore.getValue('sessionId');
if (!sessionId) {
const session = new Session({ sessionPool });
await sessionStore.setValue('sessionId', session.id);
return session;
}
return new Session({ sessionPool, id: sessionId })
},
},
});
Now All the Actors with this configuration should share the same session.
absent-sapphire
absent-sapphireOP2y ago
Thank you! 🙏

Did you find this page helpful?