Running crawlee multiple times with the same URL

Hi! I am trying to build a crawler using PuppeteerCrawler. The crawler will be started by sending a POST to an API endpoint. The API is implemented using azure durable functions. The first time I call the API it works as expected. The next time I call it I get no output. This is the log output on the second run:
INFO PuppeteerCrawler: Initializing the crawler.
INFO PuppeteerCrawler: All requests from the queue have been processed, the crawler will shut down.
INFO PuppeteerCrawler: Initializing the crawler.
INFO PuppeteerCrawler: All requests from the queue have been processed, the crawler will shut down.
How do I configure crawlee such that every call to the API runs a new crawl? Here is my current implementation. This function is called from an orchestrator function.
const activityFunction: AzureFunction = async function (
context: Context,
crawlerParameters: CrawlingParameters
) {
process.env.CRAWLEE_STORAGE_DIR = os.tmpdir();

const nettsider = new Set();
const crawler = new PuppeteerCrawler({
async requestHandler({ request, page, enqueueLinks }) {
nettsider.add({ title: await page.title(), url: request.url });
await enqueueLinks({ exclude: [/.pdf$/, /.doc$/] });
},
maxRequestsPerCrawl: crawlerParameters.maxLenker,
});
await crawler.run([crawlerParameters.startUrl]);

return Array.from(nettsider);
};
const activityFunction: AzureFunction = async function (
context: Context,
crawlerParameters: CrawlingParameters
) {
process.env.CRAWLEE_STORAGE_DIR = os.tmpdir();

const nettsider = new Set();
const crawler = new PuppeteerCrawler({
async requestHandler({ request, page, enqueueLinks }) {
nettsider.add({ title: await page.title(), url: request.url });
await enqueueLinks({ exclude: [/.pdf$/, /.doc$/] });
},
maxRequestsPerCrawl: crawlerParameters.maxLenker,
});
await crawler.run([crawlerParameters.startUrl]);

return Array.from(nettsider);
};
2 Replies
NeoNomade
NeoNomade2y ago
You have to use “useExtendedUniqueKey” on your requests .
wise-white
wise-whiteOP2y ago
Thanks! "useExtendedUniqueKey" didn't work for me, because in my case, the request urls are identical. But you put me on the right track, and I solved it by modifying the uniqueKey of each request, and adding an "invocationId" from azure, which is unique for each run.

Did you find this page helpful?