Crawlee & Apify•2y ago

❓ Help Needed: Downloading Linked PDF Files with Crawlee 🕸📥

Hello everyone, I need some help with Crawlee. I've been using CheerioCrawler to scrape pages and I've managed to extract links and store page titles and URLs into a dataset. Now I want to add functionality to download linked files, like PDFs, from the scraped pages. However, I'm unsure how to do this natively with Crawlee. Here's my current code:

import { CheerioCrawler, Dataset } from 'crawlee';

// CheerioCrawler crawls the web using HTTP requests
// and parses HTML using the Cheerio library.
const crawler = new CheerioCrawler({
    // Use the requestHandler to process each of the crawled pages.
    async requestHandler({ request, $, enqueueLinks, log }) {
        const title = $('title').text();
        log.info(`Title of ${request.loadedUrl} is '${title}'`);

        // Save results as JSON to ./storage/datasets/default
        await Dataset.pushData({ title, url: request.loadedUrl });

        // Extract links from the current page
        // and add them to the crawling queue.
        await enqueueLinks();
    },
});

// Add first URL to the queue and start the crawl.
await crawler.run(['https://cloudflare.net/events-and-presentations']);

import { CheerioCrawler, Dataset } from 'crawlee';

// CheerioCrawler crawls the web using HTTP requests
// and parses HTML using the Cheerio library.
const crawler = new CheerioCrawler({
    // Use the requestHandler to process each of the crawled pages.
    async requestHandler({ request, $, enqueueLinks, log }) {
        const title = $('title').text();
        log.info(`Title of ${request.loadedUrl} is '${title}'`);

        // Save results as JSON to ./storage/datasets/default
        await Dataset.pushData({ title, url: request.loadedUrl });

        // Extract links from the current page
        // and add them to the crawling queue.
        await enqueueLinks();
    },
});

// Add first URL to the queue and start the crawl.
await crawler.run(['https://cloudflare.net/events-and-presentations']);

Could anyone guide me on how to modify this code to download linked files, specifically PDFs, from the scraped pages? Any help would be appreciated, thank you!

5 Replies

deep-jadeOP•2y ago

can anyone help?

Pepa J•2y ago

Hello @Alex There is a code I didn't tested, but you may get the idea out of it:

async requestHandler({ $, sendRequest }) {
    const urls = $('a[href]')
        .filter((_, el) => /\.pdf$/.test($(el).attr('href')!)) // filter links ending with .pdf
        .map((_, el) => $(el).attr('href'))
        .toArray();

    for (const url in urls) {
        // Do a request for the PDF file
        const pdfFileResponse = await sendRequest({
            url
        });

        const fileName = url.split('/').reverse()[0];

        await Actor.setValue(fileName, pdfFileResponse.rawBody, { contentType: 'application/pdf' });
    }
};

async requestHandler({ $, sendRequest }) {
    const urls = $('a[href]')
        .filter((_, el) => /\.pdf$/.test($(el).attr('href')!)) // filter links ending with .pdf
        .map((_, el) => $(el).attr('href'))
        .toArray();

    for (const url in urls) {
        // Do a request for the PDF file
        const pdfFileResponse = await sendRequest({
            url
        });

        const fileName = url.split('/').reverse()[0];

        await Actor.setValue(fileName, pdfFileResponse.rawBody, { contentType: 'application/pdf' });
    }
};

Basically it will store all the PDF to the storages/key-value-store/default when running locally

deep-jadeOP•2y ago

Hi Pepa, thx, very helpful! Do you have any hint on how use Firebase Storage instead of the local key value store? My goal is to analyze the PDFs with an LLM and store the results in a vector databse.

Pepa J•2y ago

I believe there would be a npm package for firebase with proper documentation, I have no personal experience with it.

deep-jadeOP•2y ago

will look into that. thank you very much Pepa

Gaming

Programming

❓ Help Needed: Downloading Linked PDF Files with Crawlee 🕸📥

Did you find this page helpful?