❓ Help Needed: Downloading Linked PDF Files with Crawlee πŸ•ΈπŸ“₯

Hello everyone, I need some help with Crawlee. I've been using CheerioCrawler to scrape pages and I've managed to extract links and store page titles and URLs into a dataset. Now I want to add functionality to download linked files, like PDFs, from the scraped pages. However, I'm unsure how to do this natively with Crawlee. Here's my current code:
import { CheerioCrawler, Dataset } from 'crawlee';

// CheerioCrawler crawls the web using HTTP requests
// and parses HTML using the Cheerio library.
const crawler = new CheerioCrawler({
// Use the requestHandler to process each of the crawled pages.
async requestHandler({ request, $, enqueueLinks, log }) {
const title = $('title').text();
log.info(`Title of ${request.loadedUrl} is '${title}'`);

// Save results as JSON to ./storage/datasets/default
await Dataset.pushData({ title, url: request.loadedUrl });

// Extract links from the current page
// and add them to the crawling queue.
await enqueueLinks();
},
});

// Add first URL to the queue and start the crawl.
await crawler.run(['https://cloudflare.net/events-and-presentations']);
import { CheerioCrawler, Dataset } from 'crawlee';

// CheerioCrawler crawls the web using HTTP requests
// and parses HTML using the Cheerio library.
const crawler = new CheerioCrawler({
// Use the requestHandler to process each of the crawled pages.
async requestHandler({ request, $, enqueueLinks, log }) {
const title = $('title').text();
log.info(`Title of ${request.loadedUrl} is '${title}'`);

// Save results as JSON to ./storage/datasets/default
await Dataset.pushData({ title, url: request.loadedUrl });

// Extract links from the current page
// and add them to the crawling queue.
await enqueueLinks();
},
});

// Add first URL to the queue and start the crawl.
await crawler.run(['https://cloudflare.net/events-and-presentations']);
Could anyone guide me on how to modify this code to download linked files, specifically PDFs, from the scraped pages? Any help would be appreciated, thank you!
5 Replies
deep-jade
deep-jadeOPβ€’2y ago
can anyone help?
Pepa J
Pepa Jβ€’2y ago
Hello @Alex There is a code I didn't tested, but you may get the idea out of it:
async requestHandler({ $, sendRequest }) {
const urls = $('a[href]')
.filter((_, el) => /\.pdf$/.test($(el).attr('href')!)) // filter links ending with .pdf
.map((_, el) => $(el).attr('href'))
.toArray();

for (const url in urls) {
// Do a request for the PDF file
const pdfFileResponse = await sendRequest({
url
});

const fileName = url.split('/').reverse()[0];

await Actor.setValue(fileName, pdfFileResponse.rawBody, { contentType: 'application/pdf' });
}
};
async requestHandler({ $, sendRequest }) {
const urls = $('a[href]')
.filter((_, el) => /\.pdf$/.test($(el).attr('href')!)) // filter links ending with .pdf
.map((_, el) => $(el).attr('href'))
.toArray();

for (const url in urls) {
// Do a request for the PDF file
const pdfFileResponse = await sendRequest({
url
});

const fileName = url.split('/').reverse()[0];

await Actor.setValue(fileName, pdfFileResponse.rawBody, { contentType: 'application/pdf' });
}
};
Basically it will store all the PDF to the storages/key-value-store/default when running locally
deep-jade
deep-jadeOPβ€’2y ago
Hi Pepa, thx, very helpful! Do you have any hint on how use Firebase Storage instead of the local key value store? My goal is to analyze the PDFs with an LLM and store the results in a vector databse.
Pepa J
Pepa Jβ€’2y ago
I believe there would be a npm package for firebase with proper documentation, I have no personal experience with it.
deep-jade
deep-jadeOPβ€’2y ago
will look into that. thank you very much Pepa

Did you find this page helpful?