Scrape JSON and HTML responses in different handlers

I do not know how to scrape a website, that contains JSON and HTML responses My scraper need to: 1. Send a request and parse a JSON response which contains a list of URL that I will enqueue. 2. Scrape those URLs but in HTML using cheerio or whatever is required to do so.
2 Replies
Hall
Hall8mo ago
View post on community site
This post has been pushed to the community knowledgebase. Any replies in this thread will be synced to the community site.
Apify Community
fair-rose
fair-rose7mo ago
Hey, For your task, I'd use 2 request handlers: - JSON handler will handle the JSON response, it'll parse it and enqueue HTML requests - HTML handler will parse HTML response as usual with cheerio's $ JSON and HTML are request labels, you can read more about labels here. Basically, if you label a request with e.g. HTML label, it will be handled with HTML request handler.
const router = createCheerioRouter();

// add request handler for handling `JSON` labelled requests
router.addHandler('JSON', async ({ body, crawler }) => {
// parse JSON response
const json = JSON.parse(body.toString());

// enqueue HTML requests
await crawler.addRequests([{ url: '...', userData: { label: 'HTML' } }]);
});

// add request handler for handling `HTML` labelled requests
router.addHandler('HTML', async ({ $ }) => {
// parse HTML response
});

const crawler = new CheerioCrawler({
proxyConfiguration,
maxRequestsPerCrawl,
requestHandler: router,
});

await crawler.run([{ url: '...', userData: { label: 'JSON' } }]);
const router = createCheerioRouter();

// add request handler for handling `JSON` labelled requests
router.addHandler('JSON', async ({ body, crawler }) => {
// parse JSON response
const json = JSON.parse(body.toString());

// enqueue HTML requests
await crawler.addRequests([{ url: '...', userData: { label: 'HTML' } }]);
});

// add request handler for handling `HTML` labelled requests
router.addHandler('HTML', async ({ $ }) => {
// parse HTML response
});

const crawler = new CheerioCrawler({
proxyConfiguration,
maxRequestsPerCrawl,
requestHandler: router,
});

await crawler.run([{ url: '...', userData: { label: 'JSON' } }]);
Let me know if you have any questions
Crawling the Store | Crawlee · Build reliable crawlers. Fast.
Crawlee helps you build and maintain your crawlers. It's open source, but built by developers who scrape millions of pages every day for a living.

Did you find this page helpful?