Best Crawler for Youtube JS?
I'm looking to scrape youtube captions. There is a timedtext url in the page js... but it doesn't seem to show through standard http request through curl etc.
Can you please suggest a method?
Here is my routes.ts file:
import { Dataset, createPuppeteerRouter } from 'crawlee';
export const router = createPuppeteerRouter();
router.addDefaultHandler(async ({ request, page, log }) => {
log.info(
Default handler triggered for: ${request.url}
);
// Wait for the page to fully load
await page.waitForSelector('ytd-watch-flexy', { timeout: 10000 }); // Adjust the selector and timeout as necessary
log.info(Page is ready for processing: ${request.url}
);
// Execute JS in the page context to find the caption URL
const captionUrl = await page.evaluate(() => {
const scripts = Array.from(document.querySelectorAll('script'));
for (const script of scripts) {
const match = script.textContent?.match(/https://www.youtube.com/api/timedtext?[^"]+/);
if (match) {
return match[0];
}
}
return null;
});
if (captionUrl) {
log.info(Found caption URL: ${captionUrl}
);
await Dataset.pushData({
url: request.loadedUrl,
title: await page.title(),
captionUrl,
});
} else {
log.warning(No caption URL found on ${request.loadedUrl}
);
}
});3 Replies
Post created!
This post has been synced with the Apify community site and will be indexed by search engines
Hi @Mike Bruggs
Probably Discord damaged your code, but I tried:
In my Puppeteer extension and it woked well
variable-lime•9mo ago
why use puppetteer or playwright? it consumes more time and ressouces. unless you scrape an embeded video, rendering JS in this case is not necessary and doesn't need YouTube's internal api or public api.