Best Crawler for Youtube JS?

I'm looking to scrape youtube captions. There is a timedtext url in the page js... but it doesn't seem to show through standard http request through curl etc. Can you please suggest a method? Here is my routes.ts file: import { Dataset, createPuppeteerRouter } from 'crawlee'; export const router = createPuppeteerRouter(); router.addDefaultHandler(async ({ request, page, log }) => { log.info(Default handler triggered for: ${request.url}); // Wait for the page to fully load await page.waitForSelector('ytd-watch-flexy', { timeout: 10000 }); // Adjust the selector and timeout as necessary log.info(Page is ready for processing: ${request.url}); // Execute JS in the page context to find the caption URL const captionUrl = await page.evaluate(() => { const scripts = Array.from(document.querySelectorAll('script')); for (const script of scripts) { const match = script.textContent?.match(/https://www.youtube.com/api/timedtext?[^"]+/); if (match) { return match[0]; } } return null; }); if (captionUrl) { log.info(Found caption URL: ${captionUrl}); await Dataset.pushData({ url: request.loadedUrl, title: await page.title(), captionUrl, }); } else { log.warning(No caption URL found on ${request.loadedUrl}); } });
3 Replies
Hall
Hall9mo ago
Post created!
This post has been synced with the Apify community site and will be indexed by search engines
Pepa J
Pepa J9mo ago
Hi @Mike Bruggs Probably Discord damaged your code, but I tried:
// await page.waitForSelector('ytd-watch-flexy', { timeout: 10000 }); // Adjust the selector and timeout as necessary

console.log(`Page is ready for processing:`);

console.log("fdsfds");
// Execute JS in the page context to find the caption URL
const captionUrl = await page.evaluate(() => {
const scripts = Array.from(document.querySelectorAll('script'));
for (const script of scripts) {
const match = script.textContent?.match(/https:\/\/www.youtube.com\/api\/timedtext?[^"]+/);
if (match) {
return match[0];
}
}
return null;
});

console.log("fdsfds", captionUrl);

if (captionUrl) {
console.log(`Found caption URL: ${captionUrl}`);

console.log("jhdgfjdfdsfds");

console.log({
url: request.loadedUrl,
title: await page.title(),
captionUrl,
});
console.log("g7fd8sgfdsfds");
} else {
console.log(`No caption URL found on ${request.loadedUrl}`);
}
// await page.waitForSelector('ytd-watch-flexy', { timeout: 10000 }); // Adjust the selector and timeout as necessary

console.log(`Page is ready for processing:`);

console.log("fdsfds");
// Execute JS in the page context to find the caption URL
const captionUrl = await page.evaluate(() => {
const scripts = Array.from(document.querySelectorAll('script'));
for (const script of scripts) {
const match = script.textContent?.match(/https:\/\/www.youtube.com\/api\/timedtext?[^"]+/);
if (match) {
return match[0];
}
}
return null;
});

console.log("fdsfds", captionUrl);

if (captionUrl) {
console.log(`Found caption URL: ${captionUrl}`);

console.log("jhdgfjdfdsfds");

console.log({
url: request.loadedUrl,
title: await page.title(),
captionUrl,
});
console.log("g7fd8sgfdsfds");
} else {
console.log(`No caption URL found on ${request.loadedUrl}`);
}
In my Puppeteer extension and it woked well
variable-lime
variable-lime9mo ago
why use puppetteer or playwright? it consumes more time and ressouces. unless you scrape an embeded video, rendering JS in this case is not necessary and doesn't need YouTube's internal api or public api.

Did you find this page helpful?