Crawlee & Apify•9mo ago

Best Crawler for Youtube JS?

I'm looking to scrape youtube captions. There is a timedtext url in the page js... but it doesn't seem to show through standard http request through curl etc. Can you please suggest a method? Here is my routes.ts file: import { Dataset, createPuppeteerRouter } from 'crawlee'; export const router = createPuppeteerRouter(); router.addDefaultHandler(async ({ request, page, log }) => { log.info(Default handler triggered for: ${request.url}); // Wait for the page to fully load await page.waitForSelector('ytd-watch-flexy', { timeout: 10000 }); // Adjust the selector and timeout as necessary log.info(Page is ready for processing: ${request.url}); // Execute JS in the page context to find the caption URL const captionUrl = await page.evaluate(() => { const scripts = Array.from(document.querySelectorAll('script')); for (const script of scripts) { const match = script.textContent?.match(/https://www.youtube.com/api/timedtext?[^"]+/); if (match) { return match[0]; } } return null; }); if (captionUrl) { log.info(Found caption URL: ${captionUrl}); await Dataset.pushData({ url: request.loadedUrl, title: await page.title(), captionUrl, }); } else { log.warning(No caption URL found on ${request.loadedUrl}); } });

3 Replies

Hall•9mo ago

Post created!

This post has been synced with the Apify community site and will be indexed by search engines

Pepa J•9mo ago

Hi @Mike Bruggs Probably Discord damaged your code, but I tried:

// await page.waitForSelector('ytd-watch-flexy', { timeout: 10000 }); // Adjust the selector and timeout as necessary

console.log(`Page is ready for processing:`);

console.log("fdsfds");
// Execute JS in the page context to find the caption URL
const captionUrl = await page.evaluate(() => {
    const scripts = Array.from(document.querySelectorAll('script'));
    for (const script of scripts) {
        const match = script.textContent?.match(/https:\/\/www.youtube.com\/api\/timedtext?[^"]+/);
        if (match) {
            return match[0];
        }
    }
    return null;
});

console.log("fdsfds", captionUrl);

if (captionUrl) {
    console.log(`Found caption URL: ${captionUrl}`);

    console.log("jhdgfjdfdsfds");

    console.log({
        url: request.loadedUrl,
        title: await page.title(),
        captionUrl,
    });
    console.log("g7fd8sgfdsfds");
} else {
    console.log(`No caption URL found on ${request.loadedUrl}`);
}

// await page.waitForSelector('ytd-watch-flexy', { timeout: 10000 }); // Adjust the selector and timeout as necessary

console.log(`Page is ready for processing:`);

console.log("fdsfds");
// Execute JS in the page context to find the caption URL
const captionUrl = await page.evaluate(() => {
    const scripts = Array.from(document.querySelectorAll('script'));
    for (const script of scripts) {
        const match = script.textContent?.match(/https:\/\/www.youtube.com\/api\/timedtext?[^"]+/);
        if (match) {
            return match[0];
        }
    }
    return null;
});

console.log("fdsfds", captionUrl);

if (captionUrl) {
    console.log(`Found caption URL: ${captionUrl}`);

    console.log("jhdgfjdfdsfds");

    console.log({
        url: request.loadedUrl,
        title: await page.title(),
        captionUrl,
    });
    console.log("g7fd8sgfdsfds");
} else {
    console.log(`No caption URL found on ${request.loadedUrl}`);
}

In my Puppeteer extension and it woked well

variable-lime•9mo ago

why use puppetteer or playwright? it consumes more time and ressouces. unless you scrape an embeded video, rendering JS in this case is not necessary and doesn't need YouTube's internal api or public api.

Gaming

Programming

Best Crawler for Youtube JS?

Did you find this page helpful?