push Dataset but got nothing

Hi, i'm new I try to make like https://crawlee.dev/docs/examples/playwright-crawler but make none data on storage : /

import { Dataset, PlaywrightCrawler, createPlaywrightRouter } from "crawlee";
export const wiki_lexique_de_orgue = createPlaywrightRouter();

wiki_lexique_de_orgue.addDefaultHandler(async ({ enqueueLinks, log }) => {
    log.info(`enqueueing new URLs`);
    await enqueueLinks({
        globs: ['https://fr.wikipedia.org/wiki/Lexique_de_l%27orgue'],
        label: 'detail',
    });
});

wiki_lexique_de_orgue.addHandler('detail', async ({ request, page, log }) => {
    const title = await page.title();
    // const XPATH="xpath=//main/div[3]/div[3]/div[1]/ul/li"
    // const STR=(await page.$$eval() ));
    const data = await page.$$eval('.dxp-node', ($posts: HTMLElement[]) => {
        const scrapedData: { title: string; desc: string}[] = [];
        // We're getting the title, rank and URL of each post on Hacker News.
        $posts.forEach(($post) => {
            scrapedData.push({
                title: $post.querySelector('b')!.innerText,
                desc: $post!.innerText!
            });
        });
        return scrapedData;
    });
    await Dataset.pushData(data);
    log.info(`${title}`, { url: request.loadedUrl });
});

export const WIKI_LEXIQUE_DE_ORGUE = { route: wiki_lexique_de_orgue, start: ["https://fr.wikipedia.org/wiki/Lexique_de_l%27orgue"] }

const crawler = new PlaywrightCrawler({requestHandler: WIKI_LEXIQUE_DE_ORGUE.route,});

await crawler.run(WIKI_LEXIQUE_DE_ORGUE.start);

import { Dataset, PlaywrightCrawler, createPlaywrightRouter } from "crawlee";
export const wiki_lexique_de_orgue = createPlaywrightRouter();

wiki_lexique_de_orgue.addDefaultHandler(async ({ enqueueLinks, log }) => {
    log.info(`enqueueing new URLs`);
    await enqueueLinks({
        globs: ['https://fr.wikipedia.org/wiki/Lexique_de_l%27orgue'],
        label: 'detail',
    });
});

wiki_lexique_de_orgue.addHandler('detail', async ({ request, page, log }) => {
    const title = await page.title();
    // const XPATH="xpath=//main/div[3]/div[3]/div[1]/ul/li"
    // const STR=(await page.$$eval() ));
    const data = await page.$$eval('.dxp-node', ($posts: HTMLElement[]) => {
        const scrapedData: { title: string; desc: string}[] = [];
        // We're getting the title, rank and URL of each post on Hacker News.
        $posts.forEach(($post) => {
            scrapedData.push({
                title: $post.querySelector('b')!.innerText,
                desc: $post!.innerText!
            });
        });
        return scrapedData;
    });
    await Dataset.pushData(data);
    log.info(`${title}`, { url: request.loadedUrl });
});

export const WIKI_LEXIQUE_DE_ORGUE = { route: wiki_lexique_de_orgue, start: ["https://fr.wikipedia.org/wiki/Lexique_de_l%27orgue"] }

const crawler = new PlaywrightCrawler({requestHandler: WIKI_LEXIQUE_DE_ORGUE.route,});

await crawler.run(WIKI_LEXIQUE_DE_ORGUE.start);

I really don't understand how does it work : - I have url log - playwright is ok ?????? With ".dxp-node" I expected to fetch 153 text nodes ...

Playwright crawler | Crawlee

This example demonstrates how to use PlaywrightCrawler in combination with RequestQueue to recursively scrape the Hacker News website using headless Chrome / Playwright.

10 Replies

sensitive-blueOP•2y ago

Someone have idea what happens ? (Up)

rare-sapphire•2y ago

Hey there - what are you trying to enqueue? currently you start the crawler with this url - https://fr.wikipedia.org/wiki/Lexique_de_l%27orgue - and then trying to find the new links with provided glob patterns, while the pattern is the page url itself - meaning enqueueLinks does not find anything on the page, and the crawler just shuts down I'd say the question isn't really what are you trying to enqueue, please share what is the workflow here generally?

sensitive-blueOP•2y ago

thank you for taking this time, ... I understand why its going nowhere ... I just want to fetch data in one page only https://fr.wikipedia.org/wiki/Lexique_de_l%27orgue I understand that I can also make it different ways but obviously sometimes we have to deal with exceptions ... I found this https://crawlee.dev/api/core/function/enqueueLinks

await enqueueLinks({
  urls: aListOfFoundUrls,
  requestQueue,
  selector: 'a.product-detail',
  globs: [
      'https://www.example.com/handbags/*',
      'https://www.example.com/purses/*'
  ],
});

await enqueueLinks({
  urls: aListOfFoundUrls,
  requestQueue,
  selector: 'a.product-detail',
  globs: [
      'https://www.example.com/handbags/*',
      'https://www.example.com/purses/*'
  ],
});

maybe i will try to put in "urls" my link

rare-sapphire•2y ago

I am still not following. You alreadu run the crawler with the above start URL - meaning in requestHandler you have the first page loaded with it. If you want to scrape data from this page - why are you trying to enqueue more pages? You either have to use detail handler as default handler, or add start url as an object with url being your start url and label set to detail

sensitive-blueOP•2y ago

const requestQueue = await RequestQueue.open();
await requestQueue.addRequest({ url: 'https://fr.wikipedia.org/wiki/Lexique_de_l%27orgue' });
const wiki_lex_v2 = new PlaywrightCrawler({
    requestQueue, async requestHandler({ $, request }) {
        const title = ($ as any)('title').text();
        console.log(`The title of "${request.url}" is: ${title}.`);
        await Dataset.pushData({ url: request.loadedUrl, title });
    }
});
wiki_lex_v2.run()

const requestQueue = await RequestQueue.open();
await requestQueue.addRequest({ url: 'https://fr.wikipedia.org/wiki/Lexique_de_l%27orgue' });
const wiki_lex_v2 = new PlaywrightCrawler({
    requestQueue, async requestHandler({ $, request }) {
        const title = ($ as any)('title').text();
        console.log(`The title of "${request.url}" is: ${title}.`);
        await Dataset.pushData({ url: request.loadedUrl, title });
    }
});
wiki_lex_v2.run()

I test that but dont work .....

const requestQueue = await RequestQueue.open();
await requestQueue.addRequest({ url: 'https://fr.wikipedia.org/wiki/Lexique_de_l%27orgue' });
export const wiki_lexique_de_orgue = createPlaywrightRouter();

wiki_lexique_de_orgue.addDefaultHandler(async ({ enqueueLinks, log }) => {
    log.info(`enqueueing new URLs`);
    
    await enqueueLinks({
        requestQueue,
        urls: [ "https://fr.wikipedia.org/wiki/Lexique_de_l%27orgue"],
        globs: [],
        label: 'detail',
    });
    await Dataset.pushData({lol:""});
});

wiki_lexique_de_orgue.addHandler('detail', async ({ request, page, log }) => {
    const title = await page.title();
    // const XPATH="xpath=//main/div[3]/div[3]/div[1]/ul/li"
    // const STR=(await page.$$eval() ));
    const data = await page.$$eval('.dxp-node', ($posts: HTMLElement[]) => {
        const scrapedData: { title: string; desc: string }[] = [];
        // We're getting the title, rank and URL of each post on Hacker News.
        $posts.forEach(($post) => {scrapedData.push({title: $post.querySelector('b')!.innerText, desc: $post!.innerText!});});
        return scrapedData;
    });
    await Dataset.pushData(data);
    log.info(`${title}`, { url: request.loadedUrl });
});
export const WIKI_LEXIQUE_DE_ORGUE = { route: wiki_lexique_de_orgue, start: ["https://fr.wikipedia.org/wiki/Lexique_de_l%27orgue"] }
const crawler = new PlaywrightCrawler({ requestHandler: WIKI_LEXIQUE_DE_ORGUE.route, });
await crawler.run();

const requestQueue = await RequestQueue.open();
await requestQueue.addRequest({ url: 'https://fr.wikipedia.org/wiki/Lexique_de_l%27orgue' });
export const wiki_lexique_de_orgue = createPlaywrightRouter();

wiki_lexique_de_orgue.addDefaultHandler(async ({ enqueueLinks, log }) => {
    log.info(`enqueueing new URLs`);
    
    await enqueueLinks({
        requestQueue,
        urls: [ "https://fr.wikipedia.org/wiki/Lexique_de_l%27orgue"],
        globs: [],
        label: 'detail',
    });
    await Dataset.pushData({lol:""});
});

wiki_lexique_de_orgue.addHandler('detail', async ({ request, page, log }) => {
    const title = await page.title();
    // const XPATH="xpath=//main/div[3]/div[3]/div[1]/ul/li"
    // const STR=(await page.$$eval() ));
    const data = await page.$$eval('.dxp-node', ($posts: HTMLElement[]) => {
        const scrapedData: { title: string; desc: string }[] = [];
        // We're getting the title, rank and URL of each post on Hacker News.
        $posts.forEach(($post) => {scrapedData.push({title: $post.querySelector('b')!.innerText, desc: $post!.innerText!});});
        return scrapedData;
    });
    await Dataset.pushData(data);
    log.info(`${title}`, { url: request.loadedUrl });
});
export const WIKI_LEXIQUE_DE_ORGUE = { route: wiki_lexique_de_orgue, start: ["https://fr.wikipedia.org/wiki/Lexique_de_l%27orgue"] }
const crawler = new PlaywrightCrawler({ requestHandler: WIKI_LEXIQUE_DE_ORGUE.route, });
await crawler.run();

"why are you trying to enqueue more pages? " you expected that I write it correctly, if it is easy, I would love to watch some code that works [...] . I not able to execute the code in the documentation : / . ANYWAY "await Dataset.pushData" is not executed ... It should be ... I guess

MEE6•2y ago

@Landerfine l'écarlate just advanced to level 1! Thanks for your contributions! 🎉

rare-sapphire•2y ago

have you checked that https://docs.apify.com/academy ? in first snippet you have a missing await for crawler.run() call. Also $ is not part of the PlaywrightCrawlingContext. second snippet - you're not adding any requests to the crawler - you open requests queue explicitly, but don't add it to the crawler. And also pretty much the same comments as in my previous message. Also I don't see any elements with the selector you provide .dxp-node on the page

sensitive-blueOP•2y ago

thanks to report that (.dxp-node dont exist indeed ... now ...), maybe wikipédia have updated ... I will test later some updates. I will respond when I see your ressource/suggestion soon aaaaah ... ok ... crawly just only fetch links ...

Pepa J•2y ago

Hello @Landerfine l'écarlate , Not sure if I follow. Crawling is a process of getting urls from website, navigate through them and obtain another links. Crawlee is a framework capable to do this, but it also allows you to get "any" information from the website and store it, so you can use them later. At this point I am not sure if you are having issues with your data were not stored. If so please let me know and we may investigate it more. 🙂

sensitive-blueOP•2y ago

for me it is closed ! Thanks

Gaming

Programming

push Dataset but got nothing

Did you find this page helpful?