sensitive-blue

infinite_scroll | how to get the updated page

Hey, I created this simple script:

import asyncio

# Instead of BeautifulSoupCrawler let's use Playwright to be able to render JavaScript.
from crawlee.playwright_crawler import PlaywrightCrawler, PlaywrightCrawlingContext
import urllib.parse


async def main(terms: str) -> None:
    crawler = PlaywrightCrawler(headless=False)

    @crawler.router.default_handler
    async def request_handler(context: PlaywrightCrawlingContext) -> None:
        # Wait for the collection cards to render on the page. This ensures that
        # the elements we want to interact with are present in the DOM.
        await context.page.wait_for_load_state("networkidle")
        await context.infinite_scroll()

    url = f"https://www.youtube.com/results?search_query={urllib.parse.quote(terms)}&sp=EgIwAQ%253D%253D"
    await crawler.run([url])


if __name__ == "__main__":
    asyncio.run(main("music"))

import asyncio

# Instead of BeautifulSoupCrawler let's use Playwright to be able to render JavaScript.
from crawlee.playwright_crawler import PlaywrightCrawler, PlaywrightCrawlingContext
import urllib.parse


async def main(terms: str) -> None:
    crawler = PlaywrightCrawler(headless=False)

    @crawler.router.default_handler
    async def request_handler(context: PlaywrightCrawlingContext) -> None:
        # Wait for the collection cards to render on the page. This ensures that
        # the elements we want to interact with are present in the DOM.
        await context.page.wait_for_load_state("networkidle")
        await context.infinite_scroll()

    url = f"https://www.youtube.com/results?search_query={urllib.parse.quote(terms)}&sp=EgIwAQ%253D%253D"
    await crawler.run([url])


if __name__ == "__main__":
    asyncio.run(main("music"))

` But I want to get the content of the page while infinite_scroll scroll the page, like that I can see the new content and I can make action according to them, but await context.infinite_scroll() never stop so I can't put an action behind it to run the thinkg I want, how can I manage that? (I want to get the new link of youtube video)

3 Replies

Hall•9mo ago

View post on community site

This post has been pushed to the community knowledgebase. Any replies in this thread will be synced to the community site.

Apify Community

adverse-sapphire•9mo ago

In general, infinite scroll is not reliable, as it can be really infinite, what can lead to memory leaks. On websites with lazy-loading pagination, if API scraping is a viable option, it is a much better approach due to reliability and performance. AFAIK, youtube has API endpoints for scrolling results. Try to Check it in devTools. Otherwise, I'd advice to create your own scrolling logic with processing new items an a loop. E.q. here (it's JS, but logic remains the same): https://docs.apify.com/academy/puppeteer-playwright/common-use-cases/paginating-through-results#lazy-loading-pagination

sensitive-blueOP•9mo ago

Thanks for the answer! I found a workaround (context.page.on("request", track_request)) but I will try to use the API endpoint for that

Gaming

Programming

infinite_scroll | how to get the updated page

Did you find this page helpful?