CA
Crawlee & Apify9mo ago
sensitive-blue

infinite_scroll | how to get the updated page

Hey, I created this simple script:
import asyncio

# Instead of BeautifulSoupCrawler let's use Playwright to be able to render JavaScript.
from crawlee.playwright_crawler import PlaywrightCrawler, PlaywrightCrawlingContext
import urllib.parse


async def main(terms: str) -> None:
crawler = PlaywrightCrawler(headless=False)

@crawler.router.default_handler
async def request_handler(context: PlaywrightCrawlingContext) -> None:
# Wait for the collection cards to render on the page. This ensures that
# the elements we want to interact with are present in the DOM.
await context.page.wait_for_load_state("networkidle")
await context.infinite_scroll()

url = f"https://www.youtube.com/results?search_query={urllib.parse.quote(terms)}&sp=EgIwAQ%253D%253D"
await crawler.run([url])


if __name__ == "__main__":
asyncio.run(main("music"))
import asyncio

# Instead of BeautifulSoupCrawler let's use Playwright to be able to render JavaScript.
from crawlee.playwright_crawler import PlaywrightCrawler, PlaywrightCrawlingContext
import urllib.parse


async def main(terms: str) -> None:
crawler = PlaywrightCrawler(headless=False)

@crawler.router.default_handler
async def request_handler(context: PlaywrightCrawlingContext) -> None:
# Wait for the collection cards to render on the page. This ensures that
# the elements we want to interact with are present in the DOM.
await context.page.wait_for_load_state("networkidle")
await context.infinite_scroll()

url = f"https://www.youtube.com/results?search_query={urllib.parse.quote(terms)}&sp=EgIwAQ%253D%253D"
await crawler.run([url])


if __name__ == "__main__":
asyncio.run(main("music"))
` But I want to get the content of the page while infinite_scroll scroll the page, like that I can see the new content and I can make action according to them, but await context.infinite_scroll() never stop so I can't put an action behind it to run the thinkg I want, how can I manage that? (I want to get the new link of youtube video)
3 Replies
Hall
Hall9mo ago
View post on community site
This post has been pushed to the community knowledgebase. Any replies in this thread will be synced to the community site.
Apify Community
adverse-sapphire
adverse-sapphire9mo ago
In general, infinite scroll is not reliable, as it can be really infinite, what can lead to memory leaks. On websites with lazy-loading pagination, if API scraping is a viable option, it is a much better approach due to reliability and performance. AFAIK, youtube has API endpoints for scrolling results. Try to Check it in devTools. Otherwise, I'd advice to create your own scrolling logic with processing new items an a loop. E.q. here (it's JS, but logic remains the same): https://docs.apify.com/academy/puppeteer-playwright/common-use-cases/paginating-through-results#lazy-loading-pagination
sensitive-blue
sensitive-blueOP9mo ago
Thanks for the answer! I found a workaround (context.page.on("request", track_request)) but I will try to use the API endpoint for that

Did you find this page helpful?