Crawlee & Apify•8mo ago

How can I pass data extracted in the first part of the scraper to items that will be extracted later

Hi. I'm extracting prices of products. In the process, I have the main page where I can extract all the information I need except for the fees. If I go through every product individually, I can get the price and fees, but sometimes I lose the fee information because I get blocked on some products. I want to handle this situation. If I extract the fees, I want to add them to my product_item, but if I get blocked, I want to pass this data as empty. I'm using the "Router" class as the Crawlee team explains here: https://crawlee.dev/python/docs/introduction/refactoring. When I add my URL extracted from the first page as shown below, I cannot pass data extracted before: await context.enqueue_links(url='product_url', label='PRODUCT_WITH_FEES') I want something like this: await context.enqueue_links(url='product_url', label='PRODUCT_WITH_FEES', data=product_item # type: dict) But I cannot do the above. How can I do it? So, my final data will be showed as: If I handle the data correctly I want something like this: product_item = {product_id: 1234, price: 50$, fees: 3$} If I get blocked, I have something like this: product_item = {product_id: 1234, price: 50$, fees: ''}

5 Replies

Hall•8mo ago

View post on community site

This post has been pushed to the community knowledgebase. Any replies in this thread will be synced to the community site.

Apify Community

overseas-lavender•8mo ago

Hi @frankman You can use this approach

await context.add_requests([
    Request.from_url(
            url='product_url',
            label='PRODUCT_WITH_FEES',
            user_data={"product_item": product_item}
            )
    ])

await context.add_requests([
    Request.from_url(
            url='product_url',
            label='PRODUCT_WITH_FEES',
            user_data={"product_item": product_item}
            )
    ])

enqueue_links - It also supports the user_data variable, but it seems to me that add_requests is better for your case

sunny-greenOP•8mo ago

Thank you Mantisus, that works for me. Now I know how can I pass data between requests. And how can I handle the data upload depending on whether the request failed or was successful? If I handle the data correctly I want something like this: product_item = {product_id: 1234, price: 50$, fees: 3$} If I get blocked, I have something like this: product_item = {product_id: 1234, price: 50$, fees: ''} In my final function with the "label": PRODUCT_WITH_FEES I'm using Apify.push(product_item) (same than crawlee.push()). I have to do the following way?

try: 
   ...
   await context.add_requests([
       Request.from_url(
               url='product_url',
               label='PRODUCT_WITH_FEES',
               user_data={"product_item":    product_item}
               )
       ])

except Exception as e:
   Apify.push(produc_item)  # product_item without fees.

overseas-lavender•8mo ago

I can't be certain as I don't know exactly what behavior you are observing. But it's more likely to be something like this

@crawler.failed_request_handler
async def blocked_item_handle(context, error) -> None:
    if context.request.label == "PRODUCT_WITH_FEES":
        Apify.push(context.request.produc_item)

@crawler.failed_request_handler
async def blocked_item_handle(context, error) -> None:
    if context.request.label == "PRODUCT_WITH_FEES":
        Apify.push(context.request.produc_item)

https://crawlee.dev/python/api/class/BasicCrawler#failed_request_handler Either at the try ... except in the route for PRODUCT_WITH_FEES

sunny-greenOP•8mo ago

Thank you, that works fine!

Gaming

Programming

How can I pass data extracted in the first part of the scraper to items that will be extracted later

Did you find this page helpful?