How can I pass data extracted in the first part of the scraper to items that will be extracted later

Hi. I'm extracting prices of products. In the process, I have the main page where I can extract all the information I need except for the fees. If I go through every product individually, I can get the price and fees, but sometimes I lose the fee information because I get blocked on some products. I want to handle this situation. If I extract the fees, I want to add them to my product_item, but if I get blocked, I want to pass this data as empty. I'm using the "Router" class as the Crawlee team explains here: https://crawlee.dev/python/docs/introduction/refactoring. When I add my URL extracted from the first page as shown below, I cannot pass data extracted before: await context.enqueue_links(url='product_url', label='PRODUCT_WITH_FEES') I want something like this: await context.enqueue_links(url='product_url', label='PRODUCT_WITH_FEES', data=product_item # type: dict) But I cannot do the above. How can I do it? So, my final data will be showed as: If I handle the data correctly I want something like this: product_item = {product_id: 1234, price: 50$, fees: 3$} If I get blocked, I have something like this: product_item = {product_id: 1234, price: 50$, fees: ''}
5 Replies
Hall
Hall8mo ago
View post on community site
This post has been pushed to the community knowledgebase. Any replies in this thread will be synced to the community site.
Apify Community
overseas-lavender
overseas-lavender8mo ago
Hi @frankman You can use this approach
await context.add_requests([
Request.from_url(
url='product_url',
label='PRODUCT_WITH_FEES',
user_data={"product_item": product_item}
)
])
await context.add_requests([
Request.from_url(
url='product_url',
label='PRODUCT_WITH_FEES',
user_data={"product_item": product_item}
)
])
enqueue_links - It also supports the user_data variable, but it seems to me that add_requests is better for your case
sunny-green
sunny-greenOP8mo ago
Thank you Mantisus, that works for me. Now I know how can I pass data between requests. And how can I handle the data upload depending on whether the request failed or was successful? If I handle the data correctly I want something like this: product_item = {product_id: 1234, price: 50$, fees: 3$} If I get blocked, I have something like this: product_item = {product_id: 1234, price: 50$, fees: ''} In my final function with the "label": PRODUCT_WITH_FEES I'm using Apify.push(product_item) (same than crawlee.push()). I have to do the following way? try: ... await context.add_requests([ Request.from_url( url='product_url', label='PRODUCT_WITH_FEES', user_data={"product_item": product_item} ) ]) except Exception as e: Apify.push(produc_item) # product_item without fees. ??
overseas-lavender
overseas-lavender8mo ago
I can't be certain as I don't know exactly what behavior you are observing. But it's more likely to be something like this
@crawler.failed_request_handler
async def blocked_item_handle(context, error) -> None:
if context.request.label == "PRODUCT_WITH_FEES":
Apify.push(context.request.produc_item)
@crawler.failed_request_handler
async def blocked_item_handle(context, error) -> None:
if context.request.label == "PRODUCT_WITH_FEES":
Apify.push(context.request.produc_item)
https://crawlee.dev/python/api/class/BasicCrawler#failed_request_handler Either at the try ... except in the route for PRODUCT_WITH_FEES
sunny-green
sunny-greenOP8mo ago
Thank you, that works fine!

Did you find this page helpful?