How can I pass data extracted in the first part of the scraper to items that will be extracted later
Hi. I'm extracting prices of products. In the process, I have the main page where I can extract all the information I need except for the fees. If I go through every product individually, I can get the price and fees, but sometimes I lose the fee information because I get blocked on some products. I want to handle this situation. If I extract the fees, I want to add them to my product_item, but if I get blocked, I want to pass this data as empty. I'm using the "Router" class as the Crawlee team explains here: https://crawlee.dev/python/docs/introduction/refactoring. When I add my URL extracted from the first page as shown below, I cannot pass data extracted before:
await context.enqueue_links(url='product_url', label='PRODUCT_WITH_FEES')
I want something like this:
await context.enqueue_links(url='product_url', label='PRODUCT_WITH_FEES', data=product_item # type: dict)
But I cannot do the above. How can I do it?
So, my final data will be showed as:
If I handle the data correctly I want something like this:
product_item = {product_id: 1234, price: 50$, fees: 3$}
If I get blocked, I have something like this:
product_item = {product_id: 1234, price: 50$, fees: ''}
5 Replies
View post on community site
This post has been pushed to the community knowledgebase. Any replies in this thread will be synced to the community site.
Apify Community
overseas-lavender•8mo ago
Hi @frankman
You can use this approach
enqueue_links
- It also supports the user_data
variable, but it seems to me that add_requests
is better for your casesunny-greenOP•8mo ago
Thank you Mantisus, that works for me. Now I know how can I pass data between requests. And how can I handle the data upload depending on whether the request failed or was successful?
If I handle the data correctly I want something like this:
product_item = {product_id: 1234, price: 50$, fees: 3$}
If I get blocked, I have something like this:
product_item = {product_id: 1234, price: 50$, fees: ''}
In my final function with the "label": PRODUCT_WITH_FEES I'm using Apify.push(product_item) (same than crawlee.push()).
I have to do the following way?
try:
...
await context.add_requests([
Request.from_url(
url='product_url',
label='PRODUCT_WITH_FEES',
user_data={"product_item": product_item}
)
])
except Exception as e:
Apify.push(produc_item) # product_item without fees.
??overseas-lavender•8mo ago
I can't be certain as I don't know exactly what behavior you are observing. But it's more likely to be something like this
https://crawlee.dev/python/api/class/BasicCrawler#failed_request_handler
Either at the
try ... except
in the route for PRODUCT_WITH_FEES
sunny-greenOP•8mo ago
Thank you, that works fine!