CA
like-gold

Question about userData as option in enqueueLinks

Scraping a few pages of a forum. Some infos are in the threadlist page and other are into the thread. In the end these information has to be loaded in a pg db. Example of thread list, information scraped href to feed in the enqueue links, views and replies
| Title | Views | Replies |
|---------------------|-------|---------|
| My great post[href] | 2 | 1 |
| Asd asd[href] | 23 | 6 |
| TestPost[href] | 1 | 0 |
| Title | Views | Replies |
|---------------------|-------|---------|
| My great post[href] | 2 | 1 |
| Asd asd[href] | 23 | 6 |
| TestPost[href] | 1 | 0 |
While in the post I generally need title, post body, author, date etc... Needing some data that's outside and inside, I'd find useful the userData option in enqueueLinks(). Right now my scraper runs on cheerios and the main focus is on this piece of code. Since replies and views are different for every thread, how could I pass these information since links gets passed in bulk with a string[]
if(request.label === 'THREAD_LIST'){
const links : string[] = []
$('tr > td.col_f_content > h4 > a')
.each((i, el) => {links.push((new URL($(el).attr('href'), request.url).toString()))})

console.log(chalk.green(`Found ${links.length} thread links, cool!`))
await enqueueLinks({
urls: links,
label: "THREAD",
userData: {
"replies_count": ?,
}
});
}
if(request.label === 'THREAD_LIST'){
const links : string[] = []
$('tr > td.col_f_content > h4 > a')
.each((i, el) => {links.push((new URL($(el).attr('href'), request.url).toString()))})

console.log(chalk.green(`Found ${links.length} thread links, cool!`))
await enqueueLinks({
urls: links,
label: "THREAD",
userData: {
"replies_count": ?,
}
});
}
I see multiple options, though nothing looks so neat: - make x multiple enqueueLinks call for x thread. Passing everytime an array of 1 element with the related userdata - somehow storing locally, using dataset or someother key-value as temporary storage The desirable option would be having the chance of sending two parallel vectors. Getting the index of the current link being runned and recover the related data from those arrays. Though I don't understand if there's a way to get the index. I would like to hear the opinion of someone more expert of me on this thing. Thankies ^^
3 Replies
adverse-sapphire
adverse-sapphire3y ago
Hey there! Sorry for double checking, but I am a bit lost in the description. To pass data between requests, as you described, you could use userData. And as you mentioned - you could pass some requests with one label (sending one userData) and other with different label and userData. If you want more granular control and you know ahead which links should go to which vector - you could just create objects with url, label and userData and use await crawler.addReqeusts([...]) directly: https://crawlee.dev/api/cheerio-crawler/class/CheerioCrawler#addRequests (link for cheerio crawler, but other crawlers work the same). Saving some temp data to Key-Value store is also an option, but a bit of an overhead (as you need to store/read the data). Hope this helps!
like-gold
like-goldOP3y ago
Hey, thank you for the response. In the end i decided to pass to userData a Dictionary[discovered_url_to_enqueue][replies_count/views_count] So generally when the request gets analyzed I end up reading the data in this way from userData
request.userData[request.url].replies_count,
request.userData[request.url].replies_count,
adverse-sapphire
adverse-sapphire3y ago
So yeah - totally an option 👍 Glad this is resolved 👌

Did you find this page helpful?