Difference between the scraped amount in browser vs on Python
I am using the twitter scraper in Python and finding that on the browser console I am getting all 300 tweets that I request. I have a counter in my Python script that increments with each item in the client.dataset(run['defaultDatasetId']).iterate_items(). This ends up being around 90, so it seems I am only getting 1/3rd of the tweets I scrape. Anyone know why or recommend what to do?
9 Replies
Hello @Ruuubear , may you provide us with better an example of the code, and maybe send the runId (id you are running it on platform) to the PM, so we may try to reproduce/investigate it on our side?
Also beware of scraping with proxy from different region or with logged off account may you provide different results than you see in the browser.
rival-blackOP•3y ago
Hi @Pepa J , the run id is: taVGBlEaesj8eUwii.
This is my run input:
run_input = {
"profilesDesired": 1,
"handle": [f"{user_profile}"],
"searchMode": "user",
"tweetsDesired": maxtweets,
# "mode": "replies",
"proxyConfig": { "useApifyProxy": True },
"extendOutputFunction": """async ({ data, item, page, request, customData, Apify }) => {
return item;
}""",
"extendScraperFunction": """async ({ page, request, addSearch, addProfile, , addThread, addEvent, customData, Apify, signal, label }) => {
}""",
"customData": {},
"handlePageTimeoutSecs": 5000,
"maxRequestRetries": 6,
"maxIdleTimeoutSecs": 60,
"initialCookies": [],
}
maxtweets was set to 300
@Ruuubear
I just tested it and it worked well:
My implementaion:
Be sure you provide right
datasetId
(and not the actorId
)
rival-blackOP•3y ago
Thanks for your reply. In your example are you pulling the dataset already scraped? I'm trying to get the data as it is scraped live.
@Ruuubear just advanced to level 1! Thanks for your contributions! 🎉
@Ruuubear Yea I wait for the run to finish, otherwise you would have to do some active waiting with checking the actor is still running, resolving the offset parameter for listing items based on already download items etc.
rival-blackOP•3y ago
Is it possible to do it live? Is quite essential for what I am building
I am afraid you would need to solve with by yourself, I don't know about any streaming the dataset mechanism that would be available on the platform.
But you may solve this by polling the data every few secs and checking the run status (I suggest to wait another 5 secs after the run ends, because there could still be some items being stored to the dataset).
rival-blackOP•3y ago
Ok, thanks for your help