Difference between the scraped amount in browser vs on Python

I am using the twitter scraper in Python and finding that on the browser console I am getting all 300 tweets that I request. I have a counter in my Python script that increments with each item in the client.dataset(run['defaultDatasetId']).iterate_items(). This ends up being around 90, so it seems I am only getting 1/3rd of the tweets I scrape. Anyone know why or recommend what to do?
9 Replies
Pepa J
Pepa J•3y ago
Hello @Ruuubear , may you provide us with better an example of the code, and maybe send the runId (id you are running it on platform) to the PM, so we may try to reproduce/investigate it on our side? Also beware of scraping with proxy from different region or with logged off account may you provide different results than you see in the browser.
rival-black
rival-blackOP•3y ago
Hi @Pepa J , the run id is: taVGBlEaesj8eUwii. This is my run input: run_input = { "profilesDesired": 1, "handle": [f"{user_profile}"], "searchMode": "user", "tweetsDesired": maxtweets, # "mode": "replies", "proxyConfig": { "useApifyProxy": True }, "extendOutputFunction": """async ({ data, item, page, request, customData, Apify }) => { return item; }""", "extendScraperFunction": """async ({ page, request, addSearch, addProfile, , addThread, addEvent, customData, Apify, signal, label }) => { }""", "customData": {}, "handlePageTimeoutSecs": 5000, "maxRequestRetries": 6, "maxIdleTimeoutSecs": 60, "initialCookies": [], } maxtweets was set to 300
Pepa J
Pepa J•3y ago
@Ruuubear I just tested it and it worked well: My implementaion:
from apify import Actor
from apify_client import ApifyClient


async def main():
async with Actor:
# Get the value of the actor input
actor_input = await Actor.get_input() or {}

apify_client = ApifyClient('apify_api_************************')

dataset = apify_client.dataset('Sh*************zz')

dataset_items = dataset.list_items().items

i = 1
for item in dataset_items:
print(i)
i += 1
from apify import Actor
from apify_client import ApifyClient


async def main():
async with Actor:
# Get the value of the actor input
actor_input = await Actor.get_input() or {}

apify_client = ApifyClient('apify_api_************************')

dataset = apify_client.dataset('Sh*************zz')

dataset_items = dataset.list_items().items

i = 1
for item in dataset_items:
print(i)
i += 1
Be sure you provide right datasetId (and not the actorId)
No description
rival-black
rival-blackOP•3y ago
Thanks for your reply. In your example are you pulling the dataset already scraped? I'm trying to get the data as it is scraped live.
MEE6
MEE6•3y ago
@Ruuubear just advanced to level 1! Thanks for your contributions! 🎉
Pepa J
Pepa J•3y ago
@Ruuubear Yea I wait for the run to finish, otherwise you would have to do some active waiting with checking the actor is still running, resolving the offset parameter for listing items based on already download items etc.
rival-black
rival-blackOP•3y ago
Is it possible to do it live? Is quite essential for what I am building
Pepa J
Pepa J•3y ago
I am afraid you would need to solve with by yourself, I don't know about any streaming the dataset mechanism that would be available on the platform. But you may solve this by polling the data every few secs and checking the run status (I suggest to wait another 5 secs after the run ends, because there could still be some items being stored to the dataset).
rival-black
rival-blackOP•3y ago
Ok, thanks for your help

Did you find this page helpful?