Crawlee & Apify•3y ago

Difference between the scraped amount in browser vs on Python

I am using the twitter scraper in Python and finding that on the browser console I am getting all 300 tweets that I request. I have a counter in my Python script that increments with each item in the client.dataset(run['defaultDatasetId']).iterate_items(). This ends up being around 90, so it seems I am only getting 1/3rd of the tweets I scrape. Anyone know why or recommend what to do?

9 Replies

Pepa J•3y ago

Hello @Ruuubear , may you provide us with better an example of the code, and maybe send the runId (id you are running it on platform) to the PM, so we may try to reproduce/investigate it on our side? Also beware of scraping with proxy from different region or with logged off account may you provide different results than you see in the browser.

rival-blackOP•3y ago

Hi @Pepa J , the run id is: taVGBlEaesj8eUwii. This is my run input: run_input = { "profilesDesired": 1, "handle": [f"{user_profile}"], "searchMode": "user", "tweetsDesired": maxtweets, # "mode": "replies", "proxyConfig": { "useApifyProxy": True }, "extendOutputFunction": """async ({ data, item, page, request, customData, Apify }) => { return item; }""", "extendScraperFunction": """async ({ page, request, addSearch, addProfile, , addThread, addEvent, customData, Apify, signal, label }) => { }""", "customData": {}, "handlePageTimeoutSecs": 5000, "maxRequestRetries": 6, "maxIdleTimeoutSecs": 60, "initialCookies": [], } maxtweets was set to 300

Pepa J•3y ago

@Ruuubear I just tested it and it worked well: My implementaion:

from apify import Actor
from apify_client import ApifyClient


async def main():
    async with Actor:
        # Get the value of the actor input
        actor_input = await Actor.get_input() or {}

        apify_client = ApifyClient('apify_api_************************')

        dataset = apify_client.dataset('Sh*************zz')

        dataset_items = dataset.list_items().items

        i = 1
        for item in dataset_items:
            print(i)
            i += 1

from apify import Actor
from apify_client import ApifyClient


async def main():
    async with Actor:
        # Get the value of the actor input
        actor_input = await Actor.get_input() or {}

        apify_client = ApifyClient('apify_api_************************')

        dataset = apify_client.dataset('Sh*************zz')

        dataset_items = dataset.list_items().items

        i = 1
        for item in dataset_items:
            print(i)
            i += 1

Be sure you provide right datasetId (and not the actorId)

rival-blackOP•3y ago

Thanks for your reply. In your example are you pulling the dataset already scraped? I'm trying to get the data as it is scraped live.

MEE6•3y ago

@Ruuubear just advanced to level 1! Thanks for your contributions! 🎉

Pepa J•3y ago

@Ruuubear Yea I wait for the run to finish, otherwise you would have to do some active waiting with checking the actor is still running, resolving the offset parameter for listing items based on already download items etc.

rival-blackOP•3y ago

Is it possible to do it live? Is quite essential for what I am building

Pepa J•3y ago

I am afraid you would need to solve with by yourself, I don't know about any streaming the dataset mechanism that would be available on the platform. But you may solve this by polling the data every few secs and checking the run status (I suggest to wait another 5 secs after the run ends, because there could still be some items being stored to the dataset).

rival-blackOP•3y ago

Ok, thanks for your help

Gaming

Programming

Difference between the scraped amount in browser vs on Python

Did you find this page helpful?