Crawlee & Apify•15mo ago

Multiple usernames for single actor run not working

I am trying to leverage the apify-client for Python to kick off an execution of the Instagram Reel Scraper (apify/instagram-reel-scraper) via API. I have been successful in running this in a "one-to-one" scenario where I run it for a single Instagram handle and it returns ten results as expected. However, I want to be able to pass ~1,600 instagram IDs with a single API call. You can do this through the web console quite easily and it runs without a problem. I can even edit the JSON to execute all ~1,600. However, whenever I try to pass my payload through the API it skips all of the instagram handles. I am curious if it's possible I am not building my payload correctly, or maybe something else. I have been unsuccessful in finding documentation that speaks specifically to the use case I have. I would be happy to read up on documentation if it exists. Thanks in advance!

3 Replies

metropolitan-bronzeOP•15mo ago

Here is an example of the log output I get when the actor runs.

2024-03-04T03:31:29.502Z ACTOR: Pulling Docker image of build ZzBn4x8zhwvmc44ao from repository.
2024-03-04T03:31:30.279Z ACTOR: Creating Docker container.
2024-03-04T03:31:30.475Z ACTOR: Starting Docker container.
2024-03-04T03:31:32.879Z INFO  System info {"apifyVersion":"3.1.15","apifyClientVersion":"2.9.1","crawleeVersion":"3.8.1","osType":"Linux","nodeVersion":"v18.19.1"}
2024-03-04T03:31:32.981Z WARN  Skipped incorrect URL: {"urlStringOrObject":"[\"instagramhandle1 \", \"instagramhandle2\", \"instagramhandle3\", \"instagramhandle4\", \"instagramhandle5\",... [line-too-long]

2024-03-04T03:31:29.502Z ACTOR: Pulling Docker image of build ZzBn4x8zhwvmc44ao from repository.
2024-03-04T03:31:30.279Z ACTOR: Creating Docker container.
2024-03-04T03:31:30.475Z ACTOR: Starting Docker container.
2024-03-04T03:31:32.879Z INFO  System info {"apifyVersion":"3.1.15","apifyClientVersion":"2.9.1","crawleeVersion":"3.8.1","osType":"Linux","nodeVersion":"v18.19.1"}
2024-03-04T03:31:32.981Z WARN  Skipped incorrect URL: {"urlStringOrObject":"[\"instagramhandle1 \", \"instagramhandle2\", \"instagramhandle3\", \"instagramhandle4\", \"instagramhandle5\",... [line-too-long]

ratty-blush•15mo ago

I see there is

WARN  Skipped incorrect URL: {"urlStringOrObject":"[\"instagramhandle1 \", \"instagramhandle2\", \"instagramhandle3\", \"instagramhandle4\", \"instagramhandle5\"

So maybe try to check the input structure (your payload) and ensure that everything is correct there. Especially, if Your input works correct on the platform. It must be something with the payload. Try to double-check it. Or provide some reproduction, so we can check it (snippet, where You make a call of the scraper)

metropolitan-bronzeOP•15mo ago

Certainly. Here is the code that is supposed to be building the payload:

from datetime import datetime
import csv
import json
# Get the current date and time
now = datetime.now()
# Format it as a string, e.g., "2024-02-21_17-51-57"
formatted_now = now.strftime("%Y-%m-%d_%H-%M-%S")

# Define a single CSV file name with the current date and timestamp
csv_file_name = f"instagram_scrape_results_{formatted_now}.csv"
# Define the CSV file path, e.g., "/mnt/data/" for saving in this environment
csv_file_path = f"/home/wesgelpi/Downloads/{csv_file_name}"

from google.cloud import bigquery

# Import the scrape_instagram function from apifyClient.py
from apifyClient import scrape_instagram
from data_processor import process_csv

# Initialize the BigQuery client
client = bigquery.Client()

def get_users():
    # Function to retrieve user data from BigQuery

    # Retrieve user data from the BigQuery table.
    query = """
        SELECT * FROM `self-tape-may.self_tape_may_data.tblInstagramUsers`
    """
    query_job = client.query(query)  # Make an API request.
    
    try:
        users = query_job.result()  # Waits for the query to finish
        return users
    except Exception as e:
        print("Error in get_users:", e)
        return None

users = get_users()

# Assuming 'users' is an iterable of user information
instagram_handles = [user['instagramHandle'] for user in users]  # Collect all handles

# Convert the dictionary to a JSON string
json_payload = json.dumps(instagram_handles)

# Now, call scrape_instagram with the string of Instagram handles
print("Scraping results starting at: ", datetime.now().strftime("%Y-%m-%d %H:%M:%S"))
scrape_result = scrape_instagram(json_payload)
print("Completed scraping results at: ", datetime.now().strftime("%Y-%m-%d %H:%M:%S"), "writing data to csv...")

from datetime import datetime
import csv
import json
# Get the current date and time
now = datetime.now()
# Format it as a string, e.g., "2024-02-21_17-51-57"
formatted_now = now.strftime("%Y-%m-%d_%H-%M-%S")

# Define a single CSV file name with the current date and timestamp
csv_file_name = f"instagram_scrape_results_{formatted_now}.csv"
# Define the CSV file path, e.g., "/mnt/data/" for saving in this environment
csv_file_path = f"/home/wesgelpi/Downloads/{csv_file_name}"

from google.cloud import bigquery

# Import the scrape_instagram function from apifyClient.py
from apifyClient import scrape_instagram
from data_processor import process_csv

# Initialize the BigQuery client
client = bigquery.Client()

def get_users():
    # Function to retrieve user data from BigQuery

    # Retrieve user data from the BigQuery table.
    query = """
        SELECT * FROM `self-tape-may.self_tape_may_data.tblInstagramUsers`
    """
    query_job = client.query(query)  # Make an API request.
    
    try:
        users = query_job.result()  # Waits for the query to finish
        return users
    except Exception as e:
        print("Error in get_users:", e)
        return None

users = get_users()

# Assuming 'users' is an iterable of user information
instagram_handles = [user['instagramHandle'] for user in users]  # Collect all handles

# Convert the dictionary to a JSON string
json_payload = json.dumps(instagram_handles)

# Now, call scrape_instagram with the string of Instagram handles
print("Scraping results starting at: ", datetime.now().strftime("%Y-%m-%d %H:%M:%S"))
scrape_result = scrape_instagram(json_payload)
print("Completed scraping results at: ", datetime.now().strftime("%Y-%m-%d %H:%M:%S"), "writing data to csv...")

The function def scrape_instagram(user): is in a separate python file. That code is:

import requests
from apify_client import ApifyClient


# Apify URL build
memLimit = "&memory=32"
resultsLimit = 10
filePath = "/home/wesgelpi/secrets/apifySecret.txt"
apiURL = "https://api.apify.com/v2/acts/"
actor_id = "apify~instagram-reel-scraper/run-sync"
apiURL = apiURL + actor_id + "?token="

with open(filePath, 'r') as file:
    fileContent = file.read().strip()

apiURL = apiURL + fileContent + memLimit

# Initialize the ApifyClient with my API Token
client = ApifyClient(fileContent)

def scrape_instagram(user):
    # Function to call Apify API and scrape Instagram data
    # Build the Apify API Payload
    run_input = {
        "username": user,
        "resultsLimit": resultsLimit,
    }

    print("Sending run_input:", run_input)

    scraped_data = []  # List to hold the results

    try:
        # Run the Actor and wait for it to finish
        run = client.actor("xMc5Ga1oCONPmWJIa").call(run_input=run_input)

        # Fetch and print Actor results from the run's dataset (if there are any)
        for item in client.dataset(run["defaultDatasetId"]).iterate_items():
            # Desired fields
            # item['id'], item['type'], item['ownerUsername'], item['hashtags'], item['url']
            # item['timestamp'], item['childPosts']
            # Extracting desired fields
            data_entry = {
                'id': item.get('id'),
                'type': item.get('type'),
                'ownerUsername': item.get('ownerUsername'),
                'hashtags': item.get('hashtags'),
                'url': item.get('url'),
                'timestamp': item.get('timestamp'),
                'childPosts': item.get('childPosts', [])
            }
            scraped_data.append(data_entry)
        
        return scraped_data
    except requests.exceptions.RequestException as e:
        print(f"Error while calling Apify API: {e}")
        return None

import requests
from apify_client import ApifyClient


# Apify URL build
memLimit = "&memory=32"
resultsLimit = 10
filePath = "/home/wesgelpi/secrets/apifySecret.txt"
apiURL = "https://api.apify.com/v2/acts/"
actor_id = "apify~instagram-reel-scraper/run-sync"
apiURL = apiURL + actor_id + "?token="

with open(filePath, 'r') as file:
    fileContent = file.read().strip()

apiURL = apiURL + fileContent + memLimit

# Initialize the ApifyClient with my API Token
client = ApifyClient(fileContent)

def scrape_instagram(user):
    # Function to call Apify API and scrape Instagram data
    # Build the Apify API Payload
    run_input = {
        "username": user,
        "resultsLimit": resultsLimit,
    }

    print("Sending run_input:", run_input)

    scraped_data = []  # List to hold the results

    try:
        # Run the Actor and wait for it to finish
        run = client.actor("xMc5Ga1oCONPmWJIa").call(run_input=run_input)

        # Fetch and print Actor results from the run's dataset (if there are any)
        for item in client.dataset(run["defaultDatasetId"]).iterate_items():
            # Desired fields
            # item['id'], item['type'], item['ownerUsername'], item['hashtags'], item['url']
            # item['timestamp'], item['childPosts']
            # Extracting desired fields
            data_entry = {
                'id': item.get('id'),
                'type': item.get('type'),
                'ownerUsername': item.get('ownerUsername'),
                'hashtags': item.get('hashtags'),
                'url': item.get('url'),
                'timestamp': item.get('timestamp'),
                'childPosts': item.get('childPosts', [])
            }
            scraped_data.append(data_entry)
        
        return scraped_data
    except requests.exceptions.RequestException as e:
        print(f"Error while calling Apify API: {e}")
        return None

Ok, I (finally) resolved the problem. You were correct @Oleg V. . It was a malformed payload. The solution for me was to abandon the conversion to json using the json_payload = json.dumps(instagram_handles) code. Instead I am just passing the result of instagram_handles = [user['instagramHandle'] for user in users]. The other key element is I had to drop the square brackets in the run_input command. Instead of "username": [user] I changed it to "username": user

Gaming

Programming

Multiple usernames for single actor run not working

Did you find this page helpful?