Twitter scraping by both keyword and profile
It is too computationally intense/slow for me to make the api call for one of the filters and do post processing with the second filter. I am wondering if you can make an api call to scrape filtering by both keyword and profile. Is this possible or can I only do one or the other? Thanks!
11 Replies
foreign-sapphireOP•3y ago
I see this question is similar to the Facebook scraper post, is it the same case that you are unable to filter both simultaneously in one api call?
Hello @Deleted User the twitter has advanced search possibilities by itself . May you fill the form for advanced search ( https://twitter.com/search-advanced?lang=en ) and then copy paste it to the Actor's input? If it would not help, what combination of keywords and profiles, are you trying to scrape?
foreign-sapphireOP•3y ago
For some reason when I advanced search by both user and keyword on apify, it only searches the keyword. Is that supposed to happen?
@Deleted User which specific actor do you use? I just tried Twitter Scraper and 90% of the results are from the user I set on Input with the right keywords.
foreign-sapphireOP•3y ago
I use the same, I’m asking if it’s possible to set keyword and user and have results return the union of both
vicious-gold•3y ago
Can you give us more specific examples and step by step approach what are you trying to achieve.
foreign-sapphireOP•3y ago
Sure, so say I want to scrape all tweets by https://twitter.com/JoeBiden containing the word "president", I am current using this body of code
actorinput = {
"addTweetViewCount": true,
"addUserInfo": false,
"browserFallback": false,
"debugLog": false,
"extendOutputFunction": "async ({ data, item, page, request, customData, Apify }) => {\n return item;\n}",
"extendScraperFunction": "async ({ page, request, addSearch, addProfile, , addThread, addEvent, customData, Apify, signal, label }) => {\n \n}",
"fromDate": "2021-11-02",
"handle": [
"https://twitter.com/JoeBiden"
],
"handlePageTimeoutSecs": 5000,
"maxIdleTimeoutSecs": 60,
"maxRequestRetries": 6,
"mode": "own",
"profilesDesired": 10,
"proxyConfig": {
"useApifyProxy": true
},
"searchTerms": [
"president"
],
"tweetsDesired": 10000,
"useAdvancedSearch": true,
"useCheerio": true
}
headers = {
'Content-Type': 'application/json; charset=utf-8',
'Authorization': f'Bearer {api_token}'
}
data = json.dumps(actor_input)
response = requests.post(api_endpoint, headers=headers, data=data)
@Deleted User just advanced to level 1! Thanks for your contributions! 🎉
foreign-sapphireOP•3y ago
however it looks like the actor is retrieving tweets from any user containing the search term 'president'. I am only interested in tweets from "https://twitter.com/JoeBiden" containing the term 'president'. Thanks!
@Deleted User yes for this general input I am also receiving a lot unrelevant results.
That's why I suggested you to generate expression from advanced search form (on the twitter website) and use it for the
searchTerms
attribute. The input then looks like this:
Now all the results belongs to the specified twitter account.foreign-sapphireOP•3y ago
ahh okay, i was wrongly under the impression that the api would have done this for me, thank you so much!