Author: Priyanka Arvind Linge
On the Twitter platform users exchange information, ideas, and comments which are not examined, in addition, Twitter allows bot accounts on the platform which adds to the misinformation. Twitter bot autonomously performs actions such as tweeting, re-tweeting, following, unfollowing, liking, or even messaging the other accounts. The bot's intent and activity can be beneficial or harmful. These bots automatically retweet posts without verifying the facts or checking the credibility of the source. These bots mislead, manipulate social media discourse with rumors, spam, malware, misinformation, slander, or even just noise. Some studies suggest that between 9% and 15% of active Twitter accounts are bots. Worldwide around 12% of people use Twitter as a news resource and around 17% of Americans use Twitter to get the news. This study would help classify bot and non-bot accounts, to contribute little in prohibiting the spread of misinformation.
The required bot list for the dataset is obtained by scraping the site botwiki.org. This site was created mainly for bot enthusiasts, developers, artists to share their ideas and resources on bot creation. The aim of this site is to preserve the examples of creative online bots and provide tutorials and resources to the people who are interested in making bots.
To obtain the list of non-bot accounts we collected the usernames of known friends and collected all the ‘followers’ and ‘following’ in those accounts, which were identified as non-bots with the help of friends. We were able to collect in total 1600 bot and non bot accounts for this study. In order to make the dataset strong we have added different types of accounts in dataset for example verified account, accounts having different frequency of using the Twitter.
To classify bot and non-bots we used ensemble techniques such as Gradient Boosted Trees and Random Forest Classifier. Initially we trained the models using the basic variables and we found that highest accuracy obtained was 95% using XGB Bosst model, After adding additional variables such as average time of tweet, user mention count, url count, retweet count, has default profile image, age of account, FollowsBackAllFollowers etc. the highest accuracy obtained was 98.26 on training dataset using Random Forest.
Basic variables used in training- verified, location, description, followers count, following count, favorites count, listed count
Additional variables used - average time of tweet, user mention count, url count, retweet count, has default profile image, age of account, FollowsBackAllFollowers