help

profilebcs
Real-time_Twitter_Sentiment_Analysis_using_3-way_classifier.pdf

978-1-5386-4110-1/18/$31.00 ©2018 IEEE

Real-time Twitter Sentiment Analysis using 3-way classifier

Alaa S. Al Shammari

Computer science department King Saud University Riyadh, Saudi Arabia

Email: [email protected]

Abstract—Sentiment analysis or opinion mining is a critical issue where a huge amount of information related to user’s opinion widespread in all counties in the world. This paper presents an online system for real-time twitter sentiment analysis and classification. The proposed system helps users to enter the query and get a graphical representation of the tweets polarity. Out of various classification algorithms, Simple Voter and Naïve Bayes algorithms have been used to classify tweets. The obtained results show that the accuracy of the system is efficient using Naïve Bayes classifier.

Keywords-Sentiment analysis, twitter, opinion mining, polarity, R.

I. INTRODUCTION Opinion mining and sentiment analysis is one of the most

important research topics at our present time. It refers to the study of people’s emotions, sentiment or opinions that can be expressed in a written text.

Different companies and organizations trying to find the users opinion about their products, services and so on to them for making better decisions. Moreover, Microblogging platforms such as: Twitter, Facebook, weibo…etc. can help to extract and analyzed user’s opinions and reviews. The amount of these information is too large to be analyzed by normal users. So, to avoid this, sentiment analysis techniques could be used.

Twitter is one of the most popular social networks and microblogging platform where it is a convenient way for senders to write and share their opinion about several aspects of life within 140-character length. Twitter has enormous number of text and posts that has grown rapidly. It is a quite difficult to analyze these tweets based on misspellings, emoji, and slang words where it should have a preprocessing step before dealing with it per the polarity detection of positivity or negativity of the tweets.

The rest of the paper is structured as follows: In section 2, I discuss about several previous related works of twitter sentiment analysis and opinion mining. In section 3 and 4, it describes the both of system model and methodology of collecting and analyzing the dataset of twitter microblogging platform. In section 4, the result will be represented. Finally, the conclusion and future work are discussed in section 5.

II. LITERATURE REVIEW There has been a great interest in studying sentiment analysis

problem where there are several researches done under this topic with different aspects.

In [1], authors used twitter to collect a corpus to preform linguistic analysis of the corpus and building a sentiment classifier using nave Bayes to determine the polarity of the documents. The experimental evaluation shows improvement of the proposed techniques than the previous methods.

In [2], authors collect 3 different datasets (HASH, EMOT and iSieve) and do a pre-processing of it. Then, evaluating the training data which found that some features may not be good to be combined with other features.

While [3], proposed a unigram model as baseline which gain 4 % at both binary classification and 3-way classification. Then, they investigated of both tree kernel model and feature based models which exceeded the unigram model. The combination of parts-of-speech and prior polarity features outperforms other features based on features analysis.

In [4], authors focused at political sentiment analysis problem at real time using nave Bayes to classify tweets under 4 categories (positive, negative, neutral and unsure) where the system applicable to analyze tweet at election events. The method that used is generic which can be used in other domains such as movies events.

While [5], studied the sentiment analysis problem for electronic products domain where tweet posts related to that field was analyzed. The machine learning techniques outperforms symbolic techniques based on sentiments identification. The evaluation of enhanced vector feature was tested using several classifiers such as nave Bayes, Maximum Entropy, SVM and Ensemble classifiers where the accuracy of these classifiers was almost similar. The performance of the proposed feature vector improved at the electronic products opinion domain.

Another insignificant effort has been directed under the development of sentiment analysis systems to support the research community.

In [6], Microsoft azure developed a real-time twitter sentiment analysis demo that helps to represent the polarity of

tweets at Bing maps where it expressed opinion as positive, negative, or neutral tweets and represented at the map with different colors. While other several online tools and application offers different services for twitter sentiment analysis at [7].

This study focuses in sentiment analysis and opinion mining of a real-time twitter data extraction based on tweets polarity classification.

III. SYSTEM MODEL The system aims to generate a polarity graphical

representation of tweets based on user’s query. It is composed of three components as shown in fig1. The first component takes the input (query) from the user and the range of tweets (from 0 to 2000 tweets). Then in the sentiment classification component, tweets categorization will be generated positivity, negativity of neutrality of the tweets. Therefore, the final bar graphical representation will be printed as the output for user query.

Figure 1. Block diagram of the system.

IV. METHODOLOGY The main objective of this paper is How can we classify real-

time tweets based on polarity in efficient and fast way? The performance measure of our system is the accuracy of the retrieved tweets, while the factors are the query and the range of requested tweets.

A. Data Extraction The real-time tweets can be extracted by using Twitter API

where I create a twitter application called TwitterSentimentAnalysis44 to get the four important keys (Consumer key, Consumer Secret key, Access Token, and Access Token Secret) which needed for twitter authentication to help for extracting the tweets based on the number of requested tweets.

B. Preprocessing The test of twitter posts is different from any other text in

books or articles where it includes a lot of idiosyncratic uses like user mentions, retweets and so on. These tweets need to be preprocessed by cleaning it from all unimportant data for our classifier where the proposed system clean each tweet by removing Remove hashtags #, URL, RT or retweets, the @ of user twitter account name, multiple white spaces, punctuation, control characters and digits or numbers. Then, the tokenization of each tweet will be applied.

C. Sentiment Classification There are several classifiers used for opinion mining and

sentiment analysis of microblogging platforms. I used Simple Voter algorithm and Naïve Bayes algorithm which classify tweets as positive, negative, or neutral opinion. The positive and negative dictionaries have been downloaded from the internet with total of 2014 positive words and 4783 negative words. In the proposed system, it works based on sentence level. So, each tweet decomposed into number of separated words. At this level, we score each tweet based on this equation:

Score = Number of positive words-Number of negative words (1)

� If Score > 0, then the sentence is positive. � If Score < 0, then the sentence is negative. � If Score = 0, then the sentence is neutral.

As shown in (1), each tweet will be tokenized into separated

words where each word will be compared based on matching of positive dictionary term or even negative dictionary term. Then, the score will be assigned for each tweet based on the probability using Naïve Bayes algorithm or the high majority using Simple Voter algorithm. Therefore, the tweets will be classified into 3 categories (positive, negative, or neutral) based on the score.

D. Graphical representation There are several great packages or libraries which used for

graphical representation such as: pie chart, line chart, column chart, histogram and so on.

In this system, Bar chart or we can call it histogram has been used for graphical representation of the sentiment classification as shown in fig 2.

Figure 2. The graphical user interface of the system.

V. RESULTS AND DISCUSSION Our system was implemented using RStudio Version 1.0.136,

with Asus laptop with intel core i7 processor and 4 GB memory. The proposed system evaluated based on the accuracy of the retrieved tweets according to the user’s information needs. The experiment underwent with 1500 tweets of the query” Happy” where it labeled manually as a gold label dataset to measure the performance of the system using Naïve Bayes and Simple Voter

algorithms. In additional, I assume the characteristic of the gold labeled dataset as summarized in table 1.

TABLE I. THE MAIN CHARACTERISTICS OF THE GOLD LABELED DATASET

Sentiment Count

Positive 1074

Negative 100

Neutral 326

Total 1500

From the table above, I compared the result of Simple Voter classifier with the manually labeled dataset and finds out the confusion matrix as summarized in table 2.

TABLE II. CONFUSION MATRIX OF SIMPLE VOTER CLASSIFICATION

Predicted class

Actual class Positive Negative Neutral

Positive 953 25 142

Negative 14 47 68

Neutral 107 28 116

Then, I compared the result of Naïve Bayer classifier with

the manually labeled dataset and finds out the confusion matrix as summarized in table 3.

TABLE III. CONFUSION MATRIX OF NAÏVE BAYES CLASSIFICATION

Predicted class

Actual class Positive Negative Neutral

Positive 1022 27 144

Negative 19 53 35

Neutral 33 20 147

Next, I calculate the Accuracy of Naïve Bayes and Simple

Voter classifiers as follows:

� Accuracy of Classification = 0.74

� Accuracy of Naïve Bayes Classification= 0.81

The results show that the system performs well with Naïve Bayes classifier compared to Simple voter classifier where the accuracy of Naïve Bayes performance is 81% and the accuracy of Simple Voter performance is 74%.

A. Limitations � Sometimes, the twitter API retrieved less number of

tweets than the number that requested. � The 3-ways classifier can’t deal with semantic analysis

of tweets.

VI. CONCLUSION AND FUTURE WORK This paper investigated the problem of real-time twitter

sentiment analysis which relies on providing a graphical representation of tweets categories (positive, negative, and neutral) opinions to help companies, organization, or any agency to focus about user’s opinion of their products, services and so on. Also, it is supporting the society to fight against terrorism, racism, and deviant thought by detecting the negative tweets.

In this paper, I proposed a system that shows a histogram graphical representation based on user query. The system has been evaluated based on the accuracy of the proposed 3-way classifier using Naïve Bayes and Simple Voter algorithms. Our experiment considered, user query and number of requested tweets and results that obtained shows that the system was more efficient with Naïve Bayes classifier compared to Simple voter classifier. For future work, I will focus on coping with other languages such as Arabic and Spanish and concern on semantic analysis.

REFERENCES

[1] A. Pak and P. Paroubek, "Twitter as a Corpus for Sentiment Analysis and

Opinion Mining", in International Conference on Language Resources and Evaluation, Valletta, Malta, 2010.

[2] E. Kouloumpis, T. Wilson and J. Moore, "Twitter sentiment analysis: The good the bad and the omg!", in Fifth International AAAI Conference on Weblogs and Social Media, Barcelona, Spain, 2011, pp. 538-541.

[3] A. Agarwal, B. Xie, I. Vovsha, O. Rambow and R. Passonneau, "Sentiment analysis of Twitter data", in Workshop on Languages in Social Media, Portland, Oregon, 2011, pp. 30-38.

[4] "A system for real-time Twitter sentiment analysis of 2012 U.S. presidential election cycle", in ACL 2012 System Demonstrations, Jeju Island, Korea, 2012, pp. 115-120.

[5] M. Neethu and R. Rajasree, "Sentiment analysis in twitter using machine learning techniques", 2013 Fourth International Conference on Computing, Communications and Networking Technologies (ICCCNT), 2013.

[6] ”Analyze real-time Twitter sentiment with HBase”, Docs.microsoft.com, 2017. [Online]. Available: https://docs.microsoft.com/en- us/azure/hdinsight/hdinsight-hbaseanalyze- twitter-sentiment. [Accessed: 20- Jan- 2018].

[7] ”Free Nuts”, Yourhandphone.blogspot.com, 2017. [Online]. Available: http://yourhandphone.blogspot.com/2011/09/free-nuts25.html. [Accessed: 20- Jan- 2018].