help

profilebcs
Twitter_Sentiment_Analysis_using_Dynamic_Vocabulary.pdf

2018 Conference on Information and Communication Technology (CICT’18)

Twitter Sentiment Analysis using Dynamic Vocabulary

Hrithik Katiyar1, Monika1, Parveen Kumar1,2 and Ambalika Sharma2 1National Institute of Technology, Uttarakhand

Srinagar Garhwal, India 2Indian Institute of Technology Roorkee, India

{hrithik.cse15, monika.cse16, parveen.cse}@nituk.ac.in, [email protected]

Abstract—Technology has always led to advancements in research works. Today, Twitter is one of the most visited social networking sites by millions of users. A significant way of expressing the opinion for the people is through the Internet. Opinions tend to reflect beliefs as well as feelings. To know the polarity of the opinions a sentiment analysis can be done, this will let us know whether the opinion is negative, neutral or positive. Sentiment Analysis finds its applications in many places like an opinion of the customer on a particular product is to be known by the company, movie review opinion or sentiment analysis of political opinions. This paper introduced a technique of dynamic vocabulary, in which the vocabulary develops as the training is done. The experimental result shows the performance of the proposed technique is satisfactory.

Index Terms—Sentiment Analysis, Support Vector Machine, Index Construction.

INTRODUCTION

Nowadays, a major part of the world is connected to the In- ternet, so Internet has become a popular way for the people to express their opinions. Today a need for efficient and effective text mining tools techniques has increased due to a staggering amount of textual data. One of the important parts of data is from the social networking sites. Different organizations can extract polarity and sentiments of this massive information and use them for their benefits [17]. Twitter is one of the most popular social media sites where users express views in the form of short messages called tweets. Every day millions of tweets are generated. Tweets often contain user’s perspective as well as opinion on a topic, and the research has shown that they provide valuable insights on some issues [13]. Sentiment Analysis is the process of finding the opinion of the user about some topic or text in consideration [18]. The polarity of the opinions can be known through Sentiment Anal- ysis. Content Sentiment analysis is a programmed procedure of deciding whether a content fragment contains objective or a fixed view content, and it can moreover decide the content’s estimation extremity [15]. The method of analyzing the sentiments of an opinion can be used in many places like, when the company needs to know the review of the customers on their products, companies can make future decisions based on this gained information from sentiment analysis [2], [3]. Many of the approaches of traditional sentiment analysis use bag of words method [4]–[6]. Some well-known techniques of machine learning include Maximum Entropy, Stochastic

Gradient Descent, Random Forest, SailAil Sentiment Anal- yser, Multilayer Perceptron, Naive Bayes, Multinomial Naive Bayes, and Support Vector Machine (SVM). In this paper, SVM is used for Sentiment Analysis. Support Vector Machine is officially acquainted by [19] and demonstrated to be one of the generally utilized regulated machine learning calculations for reason for classification [17].

Some of the works have used an ontology for understanding text [23]. On phrase level, the system should be capable of recognizing the polarity of the phrase as discussed in [24]. Tree kernel and feature based model is used for sentiment analysis on twitter in [25].

There are some drawbacks to some of the sentiment analysis techniques. In n-gram technique, dependencies which are of the long range are not captured, it is reliant on having a corpus of information to prepare from. In Naive Bayes technique, contingent autonomy among the phonetic highlights is assumed, it is mainly used for less size training data. In the k-Nearest Neighbor technique, large storage is required and has a computationally intensive recall. In rule-based approach effectiveness and precision depends on defining rules. Lexicon based approach requires powerful linguistic resources which are not always available. The bag of words technique does not think about dialect morphology, and it could inaccurately characterize two expressions of having a similar importance since it could have a similar bag of words [5].

The SVM classifier model has the advantage over others as it has input space which is high dimensional, features which are irrelevant are a few in number and the document vectors are sparse.

The rest of the paper is described as follows: The different techniques used by different researchers on sentiment analysis is described in Related Work in Section II. The technique and algorithm developed in this paper are discussed in Section III. The result is given in Section IV and the conclusion is given in Section V.

RELATED WORK

Twitter sentiment analysis is a specific issue inside slant examination, a noticeable territory of research in the field of computational semantics. Ways to deal with slant examina- tion recognize and assess feelings communicated in content utilizing mechanized techniques [14]. The majority of the

978-1-5386-8215-9/18/$31.00 c©2018 IEEE

Authorized licensed use limited to: University of the Cumberlands. Downloaded on October 17,2021 at 14:56:02 UTC from IEEE Xplore. Restrictions apply.

2018 Conference on Information and Communication Technology (CICT’18)

current strategies of Twitter sentiment classification take after the technique proposed by Pang et al. [16] and apply machine learning techniques to construct a classifier from tweets with physically commented on sentiment polarity label [15]. In [21] a way is proposed to train the SVM model with the pre-labeled data. Twitter hashtags were used to determine the polarity of the tweets. The proposed technique gave the accuracy of 85 percent. Many classification techniques have experimented in this field of work, Some of them can be broadly be classified into two:

Text Classification Techniques: In [7], the discussion is about two strategies to use neural networks for text classification.

Traditional method [7]: In the traditional method, docu- ments are encoded to numerical vectors, documents are joined into a long string. The string is then tokenized by spaces and punctuation marks, then each word is stemmed from its root form, stop words are removed. The remaining words are known as feature candidates. Due to a large number of features, feature selection techniques are applied to select a subset of features. The numerical vector can at present be huge, even after applying feature selection, this prompts high- cost time for preprocessing and decrement in classification execution.

Novel method [7]: In novel method, documents are mapped into vectors of string in place of numerical vectors. An n- dimensional vector of the string will contain d words in sorted order according to their frequencies in the text. This string vector is used as an input for the novel neural network.

Common preprocessing steps utilized as a part of text clas- sification are Tokenization, Stemming and Stopwords removal [7]–[9]. Commonly, because of the presence of a large set of words, there are a large number of features when using various classification techniques for text classification. The procedure of separating words from the text is known as Tokenization, and the procedure of reducing words to their root form that is not in their root form is known as Stemming. The removal of common functional words is known as Stop words removal. These steps are performed so as to enhance the execution of the classification.

Selection techniques which are used in [8] are: • df method, • cf-df method, • tf-idf method, • method of principal component analysis.

Through experiments, it was inferred that principal component analysis was most effective. Neural Networks besides feedfor- ward neural networks, like the Self Organizing Map (SOM) and the Growing Hierarchical Self Organizing Map (GHSOM) were used in [9]. SOM functions admirably for mapping high- dimensional data into a two-dimensional representation space, this is the reason for using SOM.

Sentiment Analysis Techniques: As per [5], one of the approaches to sentiment analysis is the Recurrent Neural Network. This type of neural network gives better execution on structured data forecast on variable

input, so traditional neural networks lag behind it. In this network architecture, at time t input layer contains the feature vector of a bag of words. The hidden layer contains the historical backdrop of data and has recursive connections to itself. The input network is associated with the hidden layer. The hidden layer is associated with the yielding layer. The hidden layer and yield layer additionally highlight neurons to store esteem at time t. The recursion makes the system more profound than the conventional Neural Network [5]. Rong et al. [5] proposed a semi-supervised twofold Recurrent Neural Network (RNN) for their sentiment analysis. The traditional network is indistinguishable from this network in the way that this network can also time to cover a long historical memory. It is dissimilar from the traditional network in the way that in this network yield layer additionally has recursive associations back to itself for better sentiment analysis.

In [6], techniques of sentiment analysis specific to Twit- ter are discussed. Electronic products domain was in focus. Twitter sentiment analysis and Traditional sentiment analysis are different because on Twitter more argot and incorrectly spelled words are utilized due to character limit, due to which requirement of preprocessing is there before extraction of features. Two steps are involved in preprocessing: First, features such as hashtags and emoticons are extracted. Based on their class polarity value emoticons are given positive or negative value. Scores are allotted also to the hashtags. After the removal of Twitter-specific features, a unigram approach is utilized and a plain content of the tweet is spoken to as an accumulation of words. Feature vector uses Negative and Positive tweets. Other Text classification techniques such as Support Vector Machine (SVM), Naive Bayes and Maximum Entropy Classifier are discussed in [6].

PROPOSED MODEL

In this paper, a model is proposed for Twitter Sentiment Analysis and Support Vector Machine (SVM) is used as a classifier model. The dynamic vocabulary technique is used to develop the vocabulary to analyze the Twitter. SVM is a classifier which is non-probabilistic requiring a large set of training data. Decision Boundaries are defined by decision planes and this concept is used in SVM. A decision plane partitions the set of objects having a membership of the different class. In the dynamic vocabulary technique, initially, we do not have any vocabulary. The vocabulary is built during training by adding the unique words one by one to the vocabulary. This built dictionary is then used for testing. This technique helps to keep the record of some of the modified forms of the words which can not be known beforehand and are difficult to keep the record of Some of these words are important in the sense of sentiment analysis.

Preprocessing

Before data is to be passed to the SVM classifier, it should be preprocessed and prepared for further training the classifier. There are some basic steps of preprocessing the data:

Authorized licensed use limited to: University of the Cumberlands. Downloaded on October 17,2021 at 14:56:02 UTC from IEEE Xplore. Restrictions apply.

2018 Conference on Information and Communication Technology (CICT’18)

Removal of Irrelevant Parts: • Removal of Punctuation Marks and Symbols: The Punc-

tuation marks and Symbols used in the tweets should be removed. As the role of punctuation marks in deciding the sentiment of an opinion is less as compared to the words.

• Removal of Single Characters: The single characters present in the tweets should be removed. They have removed as the play no role in sentiment analysis of an opinion.

• Removal of Stop Words like a, are, is, the, etc.: The Stop Words should be removed as these occur number of times than any other terms. The less frequent words are more important to sentiment analysis than the highly frequent words. Here the list of stopwords is taken from the Internet.

• @ and the username following it: The characters follow- ing @ represent the username of the users to whom this particular tweet is to be mentioned. The username has no significance in deciding the sentiment of an opinion.

• Stemming from the Words: The words which are not in their root form are stemmed from their root form so as to further process. For example, the words waiting, waits or waited will become wait.

These set of actions will reduce the number of terms that will be used further. The algorithm used for Stemming is Porter’s Stemming Algorithm.

Data Preparation for Twitter Sentiment Analysis: • Vocabulary List or Dictionary: After the above set of

actions, the stemmed words from every tweet which are overall unique are added to a list structure. This list then is sorted in alphabetical order. This sorted list is known as Vocabulary List or Dictionary. In this way, the vocabulary is built dynamically. This vocabulary when fully developed after full training is used for testing.

• Mapping Structures: There will be a mapping variable that will contain the location index of each word in the vocabulary, this will act as the termID of the preferred word. This also contains the Sentiment value(0: Negative, +1: Positive) for the corresponding tweet.

• Numerical Vector: A numerical vector is formed that is used as an input to the SVM Classifier Model. The size of the vector is (number of tweets) x (number of words in vocabulary list). The number of rows corresponds to the tweets in the training set of the data, and the number of columns corresponds to the unique words in the vocabulary list. The vector values will contain the frequency of the occurrence (in a particular tweet represented by the row) of the word. The mapping variable is used to map the column number to the words in the vocabulary list or dictionary. This numerical vector is used to feed the classifier. The complete procedure of Twitter Sentiment Analysis (TSA) is shown in Algorithm 1.

Algorithm 1 TSA Algorithm data ←< Dataset > vocabulary ←<> tokens ←<> input ←<> for each tweet(i) in data do tokens(i) ← tokenize(tweet(i)) vocabulary ← addtokens(tokens(i))

end for vocabulary ← sort(vocabulary) for each tweet(i) in data do

for each word(j) in vocabulay do input(i, j) ← 0

end for end for for each tweet(i) in data do

for each word(j) in vocabulay do for each token(k) in tweet(i) do

if word(j) == token(k) then input(i, j) ← input(i, j) + 1

end if end for

end for end for return input

Fig. 1: Twitter Sentiment Analysis Model

The Twitter Sentiment Analysis Model is depicted in Fig 1.

CONCLUSION

This model gives good performance when the number of tweets used for training is less. As the number of tweets in- creases, the size of vocabulary size increases this decreases the processing speed of the model. Due to increase in the number of tweets, limited memory becomes an issue. The larger the number of tweets, the more the number of vocabulary words, as a result, the size of vocabulary increases. If memory is not an issue, then the accuracy increases with the size of the

Authorized licensed use limited to: University of the Cumberlands. Downloaded on October 17,2021 at 14:56:02 UTC from IEEE Xplore. Restrictions apply.

2018 Conference on Information and Communication Technology (CICT’18)

TABLE I: Comparison of Results of some Techniques

Reference Technique Dataset Accuracy (%) Pang et al. [25] MaxEnt Movie Review 81.00 Pang et al. [25] NB Movie Review 81.50

Gautam et al. [26] MaxEnt Twitter 83.80 Bhayani et al. [27] MaxEnt Twitter 83.00 Bhayani et al. [27] SVM Twitter 82.20 Proposed Technique Dyn. Voc. Twitter 84.23

training set as the size of the vocabulary also increase and it can store more of the unique words used in the tweets in training set. The working of SVM model is more efficient than Neural Network model for this classification model. Neural Network model takes more computation time than the SVM model while using Dynamic vocabulary. SVM model gives better results than the Neural Networks model.

FUTURE WORK

When training the model, Memory becomes an issue if lim- ited so, extra techniques to reduce the number of features can be tried so that the size of input vector can be reduced and so the size of the vocabulary. Some other types of classifiers such as neural networks can also be used to resolve the problem of memory, Recursive neural networks, self-organizing maps, recursive deep neural networks, etc. can be used. Effect of emoticons can also be considered of some sentimental value. Emoticons which are text-based can also be taken into account while preprocessing such as,:).

REFERENCES

[1] Porter, Martin F. ”Readings in information retrieval.” (1997): 313-316. [2] Wang, Wei. ”Sentiment analysis of online product reviews with Semi-

supervised topic sentiment mixture model.” Fuzzy Systems and Knowl- edge Discovery (FSKD), 2010 Seventh International Conference on. Vol. 5. IEEE, 2010.

[3] Zhou, Xujuan, et al. ”Sentiment analysis on tweets for social events.” Computer Supported Cooperative Work in Design (CSCWD), 2013 IEEE 17th International Conference on. IEEE, 2013.

[4] Socher, Richard, et al. ”Recursive deep models for semantic composi- tionality over a sentiment treebank.” Proceedings of the 2013 conference on empirical methods in natural language processing. 2013.

[5] Rong, Wenge, et al. ”Semi-supervised dual recurrent neural network for sentiment analysis.” 2013 IEEE International Conference on Depend- able, Autonomic and Secure Computing (DASC). IEEE, 2013.

[6] Neethu, M. S., and R. Rajasree. ”Sentiment analysis in twitter using machine learning techniques.” Computing, Communications and Net- working Technologies (ICCCNT), 2013 Fourth International Conference on. IEEE, 2013.

[7] Jo, Taeho. ”NTC (Neural Text Categorizer): Neural network for text categorization.” International Journal of Information Studies 2.2 (2010): 83-96.

[8] Lam, Savio LY, and Dik Lun Lee. ”Feature reduction for neural network based text categorization.” Database Systems for Advanced Applications, 1999. Proceedings., 6th International Conference on. IEEE, 1999.

[9] Murthy, Kavi Narayana. ”Automatic categorization of Telugu news articles.” Department of Computer and Information Sciences (2003).

[10] Yao, Kaisheng, et al. ”Recurrent neural networks for language under- standing.” Interspeech. 2013.

[11] Anjaria, Malhar, and Ram Mohana Reddy Guddeti. ”Influence fac- tor based opinion mining of Twitter data using supervised learning.” Communication Systems and Networks (COMSNETS), 2014 Sixth International Conference on. IEEE, 2014.

[12] Duncan, Brett, and Yanqing Zhang. ”Neural networks for sentiment analysis on Twitter.” Cognitive Informatics & Cognitive Computing (ICCI* CC), 2015 IEEE 14th International Conference on. IEEE, 2015.

[13] Jansen, Bernard J., et al. ”Twitter power: Tweets as electronic word of mouth.” Journal of the American society for information science and technology 60.11 (2009): 2169-2188.

[14] Zimbra, David, Manoochehr Ghiassi, and Sean Lee. ”Brand-related Twitter sentiment analysis using feature engineering and the dynamic architecture for artificial neural networks.” System Sciences (HICSS), 2016 49th Hawaii International Conference on. IEEE, 2016.

[15] Jianqiang, Zhao, Gui Xiaolin, and Zhang Xuejun. ”Deep Convolution Neural Networks for Twitter Sentiment Analysis.” IEEE Access 6 (2018): 23253-23260.

[16] Pang, Bo, Lillian Lee, and Shivakumar Vaithyanathan. ”Thumbs up?: sentiment classification using machine learning techniques.” Proceedings of the ACL-02 conference on Empirical methods in natural language processing-Volume 10. Association for Computational Linguistics, 2002.

[17] Ahmad, Munir, Shabib Aftab, and Iftikhar Ali. ”Sentiment Analysis of Tweets using SVM.” Int. J. Comput. Appl 177.5 (2017): 25-29.

[18] Bholane Savita, D., and Deipali Gore. ”Sentiment Analysis on Twitter Data Using Support Vector Machine.” International Journal of Computer Science Trends and Technology (IJCST)–Volume 4: 365.

[19] Cortes, Corinna, and Vladimir Vapnik. ”Support-vector networks.” Ma- chine learning 20.3 (1995): 273-297

[20] https://datahub.ckan.io/dataset/a7effce8-ec55-4c23-b291- ba279b83b705/resource/091d6b4b-22e9-4a64-85c4- bdc8028183ac/download/finalizedfull.csv

[21] Zgheib, Wassim A., and Aziz M. Barbar. ”A Study using Support Vector Machines to Classify the Sentiments of Tweets.”

[22] Boguslavsky, I. ”Semantic Descriptions for a Text Understanding Sys- tem.” Computational Linguistics and Intellectual Technologies. Papers from the Annual International Conference “Dialogue”(2017). 2017

[23] Wilson, Theresa, Janyce Wiebe, and Paul Hoffmann. ”Recognizing contextual polarity in phrase-level sentiment analysis.” Proceedings of the conference on human language technology and empirical methods in natural language processing. Association for Computational Linguistics, 2005.

[24] Agarwal, Apoorv, et al. ”Sentiment analysis of twitter data.” Proceed- ings of the workshop on languages in social media. Association for Computational Linguistics, 2011.

[25] Pang, Bo, Lillian Lee, and Shivakumar Vaithyanathan. ”Thumbs up?: sentiment classification using machine learning techniques.” Proceedings of the ACL-02 conference on Empirical methods in natural language processing-Volume 10. Association for Computational Linguistics, 2002.

[26] Gautam, Geetika, and Divakar Yadav. ”Sentiment analysis of twitter data using machine learning approaches and semantic analysis.” Contempo- rary computing (IC3), 2014 seventh international conference on. IEEE, 2014.

[27] Go, Alec, Richa Bhayani, and Lei Huang. ”Twitter sentiment classifica- tion using distant supervision.” CS224N Project Report, Stanford 1.12 (2009).

Authorized licensed use limited to: University of the Cumberlands. Downloaded on October 17,2021 at 14:56:02 UTC from IEEE Xplore. Restrictions apply.