helpfn

Multi-ClassSentimentAnalysisinTwitterWhatifClassificationisNottheAnswer.pdf

Home >Computer Science homework help >helpfn

SPECIAL SECTION ON EMERGING TRENDS, ISSUES AND CHALLENGES FOR ARRAY SIGNAL PROCESSING AND ITS APPLICATIONS IN SMART CITY

Received September 9, 2018, accepted October 8, 2018, date of publication October 18, 2018, date of current version November 19, 2018.

Digital Object Identifier 10.1109/ACCESS.2018.2876674

Multi-Class Sentiment Analysis in Twitter: What If Classification Is Not the Answer MONDHER BOUAZIZI AND TOMOAKI OHTSUKI , (Senior Member, IEEE) Graduate School of Science and Technology, Keio University, Yokohama 223-8522, Japan

Corresponding author: Mondher Bouazizi (bouazizi@ohtsuki.ics.keio.ac.jp)

This work was supported by the Keio Leading-Edge Laboratory of Science and Technology-Japan under Grant KEIO-KLL-000081.

ABSTRACT With the rapid growth of online social media content, and the impact these have made on people’s behavior, many researchers have been interested in studying these media platforms. A major part of their work focused on sentiment analysis and opinion mining. These refer to the automatic identification of opinions of people toward specific topics by analyzing their posts and publications. Multi-class sentiment analysis, in particular, addresses the identification of the exact sentiment conveyed by the user rather than the overall sentiment polarity of his text message or post. That being the case, we introduce a task different from the conventional multi-class classification, which we run on a data set collected from Twitter. We refer to this task as ‘‘quantification.’’ By the term ‘‘quantification,’’ we mean the identification of all the existing sentiments within an online post (i.e., tweet) instead of attributing a single sentiment label to it. For this sake, we propose an approach that automatically attributes different scores to each sentiment in a tweet, and selects the sentiments with the highest scores which we judge as conveyed in the text. To reach this target, we added to our previously introduced tool SENTA the necessary components to run and perform such a task. Throughout this work, we present the added components; we study the feasibility of quantification, and propose an approach to perform it on a data set made of tweets for 11 different sentiment classes. The data set was manually labeled and the results of the automatic analysis were checked against the human annotation. Our experiments show the feasibility of this task and reach an F1 score equal to 45.9%.

INDEX TERMS Twitter, sentiment analysis, machine learning.

I. INTRODUCTION Twitter has attracted the research community over the last few years for several properties it possesses. These include the open nature of Twitter, which unlike other platforms, allows people to access posts published by Twitter users without requiring a registration. In addition, relations between users in Twitter are not necessarily mutual. In other words, for a user ‘‘A’’ to follow another user ‘‘B’’, this does not require that ‘‘B’’ follows ‘‘A’’ as well. This property made it possi- ble to have influential users, whose followers are numerous and whose tweets have usually huge impact on the public opinion [1], [2], or even to ‘‘create’’ ones [3]. Twitter also has a free-to-use API through which, anyone can collect tweets, dealing with any topic in real time as they are posted.

However, probably the nature of tweets themselves (i.e., microblogs posted or published by users of Twitter) and their content are the most interesting part about this microblogging platform. Tweets are limited to 140 charac- ter per tweet, making them brief and easy to read. They usually include slang words, abbreviations and emoticons. Nonetheless, the employment of hashtags (words or phrases

preceded by the sign ‘‘#’’) made this platform very suited for spreading news or discussions by using them (the hashtags) to refer to the topics being discussed, and even for creating sub-communities and micro-celeberities [4]. While the use of hashtags is unmoderated and not subject to any restric- tions, they have been widely adopted, mainly in Twitter, and became one effective way to identify the trending topics, and to collect tweets related to a certain subject, product, service or others.

That being the case, it is easy to collect tweets in relation with any specific topic, from several users and analyze them, which has been done by several companies or organizations. One particular type of analysis of these data is what we refer to as ‘‘sentiment analysis’’. In this context, sentiment analysis refers to the process of computationally identify the opinion of people towards specific topics out of their online posts.

Sentiment analysis has been deeply studied in the litera- ture: several approaches were proposed to perform this task on data collected from Twitter [5]–[8] as well as other sources of online data [9], [10]. In a previous work [11], we have pro- posed an approach that performs this task on data collected

64486 2169-3536 2018 IEEE. Translations and content mining are permitted for academic research only.

Personal use is also permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

VOLUME 6, 2018

https://orcid.org/0000-0001-7055-9318

https://orcid.org/0000-0003-3961-1426

M. Bouazizi, T. Ohtsuki: Multi-Class Sentiment Analysis in Twitter: What If Classification Is Not the Answer

from Twitter for several topics, where tweets were classified into positive or negative.

In a more recent work [12], we have dealt with a more chal- lenging task, which we refer to as the ‘‘multi-class sentiment analysis’’, where tweets were classified into one of 7 different sentiment classes. However, as we discusses in [12], this task presents several challenges. A major challenge we have deeply discussed is the fact that tweet simply might contain more than one sentiment. That being the case, in the current work, we aim to deal with this problem and solve it. We pro- pose an approach that tries to actually detect all the sentiments existing in a given tweet and attribute different scores to these sentiments showing their weight, or how relevant they are in the tweet. We refer to this task as ‘‘quantification’’.

The contributions of this work are the following: 1) we introduce the task of sentiment quantification as

described above and as we will describe in more detail more later in this work,

2) we propose an approach that relies on writing patterns along with other sets of features to perform a ternary sentiment classification of tweets (i.e., the classifica- tion into ‘‘positive’’, ‘‘negative’’ and ‘‘neutral’’),

3) upon classification, the writing patterns are used again to attribute scores for each sentiment in every tweet. These scores are used to filter the sentiments we judge as being conveyed in the tweet (within the process we refer to as quantification),

4) we added the required quantification components to our previously introduced tool SENTA [12], to make it easy to run the approach.

The remainder of this paper will be structured as follows: in section II, we discuss the limitations of the multi-class sentiment analysis and present our motivations for this work. In section III, we present some of the work related to the subject we discuss in this paper. In section IV, we describe the modules and components that we have added to SENTA. In section V we describe in details our proposed approach for sentiment quantification and in section VI we show the results of our experiments using the approach on a data set made out of tweets, we analyze the obtained results, and discuss the potentials and limits of the approach. Finally, section VII concludes this work and proposes possible directions for future work.

II. MOTIVATIONS A. MULTI-CLASS CLASSIFICATION: POTENTIAL AND LIMITS In previous works [12], [13], we have explored the task of multi-class sentiment analysis in Twitter: for given tweet, instead of telling whether it is positive, negative or neutral, our aim was to actually identify the most dominant sentiment in it, that being ‘‘Happiness’’, ‘‘Love’’, ‘‘Sadness’’, etc.

Such a task is interesting given that it allows companies, for example, to distinguish between comments regarding their products that are dissatisfaction-driven and those which relate to physical damage or other. This can be seen in the following

two tweets that show 2 different sentiments, despite being both negative: • ‘‘C’mon Valve!! get a solution for these bastard cheaters?? They are ruining the game and soon enough there won’t be anyone playing CSGO!’’

• ‘‘I bought it yesterday, and now it’s discounted. Just why Valve why? :(’’

Even though both tweets could interest the company in ques- tion, the first tweet could be judged as more important and a useful feedback of a frustrated and angry user, whereas the second is, somehow, showing a sentiment of sadness for the bad luck the user had.

The tweets in question are not unique, nor few in number. A negative tweet could have several interpretations, depend- ing on the actual sentiment shown. The same can be said about positive tweets.

This highlights the importance of the multi-class classifi- cation, and shows why it is indeed needed. However, as we will see in more details in the next sections, tweets tend to show more than one sentiment in a single tweet. In the data set we have used in this work, we have asked human annotator to attribute one sentiment or more to every tweet, and the results show that more than 55% of the tweets actually con- tain more than one sentiment. That is not surprising though: in [12], we have studied the performances of the multi-class classification, and concluded that this is indeed a common thing: a sentimental tweet (i.e., a tweet that is not neu- tral) shows usually more than one sentiment. Nevertheless, some sentiments are highly correlated. As a matter of fact, tweets showing hate tend to show anger and frustration as well.

B. WHY QUANTIFICATION? The presence of several sentiments within a tweet, as shown above, makes the task of multi-class classification a bit obso- lete given that, out of all the sentiments presents, only one is identified. That being the case, the identification of all the existing sentiments is a very challenging task [12], [14]. Not only does it suggest that the different sentiments co-exist within the tweet, but also these might have different weights and manifestations. This leads to a more challenging task: is it possible to identify these sentiments and attribute different scores to them, each showing the weight of the corresponding sentiment?

In this work, we refer to the task of identification of these sentiments and the attribution of scores to them as ‘‘quantification’’.

C. SENTA: REQUIREMENT FOR AN UPDATE SENTA has previously been introduced for the purpose of multi-class classification [12]: it helps extract several sets of features and export them in several formats, allowing the user to use later on any program or tool to perform the classifica- tion. However, to makes it easy for a user to experiment with his data, it would be more interesting to allows him to run the classification using SENTA.

VOLUME 6, 2018 64487

M. Bouazizi, T. Ohtsuki: Multi-Class Sentiment Analysis in Twitter: What If Classification Is Not the Answer

Nonetheless, as part of the quantification process, tweets are initially classified into 3 classes: positive, negative and neutral (ternary classification). Performing the classification somewhere else separately, and re-introducing the results is very inconvenient and impractical. Therefore arises the need for adding a classifier component to the tool so that the classification is performed internally.

Nonetheless, for the sake of quantification, other sets of features need to be introduced, notably what we will refer to as ‘‘Advanced Pattern Features’’. These features are very important for quantification, however, they can also be used for classification.

III. RELATED WORK Twitter, being one of the biggest web destinations and a very active microblogging service, has attracted an important part of the attention of researchers [15]. This is due partially to the several properties of Twitter that we introduced in Section I. It is also due to the abundance of Twitter-collected data and the ease of manual annotation of tweets to experiment with.

Twitter analysis has covered several of its properties, and was not restricted to its content. Some of the works studied the relations between users and the identification of hidden com- munities [16], [17] the and the influence they might have on each other [18]. Tweets have also proven to be able to influ- ence false memory [19] and spread fake information [20], making it interesting to understand how this platform (i.e., Twitter) orients the public opinion and influences it [21]. In this context, Achananuparp et al. [22] studied the user behavior with regards to the information propagation through microblogging websites, taking Twitter as an example. They used retweets as indicators of originating and promoting behaviors. They proposed several models to measure these two behaviors and demonstrated their applicability.

In a related context, Twitter has been studied as a potential teaching and learning tool [23], [24]. Kassens-Noor [23] con- ducted experiments to explore the teaching practice of Twitter as an active, informal learning tool, while Dhir et al. [24] focused on the impact of Twitter, whether it is positive or neg- ative, on informal learning, class dynamics, motivations and academic and psychological development of students.

However, sentiment analysis in social media in general, and Twitter in particular, has been among the hottest topics of research in the recent years: while sentiment anlysis has been a subject of research for decades and goes back to the 90s of the previous century (and even way back to the early years of the 20th century) [15], the rise of internet, followed by the exponential growth of online content and the spread of social media usage made the topic of a high interest to companies and organizations [15]. This is because, nowadays, the end- user generated amount of data is very rich and covers several aspects of the users’ lives as well as their opinions towards various topics and subjects. Performing sentiment analysis on such data is of great use to companies, for example, that want to know the opinion of average consumers [25], [26]. This is because data collected from online shops or dedicate

movie review websites tend to be polarized, and people who are very satisfied or dissatisfied are more likely to share their experiences on these websites.

That being the case, we find in the literature several works that have dealt with the topic of sentiment analysis in Twitter. These works revolve mostly around the use of machine learning and a pre-labeled data set to learn how to classify tweets. They started with simple approaches that re- applied the existing works that have been proposed previ- ously for other types of texts, and soon after evolved into a more sophisticated ones that use features that are very specific to Twitter such as the use of slang words [27] or emoticons [28].

A particular task in sentiment analysis, referred to as aspect-based sentiment analysis, has also attracted the atten- tion of researchers. Aspect-based sentiment analysis refers to the classification of sentiments for the different aspects present in a given piece of text. Zainuddin et al. [29] pro- posed a hybrid sentiment classification approach in which they use Twitter attributes as features to improve Twitter- aspect-based sentiment analysis. They ran their approach on several existing data sets to validate the efficiency of their pro- posed approach. Similarly, Bhoi and Joshi [30] proposed to use various classification approaches involving conventional machine learning and deep learning techniques to perform aspect-based sentiment analysis.

Multi-class sentiment analysis on Twitter has attracted part of the attention as well, but has not matured yet and the state-of-the-art works are good, but require deeper study. Multi-class classification refers to the identification of the exact sentiment(s) present in a given piece of text rather than just determining its overall polarity (whether it is positive, negative or neutral). To begin with, most of these works have dealt with this task in a different way from that we are dealing with. In fact, multi-class classification has conven- tionally referred to the attribution of one of several sentiment strengths to a text or a tweet. A typical classification task was to attribute one of the following sentiment classes to tweets: {‘‘very negative’’, ‘‘negative’’, ‘neutral’’, ‘‘positive’’ and ‘‘very positive’’}, or simply attribute a score ranging from −1 to 1, showing at the same time the polarity and the strength of the sentiment [31], [32]. Nonetheless, with the wide adoption of Deep Learning as a cutting edge technology, this task has been dealt with as well in works such as that of Yu and Chang [33] and that of Araque et al. [34].

However, there have been several approaches which dealt with multi-class classification the way we do in this work: detect one (or more) sentiment(s) for a given text or tweet. For instance, Lin et al. [35], [36] proposed an approach in which they extracted features they qualified as ‘‘similarity features’’ and which they used to classify tweets into reader- emotion categories. A similar task has been tackled by Ye et al. [37] who proposed an approach that tries to iden- tify the sentiments of readers of news articles. Nevertheless, Liang et al. [38] proposed a system that recommends emoti- cons (which eventually show emotions) for users while they

64488 VOLUME 6, 2018

M. Bouazizi, T. Ohtsuki: Multi-Class Sentiment Analysis in Twitter: What If Classification Is Not the Answer

are typing a text message. These emoticons are obviously generated by analyzing the sentiment in the text being typed. In a more recent work, Krawczyk et al. [39], has tackled the problem of multi-class sentiment analysis in imbalanced data collected from Twitter. They proposed an approach that relies on binarization scheme and pairwise dimensionality reduction to reduce the task into an easier one: they gener- ate pairwise dichotomies, then for each pair of classes they reduced the feature dimensions and used several classifiers to perform the binary classification.

In a related, yet a bit far context, the term ‘‘quantification’’ has been used in the context of sentiment analysis the litera- ture to refer to the estimation of the relative frequency of the different classes that the instances of a given data set are to be classified into. In other words, in most of the cases, the party who is performing sentiment analysis, cares more about the the percentage of data showing each sentiment (mainly in the case of binary or ternary classification). Therefore, it might be interesting to find ways to identify these percentages instead of actually finding the class labels of the individual tweets. This idea has been developed and several approaches were made to solve this problem [40]–[43], even for a poor initial classification accuracy of the individual tweets [44]. It is important to understand that the current task we are dealing with in this work is completely different. It actually aims to identify the actual labels of the individual tweets. It is fair to assume it is closer to the context of the multi-class classification.

IV. SENTA - INTEGRATING THE QUANTIFICATION COMPONENTS A. TOOLS To recall, SENTA was built using Java 8 and JavaFX, a plat- form used to make desktop applications.

We have also used Apache OpenNLP1 Application Pro- gramming Interface (API) to perform the different Natural Language Processing (NLP) tasks such as the tokenization, Part-of-Speech (PoS) tagging, lemmatization, etc.

In the current work, we have referred to Weka2 API [45], to make use of the different classifiers built-in. While Weka has a Graphical User Interface (GUI), we have built our own for the different classifiers that we have implemented so far.

B. CONVENTION As we previously stated in [12], the term ‘‘user’’ will be used to refer to the user of SENTA, whereas, if needed, the term Twitterer will be used to refer to a Twitter user. Nevertheless, in this section, the term ‘‘interface’’, will be used to refer to the graphical user interface of SENTA.

Furthermore, the interfaces and components of SENTA, which have been previously introduced in [12] will not be detailed here.

1https://opennlp.apache.org 2https://www.cs.waikato.ac.nz/ml/weka/

C. GRAPHICAL USER INTERFACES 1) ADVANCED FEATURES CUSTOMIZATION The sets of features we have introduced previously were enough for tasks such as the multi-class classification. How- ever, for quantification, our experiments have shown the limits of these in the detection of all the existing sentiments within a tweet. To begin with, only few sets actually take into account the different sentiments (i.e., unigram features, top words and pattern features). Other features, such as punctua- tion features, do not refer to the sentiments in the tweets, nor do they have any direct correlation with a given sentiment.

That being the case, we believe that adding more features is required to perform the task of quantification: we refer to these as ‘‘Advanced Features’’. Mainly 2 sets of features have been fully integrated so far, as shown in Fig. 1. These are: • Advanced Unigram Features • Advanced Pattern Features In the rest of this subsection, we describe these two sets of

features, what they refer to and how they are extracted.

a: ADVANCED PATTERN FEATURES Advanced pattern features are similar to the old pattern fea- tures [12]. They are extracted from a given set (that could be the training set), and are used in two different ways (either each pattern is a unique feature, or several patterns can be scored and summed up together as we will explain later on). We rely on both Part-of-Speech tags and sentiment scores of words to extract the different advanced patterns. First of all, a word can be sentimental or not: if a word has the PoS of a verb, an adverb, a noun or an adjective, it is quali- fied as sentimental given that only these words (as well as some interjections) could convey sentiments; a word having any of the remaining PoS is qualified as non-sentimental. In addition, the same way we previously extracted words correlated with a given sentiments [12] (Unigram features) with the help of WordNet [46], we use the same approach to extract words correlated with each sentiment that we use in our data set. Obviously, these can only be verbs, adverbs, nouns or adjectives.

Unlike basic patterns, which are extracted for a given tweet regardless of its sentiment, advanced patterns are extracted differently for different sentiments. An advanced pattern is created as follows:

- For training tweets (tweets of known sentiments): given a tweet having sentiments {s1, · · · ,sN}, for the sentiment si, the corresponding pattern will be extracted as follows: for each token, if it is a sentimental word, we verify whether it conveys the sentiment si. If it does, it is replaced in the pattern by its simplified PoS-Tag as shown in TABLE 1 along with the sentiment. Otherwise, if it is sentimental but does not convey si or if it is not sentimental, it is simply replaced by the corresponding simplified PoS-Tag as shown in TABLE 1.

- For test tweets (tweets whose sentiments are unknown): for all the sentiments that are being studied, we do the same: for each sentiment si, we extract a separate pattern using the same approach.

VOLUME 6, 2018 64489

M. Bouazizi, T. Ohtsuki: Multi-Class Sentiment Analysis in Twitter: What If Classification Is Not the Answer

FIGURE 1. Advanced features – main window.

TABLE 1. List of simplified part-of-speech tags.

To concretize, given the following tweet: ‘‘I liked it sooo much. Thanks a lot!’’

if we suppose this is a tweet of known sentiments that has been annotated by human annotators into two sentiments

‘‘Happiness’’ and ‘‘Love’’: this generates the following two full patterns:

- Happiness: [PRONOUN HAPPINESS_VERB PRO- NOUN INTERJECTION ADVERB . HAPPINESS_NOUN PARTICLE ADJECTIVE ]

- Love: [PRONOUN LOVE_VERB PRONOUN INTER- JECTION ADVERB . NOUN PARTICLE ADJECTIVE ] given that the word ‘‘like’’ shows both happiness and love, while ‘‘thank’’ shows only happiness.

If this tweet is of unknown sentiments, and whose senti- ments need to be detected, in addition to the aforementioned patterns, we need to extract all the possible patterns for all the possible sentiments including:

- Sadness: [PRONOUN VERB PRONOUN INTERJEC- TION ADVERB . NOUN PARTICLE ADJECTIVE ]

- Neutral: [PRONOUN VERB PRONOUN INTERJEC- TION ADVERB . NOUN PARTICLE ADJECTIVE ]

- etc. Patterns are defined as ordered sequence of words with

very specific length(s). They are extracted from the known data set. For a given tweet and a given sentiment, it is possible to extract several patterns. If a pattern happens to occur in a tweet of negative sentiments and a tweet of positive ones, it is discarded. Additionally, a pattern needs to occur several

64490 VOLUME 6, 2018

M. Bouazizi, T. Ohtsuki: Multi-Class Sentiment Analysis in Twitter: What If Classification Is Not the Answer

times in tweets of a given sentiment to make sure it really characterizes that sentiment. Patterns can be either unique features or summed up.

In the case where patterns are used as unique features they must have all the same length, and each pattern extracted from the known data set will be used to generate a single feature as follows:

For a tweet T , and a reference pattern P extracted earlier from the known data set. We first extract the full patterns from the tweet and use the following resemblance function [47] to measure how much T resembles P:

res(p,T)=

 

1, if the tweet vector contains the pattern as it is, in the same order,

α, if all the words of the pattern appear in the tweet in the correct order but with other words in between,

γ ·n/N, if n words out of the N words of the pattern appear in the tweet in the correct order,

0, if no word of the pattern appears in the tweet.

The result of resemblance is attributed to the corresponding feature, for the tweet T . Obviously, this adds few parameters, that the user can

adjust to maximize the results of detection of sentiments: he needs to choose the length of a pattern, the values for α and γ , as well as the minimum number of occurrences of the pattern.

In the case where patterns can have multiple lengths, they are taken such as their length satisfies the following:

LMin ≤ Len(pattern) ≤ LMax (1)

where LMin and LMax refer to the minimal and the maximal allowed length for a pattern, while Len(pattern) is the length of the pattern. In addition to the aforementioned parameters, one last parameter, which we refer to as knn, is to be opti- mized. Given all the patterns extracted for the sentiment class si and the length Lj, one feature is extracted. The value of this feature, which we refer to as Fij, is calculated as follows [48]:

Fij = knn∑ k=1

res(pk,T) (2)

where the different patterns pk here are ones that have the highest resemblance to the tweet T . Fij as defined measures the degree of resemblance of a tweet T to patterns of the sentiment class si and length j.

The different parameters related to advanced patterns can be optimized via the window shown in Fig. 2.

As stated previously, this set of features can be used for both classification and quantification. However, in the case of quantification, the user can only use patterns of multiple lengths (later on, we explain the reason).

b: ADVANCED UNIGRAM FEATURES Advanced unigram features are unigrams that the user speci- fies manually, and that will be checked against a given tweet. If a unigram exists in that tweet, the corresponding feature will be attributed the value ‘‘True’’, otherwise, it will be attributed the value ‘‘False’’. Fig. 3 shows the window through which the user con-

figures the advanced unigram features. The user needs to save the unigrams he wants to check in a file (one unigram per line). He can then select the file location by pressing ‘‘Select’’. Optionally, the user can choose whether to com- pare the lemmas of the words of the tweets to those of the list he provides, or the actual words. For example, if the the list of words contain the word ’’love’’ and the tweet contains a word such as ‘‘loving’’: if the users chooses to check for words, the corresponding feature for the word ’’love’’ will be attributed ‘‘False’’, whereas if he chooses to compare lemmas, the feature will be attributed the value ‘‘True’’.

Advanced unigram features are supposed to be used in case the basic unigram features or the top words are not enough. It does not include useful information for the quantification though, so it will not be used in the current work.

2) CLASSIFICATION WINDOW In addition to the new sets of features we have described above, we have implemented several classifier interfaces, using Weka API. In the current version, we have added several classifiers. These include, but are not limited to: • Naive Bayes classifier, • Random Forest classifier [49], • Iterative Dichotomiser 3 (J48) classifier [50] Once SENTA has finished extracting the different features

selected by the user, in the interface shown in Fig. 4 the user can press the button ‘‘Proceed to classification’’. Since we are using Weka API, proceeding to classification requires the files with ‘‘*.arff’’ extension (i.e., weka file format) for both the training and the test set to be generated. So in case the user has not selected to generate these files, they will be automatically generated.

Upon proceeding, the interface shown in Fir. 5, will be displayed. The user chooses the classifier he wants to use, sets the different parameters of the classifier and selects the oper- ation he wants to perform (e.g. training set cross validation, experimenting with the test set, etc.). In Fig. 6 we show an example of parameters optimization window (that of Random Forest classifier). The default parameters offered by Weka are used as default parameters here.

The classification results will be saved every time and the user can go back to check them by selecting the corre- sponding iteration from the table, and clicking ‘‘Display’’. However, only the results are saved, and not the classification model. Additionally, SENTA stores only the results of classi- fication of the individual tweets only for the last classification operation (Later on, for quantification, these results are the ones that are used).

VOLUME 6, 2018 64491

M. Bouazizi, T. Ohtsuki: Multi-Class Sentiment Analysis in Twitter: What If Classification Is Not the Answer

FIGURE 2. Advanced pattern features – customization window.

3) QUANTIFICATION WINDOW Once the classification is done (on the test set or the vali- dation set), the user can proceed to the quantification. Basi- cally, if the user has chosen to perform a quantification task, regardless of the number of sentiment classes that he initially selected and that the tweets might contain, the classification task will classify tweets into one of 3 classes: positive, nega- tive or neutral. The sentiment classes the user has specified will be used in quantification. This assumes that a tweet contains exclusively positive, negative or neutral sentiments (i.e., a tweet cannot have two sentiments of different polarities at once). Despite the fact that this assumption is not always satisfied (e.g., in our data set less than 3% of the tweet did actually have sentiments of different polarities), it is needed in order for the ternary classification to make sense. Technically, SENTA is implemented in a way that, in case a tweet contains sentiments of different polarities, the polarity of the first sentiment present in the list of sentiments of that tweet is taken into account.

The quantification task will use the results of the classification, and the values of the following sets of features:

• Unigram features, • Basic pattern features, • Advanced Unigram features. To recall, unigram features work as follows: we dispose of

several lists of words, each we judged highly correlated with a certain sentiment. We count, in every tweet how many words from each list appear in it.

For a given tweet, suppose that the corresponding features have the following format [U1,U2, · · · ,UN] where Ui is the ith feature corresponding to the ith sentiment. These values are then normalized by dividing all of them by the maximum value (obviously if they are all equal to 0, they are kept as they are). We refer to the resulting scores as SUi (T), where i ∈ {1, · · · ,N}. We do the same for the different patterns (basic and

advanced patterns work the same way): Given that the user has set the parameters for LMin and LMax for the minimal and maximal pattern lengths respectively, and the parameters α and γ , the features will have the format shown in TABLE 2 as detailed in [12].

Given that these features are extracted, we need to derive two scores (one using basic pattern features and one using

64492 VOLUME 6, 2018

M. Bouazizi, T. Ohtsuki: Multi-Class Sentiment Analysis in Twitter: What If Classification Is Not the Answer

FIGURE 3. Advanced unigram features – customization window.

TABLE 2. Pattern features.

advanced pattern features) for each sentiment for a given tweet T . The scores will have the following format:

SPi (T) = M∑ j=1

( βj ·Fi,j

) =

M∑ j=1

( βj ·

knn∑ k=1

res(pk,T) )

(3)

where SPi (T) is the score generated using patterns, of the sentiment i for the tweet T , M is the number of pat- tern lengths (to recall, the lengths are {L1, · · · ,LM}) and βj is a weight given to patterns of length Lj (regardless of

their class). Currently, we set the values for βj as follows:

βj = Lj −1 Lj +1

, (4)

where Lj is the length of the pattern. Again these scores are normalized by dividing them by the highest score for T . The resemblance function res(pk, t) is the one that we have defined in Section 4.3.1.

We refer to the Basic Pattern Score and Advanced Pattern scores of the ith sentiment in the tweet T as SBPi (T) and SAPi (T) respectively. Finally, the user gets to choose a coefficient that highlights

the importance of each of the given scores (i.e., SUi (T), SBPi (T) and S

AP i (T)), to detect the sentiments existing in the

tweet. In other words, given the following total score:

Si(T) = τ ·S U i (T)+µ ·S

BP i (T)+ν ·S

AP i (T) (5)

the user can adjust the values of τ, µ and ν to adjust the importance of the 3 sub-scores. In addition, τ +µ+ν = 1.

The different scores Si(T) are normalized as well. Senti- ments that have a score higher than a certain threshold are

VOLUME 6, 2018 64493

M. Bouazizi, T. Ohtsuki: Multi-Class Sentiment Analysis in Twitter: What If Classification Is Not the Answer

FIGURE 4. The main window showing the summery of the project.

ones judged as detected. The threshold is also a parameter to optimize.

In Fig. 7, we show the interface through which the user can set these parameters. The user can also choose to let SENTA automatically optimize these parameters for him. The function to optimize is the F1-Score, which we will introduce and explain later in this work.

D. FUTURE EXTENSION In the current version of SENTA, we have introduced few new sets of features. However, 2 of them are still under experimenting and require some tuning to be useable. They will be used exclusively for classification purposes and will not contribute to the quantification. The next version will include these sets of features.

In addition, we have implemented few classifiers. These are ones that we have found best fitting in the context of multi- class sentiment analysis (mainly Random Forest). However, a user might need to compare several machine learning algo- rithms, or perform a task different from the one SENTA was designed for (e.g., sarcasm detection or hate speech detection) which will require a different classifier such as Support Vec- tor Machine (SVM) or others. These classifiers will be added as well in the next version of SENTA.

Finally, it might be interesting for a user to save the classifi- cation model built using his training set, or import one that he has already built externally using Weka. Such features need to be added as well.

V. SENTIMENT QUANTIFICATION - PROPOSED APPROACH A. PROBLEM STATEMENT Although the multi-class classification of tweets has its advantages and makes sense in the context of detecting the actual sentiment of a given tweet, it has its limitations as we explained in Section II. Among these limitations, we high- lighted the particular issue of not being able to identify all the existing sentiments within the tweet if it contains more than one. In other words, if a tweet presents more than one senti- ment, the classification task will attribute a single sentiment label.

This makes it more reasonable to try to detect all the existing sentiments. As a matter of fact, in the training set we are using in this work for example, over 59% of the tweets contain more than 2 sentiments (the details of the structure of the data sets used will be given in the next subsection). That being the case, the task we tackle here is as follows: given a tweet, we first try to detect its sentiment polarity

64494 VOLUME 6, 2018

M. Bouazizi, T. Ohtsuki: Multi-Class Sentiment Analysis in Twitter: What If Classification Is Not the Answer

FIGURE 5. Classifiers main window.

(i.e., whether it is positive, negative or neutral). We then try to identify all the existing sentiments by attributing a score for each sentiment. The sentiments are then ranked according to the attributed scores, and the ones that have the highest scores are judged as conveyed in the tweet. In other words, a tweet will be classified into one of the 3 classes described, and then into a further granularity level, but allowing it to have multiple classes.

B. DATA For the sake of this work, we have prepared a data set made of tweets collected using Twitter API. These tweets were manually annotated by several annotators using the services of CrowdFlower.3 We asked the annotators to attribute 1 or more sentiments (out of 11) to each tweet, and encouraged them to choose more than one. However we have not made this requirement mandatory.

Two annotators annotated each tweet. The outputs of their judgement are merged. Tweets with inconsistent judgement are discarded from our data set. By the expression ‘‘tweets with inconsistent judgement’’, we mean ones that the anno- tators did not agree on a single sentiment shown in them.

3htts://www.crowdflower.com/

We have also discarded tweets with sentiments of opposite polarities (i.e., tweets which have at least one positive senti- ment and at least on negative sentiment).

As stated above, when running the task, we have asked the annotators to attribute one or more sentiment(s) for each tweet, from the following sentiment classes:

- Positive sentiments: Enthusiasm, Fun, Happiness, Love and Relief,

- Negative sentiments: Anger, Boredom, Hate, Sadness and Worry,

- Neutral sentiment: Neutral. This data set has then been divided into 5 data sets, as

follows: • A pattern extraction set: as we described in [12] and in Section IV, we need to collect what we qualified as patterns that we will use later to attribute pattern scores (which we refer to as ‘‘Pattern Features’’ and ‘‘Advanced Pattern Features’’ and which we use later to perform both the classification and the quantification). In [12], we extracted these patterns from the training set itself. However, we believe that this would make the classi- fication favors these features over the others, because they fit in very well for the training set. Therefore, in the current work, we use an independent data set (thus the

VOLUME 6, 2018 64495

M. Bouazizi, T. Ohtsuki: Multi-Class Sentiment Analysis in Twitter: What If Classification Is Not the Answer

FIGURE 6. Classifier parameters optimization window.

name ‘‘Pattern Extraction Set’’) to avoid such problem. This set is used only for the extraction of patterns of each sentiment class, and will be discarded afterwards.

• A training set: This set is used to train our model for classification.

• A test set: This set is used to run our experiments. The classification and quantification results obtained in this work are ones that were run on this set.

• A validation set: Throughout our experiments, we have optimized several parameters that we defined for SENTA. To make sure that these parameters are opti- mal (or at least sub-optimal), we validate them using a separate data set. This set will be referred to, in the rest of this work, as the ‘‘Validation Set’’.

As stated above, it is important to notice that several tweets were judged by the annotators as containing sentiments of opposite polarities (i.e., containing at least a positive senti- ment and a negative one). These tweets were discarded as well, since they do not fit in the problem we stated in the previous subsection.

The structure of the data sets is given in TABLES 3 and 4: In the first table we describe the number of tweets having each sentiment in each of the data set. And in the second,

TABLE 3. Number of tweets having each sentiment in the different data sets.

TABLE 4. Distribution of sentiments in the different data sets.

we describe the number of sentiments per tweet in each of the data sets.

Fig. 8, shows a diagram of the proposed approach proce- dure: Initially, from the data set we have qualified as ‘‘Pattern Set’’, basic and advanced patterns are extracted following the rules we have described previously. These two sets of features are then used along with the other sets of features as described in [12] to train a classification model on the training set. The model is optimized for the test set. After classification, the quantification process is run on the test set. The optimal values of classification and quantification were then verified on a totally independent set, which we refer to as the validation set, to verify whether they are overfitting the test set or they do present good (probably sub-optimal) performances on other sets.

C. FEATURES EXTRACTION From the tweets, we extract different sets of features, that we use to perform the classification and later on the quantifica- tion. SENTA offers the option to extract the features we need for this work.

1) BASIC FEATURES Here, we refer to our previous work [12] and extract the same features, with the same parameters. To recall, the features extracted are the following:

1) Sentiment Features: these are features that help detect the sentiment polarity of the different components of the tweet (e.g., words, emoticons, hashtags, etc.).

2) Punctuation Features: these are features related to the use of punctuation in the tweet.

3) Syntactic and Stylistic Features: these are features related to the use of words and expressions in a tweet.

4) Semantic Features: these are features related to the meaning of words, the relations between them and the logic behind them.

64496 VOLUME 6, 2018

M. Bouazizi, T. Ohtsuki: Multi-Class Sentiment Analysis in Twitter: What If Classification Is Not the Answer

FIGURE 7. Quantifier main window.

5) Unigram Features: these are features extracted with references to word lists, where each list presents the words that are highly correlated with a given sentiment.

6) Basic Pattern Features: these are features that try to identify the common patterns or expressions used in different contexts to show certain emotions. They are extracted with reference to the data set with manually labeled data.

2) ADVANCED FEATURES In the current work, we will restraint to the use of one set of advanced features, which we qualified as ‘‘Advanced Pattern Features’’. Advanced pattern features resemble to basic pat- tern features, but are more specific to the different sentiments ad we explained previously in Section IV.

We use all the features together to perform the clas- sification and the quantification. However, unlike [12], we have not referred to the training set to extract the pat- terns that we use for the classification, but rather to a separate data set that we qualified as ‘‘pattern extraction set’’.

VI. EXPERIMENTAL RESULTS In this section, we present the results obtained for ternary classification and quantification, on both the test set and the validation set. As we explained earlier, the classification parameters and model as well as the quantification parame- ters will be optimized for the test set. The validation set is used to check the validity of these parameters and model on a new data set that has not been involved in the optimization.

A. KEY PERFORMANCE INDICATORS After the extraction of features, we run different tests using the ‘‘Random Forest’’ [49] classifier. We use 4 Key Perfor- mance Indicators (KPIs) to evaluate the classification and quantification results: True Positives Rate, Precision, Recall and F1-score: • True Positives Rate (TPR or Recall) measures the rate of tweets correctly classified as part of a given class over the total number of tweets of that class:

TPR = Rec = TP

TP+FN (6)

• False Positive Rate (FPR) measures the rate of tweets falsely classified as part of a given class over the total

VOLUME 6, 2018 64497

M. Bouazizi, T. Ohtsuki: Multi-Class Sentiment Analysis in Twitter: What If Classification Is Not the Answer

FIGURE 8. Flowchart of the proposed approach.

number of tweets that are not part of that class:

FPR = FP

FP+TN (7)

• Precision (Prec) measures the rate of tweets correctly classified as being part of a class, over the total number of tweets classified as belonging to that class:

Prec = TP

TP+FP (8)

• F1 score is a combination of both precision and recall defined as follows:

F1 score = 2 · Prec ·Rec Prec+Rec

= 2TP

2TP+FP+FN . (9)

In the context of classification the terms TP, FP, TN and FN are measured for all the tweets at once and are defined, for a given class C, as follows: • TP (True Positive) refers to the fraction of tweets belonging to C and identified as belonging to C,

• FP (False Positive) refers to the fraction of tweets not belonging to C and identified as belonging to C,

• TN (True Negative) refers to the fraction of tweets not belonging to C and identified as not belonging to C,

• FN (False Negative) refers to the fraction of tweets belonging to C and identified as not belonging to C.

In the context of quantification, we measure the values of these terms is different. Given the quantification results of the single tweet shown in TABLE 5, where: • TP (True Positive) refers to the sentiments that are identified correctly by our code as being shown in the tweet,

• FP (False Positive) refers to the sentiment that were judged as being shown in the tweet, when in reality, according to the annotators, they are not,

TABLE 5. Sentiments confusion Matrix for a given tweet.

TABLE 6. Ternary classification performances on the test set.

• FN (False Negative) refers to the sentiments that are present, according to the annotators, in the tweet, but our code could not identify them,

• TN (True negative) refers to the sentiments that are not present in the tweet, and were not judged as present in the tweet.

In this sense, the overall KPIs measured for the entire test set (and validation set) are the average of the values of these KPIs measured at tweet level.

B. TERNARY CLASSIFICATION RESULTS 1) TERNARY CLASSIFICATION ON THE TEST SET We first run the classification on the test set. The classifica- tion results returned by the classifier Random Forest are the best, compared with other classifiers. This goes along with our previous observations in [12], [13], and [48]. The results of classification are given in TABLE 6.

The results show that, in the current data set, the pos- itive tweets are easier to detect than the negative or the neutral ones. The classification TPR of positive tweets reaches 90.2%, whereas that of negative tweets is 68.3% and

64498 VOLUME 6, 2018

M. Bouazizi, T. Ohtsuki: Multi-Class Sentiment Analysis in Twitter: What If Classification Is Not the Answer

TABLE 7. Ternary classification performances on the validation set.

that of neutral ones is only 31.9%. As we have explained in [12], in such data sets, tweets tend to be polarized (clas- sified either as positive or negative, but rarely neutral) for several reasons including the nature of features themselves which are engineered to detect the presence of sentimental components, as well as the unbalanced amount of training data in favor of the non-neutral tweets.

The overall accuracy is equal to 77.4%, with a precision level equal to 76.6%, a recall equal to 77.4% and an F1-score equal to 76.2%. These results are promising, even though they are lower than those obtained in [12].

2) TERNARY CLASSIFICATION ON THE VALIDATION SET Given the same classifier parameters we have used in the previous classification task, we run the classification on the validation set. The results of classification are given in TABLE 7.

As we can observe, the classification results do not differ much from those on the test set. While we notice a slight decrease in the overall accuracy by about 1.1%, the results are pretty much close. The overall accuracy on the validation set is equal to 76.3% with a precision equal to 75.6%, a recall equal to 76.3% and an F1-score equal to 74.9%.

Moreover, the classification performances per class are also very similar: the classification TPR and recall of the positive tweets is the highest marking values equal to 89.7% both. Neutral tweets are also the hardest to identify with a TPR equal to 28.5%, but with a high precision level proving again that the reason of misclassification of these tweets is actually the tendency to polarize tweets. However, once identified as neutral, a tweet is most likely to be neutral (precision equal to 61.7%).

However, the important results we can conclude is that the classification performances are independent from the test set, and that we can proceed to the quantification part with no overfitting issue for the classification part.

C. QUANTIFICATION RESULTS Given a tweet that was annotated by human annotators into m sentiments. The tweet is attributed n sentiments using our method.

While the different KPIs are being measured, we only focus on optimizing the F1 score given that it is the most significant KPI. In other words, for a high precision, a high threshold can be used, which will result in a low recall given that the process of minimizing the False Positives tends to favor the detection of a single sentiment. The same goes the other way around: for a high recall, a very low threshold can

TABLE 8. Quantification results on the test set.

be used, which will result in a low precision, given that the process of minimizing the False Negatives tends to favor the detection of almost all sentiments, so that no True Positive escapes.

Running the quantification on the test set gave us the results shown in TABLE 8. The results shown are the top ones for different values of the tuple [τ, µ, ν]. For convenience and ease of display, we discarded the combinations that gave lower values.

The values obtained reach a maximal F1 score equal to 45.9% when [τ, µ, ν] = [0, 0.2, 0.8]. More interestingly, all these top values are obtained for a value of τ equal to 0, or very small. This translates into the fact that unigram scores do not contribute much to the detection of sentiments. In fact, this feature returns a score equal to 0 for many tweet, meaning that they actually do not contain words referring to any sentiments at all.

In TABLE 9, we show the results of quantification using the same tuples [τ, µ, ν] (in the same order). The best result obtained in the test set corresponds to a sub-optimal, yet very good, results on the validation set. The best F1-score is obtained when [τ, µ, ν] = [0,0.3,0.7] (i.e., F1-score equal to 47.7%). However, the tuple [0, 0.2, 0.8] presents very good results reaching 44.6%.

As stated previously, if we opt for the optimization of the recall, we observe that for all the tweets that were correctly classified, the quantification results in attributing a threshold for sentiment equal to 0 leading to attributing all the sentiment of the polarity to the tweets. In other words, given a positive tweet for example, optimizing the Recall results in attributing all the positive sentiments to the tweet, to make sure the cor- rect sentiments are detected. In a similar way, the optimiza- tion of the precision results in very strict selection, leading to the attribution of a single sentiment per tweet. Therefore, we opted for the optimization of F1-score, which makes a lot

VOLUME 6, 2018 64499

M. Bouazizi, T. Ohtsuki: Multi-Class Sentiment Analysis in Twitter: What If Classification Is Not the Answer

TABLE 9. Quantification results on the validation set.

FIGURE 9. F1-Score for different values of µ and ν on the test set and the validation set.

of sense. The corresponding values of Recall and Precision are not the optimal, but are more meaningful.

Since the contribution of the Unigram score is minimal, we collected the different combination that have τ set to 0. The F1-score of these combinations on the test set and the validation set are given in Fig. 9.

The figure shows a very similar behavior on both the test set and the validation set. It also highlights the fact that the advanced patterns, which are part of the contribution of this paper, are more valuable in terms of detection of sentiments and quantification in general. As a matter of fact, even if we discard the basic pattern scores (i.e., set µ to 0), the results obtained aver very close to the ones obtained for the optimal ones (i.e., [µ, ν] = [0.2, 0.8]).

D. COMPARISON WITH A BASELINE APPROACH To the best of our knowledge, the task we have defined in this work is new, and no previous work we encountered dealt with it. Therefore, to evaluate our approach, we define a baseline and compare the performances of our approach to its performances.

The baseline approach is defined as follows: Given a tweet T , we run the binary classification on each sentiment to guess whether or not that sentiment is present on the

TABLE 10. Comparison between the proposed approach and the baseline one.

tweet or not. We use all the sets of features, except advanced pattern features (which are part of the contribution of this work).

This baseline has given very poor results, so it has been adjusted so that it makes use of the output of the ternary classification. Instead of running the classification on all the sentiments, we use the output of the ternary classification to restrict the number of sentiments to be verified. For example, if a tweet is judged as positive, the binary classification of only the five positive sentiments is run.

A comparison between the performances of the proposed approach and the baseline one on the test set is given in TABLE 10.

E. DISCUSSION In this work, we have introduced a task different from the conventional sentiment analysis one, and even from the multi- class classification task introduced in [12]. Throughout this work we have tried to identify all the existing sentiments within tweets, by attributing different scores to each senti- ment in a tweet, and selecting ones with the highest scores. We referred to this task as quantification.

The results of quantification observed were promising. However, we believe that the not-exceptionally good results can be enhanced in more than one way. Several factors have led to a low results of classification and/or quantification of many tweets.

To begin with, the quantification task is a challenging task, that is highly subject to the annotators’ opinion. This is actually a property that is valid for sentiment analysis in general, even for simple tasks such as the binary classifica- tion, where texts are to be classified into positive or negative. However, the finer the granularity level of classification is, the harder the task gets, and the more discrepancy between annotators there is. As a matter of fact, we have studied the data set we used in [12] and we found a ratio of agreement between annotators on a sample of 300 tweets to be 67.3% on the 7-class classification, an agreement that jumps to 82.7% for ternary classification. Therefore, we expect to have even more disagreement (i.e., lower agreement level) on a data set that needs to be attributed one(s) from 11 sentiment classes.

Nevertheless, it is important to mention that the values of the two parameters α and γ set for the basic and advanced patterns were optimized for the classification. This means that they might not be the optimal values for the quantification. In fact, setting these two values to 0.1 and 0.02 respectively decrease greatly the value of sparse and incomplete resem- blance of patterns to the tweet. This leads us to believe that different values for these features might mean different results

64500 VOLUME 6, 2018

M. Bouazizi, T. Ohtsuki: Multi-Class Sentiment Analysis in Twitter: What If Classification Is Not the Answer

for the quantification. This dilemma is set in favor of the classification, given that a misclassified tweet has an F1-score equal to 0 anyway.

On a related context, we have noticed that the accuracy of classification of the neutral tweets on both the test and validation sets was very low. It was way lower than that observed in [12]. Again, that is due to the low amount of training data for these tweets, among others. A neutral tweet that is misclassified has an F1-score equal to 0. This leads to a total decrease in the overall F1-score.

Over and above that, we believe that more training data instances, and more importantly a training set that is balanced among all the sentiment classes could improve noticeably the results. As we can see in TABLE 3 that we described in Section V-B, the tweets are very unbalanced among the different sentiment classes. This is because it is hard to collect a balanced set a priori, especially with the fact that it is totally up to the annotators to decide on which sentiments exist in a tweet, and more importantly how many. In fact, we started indeed with a data set extracted from a bigger one that was automatically annotated into positive and negative (using a previously trained model). The data set we uploaded for manual annotation was indeed balanced between the two sentiments.

Another critic that we address is the fact that we assumed that a tweet could contain exclusively positive or negative sentiments (neutral tweets are by definition ones that show no sentiment), which is a hard assumption that is not always true. In fact, as we explained in Section V-B, several tweets were annotated as having sentiments of opposite polarities, which we have discarded for the sake of this work. As observed on our initial data set (before discarding any tweet), some sen- timents tend to co-occur more than others. Namely, the sen- timents ‘‘Love’’ and ‘‘Worry’’ co-occurred in many tweets where the tweeter is worried about something precious to him, or someone he cares about. In a similar way, in some tweets, the tweeters have shown both sentiments of ‘‘Bore- dom’’ and ‘‘Relief’’, to express how bored they are of some event and how relieved they are it was over. As mentioned in Section V-B, for the sake of this work, these tweets have been discarded. We consider only tweets with sentiments of a single polarity. This is because we rely on the results of classi- fication to choose the set of sentiments from which we guess the actual sentiments of the tweet. This limits the potential of the proposed approach, and needs to be addressed in a future work.

VII. CONCLUSION In this paper we have introduced the task of sentiment quan- tification in Twitter: for a given tweet, we tried to identify in a first step its sentiment polarity (whether it is positive, negative or neutral), and in a second step we tried to iden- tify all the sentiments conveyed within it. We added several components to our previously introduced tool SENAT, to make the quantification task feasible and automated. Our proposed approach has proven to be good in detecting

sentiments hidden in tweets with an average F1-score equal to 45.9% for 11 different sentiment classes.

We have also discussed the different potential misclassi- fication reasons, and presented some solutions to enhance the performances of the proposed approach, which we will be dealing with as part of our future work. In our future work, we will also address the case of tweets with sentiments belonging to different polarities (i.e., tweets which have at the same time positive sentiments and negative ones), and try find possible ways to identify these sentiments.

REFERENCES [1] M. Cha, H. Haddadi, F. Benevenuto, and P. K. Gummadi, ‘‘Measuring

user influence in Twitter: The million follower fallacy,’’ in Proc. ICWSM, vol. 10, nos. 10–17, 2010, p. 30.

[2] M. Trusov, A. Bodapati, and R. E. Bucklin, ‘‘Determining influential users in Internet social networks,’’ J. Marketing Res., vol. 47, no. 4, pp. 643–658, 2010.

[3] J. Messias, L. Schmidt, R. A. R. de Oliveira, and F. R. Benevenuto, ‘‘You followed my Bot! Transforming robots into influential users in Twitter,’’ First Monday, vol. 18, no. 7, pp. 1–14, 2013.

[4] R. Page, ‘‘The linguistics of self-branding and micro-celebrity in Twitter: The role of hashtags,’’ Discourse Commun., vol. 6, no. 2, pp. 181–201, 2012.

[5] A. Kumar and T. M. Sebastian, ‘‘Sentiment analysis on Twitter,’’ Int. J. Comput. Sci. Issue, vol. 9, no. 4, pp. 372–378, 2012.

[6] E. Martínez-Cámara, M. T. Martín-Valdivia, L. A. Ureña-López, and A. R. Montejo-Ráez, ‘‘Sentiment analysis in Twitter,’’ Natural Lang. Eng., vol. 20, no. 1, pp. 1–28, 2014.

[7] A. Agarwal, B. Xie, I. O. Vovsha Rambow, and R. Passonneau, ‘‘Sentiment analysis of Twitter data,’’ in Proc. Workshop Lang. Social Media, 2011, pp. 30–38.

[8] H. Saif, Y. He, and H. Alani, ‘‘Semantic sentiment analysis of Twitter,’’ in Proc. Int. Semantic Web Conf., 2012, pp. 508–524.

[9] C. Troussas, M. Virvou, K. J. Espinosa, K. Llaguno, and J. Caro, ‘‘Sen- timent analysis of Facebook statuses using naive Bayes classifier for lan- guage learning,’’ in Proc.Int.Conf. Inf., Intell.,Syst.Appl. (IISA), Jul. 2013, pp. 1–6.

[10] H. M. Zin, N. Mustapha, M. A. A. Murad, and N. M. Sharef, ‘‘Term weighting scheme effect in sentiment analysis of online movie reviews,’’ Adv. Sci. Lett., vol. 24, no. 2, pp. 933–937, 2018.

[11] M. Bouazizi and T. Ohtsuki, ‘‘Sentiment analysis in Twitter for multiple topics: How to detect the polarity of tweets regardless of their topic,’’ in Proc. IEICE ASN, Feb. 2015, pp. 91–96.

[12] M. Bouazizi and T. Ohtsuki, ‘‘A pattern-based approach for multi-class sentiment analysis in Twitter,’’ IEEE Access, vol. 5, pp. 20617–20639, 2017.

[13] M. Bouazizi and T. Ohtsuki, ‘‘Sentiment analysis: From binary to multi- class classification: A pattern-based approach for multi-class sentiment analysis in Twitter,’’ in Proc. IEEE ICC, May 2016, pp. 1–6.

[14] K. Ghag and K. Shah, ‘‘Comparative analysis of the techniques for senti- ment analysis,’’ in Proc. Int. Conf. Adv. Technol. Eng., Jan. 2013, pp. 1–7.

[15] M. V. Mäntylä, D. Graziotin, and M. Kuutila, ‘‘The evolution of sentiment analysis—A review of research topics, venues, and top cited papers,’’ Comput. Sci. Rev., vol. 27, pp. 16–32, Feb. 2018.

[16] A. Java, X. Song, T. Finin, and B. Tseng, ‘‘Why we Twitter: Understanding microblogging usage and communities,’’ in Proc. 9th WebKDD 1st SNA- KDD Workshop Web Mining Social Netw. Anal., Aug. 2007, pp. 56–65.

[17] Z. Kuncheva and G. Montana, ‘‘Community detection in multiplex net- works using Locally Adaptive Random walks,’’ in Proc. IEEE/ACM ASONAM, Aug. 2015, pp. 1308–1315.

[18] I. Bizid, N. Nayef, P. Boursier, S. Faiz, and J. Morcos, ‘‘Prominent users detection during specific events by learning on-and off-topic features of user activities,’’ in Proc. IEEE/ACM ASONAM, Aug. 2015, pp. 500–503.

[19] N. R. Griffin, C. R. Fleck, M. G. Uitvlugt, S. M. Ravizza, and K. M. Fenn, ‘‘The Tweeter matters: Factors that affect false memory from Twitter,’’ Comput. Hum. Behav., vol. 77, pp. 63–68, Dec. 2017.

[20] H. Allcott and M. Gentzkow, ‘‘Social media and fake news in the 2016 election,’’ J. Econ. Perspect., vol. 31, no. 2, pp. 211–236, 2017.

VOLUME 6, 2018 64501

M. Bouazizi, T. Ohtsuki: Multi-Class Sentiment Analysis in Twitter: What If Classification Is Not the Answer

[21] B. O’Connor, R. Balasubramanyan, B. R. Routledge, and N. A. Smith, ‘‘From tweets to polls: Linking text sentiment to public opinion time series,’’ in Proc. Int. AAAI Conf. Weblogs Social Media, May 2010, pp. 26–33.

[22] P. Achananuparp, E.-P. Lim, J. Jiang, and T.-A. Hoang, ‘‘Who is retweet- ing the tweeters? Modeling, originating, and promoting behaviors in the Twitter network,’’ Trans. Manage. Inf. Syst., vol. 3, no. 3, Oct. 2012, Art. no. 13.

[23] E. Kassens-Noor, ‘‘Twitter as a teaching practice to enhance active and informal learning in higher education: The case of sustainable tweets,’’ Act. Learn. Higher Educ., vol. 13, no. 1, pp. 9–21, Feb. 2012.

[24] A. Dhir, K. Buragga, and A. A. Boreqqah, ‘‘Tweeters on campus: Twitter a learning tool in classroom?’’ J. Universal Comput. Sci., vol. 19, no. 5, pp. 672–691, 2013.

[25] U. R. Hodeghatta, ‘‘Sentiment analysis of Hollywood movies on Twitter,’’ in Proc. IEEE/ACM ASONAM, Aug. 2013, pp. 1401–1404.

[26] M. A. Cabanlit and K. J. Espinosa, ‘‘Optimizing N-Gram based text feature selection in sentiment analysis for commercial products in Twitter through polarity lexicons,’’ in Proc. 5th Int. Conf. Inf., Intell., Syst. Appl., Jul. 2014, pp. 94–97.

[27] K. Manuel, K. V. Indukuri, and P. R. Krishna, ‘‘Analyzing Internet slang for sentiment mining,’’ in Proc. 2nd Vaagdevi Int. Conf. Inform. Technol. Real World Problems, Dec. 2010, pp. 9–11.

[28] M. Boia, B. Faltings, C.-C. Musat, and P. Pu, ‘‘A :) Is worth a thousand words: How people attach sentiment to emoticons and words in tweets,’’ in Proc. Int. Conf. Soc. Comput., Sep. 2013, pp. 345–350.

[29] N. Zainuddin, A. Selamat, and R. Ibrahim, ‘‘Hybrid sentiment classifi- cation on Twitter aspect-based sentiment analysis,’’ Appl. Intell., vol. 48, no. 5, pp. 1218–1232, May 2018.

[30] A. Bhoi and S. Joshi. (May 2018). ‘‘Various approaches to aspect-based sentiment analysis.’’ [Online]. Available: https://arxiv.org/abs/1805.01984

[31] R. Srivastava and M. P. S. Bhatia, ‘‘Quantifying modified opinion strength: A fuzzy inference system for Sentiment Analysis,’’ in Proc. Int. Conf. Adv. Comput., Commun. Inform., Aug. 2013, pp. 1512–1519.

[32] Y. H. P. P. Priyadarshana, K. I. H. Gunathunga, K. K. A. N. N. Perera, L. Ranathunga, P. M. Karunaratne, and T. M. Thanthriwatta, ‘‘Sentiment analysis: Measuring sentiment strength of call centre conversations,’’ in Proc. IEEE ICECCT, Mar. 2015, pp. 1–9.

[33] A. Yu and D. Chang, ‘‘Multiclass sentiment prediction using yelp business reviews,’’ CS224n, Natural Lang. Process. Deep Learn., Stanford Univ., Stanford, CA, USA, Tech. Rep. 62, 2015.

[34] O. Araque, I. Corcuera-Platas, J. F. Sánchez-Rada, and C. A. Iglesias, ‘‘Enhancing deep learning sentiment analysis with ensemble techniques in social applications,’’ Expert Syst. Appl., vol. 77, pp. 236–246, Jul. 2017.

[35] K. H.-Y. Lin, C. Yang, and H.-S. Chen, ‘‘What emotions do news articles trigger in their readers?’’ in Proc. ACM SIGIR, Jul. 2007, pp. 733–734.

[36] K. H.-Y. Lin, C. Yang, and H.-H. Chen, ‘‘Emotion classification of online news articles from the reader’s perspective,’’ in Proc. IEEE/WIC/ACM WI- IAT, vol. 1, Dec. 2008, pp. 220–226.

[37] L. Ye, R.-F. Xu, and J. Xu, ‘‘Emotion prediction of news articles from reader’s perspective based on multi-label classification,’’ in Proc. Int.Conf. Mach. Learn. Cybern., vol. 5, Jul. 2012, pp. 2019–2024.

[38] W. B. Liang, H. C. Wang, Y. A. Chu, and C. H. Wu, ‘‘Emoticon rec- ommendation in microblog using affective trajectory model,’’ in Proc. Annu. Summit Conf. Asia–Pacific Signal Inf. Process. Assoc. (APSIPA), Dec. 2014, pp. 1–5.

[39] B. Krawczyk, B. T. McInnes, and A. Cano, ‘‘Sentiment classification from multi-class imbalanced Twitter data using binarization,’’ in Proc. Int. Conf. Hybrid Artif. Intell. Syst., Jun. 2017, pp. 26–37.

[40] J. Barranquero, J. Díez, and J. J. del Coz, ‘‘Quantification-oriented learn- ing based on reliable classifiers,’’ Pattern Recognit., vol. 48, no. 2, pp. 591–604, 2015.

[41] A. Bella, C. Ferri, J. Hernandez-Orallo, and M. J. Ramirez-Quintana, ‘‘Quantification via probability estimators,’’ in Proc. 11th IEEE Int. Conf. Data Mining (ICDM), Sydney, NSW, Australia, Dec. 2010, pp. 737–742.

[42] A. Esuli and F. Sebastiani, ‘‘Optimizing text quantifiers for multivari- ate loss functions,’’ Trans. Knowl. Discovery Data, vol. 9, no. 4, 2015, Art. no. 27.

[43] G. Forman, ‘‘Quantifying counts and costs via classification,’’ DataMining Knowl. Discovery, vol. 17, no. 2, pp. 164–206, 2008.

[44] W. Gao and F. Sebastiani, ‘‘Tweet sentiment: From classification to quan- tification,’’ in Proc. IEEE/ACM ASONAM, Aug. 2015, pp. 97–104.

[45] M. Hall, E. Frank, G. Holmes, B. Pfahringer, P. Reutemann, and I. H. Witten, ‘‘The WEKA data mining software: An update’,’’ ACM SIGKDD Explor. Newslett., vol. 11, no. 1, pp. 10–18, Jun. 2009.

[46] C. Fellbaum and G. A. Miller, WordNet: An Electronic Lexical Database. Cambridge, MA, USA: MIT Press, 1998.

[47] D. Davidov, O. Tsur, and A. Rappoport, ‘‘Semi-supervised recognition of sarcastic sentences in Twitter and Amazon,’’ in Proc. 14th Conf. Comput. Natural Lang. Learn., Jul. 2010, pp. 107–116.

[48] M. Bouazizi and T. Ohtsuki, ‘‘Sarcasm detection in Twitter: ‘All your products are incredibly amazing!!!’—Are they really?’’ in Proc. IEEE Globecom, Dec. 2015, pp. 1–6.

[49] L. Breiman, ‘‘Random forest,’’ Mach. Learn., vol. 45, no. 1, pp. 5–32, 2001.

[50] J. R. Quinlan, C4.5: Programs for Machine Learning. San Mateo, CA, USA: Morgan Kaufmann, 1993.

MONDHER BOUAZIZI received the Bachelor Engineering Diploma degree in communications from SUPCOM, Carthage University, Tunisia, in 2010, and the master’s degree from Keio Univer- sity in 2017, where he is currently pursuing the Ph.D. degree. In 2015, he enrolled as a master’s student. He was a Telecommunication Engineer in access network quality and optimization with Ooredoo Tunisia (Ex. Tunisiana) for three years.

TOMOAKI OHTSUKI (S’91–M’92–SM’01) received the B.E., M.E., and Ph.D. degrees in electrical engineering from Keio University, Yokohama, Japan, in 1990, 1992, and 1994, respectively. From 1993 to 1995, he was a special Research Fellow, Japan Society for the Promotion of Science, for Japanese junior scientists. From 1994 to 1995, he was a Post-Doctoral Fellow and a Visiting Researcher in electrical engineering with Keio University. From 1995 to 2005, he was with

the Science University of Tokyo. From 1998 to 1999, he was with the Department of Electrical Engineering and Computer Sciences, University of California at Berkeley, Berkeley. In 2005, he joined Keio University, where he is currently a Professor. He is currently involved in research on wireless communications, optical communications, signal processing, and information theory. He was a recipient of the 5th International Communica- tion Foundation Research Award, the 27th TELECOM System Technology Award, the 1997 Inoue Research Award for Young Scientist, the 1997 Hiroshi Ando Memorial Young Engineering Award, the 2000 Ericsson Young Scien- tist Award, the IEEE the 1st Asia-Pacific Young Researcher Award in 2001, the 2002 Funai Information and Science Award for Young Scientist, the 2011 IEEE SPCE Outstanding Service Award, the ETRI Journal’s 2012 Best Reviewer Award, and the 9th International Conference on Communications and Networking Best Paper Award in China, in 2014.

He gave tutorials and keynote speeches at many international conferences, including the IEEE VTC and the IEEE PIMRC. He has published over 140 journal papers and 340 international conference papers. He was the Vice President of Communications Society of the IEICE. He is a fellow of the IEICE. He has served the General Co-Chair and the Symposium Co-Chair for many conferences, including the IEEE GLOBECOM 2008, SPC, the IEEE ICC 2011, CTS, the IEEE GCOM 2012, SPC, and the IEEE SPAWC. He served as the Chair for the IEEE Communications Society, Signal Processing for Communications and Electronics Technical Committee. He served a Technical Editor for the IEEE Wireless Communications Magazine and an Editor for Physical Communications (Elsevier). He is currently serving as an Area Editor for the IEEE TRANSACTIONS ON VEHICULAR TECHNOLOGY and an Editor for the IEEE COMMUNICATIONS SURVEYS AND TUTORIALS.

64502 VOLUME 6, 2018

INTRODUCTION
MOTIVATIONS

MULTI-CLASS CLASSIFICATION: POTENTIAL AND LIMITS
WHY QUANTIFICATION?
SENTA: REQUIREMENT FOR AN UPDATE

RELATED WORK
SENTA - INTEGRATING THE QUANTIFICATION COMPONENTS

TOOLS
CONVENTION
GRAPHICAL USER INTERFACES

ADVANCED FEATURES CUSTOMIZATION
CLASSIFICATION WINDOW
QUANTIFICATION WINDOW

FUTURE EXTENSION

SENTIMENT QUANTIFICATION - PROPOSED APPROACH

PROBLEM STATEMENT
DATA
FEATURES EXTRACTION

BASIC FEATURES
ADVANCED FEATURES

EXPERIMENTAL RESULTS

KEY PERFORMANCE INDICATORS
TERNARY CLASSIFICATION RESULTS

TERNARY CLASSIFICATION ON THE TEST SET
TERNARY CLASSIFICATION ON THE VALIDATION SET

QUANTIFICATION RESULTS
COMPARISON WITH A BASELINE APPROACH
DISCUSSION

CONCLUSION
REFERENCES
Biographies

MONDHER BOUAZIZI
TOMOAKI OHTSUKI