helpfn

profilebcs
ExploringDiverseFeaturesforSentimentQuantificationUsingMachineLearningAlgorithms.pdf

Received July 10, 2020, accepted July 17, 2020, date of publication July 22, 2020, date of current version August 17, 2020.

Digital Object Identifier 10.1109/ACCESS.2020.3011202

Exploring Diverse Features for Sentiment Quantification Using Machine Learning Algorithms KASHIF AYYUB 1, SAQIB IQBAL 2, EHSAN ULLAH MUNIR1, (Senior Member, IEEE), MUHAMMAD WASIF NISAR1, AND MOMNA ABBASI1 1Department of Computer Science, COMSATS University Islamabad–Wah, Wah Cantonment 47040, Pakistan 2Department of Software Engineering, Al Ain University, Al Ain, UAE

Corresponding author: Kashif Ayyub ([email protected])

ABSTRACT In the era of web 2.0, online forums, blogs and Twitter are becoming primary sources for sharing views, opinions and comments about different topics. Classifying these views, opinions and comments is known as sentiment analysis which is an active research area. Sentiment analysis has vast applications in different fields of life, such as marketing, e-commerce and business. Under the umbrella of sentiment analysis, sentiment quantification that deals with estimating relative frequency of class of interest is being investigated by researchers nowadays. In sentiment quantification, exploring effect of new features and comparison of diverse types of classifiers to assess their effectiveness needs further investigation. In this paper, we explore diverse feature sets and classifiers for sentiment quantification. In addition, empirical performance analysis of conventional machine learning techniques, ensemble-based methods and state-of- the-art deep learning algorithms on basis of features set, is performed. The computed results show that the diverse features sets affect the performance of classifiers in sentiment quantification. The results also confirm that the deep learning techniques perform better than the conventional machine learning algorithms.

INDEX TERMS Deep learning, feature engineering, lexical feature, machine learning, sentiment quantifi- cation.

I. INTRODUCTION Web 2.0 has become a major platform of exchanging views and sharing thoughts with other users. Analyzing emotions, opinions or sentiments from electronic words of mouth (eWOM) is known as sentiment analysis [1]. Sentiment anal- ysis is one of the widely researched areas of natural language processing. The sentiment analysis has vast applications in different fields of life such as social science, political sci- ences [2], [3], and marketing [4], in e-commerce and for customer satisfaction [5], [6]. Sentiment based analysis of online reviews regarding products and services also helps in understanding satisfaction of customers [7]. E-commerce has given opportunity to customers to review products online. These reviews highly impact purchase decisions for other customers. Sentiment analysis has proved to be an important method for ranking products on basis of reviews [8]. Most

The associate editor coordinating the review of this manuscript and

approving it for publication was Bohui Wang .

of the time, users are not interested in individual labelling of data but are interested at aggregate level [9]. For example, in a research conducted for monitoring reputation of company, the highest priority is given to the topic which affects the reputation of a company [10]. In this task, the authors are not only interested in sentiment classification of individual tweets but are also interested at aggregate level.

The task of correctly identifying relative frequency of data in class of interest is called ‘‘Sentiment Quantification’’ [11]. It is also known as prevalence estimation [11] and class distri- bution estimation. The simplest approach for sentiment quan- tification of data is ‘‘classify-and-count’’ approach. In this approach, firstly data is classified individually using super- vised learning classifier. After classification next step is to count examples of the class of interest, which is also called the positive class.

The relevant literature presents several methods for senti- ment quantification which are divided into two categories: aggregative and non-aggregative. Aggregative methods

VOLUME 8, 2020 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ 142819

K. Ayyub et al.: Exploring Diverse Features for Sentiment Quantification Using Machine Learning Algorithms

require classification of unlabeled data as an intermediate step whereas non-aggregative methods do not require the classification step. The majority of the methods proposed for quantification belongs to aggregative method, such as Adjusted Classify and Count (ACC) [12], Probabilistic Clas- sify and Count (PCC) [13], and Probabilistic Adjusted Clas- sify and Count (PACC) [13]. Fewer methods belong to non- aggregative methods, such as Hellinger Distancex (HDx) and Hellinger Distancey (HDy) [14]. Sentiment Quantification has different applications in dif-

ferent fields of life. Quantification is applied on marine life for ecological analysis of coral reefs and plankton [15]. It is also used to explore homophily affect among different social networks [16]. Quantification is used for finding sense priors of different words from news domains and to improve the accuracy of word sense disambiguation [17]. Prevalence Esti- mation of different movie reviews are also performed using different datasets from health sciences for collecting health related information using sentiment quantification [18].

In this research study, our aim is to explore performance of feature sets using different classifiers and to compare perfor- mance of conventional and deep learning-based techniques for sentiment quantification. The major contributions of the study include:

• Analysis of effects of different features set on classifier for sentiment quantification.

• Comparison of diverse types of conventional machine learning algorithms to assess their performance using feature sets such as N-GRAM, TF-IDF, and their com- binations.

• Exploration of multilayer feed-forward artificial neural network (ANN) using diverse activation functions such as Rectifier, Maxout and Tanh.

• Exploration of word embedding based features such as Word2vec, and GloVe for sentiment quantification.

• Application of the state-of-the-art deep learning algo- rithms such as CNN-LSTM, RNN, and DBN are applied.

The rest of this paper is organized as follows: Section 2 describes the state of art methods used in the sentiment quantification, Section 3 presents the experimental setup with details of proposed framework and features to be used. Details of datasets used for experimentation purposes and measures for evaluating performance of the method. Section 4 presents analysis of results of applied methods and discussion of these results and section 5 concludes the study.

II. RELATED WORK In this section we discuss the existing methods for sentiment quantification from the relevant literature. The researchers have used different learning approaches for quantification of data. We discuss these existing methods by categorizing them into three categories: aggregated, non-aggregated and ensemble-based methods.

A. AGGREGATED METHOD Aggregated methods require classification of data as an inter- mediate step. The following is the summary of research work in this domain.

A method has been proposed by Gao and Sebastiani [11] proposed for calculating a priori probability based on Expectation Maximization (EM) algorithm. EM algorithm is also used along with two quantification algorithms CC and Confusion-Matrix Correction method. CNN are trained for experimentation purposes. Experiments are performed on two marine-life image datasets. The results show EM outper- forms all other methods on one dataset while CC approach outperforms all methods on second dataset [15]. Different words have different sense priors based on their sources, which causes word sense disambiguation. A method is pro- posed by Daughton and Paul [17] for estimation of class pri- ors. The classifiers are used to obtain posterior probabilities and then these probabilities are used by EM algorithm for calculating prior probabilities.

CC is the simplest approach for quantification of data. AC is improved version of CC approach. Three quantification methods CC, AC and Mixture Model (MM) are compared in-terms of absolute error on four datasets in a study con- ducted by Esuli and Sebastiani [12]. The results show that MM outperformed all other methods. In another study dif- ferent algorithms are compared for two tasks Quantification and Cost Quantification on HP datasets. For quantification CC, AC, T50, X, Max, MM and MS were compared by Milli et al. [19]. The results have shown that Median Sweep (MS) outperformed all methods for quantification. Different quantification methods CC, PCC, ACC, PACC, Expecta- tion Maximization for Quantification (EMQ), Support Vector Machine (Kullback-Leibler Divergence) SVM(KLD), Sup- port Vector Machine (Normalized Kullback-Leibler Diver- gence) SVM(NKLD), Support Vector Machine (Q-measure) SVM(Q) are compared on different datasets on the basis of different evaluation measures in a study by Xue and Weiss [20]. Two different base learners are used for the experimen- tation purposes. In the first case SVM whereas in the second case Logistic Regression (LR) were used as base learners. In both cases PCC outperformed both SVM and LR. Distribution Mismatch (DM) problem occurs frequently

due to change in class distribution over time. A method is proposed by Bella et al. [21] for dealing with problems when class distribution changes over time. In the proposed method, quantification is considered as an intermediate task whereas improving classification on new distribution is a major task. Three different techniques: Class Distribution Estimation (CDE) method, Semi-supervised learning meth- ods and hybrid methods are compared for this purpose. Three CDE based models are proposed: CDE-iterative, which esti- mates NewCD by working in iterations, CDE-AC, which is based on AC and CDE-Oracle which uses Oracle to get NewCD value. Two SSL based methods: SSL-Naïve and SSL-Self-Train are proposed. Five UCI datasets are used for

142820 VOLUME 8, 2020

K. Ayyub et al.: Exploring Diverse Features for Sentiment Quantification Using Machine Learning Algorithms

TABLE 1. Summary of the papers used aggregated methods.

training and testing purposes. Results have shown that in terms of accuracy CDE-AC method outperformed all other methods.

For probability estimation a method known as Average Probability is proposed by Barranquero et al. [22], which is based on the idea of finding average of probability estima-

tions. After training classifier probabilities are obtained for each instance in a test set and finally the average is calcu- lated. In addition, Scaled Probability Estimators and Scaled Classify and Count are proposed. Another approach based on weighted nearest-neighbor is proposed for quantification of data [23]. Another approach which is based on homophily

VOLUME 8, 2020 142821

K. Ayyub et al.: Exploring Diverse Features for Sentiment Quantification Using Machine Learning Algorithms

is proposed for quantification of data [16]. For empirical analysis two types of networks Community Discovery and Ego-Networks are used for this task.

Better quantification results are obtained by improv- ing classifier’s performance. Q-Measure is proposed by Narasimhan et al. [24] for balancing of classification and quantification based on F-measure. Three different versions of Q-Measures are used for this purpose. An approach named as SVM(KLD) is proposed based on improving the classi- fier’s performance [13]. Another technique for optimizing performance measures of quantification and for balancing of classifier and quantifier is proposed [25]. Family of algo- rithms named as NEMSIS are proposed in this paper.

In SemEval-2016, a study conducted by Esuli [26], Task 4 is subdivided into task A, B, C, D and E where task D is binary quantification of data while task E is ordinal quantifi- cation of data. A team named QCRI proposed ordinal tree for task E and for task D compared already proposed methods from literature for binary quantification [27]. Another team named ISTI-CNR also participated in SemEval-2016 for all tasks [28]. For ordinal quantification of data, Regress and Count method is proposed while for adjusted data Adjusted Regress and Count method is proposed. For binary quantifi- cation, four quantification methods form literature are com- pared.

One of the most common issues in sentiment analysis is data distribution over time. In SemEval-2017, task 4 is again divided into five tasks where task D and task E is binary and ordinal quantification of data respectively. Sen- timent quantification has also been applied on content other than English, for instance, the research study [29] uses Ara- bic language content. A team named NRU-HSE [28] used LSTM for classification of data for task D and the results were compared with that was proposed in 2016 in terms of KLD. Another team named TwiSE [29] also participated in SemEval-2017. Logistic Regression is used as base classifier with Classify and Count algorithm and promising results were observed. Team named CrystalNest [30] participated in SemEval-2017 for task A to D. This team used approach based on detection of Sarcasm to improve classification and quantification. A feature set is proposed known as ACS where A is for Affect-related, C is for Cognition-related and S is for Sociolinguistics-related features. For classifier SVM is used based on ACS model.

The relevant literature also presents tools and systems developed for sentiment quantification. A single tweet can hold more than one sentiment, but in case of classification only one sentiment can be identified. SENTA tool was devel- oped by Esuli et al. [31], which was previously used for classification, is updated for quantification. A system based on three Neural Networks named as AffecThor is proposed in Sem-Eval 2018 [32] for ordinal quantification. Another deep learning-based approach named as QuaNet [33] is proposed for quantification of data. This approach used bi-directional LSTM for quantifying data. An error-adjusted bootstrapping procedure is proposed by Saerens et al. [18] for quantifica-

tion of data. Adjusted count is used as a baseline method. A summarized comparison of aggregated methods is shown in Table 1.

B. NON-AGGREGATED METHODS As discussed earlier in Introduction section, non-aggregated methods do not require the classification of each individual item as an intermediate step and estimate class pre-valences holistically [11]. An approach based on divergence measure has been proposed by Beijbom et al. [10] for quantification of data. The approach uses Hellinger distance for data distribu- tion. Hellinger distance deals with test and validation of data distribution. Hellinger distance measures mismatch between test and validation data distributions for estimation of prior probability to minimize divergence. Two different Hellinger Distance approaches are compared: HDx and HDy, where HDx does not require any classifier while HDy needs output from classifier. A non-parametric approach for quantification of data which does not require classification of data is pro- posed by Gállego et al. [35]. For experimentation purposes American presidential blogs are used. A summarized com- parison of non-aggregated methods is shown in Table 2.

TABLE 2. Summary of the papers used non-aggregated methods.

C. ENSEMBLE BASED METHODS Ensemble models are developed by combining models while using some function for aggregation. For issue of data distri- bution an ensemble based method is proposed by Marquez et al. [34]. For ensemble-based models two methods Adjusted Classify and Count and HDy are used. Another method by Saif et al. [35] uses five algorithms for learning of ensembles: CC, AC, PCC, PAC and HDy. Two strategies are used in this approach, first all learners make predictions and then four selected measures are defined based on best selected method.

III. RESEARCH METHODOLOGY This section discusses data pre-processing, features extrac- tion techniques and algorithms for sentiment quantification. For estimating prevalence of class of interest, an approach based on aggregated methods for quantifying data is pro- posed. For quantifying data after preprocessing of data, data is classified into classes. After classification of data, number of instances in certain class are counted.

A. THE PROPOSED FRAMEWORK The Figure 1 presents the steps of proposed framework for sentiment quantification. First step is acquisition of

142822 VOLUME 8, 2020

K. Ayyub et al.: Exploring Diverse Features for Sentiment Quantification Using Machine Learning Algorithms

TABLE 3. Summary of the papers used mixture of aggregated and non-aggregated methods.

FIGURE 1. The Proposed Framework for Sentiment Quantification.

datasets. Data is pre-processed by applying different data pre- processing techniques. The features are extracted by using RapidMiner, which is a tool for data mining. These extracted features are then fed into different classifiers for classify- ing data into different classes. After classification of data, number of instances in class of interest are counted. Last step is to apply different evaluation measures for evaluating performance of applied methodology.

B. DATA PRE-PROCESSING After data-acquisition, next step is pre-processing of data. For pre-processing of data, RapidMiner tool is used. First tokenization of data is performed and then stop-words are removed. Stop-words are the words, which do not have much impact while processing of text such as: is, I, am, are, the. The removal of stop-words from data reduces the dimensionality space and enhances the efficiency. Stemming of data is also performed using rapid miner. Stemming is the process of bringing text back into its root form. For-example words

‘‘extraction’’ and ‘‘extracting’’ become ‘‘extract’’ after stem- ming. Similar to stop-word removal, stemming also reduces the dimensional space.

C. FEATURE ENGINEERING In this study, diverse features are used. We explore conven- tional features such as TF-IDF, n-gram and combination. In addition, word embedding based Word2vec and GloVe are also used.

1) TF-IDF Terms Frequency-Inverse Documents Frequency (TF-IDF) is a method of Bag of Words. Term Frequency represents ‘‘how many times a word has appeared in the text’’ while Inverse Document Frequency represents ‘‘number of times a word has appeared in documents over total number of documents’’. First step of this method is the representation of all words in the documents in form of list. After representation of words in form of list, the list is assigned to all documents. Let’s say there are 500 words in the list then for each document there will be vector of one row and 500 columns where columns contain words from the word list and rows contain computed term frequency of each word in the given document. Inverse Document Frequency can be given as:

IDF = logN / n (1)

where n is number of documents containing given word while N represents total number of documents. So, at end TF− IDF value can easily be obtained by multiplying TF and IDF values.

TF − IDF = TF ∗ IDF (2)

Let’s suppose a document contains 100 words and word ‘‘good’’ appears 3 times in the document, then the TF can be given as

( 3

100

) = 0.03. If documents have 500 words

and word ‘‘good’’ appear 50 times, then the IDF is given as log

( 500 50

) = 1. TF− IDF is given as 0.03×1 = 0.03.

2) N-GRAM In n-gram word embeddings probabilities are assigned to sequence of words. Basically N-gram is a sequence of N words. The n-gram of size one is referred as unigram, n- gram of size two is referred as bigram and that of size three is referred as trigram. For-example ‘‘Thanks’’, ‘‘Thank you’’ and ‘‘Thank you so much’’ are unigram, bigram and four- gram respectively.

3) COMBINATION OF TF-IDF ALONG WITH N-GRAM TF-IDF is a method involving Bag of Words and n-gram is based on word embeddings. In TF-IDF along with N-gram approach first n-grams feature extraction technique is applied on dataset. By applying n-gram on dataset word list from both data is obtained. After n-gram TF-IDF is applied on data to obtain term frequency inverse document frequency of these words list from datasets.

VOLUME 8, 2020 142823

K. Ayyub et al.: Exploring Diverse Features for Sentiment Quantification Using Machine Learning Algorithms

4) WORD2VEC Word embedding has become one of the most popular way of representing text-based documents into vector forms. It is gaining popularity as it captures context of text, seman- tic and syntactic similarity and relation with other words. Word2vec takes text input and produces a large vector space. Word Embedding based approaches are used widely in recent research studies in sentiment analysis [38]–[40].

5) GLOVE GloVe: Global Vectors for Word Representation is a model for word representation. Word vectors put words to a vector space and it is an unsupervised approach where similar words clusters together and different words repel. The advantage of GloVe is that it does not just rely on local statistics like word2vec but incorporates word to word co-occurrence to obtain word vectors. In other words, it captures both global statistics as well as local statistics of a corpus in order to form word vectors. GloVe has been used in the sentiment analysis [41], [42].

IV. EXPERIMENTAL SETUP This section discusses datasets, techniques and eval- uation measures, which have been used to evaluate performance.

A. DATASETS We used three datasets for experimentation purposes. First dataset is Stanford Twitter Sentiment, also known as Sen- timent 140 dataset [43]. It contains two datasets, training data and testing data. Training data consists of 1.6 million tweets. These tweets are automatically classified on the basis of emotions. If the tweet contains :- ), : ), :D they are classified as positive while if the tweet contains :-(, : ( they are classified as negative. The testing dataset contains 498 tweets. The test dataset is classified manually. It contains 177, 139 and 182 negative, neutral and positive tweets respectively. For experimentation purposes we have only used this testing data. The second dataset used for experimentation is a new dataset specifically purposed for sentiment analysis of tweets named as STS-Gold [44]. This dataset contains 2304 tweets which are classified manually. 1402 tweets are classified negative while 632 tweets are positive. Table 4 shows detailed charac- teristics of datasets. Third dataset is Sanders which is widely used for Sentiment Analysis. The Sanders dataset has overall 5513 tweets but 1786 tweets are not labelled either as positive, negative or neutral so overall the 3727 tweets are available for sentiment quantification task used in this paper. Sanders dataset has already been used for sentiment classification [45], [46].

B. CONVENTIONAL MACHINE LEARNING TECHNIQUES Machine learning provides system ability to learn and then improve its performance from its experience. We used Naïve

TABLE 4. Characteristics of Datasets.

Bayes, Support Vector Machine and Decision Tree for exper- iment.

1) NAÏVE BAYES Naïve Bayes (NB) is a widely use classification algorithm. This algorithm is based on conditional probability and it uses Bayes rule. The formula of Bayes theorem is:

P(x|H) = P(x)P(H|x)

P(H) (3)

By using this equation, we can find the class x with maximum probability.

x = argmaxxP(x) ∏n

i=1 P(hi|x) (4)

PNB (x|H) = (P(x))

∑m i=1 P(f |x)

ni(H)

P(H) (5)

NB performs better on categorical data than on numerical data.

2) SUPPORT VECTOR MACHINE Support Vector Machine (SVM) is very useful for high dimensional data. This algorithm is based on linear regres- sion and is widely used in text processing. It determines an optimal boundary between the classes. The formula for SVM optimization is:

maximizaf(a1 . . . . . . . . .an)

=

∑n i=1

ai − 1 2

∑n i=1

∑n j=1

yiai ( xi ·xj

) yjaj

×

∑n i=1

aiyi = 0, 0 ≤ ai ≤ C (6)

SVM works well even when data is semi-structured or unstruc- tured but does not perform well with noisy data and requires long training time for large datasets.

3) DECISION TREE Decision Tree (DT) works on principle of dividing data. This algorithm is based on inductive learning. In this method, at internal node questions appear. Data passes through these questions and reaches the classes represented by leaves. Entropy and information gains are used to build a decision tree. The formula of entropy is;

H (Y) = ∑n

j=1 Pjlog2Pj (7)

IG(Y,x) = H (Y)−H (Y|x) (8)

While using decision trees less effort is required for pre- processing of data. If there are missing values in data they

142824 VOLUME 8, 2020

K. Ayyub et al.: Exploring Diverse Features for Sentiment Quantification Using Machine Learning Algorithms

do not affect the building of decision tree but training time for decision tree is higher than the other classifiers.

C. ENSEMBLE BASED-TECHNIQUES Ensemble-based methods contain multiple algorithms to obtain better results. We use two ensemble-based methods for our proposed work, AdaBoost and Random Forest.

1) AdaBoost AdaBoost is a boosting algorithm, which uses weak clas- sifiers multiple times to make prediction for data and mis- classified data by other classifiers. Then all the predictions are combined to obtain final prediction. In the training dataset, each instance is weighted. The initial weight is set to;

weight(y_i) = 1/k (9)

For the trained model, the misclassification rate is calculated as:

error = (correct −K)

K (10)

Adaboost uses weak classifiers to increase their performance but training is time consuming and imbalanced data can decrease performance in Adaboost.

2) RANDOM FOREST Random forest is another ensemble-based learning algorithm for classification and regression of data. Different decision tree operate as ensembled. Each decision tree makes class predictions and the class with most predictions become pre- diction of the model.

D. DEEP LEARNING-BASED TECHNIQUES Deep learning is a field of machine learning, which deals with networks that can learn from unstructured and unlabeled data. The models are trained using stochastic gradient descent and back propagation algorithm. Multiple hidden layers are used. For our experiment we used 50 hidden layers for the model. These hidden layers contain neurons. Deep learning is used commonly for increasing performance in classification and analyzing text [47]. We used different activation functions with and without dropout for training our models.

1) RECTIFIER Rectifier is a most commonly used activation function in deep learning. A unit that applies rectifier activation function is known as Rectified Linear Unit (ReLU). Rectifier activation function is mostly used in speech recognition [48] and also in computer vision [49]. Rectifier function can be given as:

f (x) = max(0,x) (11)

This function returns 0 whenever it gets negative input and returns value of x in other cases.

2) MAXOUT Maxout activation function is a generalized version of rec- tification action function and leaky rectification activation function. Besides its strengths, one weakness of this function is that the parameters are doubled for every neuron, hence, the training overhead is increased with every extension. Max- out function can be given as:

max ( wT1 x +b1,w

T 2 x +b2

) (12)

where b is the bias and wT are the weight matrices.

3) TANH Tanh is a non-linear activation function which has a range (-1,1). Tanh is sigmoid function that means its graph is in S-shape. Advantage of Tanh function is that the values that are negative are marked as negative while values that are 0 are plotted near to 0 in graph. Tanh can be given as:

tanh(x) = 2

1+ e−2x −1 (13)

4) RNN Recurrent Neural Networks (RNN) is a class of artificial neural networks which are useful for unsegmented tasks such as handwriting or speech detection. RNN store memory which influence while making decisions. Difference between conventional neural networks and RNN is that CNN store memory which they have learnt during training phase but RNN store memory which they have learnt not only in train- ing phase but also from previous input during generating output. The other difference is that outputs in RNN are not only affected by weights applied on input like in conven- tional neural networks but also with the hidden vector rep- resenting knowledge learnt from previous inputs. RNN based approaches are used in sentiment analysis [50]–[52].

5) CNN-LSTM CNN (Convolutional Neural Network) can extract local infor- mation but may fail to capture long-distance dependency. LSTM (Long Short-Term) can address this limitation by sequentially modeling texts across sentences [53]. The CNN- LSTM architecture involves using CNN layers for feature extraction on input data combined with LSTM to support sequence prediction.

6) DEEP BELIEF NETWORKS Deep Belief Networks (DBN) is a class of deep neural networks. It is combination of machine learning and neu- ral networks with statistics and probabilities. DBN is com- posed of multiple layers having hidden units. Layers are connected to each other, but there is no connection among hidden units. It generates all possible values for given case. Researchers are now exploring DBN for sentiment analysis of data [54]–[56].

VOLUME 8, 2020 142825

K. Ayyub et al.: Exploring Diverse Features for Sentiment Quantification Using Machine Learning Algorithms

E. PERFORMANCE EVALUATION MEASURES Evaluation of proposed techniques is done using three evalu- ation measures named as, absolute error (AE), relative error (RE) and Normalized absolute error (NAE). The simplest measure for binary quantification is AE which can be given as:

AE ( p, p̂

) =

1 |C|

∑ c�C

∣∣p̂(c)−p(c)∣∣ (14) where C is number of classes and

∣∣p̂(c)−p(c)∣∣ is the dif- ference between true class and estimated class prevalence. Normalized Absolute Error is also used for evaluating per- formance. NAE gives normalized value of absolute error.

RE is also used for evaluating performance of classifiers in terms of sentiment quantification.

RE ( p, p̂

) =

1 |C|

∑ c�C

|p̂(c)−p(c)| p(c)

×100 (15)

V. RESULTS AND DISCUSSIONS In this section we discuss results obtained on three datasets based on two different feature sets. For lexical features when N-Gram is applied, best results are obtained while using n=5, so we only consider this for the detailed discussion. When N- Gram technique is applied on datasets, word lists are obtained for both datasets. TF-IDF is used as word embeddings- based feature extraction technique. When TF-IDF is applied 3859, 1724, and 1933 attributes are extracted for STS-Gold and Sentiment140, and Sanders datasets respectively. For further exploration features extraction techniques, TF-IDF along with n-gram, are used. In this case 10123, 3788, and 16268 attributes are extracted for STS-Gold, Sentiment140, and Sanders datasets respectively.

After feature extraction different machine learning-based algorithms are applied for classification of instances. After classification, examples in the class of interest are counted.

Results have shown that in case of STS-Gold dataset by using N-Gram feature extraction techniques best results for quantification are obtained by SVM (NAE = 0.269). While using TF-IDF, DT produced better result (NAE=0.009) and when TF-IDF along with n-gram is used AdaBoost produced best result (NAE = 0.009). Table 5 gives the complete detail of machine-learning based algorithms on STS-Gold dataset.

In case of Sentiment140 dataset, N-Gram feature extrac- tion technique RF produced NAE=0.620. When TF-IDF is used results produced by AdaBoost are better than other classifiers (NAE = 0.014). Again, on this dataset when TF- IDF is applied along with n-gram, AdaBoost produced better results than the other classifiers (NAE = 0.013). Table 6 gives detailed results of applying machine learning based algorithms on Sentiment140 dataset.

In case of Sanders dataset, N-Gram feature extraction technique NB produced NAE=0.250. When TF-IDF is used results produced by SVM are better than other classifiers (NAE = 0.139). Again, on this dataset when TF-IDF is applied along with n-gram, SVM produced better results than the other classifiers (NAE = 0.063). Table 7 gives detailed

results of applying machine learning based algorithms on Sanders dataset.

Overall these results have shown that the better results are produced using TF-IDF along with n-gram in terms of sentiment quantification.

TABLE 5. Results of Machine Learning Algorithm on STS-Gold Dataset.

TABLE 6. Results of Machine Learning Algorithm on Sentiment140 Dataset.

Deep learning-based activation functions are used for clas- sification step while carrying out sentiment quantification. Rectifier, maxout and tanh are used with and without dropout. The dropout value is set to be 0.5 and 50 hidden layers are used in these models. The results show that n-gram does not perform well as a feature extraction technique. TF-IDF and TF-IDF along with n-gram produce better results. In case of using TF-IDF along with N-Gam as a feature and recti- fier as activation function without dropout outperformed for STS-Gold dataset. On Sentiment140 dataset best results are obtained using Tanh activation function on features extracted by TF-IDF along with n-gram. In case of Sanders dataset best results are obtained using Maxout activation function without dropout on features extracted by TF-IDF along with n-gram. Table 8, Table 9, and Table 10 provide a detailed summary

142826 VOLUME 8, 2020

K. Ayyub et al.: Exploring Diverse Features for Sentiment Quantification Using Machine Learning Algorithms

TABLE 7. Results of Machine Learning Algorithm on Sanders Dataset.

FIGURE 2. Comparison of Machine Learning Algorithms based on N-Gram.

of results of deep learning-based algorithms on STS-Gold, Sentiment140, and Sanders datasets respectively.

TABLE 8. Results of Deep Learning Algorithm on STS-Gold Dataset.

TABLE 9. Results of Deep Learning Algorithm on Sentiment140 Dataset.

TABLE 10. Results of Deep Learning Algorithm on Sanders Dataset.

Overall results obtained by deep learning-based activa- tion function for sentiment quantification are better than the results obtained by conventional machine learning methods.

A. COMPARISON OF USED MACHNE LEARNNG ALGORITHMS ON THE BASIS OF DIVERSE FEATURE SETS The results have shown that the TF-IDF along with N-Gram as feature extraction technique produces the best results. Results have further shown that using Deep learning algo- rithms for classification step reduces errors. For two datasets

VOLUME 8, 2020 142827

K. Ayyub et al.: Exploring Diverse Features for Sentiment Quantification Using Machine Learning Algorithms

FIGURE 3. Comparison of Machine Learning Algorithms based on TF-IDF.

FIGURE 4. Comparison of Machine Learning Algorithms based on TF-IDF + N-GRAM.

FIGURE 5. Comparison of Deep Learning Algorithms based on N-GRAM.

better results are obtained using deep learning-based tech- niques, where SVM performed better in case of Sanders dataset. The results have also shown that the DT and Max- out perform better with the TF-IDF as feature extraction technique on three datasets. On STS-Gold dataset better results are obtained using TF-IDF along with n-gram as feature extraction technique using Rectifier without dropout deep learning-based activation function with NAE = 0.008. On Sentiment140 dataset better results are obtained using TF-IDF along with n-gram as feature extraction technique using Tanh without dropout deep learning-based activation function with NAE = 0.006. On Sanders dataset better results are obtained using TF-IDF along with n-gram as feature extraction technique using Maxout with dropout deep learning-based activation function with NAE = 0.092. Fig-

FIGURE 6. Comparison of Deep Learning Algorithms based on TF-IDF.

FIGURE 7. Comparison of Deep Learning Algorithms based on TF-IDF + N-GRAM.

TABLE 11. Best Results produced by Algorithms on basis of Features Engineering Technique for STS-Gold Data.

ure 2 to Figure 7 show the comparisons of algorithms based on feature engineering techniques used.

Table 11, Table 12, and Table 13 provide the details of performance of algorithms based on feature engineering tech- niques used.

VI. CONCLUSION In this paper we focused on diverse feature sets and classifica- tion algorithms for sentiment quantification of data. We used three twitter datasets STS-Gold, Sentiment140, and Sanders. For feature extraction and applying classifier TF-IDF, n-gram and combination of TF-IDF and n-gram is used. The results have shown that the performance of classifiers is highly dependent on the feature sets and majority of the classifiers

142828 VOLUME 8, 2020

K. Ayyub et al.: Exploring Diverse Features for Sentiment Quantification Using Machine Learning Algorithms

TABLE 12. Best Results produced by Algorithms on basis of Features Engineering TECHNIQUE FOR Sentiment140 Data.

TABLE 13. Best Results produced by Algorithms on basis of Features Engineering Technique For Sanders Data.

TABLE 14. Better Results produced by State of the art Deep-Learning Algorithms Based on the Features Engineering Technique For Different Datasets.

work well with TF-IDF combined with n-gram. It can be said that the best feature set is obtained by applying TF-IDF and n-gram combined. Deep learning algorithms performed better than the conventional machine learning algorithm and ensemble-based algorithms. Best results were obtained by rectifier activation function and tanh activation function on STS-Gold and Sentiment140 datasets respectively. The results have revealed that the deep learning, which is widely used in different fields of life, such as image processing and classifying sentiment, can also be used for sentiment quantification of data for better results.

REFERENCES [1] B. Liu, ‘‘Sentiment analysis and opinion mining,’’ Synth. Lectures Hum.

Lang. Technol., vol. 5, no. 1, pp. 1–167, 2012. [2] H. U. Khan and A. Daud, ‘‘Using machine learning techniques for subjec-

tivity analysis based on lexical and nonlexical features,’’ Int. Arab J. Inf. Technol., vol. 14, no. 4, pp. 481–487, 2017.

[3] M. Marchetti-Bowick and N. Chambers, ‘‘Learning for microblogs with distant supervision: Political forecasting with twitter,’’ in Proc. 13th Conf. Eur. Chapter Assoc. Comput. Linguistics. Stroudsburg, PA, USA: Associ- ation for Computational Linguistics, 2012, pp. 603–612.

[4] G. Amato, L. Candela, D. Castelli, A. Esuli, F. Falchi, C. Gennaro, F. Giannotti, A. Monreale, M. Nanni, P. Pagano, L. Pappalardo, D. Pedreschi, F. Pratesi, F. Rabitti, S. Rinzivillo, G. Rossetti, S. Ruggieri, F. Sebastiani, and M. Tesconi, ‘‘How data mining and machine learning evolved from relational data base to data science,’’ in A Comprehensive Guide Through the Italian Database Research Over the Last 25 Years. Springer, 2018, pp. 287–306.

[5] J.-W. Bi, Y. Liu, and Z.-P. Fan, ‘‘Crowd intelligence: Conducting asym- metric impact-performance analysis based on online reviews,’’ IEEE Intel- ligent Systems, vol. 35, no. 2, pp. 92–98, Mar./Apr. 2020.

[6] J.-W. Bi, Y. Liu, Z.-P. Fan, and J. Zhang, ‘‘Wisdom of crowds: Conducting importance-performance analysis (IPA) through online reviews,’’ Tourism Manage., vol. 70, pp. 460–478, Feb. 2019.

[7] J.-W. Bi, Y. Liu, Z.-P. Fan, and E. Cambria, ‘‘Modelling customer satisfac- tion from online reviews using ensemble neural network and effect-based Kano model,’’ Int. J. Prod. Res., vol. 57, no. 22, pp. 7068–7088, Nov. 2019.

[8] D. Zhang, C. Wu, and J. Liu, ‘‘Ranking products with online reviews: A novel method based on hesitant fuzzy set and sentiment word frame- work,’’ J. Oper. Res. Soc., vol. 71, no. 3, pp. 528–542, Mar. 2020.

[9] M. A. Qureshi, C. O’Riordan, and G. Pasi, ‘‘Clustering with error- estimation for monitoring reputation of companies on Twitter,’’ in Proc. Asia Inf. Retr. Symp. Berlin, Germany: Springer, 2013, pp. 170–180.

[10] R. Alaiz-Rodríguez, A. Guerrero-Curieses, and J. Cid-Sueiro, ‘‘Class and subclass probability re-estimation to adapt a classifier in the pres- ence of concept drift,’’ Neurocomputing, vol. 74, no. 16, pp. 2614–2623, Sep. 2011.

[11] W. Gao and F. Sebastiani, ‘‘From classification to quantification in tweet sentiment analysis,’’ Social Netw. Anal. Mining, vol. 6, no. 1, p. 19, Dec. 2016.

[12] A. Esuli and F. Sebastiani, ‘‘Optimizing text quantifiers for multivariate loss functions,’’ ACM Trans. Knowl. Discovery from Data, vol. 9, no. 4, pp. 1–27, Jun. 2015.

[13] V. González-Castro, R. Alaiz-Rodríguez, and E. Alegre, ‘‘Class distri- bution estimation based on the hellinger distance,’’ Inf. Sci., vol. 218, pp. 146–164, Jan. 2013.

[14] O. Beijbom, J. Hoffman, E. Yao, T. Darrell, A. Rodriguez-Ramirez, M. Gonzalez-Rivero, and O. Hoegh- Guldberg, ‘‘Quantification in-the- wild: Data-sets and baselines,’’ 2015, arXiv:1510.04811. [Online]. Avail- able: http://arxiv.org/abs/1510.04811

[15] G. Forman, ‘‘Quantifying counts and costs via classification,’’ DataMining Knowl. Discovery, vol. 17, no. 2, pp. 164–206, Oct. 2008.

[16] Y. S. Chan and H. T. Ng, ‘‘Estimating class priors in domain adaptation for word sense disambiguation,’’ in Proc. 21st Int. Conf. Comput. Linguistics 44th Annu. Meeting ACL (ACL). Association for Computational Linguis- tics, 2006, pp. 89–96.

[17] A. R. Daughton and M. J. Paul, ‘‘Constructing accurate confidence inter- vals when aggregating social media data for public health monitoring,’’ in Proc. Int. Workshop Health Intell. Honolulu, HI, USA: Springer, 2019, pp. 9–17.

[18] M. Saerens, P. Latinne, and C. Decaestecker, ‘‘Adjusting the outputs of a classifier to new a priori probabilities: A simple procedure,’’ Neural Comput., vol. 14, no. 1, pp. 21–41, Jan. 2002.

[19] L. Milli, A. Monreale, G. Rossetti, D. Pedreschi, F. Giannotti, and F. Sebastiani, ‘‘Quantification in social networks,’’ in Proc. IEEEInt.Conf. Data Sci. Adv. Analytics (DSAA), Oct. 2015, pp. 1–10.

[20] J. C. Xue and G. M. Weiss, ‘‘Quantification and semi-supervised classifi- cation methods for handling changes in class distribution,’’ in Proc. 15th ACM SIGKDD Int. Conf. Knowl. Discovery Data Mining (KDD), 2009, pp. 897–906.

[21] A. Bella, C. Ferri, J. Hernandez-Orallo, and M. J. Ramirez-Quintana, ‘‘Quantification via probability estimators,’’ in Proc. IEEE Int. Conf. Data Mining, Dec. 2010, pp. 737–742.

[22] J. Barranquero, P. González, J. Díez, and J. J. del Coz, ‘‘On the study of nearest neighbor algorithms for prevalence estimation in binary problems,’’ Pattern Recognit., vol. 46, no. 2, pp. 472–482, Feb. 2013.

[23] J. Barranquero, J. Díez, and J. José del Coz, ‘‘Quantification-oriented learning based on reliable classifiers,’’ Pattern Recognit., vol. 48, no. 2, pp. 591–604, Feb. 2015.

[24] H. Narasimhan, S. Li, P. Kar, S. Chawla, and F. Sebastiani, ‘‘Stochastic optimization techniques for quantification performance measures,’’ Stat, vol. 1050, p. 13, May 2016.

VOLUME 8, 2020 142829

K. Ayyub et al.: Exploring Diverse Features for Sentiment Quantification Using Machine Learning Algorithms

[25] G. Da San Martino, W. Gao, and F. Sebastiani, ‘‘QCRI at SemEval-2016 task 4: Probabilistic methods for binary and ordinal quantification,’’ in Proc. 10th Int. Workshop Semantic Eval. (SemEval-), 2016, pp. 58–63.

[26] A. Esuli, ‘‘ISTI-CNR at SemEval-2016 task 4: Quantification on an ordinal scale,’’ in Proc. 10th Int. Workshop Semantic Eval. (SemEval-), 2016, pp. 92–95.

[27] S. Rosenthal, N. Farra, and P. Nakov, ‘‘SemEval-2017 task 4: Sentiment analysis in Twitter,’’ in Proc. 11th Int. Workshop Semantic Eval. (SemEval- ), 2017, pp. 502–518.

[28] N. Karpov, ‘‘NRU-HSE at SemEval-2017 task 4: Tweet quantification using deep learning architecture,’’ in Proc. 11th Int. Workshop Semantic Eval. (SemEval-), 2017, pp. 683–688.

[29] G. Balikas, ‘‘TwiSe at SemEval-2017 task 4: Five-point Twitter sentiment classification and quantification,’’ in Proc. 11th Int. Workshop Semantic Eval. (SemEval-), 2017, pp. 755–759.

[30] R. K. Gupta and Y. Yang, ‘‘CrystalNest at SemEval-2017 task 4: Using sarcasm detection for enhancing sentiment classification and quantifi- cation,’’ in Proc. 11th Int. Workshop Semantic Eval. (SemEval-), 2017, pp. 626–633.

[31] M. Bouazizi and T. Ohtsuki, ‘‘Multi-class sentiment analysis in Twitter: What if classification is not the answer,’’ IEEE Access, vol. 6, pp. 64486–64502, 2018.

[32] M. Abdou, A. Kulmizev, and J. Ginés i Ametllé, ‘‘AffecThor at SemEval- 2018 task 1: A cross-linguistic approach to sentiment intensity quantifi- cation in tweets,’’ in Proc. 12th Int. Workshop Semantic Eval., 2018, pp. 210–217.

[33] A. Esuli, A. Moreo Fernández, and F. Sebastiani, ‘‘A recurrent neural network for sentiment quantification,’’ in Proc. 27th ACM Int. Conf. Inf. Knowl. Manage., Oct. 2018, pp. 1775–1778.

[34] G. Forman, ‘‘Counting positives accurately despite inaccurate classifica- tion,’’ in Proc. Eur. Conf. Mach. Learn. Torino, Italy: Springer, 2005, pp. 564–575.

[35] P. Pérez-Gállego, J. R. Quevedo, and J. J. del Coz, ‘‘Using ensembles for problems with characterizable changes in data distribution: A case study on quantification,’’ Inf. Fusion, vol. 34, pp. 87–100, Mar. 2017.

[36] D. J. Hopkins and G. King, ‘‘A method of automated nonparametric content analysis for social science,’’ Amer. J. Political Sci., vol. 54, no. 1, pp. 229–247, Jan. 2010.

[37] P. Pérez-Gállego, A. Castano, J. R. Quevedo, and J. J. del Coz, ‘‘Dynamic ensemble selection for quantification tasks,’’ Inf. Fusion, vol. 45, pp. 1–15, Jan. 2019.

[38] E. M. Alshari, A. Azman, S. Doraisamy, N. Mustapha, and M. Alkeshr, ‘‘Effective method for sentiment lexical dictionary enrichment based on Word2 Vec for sentiment analysis,’’ in Proc. 4th Int. Conf. Inf. Retr. Knowl. Manage. (CAMP), Mar. 2018, pp. 1–5.

[39] E. M. Alshari, A. Azman, S. Doraisamy, N. Mustapha, and M. Alkeshr, ‘‘Improvement of sentiment analysis based on clustering of Word2 Vec fea- tures,’’ in Proc. 28th Int. Workshop Database Expert Syst. Appl. (DEXA), Aug. 2017, pp. 123–126.

[40] M. Al-Amin, M. S. Islam, and S. Das Uzzal, ‘‘Sentiment analysis of bengali comments with Word2 Vec and sentiment information of words,’’ in Proc. Int. Conf. Electr., Comput. Commun. Eng. (ECCE), Feb. 2017, pp. 186–190.

[41] P. Lauren, G. Qu, J. Yang, P. Watta, G.-B. Huang, and A. Lendasse, ‘‘Generating word embeddings from an extreme learning machine for sentiment analysis and sequence labeling tasks,’’ Cognit. Comput., vol. 10, no. 4, pp. 625–638, Aug. 2018.

[42] Y. Sharma, G. Agrawal, P. Jain, and T. Kumar, ‘‘Vector representation of words for sentiment analysis using GloVe,’’ in Proc. Int. Conf. Intell. Commun. Comput. Techn. (ICCT), Dec. 2017, pp. 279–284.

[43] A. Go, R. Bhayani, and L. Huang, ‘‘Twitter sentiment classification using distant supervision,’’ CS224N Project Rep., Stanford, vol. 1, no. 12, p. 2009, 2009.

[44] H. Saif, M. Fernandez, Y. He, and H. Alani, ‘‘Evaluation datasets for Twitter sentiment analysis: A survey and a new dataset, the STS-gold,’’ in Proc. Workshop, Emotion Sentiment Social Expressive Media, Approaches Perspective AI (ESSEM), Turin, Italy, 2013.

[45] A. Deshwal and S. K. Sharma, ‘‘Twitter sentiment analysis using various classification algorithms,’’ in Proc. 5th Int. Conf. Rel., Infocom Technol. Optim. (Trends Future Directions) (ICRITO), Sep. 2016, pp. 251–257.

[46] A. Walha, F. Ghozzi, and F. Gargouri, ‘‘A lexicon approach to multidi- mensional analysis of tweets opinion,’’ in Proc. IEEE/ACS 13th Int. Conf. Comput. Syst. Appl. (AICCSA), Nov. 2016, pp. 1–8.

[47] Y. LeCun, Y. Bengio, and G. Hinton, ‘‘Deep learning,’’ Nature, vol. 521, no. 7553, pp. 436–444, 2015.

[48] L. Toth, ‘‘Phone recognition with deep sparse rectifier neural networks,’’ in Proc. IEEE Int. Conf. Acoust., Speech Signal Process., May 2013, pp. 6985–6989.

[49] L. Xu, C.-S. Choy, and Y.-W. Li, ‘‘Deep sparse rectifier neural networks for speech denoising,’’ in Proc. IEEE Int. Workshop Acoustic Signal Enhance- ment (IWAENC), Sep. 2016, pp. 1–5.

[50] Y. Wang, A. Sun, J. Han, Y. Liu, and X. Zhu, ‘‘Sentiment analysis by capsules,’’ in Proc. World Wide Web Conf. World Wide Web (WWW), 2018, pp. 1165–1174.

[51] A. Agarwal, A. Yadav, and D. K. Vishwakarma, ‘‘Multimodal sentiment analysis via RNN variants,’’ in Proc. IEEE Int. Conf. Big Data, Cloud Comput., Data Sci. Eng. (BCD), May 2019, pp. 19–23.

[52] M. A. Jerbi, H. Achour, and E. Souissi, ‘‘Sentiment analysis of code- switched tunisian dialect: Exploring RNN-based techniques,’’ in Proc. Int. Conf. Arabic Lang. Process. Loria, France: Springer, 2019, pp. 122–131.

[53] J. Wang, L.-C. Yu, K. R. Lai, and X. Zhang, ‘‘Dimensional sentiment analysis using a regional CNN-LSTM model,’’ inProc.54thAnnu.Meeting Assoc. Comput. Linguistics (Short Papers), vol. 2, 2016, pp. 225–230.

[54] R.-C. Chen, ‘‘User rating classification via deep belief network learning and sentiment analysis,’’ IEEE Trans. Comput. Social Syst., vol. 6, no. 3, pp. 535–546, May 2019.

[55] M. Wang, Z.-H. Ning, C. Xiao, and T. Li, ‘‘Sentiment classification based on information geometry and deep belief networks,’’ IEEE Access, vol. 6, pp. 35206–35213, 2018.

[56] M. Jiang, Y. Liang, X. Feng, X. Fan, Z. Pei, Y. Xue, and R. Guan, ‘‘Text classification based on deep belief network and softmax regression,’’ Neural Comput. Appl., vol. 29, no. 1, pp. 61–70, Jan. 2018.

KASHIF AYYUB received the master’s degree in computer science from Bahauddin Zakariya Uni- versity, Multan, Pakistan, in 2002. He is currently serving as an Assistant Professor with the Depart- ment of Computer Science, COMSATS Univer- sity Islamabad–Wah, Wah Cantonment, Pakistan. His research interests include algorithms, machine learning, and sentiment analysis.

SAQIB IQBAL received the M.Sc. degree in soft- ware engineering from the Queen Marry Univer- sity of London, and the Ph.D. degree in software engineering from the University of Huddersfield, U.K. He is currently an Assistant Professor and a Program Director of the Software Engineering Program, Al-Ain University, UAE. His research interests include software analysis and design, aspect oriented software development, require- ments engineering, model-based software devel-

opment, data mining, and machine learning.

142830 VOLUME 8, 2020

K. Ayyub et al.: Exploring Diverse Features for Sentiment Quantification Using Machine Learning Algorithms

EHSAN ULLAH MUNIR (Senior Member, IEEE) received the master’s degree in computer science from the Barani Institute of Information Tech- nology, Pakistan, in 2001, and the Ph.D. degree in computer science and theory from the Harbin Institute of Technology, Harbin, China, in 2008. He is currently serving as an Associate Profes- sor with the Department of Computer Science, COMSATS University Islamabad–Wah, Wah Can- tonment, Pakistan. His research interests include

heterogeneous parallel and distributed computing systems (cluster grid and cloud and peer-to-peer systems), computer and wireless networks, informa- tion systems, and information retrieval.

MUHAMMAD WASIF NISAR received the mas- ter’s degree in computer science from the Uni- versity of Peshawar, Peshawar, Pakistan, in 2000, the Ph.D. degree in computer science from the Laboratory for Internet and Software Technolo- gies, Institute of Software, Computer Applied Technology, Graduate University of Chinese Academy of Sciences (GUCAS), Beijing, China, in 2009. He is currently serving as an Associate Professor and the Head of the Department of Com-

puter Science, COMSATS University Islamabad–Wah, Wah Cantonment, Pakistan. His research interests include software estimation, software process improvement, distributed systems, data mining, information security, and CMMI-based project management and algorithms.

MOMNA ABBASI is currently pursuing the M.S. degree in computer science with COMSATS University Islamabad–Wah, Wah Cantonment, Pak- istan. Her research interests include data mining, machine learning, informa- tion retrieval, and scientometrics.

VOLUME 8, 2020 142831