helpfn

ClassAssociationandAttributeRelevancyBasedImputationAlgorithmtoReduceTwitterDataforOptimalSentimentAnalysis.pdf

Home >Computer Science homework help >helpfn

Received September 3, 2019, accepted September 12, 2019, date of publication September 18, 2019, date of current version October 2, 2019.

Digital Object Identifier 10.1109/ACCESS.2019.2942112

Class Association and Attribute Relevancy Based Imputation Algorithm to Reduce Twitter Data for Optimal Sentiment Analysis MARYUM BIBI 1, MALIK SAJJAD AHMED NADEEM 1, IMTIAZ HUSSAIN KHAN 2, SEONG-O SHIM3, ISHTIAQ RASOOL KHAN 3, UZMA NAQVI1, AND WAJID AZIZ 1,3 1Department of Computer Sciences and Information Technology, The University of Azad Jammu and Kashmir, Muzaffarabad 13100, Pakistan 2Department of Computer Science, King Abdulaziz University, Jeddah 21959, Saudi Arabia 3College of Computer Science and Engineering, University of Jeddah 21959, Saudi Arabia

Corresponding author: Maryum Bibi (mariyam.hamdani@gmail.com)

ABSTRACT Twitter sentiment analysis is a challenging task that involves various preprocessing steps including dimensionality reduction. Dimensionality reduction helps ensure low computational complexity and performance improvement during the classification process. In Twitter data, each tweet has feature values which may or may not reflect a person’s response. Therefore, a large number of sparse data points are generated when tweets are represented as feature matrix, eventually increasing computational overheads and error rates in Twitter sentiment analysis. This study proposes a novel preprocessing technique called class association and attribute relevancy based imputation algorithm (CAARIA) to reduce the Twitter data size. CAARIA achieves the dimensionality reduction goal by imputing those tweets that belong to the same class and also share useful information. The performance of two classifiers (Naïve Bayes and support vector machines) is evaluated on three Twitter datasets in terms of classification accuracy, measured as area under curve, and time efficiency. CAARIA is also compared against two widely used feature selection (dimensionality reduction) techniques, information gain (IG) and Pearson’s correlation (PC). The findings reveal that CAARIA outperforms IG and PC in terms of classification accuracy and time efficiency. These results suggest that CAARIA is a robust data preprocessing technique for the classification task.

INDEX TERMS Classification, class association, dimensionality reduction, imputation, machine learning, preprocessing, Twitter sentiment analysis.

I. INTRODUCTION Twitter sentiment analysis is a challenging task due to various reasons, including the limit of 280 characters in a tweet, the use of slang words and abbreviations. Therefore, it is crucial to use intelligent techniques for automatic Twitter sentiment analysis. Machine learning is among the avail- able techniques, which aim to learn from huge historical data for discovering patterns such that human intervention is minimized [1], [2]. These techniques are popular to address problems in different areas, including biology [3], bank- ing [4], marketing [5], and social media [6], [7]. Machine learning techniques are broadly divided into supervised and unsupervised learning. Generally, these techniques lack in

The associate editor coordinating the review of this manuscript and approving it for publication was Jafar A. Alzubi.

directly processing the raw text, thereby require necessary preprocessing to transform the input text into some fea- tures representation. Feature represents the characteristics of data. One of the most widely used feature representa- tion techniques is bag-of-words (BoWs) model. However, representing textual data as BoWs produces diverse sparse matrices [8]. To address this issue, dimensionality reduc- tion has been used that may be feature-wise (i.e., feature selection) [9] or sample-wise [10]. Feature selection refers to reduced representation of features by using a subset of relevant features from the complete feature space and elim- inating irrelevant ones based on certain criteria [8]. It pro- vides a way to minimize feature sparseness, computational complexity and hence improves performance of machine learning techniques [11]. Different feature selection methods, including information gain (IG) [12], T-test [13], relief [14]

VOLUME 7, 2019 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see http://creativecommons.org/licenses/by/4.0/ 136535

https://orcid.org/0000-0001-9619-8565

https://orcid.org/0000-0002-4181-329X

https://orcid.org/0000-0003-3653-009X

https://orcid.org/0000-0002-3887-9052

https://orcid.org/0000-0002-7953-785X

M. Bibi et al.: CAARIA to Reduce Twitter Data for Optimal Sentiment Analysis

and Pearson’s correlation (PC) [15] have been proposed and experimented in different domains. One limitation of fea- ture selection for BoWs is that their removal may result in loss of important features (words). On the other hand, sample-wise dimensionality reduction is achieved by remov- ing outliers, which are identified by different outlier detec- tion techniques [16]. But, in case of BoWs representation of textual documents, samples sparseness makes them out- liers. In various studies, e.g., text categorization, it is shown that classification accuracy can be improved by employing dimensionality reduction methods [9], [17], [18]. However, straightforward dimensionality reduction may be counterpro- ductive for Twitter sentiment analysis because tweets already have limited size in terms of attributes. One solution is data imputation [19] that we pursued in this study. Although lit- erature suggests several methods of imputation, including attribute mean for all the samples belonging to the same class [1], these methods are applicable to numeric data only. For Twitter sentiment analysis such methods may not be feasible due to the binary nature of tweets.

The main aim of this study is the design and evaluation of a novel preprocessing technique called class association and attribute relevancy based imputation algorithm, henceforth CAARIA. This technique intends to reduce samples sparse- ness by combining feature values of those samples sharing maximum common information. The idea is to enhance the usefulness of features by minimizing sparseness via impu- tation rather than eliminating them. CAARIA combines fea- tures of the tweets that fulfill two criteria: a) the tweets belong to the same class, and b) they have at least 50 percent features with response value 1 (the 50 percent threshold was set exper- imentally; details follow). This feature combining technique is expected to reduce the number of tweets thereby result- ing in more efficient Twitter analysis in less computational time. To evaluate CAARIA, an experimental setup is built in which performance of two most commonly used classifiers is compared (where CAARIA is embedded as a preprocessing step for each classifier) along two dimensions, classifier’s accuracy and time efficiency. We used two widely explored classifiers, Naïve Bayes and svm, to test the performance of our proposed algorithm. Literature suggests that these clas- sifiers perform better for sentiment analysis tasks [20]–[25]. Naïve Bayes is a supervised learning technique, which aims to predict the class label for unseen data. It is a probability based classifier, which infers the class label by calculating the probabilities for unseen instances. It assigns unseen instance to a class with maximum posterior probability. svm [26] is also a supervised learning technique, which has been widely used for text classification and sentiment analysis tasks [27], [28]. In this study, svm is implemented using poly- nomial kernel with the default hyperparameters (c = 1.0). Three English language Twitter datasets are experimented using 10-fold cross validation. The performance of CAARIA is compared with two existing feature selection measures IG and PC, and the simple baseline (unigram) that keeps all features intact. The experimental results show that CAARIA

outperforms both IG and PC, which indicates that the former technique is robust as a preprocessing step for classification.

The paper is organized as follows. The related work is dis- cussed in Section II. Our proposed algorithm CAARIA is out- lined in Section III. A detailed empirical study is presented in Section IV. Section V discusses experimental results. Finally, Section VI concludes the paper.

II. RELATED WORK This section presents related work relevant to supervised learning techniques for sentiment analysis and dimensional- ity reduction of features. Supervised learning techniques have largely been explored for Twitter sentiment analysis [23], [25], [29]–[33]. In [23], authors investigated Naïve Bayes, svm and maximum entropy, where svm outperformed other classifiers. In another study, researchers proposed a coop- erative framework of Naïve Bayes, random forest, svm and logistics regression to predict the polarity of tweets [29]. Better performance is reported for the proposed framework. In a recent study [34], a deep learning based approach using convolutional neural network for Twitter sentiment analysis in Spanish language is proposed. In [35], word embeddings using latent contextual semantic relationships for Twitter sentiment analysis is presented. Alzubi [36] proposed a coalition-based ensemble design by using ensemble classi- fiers which elucidated encouraging results. In [37], diversity- based boosting (DivBoosting) is proposed to improve the performance of traditional boosting algorithms. Experimen- tal results revealed that DivBoosting is a promising method for ensemble pruning compared to traditional boosting algo- rithms. Alzubi et al. [38] proposed a novel approach called consensus-based combining method (CCM) for combining an ensemble of classifiers. The findings revealed that the clas- sification accuracy of CCM is significantly improved than product and average methods, whereas it is better or com- parable that of majority voting. In [30], Naïve Bayes, svm, k-nearest neighbor and C4.5 along with their ensembles were used for sentiment classification. The results revealed the effectiveness of ensemble-based classifiers for improving the accuracy of Twitter sentiment analysis. In [31], multinomial Naïve Bayes, conditional random fields and svm are exam- ined for Twitter data analysis. Experimental results indicated that Naïve Bayes classifier outperforms the other two classi- fication approaches. In [39], a combined lexicon based and learning based methods are adopted for Twitter sentiment analysis. The empirical results demonstrated the effective- ness of the proposed method for Twitter sentiment analy- sis. Lalji and Deshmukh [40] proposed a hybrid approach based on svm and lexicons method for Twitter sentiment analysis. Experimental results elucidated better results for the proposed framework. In [35], deep convolution neural net- work algorithm is employed for Twitter sentiment analysis. Experimental results on five Twitter datasets show highly accurate performance. In [41], a novel neural network design for formalizing sentiment information into market views is proposed. In [42], sentiment polarity of the whole document

136536 VOLUME 7, 2019

M. Bibi et al.: CAARIA to Reduce Twitter Data for Optimal Sentiment Analysis

is discovered by finding the polarity of its corresponding sentences. Online reviews in different languages are collected for experimentation using machine learning techniques. It is shown that the sentence-level approach for sentiment clas- sification provides promising results. Poria et al. [43] pre- sented a method for detecting sentiment polarity in short video clips using visual (facial expressions), audio (pitch) and textual (context of uttered sentence). They performed experiments using svm (multi kernel version) and reported encouraging results. In [44], the authors proposed a hybrid approach called cluster-then-predict model, which combines k-means clustering with decision trees and svm. The pre- dicted results using the proposed mechanism have shown better accuracy compared to traditional approaches. Recently, sentiment analysis also gained popularity in community detection frameworks. In [45], a recommendation model, resonant sentimental interest community, is proposed to com- pute the resonance relationship among users for commu- nity detection. The proposed system integrates sentiment and semantics relationships among users.

Numerous studies suggest that irrelevant features in data are computationally very expensive [2], [46]. Feature selec- tion methods have widely been examined to select the subset of features from high dimensional data to reduce feature space. In [47], authors selected optimal features from the whole feature space using mutual information measure. They performed experiments on Twitter datasets using super- vised learning techniques and reported impressive results. In [12], the authors investigated feature selection methods including IG and minimum redundancy maximum relevancy (mRMR) for sentiment analysis of movie reviews data. In [48], a feature selection methodology is proposed for sentiment analysis of movie reviews which is based on long- and short-range dependencies given by the Hidden Markov Model with Latent Dirichlet Allocation, commonly known as HMM-LDA, model. Better results are reported. In [18], entropy weighted genetic algorithm based on IG heuristic has been developed. Evaluation is performed on movie review dataset with impressive results. [49], an ensemble approach is analyzed based on IG and genetic algorithm for optimized feature reduction for sentiment classification. The proposed approach is evaluated using movie review and multidomain datasets with promising results. In [50], the authors reported high classification accuracy when PC is used with other feature selection methods in an ensemble feature-selection framework. In [51], it is shown that PC is useful for feature construction process. In [52], sentiment analysis of online reviews is performed to rank product aspects by the econo- metric model, which used IG to discover information carried by each aspect. Experiments showed promising results on datasets collected from Amazon.

In our proposed work, instead of diminishing features as mostly been found in the afore-mentioned literature, it is intended to impute new features from the existing feature space such that dimensionality reduction can be performed along number of instances (tweets). A novel algorithm

CAARIA is proposed which aims to combine (merge) fea- tures based on the common shared information. The proposed algorithm is expected to be an important contribution in advancing research on improving classifiers’ accuracy for Twitter sentiment analysis.

III. PROPOSED ALGORITHM In this section, the proposed algorithm CAARIA is presented, which is the main contribution of the current study. The algorithm starts by taking an input matrix of original tweets represented as feature vectors and produces an imputed out- put matrix. The sample size (tweets) in the output matrix is expected to be greatly reduced as compared to the input matrix. The pseudocode of CAARIA given in Algorithm 1 makes this data reduction more precise. In line (4), the algo- rithm takes each tweet ti ∈ T in turn. In lines (5-7), place- holder variables are initialized, which are used to inform the merging of tweets (later, lines 23-25). The Count variable keeps track the number of features whose value is 1 for the two classmates (lines 12-13). CountMaxTIndex is used to determine the best classmate for ti having at least 50 percent features with value 1 (lines 17-19). The threshold value 50% was set experimentally as it gives better results in comparison with other threshold values. For example, at 75% threshold many related tweets were excluded whereas at 25% thresh- old many unrelated tweets were grouped together. The Flag variable is used to test whether the tweets to be merged satisfy two criteria: the tweets belong to same class (line 9) and they share at least 50 percent features with value 1 (lines 16-17). If the two conditions are not satisfied, Flag would be set to False (line 19). In this case, the tweets would not be merged and the same tweet ti would be added to the output imputed matrix (line 33). On the other hand, if Flag is True (i.e., both conditions are satisfied), ti is merged with its best classmate using a Boolean OR (‖) operator resulting in an imputed tweet tc (line 23-25). The imputed tweet tc is inserted back to the original matrix T (line 26) so that it can be further merged with the other best classmate(s), if any, in subsequent steps (this would be further clarified in the following example). The merged tweets are removed from the original tweets to avoid further merging with any other tweets (line 27). In case no classmate is found that satisfies both criteria for merging, the current tweet is directly added to the imputed matrix IT and removed from the original matrix (lines 32-35). Finally, the last remaining tweet from the original matrix is added to the imputed matrix to complete the process (line 38). Example: We demonstrate the working of CAARIA

using an example. Consider Table 1, where six tweets {t1, t4, t7, t8, t9, t10} belong to class c1 and six tweets {t2, t3, t5, t6, t11, t12} belong to class c2. CAARIA starts with the first tweet t1 to find its classmates, which are t4, t7, t8, t9 and t10 because all these belong to the same class c1. In the next step, CAARIA determines that the number of shared features with value 1 between t1 and its each classmate (t4, t7, t8, t9 and t10) is 1, 1, 2, 0, and 0, respectively. From this distribution, t8 (with value 2) is selected as the best

VOLUME 7, 2019 136537

M. Bibi et al.: CAARIA to Reduce Twitter Data for Optimal Sentiment Analysis

Algorithm 1 CAARIA 1: Input: m×n matrix of original tweets T 2: Output: m′×n matrix of imputed dataset IT 3: repeat F Repeat steps 4 to 36 until there is at least one

tweet in T 4: for each tweet ti ∈ T do F i from 1 to m 5: Flag ← False 6: Count ← 0 7: CountMaxTIndex ← 0 8: for each tweet tj ∈ T do F j from i+1 to m 9: if cl(ti) = cl(tj) then 10: Flag ← True 11: for each feature fk ∈ F do 12: if ti(fk) = tj(fk) and ti(fk) = 1 then 13: Count ← Count + 1 14: end if 15: end for 16: if Count >= n/2 then 17: CountMaxTIndex ← j 18: else 19: Flag ← False 20: end if 21: end if 22: if Flag = True then 23: for each fk ∈ F do 24: tc(fk) ← ti(fk) ‖ tCountMaxTIndex(fk) F

Merging tweets 25: end for 26: Insert tc to the front of T 27: Remove ti and tCountMaxTIndex from T 28: j ← m F Terminating the inner loop 29: i ← 1 F Resetting i 30: end if 31: end for 32: if Flag = False then 33: Add ti to IT F No best classmate for ti was

found 34: Remove ti from T 35: end if 36: end for 37: until |T| > 1 38: Remove the last tweet from T and add to IT

classmate of t1 and therefore these two tweets are merged: ∀fk(t1) ‖ ∀fk(t8)={1,0,0,1}‖{1,1,1,1}={1,1,1,1}. We call this imputed tweet tc ={1,1,1,1}. At this stage, both t1 and t8 are replaced with tc in the original tweet matrix T. In the next iteration, the process will start with tc whose classmates are t4, t7, t9 and t10; among these tweets, t7 is selected as the best classmate (with three shared features having value 1) of tc. Therefore, in this step, tc is merged with t7. At this stage, three of the original tweets (t1, t7 and t8) are merged together, yielding a new imputed tweet tc. In the next iteration, the sec- ond criterion (at least 50 percent shared features having value

TABLE 1. Unigram Boolean representation of tweets (T).

TABLE 2. Imputed data (IT) generated by CAARIA.

1) is not satisfied for any tweets (tc, t4, t9 and t10) belonging to class c1, therefore, tc added to IT. Subsequently, the same process continues for all other tweets in T. Eventually, when the process terminates, the total number of tweets would reduce to 7 in the imputed matrix as shown in Table 2.

The theoretical time-complexity of CAARIA depends on the number of tweets and the number of features within a tweet. We use the random-access machine (commonly known as RAM model) model, which measures the run time of an algorithm by adding up the number of steps needed to execute the algorithm on a set of data, to report on time complexity of CAARIA. Let n is the number of tweets and k is the number of features, the worst-case time complexity of CAARIA is O(n2 × k).

IV. EXPERIMENTAL FRAMEWORK To test the performance of our proposed algorithm, CAARIA, an experimental setup is built using Weka.1 The experimen- tal framework is depicted in Figure 1, where CAARIA is embedded in the framework as a preprocessing step to impute Twitter data. Briefly, the framework starts with a Twitter dataset as input and after necessary preprocessing the tweets are transformed into unigram-based feature vectors. Then, the dimensionality reduction is invoked where data size is reduced by three techniques: CAARIA, IG and PC. These reduced data from each technique (including the imputed data

1https://www.cs.waikato.ac.nz/ml/weka/

136538 VOLUME 7, 2019

M. Bibi et al.: CAARIA to Reduce Twitter Data for Optimal Sentiment Analysis

FIGURE 1. Sentiment analysis framework for Twitter data with CAARIA incorporated in preprocessing.

by CAARIA) are used to train two classifiers, Naïve Bayes and svm in a k-fold cross validation manner; the training time is recorded for each classifier on each dataset. Finally, the per- formance of classifiers’ is evaluated in terms of classification error rate. In what follows, each step of the framework is briefly described in turn.

A. TWITTER DATASETS Three Twitter datasets in English language are considered for evaluating CAARIA. Health Care Reform (HCR) and Sentiment Strength Twitter Dataset (SS-Tweet) are existing datasets, which have widely been investigated in the previous studies [29], [30], [53]–[55]. The third one is a newly col- lected indigenous dataset. The details of these datasets are as follows: • HCR is a publicly available Twitter data set collected

and manually labeled (positive, negative and neutral) by its authors in March 2010 [55]. It contains total 2156 tweets. In this study, since only positive and neg- ative tweets are considered for experiments, the neutral

TABLE 3. Twitter datasets’ statistics.

tweets are excluded resulting in 1922 tweets. The statis- tical details of this dataset are shown in Table 3.

• SS-Tweet is a collection of 2289 tweets, which was prepared for sentiment strength detection [56]. In this dataset, positive and negative polarity is assigned to tweets. SS-Tweet dataset is shown in Table 3.

• As part of our current study, an indigenous dataset is also created that comprises of 201 tweets (Table 3). This dataset, henceforth FluTweetsPak, is collected using a

VOLUME 7, 2019 136539

M. Bibi et al.: CAARIA to Reduce Twitter Data for Optimal Sentiment Analysis

keyword flu with the help of Twitter4j API.2 It is a collection of tweets from Pakistani Twitter users. This dataset was manually labeled by medical-domain spe- cialists. To the best of our knowledge, such dataset has not been collected or analyzed earlier in any study. The data size is kept manageable so that the domain experts can label this within their limited time constrains.

B. PREPROCESSING We developed an indigenous tool in Microsoft Visual C# to carry out the preprocessing task. The process starts by converting all tweets to lowercase letters followed by tokenization. Subsequently, stopwords are removed that were downloaded from the WordNet.3 All punctuation (e.g. !, =, ; etc.), URLs, numbers, emoticons (:-(,:-) etc) and repeated words are also removed. This is achieved by implementing a simple pattern matching routine.

C. FEATURE REPRESENTATION After preprocessing, the tweets are transformed into feature vectors. In this study, a widely reported feature representation technique unigram is used [57]–[63]. This method works as follows. Suppose that T = {t1, t2, t3 . . . · ·tn} be the collection of tweets and C = {c1,c2 . . . · ·cl} be the set of classes assigned to each tweet. From this tweet collection, unique tokens (terms) will be generated that represent the features. Suppose F be the k-dimensional feature set then F = (f1, f2, f3 . . . · ·fk). These features are represented as an m × n matrix, where m is the number of tweets and n is the number of features. Unigram representation considers the individual terms as features [64]. Boolean approach is used to weigh the terms/features in this matrix. That is, if a term/feature exists in a tweet then it will be assigned a boolean value 1, otherwise 0. Table 1 shows a chunk of the unigram Boolean feature representation for twelve pre-labeled tweets i.e. T={t1, t2, t3 . . . . . . t12}, four features F = {f1, f2, f3, f4} and two class labels Cl = {c1,c2} as an example dataset. Imputed data IT will be generated from existing Twitter data T.

D. CLASSIFICATION FOR TWITTER SENTIMENT ANALYSIS Using supervised learning for sentiment analysis involves two main phases, learning and classification. In the first phase, a classification model (classifier) is built through some learn- ing algorithm. The second phase uses the learned model and attempts to classify the unseen tuples into pre-defined cate- gories. Hence, for this purpose, available data are generally divided into training data (for building the model) and test data (to test the performance of the learned model). Naïve Bayes and svm classifiers are used to evaluate the perfor- mance of our proposed algorithm. We trained svm using sequential minimal optimization (SMO) algorithm [65].

2http://twitter4j.org/en/ 3http://www.d.umn.edu/~tpederse/Group01/WordNet/wordnet-

stoplist.html

The performance of svm largely depends on the kernel and hyperparameters; we used Polynomial kernel and default parameters as available in Weka, e.g., c = 1.0.

E. EXISTING FEATURE SELECTION METHODS In feature selection methods, the dimensionality of feature space is reduced by selecting the useful features by ranking them on the basis of certain criteria. In this study, two popular feature selection methods, IG and PC, are used as existing dimensionality reduction methods which are investigated ear- lier for sentiment analysis purposes [1]. These two methods are investigated in this work for comparative analysis with our own algorithm.

1) INFORMATION GAIN (IG) IG is an entropy based filter approach, which ranks the sub- sets of features with high information gain from whole feature space [1]. It is assumed that features with high information gain are more informative with respect to classification target. It has gained popularity in the domain of sentiment analy- sis [9], [17], [18]. IG is calculated using the relation as shown in equation 1.

InfoGain(Class,Attribute)=H(Class)−H(Class|Attribute)

(1)

where H represents the Entropy.

2) PEARSON’S CORRELATION (PC) The PC measure selects those attributes for which a high correlation trend is found with respect to its corresponding class [15]. The mathematical relation is specified in equation 2 below:

rA,B =

∑n i=1(aibi)−nAB

nσAσB (2)

where r is the correlation coefficient, n represents the number of tuples, A and B represent mean values of attributes A and B and σA and σA are the standard deviations of A and B.

V. RESULTS AND DISCUSSION Two performance measures, area under curve (AUC) and time efficiency are used to evaluate the proposed algorithm. The value of AUC ranges between 0.5 and 1.0, where 1.0 indicates perfect classification. We first created a confusion matrix of (mis)classified tweets and then calculated the weighted F-measure and AUC. Time efficiency (in milliseconds) is computed in terms of training time taken by the learn- ing algorithm to build the model. We used 10-fold cross validation that has previously been explored for sentiment analysis [66]. In this technique, Twitter data are divided into 10 mutually exclusive folds. In various studies, it is argued that cross validation is suitable when datasets are small [67], [68]. Table 4 shows the AUC results of each classifier on the existing dimensionality reduction approaches (IG and PC) and CAARIA using 10-fold cross validation.

136540 VOLUME 7, 2019

M. Bibi et al.: CAARIA to Reduce Twitter Data for Optimal Sentiment Analysis

TABLE 4. Performance of the classifiers in terms of AUC(F-measure) using 10-fold cross validation.

FIGURE 2. Average processing time for 10-fold cross validation.

It is evident from Table 4 that, on average, both svm and Naïve Bayes offer comparable performance. Interestingly, each classifier gave better AUC score (Naïve Bayes: 0.72 and svm: 0.70) when CAARIA is used. With CAARIA, svm achieved the best AUC score (0.79) for FluTweetsPak dataset. The average performance of CAARIA is almost 20% better than the baseline and 15% better than the existing dimension- ality reduction approaches. These results could be interpreted as there is a very high probability (∼ 80%) that CAARIA would be able to discriminate between positive and negative tweets, which is encouraging.

Figure 2 depicts the average time taken by each classifier to train the model. It can be seen that overall Naïve Bayes is faster to train as compared to svm. It can be observed that, for 10-fold cross validation, CAARIA took the minimum time as compared to IG and PC. The best performance of Naïve Bayes was 650 milliseconds when data were imputed using CAARIA, whereas the best performance achieved by Naïve Bayes for PC and IG was 1710 and 3600 milliseconds, respectively. Interestingly, the baseline method offers the worst performance taking the maximum time to train svm.

During this study some interesting results came to fore, which are worth discussing. As conjectured, CAARIA out- performed both IG and PC. A performance gain in attaining good classification results can be interpreted as CAARIA

effectively combines the related features to uncover hidden information in tweets where there is apparently no response from the user. This is important because the existing feature reduction techniques generally treat such features as sparse data and hence discard them [9], [17], [18]. In this way useful information is lost and this could be detrimental in Twitter analysis where the size of tweets is already very small. Therefore, instead of removing apparently unimportant features, our technique imputes them with the other related feature to increase their utility. Equally important are the results of attaining better time efficiency, because generally it is observed that an algorithm A may beat an algorithm B on accuracy but the latter may win on speed. However, one important aspect of the present results is that they do not shed much light on how well CAARIA would scale when evaluated on relatively larger datasets. This could be an important aspect to explore further perhaps by applying our algorithm in other domains like text classification [69], spam detection [70], and architecture recovery [71].

An important design choice of CAARIA is the selection of tweets for merging. Our aim is to merge those tweets which share useful information. Thus we merge those tweets which share a reasonable number of features. In this regard, we experimented with different threshold values and found that the tweets with 50% shared features give better results in comparison with other threshold values. For example, at 75% threshold many related tweets were excluded whereas at 25% threshold many unrelated tweets were grouped together. A limitation of CAARIA, though, is its dependency on pre-classified data. This requirement is driven by neces- sity though, because class associations are considered while merging the features of tweets. To combat this issue, modifi- cations can be proposed such that CAARIA can be adapted in an unsupervised manner.

VI. CONCLUSION In this study, we developed and evaluated a novel algo- rithm CAARIA that exploits class association and attribute relevancy for dimensionality reduction of Twitter data as a preprocessing step for classification. We empirically

VOLUME 7, 2019 136541

M. Bibi et al.: CAARIA to Reduce Twitter Data for Optimal Sentiment Analysis

investigated to what extent the proposed algorithm improves the quality of classification for Twitter sentiment analysis with better time efficiency. For this purpose, two well-known classification algorithms, Naïve Bayes and svm were experi- mented on three sizeable Twitter datasets using 10-fold cross validation. The AUC measure was used to evaluate the quality of classification results and CPU elapsed time was computed for time efficiency of each classifier. The performance of CAARIA was compared against two well-studied dimen- sionality reduction methods, IG (Information Gain) and PC (Pearson’s correlation). The experimental results show that overall CAARIA outperformed its competitors in attain- ing good AUC score and better time efficiency. Therefore, we conclude that combining tweets’ features on the common information seems to be most suitable (as CAARIA does) in terms of generating high quality classification results with low time consumption.

VII. FUTURE WORK In future, it is intended to use weighting schemes in order to merge the features of tweets based on assigning weights to feature values. Other classification techniques including deep recurrent neural network will be investigated for further evaluation. The proposed algorithm will be experimented on relatively larger Twitter datasets and perhaps to other domains including text classification [69], spam detection [70], and architecture recovery [71].

REFERENCES [1] J. Han and M. Kamber, Data Mining: Concepts and Techniques.

San Mateo, CA, USA: Morgan Kaufmann, 2006. [2] J. Alzubi, A. Nayyar, and A. Kumar, ‘‘Machine learning from theory to

algorithms: An overview,’’ J. Phys., Conf. Ser., vol. 1142, Nov. 2018, Art. no. 012012.

[3] S. Min, B. Lee, and S. Yoon, ‘‘Deep learning in bioinformatics,’’ Brief Bioinform., vol. 18, no. 5, pp. 851–869, 2016.

[4] P. Gogas, T. Papadimitriou, and A. Agrapetidou, ‘‘Forecasting bank fail- ures and stress testing: A machine learning approach,’’ Int. J. Forecasting, vol. 34, no. 3, pp. 440–455, Jul./Sep. 2018.

[5] K. Siau, ‘‘Impact of artificial intelligence, robotics, and automation on higher education,’’ in Proc. 23rd Americas Conf. Inf. Syst., Boston, MA, USA, 2017, pp. 1–47.

[6] M. S. Neethu and R. Rajasree, ‘‘Sentiment analysis in Twitter using machine learning techniques,’’ in Proc. 4th Int. Conf. Comput., Commun. Netw. Technol. (ICCCNT), Jul. 2013, pp. 1–5.

[7] M. Hagen, M. Potthast, M. Büchner, and B. Stein, ‘‘Twitter sentiment detection via ensemble classification using averaged confidence scores,’’ in Proc. Eur. Conf. Inf. Retr., 2015, pp. 741–754.

[8] R. Bekkerman and J. Allan, ‘‘Using bigrams in text categorization,’’ Center Intell. Inf. Retr., Univ. Massachusetts Amherst, Amherst, MA, USA, Tech. Rep. IR-408, 2004.

[9] J. Singh, G. Singh, and R. Singh, ‘‘Optimization of sentiment analysis using machine learning classifiers,’’ Hum.-CentricComput. Inf.Sci., vol. 7, no. 1, p. 32, Dec. 2017.

[10] S. Lukasik and P. Kulczycki, ‘‘An algorithm for sample and data dimen- sionality reduction using fast simulated annealing,’’ in Proc. Int. Conf. Adv. Data Mining Appl. Cham, Switzerland: Springer, 2011, pp. 152–161.

[11] G. Chandrashekar and F. Sahin, ‘‘A survey on feature selection methods,’’ Comput. Elect. Eng., vol. 40, no. 1, pp. 16–28, Jan. 2014.

[12] A. Sharma and S. Dey, ‘‘A comparative study of feature selection and machine learning techniques for sentiment analysis,’’ in Proc. ACM Res. Appl. Comput. Symp., Oct. 2012, pp. 1–7.

[13] D. Wang, H. Zhang, R. Liu, W. Lv, and D. Wang, ‘‘T-test feature selection approach based on term frequency for text categorization,’’ Pattern Recog- nit. Lett., vol. 45, pp. 1–10, Aug. 2014.

[14] K. Kira and L. A. Rendell, ‘‘A practical approach to feature selection,’’ in Machine Learning Proceedings. Amsterdam, The Netherlands: Elsevier, 1992, pp. 249–256.

[15] A. H. Narayanan, M. Prabhakar, B. L. Priya, and J. A. P. Singh, ‘‘Compar- ative study between classification algorithms based on prediction perfor- mance,’’ in Proc. Annu. Conv. Comput. Soc. India, Jan. 2018, pp. 185–196.

[16] T. Walkowiak, S. Datko, and H. Maciejewski, ‘‘Algorithm based on mod- ified angle-based outlier factor for open-set classification of text docu- ments,’’ Appl. Stochastic Models Bus. Ind., vol. 34, no. 5, pp. 718–729, Sep./Oct. 2018.

[17] H. Bagheri and M. J. Islam, ‘‘Sentiment analysis of Twitter data,’’ 2017, arXiv:1711.10377. [Online]. Available: https://arxiv.org/abs/1711.10377

[18] A. Abbasi, H. Chen, and A. Salem, ‘‘Sentiment analysis in multiple lan- guages: Feature selection for opinion classification in Web forums,’’ ACM Trans. Inf. Syst., vol. 26, no. 3, Jun. 2008, Art. no. 12.

[19] M. S. Santos, J. P. Soares, P. H. Abreu, H. Araujo, and J. Santos, ‘‘Influence of data distribution in missing data imputation,’’ in Proc. Conf. Artif. Intell. Med. Eur. Cham, Switzerland: Springer, 2017, pp. 285–294.

[20] R. Narayanan, B. Liu, and A. Choudhary, ‘‘Sentiment analysis of con- ditional sentences,’’ in Proc. Conf. Empirical Methods Natural Lang. Process., 2009, pp. 180–189.

[21] A. Balahur and M. Turchi, ‘‘Multilingual sentiment analysis using machine translation?’’ in Proc. 3rd Workshop Comput. Approaches Subjectivity Sentiment Anal., 2012, pp. 52–60.

[22] J. Khairnar and M. Kinikar, ‘‘Machine learning algorithms for opinion mining and sentiment classification,’’ Int. J. Sci. Res. Publications, vol. 3, no. 6, pp. 1–6, Jun. 2013.

[23] A. Go, R. Bhayani, and L. Huang, ‘‘Twitter sentiment classification using distant supervision,’’ Stanford Univ., Stanford, CA, USA, CS224N Project Rep., 2009, vol. 1, no. 12.

[24] B. Pang, L. Lee, and S. Vaithyanathan, ‘‘Thumbs up?: Sentiment classifica- tion using machine learning techniques,’’ inProc.Conf.EmpiricalMethods Natural Lang. Process., vol. 10, 2002, pp. 79–86.

[25] M. Al-Smadi, O. Qawasmeh, M. Al-Ayyoub, Y. Jararweh, and B. Gupta, ‘‘Deep Recurrent neural network vs. support vector machine for aspect- based sentiment analysis of Arabic hotels’ reviews,’’ J. Comput. Sci., vol. 27, pp. 386–393, Jul. 2018.

[26] V. Vapnik, The Nature of Statistical Learning Theory. Springer Science & Business Media, 2013.

[27] A. Tripathy, A. Agrawal, and S. K. Rath, ‘‘Classification of sentiment reviews using n-Gram machine learning approach,’’ Expert Syst. Appl., vol. 57, pp. 117–126, Sep. 2016.

[28] S. Kiritchenko, X. Zhu, and S. M. Mohammad, ‘‘Sentiment analysis of short informal texts,’’ J.Artif. Intell.Res., vol. 50, no. 1, pp. 723–762, 2014.

[29] N. F. F. Da Silva, E. R. Hruschka, and E. R. Hruschka, Jr., ‘‘Tweet sen- timent analysis with classifier ensembles,’’ Decis. Support Syst., vol. 66, pp. 170–179, Oct. 2014.

[30] C. Troussas, A. Krouska, and M. Virvou, ‘‘Evaluation of ensemble-based sentiment classifiers for Twitter data,’’ in Proc. 17th Int. Conf. Inf., Intell., Syst. Appl. (IISA), Jul. 2016, pp. 1–6.

[31] A. Pak and P. Paroubek, ‘‘Twitter as a corpus for sentiment analysis and opinion mining.,’’ in Proc. Eur. Lang. Resour. Assoc. (ELRA), Valletta, Malta, vol. 10, 2010, pp. 1320–1326, 2010.

[32] A. Ortigosa, J. M. Martín, and R. M. Carro, ‘‘Sentiment analysis in Facebook and its application to e-learning,’’ Comput. Hum. Behav., vol. 31, pp. 527–541, Feb. 2014.

[33] M. Bouazizi and T. Ohtsuki, ‘‘Multi-class sentiment analysis in Twit- ter: What if classification is not the answer,’’ IEEE Access, vol. 6, pp. 64486–64502, 2018.

[34] M. A. Paredes-Valverde, R. Colomo-Palacios, M. D. P. Salas-Zarate, and R. Valencia-Garcia, ‘‘Sentiment analysis in Spanish for improvement of products and services: A deep learning approach,’’ Sci. Program., vol. 2017, Oct. 2017, Art. no. 1329281. s

[35] Z. Jianqiang, G. Xiaolin, and Z. Xuejun, ‘‘Deep convolution neu- ral networks for twitter sentiment analysis,’’ IEEE Access, vol. 6, pp. 23253–23260, 2018.

[36] J. A. Alzubi, ‘‘Optimal classifier ensemble design based on coopera- tive game theory,’’ Res. J. Appl. Sci., Eng. Technol., vol. 11, no. 12, pp. 1336–1343, Dec. 2015.

[37] J. A. Alzubi, ‘‘Diversity-based boosting algorithm,’’ Int. J. Adv. Comput. Sci. Appl., vol. 7, no. 5, pp. 524–529, 2016.

[38] O. A. Alzubi, J. A. Alzubi, S. Tedmori, H. Rashaideh, and O. Almomani, ‘‘Consensus-based combining method for classifier ensembles,’’ Int. Arab J. Inf. Technol., vol. 15, no. 1, pp. 76–86, Jan. 2018.

136542 VOLUME 7, 2019

M. Bibi et al.: CAARIA to Reduce Twitter Data for Optimal Sentiment Analysis

[39] L. Zhang, R. Ghosh, M. Dekhil, M. Hsu, and B. Liu, ‘‘Combining lexicon- based and learning-based methods for Twitter sentiment analysis,’’ HP Lab., Palo Alto, CA, USA, Tech. Rep. HPL-2011, 2011.

[40] T. Lalji and S. Deshmukh, ‘‘Twitter sentiment analysis using hybrid approach,’’ Int. Res. J. Eng. Technol., vol. 3, no. 6, pp. 2887–2890, 2016.

[41] F. Z. Xing, E. Cambria, and R. E. Welsch, ‘‘Intelligent asset allocation via market sentiment views,’’ IEEE Comput. Intell. Mag., vol. 13, no. 4, pp. 25–34, Nov. 2018.

[42] H. Wang, P. Yin, L. Zheng, and J. N. Liu, ‘‘Sentiment classification of online reviews: Using sentence-based language model,’’ J. Exp. Theor. Artif. Intell., vol. 26, no. 1, pp. 13–31, 2014.

[43] S. Poria, E. Cambria, and A. Gelbukh, ‘‘Deep convolutional neural network textual features and multiple kernel learning for utterance-level multimodal sentiment analysis,’’ in Proc. Conf. Empirical Methods Natural Lang. Process., Sep. 2015, pp. 2539–2544.

[44] R. Soni and K. J. Mathai, ‘‘Improved Twitter sentiment prediction through cluster-then-predict model,’’ 2015, arXiv:1509.02437. [Online]. Available: https://arxiv.org/abs/1509.02437

[45] J. Zheng and Y. Wang, ‘‘Personalized recommendations based on sentimental interest community detection,’’ Sci. Program., vol. 2018, Aug. 2018, Art. no. 8503452.

[46] L. Yu and H. Liu, ‘‘Efficient feature selection via analysis of relevance and redundancy,’’ J. Mach. Learn. Res., vol. 5, no. 10, pp. 1205–1224, 2004.

[47] D. A. Alboaneen, H. Tianfield, and Y. Zhang, ‘‘Sentiment analysis via multi-layer perceptron trained by meta-heuristic optimisation,’’ in Proc. Int. Conf. Big Data (Big Data), Dec. 2017, pp. 4630–4635.

[48] A. Duric and F. Song, ‘‘Feature selection for sentiment analysis based on content and syntax models,’’ Decis. Support Syst., vol. 53, no. 4, pp. 704–711, Nov. 2012.

[49] P. Kalaivani and K. Shunmuganathan, ‘‘Feature reduction based on genetic algorithm and hybrid model for opinion mining,’’ Sci. Program., vol. 2015, 2015, Art. no. 961454.

[50] B. Agarwal and N. Mittal, ‘‘Prominent feature extraction for review anal- ysis: An empirical study,’’ J. Exp. Theor. Artif. Intell., vol. 28, no. 3, pp. 485–498, 2016.

[51] S. Arora, E. Mayfield, C. Penstein-Rosé, and E. Nyberg, ‘‘Sentiment classification using automatically extracted subgraph features,’’ in Proc. NAACL HLT Workshop Comput. Approaches Anal. Gener. Emotion Text, Jun. 2010, pp. 131–139.

[52] W. Wang, H. Wang, and Y. Song, ‘‘Ranking product aspects through sentiment analysis of online reviews,’’ J. Exp. Theor. Artif. Intell., vol. 29, no. 2, pp. 227–246, 2017.

[53] H. Saif, M. Fernandez, Y. He, and H. Alani, ‘‘Evaluation datasets for Twitter sentiment analysis: A survey and a new dataset, the STS-gold,’’ in Proc. 1st Int. Workshop Emotion Sentiment Social Expressive Media, Approaches Perspect. AI (ESSEM), 2013, pp. 1–14.

[54] L. F. S. Coletta, N. F. F. da Silva, E. R. Hruschka, and E. R. Hruschka, ‘‘Combining classification and clustering for tweet sentiment analysis,’’ in Proc. Brazilian Conf. Intell. Syst. (BRACIS), Oct. 2014, pp. 210–215.

[55] M. Speriosu, N. Sudan, S. Upadhyay, and J. Baldridge, ‘‘Twitter polarity classification with label propagation over lexical links and the follower graph,’’ in Proc. 1st Workshop Unsupervised Learn. NLP, Jul. 2011, pp. 53–63.

[56] M. Thelwall, K. Buckley, and G. Paltoglou, ‘‘Sentiment strength detection for the social Web,’’ J. Amer. Soc. Inf. Sci. Technol., vol. 63, no. 1, pp. 163–173, Jan. 2012.

[57] J. Jonnagaddala, T. R. Jue, and H. Dai, ‘‘Binary classification of Twitter posts for adverse drug reactions,’’ in Proc. Social Media Mining Shared Task Workshop Pacific Symp. Biocomput., Jan. 2016, pp. 4–8.

[58] L. Barbosa and J. Feng, ‘‘Robust sentiment detection on Twitter from biased and noisy data,’’ in Proc. 23rd Int. Conf. Comput. Linguistics, Posters, Aug. 2010, pp. 36–44.

[59] E. Kouloumpis, T. Wilson, and J. D. Moore, ‘‘Twitter sentiment analysis: The good the bad and the omg!’’ in Proc. Int. Conf. Web Social Media, Jul. 2011, pp. 538–541.

[60] L. Hong, O. Dan, and B. D. Davison, ‘‘Predicting popular messages in Twitter,’’ in Proc. 20th Int. Conf. Companion World Wide Web, Mar./Apr. 2011, pp. 57–58.

[61] O. Phelan, K. McCarthy, and B. Smyth, ‘‘Using Twitter to recommend real- time topical news,’’ in Proc.3rdACMConf.RecommenderSyst., Oct. 2009, pp. 385–388.

[62] S. Phuvipadawat and T. Murata, ‘‘Breaking news detection and tracking in Twitter,’’ in Proc. IEEE/WIC/ACM Int. Conf. Web Intell. Intell. Agent Technol., Aug./Sep. 2010, pp. 120–123.

[63] K. Lee, D. Palsetia, R. Narayanan, M. M. A. Patwary, A. Agrawal, and A. Choudhary, ‘‘Twitter trending topic classification,’’ in Proc. IEEE 11th Int. Conf. Data Mining Workshops (ICDMW), Dec. 2011, pp. 251–258.

[64] R. Al-Shalabi and R. Obeidat, ‘‘Improving KNN arabic text classification with n-grams based document indexing,’’ in Proc. 6th Int. Conf. Informat. Syst., Cairo, Egypt, 2008, pp. 108–112.

[65] J. C. Platt, ‘‘Using analytic QP and sparseness to speed training of sup- port vector machines,’’ in Proc. Adv. Neural Inf. Process. Syst., 1999, pp. 557–563.

[66] R. Prabowo and M. Thelwall, ‘‘Sentiment analysis: A combined approach,’’ J. Informetrics, vol. 3, no. 2, pp. 143–157, 2009.

[67] S. Varma and R. Simon, ‘‘Bias in error estimation when using cross- validation for model selection,’’ BioMed Central, vol. 7, no. 1, p. 91, 2006.

[68] R. O. Duda, P. E. Hart, and D. G. Stork, Pattern Classification. Hoboken, NJ, USA: Wiley, 2012.

[69] W. Aziguli, Y. Zhang, Y. Xie, D. Zhang, X. Luo, C. Li, and Y. Zhang, ‘‘A robust text classifier based on denoising deep neural network in the analysis of big data,’’ Sci. Program., vol. 2017, Nov. 2017, Art. no. 3610378.

[70] J. Fdez-Glez, D. Ruano-Ordás, R. Laza, J. R. Méndez, R. Pavón, and F. Fdez-Riverola, ‘‘WSF2: A novel framework for filtering Web spam,’’ Sci. Program., vol. 2016, Nov. 2016, Art. no. 6091385.

[71] M. Bibi, O. Maqbool, and J. Kanwal, ‘‘Supervised learning for orphan adoption problem in software architecture recovery,’’ Malaysian J. Com- put. Sci., vol. 29, no. 4, pp. 287–313, 2016.

MARYUM BIBI received the MPhil degree in computer science from Quaid-i-Azam University, Islamabad, Pakistan, in 2012. She is currently pursuing the Ph.D. degree with the Department of Computer Sciences and Information Technol- ogy, University of Azad Jammu and Kashmir, Muzaffarabad. She has publications in area of machine learning and reverse engineering, includ- ing impact factor publication. She has various achievements, including a Gold Medal on securing First position in HSSC.

MALIK SAJJAD AHMED NADEEM received the Ph.D. degree from the University of Paris, in 2011. He is currently an Assistant Professor with the Department of Computer Sciences and Informa- tion Technology, University of Azad Jammu and Kashmir, Muzaffarabad. He has published various journal articles in the area of machine learning.

IMTIAZ HUSSAIN KHAN received the M.S. degree in computer science from the University of Essex, U.K., in 2005, and the Ph.D. degree in artificial intelligence from the University of Aberdeen, U.K., in 2010. He is currently an Asso- ciate Professor with the Department of Computer Science, King Abdulaziz University, Jeddah, KSA. He is an author of more than 20 research articles. His research interests include natural language processing, cognitive computing, and evolutionary computation.

VOLUME 7, 2019 136543

M. Bibi et al.: CAARIA to Reduce Twitter Data for Optimal Sentiment Analysis

SEONG-O SHIM received the B.S. degree in elec- tronics engineering from Ajou University, Suwon, South Korea, in 1999, and the M.S. degree in mechatronics and the Ph.D. degree in information and mechatronics from the Gwangju Institute of Science and Technology, Gwangju, South Korea, in 2001 and 2011, respectively. He was with the LG Electronics DTV Labs, Seoul, South Korea, engaged in research and development of digital TV, from 2003 to 2007. He is currently an Asso-

ciate Professor with the College of Computer Science and Engineering, University of Jeddah, Jeddah, Saudi Arabia. His research interests include computer vision, image processing, 3D shape recovery, and medical imaging.

ISHTIAQ RASOOL KHAN received the B.Sc. degree in electrical engineering from the Uni- versity of Engineering and Technology, Taxila, Pakistan, in 1992, the M.S. degree in sys- tems engineering from Quaid-i-Azam University, Islamabad, Pakistan, in 1994, and the M.S. degree in information engineering and the Ph.D. degree in digital signal processing from Hokkaido Univer- sity, Japan, in 1998 and 2000, respectively, where he was a JSPS Fellow, from 2000 to 2002. He was

with the University of Kitakyushu, Japan, Kyushu Institute of Technology, Japan, and Institute for Infocomm Research, A*STAR, Singapore, in the past. He is currently a Professor with the College of Computer Science and Engineering, University of Jeddah. His research interests include high dynamic range imaging, data analytics, and digital signal processing.

UZMA NAQVI received the M.S. (SE) degree from Bahria University, Islamabad, Pakistan. She is currently an Assistant professor with the Depart- ment of Computer Sciences and Information Tech- nology, The University of Azad Jammu and Kashmir. She published various research articles in the area of machine learning.

WAJID AZIZ received the B.Sc. and M.Sc. degrees from the University of Azad Jammu and Kashmir University (UAJ&K), Muzaffarabad, Pakistan, and the Ph.D. degree from the Pakistan Institute of Engineering and Applied Sciences (PIEAS), Islamabad, Pakistan. He was a lecturer with UAJ&K, in 1998. He is currently a Profes- sor with the College of Computer Science and Engineering, University of Jeddah, Jeddah, Saudi Arabia. His core research expertise is in biomed-

ical information systems. He has published three books and more than fifty research articles in the reputed national and International journals and conference proceeding. His research interests include biomedical signal processing, time series analysis, and biomedical data analytics. Based on his academic and research contributions, he was a recipient of HEC University Best Teacher Award for the year 2012-2013 awarded by HEC Pakistan, in 2014, and University Best Teacher Award by the University of AJ&K, in 2013.

136544 VOLUME 7, 2019

INTRODUCTION
RELATED WORK
PROPOSED ALGORITHM
EXPERIMENTAL FRAMEWORK

TWITTER DATASETS
PREPROCESSING
FEATURE REPRESENTATION
CLASSIFICATION FOR TWITTER SENTIMENT ANALYSIS
EXISTING FEATURE SELECTION METHODS

INFORMATION GAIN (IG)
PEARSON'S CORRELATION (PC)

RESULTS AND DISCUSSION
CONCLUSION
FUTURE WORK
REFERENCES
Biographies

MARYUM BIBI
MALIK SAJJAD AHMED NADEEM
IMTIAZ HUSSAIN KHAN
SEONG-O SHIM
ISHTIAQ RASOOL KHAN
UZMA NAQVI
WAJID AZIZ