We observe that sentiment analysis tools exhibit intra-tool inconsis- tency, which is the prediction of different polarity for semantically equivalent documents by the same tool, and inter-tool inconsistency, which is the prediction of different polarity for semantically equiv-

alent documents across different tools. We introduce a heuristic to

assess the data quality of an augmented dataset and a new set of

metrics to evaluate tool inconsistencies. Our results indicate that

tool inconsistencies is still an open problem, and they point towards

promising research directions and accuracy improvements that can

be obtained if such inconsistencies are resolved.

PVLDB Reference Format: Wissam Mammar Kouadri, Mourad Ouziri, Salima Benbernou, Karima

Echihabi, Themis Palpanas, and Iheb Ben Amor. Quality of Sentiment

Analysis Tools: The Reasons of Inconsistency. PVLDB, 14(4): 668 - 681,

2021.

doi:10.14778/3436905.3436924

PVLDB Artifact Availability: The source code, data, and/or other artifacts have been made available at

https://github.com/AbCd256/Quality-of-sentiment-analysis-tools.git.

1 INTRODUCTION With the growing popularity of social media, the internet is replete

with sentiment-rich data like reviews, comments, and ratings. A

sentiment analysis tool automates the process of extracting sen-

timents from a massive volume of data by identifying an opinion

and deriving its polarity, i.e., Positive, Negative, or Neutral. In the

last decade, the topic of sentiment analysis has also flourished

in the research community [15, 22, 26, 29, 31, 39, 69, 76, 80, 82],

This work is licensed under the Creative Commons BY-NC-ND 4.0 International

License. Visit https://creativecommons.org/licenses/by-nc-nd/4.0/ to view a copy of

this license. For any use beyond those covered by this license, obtain permission by

emailing [email protected]. Copyright is held by the owner/author(s). Publication rights

licensed to the VLDB Endowment.

Proceedings of the VLDB Endowment, Vol. 14, No. 4 ISSN 2150-8097.

doi:10.14778/3436905.3436924

and has also attracted the attention of the data management com-

munity, which has studied problems related to polarity aggrega-

tion, opinion mining, sentiment correlation, fact-checking, and

others [2, 3, 23, 49, 56, 64, 67, 68, 70, 71, 74, 75, 82]. Therefore, orga-

nizations are showing great interest in adopting sentiment analysis

tools to exploit this data for decision making. For instance, it is

used in politics for predicting election results, in marketing to mea-

sure user’s satisfaction [68], in crowdsourcing to measure workers’

relevance [63], and in the health care domain [36]. Nevertheless,

sentiment analysis of social media data is still a challenging task

due to the complexity and variety of natural language, through

which the same idea can be expressed and interpreted using dif-

ferent words. Let us illustrate this issue by considering two texts

(documents) expressed differently, but having the same meaning:

(1) China urges the US to stop its unjustifiable crackdown on Huawei.; and (2) China urges the United States to stop its unreasonable crack- down on Huawei. We notice that although the two documents are structured differently, they are, in fact, semantically equivalent

paraphrases, because they convey the same meaning.

The sentiment analysis research community has adopted the

consensus that semantically equivalent text should have the same

sentiment polarity[8, 19, 28, 62, 72, 77]. For example, [19] proposed

an approach that learns the polarity of affective events in a narra-

tive text based on weakly-supervised labels, where the semantically

equivalent pairs (event/effect) got the same polarity, and oppo-

site pairs (event /effect) got opposite polarities. Authors in [26]

have extended their dataset with opinion paraphrases by labeling

the generated paraphrases with polarity labels of the original text.

However, recent works show that sentiment analysis systems as-

sign different polarity labels to semantically equivalent documents,

i.e., those systems are predicting different outputs for semantically

equivalent inputs [1, 9, 35, 44, 48, 61, 73]. Such documents are called

adversarial examples. The latest works propose new tools that gen- erate adversarial examples to extend datasets, which lead to more

general and robust models [1, 35, 61]. The work in [35] proposes

a syntactically controlled paraphrase generation model, and the

one in [1] generates semantically equivalent documents using pop-

ulation genetic optimization algorithms. A rule-based framework

that generates semantically equivalent adversarial examples is also

proposed in [61].

668

https://doi.org/10.14778/3436905.3436924

https://github.com/AbCd256/Quality-of-sentiment-analysis-tools.git

https://creativecommons.org/licenses/by-nc-nd/4.0/

mailto:[email protected]

https://doi.org/10.14778/3436905.3436924

In this work, we conduct an empirical study that quantifies and

explains the non-robustness of sentiment analysis tools in the pres-

ence of adversarial examples. Our evaluation uncovers two different

anomalies that occur in sentiment analysis tools: intra-tool inconsis- tency, which is the prediction of different polarity for semantically equivalent documents by the same tool, and inter-tool inconsistency, which is the prediction of different polarity for semantically equiv-

alent documents across different tools. In our in-depth analysis, we

offer a detailed explanation behind such inconsistencies. Existing

works on [1, 35, 53, 73] have typically focused on generating data

for adversarial training and for debugging models [61]. In our work,

we focus on studying the problem from a data quality point of

view, and understanding the reasons behind the aforementioned

inconsistencies. We also demonstrate that sentiment analysis tools

are not robust in the presence of adversarial examples, regardless

of whether they are based on machine learning models, lexicons,

or rules.

In summary, we make the following contributions.

• We address the problem of adversarial examples in sentiment analysis tools as a data quality issue (Sections 3 and 5), evaluate

the prediction quality of tools on two levels ( intra-tool and inter-

tool), and review five datasets in terms of quality and consistency

(Section 5.2). Our experiments reveal that inconsistencies are fre-

quent and present in different configurations, so we define a non-

exhaustive list of causes that may influence the inconsistency.

• We build a benchmark for paraphrases with polarity labels to evaluate the consistency and the accuracy of sentiment analysis

tools and propose a method to reduce the cost of the benchmark

quality Section 5.1.

• We survey six state-of-the-art sentiment analysis algorithms (Section 4) and provide an in-depth analysis of them in the pres-

ence of adversarial examples across different domains (Sections 5.3

and 5.4).

• We propose new metrics that allow the evaluation of intra-tool and inter-tool inconsistencies (Section 5.1).

• We present a rich discussion of our experimental results and show the effects of reducing inconsistencies on the accuracy of

sentiment analysis tools (Section 5.4).

• We propose recommendations for choosing sentiment analysis tools under different scenarios (Section 6), which helps the data

managers select the relevant tool for their frameworks. For the

sake of reproducibility, we make our code and datasets publicly

available 1 .

2 RELATED WORK The problem of inconsistency in sentiment analysis has attracted

the interest of researchers in the data management community [23,

56, 70]. Studies performed on this problem can be classified into

two categories: time-level inconsistency (or sentiment diversity),

and polarity inconsistency represented by both intra-tool and inter-

tool inconsistencies. For instance, authors in [70] have studied the

inconsistency of opinion polarities (on a specific subject) in a given

time window, as well as its evolution over time, and propose a

tree-based method that aggregates opinion polarities by taking into

consideration their disagreements at large scale. Authors in [56]

1 https://github.com/AbCd256/Quality-of-sentiment-analysis-tools.git

proposed a set of hybrid machine learning methods that rank tweet

events based on their controversial score (inconsistency score) in a

time window. As an example of polarity inconsistency work, au-

thors in [23] have studied the polarity consistency on sentiment

analysis dictionaries on two levels (intra- and inter-dictionaries).

Hence, their study has focused only on one type of sentiment anal-

ysis tools (lexicon-based methods with word dictionary), while our

study is more thorough and includes a wider range of algorithms

(rule-based, machine learning-based and concept lexicon methods)

and data. Besides, our study is performed on the document level,

and analyzes the tool inconsistency from different points of view:

statistical, structural, and semantic. Other studies about intra-tool

inconsistency have been limited on earlier research that studied

sentiment preservation in translated text using lexicon-based meth-

ods [4, 10]. The results of our study do not contradict previous

findings, but confirm, deepen and generalize them.

To the best of our knowledge, this is the first study that carries

out a comprehensive structural and semantic analysis on large

scale datasets for sentiment analysis tools for different domains

and varieties of algorithms.

3 PROBLEM STATEMENT 3.1 Motivating Example Our research work is motivated by a real observation from social

media. We present here a data sample collected from twitter about

Trump’s restrictions on Chinese technology.

Let 𝐷 = {𝑑1, . . . ,𝑑8} be a sample set of documents/tweets such that:

• 𝑑1 : Chinese technological investment is the next target in Trump’s crackdown. 𝑑2 : Chinese technological investment in the

US is the next target in Trump’s crackdown.

• 𝑑3 : China urges end to United States crackdown on Huawei. 𝑑4 : China slams United States over unreasonable crackdown on

Huawei. 𝑑5 : China urges the US to stop its unjustifiable crackdown

on Huawei.

• 𝑑6 : Trump softens stance on China technology crackdown. 𝑑7 : Trump drops new restrictions on China investment. 𝑑8 : Donald

Trump softens tone on Chinese investments.

Note that 𝐷 is constructed with subsets of semantically equiva-

lent documents. For instance, 𝑑1 and 𝑑2 are semantically equivalent

as they both express the idea that the US is restricting Chinese

technological investments. We denote this set by 𝐴1 and we write:

𝐴1 = {𝑑1,𝑑2} and 𝐴2 = {𝑑3,𝑑4,𝑑5}, which express that the Chinese government demands the US to stop the crackdown on Huawei,

and 𝐴3 = {𝑑6, . . . ,𝑑8} which convey the idea that Trump reduces restrictions on Chinese investments. 𝐴𝑖 are partitions of 𝐷, that is:

𝐷 = 𝑛⋃︁ 𝑖=1

𝐴𝑖 and 𝑛⋂︁ 𝑖=1

𝐴𝑖 = ∅. We analyse 𝐷 using four sentiment anal-

ysis tools: The Stanford Sentiment Treebank [66], Senticnet5 [8],

Sentiwordnet [25] and Vader [29]. We associate to those sentiment

analysis tools the following polarity functions respectively: 𝑃𝑟𝑒𝑐_𝑛𝑛,

𝑃𝑠𝑒𝑛𝑡𝑖𝑐𝑛𝑒𝑡 , 𝑃𝑠𝑒𝑛𝑡𝑖𝑤𝑜𝑟𝑑𝑛𝑒𝑡 , 𝑃𝑣𝑎𝑑𝑒𝑟 , and 𝑃ℎ is associated to human an-

notations as a ground truth. Table 1 summarizes the results of the

analysis. Note that different tools can attribute different sentiment

labels for the same document (for e.g., only 𝑃𝑟𝑒𝑐_𝑛𝑛 attributes the

correct label to 𝑑3 in 𝐴2) and the same tool can attribute different

669

Table 1: Predicted polarity on dataset D by different tools 𝐴𝑖 Id 𝑃𝑟𝑒𝑐_𝑛𝑛 𝑃𝑠𝑒𝑛𝑡𝑖𝑐𝑛𝑒𝑡 𝑃𝑠𝑒𝑛𝑡𝑖𝑤𝑜𝑟𝑑𝑛𝑒𝑡 𝑃𝑣𝑎𝑑𝑒𝑟 𝑃ℎ 𝐴1 𝑑1 Neutral Negative Negative Neutral Negative

𝑑2 Negative Negative Negative Neutral Negative

𝐴2 𝑑3 Negative Positive Positive Neutral Negative

𝑑4 Negative Positive Negative Neutral Negative

𝑑5 Negative Positive Negative Neutral Negative

𝐴3 𝑑6 Neutral Positive Positive Neutral Positive

𝑑7 Negative Negative Positive Neutral Positive

𝑑8 Neutral Positive Negative Neutral Positive

labels for semantically equivalent documents (for e.g., 𝑃𝑠𝑒𝑛𝑡𝑖𝑐𝑛𝑒𝑡 considers 𝑑6 as Positive and 𝑑7 as Negative).

3.2 Definitions and Terminology 3.2.1 Sentiment Analysis. Sentiment Analysis is the process of extracting the quintuplet < 𝐸𝑖, 𝐹𝑖𝑗, 𝐻𝑧,𝑂𝑖𝑗𝑧, 𝑡 > from a document

𝑑, such that: 𝐸𝑖 is an entity 𝑖 mentioned in document 𝑑, 𝐹𝑖𝑗 is the

𝑗𝑡ℎ feature of the entity 𝑖, 𝐻𝑧 is the holder or the person who gives

the opinion 𝑂𝑖𝑗𝑧 on the feature 𝑖 𝑗 at the time 𝑡 [45]. Notice that the

same document can contain several entities and that each entity

can have a number of features.

In our work, we consider sentiment analysis as the assignment

< 𝑑𝑖,𝑂𝑖 > where a polarity label 𝑂𝑖 is extracted from document 𝑑𝑖 .

3.2.2 Polarity. Polarity. We call polarity the semantic orienta- tion of the sentiment in the text. Polarity can be Negative, Neu-

tral or Positive and is denoted by 𝑝 ∈ Π with Π = {𝑁𝑒𝑔𝑎𝑡𝑖𝑣𝑒, 𝑁𝑒𝑢𝑡𝑟𝑎𝑙, 𝑃𝑜𝑠𝑖𝑡𝑖𝑣𝑒}. Polarity function. For each tool 𝑡𝑘 , we associate a polarity func- tion 𝑃𝑡𝑘 defined as: ∀𝑃𝑡𝑘 ∈ Γ, 𝑃𝑡𝑘 : 𝐷 → Π, with Γ the set of all polarity functions and 𝐷 a set of documents. Otherwise, this

function assigns to each document 𝑑𝑖 a polarity label.

Ground truth. We denote by 𝑃ℎ the polarity given by human annotation that we consider as the ground truth.

Polar Word. We define a polar word as a word that has a polarity different from Neutral (𝑃ℎ (𝑤 𝑗) ≠ Neutral). Opinionated Document. Let’s consider a document 𝑑𝑖 as a set of words such that 𝑑𝑖 = {𝑤1, . . . ,𝑤𝑚}. An opinionated document is a document that contains a polar word, formally: ∃ 1 ≤ 𝑗 ≤ 𝑚,𝑠𝑢𝑐ℎ 𝑡ℎ𝑎𝑡 𝑃ℎ (𝑤 𝑗) ≠ Neutral. Polar Fact/Objective Document. A polar fact, a.k.a an objective document, is a document that does not contain a polar word, i.e., 𝑑𝑖 is a polar fact iff: ∀𝑤 𝑗 ∈ 𝑑𝑖,𝑠𝑢𝑐ℎ 𝑡ℎ𝑎𝑡 1 ≤ 𝑗 ≤ 𝑚, 𝑃ℎ (𝑤 𝑗) = Neutral.

3.2.3 Analogical Set. Analogical Set. is a set of semantically equiv- alent documents: 𝐴𝑙 = {𝑑1, . . . ,𝑑𝑛} s.t: ∀𝑑𝑖 ,𝑑𝑗 ∈ 𝐴𝑙 ,𝑑𝑖

𝑠 ⇐⇒ 𝑑𝑗

where

𝑠 ⇐⇒ denotes semantic equivalence, and 𝑑𝑖 and 𝑑𝑗 are called

analogical documents.

3.2.4 Sentiment Consistency. For each dataset 𝐷 and polarity func- tions set Γ, we define sentiment consistency as the two rules 1) and 2):

1) Intra-tool consistency. Intra-tool consistency assesses the con- tradiction of tools when considered individually. In other words,

there is an Intra-tool inconsistency regarding a tool 𝑃𝑡𝑘 if it assigns

different polarities to two analogical documents 𝑑𝑖 and 𝑑𝑗 . The

intra-tool consistency is defined as:

∀𝐴𝑙 ∀𝑑𝑖,𝑑𝑗 ∈ 𝐴𝑙 ∀𝑃𝑡𝑘 ∈ Γ, 𝑃𝑡𝑘 (𝑑𝑖) = 𝑃𝑡𝑘 (𝑑𝑗) (1)

We call adversarial examples the documents that belong to the same analogical set and that violate intra-tool consistency. 𝑑𝑖 𝑑𝑗 are

adversarial for 𝑃∗ if:

𝑃∗(𝑑𝑖) ≠ 𝑃∗(𝑑𝑗) 𝑠.𝑡 : 𝑃∗ ∈ Γ (2) For example, documents 𝑑6 and 𝑑7 in Table 1 violate intra-tool

consistency so they are adversarial examples for the Senticnet sen-

timent analysis tool.

2) Inter-tool consistency. Inter-tool consistency assess the con- tradiction between the tools. In other words, there is an Inter-tool

inconsistency between the tools 𝑃𝑡𝑘 and 𝑃𝑡′ 𝑘 if they assigns differ-

ent polarities to a given document 𝑑𝑖 . The inter-tool consistency is

defined as:

∀𝐴𝑙 ∀𝑑𝑖 ∈ 𝐷 ∀𝑃𝑡𝑘 , 𝑃𝑡′ 𝑘 ∈ Γ 𝑃𝑡𝑘 (𝑑𝑖) = 𝑃𝑡′

𝑘 (𝑑𝑖) (3)

In the example 3.1, the inter-tool consistency is not respected

for the document 𝑑3, since we have: 𝑃𝑟𝑒𝑐_𝑛𝑛(𝑑3) ≠ 𝑃𝑠𝑒𝑛𝑡𝑖𝑐𝑛𝑒𝑡 (𝑑3). We call this case inter-tool inconsistency.

Inconsistency classes. We distinguish three classes of inconsis- tencies based on the predicted polarity of analogical documents:

(1) Inconsistency of type Positive/Neutral, (2) Inconsistency of type

Negative/Neutral and (3) Inconsistency of type Negative/Positive.

4 APPROACHES In this section, we describe the six popular sentiment analysis tools

that we cover in our study. They are representative tools from

the state-of-the-art, based on their accuracy, polarity classes they

predict, and methods they are based on (lexicon-based, rule-based

and machine learning-based).

4.1 Lexicon-Based Approaches This family of approaches is considered a trivial solution to sen-

timent analysis. It encompasses unsupervised methods that use a

word matching scheme to find the sentiment of a document based

on a predefined dictionary of polar words, concepts and events/

effects [12, 13, 18, 19]. We propose a description of the general

approach adopted by lexicon-based methods (template), displayed

in Algorithm 1. To represent this class of methods, we select Sen-

tiwordnet [25] because it is a widely used lexicon for sentiment

analysis [17, 19, 32, 41, 54]. It is based on Wordnet [52] and is anno-

tated automatically using a semi-supervised classification method.

It uses a ternary classifier (a classifier for each polarity label) on

Wordnet that is trained by a dataset of labeled synsets (set of word’s

synonyms). The dataset is created automatically by clustering the

Wordnet synsets around small samples of manually labeled synsets.

We run Algorithm 1 by using Sentiwordnet as a lexicon on doc-

ument 𝑑8 from Example 3.1. First, we represent the document in

a bag-of-words model to tokenize it, (step 1): Donald:1, Trump:1, soften:1, tone:1, on:1, Chinese:1, investments:1. where the num-

bers represent the frequency of tokens in the sentence. Then we

search the sentiment score of each token in the dictionary, (step 2) we find: 𝑙𝑖𝑠𝑡 = {0, 0, 0, −0.125, 0, 0, 0} (we multiply the polarity by the effective of the token). We aggregate tokens’ polarities, by

calculating the mean of the polarity scores (step 3), and get the polarity label associated with these scores (step 4), we find that: 𝑃𝑠𝑒𝑛𝑡𝑖𝑤𝑜𝑟𝑑𝑛𝑒𝑡 (𝑑8) = 𝑁𝑒𝑔𝑎𝑡𝑖𝑣𝑒.

670

Algorithm 1 Lexicon-Based Sentiment Analysis Input : 𝑑𝑖 : 𝐷𝑜𝑐𝑢𝑚𝑒𝑛𝑡 Output : Polarity label 𝑙𝑖 ∈ Π

1: procedure LexiconBased 2: //step1:Tokenize the document 3: W = tokenize(𝑑𝑖) 4: //step2: Calculate the polarity of each token in W 5: for each 𝑤𝑖 ∈ W : 6: list.add(𝑃∗ (𝑤𝑖)) 7: //step3:Aggregate words polarities (using mean 8: // in most cases) 9: score= mean(list) 10: //step 4:Get the polarity label based on 11: //the polarity score 12: 𝑙 = 𝑔𝑒𝑡𝐿𝑎𝑏𝑒𝑙 (𝑠𝑐𝑜𝑟𝑒) 13: return l

4.2 Rule-Based Approaches Rule-based methods are similar to lexicon-based methods with an

additional inference layer that allows deducing the correct polarity

of documents based on linguistic rules. First, we create a lexicon

of words, concepts, or event/effects. Then, a layer of syntactic and

part-of-speech rules is added to allow the inference of the correct

polarity label. For example, the Vader tool [29] uses the following

syntactic rules: 1) If all characters of a polar word are in upper case

(e.g. GOOD), the polarity intensity increases by 0.733; and 2) The

polarity intensity decreases in the text preceding the conjunction

"but" by 50% and increases by 50% in the text that follows. Sentic-

net [8] is a semantic network of concepts (lexicon with relations

between concepts) that has a layer for polarity inference. It uses

linguistic rules named "sentic patterns" to infer the right polarity of

the sentence based on its structure. As an example: if the concept is

in a polarity switchers scoop, such as negation, its polarity will be

inverted from Positive to Negative. We propose Algorithm 2 which

is the template adopted by rule-based methods.

Algorithm 2 Rule-Based Sentiment Analysis Input: 𝑑𝑖 : Document Output: Polarity label 𝑙𝑖 ∈ Π 1: procedure RuleBased 2: // Step1: Split the sentence into units(words, 3: //concepts, event/effect )

4: W = tokenize(𝑑𝑖) 5: // Step2: Get the polarity of each token in W from 6: //the semantic network/lexicon

7: for each 𝑤𝑖 ∈ W : 8: list.add(𝑃∗ (𝑤𝑖)) 9: // Step3: Infer the final polarity of tokens using 10: //the set of defined syntactic rules s

11: 𝑙𝑖𝑠𝑡 = 𝑖𝑛𝑓 𝑒𝑟_𝑝𝑜𝑙𝑎𝑟𝑖𝑡𝑦(𝑙𝑖𝑠𝑡,𝑠) 12: // Step4: Aggregate words polarities 13: 𝑠𝑐𝑜𝑟𝑒 = 𝑎𝑔𝑔𝑟𝑒𝑔𝑎𝑡𝑒 (𝑙𝑖𝑠𝑡) 14: // Step5 : Get the polarity label based on the 15: //polarity score

16: 𝑙 = 𝑔𝑒𝑡𝐿𝑎𝑏𝑒𝑙 (𝑠𝑐𝑜𝑟𝑒) 17: return l

Consider the document "China does not want to supplant the

US, but it will keep growing" that we analyze using Senticnet:(step 1), Senticnet divides this sentence into concepts using the 𝑡𝑜𝑘𝑒𝑛𝑖𝑧𝑒 function, which produces the list 𝑊 of tokens (a.k.a. concepts):

W = {𝑐ℎ𝑖𝑛𝑎,𝑠𝑢𝑝𝑝𝑙𝑎𝑛𝑡_𝑈𝑆,𝑘𝑒𝑒𝑝_𝑔𝑟𝑜𝑤} (step 2), Senticnet deter- mines the polarity of each concept as a list of couples < 𝐶𝑜𝑛𝑐𝑒𝑝𝑡 :

𝑝𝑜𝑙𝑎𝑟𝑖𝑡𝑦 >:

list = {𝑐ℎ𝑖𝑛𝑎 : 𝑁𝑒𝑢𝑡𝑟𝑎𝑙 ,𝑠𝑢𝑝𝑝𝑙𝑎𝑛𝑡_𝑈𝑆 : 𝑃𝑜𝑠𝑖𝑡𝑖𝑣𝑒 ,𝑘𝑒𝑒𝑝_𝑔𝑟𝑜𝑤 : 𝑃𝑜𝑠𝑖𝑡𝑖𝑣𝑒} (step 3), the function 𝑖𝑛𝑓 𝑒𝑟_𝑝𝑜𝑙𝑎𝑟𝑖𝑡𝑦 activates Sentic pat- terns to infer each concept’s polarity. The concept < 𝑆𝑢𝑝𝑝𝑙𝑎𝑛𝑡_

𝑈𝑆 > is in the scoop of a polarity switching operator, which inverts

the polarity of this concept to "Negative" (step 4), when aggregat- ing the concepts polarities, Senticnet encounters the conjunction

< but >; therefore, it infers the polarity as being Positive because it applies the following adversary rule defined in Sentic patterns: the

polarity of the document is equivalent to the polarity of the text

following the conjunction < but >.

4.3 Learning-Based Approaches Recently, researchers have widely explored deep learning models

[21, 34, 46, 57, 65, 66] in sentiment analysis to exploit their ability

to extract the relevant features for sentiment analysis from the

text. For instance, [66] has proposed a Recursive Neural Network

model (RecNN) for sentiment classification. These methods learn a

vector representation of sentences (sentence embedding) through

a recursive compositional computation, then feed it to a Softmax

sentiment classifier.

On the embedding level, a standard Convolutional Neural Net-

work (Text_CNN) was proposed in [37] for text classification tasks,

including sentiment analysis. It uses a word embedding model pre-

trained on the Google News corpus [51], which guarantees a good

coverage of the language space and catches the semantics of the

sentence. Authors in [21] have refined the sentence embedding

level and proposed a character to sentence CNN for short text clas-

sification (Char_CNN) that has two convolutional layers that allow

handling sentiment features at both the character and sentence

levels (word order, character characteristics). Since on social media,

the way the word is written may embed a sentiment.

We propose the descriptions of the methods that use CNNs and

follow the general template displayed in Algorithms 3 for training

and 4 for classification. CNNs perform document compositionality

and transform the matrix document into a vector of features used

by a classifier. Thus, in this case, we first split the document 𝑑𝑖 to a sequence of n words such that 𝑑𝑖 = {𝑤𝑖1, . . . ,𝑤𝑖𝑛}. Then, we represent each word as a vector to obtain a matrix representation

of the document. Many research works [21, 37, 43, 65] have used a

word embedding model for this task. Other methods [51, 55] use a

pre-trained model to initialize the word embedding vectors while

others initialize the vectors randomly. In this case, our document

𝑑𝑖 will be represented as: 𝑑𝑖 = {𝑈𝑖1, . . . ,𝑈𝑖𝑛} Where 𝑈𝑖𝑗 ∈ R𝑘 𝑘 is the size of the embedding vector, 𝑈𝑖𝑗 = 𝑔(𝑤𝑖𝑗), and 𝑔 is a word embedding function. We concatenate vectors to obtain the matrix

representation of the document as follows: 𝑀𝑖 = 𝑈𝑖1 ⊕ 𝑈𝑖2 ⊕ · · · ⊕ 𝑈𝑖𝑛𝑀𝑖 ∈ (R)k×|di |

This matrix is the input of the convolution layer, which extracts

features from the matrix 𝑀𝑖 by applying convolutions between

671

Algorithm 3 (CNN for Sentiment Analysis (training) Input :- training dataset 𝐷 = {𝑑1, . . . ,𝑑𝑛 }

-𝐿: Hyper parameters list//batch size, window size,number of filters .. .

Output: Trained model 1: procedure Training 2: //Step1: initialize weights 3: //Step2: Document representation 4: //(word embedding)

5: Foreach 𝑑𝑖 ∈ 𝐷 : 6: 𝑀𝑖 = {𝑈𝑖𝑗 |∀𝑤𝑖𝑗 ∈ 𝑑𝑖,𝑢𝑖𝑗 = 𝑔(𝑤𝑖) } 7: 𝑙𝑖𝑠𝑡.𝑎𝑑𝑑 (𝑀𝑖) 8: // Step4: Training network 9: For 𝑛𝑏_𝑒𝑝𝑜𝑐ℎ𝑒 do : 10: Convolution: Equation 4

11: Poolling

12: Classifier

13: Optimization (backpropagation) Equation 5

filters𝑊 and the input matrix on a window of size l. The convolution

operation is defined by:

𝑧𝑗 = ℎ(𝑊 .𝑀𝑖 [𝑗, 𝑗 + 𝑙 − 1] + 𝑏) (4)

Algorithm 4 CNN for Sentiment Analysis (Classification) Input : 𝑑𝑖 : Document,𝑚𝑜𝑑𝑒𝑙 Output : Polarity label 𝑙 ∈ Π

1: procedure LearningBasedPrediction 2: 𝑙𝑜𝑎𝑑 (𝑚𝑜𝑑𝑒𝑙) 3: Convolution: Equation 4

4: Pooling

5: 𝑣 = 𝐶𝑙𝑎𝑠𝑠𝑖𝑓 𝑖𝑒𝑟// returns a value v

6: return get_polarity(𝑣)

where ℎ is the activation function that could be 𝑡𝑎𝑛ℎ, 𝑅𝑒𝐿𝑈 ,or

𝑆𝑜𝑓 𝑡𝑚𝑎𝑥, 𝑧𝑗 is the features map (resulted features), 𝑊 ∈ R𝑙×𝑜 where 𝑙 and 𝑜 are hyperparameters. Typically, 𝑜 = |𝑑𝑖 |. 𝑊 is the filter to learn and 𝑏 is the bias. Pooling is applied to extract only

the significant features. After that, the extracted vector of features

is fed into a normal classifier (Neural Network, SVM, Softmax,

etc.). The filters are the parameters to learn. The training process

consists of using forward-propagation then updating weights using

stochastic back-propagation with the objective of maximizing the

log likelihood 2 :

𝑚𝑎𝑥

𝑛∑︂ 𝑖=1

𝑙𝑜𝑔 𝑝𝑟 (𝑃∗(𝑑𝑖)|𝑑𝑖,\) (5)

5 EXPERIMENTS In this section, we conduct a thorough experimental evaluation of

the robustness of existing sentiment analysis tools in the presence

of adversarial examples. First, we introduce our experimental setup;

then, we evaluate the different techniques based on a statistical

analysis followed by an in-depth structural and semantic analysis.

For the sake of reproducibility, we make publicly available our

2 Equivalently, some methods [21] minimize the Negative likelihood.

Table 2: Comparison of sentiment analysis tools Technique Method Mean Accuracy Test Dataset

Lexicon based

method

Sentiwordnet [25] Binary: 51% [17] MPQA dataset [78]

Rule based methods

SenticNet [8] Binary: 93.7% Blitzer [7]

Movie reviews [47]

Vader [29] Fine grained:79%

Amazon Reviews

Movie Reviews

Tweets

NY times Editorials

Machine learning

(RNN)

RecNN[66]

Fine grained : 45%

Binary: 85.4% Sentiment tree bank[66]

Machine learning

(CNN)

CNN[37]

Fine grained: 48%

binary: 84.5%

Movie Reviews[47]

sentiment tree bank [66]

Costumer reviews [50]

CharCNN[21]

Fine graind: 48.7%

Binary : 85.8%

Sentiment treebank [66]

stanford twitter Sentiment [30]

datasets and codebases. All our experiments are implemented in

python and java, and were conducted on a server with 75GB of

RAM and two Intel Xeon E5-2650 v4 2.2GHz CPUs.

5.1 Experimental Setup Tools. In our experiments, we use the sentiment analysis tools sum- marized in Table 2. We use SentiWordnet [25] as a representative

method for the lexicon-based category because of its popularity and

coverage. We also use two representative methods for rule-based

approaches: Vader [29], whose high quality has been verified by

human experts, and SenticNet [8], which is a lexicon of concepts. Us-

ing these different lexicons allows us to evaluate inconsistencies on

two different levels: concepts and word level. In the learning-based

methods, we use three machine learning tools: RecNN [66] that

learns word embeddings, Text_ CNN [37] that uses a pre-trained

word embedding method, and Char_CNN [21], that uses two levels

of embedding. All these methods are powerful and widely used,

and use different embedding types, which allows us to study the

relationship between the embedding type and inconsistencies. We

note that we implemented the two methods [21] and [37], and re-

trained the networks on the used datasets. In the rest of the paper,

we refer to sentiment analysis tools by their polarity functions,

i.e., SenticNet as 𝑃𝑠𝑒𝑛𝑡𝑖𝑐𝑛𝑒𝑡 , Sentiwordnet as, 𝑃𝑠𝑒𝑛𝑡𝑖𝑤𝑜𝑟𝑑𝑛𝑒𝑡 , Vader

as 𝑃𝑣𝑎𝑑𝑒𝑟 , Text_CNN as 𝑃𝑡𝑒𝑥𝑡_𝑐𝑛𝑛, Char_CNN as 𝑃𝑡𝑒𝑥𝑡_𝑐𝑛𝑛, and

RecNN as 𝑃𝑟𝑒𝑐_𝑛𝑛.

Benchmark. To evaluate the tools’ inconsistencies and accuracy, a dataset of analogical sets with polarity labels is needed. To the best

of our knowledge, this dataset does not exist publicly. Therefore, we

built a benchmark of analogical sets labeled with polarities based

on publicly available sentiment datasets that we augmented with

paraphrases. This method allows generating analogical datasets

with a size up to ten documents, instead of the straight-forward

method which consists of labeling paraphrases datasets manually.

For this, we extended five publicly available datasets using the

syntactically controlled paraphrase networks (SCPNs) [35].

Base Datasets. The five datasets were chosen based on their pop- ularity [11, 16, 27, 34, 37, 38, 76, 79], size and having at least three

polarity labels. (1) The News Headlines [14] dataset contains short sentences related to financial news headlines collected from tweets

and news RSS feeds and annotated by financial news experts with

polarity values 𝑝 ∈ [−1, 1]. (2) The Stanford Sentiment Treebank (SST) [66] is built based on movie reviews and annotated using

672

Algorithm 5 Clean_dataset Input: 𝐷 = {𝐴1 = [𝑆11, . . . ,𝑆1𝑚 ], . . . ,𝐴𝑛 = [𝑆𝑛1, . . . ,𝑆𝑛𝑚′ ]} Golden dataset B

Output: Clean dataset 1: //Step1:Get the distance threshold 2: 𝑡 = Calculate_threshold(𝐵) 3: //Step2:Verify if generated elements 4: respect the max threshold t

5: for each 𝐴𝑖 ∈ B : 6: 𝑆𝑜 = 𝐴𝑖 [0] 7: for each 𝑆𝑖𝑘 ∈ 𝐴𝑖 : 8: if WMD(𝑆𝑖𝑗,𝑆𝑜) < 𝑡 : 9: NewA.add(𝑆𝑖𝑗 ) 10: list.add(NewA) 11: return list

Mechanical Turk crowdsourcing platform 3 with value 𝑜 ∈ [0, 1].

(3) The Amazon Reviews Dataset [33] contains product reviews from the Amazon websites, including text reviews and the product’s five-

level rating as a polarity score. (4) The US airlines tweets dataset 4

that contains tweets about the US airline companies and annotated

based on the emojis present in the tweets. (5) The first GOP debate

dataset 5 that contains tweets related to the debate between Trump

and Clinton. We refer to these datasets as News, SST, Amazon, US airlines, FGD datasets, respectively, in the rest of the paper. We also consider another dataset, Microsoft research paraphrases corpus [59], that contains valid paraphrases.

Augmented Datasets. We extend the previously mentioned sen- timent labeled datasets using the method described in [35], that

automatically generates syntactically controlled paraphrases for

each document (between 2 and 10 documents) that should have the

same polarity.

Refined Datasets. The problem with the tool in [35] is that it produces wrong predictions for 20% of the cases, which may affect

the quality of the analysis. To resolve this problem and to improve

the quality of the benchmark by reducing the produced error rate,

we propose a protocol that minimizes the human effort and the cost

of data quality verification. The two-steps protocol is described in

Algorithm 5 to obtain a clean datasets and Algorithm 6 to calculate

the similarity threshold needed in step 2 of the protocol.

We consider two analogical datasets 𝐵, and 𝐷, where 𝐵 is the

valid set of paraphrases from the Microsoft research paraphrases

corpus [59] and 𝐷 is generated automatically. We calculate the Word

Mover Distance (WMD) [42] between each pair of paraphrases

in 𝐵. For that, we consider the word2vec embedding matrix 𝑋 ∈ 𝑅𝑚×𝑛 such that 𝑚 is the size of the word embedding vector [51]. The distance between two words is the euclidean distance called

distance transportation cost (c). We define also the transformation

matrix 𝑇 ∈ 𝑅𝑛×𝑛 which represents the proportion of each word in the transformation of the document 𝑑 to the document 𝑑′. The weights of this matrix must verify the condition:

∑︁𝑛 𝑖=0

𝑇𝑖𝑗 = 𝑑𝑖 and∑︁𝑛 𝑗=0

𝑇𝑖𝑗 = 𝑑 ′ 𝑗 . Then, WMD is the solution of the objective function:

𝑀𝑖𝑛 ∑︁𝑛 𝑖=0

𝑇𝑖𝑗𝑐(𝑖, 𝑗) such that ∑︁𝑛 𝑖=0

𝑇𝑖𝑗 = 𝑑𝑖 and ∑︁𝑛

𝑗=0 𝑇𝑖𝑗 = 𝑑

′ 𝑗 . Once

the WMD has been calculated between all couples of paraphrases

3 https://www.mturk.com/

4 https://www.kaggle.com/crowdflower/twitter-airline-sentiment

5 https://www.kaggle.com/crowdflower/first-gop-debate-twitter-sentiment

Algorithm 6 Calculate_threshold Input: Golden dataset 𝐵 = {𝐴1 = [𝑆11,𝑆12], . . . ,𝐴𝑛 = [𝑆𝑛1,𝑆𝑛2 }] Output: Similarity threshold t 1: for each 𝐴𝑖 ∈ B : 2: //Step1:calculate WMD distance 3: dist = WMD (𝐴𝑖 [0],𝐴𝑖 [1]) 4: list.add(dist) 5: //Step2: Calculate the threshold distance 6: t= median(list) 7: return t

Table 3: Statistics of used datasets. Statistics # elements # Clusters # Positive # Neutral # Negative

𝐴𝑚𝑎𝑧𝑜𝑛 21483 10704 16944 2028 2511

𝑁𝑒𝑤𝑠 1580 831 890 36 654

𝑆𝑆𝑇 3504 1033 1492 628 1384

𝐹𝐺𝐷 28192 9298 4319 6296 1384

US airlines 38236 12894 5107 6816 26313 Total 92995 34760 28752 15804 48439

in 𝐵, we calculate the population’s median value that we consider

as the threshold distance. The main steps of this part of the method

are described briefly in Algorithm 6.

After defining the max distance threshold 𝑡, we calculate the

WMD distance between each two document 𝑆𝑖𝑗 and 𝑆𝑖𝑘 in analogi-

cal sets 𝐴𝑖 of 𝐷. Then, we only keep the generated documents that

have a distance lower than 𝑡. The output after this step is a refined

dataset of analogical sets containing semantically close documents.

In the rest of the paper, we only use the refined datasets, while for

training machine learning models, we use the base datasets. For

simplicity, we refer to generated datasets using the same name as

the base datasets. For example, 𝑆𝑆𝑇 refers to the extended, then re-

fined version of the dataset 𝑆𝑆𝑇 . The statistics about our benchmark

are displayed in Table 3.

Metrics. In this section, we present different metrics used in our experiments besides 𝑊 𝑀𝐷 described previously:

- We evaluate Accuracy by calculating the rate of correctly predicted polarities compared to the golden polarity (human annotation) as

follows: 𝐴𝑐𝑐𝑢𝑟𝑎𝑐𝑦 =

∑︁𝑛 𝑖=0 1(𝑃ℎ (𝑑𝑖 )=𝑃∗ (𝑑𝑖 ))

𝑛 .

-Cos similarity between two vectors 𝑉 and 𝑈 : 𝐶𝑜𝑠_𝑠𝑖𝑚(𝑉,𝑈 ) = 𝑈 .𝑉

∥𝑉 ∥.∥𝑈 ∥ . - We evaluate the intra-tool inconsistency rate for each document

𝑑𝑗 in the analogical set 𝐴 and sentiment analysis tool 𝑡𝑘 as the

proportion of documents 𝑑𝑧 that have a different polarity than

𝑃𝑡𝑘 (𝑑𝑖) with regards to the tool 𝑡𝑘 . We write

∀𝑑𝑖 ∈ 𝐴 , 𝑃𝑡𝑘 ∈ Γ , 𝑖𝑛𝑐𝑖𝑛(𝑑𝑖, 𝑃𝑡𝑘 ) = 𝑐𝑎𝑟𝑑(𝑆) 𝑛 − 1

s.t 𝑆 = {𝑑𝑗 ∈ 𝐴|𝑃𝑡𝑘 (𝑑𝑖) ≠ 𝑃𝑡𝑘 (𝑑𝑗)} (6)

The intra-tool inconsistency rate of an analogical set 𝐴 is the

mean of different intra-tool inconsistency rates of its documents:

𝑖𝑛𝑐𝑖𝑛(𝐴, 𝑃𝑡𝑘 ) = ∑︁𝑛

𝑗=1 𝑖𝑛𝑐𝑖𝑛(𝑑𝑗, 𝑃𝑡𝑘 )

𝑛 , 𝑛 = 𝑐𝑎𝑟𝑑(𝐴) (7)

- We measure the inter-tool inconsistency rate for each tool 𝑡𝑘 and document 𝑑𝑗 as the rate of tools 𝑡𝑘′ that give different polarities

673

Figure 1: Intra-tool inconsistency distribution

Figure 2: Inter-tool inconsistency rate

to the document 𝑑𝑗 than 𝑃𝑡𝑘 . We write:

∀𝑃𝑡𝑘 ∈ Γ, ∀𝑑𝑖 ∈ 𝐴, 𝑖𝑛𝑐𝑖𝑛𝑡𝑒𝑟 (𝑑𝑖, 𝑃𝑡𝑘 ) = 𝑐𝑎𝑟𝑑(𝑆′) 𝑚 − 1

s.t 𝑆 ′ = {𝑃𝑡′

𝑘 ∈ Γ|𝑃𝑡′

𝑘 (𝑑𝑗) ≠ 𝑃𝑡𝑘 (𝑑𝑗)}

(8)

The inter-tool inconsistency rate in the set Γ is the mean of inconsistency rates of the different tools:

𝑖𝑛𝑐𝑖𝑛𝑡𝑒𝑟 (𝑑𝑗, Γ) = ∑︁𝑚 𝑘=1

𝑖𝑛𝑐𝑖𝑛𝑡𝑒𝑟 (𝑑𝑗, 𝑃𝑡𝑘 ) 𝑚

, 𝑚 = 𝑐𝑎𝑟𝑑(Γ) (9)

5.2 Statistical Analysis Our goal is to answer the following questions: (1) Is intra-tool

inconsistency a rare event? (2) What types of inconsistencies exist?

(3) Are there domains that are more prone to inconsistencies?

5.2.1 Intra-tool Inconsistency Rate. The purpose of this experiment is to evaluate the intra-tool inconsistencies that occur in different

sentiment analysis tools. For this, we calculate the inconsistencies

in each analogical set using Equation (7), then we calculate the

mean inconsistency of tools in each dataset. We display the results

in Figure 1. Sub-figures 1-6 represent the tools’ mean inconsistency

on datasets, while sub-figure 7 shows the proportion of analogical

set that have an intra-tool inconsistency different from 0. We notice

that 44% of analogical sets have an intra-tool inconsistency differ-

ent from 0 on 𝑃𝑐ℎ𝑎𝑟_𝑐𝑛𝑛, 40% on 𝑃𝑡𝑒𝑥𝑡_𝑐𝑛𝑛, 26.6% on 𝑃𝑠𝑒𝑛𝑡𝑖𝑤𝑜𝑟𝑑𝑛𝑒𝑡 ,

33.6% on 𝑃𝑠𝑒𝑛𝑡𝑖𝑐𝑛𝑒𝑡 , 44.8% on 𝑃𝑟𝑒𝑐_𝑛𝑛 and 8.5% on 𝑃𝑣𝑎𝑑𝑒𝑟 . We notice

a high intra-tool inconsistency degrees on the machine learning-

based methods 𝑃𝑐ℎ𝑎𝑟_𝑐𝑛𝑛 and 𝑃𝑟𝑒𝑐_𝑛𝑛 (an average of 0.17 and 0.188

respectively), while we notice lower inconsistency degrees on lexi-

con and rule-based method (𝑃𝑠𝑒𝑛𝑡𝑖𝑤𝑜𝑟𝑑𝑛𝑒𝑡 0.104, and 𝑃𝑣𝑎𝑑𝑒𝑟 0.11).

We notice also a low inconsistency degree of 0.11 on 𝑃𝑡𝑒𝑥𝑡_𝑐𝑛𝑛. So

we can deduce that lexicon-based methods are more consistent

than machine learning methods. We also notice that tools are more

inconsistent on the 𝐹𝐺𝐷 and US airlines datasets. We notice also

high inconsistency on 𝑆𝑆𝑇 even for the 𝑃𝑟𝑒𝑐_𝑛𝑛 that was trained

on 𝑆𝑆𝑇 dataset.

Summary: intra-tool inconsistencies are frequent on different sen- timent analysis tools, which motivate a more in-depth study to

cover the causes of such anomalies in the next sections (Sections 5.3

and 5.4).

5.2.2 Inter-tool Inconsistency Rate. In this experiment, we calcu- late the inter-tool inconsistency degree according to Equation (8),

then we calculate the main inter-tool inconsistency of tools on the

datasets. Sub-figures 1-6 of Figure 2 report the mean inter-tool in-

consistency between tools on our datasets, and Sub-figure 7 shows

the percentage of documents where tools have an inconsistency

degree of 1. We notice that there is a large degree of inter-tool

inconsistency between different sentiment analysis tools. For exam-

ple, the mean inter-tool inconsistency is 0.535 on 𝑃𝑐ℎ𝑎𝑟_𝑐𝑛𝑛, 0.52

on 𝑃𝑟𝑒𝑐_𝑛𝑛, 0.575 on 𝑃𝑠𝑒𝑛𝑡𝑖𝑐𝑛𝑒𝑡 , 0.547 on 𝑃𝑠𝑒𝑛𝑡𝑖𝑤𝑜𝑟𝑑𝑛𝑒𝑡 , and 0.53

on 𝑃𝑡𝑒𝑥𝑡_𝑐𝑛𝑛. We notice also that 𝑃𝑐ℎ𝑎𝑟_𝑐𝑛𝑛 contradict all tools (in-

consistency degree =1) on 12.35% of cases, 𝑃𝑠𝑒𝑛𝑡𝑖𝑤𝑜𝑟𝑑𝑛𝑒𝑡 on 11.7%,

𝑃𝑡𝑒𝑥𝑡_𝑐𝑛𝑛 on 18.2% and 𝑃𝑣𝑎𝑑𝑒𝑟 on 13.7%.

The highest inter-tool inconsistencies are on the News dataset with a mean inconsistency degree of 0.62, which can be explained

by the challenging nature of the documents present in this dataset

which are short in length, factual, and meaningful. Therefore it is

difficult to catch the polarity features from the sentences.

Summary: These experiments show that inter-tool inconsis- tency is a frequent event that occurs with a high degree on the

Financial News dataset that represents relevant data for critical decision-making processes. This motivates us to refine our analysis

to unveil and explain the causes behind these inconsistencies in the

next sections (Sections 5.2.4 and 5.2.5).

5.2.3 Intra-tool Inconsistency Type. In this experiment, we calcu- late the mean of polarity inconsistencies by classes (see definition

3.2.4). We consider the document consistency in this case as a

674

Figure 3: intra-tool inconsistency type

Figure 4: Inter-tool inconsistency type

𝑀 ∈ 𝑅3×3 matrix where 𝑀𝑖𝑗 is the proportion of equivalent docu- ments that got respectively the polarities 𝑖 and 𝑗 by the sentiment

analysis tool 𝑘 formally: 𝑀𝑖𝑗 = 𝑐𝑎𝑟𝑑 (𝑆)

𝑛 such that for a document

𝑑𝑧, 𝑆 = {𝑑𝑧′ ∈ 𝐴|𝑃𝑘 (𝑑𝑧) = 𝑖 ∧ 𝑃𝑘 (𝑑𝑧′) = 𝑗}. The consistency matrix of the analogical set 𝐴𝑖 is the average of documents consistency

matrices in 𝐴𝑖 , and the inconsistency matrix of the dataset 𝐷 is the

average of inconsistency matrices of all analogical sets in 𝐷.

In Figure 3, we display the intra-tool inconsistency matrices

for each tool and dataset (inconsistencies are the no diagonal val-

ues). This experiment demonstrates that the inconsistencies are

not only ones of type (Neutral/Positive) 6 , or (Neutral/Negative). In

fact, an important proportion of inconsistencies are of type (Pos-

itive/Negative). For instance, most inconsistencies on 𝑃𝑐ℎ𝑎𝑟_𝑐𝑛𝑛 and 𝑃𝑐ℎ𝑎𝑟_𝑐𝑛𝑛 are of type Positive/Negative, while on 𝑃𝑠𝑒𝑛𝑡𝑖𝑐𝑛𝑒𝑡 the inconsistency degree of Positive/Negative inconsistencies is

0.038, 0.043 on 𝑃𝑟𝑒𝑐_𝑛𝑛, and 0.038 on 𝑃𝑠𝑒𝑛𝑡𝑖𝑤𝑜𝑟𝑑𝑛𝑒𝑡 . We observe

that the inconsistencies of type (Negative/Positive) are frequent

in learning-based tools compared to lexicon-based tools. We also

note that lexicon-based methods predict the polarity of an impor-

tant proportion of documents as Neutral: 0.28 in 𝑃𝑣𝑎𝑑𝑒𝑟 and 0.16

in 𝑃𝑠𝑒𝑛𝑡𝑖𝑤𝑜𝑟𝑑𝑛𝑒𝑡 compared to other tools because those methods

require the presence of a polar word in the document to classify it

as Positive or negative, which is not the case of real data.

Summary: This experiment shows that we have several types of intra-tool inconsistencies including Negative/Positive.

5.2.4 Inter-tool Inconsistency Type. In this experiment, we cate- gorize the prediction results of the sentiment tools on documents

according to polarity. The inter-tool consistency is represented

with a matrix 𝑀 ∈ 𝑅3×3 where 𝑀𝑖𝑗 is the proportion of tools that attribute respectively the polarities 𝑖 and 𝑗 to the document 𝑑 for-

mally: 𝑀𝑖𝑗 = 𝑐𝑎𝑟𝑑 (𝑆)

𝑚 with 𝑚 the size of Γ such that for a document

𝑑𝑧 ∈ 𝐴, a polarity function 𝑃𝑡𝑘 ∈ Γ, 𝑆 = {𝑃𝑡𝑘′ ∈ Γ|𝑃𝑡𝑘 (𝑑𝑧) = 𝑖 ∧ 𝑃𝑡′

𝑘 (𝑑𝑧) = 𝑗}.

The experiment results are displayed in Figure 4. We observe

that most inconsistencies are of (Positive/Negative) between all

tools.

5.2.5 Discussion. In the above experiments, we have demonstrated that intra-tool inconsistency is a frequent anomaly in sentiment

6 The notation "(Neutral/Positive)" refers to two analogical documents with polarities

Neutral and Positive, respectively.

Table 4: Example of inconsistencies on the SST dataset Id Document Score Polarity

43060

’ (The Cockettes ) provides a window into a subculture

hell-bent on expressing itself in every way imaginable. ’

0.625 Positive

’ ( the cockettes ) provides a window into a subculture

hell-bent on expressing itself in every way imaginable

0.375 Negative

179516 ’s a conundrum not worth solving 0.38889 Negative

179517 ’s a conundrum not worth solving 0.5 Neutral

analysis tools, in particular for machine learning-based methods.

After refining our analysis by categorizing inconsistencies by type,

we found less intra-tool inconsistencies on methods that use a word

dictionary. These methods require the presence of a polar word in

the document to classify it as positive or negative. Otherwise, they

consider the review as neutral, contrary to machine learning and

lexicon-based methods with a concept dictionary. We notice most

intra-tool inconsistencies on the 𝐹𝐺𝐷, US airlines, and SST datasets, and the most inter-tool inconsistencies on the News datasets. We explain this by the presence of abbreviations and syntax errors on

the 𝐹𝐺𝐷 and US airlines datasets. We analyzed the datasets News and SST to unveil the causes of inconsistencies, and we found that in the News dataset, the documents are meaningful and short in length. Hence, it is difficult for tools to extract the sentiment fea-

tures present in the review. While on SST, we notice the presence of semantically equivalent items with different polarities. Table 4

represents a snapshot of inconsistencies on SST. We see that the documents with identifiers 43060 and 5 express the same informa-

tion. However, there are missing punctuation (. and ’) that do not

affect the document’s polarity, and the movies’ title (The Cock-

ettes) is written without any uppercase letters, which generally

does not affect the polarity. Certainly, uppercase words may embed

sentiment as it is mentioned in [29], but it impacts the strength of

the sentiment and does not switch the polarity from Positive to

Negative.

5.3 Structural Analysis In this section, we study the impact of analogical set structures

on the intra-tool inconsistency by checking if inconsistencies are

due to the semantic or syntactic distance between documents. In

other words, checking whether the inconsistencies depend on the

𝐶𝑜𝑠_𝑆𝑖𝑚 as a measure to capture the syntax difference between

sentences or on the 𝑊 𝑀𝐷_𝑆𝑖𝑚 as a measure to capture their se-

mantic. We calculate the Cos similarity between the bag-of-words

representations of the documents. The Cos similarity measures the

675

Figure 5: Inconsistencies and similarity

Table 5: Accuracy and inconsistency Tools 𝐴𝑚𝑎𝑧𝑜𝑛 𝑁𝑒𝑤𝑠 𝑆𝑆𝑇 𝐹𝐺𝐷 𝑈𝑠𝑎𝑖𝑟𝑙𝑖𝑛𝑒𝑠 Mean

Acc

Intra-tool

inc

Inter-tool

inc

Acc

Intra-tool

inc

inter-tool

inc

Acc

intra-tool

inc

Inter-tool

inc

Acc

Intra-tool

inc

Inter-tool

inc

Acc

Intra-tool

inc

Inter-tool

inc

Mean Acc

Mean

Intra-tool

inc

Mean

Inter-tool

inc

𝑃𝑐ℎ𝑎𝑟_𝑐𝑛𝑛 46.6% 0.18 0.54 47.9% 0.197 0.608 76.14% 0.078 0.43 47.4% 0.24 0.56 62.4% 0.175 0.538 56.1 % 0.17 0.535

𝑃𝑟𝑒𝑐_𝑛𝑛 52.5% 0.155 0.45 40.52% 0.188 0.649 65.08% 0.175 0.46 54.6% 0.22 0.538 64.4% 0.2 0.513 55.42% 0.188 0.522

𝑃𝑠𝑒𝑛𝑡𝑖𝑐𝑛𝑒𝑡 59.5% 0.077 0.48 52.67% 0.132 0.607 45.84% 0.164 0.48 31.15% 0.18 0.64 24.9% 0.15 0.67 42.8% 0.14 0.575

𝑃𝑠𝑒𝑛𝑡𝑖𝑤𝑜𝑟𝑑𝑛𝑒𝑡 41.98% 0.07 0.52 33.01% 0.102 0.636 48.64% 0.137 0.46 41.69% 0.159 0.57 45.85% 0.17 0.55 42.23% 0.104 0.547

𝑃𝑡𝑒𝑥𝑡_𝑐𝑛𝑛 84.02% 0.03 0.47 49.04% 0.167 0.647 46.3% 0.09 0.48 70.01% 0.13 0.547 64.11% 0.15 0.51 62.696% 0.113 0.53

𝑃𝑣𝑎𝑑𝑒𝑟 56.5% 0.04 0.44 45.5% 0.1 0.6 51.4% 0.124 0.46 43.8% 0.14 0.55 49.9% 0.15 0.52 49.42% 0.109 0.514

MV 67.22% 56.3% 61.4% 53.8% 60% N/A

syntax similarity between the documents, since we considered the

synonyms and abbreviations as different words. The WMD based

similarity is calculated between the Word2Vec representations of

the words in the documents, which allows to handle semantics. In

this experiment, we verify the relation between the syntactic differ-

ence measured by 𝐶𝑜𝑠_𝑆𝑖𝑚, the semantic difference measured by

𝑊 𝑀𝐷_𝑆𝑖𝑚, and the inconsistencies between every two analogical

documents 𝑑𝑖,𝑑𝑗 ∈ 𝐴. For this, we first represent the documents as a bag-of-words

such that 𝑑𝑖 = [𝑣1, . . . , 𝑣𝑛], with 𝑛 being the vocabulary size and 𝑣𝑤 the occurrence of the word 𝑤 in the document 𝑑𝑖 . Then, we

calculate 𝐶𝑜𝑠_𝑠𝑖𝑚 = (𝑑𝑖,𝑑𝑗) between each two documents 𝑑𝑖 and 𝑑𝑗 of 𝐴. We calculate also the 𝑊 𝑀𝐷 between the word2vec rep-

resentation of the two documents 𝑑𝑖 , 𝑑𝑗 ∈ 𝐴 and we convert it to a similarity score by using 𝑊 𝑀𝐷_𝑆𝑖𝑚 =

1+𝑊 𝑀𝐷 (𝑑𝑖,𝑑𝑗 ) . After

that, we calculate the ratio 𝛿 = 𝐶𝑜𝑠_𝑆𝑖𝑚(𝑑𝑖,𝑑𝑗 ) 𝑊 𝑀𝐷_𝑆𝑖𝑚(𝑑𝑖,𝑑𝑗

which is high for

large values of 𝐶𝑜𝑠_𝑆𝑖𝑚 and low values of 𝑊 𝑀𝐷_𝑆𝑖𝑚, and low

for large values of 𝑊 𝑀𝐷_𝑆𝑖𝑚 and low values of 𝐶𝑜𝑠_𝑆𝑖𝑚. The

percentage of inconsistent documents is calculated as: ∀𝑑𝑖,𝑑𝑗 ∈ 𝐴, 𝑃𝑡𝑘 ∈ Γ 1(𝑃𝑡𝑘 (𝑑𝑖)≠𝑃𝑡𝑘 (𝑑𝑗 )) . We categorize the results by 𝛿 values (X-axis) and calculate the proportion of inconsistencies (adversarial

examples) to the total number of elements that have 𝑑𝑒𝑙𝑡𝑎 (Y-axis).

The results are displayed in Figure 5. We notice that we have more

inconsistencies for a low 𝛿: a high 𝑊 𝑀𝐷_𝑠𝑖𝑚 and low 𝐶𝑜𝑠_𝑠𝑖𝑚 on

all tools for all datasets comparing to the inconsistencies that we

have for a high 𝛿 i.e, a low 𝑊 𝑀𝐷_𝑠𝑖𝑚 and high 𝐶𝑜𝑠_𝑠𝑖𝑚, which

indicates that sentiment analysis tools are much influenced by the

syntactic variation of the sentence.

Summary: We observed more inconsistencies for high values of 𝑊 𝑀𝐷_𝑠𝑖𝑚 and low values of 𝐶𝑜𝑠_𝑆𝑖𝑚 on all tools.

5.3.1 Discussion. In the above experiment, we have evaluated the relationship between analogical set structure, document structure,

and inconsistency. The results show that most tools have a large

proportion of inconsistencies between documents that have a high

semantic similarity degree and a low syntactic similarity degree.

We learn from these experiments that the structure of the sentence

affects the tool vulnerability for inconsistencies. According to [31],

the document structure may embed a sentiment, but not each mod-

ification in the structure implies a sentiment variation as shown

in the examples of Table 4. It is crucial to consider this point when

developing tools since most sentiment analysis applications are in

uncontrolled environments like social media and review websites.

In an uncontrolled environment, there is a lot of unintentional

documents restructuring that does not imply a polarity switching.

5.4 Semantic Analysis In this section, we answer the following questions: (1) Are inconsis-

tencies frequent between polar facts or opinionated documents? (2)

Do fewer inconsistencies imply higher accuracy? In other words,

we evaluate the efficiency of tools in terms of accuracy and incon-

sistency.

5.4.1 Inconsistencies and Polar Facts. In this experiment, we study whether inconsistencies occur only between polar facts (i.e., the

documents that do not include a polar word, such as: "this prod-

uct does its job"), or even between opinionated documents (i.e.,

documents that contain polar words such as: "yeah! this product

does its job"). For this, we first extract two sub-datasets from the

original ones by filtering polar facts from opinionated documents

using the word-lexicon-based method 𝑃 sentiwordnet

. We first elimi-

nate objective documents based on the ground truth, i.e. documents

with Neutral polarity, then we consider the documents classified as

Positive/Negative opinionated, and the misclassified Positive and

Negative documents as Neutral polar facts. Then, we calculate the

proportion of intra-tool inconsistencies for different tools on the

opinionated/fact sub-datasets (Y-axis). The results are displayed

in Figure 6, where 𝑠𝑢𝑏_𝑠𝑢𝑏 represents the proportion of inconsis-

tencies between opinionated documents, 𝑓 𝑎𝑐𝑡_𝑓 𝑎𝑐𝑡 represents the

proportion of inconsistencies between polar facts, and 𝑠𝑢𝑏_𝑓 𝑎𝑐𝑡 the

proportion of inconsistencies between polar facts and opinionated

documents. We notice that the large proportion of inconsistencies is

between polar fact and opinionated documents in all tools/datasets.

However, we have no inconsistencies between polar facts in lexicon-

based methods because those tools classify polar facts as Neutral

most of the time.

676

Figure 6: Inconsistencies and polar fact

Table 6: Accuracy after resolving inconsistencie using MV 𝑃𝑡𝑒𝑥𝑡_𝑐𝑛𝑛 𝑃𝑐ℎ𝑎𝑟_𝑐𝑛𝑛 𝑃𝑠𝑒𝑛𝑡𝑖𝑐𝑛𝑒𝑡 𝑃𝑠𝑒𝑛𝑡𝑖𝑤𝑜𝑟𝑑𝑛𝑒𝑡 𝑃𝑟𝑒𝑐_𝑛𝑛 𝑃𝑣𝑎𝑑𝑒𝑟 MV

𝐴𝑚𝑎𝑧𝑜𝑛 84.4% 49.26% 60.7% 42.8% 54.5% 57.3% 66.7% 𝑁𝑒𝑤𝑠 50.6% 51.4% 53.67% 33.48% 47.4% 47.2% 57.4% 𝑆𝑆𝑇 45.9% 77.3% 47.9% 49% 66% 51.5% 63.4% 𝐹𝐺𝐷 49.2% 57.5% 30.9% 42.07% 71.05% 44.1% 58.7% US airlines 64.6% 68.1% 24.5% 46.11 65.4% 50.3% 65.5%

For other methods, we notice a difference in the proportion of

inconsistencies. For instance, on 𝑃𝑐ℎ𝑎𝑟_𝑐𝑛𝑛, 𝑟𝑒𝑐𝑛_𝑛𝑛, and 𝑡𝑒𝑥𝑡_𝑐𝑛𝑛,

we have more inconsistencies between polar facts than between

opinionated documents with proportions of 29.8%, 29.6%, and 32.2%,

respectively. On 𝑃𝑠𝑒𝑛𝑡𝑖𝑐𝑛𝑒𝑡 , we have more inconsistencies between

opinionated documents with a proportion of 27.3%.

Summary: This experiment shows that most inconsistencies occur between polar facts and opinionated documents, which motivates

us to do more experiments and verify which of them is likewise

closer to the ground truth.

5.4.2 Inconsistencies and Accuracy. In this section, we show the relationship between inconsistency and accuracy in different tools.

For this, we calculate on each dataset the accuracy, and the mean

intra-tool and inter-tool inconsistency rate following Equation 7.

The results are summarised in Table 5. We observe a high accu-

racy on the machine learning-based tools 𝑃𝑐ℎ𝑎𝑟_𝑐𝑛𝑛, 𝑃𝑡𝑒𝑥𝑡_𝑐𝑛𝑛 and

𝑃𝑟𝑒𝑐_𝑛𝑛 (mean accuracy of 56.1%, 62.7696% and 55.42% respectively)

compared to the lexicon based methods where we notice a mean

accuracy of (42.23% on 𝑃𝑠𝑒𝑛𝑡𝑖𝑤𝑜𝑟𝑑𝑛𝑒𝑡 , 49.42% on 𝑃𝑣𝑎𝑑𝑒𝑟 , and 42.8%

𝑃𝑠𝑒𝑛𝑡𝑖𝑐𝑛𝑒𝑡 . We notice that alow intra-tool inconsistency does not

imply good accuracy since we have a low intra-tool inconsistency

on lexicon based methods and a low accuracy. We observe an in-

verse relation between inter-tool inconsistency and accuracy which

means that a high inter-tool inconsistency is a sign of low accuracy.

5.4.3 Hyper-parameters and Inconsistency. In this section, we focus on studying the impact of the learning hyper-parameters on the

intra-tool inconsistency and the accuracy of the CNN. For that,

we consider the basic CNN described in [37], that we train each

time by varying one of the hyper-parameters: the training dataset,

the embedding type, the batch size, the cross validation folds, the

learning rate, the dropout probability and the number of training

epochs.

We answer the following questions: (1) Is intra-tool inconsistency

present for all CNN configurations? (2) What is the best config-

uration that respects both accuracy and consistency? (3) Which

training dataset guarantees a general and accurate model?

Inconsistency and training dataset. In this experiment, we eval- uate the robustness of the learned model based on the training

dataset. We trained the model each time on the original, not ex-

tended, datasets (Amazon, News , 𝑆𝑆𝑇 , 𝐹𝐺𝐷, and US airlines) and tested the model on our benchmark. This experiment allows us

to verify the generalization of the model on the extended dataset

and the other datasets. The results are presented in Figure 7 (Inc &

training dataset) and Figure 8 (Acc & training dataset). We observe

a mean accuracy of 50% on the models trained with the Amazon

dataset ( 92.6% on 𝐴𝑚𝑎𝑧𝑜𝑛, 56% on 𝑁𝑒𝑤𝑠, 47% on 𝑆𝑆𝑡, 18% on 𝐹𝐺𝐷

and 20% US airlines datasets), 55.4% on the model trained with the 𝐹𝐺𝐷 dataset, 52.8% on the models trained with the News dataset, 58.2% on the model trained with𝑆𝑆𝑇 and 54.4% on the model trained

with US airlines. For the inconsistency rate, the Amazon dataset has a mean inconsistency of 0.066, while The 𝐹𝐺𝐷, News, 𝑆𝑆𝑇 and US airlines -trained model achieve 0.124, 0.186, 0.15, and 0.1, re- spectively. We notice that the model trained with amazon is more

general, since it leads to good accuracy and inconsistency when

applied on other datasets (i.e., news and reviews), but not on the

tweets datasets, which contain many abbreviations and errors, and

need models specifically trained for them.

Inconsistency and embedding type. In this experiment, we check the robustness of the model with different embedding types: a pre-

trained version of BERT as an embedding layer of the CNN, a word

embedding using google news vectors, a word embedding using

glove and an embedding combining both words and characters.

The results of this experiment are presented in Figure 7 (Inc &

embedding type) and 8(Acc & embedding type). We notice a mean

accuracy of 58.7% on the model with the BERT embedding, a mean

accuracy of 66.6% on the model with both word and character em-

beddings, a mean accuracy of 53.4% on glove and a mean accuracy

of 64.1%% on the model with google news vectors word embedding.

We notice a high accuracy on social media data (𝐹𝐺𝐷 and US air- lines datasets) on models that use both character embedding and word2vec embedding. We notice also a mean inconsistency degree

of 0.117 on BERT, 0.158 on models with both character and word

embeddings. On the models with a word embedding layer that uses

glove, we notice a mean inconsistency of 0.104.

We notice that the model using the pre-trained bert as embed-

ding is not adapted to twitter data. Unsurprisingly, the use of the

pre-trained BERT for embedding has improved the model’s gener-

alization on the News and Amazon datasets. We can explain this generalization by the context-dependent nature of BERT that al-

lows us to generate different embeddings of the word depending on

the context. The best inconsistency score (the lowest) was obtained

on the model that uses the pre-trained Glove embeddings.

Inconsistency and batch size: In this experiment, we train the model described in [37] on the News dataset with a batch size of

677

: 50, 100, 150, 200, 300, and 500. On the News dataset, we observe that the accuracy decreases when the batch size increases, (from

an accuracy of 0.756 for a batch size of size 50 to an accuracy of

0.724 for a batch size of 500). On the Amazon dataset, the accuracy increases with the batch size (accuracy of 0.56 on the model with

batch size of 50 to an accuracy of 0.68 for a batch size of 500). On

the 𝑆𝑆𝑇 dataset, the accuracy slightly drops from 0.455 to 0.433.

We also observe that, contrary to the other datasets, inconsistency

increases on the News dataset. The optimal batch size is 300. A smaller batch size does not

help the model generalize well as good inconsistency and accuracy

results are only observed for the training dataset.

Inconsistency and cross validation. In this experiment, we eval- uate the impact of the n-cross validation on both the inconsistency

and accuracy. We trained the model [37] on the News dataset and we varied 𝑛 using the values: 2, 5, 10 and 20. We observe that the

accuracy increases with the cross validation size (𝑛) on all datasets,

while the inconsistency decreases with 𝑛 on the News and Amazon datasets.

Cross validation increases the accuracy, consistency and gener-

alization of the model. We recommend using 10-cross validation to

optimize the consistency/accuracy trade-off.

Inconsistency and dropout probability. In this experiment, we verify the impact of the dropout probability on the accuracy, in-

consistency and generalization of the model. To achieve this, we

train our model and vary the dropout probability following the

values: 0.2, 0.3 and 0.5. The results are presented in Figures 7 (Inc

& dropout p) and 8 (Acc & dropout p). We notice that the incon-

sistency on the datasets decreases when the dropout probability

increases. The accuracy increases with the dropout probability on

all datasets except the Amazon dataset. Inconsistency and learning rate. To evaluate the impact of the learning rate on inconsistency and accuracy, we train our model on

the News dataset, and we vary the learning rate values following the values: 1𝑒−4, 1.5𝑒−4, 2𝑒−4, and 3𝑒−4. The results are presented in Figures 7 (Inc & learning rate) and 8 (Acc & learning rate). We

observe that when the learning rate increases, the inconsistency

decreases on the News dataset and increases on the others datasets. We also observe that the accuracy increases with learning rate on

the datasets News, 𝐹𝐺𝐷, and 𝑆𝑆𝑇 and decreases on the datasets Amazon and US airlines datasets. The learning rate affects both accuracy and inconsistency since models with a low learning rate

are more general and consistent.

The number of epochs. In this experiment, we verify the influ- ence of the number of epochs on the inconsistency and the accuracy.

We train the model on numbers of epochs following the values: 50,

100, 200, 300 and 3000, then test the model using our benchmark.

The results in Figure 8 (Acc & nb_epochs) show that the accuracy of

the model on the datasets increases then decreases as the number of

epochs increase, which means that the model lost its generalization

when trained with a high number of epochs (overfitting). As for

the inconsistency, Figure 7 (Inc & nb_epochs) shows that it first

decreases then increases which confirms the overfitting explana-

tion. Even if training the model on a high number of epochs may

improve the training accuracy of the model, this makes it lose in

terms of intra-tool consistency and generalization.

5.4.4 Discussion. From the findings of the previous experiments, we observe that the inconsistencies are present on different 𝐶𝑁𝑁

configurations. We also notice that inconsistencies depend on the

subjectivity of the document and most inconsistencies occur be-

tween polar facts and opinionated documents. To verify which of

the two types is more close to the ground truth, we calculate the ac-

curacy in the sub-datasets of opinionated documents and facts. We

found that polar facts are more accurate on 𝑃𝑐ℎ𝑎𝑟_𝑐𝑛𝑛 with an ac-

curacy of 62% for polar facts and 57.2% for opinionated documents,

while in other tools we notice a higher accuracy on opinionated

data (74.5%, 68.4%, 73.5% for opinionated data and 55.7%, 56.1% and

64.4% for polar facts on 𝑃𝑟𝑒𝑐_𝑛𝑛, 𝑃𝑠𝑒𝑛𝑡𝑖𝑐𝑛𝑒𝑡 and 𝑃𝑡𝑒𝑥𝑡_𝑐𝑛𝑛, respec-

tively). In this case, we can use the document nature as a feature

when resolving inconsistency. We also learned that there is an in-

verse correlation between inter-tool inconsistency and accuracy.

Therefore, the most accurate tools are consistent.

These findings motivate us to study the impact of resolving in-

consistency on accuracy. We display in Table 6 the accuracy of tools

after resolving intra-tool inconsistency on different tools using ma-

jority voting (MV) and both intra-tool and inter-tool inconsistency

(the Column MV). We observe an important improvement of the

accuracy when resolving intra-tool inconsistency. This may be ex-

plained by the fact that the tools prediction may be erroneous on

some documents and correct on others. Hence, resolving intra-tool

inconsistency by unifying the polarity in the analogical set may

increase the accuracy in most cases.

We observe an accuracy degradation when resolving inconsis-

tencies on the 𝑆𝑆𝑇 , 𝐹𝐺𝐷, and Amazon datasets, because the tools are not compatible in terms of accuracy on these datasets. To over-

come this problem and guarantee the accuracy improvement, we

recommend to apply majority voting between tools compatible in

terms of accuracy. Few works have exploited the inconsistency to

improve the accuracy of classification, such as the work in [60],

where the authors minimize inter-tool inconsistency of various

labeling functions based on the accuracy, correlation and label

absence. These solutions outperform majority voting only for a

low setting of redundancy (i.e., when the number of tools is low),

which confirms that the inconsistency problem is not yet resolved.

However, these works do not consider both intra-tool and inter-

tool inconsistencies, which would lead to better results. Since the

inconsistency problem has been wildly studied in data manage-

ment [5, 6, 20, 58] and the fact inference field [24, 40, 81], where has

been made tangible progress, we suggest to refer to methods from

fact inference by considering workers as tools in the formalization

to resolve inconsistency and improve accuracy [24, 40].

5.4.5 Scalability experiments. In this experiment, we evaluate the scalability of tools with respect to dataset and document size.

Figures 9(a-b) show execution time when we vary the data size:

25%, 50%, 75%, and 100% of the full benchmark. (Figure 9(b) focuses

on the performance of 𝑃𝑐ℎ𝑎𝑟_𝑐𝑛𝑛 and 𝑃𝑡𝑒𝑥𝑡_𝑐𝑛𝑛 only.) We observe,

that for lexicon-based methods and 𝑃𝑟𝑒𝑐−𝑛𝑛, the time scales (almost) linearly with the size. Machine learning methods 𝑃𝑡𝑒𝑥𝑡_𝑐𝑛𝑛 and

𝑃𝑐ℎ𝑎𝑟_𝑐𝑛𝑛 are faster than the other tools. This is explained by the

fact that these tools perform prediction in parallel (prediction by

batch), which accelerates the prediction process.

678

Figure 7: Inconsistency and learning hyper-parameters

Figure 8: Accuracy and learning hyper parameters

Figure 9: Scalability

We also evaluate the scalability of tools in terms of response

time when varying the document size (number of words), using a

synthetic dataset sampled from the word corpus of nltk.

The results are presented in Figures 9(c-d), where the x-axis is

document size expressed as number of words and y-axis is execution

time. (Figure 9(b) focuses on the performance of 𝑃𝑐ℎ𝑎𝑟_𝑐𝑛𝑛 and

𝑃𝑡𝑒𝑥𝑡_𝑐𝑛𝑛 only.) We observe that execution time scales linearly

with the document size for all methods except for the machine

learning ones: these methods use a fixed-size embedding, and they

then crop or pad the text (if it is too long or short, respectively),

which makes them inaccurate for long text.

6 CONCLUSIONS This paper presents the first large study on data quality for sen-

timent analysis tools regarding the quality of polarity extraction

from documents across different domains. Our study offers several

findings that are of interest to the machine learning, NLP , and data

management communities.

We have built a benchmark, which we make available, along

with our code, for reproducibility and future research. We provide

statistical, structural, and semantic analysis for the inconsistency

problem, show that the inconsistency problem is not yet resolved,

and argue for the potential for improvement that can be obtained on

accuracy when resolving intra-tool and inter-tool inconsistencies.

Based on our analysis and findings, we present the following

recommendations, which tool to use for which data type:

• For short text, we recommend CNN based methods. Short texts have few sentiment features, which can not be effectively handled

by word-lexicon or concept based methods.

• From the explainability point of view, we recommend lexicon based methods, especially those based on concept lexicons, thanks

to their performance in terms of accuracy.

• For streaming data with long text and a large window, we recom- mend an ensemble of lexicon-based methods compatible in terms

of accuracy for inter-tool inconsistency. For small window sizes

(i.e., number of documents < 1000 per 𝛿𝑡), we recommend CNN.

• For social media data, both character and word embedding are recommended, since we can train the model to be fault tolerant.

For reviews, word embedding is enough and for news and factual

data, we recommend BERT embedding.

• Inconsistencies are influenced by the analogical set structure and the document nature (fact or opinionated).

Our insights have implications to various use cases, such as crowd-

sourced fact checking, where we can accurately classify workers,

remove inconsistent labels, or alleviate biased labeling.

• The scalability experiments point to interesting research oppor- tunities for the data management community in order to improve

the scalability properties of existing solutions to better serve the

machine learning and NLP communities.

To the best of our knowledge, no previous research work has

explored the problem of resolving both intra-tool and inter-tool

inconsistencies. The results of this study could be used to develop

more accurate frameworks in this area.

ACKNOWLEDGMENTS We thank Khodor Hamoud for his support, and the reviewers for the

numerous constructive comments. The work is supported partially

by IMBA Consulting and ANRT French program through the grant

nr.2018/0576 e-reputation.

679

REFERENCES [1] M. Alzantot, Y. Sharma, A. Elgohary, B.-J. Ho, M. Srivastava, and K.-W.

Chang. Generating natural language adversarial examples. arXiv preprint arXiv:1804.07998, 2018.

[2] S. Amer-Yahia, T. Palpanas, M. Tsytsarau, S. Kleisarchaki, A. Douzal, and

V. Christophides. Temporal analytics in social media. In Encyclopedia of Database Systems, Second Edition. Springer, 2018.

[3] M. Balduini, E. D. Valle, D. Dell’Aglio, M. Tsytsarau, T. Palpanas, and C. Con-

falonieri. Social listening of city scale events using the streaming linked data

framework. In ISWC, 2013. [4] M. Bautin, L. Vijayarenu, and S. Skiena. International sentiment analysis for

news and blogs. In ICWSM, 2008. [5] S. Benbernou and M. Ouziri. Enhancing data quality by cleaning inconsistent

big RDF data. In 2017 IEEE International Conference on Big Data, BigData 2017, Boston, MA, USA, December 11-14, 2017, pages 74–79, 2017.

[6] L. E. Bertossi. Inconsistent databases. In Encyclopedia of Database Systems, Second Edition. Springer, 2018.

[7] J. Blitzer, M. Dredze, and F. Pereira. Biographies, bollywood, boom-boxes and

blenders: Domain adaptation for sentiment classification. In Proceedings of the 45th annual meeting of the association of computational linguistics, pages 440–447, 2007.

[8] E. Cambria, S. Poria, D. Hazarika, and K. Kwok. Senticnet 5: discovering con-

ceptual primitives for sentiment analysis by means of context embeddings. In

Proceedings of AAAI, 2018. [9] N. Carlini and D. Wagner. Adversarial examples are not easily detected: Bypassing

ten detection methods. In Proceedings of the 10th ACM Workshop on Artificial Intelligence and Security, pages 3–14. ACM, 2017.

[10] B. Chen and X. Zhu. Bilingual sentiment consistency for statistical machine

translation. In Proceedings of the 14th Conference of the European Chapter of the Association for Computational Linguistics, pages 607–615, 2014.

[11] T. Chen, R. Xu, Y. He, and X. Wang. Improving sentiment analysis via sentence

type classification using bilstm-crf and cnn. Expert Systems with Applications, 72:221–230, 2017.

[12] Y. Choi and J. Wiebe. +/-effectwordnet: Sense-level lexicon acquisition for opinion

inference. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 1181–1191, 2014.

[13] Y. Choi, J. Wiebe, and R. Mihalcea. Coarse-grained+/-effect word sense dis-

ambiguation for implicit sentiment analysis. IEEE Transactions on Affective Computing, 8(4):471–479, 2017.

[14] K. Cortis, A. Freitas, T. Daudert, M. Huerlimann, M. Zarrouk, S. Handschuh,

and B. Davis. Semeval-2017 task 5: Fine-grained sentiment analysis on financial

microblogs and news. In Proceedings of the 11th International Workshop on Semantic Evaluation (SemEval-2017), pages 519–535, 2017.

[15] N. N. Dalvi, A. Machanavajjhala, and B. Pang. An analysis of structured data on

the web. Proc. VLDB Endow., 5(7):680–691, 2012. [16] G. Demartini and S. Siersdorfer. Dear search engine: what’s your opinion about...?:

sentiment analysis for semantic enrichment of web search results. In Proceedings of the 3rd International Semantic Search Workshop, page 4. ACM, 2010.

[17] K. Denecke. Using sentiwordnet for multilingual sentiment analysis. In 2008 IEEE 24th International Conference on Data Engineering Workshop, pages 507–512. IEEE, 2008.

[18] H. Ding and E. Riloff. Acquiring knowledge of affective events from blogs using

label propagation. In Thirtieth AAAI Conference on Artificial Intelligence, 2016. [19] H. Ding and E. Riloff. Weakly supervised induction of affective events by op-

timizing semantic consistency. In Thirty-Second AAAI Conference on Artificial Intelligence, 2018.

[20] X. L. Dong and D. Srivastava. Entity resolution. In Encyclopedia of Database Systems, Second Edition. Springer, 2018.

[21] C. Dos Santos and M. Gatti. Deep convolutional neural networks for sentiment

analysis of short texts. In Proceedings of COLING 2014, the 25th International Conference on Computational Linguistics: Technical Papers, pages 69–78, 2014.

[22] M. Dragoni and G. Petrucci. A fuzzy-based strategy for multi-domain sentiment

analysis. International Journal of Approximate Reasoning, 93:59–73, 2018. [23] E. C. Dragut, H. Wang, P. Sistla, C. Yu, and W. Meng. Polarity consistency

checking for domain independent sentiment dictionaries. IEEE Transactions on knowledge and data engineering, 27(3):838–851, 2015.

[24] A. Drutsa, V. Fedorova, D. Ustalov, O. Megorskaya, E. Zerminova, and

D. Baidakova. Crowdsourcing practice for efficient data labeling: Aggregation,

incremental relabeling, and pricing. In Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data, pages 2623–2627, 2020.

[25] A. Esuli and F. Sebastiani. Sentiwordnet: A publicly available lexical resource for

opinion mining. In LREC, volume 6, pages 417–422. Citeseer, 2006. [26] R. Feldman. Techniques and applications for sentiment analysis. Communications

of the ACM, 56(4):82–89, 2013. [27] X. Feng, Y. Zeng, and Y. Xu. Recommendation algorithm for federated user

reviews and item reviews. In Proceedings of the 2018 International Conference on Artificial Intelligence and Virtual Reality, pages 97–103. ACM, 2018.

[28] G. Fu, Y. He, J. Song, and C. Wang. Improving chinese sentence polarity classifi-

cation via opinion paraphrasing. In Proceedings of The Third CIPS-SIGHAN Joint Conference on Chinese Language Processing, pages 35–42, 2014.

[29] C. H. E. Gilbert. Vader: A parsimonious rule-based model for sentiment analysis

of social media text. In Eighth International Conference on Weblogs and Social Media (ICWSM-14), 2014.

[30] A. Go, R. Bhayani, and L. Huang. Twitter sentiment classification using distant

supervision. CS224N project report, Stanford, 1(12):2009, 2009. [31] S. Greene and P. Resnik. More than words: Syntactic packaging and implicit

sentiment. In Proceedings of human language technologies: The 2009 annual conference of the north american chapter of the association for computational linguistics, pages 503–511. Association for Computational Linguistics, 2009.

[32] H. Hamdan, F. Béchet, and P. Bellot. Experiments with dbpedia, wordnet and

sentiwordnet as resources for sentiment analysis in micro-blogging. In Second Joint Conference on Lexical and Computational Semantics (* SEM), Volume 2: Proceedings of the Seventh International Workshop on Semantic Evaluation (SemEval 2013), pages 455–459, 2013.

[33] R. He and J. McAuley. Ups and downs: Modeling the visual evolution of fashion

trends with one-class collaborative filtering. In proceedings of the 25th interna- tional conference on world wide web, pages 507–517. International World Wide Web Conferences Steering Committee, 2016.

[34] M. Iyyer, V. Manjunatha, J. Boyd-Graber, and H. Daumé III. Deep unordered

composition rivals syntactic methods for text classification. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), volume 1, pages 1681–1691, 2015.

[35] M. Iyyer, J. Wieting, K. Gimpel, and L. Zettlemoyer. Adversarial example gener-

ation with syntactically controlled paraphrase networks. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computa- tional Linguistics: Human Language Technologies, Volume 1 (Long Papers), pages 1875–1885, 2018.

[36] X. Ji. Social data integration and analytics for health intelligence. In Proceedings VLDB PhD Workshop, 2014.

[37] Y. Kim. Convolutional neural networks for sentence classification. arXiv preprint arXiv:1408.5882, 2014.

[38] F. Kokkinos and A. Potamianos. Structural attention neural networks for im-

proved sentiment analysis. arXiv preprint arXiv:1701.01811, 2017. [39] E. Kouloumpis, T. Wilson, and J. D. Moore. Twitter sentiment analysis: The good

the bad and the omg! Icwsm, 11(538-541):164, 2011. [40] E. Krivosheev, S. Bykau, F. Casati, and S. Prabhakar. Detecting and preventing

confused labels in crowdsourced data. Proceedings of the VLDB Endowment, 13(12):2522–2535, 2020.

[41] F. M. Kundi, S. Ahmad, A. Khan, and M. Z. Asghar. Detection and scoring of

internet slangs for sentiment analysis using sentiwordnet. Life Science Journal, 11(9):66–72, 2014.

[42] M. Kusner, Y. Sun, N. Kolkin, and K. Weinberger. From word embeddings to

document distances. In International conference on machine learning, pages 957–966, 2015.

[43] S. Lai, L. Xu, K. Liu, and J. Zhao. Recurrent convolutional neural networks for

text classification. In Twenty-ninth AAAI conference on artificial intelligence, 2015. [44] B. Liang, H. Li, M. Su, P. Bian, X. Li, and W. Shi. Deep text classification can be

fooled. arXiv preprint arXiv:1704.08006, 2017. [45] B. Liu and L. Zhang. A survey of opinion mining and sentiment analysis. In

Mining text data, pages 415–463. Springer, 2012. [46] L. Luo, X. Ao, F. Pan, J. Wang, T. Zhao, N. Yu, and Q. He. Beyond polarity: Inter-

pretable financial sentiment analysis with hierarchical query-driven attention.

In IJCAI, pages 4244–4250, 2018. [47] A. L. Maas, R. E. Daly, P. T. Pham, D. Huang, A. Y. Ng, and C. Potts. Learning

word vectors for sentiment analysis. In Proceedings of the 49th annual meeting of the association for computational linguistics: Human language technologies-volume 1, pages 142–150. Association for Computational Linguistics, 2011.

[48] T. Mahler, W. Cheung, M. Elsner, D. King, M.-C. de Marneffe, C. Shain, S. Stevens-

Guille, and M. White. Breaking nlp: Using morphosyntax, semantics, pragmatics

and world knowledge to fool sentiment analysis systems. In Proceedings of the First Workshop on Building Linguistically Generalizable NLP Systems, pages 33–39, 2017.

[49] A. Marcus, M. S. Bernstein, O. Badar, D. R. Karger, S. Madden, and R. C. Miller.

Tweets as data: demonstration of tweeql and twitinfo. In Proceedings of the 2011 ACM SIGMOD International Conference on Management of data, pages 1259–1262, 2011.

[50] J. McAuley and J. Leskovec. Hidden factors and hidden topics: understanding

rating dimensions with review text. In Proceedings of the 7th ACM conference on Recommender systems, pages 165–172. ACM, 2013.

[51] T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean. Distributed rep-

resentations of words and phrases and their compositionality. In Advances in neural information processing systems, pages 3111–3119, 2013.

[52] G. A. Miller. Wordnet: a lexical database for english. Communications of the ACM, 38(11):39–41, 1995.

680

[53] T. Miyato, A. M. Dai, and I. Goodfellow. Adversarial training methods for semi-

supervised text classification. arXiv preprint arXiv:1605.07725, 2016. [54] B. Ohana and B. Tierney. Sentiment classification of reviews using sentiwordnet.

In 9th. it & t conference, volume 13, pages 18–30, 2009. [55] J. Pennington, R. Socher, and C. Manning. Glove: Global vectors for word repre-

sentation. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), pages 1532–1543, 2014.

[56] A.-M. Popescu and M. Pennacchiotti. Detecting controversial events from twit-

ter. In Proceedings of the 19th ACM international conference on Information and knowledge management, pages 1873–1876, 2010.

[57] S. Poria, E. Cambria, and A. Gelbukh. Deep convolutional neural network textual

features and multiple kernel learning for utterance-level multimodal sentiment

analysis. In Proceedings of the 2015 conference on empirical methods in natural language processing, pages 2539–2544, 2015.

[58] N. Prokoshyna, J. Szlichta, F. Chiang, R. J. Miller, and D. Srivastava. Combining

quantitative and logical data cleaning. Proc. VLDB Endow., 9(4):300–311, 2015. [59] C. Quirk, C. Brockett, and W. B. Dolan. Monolingual machine translation for

paraphrase generation. In Proceedings of the 2004 conference on empirical methods in natural language processing, pages 142–149, 2004.

[60] A. Ratner, S. H. Bach, H. Ehrenberg, J. Fries, S. Wu, and C. Ré. Snorkel: Rapid

training data creation with weak supervision. The VLDB Journal, 29(2):709–730, 2020.

[61] M. T. Ribeiro, S. Singh, and C. Guestrin. Semantically equivalent adversarial

rules for debugging nlp models. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 856–865, 2018.

[62] J. Risch and R. Krestel. Aggression identification using deep learning and data

augmentation. In Proceedings of the First Workshop on Trolling, Aggression and Cyberbullying (TRAC-2018), pages 150–158, 2018.

[63] H. Rong, V. S. Sheng, T. Ma, Y. Zhou, and M. A. Al-Rodhaan. A self-play and

sentiment-emphasized comment integration framework based on deep q-learning

in a crowdsourcing scenario. IEEE Transactions on Knowledge and Data Engineer- ing, 2020.

[64] K. Schouten and F. Frasincar. Survey on aspect-level sentiment analysis. IEEE Transactions on Knowledge and Data Engineering, 28(3):813–830, 2015.

[65] A. Severyn and A. Moschitti. Twitter sentiment analysis with deep convolutional

neural networks. In Proceedings of the 38th International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 959–962. ACM, 2015.

[66] R. Socher, A. Perelygin, J. Wu, J. Chuang, C. D. Manning, A. Ng, and C. Potts.

Recursive deep models for semantic compositionality over a sentiment treebank.

In Proceedings of the 2013 conference on empirical methods in natural language processing, pages 1631–1642, 2013.

[67] D. Tang, F. Wei, B. Qin, N. Yang, T. Liu, and M. Zhou. Sentiment embeddings

with applications to sentiment analysis. IEEE transactions on knowledge and data

Engineering, 28(2):496–509, 2015. [68] M. Tsytsarau, S. Amer-Yahia, and T. Palpanas. Efficient sentiment correlation for

large-scale demographics. In Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data, pages 253–264, 2013.

[69] M. Tsytsarau and T. Palpanas. Survey on mining subjective data on the web.

Data Min. Knowl. Discov., 24(3):478–514, 2012. [70] M. Tsytsarau and T. Palpanas. Managing diverse sentiments at large scale. IEEE

Transactions on Knowledge and Data Engineering, 28(11):3028–3040, 2016. [71] M. Tsytsarau, T. Palpanas, and M. Castellanos. Dynamics of news events and

social media reaction. In KDD, 2014. [72] S. Vosoughi, P. Vijayaraghavan, and D. Roy. Tweet2vec: Learning tweet embed-

dings using character-level cnn-lstm encoder-decoder. In Proceedings of the 39th International ACM SIGIR conference on Research and Development in Information Retrieval, pages 1041–1044. ACM, 2016.

[73] K. Wang and X. Wan. Sentigan: Generating sentimental texts via mixture adversar-

ial networks. In Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence, IJCAI 2018, July 13-19, 2018, Stockholm, Sweden., pages 4446–4452, 2018.

[74] W. Wang, J. Gao, M. Zhang, S. Wang, G. Chen, T. K. Ng, B. C. Ooi, J. Shao, and

M. Reyad. Rafiki: Machine learning as an analytics service system. Proc. VLDB Endow., 12(2):128–140, 2018.

[75] X. Wang, F. Wei, X. Liu, M. Zhou, and M. Zhang. Topic sentiment analysis in

twitter: a graph-based hashtag sentiment classification approach. In Proceedings of the 20th ACM international conference on Information and knowledge management, pages 1031–1040, 2011.

[76] Y. Wang, A. Sun, J. Han, Y. Liu, and X. Zhu. Sentiment analysis by capsules. In

Proceedings of the 2018 World Wide Web Conference on World Wide Web, pages 1165–1174. International World Wide Web Conferences Steering Committee,

2018.

[77] J. W. Wei and K. Zou. Eda: Easy data augmentation techniques for boosting

performance on text classification tasks. arXiv preprint arXiv:1901.11196, 2019. [78] J. Wiebe, T. Wilson, and C. Cardie. Annotating expressions of opinions and

emotions in language. Language resources and evaluation, 39(2-3):165–210, 2005. [79] B. Yang and C. Cardie. Context-aware learning for sentence-level sentiment

analysis with posterior regularization. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), volume 1, pages 325–335, 2014.

[80] K. Zhao, Y. Liu, Q. Yuan, L. Chen, Z. Chen, and G. Cong. Towards personal-

ized maps: Mining user preferences from geo-textual data. Proc. VLDB Endow., 9(13):1545–1548, 2016.

[81] Y. Zheng, G. Li, Y. Li, C. Shan, and R. Cheng. Truth inference in crowdsourcing:

Is the problem solved? Proceedings of the VLDB Endowment, 10(5):541–552, 2017. [82] L. Zhu, A. Galstyan, J. Cheng, and K. Lerman. Tripartite graph clustering for

dynamic sentiment analysis on social media. In Proceedings of the 2014 ACM SIGMOD international conference on Management of data, pages 1531–1542, 2014.

681

Abstract
1 Introduction
2 Related Work
3 Problem Statement

3.1 Motivating Example
3.2 Definitions and Terminology

4 Approaches

4.1 Lexicon-Based Approaches
4.2 Rule-Based Approaches
4.3 Learning-Based Approaches

5 Experiments

5.1 Experimental Setup
5.2 Statistical Analysis
5.3 Structural Analysis
5.4 Semantic Analysis

6 Conclusions
Acknowledgments
References