helpfn

IterativeFeatureNormalizationSchemeforAutomaticEmotionDetectionfromSpeech.IEEETransactionsonAffectiveComputing.pdf

Home >Computer Science homework help >helpfn

Iterative Feature Normalization Scheme for Automatic Emotion Detection from Speech

Carlos Busso, Senior Member, IEEE, Soroosh Mariooryad, Student Member, IEEE,

Angeliki Metallinou, Student Member, IEEE, and Shrikanth Narayanan, Fellow, IEEE

Abstract—The externalization of emotion is intrinsically speaker-dependent. A robust emotion recognition system should be able to

compensate for these differences across speakers. A natural approach is to normalize the features before training the classifiers.

However, the normalization scheme should not affect the acoustic differences between emotional classes. This study presents the

iterative feature normalization (IFN) framework, which is an unsupervised front-end, especially designed for emotion detection. The IFN

approach aims to reduce the acoustic differences, between the neutral speech across speakers, while preserving the inter-emotional

variability in expressive speech. This goal is achieved by iteratively detecting neutral speech for each speaker, and using this subset to

estimate the feature normalization parameters. Then, an affine transformation is applied to both neutral and emotional speech. This

process is repeated till the results from the emotion detection system are consistent between consecutive iterations. The IFN approach

is exhaustively evaluated using the IEMOCAP database and a data set obtained under free uncontrolled recording conditions with

different evaluation configurations. The results show that the systems trained with the IFN approach achieve better performance than

systems trained either without normalization or with global normalization.

Index Terms—Emotion recognition, speaker normalization, emotion, features normalization

1 INTRODUCTION

AUTOMATIC emotion recognition has the potential tochange the way humans interact and communicate with machines. Some of the domains that could be enhanced by adding emotional capabilities include call centers, games, tutoring systems, ambient intelligent envi- ronments and health care. The promising results shown by the early technological developments in automatic emotion recognition however have not materialized into significant advances in real life applications. One of the main barriers toward detecting the emotional state of a human is the inherent inter-speaker variability found in emotional manifestations. It is in this context, we propose a novel normalization scheme designed to reduce the speaker variability, while preserving the discrimination between emotional states.

Data normalization is an important aspect that needs to be considered for a robust automatic emotion recognition system [1]. The normalization step should reduce all sour- ces of variability, while preserving the differences between normal and emotionally expressive speech. In particular, it should compensate for speaker variability. Speech production is the result of controlled movements of an

individual’s vocal tract apparatus that include the lungs, trachea, larynx, pharyngeal cavity, oral cavity, and nasal cavity. As a result, the properties of speech are intrinsically speaker-dependent. For example, the fundamental fre- quency is physically constrained by the anatomy of the larynx, which explains some of the gender differences observed in speech. Likewise, a robust emotion recognition system should cope with differences in the recording condi- tions. The quality and property of the speech highly depend on the sensor and technology used to capture and transmit the speech (e.g., mobile versus landline systems, close- talking versus far-field environment microphones). Any mismatch in the recording condition between the training and testing speech sets will affect the features extracted from the signals. For instance, it is well-known that the energy tends to increase with angry or happy speech [2]. If the energy of the speech signal is not properly normalized, any difference in the microphone gain will affect the perfor- mance of the system (e.g., high signal amplitude speech may be confused with emotional speech).

Most of the current approaches to normalize speech or speech features are based on gross manipulation of the speech at utterance or dialog turn level. In many cases, either speech is not normalized or the approach is not clearly defined. Given the importance of the normalization step, the limited progress in this area is surprising. Some of the approaches that have been widely used are z-standardization (subtract the mean and divide by the standard deviation) [3], min-max normalization (scaling features between �1 and 1) [4], and subtraction of mean values [1]. All these manipulations applied across speak- ers and speech samples affect the discrimination between emotional classes. An interesting approach was proposed by Batliner et al. [5]. For a given lexical unit (i.e., word or

� C. Busso and S. Mariooryad are with the Erik Jonsson School of Engineer- ing & Computer Science, The University of Texas at Dallas, Richardson TX 75080-3021. E-mail: [email protected], [email protected].

� A. Metallinou is with Pearson Knowledge Technologies, Menlo Park, CA 94025. E-mail: [email protected].

� S. Narayanan is with the Viterbi School of Engineering, University of Southern California, Los Angeles, CA 90089. E-mail: [email protected].

Manuscript received 27 Mar. 2013; revised 13 Sept. 2013; accepted 03 Oct. 2013; date of publication 22 Oct. 2013; date of current version 13 Mar. 2014. Recommended for acceptance by R. Mihalcea. For information on obtaining reprints of this article, please send e-mail to: [email protected], and reference the Digital Object Identifier below. Digital Object Identifier no. 10.1109/T-AFFC.2013.26

386 IEEE TRANSACTIONS ON AFFECTIVE COMPUTING, VOL. 4, NO. 4, OCTOBER-DECEMBER 2013

1949-3045 � 2013 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.Authorized licensed use limited to: University of the Cumberlands. Downloaded on July 24,2021 at 03:25:03 UTC from IEEE Xplore. Restrictions apply.

phoneme), the speech features were normalized by esti- mating reference values for “average speakers” learned from a training database. These reference values were used to scale the duration and energy of the speech. How- ever, this normalization scheme does not cope well with inter-speaker variation.

In previous work, we had proposed a speaker-dependent normalization scheme [6], [7]. The main idea of the approach is to estimate the linear scaling parameters for each speaker based only on his/her neutral speech. Then, the normalization parameters are applied to all speech sam- ples from that speaker, including the emotional speech set. Given that the normalization is an affine transformation, the scaling factors will not affect emotional discrimination in the speech, since the differences observed in the features across emotional categories will be preserved. The rationale behind this normalization is that, while individuals may express emotions differently, the patterns observed in their neutral utterances should be similar across speakers. The assumptions made in this normalization approach are that the identity of the subjects is known, and that the labels for a neutral speech set are available for each speaker. These assumptions are not realistic in many practical applications. This paper describes an iterative feature normalization (IFN) scheme to overcome this issue in the context of emotion detection [8], [9]. This unsupervised feature normalization approach extends the aforementioned ideas by iteratively detecting a neutral speech subset for each speaker, which is used to estimate his/her normalization parameters. The speaker-dependent scaling parameters are then applied to the entire corpus including expressive speech. The approach is implemented using z-normalization of statistics derived from acoustic features. Notice that the mean and standard deviations are only estimated from the detected neutral subset.

Although the IFN approach addresses the assumption that neutral speech is available to estimate the normaliza- tion parameters, it still requires the speaker identity (i.e., speaker-dependent normalization scheme). However, our results suggest that the normalization is not sensitive to errors in the speaker assignments (automatically deter- mined). The experimental results demonstrate the potential of the proposed approach.

The remainder of this paper is organized as follows. Sec- tion 2 discusses related work. Section 3 introduces the pro- posed normalization scheme for emotion recognition. Section 4 describes the database and acoustic features used in this study. Section 5 presents the experimental design and results of the proposed normalization scheme. The paper concludes with Section 6, which gives the discussion, future work and final remarks.

2 BACKGROUND

2.1 Motivation

A variety of acoustic features have been used to recognize emotions from speech [10], [11]. Features that are commonly used include the fundamental frequency, energy, formants, speech rate, and voice quality features. These features are not only affected by the externalization of emotion, but also by the speaker and phonetic variabilities. Therefore, a

normalization scheme is important to reduce variability while preserving the emotion-related patterns. This normal- ization step is critical for building a robust speech emotion recognition system.

Let us consider, for example, the fundamental fre- quency (F0) mean, which is widely used as a feature for emotion recognition. In fact, our previous analysis has indicated that the F0 mean is one of the most emotionally prominent aspects of the F0 contour, when properly nor- malized [6]. The fundamental frequency is directly con- strained by the structure and size of the larynx and vocal folds [12]. The F0 contour for men is in the range 50-250 Hz, while for women is higher (120-500 Hz) [12]. Fig. 1a shows the distribution of the F0 mean for neutral and angry sentences recorded from five men and five women (IEMOCAP corpus—Section 4.1). Although angry speech has higher F0 values than neutral speech [13], speaker var- iability increases the confusion between both emotional classes. As a result, emotional differences are blurred by

Fig. 1. Speaker versus emotion variability. (a) F0 mean distribution without normalization, (b) F0 mean distribution with normalization (see Section 3).

BUSSO ET AL.: ITERATIVE FEATURE NORMALIZATION SCHEME FOR AUTOMATIC EMOTION DETECTION FROM SPEECH 387

Authorized licensed use limited to: University of the Cumberlands. Downloaded on July 24,2021 at 03:25:03 UTC from IEEE Xplore. Restrictions apply.

inter-speaker differences. Fig. 1b shows the same distribu- tion after speaker normalization (see Section 3). The over- lap between both classes is now reduced. The shift in the distributions can be directly associated to emotional variations.

2.2 Related Work

Despite the importance of the feature normalization, as illustrated in Section 2.1, few studies have addressed this problem for emotion detection and recognition. Various types of normalization schemes can be implemented according to the target problem. For example, speaker normalization can be used to compensate for speaker- dependent variability. However, this scheme assumes that we know or can infer the speaker identity. Similarly, a nor- malization scheme can be used to compensate for phonetic (lexical) variability. This approach assumes that either the transcriptions or an automatic speech recognition (ASR) sys- tem is available. For multi-corpora studies, feature normali- zation can be designed to compensate for variability in recording conditions. This approach can prove to be useful for the deployment of emotion recognition system in real applications. This section discusses some of the most com- mon approaches for feature normalization. Table 1 summa- rizes the feature normalization approaches proposed in previous studies.

A simple approach that requires no speaker or phonetic information is to normalize all the available features regardless of the speaker or lexical content. A common approach is z-standardization, in which the features are separately normalized by subtracting their mean and divid- ing by their standard deviation (i.e., zero mean, unit vari- ance). These parameters are estimated across the entire data [3], [14], [15], [16], [17]. An alternative normalization method is min-max normalization, where features are

scaled within a predefined range (e.g., from �1 to 1) [4], [16], [18]. Pao et al. [19] used zero mean normalization for all features. Another global feature normalization is the approach described by Yan et al. [20]. They proposed an exponential transformation to normalize speech features so that they follow a normal distribution. They showed that the approach generates features that improve the perfor- mance of emotion classification using quadratic discrimina- tion functions (QDFs).

Previous studies have considered speaker-dependent normalizations. A common approach is to use z- standardization by computing speaker-specific means and variances [16], [21], [22], [23], [24]. The min-max normali- zation scheme can also be implemented by considering speaker-specific transformations [16], [25]. Wollmer et al. [16] compared these normalization approaches (z-stan- dardization, min-max and combination of both techni- ques), implemented in a speaker-dependent and speaker- independent manner. Their results while instructive are not conclusive since the best normalization performance varies across the adopted machine learning approaches. Other speaker-specific normalization approaches include division of speech energy by its mean [1], [26], and the common cepstral mean subtraction of Mel frequency ceps- tral coefficients (MFCCs) [21], [27]. Sethu et al. [28] have described another interesting approach. They proposed a feature warping scheme that maps the initial feature dis- tribution into a pre-determined distribution (i.e., standard normal distribution). They separately warp the features of each speaker, including his/her neutral and emotional samples. Their results indicate that the proposed speaker- dependent feature warping approach improves emotion recognition performance.

Schuller et al. [24] explored the case when the training and testing partitions were created from different corpora. For this problem, they normalized the features by

TABLE 1 Summary of Feature Normalization Approaches for Speech Emotion Recognition

388 IEEE TRANSACTIONS ON AFFECTIVE COMPUTING, VOL. 4 NO. 4 OCTOBER-DECEMBER 2013

Authorized licensed use limited to: University of the Cumberlands. Downloaded on July 24,2021 at 03:25:03 UTC from IEEE Xplore. Restrictions apply.

estimating corpus-specific and speaker-specific z-stan- dardization parameters. A speaker and lexical dependent normalization approach was presented by Fu et al. [31]. The scheme uses relative features that capture the differ- ences of a feature with respect to a neutral reference. This reference is computed using neutral, speaker-specific and utterance-specific samples from the database. However, it is not clear how this approach can be generalized for cases where the information about speaker, emotion and lexical content is not available. Mariooryad and Busso [29] proposed a shape based lexical normalization scheme that uses the whitening transformation to compensate for the variability observed across different phonemes. Finally, Zeng et al. [30] have presented an approach clos- est to this work. Each feature is divided by the feature mean, computed per speaker from held-out samples of neutral speech. These samples are assumed to be avail- able in advance.

We have shown in our previous work that global normal- ization (GN) is not always efficient in improving the perfor- mance of an emotion recognition system [8]. We have argued that this type of normalization affects the discrimi- nation between emotional classes. While estimating normal- ization parameters that are dependent on the underlying lexicon, speaker or emotional classes may give better perfor- mance, an unsupervised approach is needed for practical applications. This work proposes a robust unsupervised front-end to normalize the acoustic features for emotion detection. Although the approach is speaker-dependent, the evaluation results reveal that the scheme is not sensitive to errors made on the assumed speaker identity. This novel normalization approach produces higher accuracies in data obtained from both controlled and uncontrolled recording settings, as demonstrated in Section 5.

3 METHODOLOGY

3.1 Ideal Feature Normalization

We have previously proposed a speaker-dependent normal- ization scheme to compensate for inter-speaker variability and recording conditions [6], [7]. The main idea is to reduce the differences observed in neutral speech across speakers, while preserving the emotional discrimination observed between emotional categories. This goal is achieved by sep- arately estimating the normalization parameters for each speaker using only his/her neutral speech. For each subject,

the estimated normalization parameters are then applied to his/her entire speech data, including the emotional set. These normalizations are performed such that the proper- ties of their neutral speech become similar across speakers (e.g., first and second order statistics). If the normalization uses an affine transformation, the relative differences between neutral and emotional speech in the feature space will be preserved.

The motivation on this ideal normalization (IN) scheme is given in Fig. 2. Fig. 2a shows a schematic charac- terization of emotional clusters in the feature space for two subjects before normalization. Emotional classes are over- lapped given the intrinsic speaker dependency. For exam- ple, neutral samples from speaker 1 are mixed with sad samples from speaker 2. Fig. 2b describes the approach which aims to normalize the corpus such that neutral sets across speakers have similar properties. Fig. 2c gives the clustering of the emotional classes in the feature space after this ideal normalization. Notice that this normalization approach is general and can be applied using different transformations (e.g., z-standardization, min-max normali- zation and feature warping). Likewise, it can be applied to other modalities such as facial features. For example, Zeng et al. [30] proposed to use neutral facial poses to normalize their facial features.

One assumption made in this approach is that neutral speech will be available for each speaker to estimate his/ her normalization parameters. There are two major impli- cations/limitations of this assumption: 1) the identities of the speakers are known, and 2) reference neutral speech is available for each speaker. For real-life applications, these assumptions are reasonable when the speakers are known, and a few seconds of their neutral speech can be pre- recorded. For example, this speaker-dependent normaliza- tion scheme can be used for personalized interfaces with emotional capabilities designed for mobile devices. How- ever, in many applications either the identity of the speaker is unknown or neutral speech data are not readily available. In those cases, this normalization scheme cannot be implemented.

3.2 Iterative Feature Normalization

Given the limitations of the ideal feature normalization approach, the present paper proposes the IFN framework. This scheme is an unsupervised front-end that overcomes the requirement of having prerecorded neutral sets for the

Fig. 2. Schematic description of ideal feature normalization from neutral speech. (a) emotional classes in feature space before normalization, (b) cor- pus is normalized such that neutral portions of the corpus have similar statistics across speakers, and (c) emotional classes in feature space after normalization.

BUSSO ET AL.: ITERATIVE FEATURE NORMALIZATION SCHEME FOR AUTOMATIC EMOTION DETECTION FROM SPEECH 389

Authorized licensed use limited to: University of the Cumberlands. Downloaded on July 24,2021 at 03:25:03 UTC from IEEE Xplore. Restrictions apply.

test speakers. However, we assume that for the training speakers, neutral speech samples are provided, which is a reasonable assumption given that classifiers are built with emotional labels. For each training speaker the acoustic fea- tures are normalized with respect to the statistics extracted from their neutral samples (i.e., ideal normalization). Then, inspired by co-training strategies [32], the IFN approach iteratively detects the neutral speech set to estimate the nor- malization parameters for the test speaker. The parameters are then applied to all the samples of that speaker, including his/her emotional sentences, preserving the differences between emotional classes.

Fig. 3 describes the IFN approach. The following proce- dure is separately implemented for each speaker in the test data. First, the normalization parameters are initialized. Then, the speech features are normalized using an affine transformation, which can be implemented using various approaches (e.g. z-standardization, see Section 3.3). Then, an emotional speech detection algorithm is used to discrimi- nate between neutral and emotional speech (details are given in Section 3.3). The emotional labels assigned to the speech files are compared with the emotional labels from the previous iteration. If the percentage of emotional labels that are modified during the iteration is higher than a given threshold, a new iteration is computed. An alternative stop- ping criterion is setting a predefined maximum number of iterations. For each iteration, only the speech samples labeled as neutral are used to estimate the normalization parameters. The performance of the emotion detection sys- tem is expected to increase as the error estimation on the normalization parameters decreases. A better classification result will produce a better estimation of the neutral subset to estimate the normalization parameters.

Notice that the IFN scheme is a speaker-dependent nor- malization (i.e., the identity of the speaker is required). In some applications, it is safe to assume that only one speaker uses the system at a time (e.g., call center application). In other cases, we may be interested in tracking emotion in multi-person interaction (e.g., in smart room environment [33]). In these applications, the identity of the speakers should be predicted using either supervised or unsupervised speaker identification. However, the experiments in Sec- tion 5.3 reveal that the IFN approach is not very sensitive to errors made in identifying the speakers.

3.3 Implementation

The proposed approach in Fig. 3 is general and can be implemented with different affine transformations and emotion detection systems. In this paper, we implemented the approach with z-standardization and support vector machine (SVM).

Equation (1) describes the z-standardization approach. This affine transformation aims to preserve the first (mean) and second (variance) order statistics between the sentence level features derived from the neutral subsets across speak- ers. For a given speech feature from the speaker s, fs, its mean value, �f

neu, and standard deviation, � fs

neu, are esti- mated using only the neutral samples. Then, the normalized feature bfs is estimated as described in Equation (1).

bfs ¼ fs ��fsneu � fs neu

: (1)

One important component of the proposed approach is the automatic emotional speech detector (see Fig. 3). Unlike other emotion classification systems that recognize multiple emotional labels, the goal of this detection system is to identify neutral speech with high precision rate. This study uses a linear kernel SVM with sequential minimal optimiza- tion (SMO). Among many hyperplanes, SVM selects the one that has the largest margin between two classes (i.e., maximum margin classifier). We have successfully used this machine learning framework in paralinguistic recogni- tion problems such as emotion recognition [9], [34] and sleepiness detection [35]. The SVM is trained and tested with the WEKA data mining toolkit [36]. For consistency across the evaluations, we set the complexity parameter of the classifier to 0.1. This value provided good performance in preliminary experiments.

An important aspect of the IFN approach is setting the values of the normalization parameters in the first iteration (i.e., initialization). These parameters can be initialized with different approaches. Given that it is not clear which approach provides better performance, we evaluate differ- ent possibilities. First, we estimate �f

neu and � fs

neu using all the training data (IFNTr). Following the ideas behind the ideal normalization approach, we also initialize the parame- ters using only the neutral samples of the training data (IFNTrN ). Likewise, we estimate the initial parameters with

Fig. 3. Iterative feature normalization. This unsupervised front-end uses an automatic emotional speech detector to identify neutral samples, which are used to estimate the normalization parameters. The process is iteratively repeated until the labels are not modified any further (see Sec- tions 3.2 and 3.3).

390 IEEE TRANSACTIONS ON AFFECTIVE COMPUTING, VOL. 4 NO. 4 OCTOBER-DECEMBER 2013

Authorized licensed use limited to: University of the Cumberlands. Downloaded on July 24,2021 at 03:25:03 UTC from IEEE Xplore. Restrictions apply.

all the test data, including emotional and neutral samples (IFNTe). Finally, we calculate the parameters using the neu- tral samples of the test data (IFNTeN ). Notice that the IFNTeN setting initializes the parameters with the ones used for ideal normalization. This initialization is imple- mented for comparison purpose, since it is not suitable for real applications (the emotional labels are unknown).

4 DATABASE AND ACOUSTIC FEATURES

4.1 IEMOCAP Database

The proposed approach is evaluated with the interactive emotional dyadic motion capture (IEMOCAP) corpus [37]. Ten trained actors (five male and five female) took part in five dyadic interactions using scripts and spontaneous improv- isations, which were carefully selected to elicit happiness, anger, sadness, and frustration. As a result of the spontane- ous dialog between the subjects, other emotions were also observed. The spontaneous interactions between actors elicit expressive reactions with mixed and ambiguous emo- tions that are similar to the ones observed in naturalistic human interactions (i.e., non-acted data). This is an impor- tant characteristic of this corpus that differentiates it from other acted databases in which actors are asked to read sen- tences portraying a given emotion.

The corpus contains approximately 12 hours of data, which was manually segmented and transcribed at the turn level. We evaluate the emotional content of the corpus using perceptual evaluations by external observers. While it is not clear that perceived emotions match the actual felt emotions [38], emotional labels derived from subjective evaluations are commonly accepted as good approximation of the intended emotion conveyed by the speaker. Each turn was annotated by three evaluators with the following categorical labels: anger, sadness, happiness, disgust, fear, surprise, frustration, excited, neutral, and other. An analysis of the assigned labels reveals fair/moderate agreement across evaluators—details are given in Busso et al. [37]. We only consider samples that reached agreement using majority of votes. Since we are interested in emotion detection (i.e., neutral versus emotional speech), we discard emotional

samples that receive neutral labels. We implement this approach to increase the inter-evaluator agreement, reduc- ing the ambiguity in the emotional labels.

4.2 Acoustic Features and Feature Selection

The study considers the set of acoustic features proposed for the INTERSPEECH 2011 Speaker State Challenge [39]. The set includes an exhaustive number of sentence level features that has been commonly used for emotion recognition. First, 59 frame-by-frame features are extracted from each sentence including prosodic (e.g., energy and F0 contour), spectral (e.g., MFCCs, RASTA, spectral flux, and spectral entropy) and voice quality features (e.g., jitter, shimmer). Table 2 lists these low level descriptors (LLDs). Then, functionals such as mean, maximum, range, kurtosis, skewness, and quartiles are extracted for each of the frame-by-frame features. Table 3 reports these functionals. Altogether, the set includes 4,368 sentence level features, which are extracted using the open- SMILE toolkit [40].

Given the high dimension of the feature space, we reduce the set using feature selection. Instead of using a wrapper method, in which the performance of a classifier is used as a criterion, we implement a correlation feature selection (CFS) technique [41]. CFS aims to identify a feature set that has low correlation between the selected features, but high cor- relation between the features and the emotional labels. We implement CFS with best first search approach, which is a greedy hill-climbing search method with backtracking capa- bility. Starting from an empty set, the best first search method sequentially adds feature to the current subset and grades the subset using correlation base criterion. If five consecutive features do not improve the performance of the feature set, the feature selection stops. This strategy helps the search to avoid local maxima. Although a wrapper fea- ture selection approach may provide a more discriminative feature set, we select CFS to fix the feature set for all the experiments. Since this feature selection approach does not depend on any particular classifier, the reported results are

TABLE 2 The Set of Frame-Level Acoustic Features Used in This Study

This set is referred to as LLDs in the interspeech 2011 speaker state challenge [39].

TABLE 3 The Set of Sentence-Level Functionals Extracted

from the LLDs (see Table 2)

BUSSO ET AL.: ITERATIVE FEATURE NORMALIZATION SCHEME FOR AUTOMATIC EMOTION DETECTION FROM SPEECH 391

Authorized licensed use limited to: University of the Cumberlands. Downloaded on July 24,2021 at 03:25:03 UTC from IEEE Xplore. Restrictions apply.

not biased to any condition and can be directly compared. Notice that feature selection is conducted after ideal nor- malization (see Section 3.1). For emotion detection, CFS selected 172 features out of 4,368 sentence level features. Fig. 4 shows the distribution of the selected features across the six most important LLD categories listed in Table 2.

5 EXPERIMENTS AND RESULTS

We have used a preliminary version of the proposed front- end scheme showing promising results [8], [9], [42], [43]. For example, the normalization scheme was part of the framework that was awarded the first place in the Intoxica- tion Sub-Challenge at Interspeech 2011 [42], [43]. This section provides an extensive evaluation of the proposed front-end scheme. The evaluation includes emotion detection prob- lems (see Section 5.1, neutral versus emotional speech), con- vergence analysis (see Section 5.2), sensitivity analysis against errors on speaker identity (see Section 5.3) and per- formance in recordings obtained under unconstrained experimental conditions (see Section 5.4).

5.1 Emotionally-Expressive Speech Detection

The first experiment consists in evaluating the effect of the IFN approach in emotion detection problems—binary clas- sification between neutral and emotional speech. The sen- tences with emotional labels are clustered together, forming two classes (neutral versus emotional). This scheme yields 1,125 neutral and 3,839 emotional utteran- ces. The evaluation considers classification 1) without nor- malization (WN), 2) with global normalization (GN) 3) with ideal normalization (IN) (see Section 3.1), and 4) with the IFN approach (see Section 3.2). In global normali- zation, the mean and standard deviation of the features are estimated across the entire training or testing data of the speakers (speaker-independent normalization). Ideal normalization corresponds to the speaker-dependent nor- malization scheme, estimated from the neutral portions of each speaker (see Section 3.1). For consistency, all the experiments are conducted with linear kernel SVM classi- fiers trained with SMO. The evaluation is implemented with leave-one-speaker-out, ten-fold cross validation approach (i.e., speaker-independent evaluation). To avoid

dealing with unbalanced classes for the experiments, the emotional utterances are randomly selected to match the number of neutral sentences for both training and testing sets (i.e., training with under-sampling). To make use of the entire database, this random selection is repeated ten times. Hence, the results correspond to the average results achieved across all random selections of the ten folds.

Fig. 5 gives the performance of the classifiers trained with different normalization strategies. The results are pre- sented in terms of accuracy. With this metric, the chance level is 50 percent. The best performance is achieved with the ideal feature normalization approach described in Sec- tion 3.1. This result confirms the intuition behind this fea- ture normalization (see Fig. 2). The figure shows that global normalization can also improve the emotion discrimination. However it is not as effective as the ideal normalization. The SVM implementation of WEKA provides a prediction probability per each emotional class. This probability is used to select the neutral samples for the IFN parameter estimations. Our preliminary analysis indicated that select- ing samples with neutral probability of 0.3 or higher yields better performance than 0.5. Notice that this lower thresh- old increases the number of samples used to estimate the neutral parameters. Therefore, it gives a more robust esti- mation with lower variance. The accuracies across iterations are given in Fig. 5 for each of the four initialization approaches described in Section 3.3. Notice that the first iteration of IFNTeN corresponds to the ideal normalization. However, after the first iteration the errors in detecting the neutral and emotional labels causes a deviation from the optimal parameters. Therefore, the performance decreases for the IFNTeN setting. Ten iterations are used as the stop- ping criterion. However, after few iterations, different initi- alizations of the IFN converge almost to the same accuracy level. According to the proportion hypothesis test, the dif- ferences in performance across different initialization

Fig. 4. Distribution of the sentence level features, or functional, selected by CFS per feature group listed in Table 2.

Fig. 5. Performance of the emotional speech detection system with dif- ferent feature normalization condition. Results are reported in terms of average accuracy across all the speakers. IN: ideal normalization, GN: global normalization, WN: without any feature normalization, IFN: itera- tive feature normalization. Normalization parameters of the IFN approach are initialized with IFNTr: all the training set, IFNTrN : the neu- tral samples of the training set, IFNTe: all the testing set, IFNTeN : the neutral samples of the testing set.

392 IEEE TRANSACTIONS ON AFFECTIVE COMPUTING, VOL. 4 NO. 4 OCTOBER-DECEMBER 2013

Authorized licensed use limited to: University of the Cumberlands. Downloaded on July 24,2021 at 03:25:03 UTC from IEEE Xplore. Restrictions apply.

schemes are not statistically significant after 10 iterations (p�value > 0:46). This is an important finding since it indi- cates the stability of the proposed method regardless of the initial parameter estimation. The IFN approach improves the performance of the classifiers, compared to the cases with global normalization and without normalization. Notice, that the ideal normalization is the upper bound per- formance of the IFN approach (case where the emotion detection system does not have errors). According to the proportion hypothesis test, the effect of ideal normalization is statistically significant compared to the cases without nor- malization (p�value < 1e� 20) or with global normaliza- tion (p�value < 0:003). Also, the IFN approach gives statistically significant improvement over the classifier trained without feature normalization (p�value < 0:026).

5.2 Convergence of IFN

Fig. 6 shows the number of emotional labels (neutral versus emotional) changed by the emotion detection system in each iteration of the IFN approach. This figure shows that after five iterations all instances of the IFN approach converge, regardless of the initialization process. When the parameters

are initialized with neutral samples (IFNTrN , IFNTeN ), the number of samples changing labels from emotional to neu- tral is almost the same as the number of samples changing from neutral to emotional. However, when we initialize the parameters with all the samples (IFNTr, IFNTe), we observe more samples changing from neutral to emotional classes than from emotional to neutral class. During train- ing, the features are normalized by statistics from neutral samples. However, the initialization of the normalization parameters during testing is done with all the samples. Therefore, the emotion detection systems yield higher num- ber of false neutral detections. On average, the number of samples that changed labels after five iterations is less than 11 samples, which only represents 0.48 percent of the data. This criterion can be used as a stopping threshold.

In each iteration, the IFN tries to improve the estimate of the normalization parameters, converging to a sub-optimal set of normalization parameters. These sets are pretty close to the optimal normalization parameters (i.e., ideal normalization). Figs. 7 and 8 show the average absolute errors in the estimation of �f

neu and � fs

neu across the entire selected feature set, with respect to the optimal parameters

Fig. 6. Number of labels changed in the test data for different iterations of the IFN approach. The normalization parameters of the IFN method are initialized with IFNTr: all the training set, IFNTrN : the neutral samples of the training set, IFNTe: all the testing set, and IFNTeN : the neu- tral samples of the testing set. (a) All samples that changed labeled, (b) samples changed from emotional to neutral, (c) samples changed from neutral to emotional.

Fig. 7. Estimation error of the mean in the z-standardization for different iterations. The normalization parameters of the IFN approach are initial- ized with IFNTr: all the training set, IFNTrN : the neutral samples of the training set, IFNTe: all the testing set, and IFNTeN : the neutral samples of the testing set.

Fig. 8. Estimation error of the standard deviation in the z-standardization for different iterations. The normalization parameters of the IFN method are initialized with IFNTr: all the training set, IFNTrN : the neutral sam- ples of the training set, IFNTe: all the testing set, and IFNTeN : the neu- tral samples of the testing set.

BUSSO ET AL.: ITERATIVE FEATURE NORMALIZATION SCHEME FOR AUTOMATIC EMOTION DETECTION FROM SPEECH 393

Authorized licensed use limited to: University of the Cumberlands. Downloaded on July 24,2021 at 03:25:03 UTC from IEEE Xplore. Restrictions apply.

corresponding to the ideal normalization. Notice that the features have different dynamic range. For better visualiza- tion, we have normalized the errors by dividing them by the corresponding standard deviation of each feature. This normalization approach has the same effect as using z- normalization before running the experiments. Notice that the initial step of IFNTe corresponds to the ideal normaliza- tion, which yields zero estimation errors. In the following iterations, the emotion detection errors increase the errors of the estimated parameters. For other initialization settings, the error decreases during the IFN iterations. These results indicate the stability of the proposed IFN method.

5.3 Performance with Speaker Identification Error

The proposed IFN approach relies on the speaker identity given for the test data. We assume that test samples come from a single speaker. This section studies the effect of speaker identification errors in the test set. The data is divided into two sets of five speakers each. The emotion detection models are built on one set by using the correct speaker identity. Then, the data of each of the test speakers is mixed with the other four test speakers according to the target percentage in speaker errors. This scheme simulates the errors introduced by a speaker identification system. Notice that the parameters of IN are estimated with all the neutral samples associated with each speaker, which may contain neutral samples of the other speakers as well. Given the differences in experimental settings, the results of this section are different from the ones presented in Section 5.1, even when no error in speaker identity is introduced (classi- fiers are trained with only five speakers). Similar to Sec- tion 5.1, the training and testing partitions are randomly balanced. Different error rates in speaker identification are introduced among the samples in the testing partition (e.g., 0, 5, 10, 20, 25, and 50 percent). Then, the different normali- zation schemes are evaluated under different error rates on speaker identification.

The reported results correspond to the average obtained by considering each of the two sets as training and testing partitions (two-fold cross-validation). Fig. 9 displays the achieved average performance for each of the normalization

schemes. Since the condition without feature normalization (WN) and GN perform the classifications without any assumption on the speaker identity, the accuracy is constant regardless of the percentage of speaker identity error param- eter. However, the performance of ideal normalization drops when the error rate of the speaker identity increases. IFN, as a speaker-dependent front-end, normalizes the fea- ture space to better match the feature space in the training data. According to Fig. 9, IFN presents robust performance against speaker identity errors, dropping the accuracy only when speaker identity error is 50 percent. However, it still outperforms the global normalization setting. This result is important since, if needed, the accuracy of a speaker identifi- cation system does not need to be perfect for the IFN system to provide accurate emotion detection results.

5.4 Performance in Uncontrolled Recordings

The proposed normalization approach is finally evaluated on data from realistic recordings in uncontrolled settings. For this purpose, we downloaded from a video-sharing website several talks and interviews given by a recognized celebrity (only the audio is used in this study). The videos span different ages of the target individual (from 15 to 30 years), and were recorded in different environmental conditions. They include various undesirable factors such as background music, background noise and overlapped speech of multiple speakers (the recordings come from interviews during TV shows). Dealing with all these factors is a challenging problem that requires insights from speech enhancement, voice activity detection, speaker diarization, among other related fields. Therefore, we manually remove noisy and overlapped segments to simplify the evaluation of the IFN approach in naturalistic recordings.

The corpus was split into five sec segments, which were emotionally annotated by six graduate students. Unlike sim- ilar studies that consider emotional dimensions such as valence and arousal [44], [45], we directly asked the evalua- tors to assess whether the samples are emotional or neutral. The subjects used a slider bar to assess each speech segment on a continuous scale from 0 (neutral) to 1 (emotional). To estimate the reliability of the perceptual evaluation, we esti- mated the correlation between each evaluator and the aver- age scores across the other five evaluators. On average, we observe a correlation of � ¼ 0:52, which represents a strong agreement for this task. Notice that this low level of agree- ment is commonly observed across perceptual emotional evaluations. The average of the scores across evaluators is considered as ground truth. The assigned scores are skewed toward 0 (neutral samples). This result is expected, since most of the sentences do not convey emotional information in natural recordings. To address the unbalanced emotional content of the corpus, the speech segment is considered as neutral if its average value is lower than 0.4. Otherwise, it is considered as emotional. This threshold is similar to the ones used in previous studies [46], [47]. Altogether, the cor- pus includes 837 speech files, with more neutral (727) than emotional (110) samples. We evaluate the IFN approach with unbalanced and balanced classes in the testing set (see Fig. 10). Detailed information about this corpus is given by Rahman and Busso [9].

Fig. 9. Accuracy (%) of the emotion detection system with errors in speaker identification.

394 IEEE TRANSACTIONS ON AFFECTIVE COMPUTING, VOL. 4 NO. 4 OCTOBER-DECEMBER 2013

Authorized licensed use limited to: University of the Cumberlands. Downloaded on July 24,2021 at 03:25:03 UTC from IEEE Xplore. Restrictions apply.

The uncontrolled recordings are only used for testing the IFN approach. The training of the emotion detection system is implemented with the IEMOCAP database. Similar to the experiments presented in Section 5.1, we select a balanced set of neutral and emotional sentences for each of the 10 speakers in the IEMOCAP corpus. Then, the SVM-based emotion detection models are built with this data. Given that we present results with balanced and unbalanced clas- ses in the testing partitions, we measure the performance with F-score (F), in addition to accuracy (A). Equation (2) gives the formula for F, where �R and �P are the average recall and average precision of the two classes, respectively. Notice that we estimate both the recall and precision rate for the neutral and emotional classes, separately, and then we take their corresponding averages. Therefore, F is not sensitive to the selection of the positive class. A random classifier gives an F-score of 50 percent regardless of the prior distribution of the classes

F ¼ 2 �P �R

�P þ �R : (2)

5.4.1 Evaluation with Unbalanced Classes

Figs. 10a and 10b give the evaluations on the uncontrolled recordings when the classes in the testing set are highly unbalanced (110 versus 727). These figures give the perfor- mance using the same initialization conditions described in Section 5.1. They also provide the performance of the emo- tion detection system with global feature normalization and without feature normalization. According to the proportion hypothesis test, the F-scores reported in Fig. 10b are signifi- cantly higher than 50 percent (p�value < 1e� 20). Fig. 10 shows that when no normalization approach is used, the emotion detection system cannot cope with the mismatched condition of training with the IEMOCAP corpus and testing with the uncontrolled recordings (18.2 percent accuracy). Similarly, the global normalization does not yield satisfac- tory performance (55 percent accuracy). However, the ideal normalization provides more robust results. Notice that accuracy of ideal normalization is more than 19 percent higher than global normalization. In spite of the challenging recording conditions and the highly unbalanced corpus, the system trained with the IFN approach is also able to provide results comparable to the ideal normalization. These results suggest the potential of the IFN approach on real-life, non- laboratory data.

5.4.2 Evaluation with Balanced Classes

We also evaluate the performance of the IFN approach when the emotional classes are balanced in the test set. We select the top 111 neutral samples with the lowest ratings to achieve a fairly balanced set (110 emotional, 111 neutral). Figs. 10c and 10d show the accuracy and F-score achieved by the corresponding emotion detection systems using the balanced data set. The figures show that the IFN approach outperforms global normalization. It also provides better performance than when no feature normalization is used. The figures reveal trends consistent with the rest of the eval- uation, highlighting the benefits of using the IFN approach.

6 DISCUSSION AND CONCLUSIONS

This paper presented the IFN framework as an unsuper- vised front-end for emotion recognition systems. The speaker-dependent approach iteratively detects emotionally neutral samples which are used to estimate the normaliza- tion parameters. The normalization is applied to the entire corpus, including the emotional samples. This normaliza- tion reduces the differences in neutral speech across speak- ers and recordings, while preserving the differences between emotional classes in the feature space. An exhaus- tive evaluation is conducted to assess the performance of the IFN approach. The results reveal that the performance of the emotion detection (neutral versus emotional speech) based on the IFN framework gives better accuracies than the ones achieved with classifiers trained without normali- zation or with global normalization. The IFN approach also improves the accuracy in detecting emotional speech obtained from real life, unconstrained recordings. While the approach is speaker-dependent, the evaluations reveal that the performance is not very sensitive to speaker identifica- tion errors, which make it suitable for practical applications.

There are several interesting directions that we are con- sidering to improve the proposed IFN approach. First, the current implementation based on z-standardization corre- sponds to a simple affine transformation. We are exploring the benefits of using other transformations including fea- ture warping. Another research direction is developing approaches to recognize the identity of the speakers, which is a non trivial task with emotional speech. We are explor- ing speaker clustering strategies that will reduce the impact of speaker identification errors. We also leave as future work the evaluation of the IFN approach with overlapped speech collected in noisy environments during multiparty

Fig. 10. Performance of emotion detection methods in uncontrolled recordings for unbalanced and balanced settings (WN ¼ without normalization, GN ¼ global normalization, IFN ¼ iterative feature normalization, and IN ¼ ideal normalization).

BUSSO ET AL.: ITERATIVE FEATURE NORMALIZATION SCHEME FOR AUTOMATIC EMOTION DETECTION FROM SPEECH 395

Authorized licensed use limited to: University of the Cumberlands. Downloaded on July 24,2021 at 03:25:03 UTC from IEEE Xplore. Restrictions apply.

interactions. These challenging conditions will affect the estimation of the speaker-dependent normalization param- eters. Potential solutions are to detect overlapped speech, so that these segments can be discarded, and to apply speech enhancement solutions. Finally, this unsupervised front-end framework can be coupled with model adapta- tion strategies.

ACKNOWLEDGMENTS

This study was funded by US National Science Foundation (NSF) Grant IIS 1217104 and Samsung Telecommunications America as well as Grants to USC SAIL from NSF (IIS- 0911009, CCF-1029373) and DoD.

REFERENCES [1] O. K€ustner, R. Tato, T. Kemp, and B. Meffert, “Towards Real Life

Applications in Emotion Recognition,” Affective Dialogue Systems (ADS ’05), E. Andr�e, L. Dybkaer, W. Minker, P. Heisterkamp, eds., pp. 25-35, Springer Verlag, May 2004.

[2] R. Cowie and R. Cornelius, “Describing the Emotional States That Are Expressed in Speech,” Speech Comm., vol. 40, no. 1-2, pp. 5-32, April 2003.

[3] C. Lee and S. Narayanan, “Toward Detecting Emotions in Spoken Dialogs,” IEEE Trans. Speech and Audio Processing, vol. 13, no. 2, pp. 293-303, Mar. 2005.

[4] C. Clavel, I. Vasilescu, L. Devillers, G. Richard, and T. Ehrette, “Fear-Type Emotion Recognition for Future Audio-Based Surveil- lance Systems,” Speech Comm., vol. 50, no. 6, pp. 487-503, June 2008.

[5] A. Batliner, A. Buckow, H. Niemann, E. N€oth, and V. Warnke, “The Prosody Module,” VERBMOBIL: Foundations of Speech-to- Speech Translations, M. Maybury, O. Stock, W. Wahlster, eds., pp. 106-121, Springer Verlag, 2000.

[6] C. Busso, S. Lee, and S. Narayanan, “Analysis of Emotionally Salient Aspects of Fundamental Frequency for Emotion Detection,” IEEE Trans. Audio, Speech and Language Processing, vol. 17, no. 4, pp. 582-596, May 2009.

[7] C.-C. Lee, E. Mower, C. Busso, S. Lee, and S. Narayanan, “Emotion Recognition Using a Hierarchical Binary Decision Tree Approach,” Proc. Interspeech, pp. 320-323, Sept. 2009.

[8] C. Busso, A. Metallinou, and S. Narayanan, “Iterative Feature Normalization for Emotional Speech Detection,” Proc. Int’l Conf. Acoustics, Speech, and Signal Processing (ICASSP ’11), pp. 5692- 5695, May 2011.

[9] T. Rahman and C. Busso, “A Personalized Emotion Recognition System Using an Unsupervised Feature Adaptation Scheme,” Proc. Int’l Conf. Acoustics, Speech, and Signal Processing (ICASSP ’12), pp. 5117-5120, Mar. 2012.

[10] R. Cowie, E. Douglas-Cowie, N. Tsapatsoulis, G. Votsis, S. Kollias, W. Fellenz, and J. Taylor, “Emotion Recognition in Human- Computer Interaction,” IEEE Signal Processing Magazine, vol. 18, no. 1, pp. 32-80, Jan. 2001.

[11] B. Schuller, A. Batliner, D. Seppi, S. Steidl, T. Vogt, J. Wagner, L. Devillers, L. Vidrascu, N. Amir, L. Kessous, and V. Aharonson, “The Relevance of Feature Type for the Automatic Classification of Emotional User States: Low Level Descriptors and Functionals,” Proc. Interspeech, pp. 2253-2256, Aug. 2007.

[12] J. Deller, J. Hansen, and J. Proakis, Discrete-Time Processing of Speech Signals. IEEE Press, 2000.

[13] S. Yildirim, M. Bulut, C. Lee, A. Kazemzadeh, C. Busso, Z. Deng, S. Lee, and S. Narayanan, “An Acoustic Study of Emotions Expressed in Speech,” Proc. Eighth Int’l Conf. Spoken Language Proc- essing (ICSLP ’04), pp. 2193-2196, Oct. 2003.

[14] B. Schuller, G. Rigoll, and M. Lang, “Hidden Markov Model- Based Speech Emotion Recognition,” Proc. Int’l Conf. Acoustics, Speech, and Signal Processing (ICASSP ’03), vol. 2, pp. 1-4, Apr. 2003.

[15] C.-C. Lee, E. Mower, C. Busso, S. Lee, and S. Narayanan, “Emotion Recognition Using a Hierarchical Binary Decision Tree Approach,” Speech Comm., vol. 53, no. 9-10, pp. 1162-1171, Nov.- Dec. 2011.

[16] M. W€ollmer, F. Eyben, S. Reiter, B. Schuller, C. Cox, E. Douglas- Cowie, and R. Cowie, “Abandoning Emotion Classes—Towards Continuous Emotion Recognition with Modelling of Long-Range Dependencies,” Proc. Interspeech, pp. 597-600, September 2008.

[17] A. Metallinou, A. Katsamanis, and S. Narayanan, “A Hierarchi- cal Framework for Modeling Multimodality and Emotional Evo- lution in Affective Dialogs,” Proc. Int’l Conf. Acoustics, Speech, and Signal Processing (ICASSP ’12), pp. 2401-2404, Mar. 2012.

[18] T.-L. Pao, J.-H. Yeh, Y.-T. Chen, Y.-M. Cheng, and Y.-Y. Lin, “A Comparative Study of Different Weighting Schemes on KNN- Based Emotion Recognition in Mandarin Speech,” Advanced Intel- ligent Computing Theories and Applications. With Aspects of Theoreti- cal and Methodological Issues, D.-S. Huang, L. Heutte, M. Loog, eds., pp. 997-1005, Springer Verlag, July 2007.

[19] T.-L. Pao, C. Chien, Y.-T. Chen, J.-H. Yeh, Y.-M. Cheng, and W.-Y. Liao, “Combination of Multiple Classifiers for Improving Emotion Recognition in Mandarin Speech,” Proc. Third Int’l Conf. Intelligent Information Hiding and Multimedia Signal Processing (IIHMSP ’07), vol. 1, pp. 35-38, Nov. 2007.

[20] Z. Yan, Z. Li, Z. Cairong, and Y. Yinhua, “Speech Emotion Recog- nition Using Modified Quadratic Discrimination Function,” J. Electronics, vol. 25, no. 6, pp. 840-844, Nov. 2008.

[21] B. Vlasenko, B. Schuller, A. Wendemuth, and G. Rigoll, “Frame vs. Turn-Level: Emotion Recognition from Speech Considering Static and Dynamic Processing,” Affective Computing and Intelli- gent Interaction, A. Paiva, R. Prada, R. Picard, eds., pp. 139-147, Springer, Sept. 2007.

[22] B. Schuller, B. Vlasenko, R. Minguez, G. Rigoll, and A. Wende- muth, “Comparing One and Two-Stage Acoustic Modeling in the Recognition of Emotion in Speech,” Proc. IEEE Workshop Automatic Speech Recognition & Understanding (ASRU ’07), pp. 596-600, Dec. 2007.

[23] D. Bitouk, R. Verma, and A. Nenkova, “Class-Level Spectral Fea- tures for Emotion Recognition,” Speech Comm., vol. 52, pp. 613- 625, July-Aug. 2010.

[24] B. Schuller, B. Vlasenko, F. Eyben, M. W€ollmer, A. Stuhlsatz, A. Wendemuth, and G. Rigoll, “Cross-Corpus Acoustic Emotion Rec- ognition: Variances and Strategies,” IEEE Trans. Affective Comput- ing, vol. 1, no. 2, pp. 119-131, July-Dec. 2010.

[25] X. Le, G. Qu�enot, and E. Castelli, “Recognizing Emotions for the Audio-Visual Document Indexing,” Proc. Ninth Int’l Symp. Com- puters and Comm. (ISCC ’04), vol. 2, pp. 580-584, June-July 2004.

[26] O. W. Kwon, K. Chan, J. Hao, and T. W. Lee, “Emotion Recog- nition by Speech Signals,” Proc. Eighth European Conf. Speech Comm. and Technology (EUROSPEECH 2003), pp. 125-128, Sept. 2003.

[27] T. Zhang, M. Hasegawa-Johnson, and S. Levinson, “Mental State Detection of Dialogue System Users via Spoken Language,” Proc. ISCA & IEEE Workshop on Spontaneous Speech Processing and Recog- nition (SSPR ’03), Apr. 2003.

[28] V. Sethu, E. Ambikairajah, and J. Epps, “Speaker Normalisation for Speech Based Emotion Detection,” Proc. 15th Int’l Conf. Digital Signal Processing (DSP ’07), pp. 611-614, July 2007.

[29] S. Mariooryad and C. Busso, “Compensating for Speaker or Lexi- cal Variabilities in Speech for Emotion Recognition,” In Press, Speech Comm., vol. 57, pp. 1-12, 2014.

[30] Z. Zeng, J. Tu, B. Pianfetti, M. Liu, T. Zhang, Z. Zhang, T. Huang, and S. Levinson, “Audio-Visual Affect Recognition through Multi-Stream Fused HMM for HCI,” Proc. IEEE Conf. Computer Vision and Pattern Recognition (CVPR ’05), vol. 2, pp. 967-972, June 2005.

[31] L. Fu, X. Mao, and L. Chen, “Relative Speech Emotion Recognition Based Artificial Neural Network,” Proc. Pacific-Asia Workshop on Computational Intelligence and Industrial Application (PACIIA ’08), vol. 2, pp. 140-144, Dec. 2008.

[32] A. Blum and T. Mitchell, “Combining Labeled and Unlabeled Data with Co-Training,” Proc. 11th Ann. Conf. Computational Learning Theory (COLT ’98), pp. 92-100, July 1998.

[33] C. Busso, P. Georgiou, and S. Narayanan, “Real-Time Monitor- ing of Participants Interaction in a Meeting Using Audio-Visual Sensors,” Proc. Int’l Conf. Acoustics, Speech, and Signal Processing (ICASSP ’07), vol. 2, pp. 685-688, Apr. 2007.

[34] C. Busso, Z. Deng, S. Yildirim, M. Bulut, C. Lee, A. Kazemzadeh, S. Lee, U. Neumann, and S. Narayanan, “Analysis of Emotion Recognition Using Facial Expressions, Speech and Multimodal Information,” Proc. Sixth Int’l Conf. Multimodal Interfaces (ICMI ’04), pp. 205-211, Oct. 2004.

396 IEEE TRANSACTIONS ON AFFECTIVE COMPUTING, VOL. 4 NO. 4 OCTOBER-DECEMBER 2013

Authorized licensed use limited to: University of the Cumberlands. Downloaded on July 24,2021 at 03:25:03 UTC from IEEE Xplore. Restrictions apply.

[35] T. Rahman, S. Mariooryad, S. Keshavamurthy, G. Liu, J. Hansen, and C. Busso, “Detecting Sleepiness by Fusing Classifiers Trained with Novel Acoustic Features,” Proc. 12th Ann. Conf. Int’l Speech Comm. Association (Interspeech ’11), pp. 3285-3288, Aug. 2011.

[36] M. Hall, E. Frank, G. Holmes, B. Pfahringer, P. Reutemann, and I. Witten, “The WEKA Data Mining Software: An Update,” ACM SIGKDD Explorations Newsletter, vol. 11, no. 1, pp. 10-18, June 2009.

[37] C. Busso, M. Bulut, C. Lee, A. Kazemzadeh, E. Mower, S. Kim, J. Chang, S. Lee, and S. Narayanan, “IEMOCAP: Interactive Emo- tional Dyadic Motion Capture Database,” J. Language Resources and Evaluation, vol. 42, no. 4, pp. 335-359, Dec. 2008.

[38] C. Busso and S. Narayanan, “The Expression and Perception of Emotions: Comparing Assessments of Self versus Others,” Proc. Interspeech, pp. 257-260, Sept. 2008.

[39] B. Schuller, S. Steidl, A. Batliner, F. Schiel, and J. Krajewski, “The INTERSPEECH 2011 Speaker State Challenge,” Proc. 12th Ann. Conf. Int’l Speech Comm. Assoc. (Interspeech ’11), pp. 3201-3204, Aug. 2011.

[40] F. Eyben, M. W€ollmer, and B. Schuller, “OpenSMILE: The Munich Versatile and Fast Open-Source Audio Feature Extractor,” Proc. ACM Int’l Conf. Multimedia (MM ’10), pp. 1459-1462, Oct. 2010.

[41] M.A. Hall, “Correlation Based Feature-Selection for Machine Learning,” PhD dissertation, The Univ. of Waikato, Apr. 1999.

[42] D. Bone, M.P. Black, M. Li, A. Metallinou, S. Lee, and S. Nar- ayanan, “Intoxicated Speech Detection by Fusion of Speaker Nor- malized Hierarchical Features and GMM Supervectors,” Proc. 12th Ann. Conf. Int’l Speech Comm. Assoc. (Interspeech ’11), pp. 3217- 3220, Aug. 2011.

[43] D. Bone, M. Li, M. Black, and S. Narayanan, “Intoxicated Speech Detection: A Fusion Framework with Speaker-Normalized Hier- archical Functionals and GMM Supervectors,” Computer, Speech, and Language, Oct. 2012.

[44] R. Cowie, E. Douglas-Cowie, S. Savvidou, E. McMahon, M. Sawey, and M. Schr€oder, “‘FEELTRACE’: An Instrument for Recording Perceived Emotion in Real Time,” Proc. ISCA Tutorial and Research Workshop (ITRW ’00) on Speech and Emotion, pp. 19-24, Sept. 2000.

[45] M. Grimm, K. Kroschel, E. Mower, and S. Narayanan, “Primitives- Based Evaluation and Estimation of Emotions in Speech,” Speech Comm., vol. 49, no. 10-11, pp. 787-800, Oct.-Nov. 2007.

[46] B. Schuller, A. Batliner, S. Steidl, and D. Seppi, “Recognising Real- istic Emotions and Affect in Speech: State of the Art and Lessons Learnt from the First Challenge,” Speech Comm., vol. 53, no. 1, pp. 1062-1087, Dec. 2011.

[47] J. Arias, C. Busso, and N. Yoma, “Shape-Based Modeling of the Fundamental Frequency Contour for Emotion Detection in Speech,” Computer Speech and Language, vol. 28, pp. 278-294, 2014.

Carlos Busso (S’02-M’09-SM’13) received the BS and MS degrees with high honors in electrical engineering from the University of Chile, San- tiago, Chile, in 2000 and 2003, respectively, and the PhD in electrical engineering from the Univer- sity of Southern California (USC), Los Angeles, in 2008. He is an assistant professor at the Electri- cal Engineering Department of The University of Texas at Dallas (UTD). He was selected by the School of Engineering of Chile as the best Elec- trical Engineer graduated in 2003 across Chilean

universities. At USC, he received a Provost Doctoral Fellowship from 2003 to 2005 and a Fellowship in Digital Scholarship from 2007 to 2008. At UTD, he leads the Multimodal Signal Processing (MSP) laboratory [http://msp.utdallas.edu]. He received the Hewlett Packard Best Paper Award at the IEEE ICME 2011 (with J. Jain). He is the co-author of the winner paper of the Classifier Sub-Challenge event at the Interspeech 2009 emotion challenge. His research interests are in digital signal proc- essing, speech and video processing, and multimodal interfaces. His current research includes the broad areas of affective computing, multi- modal human-machine interfaces, modeling and synthesis of verbal and nonverbal behaviors, sensing human interaction, in-vehicle active safety system, and machine learning methods for multimodal processing.

Soroosh Mariooryad (S’12) received the BS degree with high honors in computer engineering from the Ferdowsi University of Mashhad, and the MS degree in computer engineering (artificial intelligence) from Sharif University of Technology (SUT), Tehran, Iran, in 2007 and 2010, respec- tively. He is currently working toward the PhD degree in electrical engineering at the University of Texas at Dallas (UTD), Richardson, Texas. In summer 2013, he interned at Microsoft Research working on analyzing speaking style characteris-

tics. His research interests include speech and video signal processing, probabilistic graphical models and multimodal interfaces. His current research includes modeling and analyzing human nonverbal behaviors, with applications to speech-driven facial animations and emotion recog- nition. He has also worked on statistical speech enhancement and fin- gerprint recognition. From 2008 to 2010, he was a member of the Speech Processing Lab (SPL) at SUT. In 2010, he joined as a research assistant the Multimodal Signal Processing (MSP) laboratory at UTD.

Angeliki Metallinou received the Diploma in electrical and computer engineering from the National Technical University of Athens, Greece, in 2007, and the master’s and PhD degrees in electrical engineering in 2009 and 2013, respectively, from the University of South- ern California (USC). During summer 2012, she interned at Microsoft Research working on spo- ken dialog systems. She is currently working as a research scientist at Pearson Knowledge Technologies, on automatic speech recognition

and language assessment for education applications, and on remote healthcare monitoring. Her research interests include speech and mul- timodal signal processing, affective computing, machine learning, and dialog systems. Between 2007 and 2013 she has been a member of the Signal Analysis and Interpretation Lab (SAIL) at USC, working on spoken and multimodal emotion recognition and computational approaches for healthcare.

Shrikanth Narayanan (StM’88-M’95-SM’02- F’09) is Andrew J. Viterbi professor of engineer- ing at the University of Southern California (USC) and holds appointments as a professor of electri- cal engineering, computer science, linguistics, and psychology, and as the founding director of the Ming Hsieh Institute. Prior to USC, he was with AT&T Bell Labs and AT&T Research from 1995 to 2000. At USC, he directs the Signal Anal- ysis and Interpretation Laboratory (SAIL). His research interests include human-centered infor-

mation processing and communication technologies with a special emphasis on behavioral signal processing and informatics. [http://sail. usc.edu]. He is also an editor for the Computer Speech and Language Journal and an associate editor for the IEEE Transactions on Affective Computing, Apsipa Transactions on Signal and Information Processing, and the Journal of the Acoustical Society of America. He was also previ- ously an associate editor of the IEEE Transactions of Speech and Audio Processing (2000-2004), IEEE Signal Processing Magazine (2005- 2008), and IEEE Transactions on Multimedia (2008-2011). He is a recip- ient of a number of honors including Best Transactions Paper Awards from the IEEE Signal Processing Society in 2005 (with A. Potamianos) and in 2009 (with C.M. Lee) and selection as an IEEE Signal Processing Society Distinguished Lecturer for 2010-2011. Papers co-authored with his students have won awards at Interspeech 2013 Paralinguistics Chal- lenge, Interspeech 2012 Speaker Trait Challenge, Interspeech 2011 Speaker State Challenge, InterSpeech 2010, InterSpeech 2009-Emotion Challenge, IEEE DCOSS 2009, IEEE MMSP 2007, IEEE MMSP 2006, ICASSP 2005, and ICSLP 2002. He has published more than 500 papers and has been 14 granted US patents. He is a fellow of the Acous- tical Society of America and the American Association for the Advance- ment of Science (AAAS) and a member of Tau Beta Pi, Phi Kappa Phi, and Eta Kappa Nu.

BUSSO ET AL.: ITERATIVE FEATURE NORMALIZATION SCHEME FOR AUTOMATIC EMOTION DETECTION FROM SPEECH 397

Authorized licensed use limited to: University of the Cumberlands. Downloaded on July 24,2021 at 03:25:03 UTC from IEEE Xplore. Restrictions apply.

<< /ASCII85EncodePages false /AllowTransparency false /AutoPositionEPSFiles true /AutoRotatePages /None /Binding /Left /CalGrayProfile (Gray Gamma 2.2) /CalRGBProfile (sRGB IEC61966-2.1) /CalCMYKProfile (U.S. Web Coated \050SWOP\051 v2) /sRGBProfile (sRGB IEC61966-2.1) /CannotEmbedFontPolicy /Warning /CompatibilityLevel 1.4 /CompressObjects /Off /CompressPages true /ConvertImagesToIndexed true /PassThroughJPEGImages true /CreateJDFFile false /CreateJobTicket false /DefaultRenderingIntent /Default /DetectBlends true /DetectCurves 0.0000 /ColorConversionStrategy /sRGB /DoThumbnails true /EmbedAllFonts true /EmbedOpenType false /ParseICCProfilesInComments true /EmbedJobOptions true /DSCReportingLevel 0 /EmitDSCWarnings false /EndPage -1 /ImageMemory 1048576 /LockDistillerParams true /MaxSubsetPct 100 /Optimize true /OPM 0 /ParseDSCComments false /ParseDSCCommentsForDocInfo true /PreserveCopyPage true /PreserveDICMYKValues true /PreserveEPSInfo false /PreserveFlatness true /PreserveHalftoneInfo true /PreserveOPIComments false /PreserveOverprintSettings true /StartPage 1 /SubsetFonts false /TransferFunctionInfo /Remove /UCRandBGInfo /Preserve /UsePrologue false /ColorSettingsFile () /AlwaysEmbed [ true /Algerian /Arial-Black /Arial-BlackItalic /Arial-BoldItalicMT /Arial-BoldMT /Arial-ItalicMT /ArialMT /ArialNarrow /ArialNarrow-Bold /ArialNarrow-BoldItalic /ArialNarrow-Italic /ArialUnicodeMS /BaskOldFace /Batang /Bauhaus93 /BellMT /BellMTBold /BellMTItalic /BerlinSansFB-Bold /BerlinSansFBDemi-Bold /BerlinSansFB-Reg /BernardMT-Condensed /BodoniMTPosterCompressed /BookAntiqua /BookAntiqua-Bold /BookAntiqua-BoldItalic /BookAntiqua-Italic /BookmanOldStyle /BookmanOldStyle-Bold /BookmanOldStyle-BoldItalic /BookmanOldStyle-Italic /BookshelfSymbolSeven /BritannicBold /Broadway /BrushScriptMT /CalifornianFB-Bold /CalifornianFB-Italic /CalifornianFB-Reg /Centaur /Century /CenturyGothic /CenturyGothic-Bold /CenturyGothic-BoldItalic /CenturyGothic-Italic /CenturySchoolbook /CenturySchoolbook-Bold /CenturySchoolbook-BoldItalic /CenturySchoolbook-Italic /Chiller-Regular /ColonnaMT /ComicSansMS /ComicSansMS-Bold /CooperBlack /CourierNewPS-BoldItalicMT /CourierNewPS-BoldMT /CourierNewPS-ItalicMT /CourierNewPSMT /EstrangeloEdessa /FootlightMTLight /FreestyleScript-Regular /Garamond /Garamond-Bold /Garamond-Italic /Georgia /Georgia-Bold /Georgia-BoldItalic /Georgia-Italic /Haettenschweiler /HarlowSolid /Harrington /HighTowerText-Italic /HighTowerText-Reg /Impact /InformalRoman-Regular /Jokerman-Regular /JuiceITC-Regular /KristenITC-Regular /KuenstlerScript-Black /KuenstlerScript-Medium /KuenstlerScript-TwoBold /KunstlerScript /LatinWide /LetterGothicMT /LetterGothicMT-Bold /LetterGothicMT-BoldOblique /LetterGothicMT-Oblique /LucidaBright /LucidaBright-Demi /LucidaBright-DemiItalic /LucidaBright-Italic /LucidaCalligraphy-Italic /LucidaConsole /LucidaFax /LucidaFax-Demi /LucidaFax-DemiItalic /LucidaFax-Italic /LucidaHandwriting-Italic /LucidaSansUnicode /Magneto-Bold /MaturaMTScriptCapitals /MediciScriptLTStd /MicrosoftSansSerif /Mistral /Modern-Regular /MonotypeCorsiva /MS-Mincho /MSReferenceSansSerif /MSReferenceSpecialty /NiagaraEngraved-Reg /NiagaraSolid-Reg /NuptialScript /OldEnglishTextMT /Onyx /PalatinoLinotype-Bold /PalatinoLinotype-BoldItalic /PalatinoLinotype-Italic /PalatinoLinotype-Roman /Parchment-Regular /Playbill /PMingLiU /PoorRichard-Regular /Ravie /ShowcardGothic-Reg /SimSun /SnapITC-Regular /Stencil /SymbolMT /Tahoma /Tahoma-Bold /TempusSansITC /TimesNewRomanMT-ExtraBold /TimesNewRomanMTStd /TimesNewRomanMTStd-Bold /TimesNewRomanMTStd-BoldCond /TimesNewRomanMTStd-BoldIt /TimesNewRomanMTStd-Cond /TimesNewRomanMTStd-CondIt /TimesNewRomanMTStd-Italic /TimesNewRomanPS-BoldItalicMT /TimesNewRomanPS-BoldMT /TimesNewRomanPS-ItalicMT /TimesNewRomanPSMT /Times-Roman /Trebuchet-BoldItalic /TrebuchetMS /TrebuchetMS-Bold /TrebuchetMS-Italic /Verdana /Verdana-Bold /Verdana-BoldItalic /Verdana-Italic /VinerHandITC /Vivaldii /VladimirScript /Webdings /Wingdings2 /Wingdings3 /Wingdings-Regular /ZapfChanceryStd-Demi /ZWAdobeF ] /NeverEmbed [ true ] /AntiAliasColorImages false /CropColorImages true /ColorImageMinResolution 150 /ColorImageMinResolutionPolicy /OK /DownsampleColorImages true /ColorImageDownsampleType /Bicubic /ColorImageResolution 150 /ColorImageDepth -1 /ColorImageMinDownsampleDepth 1 /ColorImageDownsampleThreshold 1.50000 /EncodeColorImages true /ColorImageFilter /DCTEncode /AutoFilterColorImages false /ColorImageAutoFilterStrategy /JPEG /ColorACSImageDict << /QFactor 0.76 /HSamples [2 1 1 2] /VSamples [2 1 1 2] >> /ColorImageDict << /QFactor 0.40 /HSamples [1 1 1 1] /VSamples [1 1 1 1] >> /JPEG2000ColorACSImageDict << /TileWidth 256 /TileHeight 256 /Quality 15 >> /JPEG2000ColorImageDict << /TileWidth 256 /TileHeight 256 /Quality 15 >> /AntiAliasGrayImages false /CropGrayImages true /GrayImageMinResolution 150 /GrayImageMinResolutionPolicy /OK /DownsampleGrayImages true /GrayImageDownsampleType /Bicubic /GrayImageResolution 300 /GrayImageDepth -1 /GrayImageMinDownsampleDepth 2 /GrayImageDownsampleThreshold 1.50000 /EncodeGrayImages true /GrayImageFilter /DCTEncode /AutoFilterGrayImages false /GrayImageAutoFilterStrategy /JPEG /GrayACSImageDict << /QFactor 0.76 /HSamples [2 1 1 2] /VSamples [2 1 1 2] >> /GrayImageDict << /QFactor 0.40 /HSamples [1 1 1 1] /VSamples [1 1 1 1] >> /JPEG2000GrayACSImageDict << /TileWidth 256 /TileHeight 256 /Quality 15 >> /JPEG2000GrayImageDict << /TileWidth 256 /TileHeight 256 /Quality 15 >> /AntiAliasMonoImages false /CropMonoImages true /MonoImageMinResolution 1200 /MonoImageMinResolutionPolicy /OK /DownsampleMonoImages true /MonoImageDownsampleType /Bicubic /MonoImageResolution 600 /MonoImageDepth -1 /MonoImageDownsampleThreshold 1.50000 /EncodeMonoImages true /MonoImageFilter /CCITTFaxEncode /MonoImageDict << /K -1 >> /AllowPSXObjects false /CheckCompliance [ /None ] /PDFX1aCheck false /PDFX3Check false /PDFXCompliantPDFOnly false /PDFXNoTrimBoxError true /PDFXTrimBoxToMediaBoxOffset [ 0.00000 0.00000 0.00000 0.00000 ] /PDFXSetBleedBoxToMediaBox true /PDFXBleedBoxToTrimBoxOffset [ 0.00000 0.00000 0.00000 0.00000 ] /PDFXOutputIntentProfile (None) /PDFXOutputConditionIdentifier () /PDFXOutputCondition () /PDFXRegistryName () /PDFXTrapped /False /Description << /CHS <FEFF4f7f75288fd94e9b8bbe5b9a521b5efa7684002000410064006f006200650020005000440046002065876863900275284e8e55464e1a65876863768467e5770b548c62535370300260a853ef4ee54f7f75280020004100630072006f0062006100740020548c002000410064006f00620065002000520065006100640065007200200035002e003000204ee553ca66f49ad87248672c676562535f00521b5efa768400200050004400460020658768633002> /CHT <FEFF4f7f752890194e9b8a2d7f6e5efa7acb7684002000410064006f006200650020005000440046002065874ef69069752865bc666e901a554652d965874ef6768467e5770b548c52175370300260a853ef4ee54f7f75280020004100630072006f0062006100740020548c002000410064006f00620065002000520065006100640065007200200035002e003000204ee553ca66f49ad87248672c4f86958b555f5df25efa7acb76840020005000440046002065874ef63002> /DAN <FEFF004200720075006700200069006e0064007300740069006c006c0069006e006700650072006e0065002000740069006c0020006100740020006f007000720065007400740065002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e007400650072002c0020006400650072002000650067006e006500720020007300690067002000740069006c00200064006500740061006c006a006500720065007400200073006b00e60072006d007600690073006e0069006e00670020006f00670020007500640073006b007200690076006e0069006e006700200061006600200066006f0072007200650074006e0069006e006700730064006f006b0075006d0065006e007400650072002e0020004400650020006f007000720065007400740065006400650020005000440046002d0064006f006b0075006d0065006e0074006500720020006b0061006e002000e50062006e00650073002000690020004100630072006f00620061007400200065006c006c006500720020004100630072006f006200610074002000520065006100640065007200200035002e00300020006f00670020006e0079006500720065002e> /DEU <FEFF00560065007200770065006e00640065006e0020005300690065002000640069006500730065002000450069006e007300740065006c006c0075006e00670065006e0020007a0075006d002000450072007300740065006c006c0065006e00200076006f006e002000410064006f006200650020005000440046002d0044006f006b0075006d0065006e00740065006e002c00200075006d002000650069006e00650020007a0075007600650072006c00e40073007300690067006500200041006e007a006500690067006500200075006e00640020004100750073006700610062006500200076006f006e00200047006500730063006800e40066007400730064006f006b0075006d0065006e00740065006e0020007a0075002000650072007a00690065006c0065006e002e00200044006900650020005000440046002d0044006f006b0075006d0065006e007400650020006b00f6006e006e0065006e0020006d006900740020004100630072006f00620061007400200075006e0064002000520065006100640065007200200035002e003000200075006e00640020006800f600680065007200200067006500f600660066006e00650074002000770065007200640065006e002e> /ESP <FEFF005500740069006c0069006300650020006500730074006100200063006f006e0066006900670075007200610063006900f3006e0020007000610072006100200063007200650061007200200064006f00630075006d0065006e0074006f0073002000640065002000410064006f00620065002000500044004600200061006400650063007500610064006f007300200070006100720061002000760069007300750061006c0069007a00610063006900f3006e0020006500200069006d0070007200650073006900f3006e00200064006500200063006f006e006600690061006e007a006100200064006500200064006f00630075006d0065006e0074006f007300200063006f006d00650072006300690061006c00650073002e002000530065002000700075006500640065006e00200061006200720069007200200064006f00630075006d0065006e0074006f00730020005000440046002000630072006500610064006f007300200063006f006e0020004100630072006f006200610074002c002000410064006f00620065002000520065006100640065007200200035002e003000200079002000760065007200730069006f006e0065007300200070006f00730074006500720069006f007200650073002e> /FRA <FEFF005500740069006c006900730065007a00200063006500730020006f007000740069006f006e00730020006100660069006e00200064006500200063007200e900650072002000640065007300200064006f00630075006d0065006e00740073002000410064006f006200650020005000440046002000700072006f00660065007300730069006f006e006e0065006c007300200066006900610062006c0065007300200070006f007500720020006c0061002000760069007300750061006c00690073006100740069006f006e0020006500740020006c00270069006d007000720065007300730069006f006e002e0020004c0065007300200064006f00630075006d0065006e00740073002000500044004600200063007200e900e90073002000700065007500760065006e0074002000ea0074007200650020006f007500760065007200740073002000640061006e00730020004100630072006f006200610074002c002000610069006e00730069002000710075002700410064006f00620065002000520065006100640065007200200035002e0030002000650074002000760065007200730069006f006e007300200075006c007400e90072006900650075007200650073002e> /ITA (Utilizzare queste impostazioni per creare documenti Adobe PDF adatti per visualizzare e stampare documenti aziendali in modo affidabile. I documenti PDF creati possono essere aperti con Acrobat e Adobe Reader 5.0 e versioni successive.) /JPN <FEFF30d330b830cd30b9658766f8306e8868793a304a3088307353705237306b90693057305f002000410064006f0062006500200050004400460020658766f8306e4f5c6210306b4f7f75283057307e305930023053306e8a2d5b9a30674f5c62103055308c305f0020005000440046002030d530a130a430eb306f3001004100630072006f0062006100740020304a30883073002000410064006f00620065002000520065006100640065007200200035002e003000204ee5964d3067958b304f30533068304c3067304d307e305930023053306e8a2d5b9a3067306f30d530a930f330c8306e57cb30818fbc307f3092884c3044307e30593002> /KOR <FEFFc7740020c124c815c7440020c0acc6a9d558c5ec0020be44c988b2c8c2a40020bb38c11cb97c0020c548c815c801c73cb85c0020bcf4ace00020c778c1c4d558b2940020b3700020ac00c7a50020c801d569d55c002000410064006f0062006500200050004400460020bb38c11cb97c0020c791c131d569b2c8b2e4002e0020c774b807ac8c0020c791c131b41c00200050004400460020bb38c11cb2940020004100630072006f0062006100740020bc0f002000410064006f00620065002000520065006100640065007200200035002e00300020c774c0c1c5d0c11c0020c5f40020c2180020c788c2b5b2c8b2e4002e> /NLD (Gebruik deze instellingen om Adobe PDF-documenten te maken waarmee zakelijke documenten betrouwbaar kunnen worden weergegeven en afgedrukt. De gemaakte PDF-documenten kunnen worden geopend met Acrobat en Adobe Reader 5.0 en hoger.) /NOR <FEFF004200720075006b00200064006900730073006500200069006e006e007300740069006c006c0069006e00670065006e0065002000740069006c002000e50020006f0070007000720065007400740065002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e00740065007200200073006f006d002000650072002000650067006e0065007400200066006f00720020007000e5006c006900740065006c006900670020007600690073006e0069006e00670020006f00670020007500740073006b007200690066007400200061007600200066006f0072007200650074006e0069006e006700730064006f006b0075006d0065006e007400650072002e0020005000440046002d0064006f006b0075006d0065006e00740065006e00650020006b0061006e002000e50070006e00650073002000690020004100630072006f00620061007400200065006c006c00650072002000410064006f00620065002000520065006100640065007200200035002e003000200065006c006c00650072002e> /PTB <FEFF005500740069006c0069007a006500200065007300730061007300200063006f006e00660069006700750072006100e700f50065007300200064006500200066006f0072006d00610020006100200063007200690061007200200064006f00630075006d0065006e0074006f0073002000410064006f00620065002000500044004600200061006400650071007500610064006f00730020007000610072006100200061002000760069007300750061006c0069007a006100e700e3006f002000650020006100200069006d0070007200650073007300e3006f00200063006f006e0066006900e1007600650069007300200064006500200064006f00630075006d0065006e0074006f007300200063006f006d0065007200630069006100690073002e0020004f007300200064006f00630075006d0065006e0074006f00730020005000440046002000630072006900610064006f007300200070006f00640065006d0020007300650072002000610062006500720074006f007300200063006f006d0020006f0020004100630072006f006200610074002000650020006f002000410064006f00620065002000520065006100640065007200200035002e0030002000650020007600650072007300f50065007300200070006f00730074006500720069006f007200650073002e> /SUO <FEFF004b00e40079007400e40020006e00e40069007400e4002000610073006500740075006b007300690061002c0020006b0075006e0020006c0075006f0074002000410064006f0062006500200050004400460020002d0064006f006b0075006d0065006e007400740065006a0061002c0020006a006f0074006b006100200073006f0070006900760061007400200079007200690074007900730061007300690061006b00690072006a006f006a0065006e0020006c0075006f00740065007400740061007600610061006e0020006e00e400790074007400e4006d0069007300650065006e0020006a0061002000740075006c006f007300740061006d0069007300650065006e002e0020004c0075006f0064007500740020005000440046002d0064006f006b0075006d0065006e00740069007400200076006f0069006400610061006e0020006100760061007400610020004100630072006f0062006100740069006c006c00610020006a0061002000410064006f00620065002000520065006100640065007200200035002e0030003a006c006c00610020006a006100200075007500640065006d006d0069006c006c0061002e> /SVE <FEFF0041006e007600e4006e00640020006400650020006800e4007200200069006e0073007400e4006c006c006e0069006e006700610072006e00610020006f006d002000640075002000760069006c006c00200073006b006100700061002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e007400200073006f006d00200070006100730073006100720020006600f60072002000740069006c006c006600f60072006c00690074006c006900670020007600690073006e0069006e00670020006f006300680020007500740073006b007200690066007400650072002000610076002000610066006600e4007200730064006f006b0075006d0065006e0074002e002000200053006b006100700061006400650020005000440046002d0064006f006b0075006d0065006e00740020006b0061006e002000f600700070006e00610073002000690020004100630072006f0062006100740020006f00630068002000410064006f00620065002000520065006100640065007200200035002e00300020006f00630068002000730065006e006100720065002e> /ENU (Use these settings to create PDFs that match the "Suggested" settings for PDF Specification 4.0) >> >> setdistillerparams << /HWResolution [600 600] /PageSize [612.000 792.000] >> setpagedevice