Extracting Emotions and Communication Styles from Vocal ...

Extracting Emotions and Communication Styles from Vocal Signals

Licia Sbattella, Luca Colombo, Carlo Rinaldi, Roberto Tedesco, Matteo Matteucciand Alessandro Trivilini

Politecnico di Milano, Dip. di Elettronica, Informazione e Biongegneria, P.zza Leonardo da Vinci 32, Milano, Italy

Keywords: Natural Language Processing, Communication Style Recognition, Emotion Recognition.

Abstract: Many psychological and social studies highlighted the two distinct channels we use to exchange informationamong us—an explicit, linguistic channel, and an implicit, paralinguistic channel. The latter contains infor-mation about the emotional state of the speaker, providing clues about the implicit meaning of the message.In particular, the paralinguistic channel can improve applications requiring human-machine interactions (forexample, Automatic Speech Recognition systems or Conversational Agents), as well as support the analysisof human-human interactions (think, for example, of clinic or forensic applications). In this work we presentPrEmA, a tool able to recognize and classify both emotions and communication style of the speaker, relyingon prosodic features. In particular, communication-style recognition is, to our knowledge, new, and could beused to infer interesting clues about the state of the interaction. We selected two sets of prosodic features,and trained two classifiers, based on the Linear Discriminant Analysis. The experiments we conducted, withItalian speakers, provided encouraging results (Ac=71% for classification of emotions, Ac=86% for classi-fication of communication styles), showing that the models were able to discriminate among emotions andcommunication styles, associating phrases with the correct labels.

1 INTRODUCTION

Many psychological and sociological studies high-lighted the two distinct channels we use to exchangeinformation among us—a linguistic (i.e., explicit)channel used to transmit the contents of a conver-sation, and a paralinguistic (i.e., implicit) channelresponsible for providing clues about the emotionalstate of the speaker and the implicit meaning of themessage.

Information conveyed by the paralinguistic chan-nel, in particular prosody, is useful for many researchfields where the study of the rhythmic and intona-tional properties of speech is required (Leung et al.,2010). The ability to guess the emotional state of thespeaker, as well as her/his communication style, areparticularly interesting for Conversational Agents, ascould allow them to select the more appropriate re-action to the user’s requests, making the conversationmore natural and thus improving the effectiveness ofthe system (Pleva et al., 2011; Moridis and Econo-mides, 2012). Moreover, being able to extract paralin-guistic information is interesting in clinic application,where psychological profiles of subjects and the clin-

ical relationships they establish with doctors could becreated. Finally, in forensic applications, paralinguis-tic information could be useful for observing how de-fendants, witnesses, and victims behave under inter-rogation.

Our contribution lies in the latter research field;in particular, we explore techniques for emotion andcommunication style recognition. In this paper wepresent an original model, a prototype (PrEmA -Prosodic Emotion Analyzer), and the results we ob-tained.

The paper is structured as follow. In Section 2we provide a brief introduction about the relationshipamong voice, emotions, and communication styles;in Section 3 we present some research projects aboutemotion recognition; in Section 4 we introduce ourmodel; in Section 5 we illustrate the experimentswe conducted, and discuss the results we gathered;in Section 6 we introduce PrEmA, the prototype webuilt; finally, in Section 7 we draw some conclusionsand outline our future research directions.

183Sbattella L., Colombo L., Rinaldi C., Tedesco R., Matteucci M. and Trivilini A..Extracting Emotions and Communication Styles from Vocal Signals.DOI: 10.5220/0004699301830195In Proceedings of the International Conference on Physiological Computing Systems (PhyCS-2014), pages 183-195ISBN: 978-989-758-006-2Copyright c 2014 SCITEPRESS (Science and Technology Publications, Lda.)

2 BACKGROUND

2.1 The Prosodic Elements

Several studies investigate the issue of characteriz-ing human behaviors through vocal expressions; suchstudies rely on prosodic elements that transmit essen-tial information about the speaker’s attitude, emotion,intention, context, gender, age, and physical condition(Caldognetto and Poggi, 2004; Tesser et al., 2004;Asawa et al., 2012).

Intonation makes spoken language very differentfrom written language. In written language, whitespaces and punctuation are used to separate words,sentences and phrases, inducing a particular “rhythm”to the sentences. Punctuation also contributes tospecify the meaning to the whole sentence, stressingwords, creating emphasis on certain parts of the sen-tence, etc. In spoken language, a similar task is doneby means of prosody—changes in speech rate, dura-tion of syllables, intonation, loudness, etc.

Such so-called suprasegmental characteristicsplay an important role in the process of utteranceunderstanding; they are key elements in expressingthe intention of a message (interrogative, affirmative,etc.) and its style (aggressive, assertive, etc.) Inthis work we focused on the following prosodiccharacteristics (Pinker and Prince, 1994): intonation,loudness, duration, pauses, timbre, and rhythm.

Intonation(or tonal Variation, or Melodic Contour)is the most important prosodic effects, and deter-mines the evolution of speech melody. Intonation istightly related to the illocutionary force of the utter-ance (e.g., assertion, direction, commission, expres-sion, or declaration). For example, in Italian intona-tion is the sole way to distinguish among requests (byraising the intonation of the final part of the sentence),assertions (ascending intonation at the beginning ofthe sentence, then descending intonation in the finalpart), and commands (descending intonation); thus,it is possible to distinguish the question “vieni do-mani?” (are you coming tomorrow?) from the as-sertion “vieni domani” (you are coming tomorrow) orthe imperative “vieni domani!” (come tomorrow!).

Moreover, intonation provides clues on the distri-bution of information in the utterance. In other words,it helps in emphasizing new or important facts thespeaker is introducing in the discourse (for example,in Italian, by means of a peak in the intonation con-tour). Thus, intonation takes part in clarifying thesyntactic structure of the utterance.

Finally, and most important for our work, into-nation is also related to emotions; for example, the

melodic contour of anger is rough and full of suddenvariations on accented syllables, while joy exhibitsa smooth, rounded, and slow-varying intonation.Intonation also conveys the attitude of the speaker,leading the hearer to grasp nuances of meaning, likeirony, kindness, impatience, etc.

Loudness is another important prosodic feature, andis directly related to the voice loudness. Loudnesscan emphasize differences in terms of meaning—anincrease of loudness, for example, can be related toanger.

Duration (or Speech Rate) indicates the length ofphonetic segments. Duration can transmit a widerange of meanings, such as speaker’s emotions; ingeneral, emotional states that imply psychophys-iological activation (like fear, anger, and joy) arecorrelated to short durations and high speech rate(Bonvino, 2000), while sadness is typically relatedto slow speech. Duration also correlates with thespeaker’s attitudes (it gives clues about courtesy,impatience, or insecurity of the speaker), as well astypes of discourse (a homily will have slower speechrate than, for example, a sport running commentary).

Pauses allow the speaker to take breath, but canalso be used to emphasize parts of the utterance,by inserting breaks in the intonation contour; fromthis point of view, pauses correspond to punctuationwe add in written language. Pauses, however, aremuch more general and can convey a larger variety ofnuances than punctuation.

Timbre –such as falsetto, whisper, hoarse voice,quavering voice– often provide information about theemotional state and health of the speaker (for exam-ple, a speaker feeling insecure is easily associatedwith quavering voice). Timbre also depends on themount of noise affecting the vocal emission.

Rhythm is a complex prosodic element, emergingfrom joint action of several factors, in particularintonation, loudness, and duration. It is an intrinsicand unique attribute of each language.

In the following, we present some studies that tryto model the relationship among prosody, emotions,and communication styles.

2.2 Speech and Emotions

Emotion is a complex construct and represents a com-ponent of how we react to external stimuli (Scherer,

PhyCS�2014�-�International�Conference�on�Physiological�Computing�Systems

184

2005). In emotions we can distinguish:

� A neurophysiological component of activation(arousal).

� A cognitive component, through which an indi-vidual evaluates the situation-stimulus in relationto her/his needs.

� A motoric component, which aims at transform-ing intentions in actions.

� An expressive component, through which an in-dividual expresses her/his intentions in relation toher/his level of social interaction.

� A subjective component, which is related to theexperience of the individual.

The emotional expression is not only based on lin-guistic events, but also on paralinguistic events, whichcan be acoustic (such as screams or particular vocalinflections), visual (such as facial expressions or ges-tures), tactile (for example, a caress), gustatory, olfac-tory, and motoric (Balconi and Carrera, 2005; Planetand Iriondo, 2012). In particular, the contribution ofnon-verbal channels on the communication process ishuge; according to (Mehrabian, 1972) the linguistic,paralinguistic, and motoric channels, constitutes, re-spectively, 7%, 38%, and 55% of the communicationprocess. In this work, we focused on the acoustic par-alinguistic channel.

According to (Stern, 1985), emotions can be di-vided in: vital affects (floating, vanishing, spend-ing, exploding, increasing, decreasing, bloated, ex-hausted, etc.) and categorical affects (happiness, sad-ness, anger, fear, disgust, surprise, interest, shame).The former are very difficult to define and recognize,while the latter can be more easily treated. Thus, inthis work we focused on categorical affects.

Finally, emotions have two components—an he-donic tone, which refers to the degree of pleasure, ordispleasure, connected to the emotion; and an activa-tion, which refers to the intensity of the physiologicalactivation (Mandler, 1984). In this work we relied onthe latter component, which is easier to measure.

2.2.1 Classifying Emotions

Several well-known theories for classifying emotionshave been proposed. In (Russell and Snodgrass,1987) authors consider a huge number of character-istics about emotions, identifying two primary axes:pleasantness / unpleasantness and arousal / inhibition.

In (Izard, 1971) the author lists 10 primary emo-tions: sadness, joy, surprise, sadness, anger, dis-gust, contempt, fear, shame, guilt; in (Tomkins, 1982)the latest one is eliminated; in (Ekman et al., 1994)

a more restrictive classification (happiness, surprise,fear, sadness, anger, disgust) is proposed.

In particular, Ekman distinguishes between pri-mary emotions, quickly activated and difficult to con-trol (for example, anger, fear, disgust, happiness, sad-ness, surprise), and secondary emotions, which un-dergo social control and cognitive filtering (for exam-ple, shame, jealousy, pride). In this work we focusedon primary emotions.

2.2.2 Mapping Speech and Emotions

As stated before, voice is considered a very reliableindicator of emotional states. The relationship be-tween voice and emotion is based on the assump-tion that the physiological responses typical of anemotional state, such as the modification of breath-ing, phonation and articulation of sounds, produce de-tectable changes in the acoustic indexes associated tothe production of speech.

Several theories have been developed in an effortto find a correlation among speech characteristics andemotions. For example, for Italian (Anolli and Ciceri,1997):

� Fear is expressed as a subtle, tense, and tight tone.

� Sadness is communicated using a low tone, withthe presence of long pauses and slow speech rate.

� Joy is expressed with a very sharp tone and witha progressive intonation profile, with increasingloudness and, sometimes, with an acceleration inspeech rate.

In (Anolli, 2002) it is suggested that active emo-tions produce faster speech, with higher frequenciesand wider loudness range, while the low-activationemotions are associated with slow voice and low fre-quencies.

In (Juslin, 1997) the author proposes a detailedstudy of the relationship between emotion an prosodiccharacteristics. His approach is based on time, loud-ness, spectrum, attack, articulation, and differences induration (Juslin, 1998). Table 1 shows such prosodiccharacterization, for the four primary emotions; ourwork started from such clues, trying to derive mea-surable acoustic features.

Relying on the aforementioned works, we decidedto focus on the following emotions: joy, fear, anger,sadness, and neutral.

2.3 Speech and Communication Styles

The process of communication has been studied frommany points of view. Communication not only con-veys information and expresses emotions, it is also

Extracting�Emotions�and�Communication�Styles�from�Vocal�Signals

185

characterized by a particular relational style (in otherwords, a communication style). Everyone has a rela-tional style that, from time to time, may be more orless dominant or passive, sociable or withdrawn, ag-gressive or friendly, welcoming or rejecting.

2.3.1 Classifying Communication Styles

We chose to rely on the following simple classifica-tion and description that includes three communica-tion styles (Michel, 2008):

� Passive

� Assertive

� Aggressive

Passive communication imply not expressing hon-est feelings, thoughts and beliefs. Therefore, allow-ing others to violate your rights; expressing thoughts

Table 1: Prosodic characterization of emotions.

Emotion Prosodic feature

Joy

- quick meters- moderate duration variations- high average sound level- tendency to tighten up the contrasts be-tween long and short words- articulation predominantly detached- quick attacks- brilliant tone- slight or missing vibrato- slightly rising intonation

Sadness

- slow meter- relatively large variations in duration- low noise level- tendency to attenuate the contrasts be-tween long and short words- articulation linked- soft attacks- slow and wide vibrato- final delaying- soft tone- intonation (at times) slightly declining

Anger

- quick meters- high noise level- relatively sharp contrasts between longand short words- articulation mostly not linked- very dry attacks- sharply stamp- distorted notes

Fear

- quick meters- high noise level- relatively sharp contrasts between longand short words- articulation mostly not linked- very dry attacks- sharply stamp- distorted notes

and feelings in an apologetic, self-effacing way, sothat others easily disregard them; sometimes showinga subtle lack of respect for the other person’s abilityto take disappointments, shoulder some responsibil-ity, or handle their own problems.

Persons with aggressive communication stylestand up for their personal rights and express theirthoughts, feelings and beliefs in a way which is usu-ally inappropriate and always violates the rights of theother person. They tend to maintain their superiorityby putting others down. When threatened, they tendto attack.

Finally, assertive communication is a way of com-municating feelings, thoughts, and beliefs in an open,honest manner without violating the rights of others.It is an alternative to being aggressive where we abuseother people’s rights, and passive where we abuse ourown rights.

It is useful to learn the distinction among the ag-gressive, passive, and assertive communication be-haviors, because such psychological characteristicsprovides clues on the prosodic parameters we can ex-pect.

2.3.2 Mapping Speech and CommunicationStyles

Starting from the aforementioned characteristics ofcommunication styles, considering the prosodic cluesprovided in (Michel, 2008), and taking into accountother works (Hirshberg and Avesani, 2000; Shriberget al., 2000; Shriberg and Stolcke, 2001; Hastie et al.,2001; Hirst, 2001), we came out with the prosodiccharacterization showed in Table 2.

2.4 Acoustic Features

As we discussed above, characterizing emotionalstates and communication styles associated to a vo-cal signal implies measuring some acoustic features,which, in turn, are derived from physiological reac-tions. Table 1 and Table 2 provide some clues abouthow to relate such physiological reactions to prosodiccharacteristics, but we need to define a set of measur-able acoustic features.

2.4.1 Acoustic Features for Emotions

We started from the most studied acoustic features(Murray and Arnott, 1995; McGilloway et al., 2000;Cowie et al., 2001; Wang and Li, 2012).

Pitch measures the intonation, and is represented bythe fundamental harmonic (F0); it tends to increasefor anger, joy, and fear; it decreases for sadness. Pitch


186

Table 2: Prosodic characterization of communication styles.

Communication Prosodic featurestyle

Passive

- flickering- voice often dull and monoto-nous- tone may be sing-song orwhining- low Volume- hesitant, filled with pauses- slow-fast or fast-slow- frequent throat clearing

Aggressive

- very firm voice- often abrupt, clipped- often fast- tone sarcastic, cold, harsh- grinding- fluent, without hesitations- voice can be strident, oftenshouting, rising at end

Assertive

- firm, relaxed voice- steady even pace- tone is middle range, rich andwarm- not over-loud or quiet- fluent, few hesitation

tends to be more variable for anger and joy.

Intensity represents the amplitude of the vocal signal,and measures the loudness; intensity tends to increasefor anger and joy, decrease for sadness, and stayconstant for fear.

Time measures duration and pauses, as voiced andunvoiced segments. High speech rate is associatedto anger, joy, and fear while low speech rate isassociated to sadness. Irregular speech rate is oftenassociated with anger and sadness. Time is also animportant parameter for distinguishing articulationbreaks (speaker’s breathing) from unvoiced segments.The unvoiced segments represent silences—parts ofthe signal where the information of the pitch and/orintensity are below a certain threshold.

Voice Quality measures the timbre and is related tovariations of the voice spectrum, as to the signal-noiseratio. In particular:

� Changes in amplitude of the waveform betweensuccessive cycles (called shimmer).

� Changes in the frequency of the waveform be-tween successive cycles (called jitter).

� Hammarberg’s index, which covers the difference

between the energy in the 0-2000 Hz and 2000-5000 Hz bands.

� The harmonic/noise ratio (HNR) between the en-ergy of the harmonic part of the signal and theremaining part of the signal; see (Hammarberget al., 1980; Banse and Sherer, 1996; Gobl andChasaide, 2000).

High values of shimmer and jitter characterize, for ex-ample, disgust and sadness, while fear and joy are dis-tinguished by different values of the Hammarberg’sindex.

2.4.2 Acoustic Features for CommunicationStyles

Starting from clues provided by Table 2, we decidedto rely on the same acoustic features we used forthe emotion recognition (pitch, intensity, time, andvoice quality). But, in order to recognize complexprosodic variations that particularly affects communi-cation style, we reviewed the literature and found thatresearch mostly focuses on the variations of tonal ac-cents within a sentence and at a level of prominentsyllables (Avesani et al., 2003; D’Anna and Petrillo,2001; Delmonte, 2000). Thus, we decided to add twomore elements to our acoustic feature set:� Contour of the pitch curve� Contour of the intensity curve

In Section 4.2 we will show how we measured thefeature set we defined for emotions and communica-tion style.

3 RELATED WORK

Several approaches exist in literature for the task ofemotion recognition, based on classifiers like SupportVector Machines (SVM), decision trees, Neural Net-works (NN), etc. In the following, we present someof such approaches.

The system described in (Koolagudi et al., 2011)made use of SVM for classifying emotions expressedby a group of professional speakers. The authors un-derlined that, for extreme emotions (anger, happinessand fear), the most useful information was containedin the first words of the sentence, while last wordswere more discriminative in case of neutral emotion.The recognition Precision 1 of the system, on aver-age, using prosodic parameters and considering onlythe beginning words, was around 36%.

1Notice that the performance index provided in this sec-tion are indicative and cannot be compared each other, sinceeach system used its own vocal dataset.


187

The approach described in (Borchert and Diister-hoft, 2005) used SVM, too, applying it to the Ger-man language. In particular, this project developeda prototype for analyzing the mood of customers incall centers. This research showed that pitch and in-tensity were the most important features for the emo-tional speech, while features on spectral energy distri-bution were the most important voice quality features.Recognition Precision they obtained was, on average,around 70%.

Another approach leveraged the Alternating De-cision Trees (ADTree), for the analysis of humorousspoken conversations from a classic comedy TV show(Purandare and Litman, 2006); speaker turns wereclassified as humorous or non-humorous. They used acombination of prosodic (e.g., pitch, energy, duration,silences, etc.) and non-prosodic features (e.g., words,turn length, etc.) Authors discovered that the best setof features was related to the gender of the speaker.Their classifier obtained Accuracies of 64.63% formales and 64.8% for females.

The project described in (Shi and Song, 2010)made use of NN. The project used two databases ofChinese utterances. One was composed of speechrecorded by non-professional speakers, while theother was composed of TV recordings. They usedMel-Frequency Cepstral Coefficients for analyzingthe utterances, considering six speech emotions: an-gry, happiness, sadness, and surprised. They ob-tained the following Precisions: angry 66%, happi-ness 57.8%, sadness 85.1%, and surprised 58.7%.

The approaches described in (Lee and Narayanan,2005) used a combination of three different sourcesof information: acoustic, lexical, and discourse. Theyproposed a case study for detecting negative andnon-negative emotions using spoken language com-ing from a call center application. In particular, thesamples were obtained from real users involved inspoken dialog with an automatic agent over the tele-phone. In order to capture the emotional featuresat the lexical level, they introduced a new conceptnamed “emotional salience”—an emotionally salientword, with respect to a category, tends to appear moreoften in that category than in other categories. For theacoustic analysis they compared a K-Nearest Neigh-borhood classifier and a Linear Discriminant Clas-sifier. The results of the project demonstrated thatthe best performance was obtained when acoustic andlanguage features were combined. The best perform-ing results of this project, in terms of classificationerrors, were 10.65% for males and 7.95% for females.

Finally, in (Lopez-de Ipina et al., 2013) au-thors focuses on “emotional temperature” (ET) asa biomarker for early Alzheimer disease detection.

They leverages non linear features, such as the Frac-tal Dimension, and rely on a SVM for classifying ETof voice frames as pathological or non-pathological.They claim an Accuracy of 90.7% to 97.7%.

Our project is based on a classifier that leveragesthe Linear Discriminant Analysis (LDA) (McLach-lan, 2004); such a model is simpler than SVM andNN, and easier to train. Moreover, with respect to ap-proaches making use of textual features, our modelis considerably simpler. Nevertheless, our approachprovides good results (see Section 5).

Finally, we didn’t find any system able to classifycommunication styles so, to our knowledge, this fea-ture provided by our system is novel.

4 THE MODEL

For each voiced segment, two set of features –onefor recognizing emotions and one for communicationstyle– were calculated; then, by means of two LDA-based classifiers, such segments were associated withemotion and communication style.

LDA-based classifier provided a good trade-offbetween performance and classification correctness.LDA projects vectors of features, which representsthe samples to analyze, to a smaller space. Themethod maximizes the ratio of between-class vari-ance to the within-class variance, permitting to maxi-mize class separability. More formally, LDA finds theeigenvectors~fi that solve:

B~fi�lW~fi = 0 (1)where B is the between-class scatter matrix and W isthe within-class scatter matrix. Once a sample ~x j isprojected on the new space provided by the eigenvec-tors, the class k corresponding to the projection ~y j ischosen according to (Boersma and Weenink, 2013):

k = argmaxk

p(kj~y j) = argmaxk�d2

k (~y j) (2)

where d2k (�) is the generalized squared distance func-

tion:

d2k (~y) = (~y�~µ j)

TS�1k (~y�~µ j)+

ln jSkj2� ln p(k) (3)

where Sk is the covariance matrix for the class k andp(k) is the a-priori probability of the class k:

p(k) =nk

åKi=1 ni

(4)

where nk is the number of samples belonging to theclass k, and K is the number of classes.


188

4.1 Creating a Corpus

Our model was trained and tested on a corpus of sen-tences, labeled with the five basic emotions and thethree communication styles we introduced. We col-lected 900 sentences, uttered by six Italian profes-sional speakers, asking them to simulate emotions andcommunication styles. This way, we obtained goodsamples, showing clear emotions and expressing thedesired communication styles.

4.2 Measuring and Selecting AcousticFeatures

Figure 1 shows the activities that lead to the calcula-tion of the acoustic features: Preprocessing, segmen-tation, and feature extraction. The result is the datasetwe used for training and testing the classifiers.

In the following, the aforementioned phases arepresented. The values shown for the various param-eters needed by the voice-processing routines, havebeen chosen experimentally (see Section 4.3.2 for de-tails on how the values of such parameters were se-lected; see Section 6 for details on Praat, the voice-processing tool we adopted).

Segmentation(voiced / unvoiced segment labeling)

Vocal signal being analyzed Recorded vocal signals

Band pass filter

Normalization

Prep

roce

ssin

g

Raw data extraction

Feature calculation

Feat

ure

extra

ctio

n

Training setTest set

Figure 1: The feature calculation process.

4.2.1 Preprocessing and Segmentation

We used a Hann band filter, for removing uselessharmonics (FLo=100Hz, FHi=6kHz, and smoothing

w=100Hz). Then, we normalized the intensity of dif-ferent audio files, so that the average intensity of dif-ferent recordings was uniform and matched a prede-fined setpoint. Finally, we divided the audio signalinto segments; in particular we divided voiced seg-ments, where the average, normalized intensity wasabove the threshold Ivoicing=0.45, and silenced seg-ments, where the average, normalized intensity wasbelow the threshold Isilence=0.03. Segments havingaverage, normalized intensity between the two thresh-olds were not considered 2.

4.2.2 Feature Calculation

For features related to Pitch, we used the followingrange Ff loor=75Hz, Fceiling=600Hz (such values arewell suited for male voices, as we used male subjectsfor our experiments).

Among features related with Time, articulationratio refers to the amount of time taken by voiced seg-ments, excluding articulation breaks, divided by thetotal recording time; an articulation break is a pause–in a voiced segment– longer than a given threshold(we used the threshold Tbreak=0.25s), and is used tocapture the speaker’s breathing. The speech ratio,instead, is the percentage of voiced segments overthe total recording time. These two parameters arevery similar for short utterances, because articulationbreaks are negligible; for long utterances, however,these parameters definitely differ, revealing that artic-ulation breaks are an intrinsic property of the speaker.Finally, unvoiced frame ratio is the total time of un-voiced frames, divided by the recording total time

The speech signal, even if produced with maxi-mum stationarity, contains variations of F0 and in-tensity (Hammarberg et al., 1980); such variationsrepresents the perceived voice quality. The randomchanges in the short term (micro disturbances) of F0are defined as jitter, while the variations of the ampli-tude are known as shimmer. The Harmonic-to-Noiseratio (HNR) value is the “degree of hoarseness” ofthe signal—the extent to which noise replaces the har-monic structure in the spectrogram (Boersma, 1993).

Finally, the following features are meant to repre-sent Pitch and Intensity contours:

� Pitch Contour

– Number of peaks per second. The number ofmaxima in the pitch contour, within a voicedsegment, divided by the duration of the seg-ment.

– Average and variance of peak values.

2Such segments were considered too loud for being clearsilences, but too quiet for providing a clear voiced signal.


189

– Average gradient. The average gradient be-tween two consecutive sampling points in thepitch curve.

– Variance of gradients. The variance of suchpitch gradients.

� Intensity Contour

– Number of peaks per second. The number ofmaxima in the intensity curve, within a voicedsegment, divided by the duration of the seg-ment.

– Mean and variance of peak values.– Variance of peak values.– Average gradient. The average gradient be-

tween two consecutive sampling points in theintensity curve.

– Variance of gradients. The variance of such in-tensity gradients.

Table 3 and Table 4 summarized the acoustic fea-tures we measured, for emotions and communicationstyles, respectively.

Table 3: Measured acoustic features for emotions.

Features Characteristics

Pitch (F0)

Average [Hz]Standard deviation [Hz]Maximum [Hz]Minimum [Hz]25th quantile [Hz]75th quantile [Hz]Median [Hz]

Intensity

Average [dB]Standard deviation [dB]Maximum [dB]Minimum [dB]Median [dB]

Time

Unvoiced frame ratio [%]Articulation break ratio [%]Articulation ratio [%]Speech ratio [%]

Voice qualityJitter [%]Shimmer [%]HNR [dB]

The features we defined underwent a selectionprocess, aiming at discarding highly correlated mea-surements, in order to obtain the minimum set of fea-tures. In particular, we used the ANOVA and the LSDtests.

The ANOVA analysis for features related toemotions (assuming 0.01 as significance threshold)found all the features to be significant, except Av-erage Intensity. For Average intensity we leveraged

Table 4: Measured acoustic features for communicationstyle (features in italics have been removed).

Features Characteristics

Pitch (F0)

Average [Hz]Standard deviation [Hz]Maximum [Hz]Minimum [Hz]10th quantile [Hz]90th quantile [Hz]Median [Hz]

Pitch contour

Peaks per second [#peaks/s]Average peaks height [Hz]Variance of peak heights [Hz]Average peak gradient [Hz/s]Variance of peak gradients [Hz/s]

Intensity

Average [dB]Standard deviation [dB]Maximum [dB]Minimum [dB]10th quantile [dB]90th quantile [dB]Median [dB]

Intensity contour

Peaks per second [#peaks/s]Average peak height [dB]Variance of peaks heights [dB]Average peak gradients [dB/s]Variance of peak gradients [dB/s]

Time

Unvoiced frame ratio [%]Articulation break ratio [%]Articulation ratio [%]Speech ratio [%]

Voice qualityJitter [%]Shimmer [%]NHR [dB]

the Fischer’s LSD test, which showed that Aver-age Intensity was not useful for discriminating Joyfrom Neutral and Sadness, Neutral from Sadness, andFear from Anger. Nevertheless Average Intensity wasretained, as LSD proved it useful for discriminatingSadness, Joy, and Neutral from Anger and Fear.

The ANOVA analysis for features re-lated to communication style (assum-ing 0.01 as significance threshold) foundeight potentially useless features: Aver-age Intensity, 90 th quantile, Unvoiced frame ratio,Peaks per second, Average peak gradient, Stan-dard deviation peaks gradient, Median intensity,and Standard deviation intensity. For such featureswe performed the LSD test, which showed thatPeaks per second was not able to discriminateAggressive vs Assertive, but was useful for discrim-inating all others communication styles and thus wedecided to retain it. The others seven features weredropped as LSD showed that they were not useful fordiscriminating communication styles.

After this selection phase, the set of features forthe emotion recognition task remained unchanged,


190

while the set of features for the communication-stylerecognition task was reduced (in Table 4, text in ital-ics indicates removed features).

4.3 Training

4.3.1 The Vocal Dataset

For the creation of the vocal corpus we examined thepublic vocal databases available for the Italian lan-guage (EUROM0, EUROM1 and AIDA), public au-diobooks, and different resources provided by profes-sional actors. After a detailed evaluation of avail-able resources, we realized that they were not suit-able for our study, due to the scarcity of sequenceswhere emotion an communication style were unam-biguously expressed.

We therefore opted for the development of ourown datasets, composed of:

� A series of sentences, with different emotional in-tentions

� A series of monologues, with different communi-cation styles

We carefully selected –taking into account thework presented in (Canepari, 1985)– 10 sentencesfor each emotion, expressing strong and clear emo-tional states. This way, it was easier for the actor tocommunicate the desired emotional state, because themeaning of the sentence already contained the emo-tional intention. With the same approach we selected3 monologues (about ten to fifteen rows long, each)—they were chosen to help the actor in identifying him-self with the desired communication style.

For example, to represent the passive style wechose some monologues by Woody Allen; to repre-sent the aggressive style, we chose “The night be-fore the trial” by Anton Chekhov; and to representassertive style, we used the book “Redesigning thecompany” by Richard Normann.

We selected six male actors; each one wasrecorded independently and individually, in order toavoid mutual conditioning. In addition, each actor re-ceived the texts in advance, in order to review themand practice before the registration.

4.3.2 The Learning Process

The first step of the learning process was to select theparameters needed by the voice-processing routines.Using the whole vocal dataset we trained several clas-sifiers, varying the parameters, and selected the bestcombination according to the performance indexes we

Table 5: Confusion matrix for emotions (%).

Predicted emotionsJoy Neutral Fear Anger Sadness

Joy 63.81 0.00 18.35 11.79 6.05Neutral 3.47 77.51 2.14 1.79 15.09Fear 33.75 0.00 58.35 6.65 1.25Anger 10.24 1.16 8.16 77.28 3.16Sadness 5.14 14.44 0.28 0.81 79.33

defined (see Section 5). We did it for both the emo-tion recognition classifier and the communication-style recognition classifier, obtaining two parametersets.

Once the parameter sets were defined, a subsetof the vocal dataset –the training dataset– was usedto train the two classifiers. In particular, for theemotional dataset –containing 900 voiced segments–and the communication-style dataset –containing 54paragraphs– we defined a training dataset containing90% of the initial dataset, and an evaluation datasetcontaining the remaining 10%.

Then, we trained the two classifiers on the trainingdataset. Such process was repeated 10 times, withdifferent training set/test set subdivisions.

5 EVALUATIONAND DISCUSSION

During the evaluation phase, the 10 pairs of LDA-based classifiers we trained (10 for emotions and 10for communication styles) tagged each voiced seg-ment in the evaluation dataset with an emotion andan communication style. Then performance metricswere calculated for each classifier; finally, averageperformance metrics were calculated (see Section 6for details on Praat, the voice-processing tool weadopted).

5.1 Emotions

The validation dataset consists of 18 voiced segmentschosen at random for each of the five emotions, for atotal of 90 voiced segments (10% of the whole emo-tion dataset).

The average performance indexes of the 10 trainedclassifiers, are shown in Table 5 and Table 6

Precision and F-measure are good for Neutral,Anger, and Sadness, while Fear and Joy are moreproblematic (especially Joy, which has the worstvalue). The issue is confirmed by the confusion ma-trix of Table 5, which shows that Joy phrases were


191

Table 6: Precision, Recall, and F-measure for emotions (%).

Joy Neutral Fear Anger SadnessPr 56.03 75.00 64.84 80.36 80.28Re 63.53 76.36 58.42 77.10 79.39F1 59.54 75.68 61.46 78.70 79.83

Table 7: Error rates for emotions.

Joy Neutral Fear Anger SadnessFp 164 56 96 65 70Fn 120 52 126 79 74Te (%) 18.25 6.94 14.27 9.25 9.25

Table 8: Average Pr, Re, F1, and Ac, for emotions (%).

Avg Pr 71.44Avg Re 71.06Avg F1 71.16

Ac 71.27

tagged as Fear 33% of the time, lowering the Preci-sion of both. Recall is good for all the emotions andalso for Joy, which exhibits the better value. The av-erage values for Precision, Recall, and F-measure areabout 71%; Accuracy exhibits a similar value.

The K value, the agreement between the classifierand the dataset, is K=0.63541, meaning a good agree-ment was found.

Finally, for each class, we calculated false posi-tives Fp (number of voiced segments belonging to an-other class, incorrectly tagged in the class), false neg-atives Fn (number of voiced segments belonging tothis class, incorrectly classified in another class), andthus the error rate Te (see Table 7).

Joy and Fear exhibit the highest errors, as the clas-sifier often confused them. We argue this result is dueto the highs degree of arousal that characterize bothJoy and Fear.

5.2 Communication Styles

The validation data set consists of 2 randomly cho-sen paragraphs, for each of the three communicationstyles, for a total of 6 paragraphs, which correspondsto 10% of the communication-style dataset.

The average performance indexes of the 10 trainedmodels, are shown in Table 9 and Table 10.

Precision, Recall, and F-measure indicate verygood performances for Aggressive and Passive com-munication styles; acceptable but much smaller val-ues are obtained for Assertive sentences, as they areoften tagged as Aggressive (24.26% of the time, asshown in the confusion matrix). The average valuesfor Precision, Recall, and F-measure are about 86%;Accuracy exhibits a similar value.

Table 9: Confusion matrix for communication styles (%).

Predicted communication stylesAggressive Assertive Passive

Aggressive 99.30 0.70 0.00Assertive 24.26 62.68 13.06Passive 7.08 10.30 82.62

Table 10: Precision, Recall, and F-measure for communica-tion styles (%).

Aggressive Assertive PassivePr 85.55 68.32 93.61Re 99.33 60.53 83.25F1 91.93 64.19 88.13

Table 11: Error rate for communication style.

Aggressive Assertive PassiveFp 100 64 36Fn 4 90 106Te (%) 7.14 10.57 9.75

Table 12: Average Pr, Re, F1, and Ac, for communication-style (%).

Avg Pr 86.10Avg Re 85.87Avg F1 85.61

Ac 86.00

The K value, the agreement between the classifierand the dataset, is K=0.777214, meaning that a goodagreement was found.

Finally, Table 11 shows error rates for each class.As expected, Assertive exhibits the highest er-

ror (10.57%), while the best result is achieved bythe recognition of Aggressive, with an error rate of7.14%. Analyzing the Fp and Fn values we noted thatonly Aggressive had Fn > Fp, which means that theclassifier tended to mistakenly associate such a classto segments where it was not appropriate.

6 THE PROTOTYPE

The application architecture is composed of five mod-ules (see Figure 2): GUI, Feature Extraction, EmotionRecognition, Communication-style Recognition, andPraat.

Praat (Boersma, 2001) is a well-known open-source tool for speech analysis 3; it provides severalfunctionalities for the analysis of vocal signals as wellas a statistical module (containing the LDA-basedclassifier we described in Section 4). The script-

3http://www.fon.hum.uva.nl/praat/


192

(a) Recognition of emotions

(b) Recognition of communication stylesFigure 3: Recognition of emotions and communication styles.

GUI

Emotion Recognition Communication-styleRecognition

Feature Extraction

Praat

LDA classifierfor emotion recognition

LDA classifierfor speech-style recon.

Figure 2: The PrEmA architecture overview.

ing functionalities provided by Praat permitted us toquickly implement our prototype.

The GUI module permits to choose the audio fileto analyze, and shows the results to the user; the Fea-ture Extraction module performs the calculations pre-sented in Section 4 (preprocessing, segmentation, andfeature calculation); the Emotion Recognition andCommunication-style Recognition modules rely onthe two best-performing 4 models to classify the in-

4From the 10 LDA-based classifiers generated for theemotion classification task, the one with better performanceindexes was chosen as a final model; the same approach wasfollowed for the communication-style classifier.

put file according to its emotional state and commu-nication style. All the calculations are implementedby means of scripts that leverage functionalities pro-vided by Praat.

Figure 3 shows two screenshots of the PrEmA ap-plication (translated in English) recognizing, respec-tively, emotions and communication styles. In par-ticular, in Figure 3(a) the application analyzed a sen-tence of 6.87s expressing anger, divided in four seg-ments (i.e., three silences where found); each segmentwas assigned with an emotion: Anger, Joy (mistak-enly), Anger, Anger. Figure 3(b) shows the first 10sfragment of a 123.9s aggressive speech; the appli-cation found 40 segments and assigned them with acommunication style (in the example, the segmentswhere all classified as Aggressive).

7 CONCLUSIONS AND FUTUREWORK

We presented PrEmA, a tool able to recognize emo-tions and communication styles from vocal signals,providing clues about the state of the conversation. Inparticular, we consider communication-style recogni-tion as our main contribution since it could provide


193

a potentially powerful mean for understanding user’sneeds, problems and desires.

The tool, written using the Praat scripting lan-guage, relies on two sets of prosodic features andtwo LDA-based classifiers. The experiments, per-formed on a custom corpus of tagged audio record-ings, showed encouraging results: for classificationof emotions, we obtained a value of about 71% foraverage Pr, average Re, average F1, and Ac, with aK=0.64; for classification of communication styles,we obtained a value of about 86% for average Pr, av-erage Re, average F1, and Ac, with a K=0.78.

As a future work, we plan to test other classifica-tion approaches, such as HMM and CRF, experiment-ing them with a bigger corpus. Moreover, we plan toinvestigate text-based features provided by NLP tools,like POS taggers and parsers. Finally, the analysiswill be enhanced according to the “musical behavior”methodology (Sbattella, 2006; Sbattella, 2013).

REFERENCES

Anolli, L. (2002). Le emozioni. Ed. Unicopoli.Anolli, L. and Ciceri, R. (1997). The voice of emotions.

Milano, Angeli.Asawa, K., Verma, V., and Agrawal, A. (2012). Recognition

of vocal emotions from acoustic profile. In Proceed-ings of the International Conference on Advances inComputing, Communications and Informatics.

Avesani, C., Cosi, P., Fauri, E., Gretter, R., Mana, N., Roc-chi, S., Rossi, F., and Tesser, F. (2003). Definizione edannotazione prosodica di un database di parlato-lettousando il formalismo ToBI. In Proc. of Il Parlato Ital-iano, Napoli, Italy.

Balconi, M. and Carrera, A. (2005). Il lessico emotivo neldecoding delle espressioni facciali. ESE - Psychofenia- Salento University Publishing.

Banse, R. and Sherer, K. R. (1996). Acoustic profiles invocal emotion expression. Journal of Personality andSocial Psychology.

Boersma, P. (1993). Accurate Short-Term Analysis of theFundamental Frequency and the Harmonics-to-NoiseRatio of a Sampled Sound. Institute of Phonetic Sci-ences, University of Amsterdam, Proceedings, 17:97–110.

Boersma, P. (2001). Praat, a system for doing phonetics bycomputer. Glot International, 5(9/10):341–345.

Boersma, P. and Weenink, D. (2013). Manual of praat: do-ing phonetics by computer [computer program].

Bonvino, E. (2000). Le strutture del linguaggio: unintro-duzione alla fonologia. Milano: La Nuova Italia.

Borchert, M. and Diisterhoft, A. (2005). Emotions inspeech - experiments with prosody and quality fea-tures in speech for use in categorical and dimensionalemotion recognition environments. Natural LanguageProcessing and Knowledge Engineering, IEEE.

Caldognetto, E. M. and Poggi, I. (2004). Il parlato emotivo.aspetti cognitivi, linguistici e fonetici. In Il parlatoitaliano. Atti del Convegno Nazionale, Napoli 13-15febbraio 2003.

Canepari, L. (1985). LIntonazione Linguistica e paralin-guistica. Liguori Editore.

Cowie, R., Douglas-Cowie, E., Tsapatsoulis, N., Votsis, G.,Kollias, S., and Fellenz, W. (2001). Emotion recogni-tion in human-computer interaction. Signal Process-ing Magazine, IEEE.

D’Anna, L. and Petrillo, M. (2001). Apa: un prototipo disistema automatico per lanalisi prosodica. In Atti delle11e giornate di studio del Gruppo di Fonetica Speri-mentale.

Delmonte, R. (2000). Speech communication. In SpeechCommunication.

Ekman, D., Ekman, P., and Davidson, R. (1994). The Na-ture of Emotion: Fundamental Questions. New YorkOxford, Oxford University Press.

Gobl, C. and Chasaide, A. N. (2000). Testing affective cor-relates of voice quality through analysis and resynthe-sis. In ISCA Workshop on Emotion and Speech.

Hammarberg, B., Fritzell, B., Gauffin, J., Sundberg, J., andWedin, L. (1980). Perceptual and acoustic correlatesof voice qualities. Acta Oto-laryngologica, 90(1–6):441–451.

Hastie, H. W., Poesio, M., and Isard, S. (2001). Automat-ically predicting dialog structure using prosodic fea-tures. In Speech Communication.

Hirshberg, J. and Avesani, C. (2000). Prosodic disambigua-tion in English and Italian, in Botinis. Ed., Intonation,Kluwer.

Hirst, D. (2001). Automatic analysis of prosody for mul-tilingual speech corpora. In Improvements in SpeechSynthesis.

Izard, C. E. (1971). The face of emotion. Ed. AppletonCentury Crofts.

Juslin, P. (1998). A functionalist perspective on emotionalcommunication in music performance. Acta Universi-tatis Upsaliensis, 1st edition.

Juslin, P. N. (1997). Emotional communication in musicperformance: A functionalist perspective and somedata. In Music Perception.

Koolagudi, S. G., Kumar, N., and Rao, K. S. (2011). Speechemotion recognition using segmental level prosodicanalysis. Devices and Communications (ICDeCom),IEEE.

Lee, C. M. and Narayanan, S. (2005). Toward detectingemotions in spoken dialogs. Transaction on Speechand Audio Processing, IEEE.

Leung, C., Lee, T., Ma, B., and Li, H. (2010). Prosodicattribute model for spoken language identification. InAcoustics, speech and signal processing. IEEE inter-national conference (ICASSP 2010).

Lopez-de Ipina, K., Alonso, J.-B., Travieso, C. M., Sole-Casals, J., Egiraun, H., Faundez-Zanuy, M., Ezeiza,A., Barroso, N., Ecay-Torres, M., Martinez-Lage, P.,and Lizardui, U. M. d. (2013). On the selection of


194

non-invasive methods based on speech analysis ori-ented to automatic alzheimer disease diagnosis. Sen-sors, 13(5):6730–6745.

Mandler, G. (1984). Mind and Body: Psychology of Emo-tion and Stress. New York: Norton.

McGilloway, S., Cowie, R., Cowie, E. D., Gielen, S., Wes-terdijk, M., and Stroeve, S. (2000). Approaching au-tomatic recognition of emotion from voice: a roughbenchmark. In ISCA Workshop on Speech and Emo-tion.

McLachlan, G. J. (2004). Discriminant Analysis and Statis-tical Pattern Recognition. Wiley.

Mehrabian, A. (1972). Nonverbal communication. Aldine-Atherton.

Michel, F. (2008). Assert Yourself. Centre for Clinical In-terventions, Perth, Western Australia.

Moridis, C. N. and Economides, A. A. (2012). Affectivelearning: Empathetic agents with emotional facial andtone of voice expressions. IEEE Transactions on Af-fective Computing, 99(PrePrints).

Murray, E. and Arnott, J. L. (1995). Towards a simulation ofemotion in synthetic speech: a review of the literatureon human vocal emotion. Journal of the AcousticalSociety of America.

Pinker, S. and Prince, A. (1994). Regular and irregular mor-phology and the psychological status of rules of gram-mar. In The reality of linguistic rules.

Planet, S. and Iriondo, I. (2012). Comparison betweendecision-level and feature-level fusion of acoustic andlinguistic features for spontaneous emotion recog-nition. In Information Systems and Technologies(CISTI).

Pleva, M., Ondas, S., Juhar, J., Cizmar, A., Papaj, J., andDobos, L. (2011). Speech and mobile technologies forcognitive communication and information systems. InCognitive Infocommunications (CogInfoCom), 20112nd International Conference on, pages 1 –5.

Purandare, A. and Litman, D. (2006). Humor: Prosodyanalysis and automatic recognition for F * R * I * E* N * D * S *. In Proc. of the Conference on Empir-ical Methods in Natural Language Processing, Syd-ney, Australia.

Russell, J. A. and Snodgrass, J. (1987). Emotion and the en-vironment. Handbook of Environmental Psychology.

Sbattella, L. (2006). La Mente Orchestra. Elaborazionedella risonanza e autismo. Vita e pensiero.

Sbattella, L. (2013). Ti penso, dunque suono. Costrutti cog-nitivi e relazionali del comportamento musicale: unmodello di ricerca-azione. Vita e pensiero.

Scherer, K. (2005). What are emotions? and how can theybe measured? Social Science Information.

Shi, Y. and Song, W. (2010). Speech emotion recognitionbased on data mining technology. In Sixth Interna-tional Conference on Natural Computation.

Shriberg, E. and Stolcke, A. (2001). Prosody modeling forautomatic speech recognition and understanding. InProc. of ISCA Workshop on Prosody in Speech Recog-nition and Understanding.

Shriberg, E., Stolcke, A., Hakkani-Tr, D., and Tr, G. (2000).Prosody-based automatic segmentation of speech intosentences and topics. Ed. Speech Communication.

Stern, D. (1985). Il mondo interpersonale del bambino.Bollati Boringhieri, 1st edition.

Tesser, F., Cosi, P., Orioli, C., and Tisato, G. (2004). Mod-elli prosodici emotivi per la sintesi dell’italiano. ITC-IRST, ISTC-CNR.

Tomkins, S. (1982). Affect theory. Approaches to emotion,Ed. Lawrence Erlbaum Associates.

Wang, C. and Li, Y. (2012). A study on the search ofthe most discriminative speech features in the speakerdependent speech emotion recognition. In ParallelArchitectures, algortihms and programming. Interna-tional symposium (PAAP 2012).


195

Date post:	15-Jan-2023
Category:	Documents
Upload:	khangminh22
View:	0 times
Download:	0 times

Extracting Emotions and Communication Styles from Vocal ...

Documents