+ All Categories
Home > Documents > Interdependencies among Voice Source Parameters in Emotional Speech

Interdependencies among Voice Source Parameters in Emotional Speech

Date post: 14-Nov-2023
Category:
Upload: kth
View: 0 times
Download: 0 times
Share this document with a friend
13
Interdependencies among Voice Source Parameters in Emotional Speech Johan Sundberg, Sona Patel, Eva Bjo ¨rkner, and Klaus R. Scherer Abstract—Emotions have strong effects on the voice production mechanisms and consequently on voice characteristics. The magnitude of these effects, measured using voice source parameters, and the interdependencies among parameters have not been examined. To better understand these relationships, voice characteristics were analyzed in 10 actors’ productions of a sustained/a/ vowel in five emotions. Twelve acoustic parameters were studied and grouped according to their physiological backgrounds, three related to subglottal pressure, five related to the transglottal airflow waveform derived from inverse filtering the audio signal, and four related to vocal fold vibration. Each emotion appeared to possess a specific combination of acoustic parameters reflecting a specific mixture of physiologic voice control parameters. Features related to subglottal pressure showed strong within-group and between- group correlations, demonstrating the importance of accounting for vocal loudness in voice analyses. Multiple discriminant analysis revealed that a parameter selection that was based, in a principled fashion, on production processes could yield rather satisfactory discrimination outcomes (87.1 percent based on 12 parameters and 78 percent based on three parameters). The results of this study suggest that systems to automatically detect emotions use a hypothesis-driven approach to selecting parameters that directly reflect the physiological parameters underlying voice and speech production. Index Terms—Paralanguage analysis, affect sensing and analysis, affective computing, voice source, vocal physiology. Ç 1 INTRODUCTION I T is an everyday experience that emotions and moods have salient effects on voice and speech. The ability to predict the emotion of a speech sample has a number of applications for improving human-machine interactions (such as call directing in call centers and natural speech synthesis). In consequence, research on emotional expres- sion has attempted to identify the acoustic features that describe the prosodic patterns for different emotions. Rather than testing specific hypotheses regarding the relations between the emotions and their acoustic effects on speech, the strategy has been an exploratory approach to identify the acoustic correlates for different vocal expres- sions that are readily computable by a variety of standard algorithms. In many of these attempts, mel-frequency- cepstral coefficients (MFCCs), fundamental frequency of phonation (f0), intensity, parameters to measure the energy distribution of the spectrum, and timing measurements have been tested [1]. Typically, these initial and extremely large acoustic feature sets are reduced through a passive feature selection process in which the results of statistical procedures or data reduction algorithms such as principal components analysis (PCA), Fisher projection, multiple or stepwise regressions, or sequential forward or backward search [2], [3], [4] determine the features to be used in emotion classification. No attempt has been made to narrow down the feature set based on physiological relevance and highly correlated (or redundant) parameters. Only a handful of studies have investigated classification using voice source features [5], [6], [7], [8], [9]. These studies have been interested in the predictive power of the voice source features alone rather than in forming an optimal feature set with both voice source and waveform or spectral features. Most acoustic parameters have been linked to an under- lying arousal dimension (ranging from highly alert and excited to relaxed and calm [1], [10]). Psychological models often use multiple dimensions to describe various emotion- related phenomena. For example, Fontaine et al. [11] suggest that four dimensions (valence: dividing positive and negative emotions; potency: degree of control; arousal; and unpredict- ability) may be necessary to differentiate among the perceived similarity in emotion words. Indeed, their results show that the power/control dimension explains a larger percentage of the variance than the arousal dimension. Early on, Green and Cliff [12] suggested that a two-dimensional model (pleasantness and excitement) may be necessary to differentiate among the emotions in speech. Subsequent research has shown the importance of specific vocal cues in communicating emotional arousal but failing to identify the acoustic cues that carry the valence information (see review in [1]). This review also points out the need for identifying acoustic cues differentiating further dimensions of the emotional space, especially power/potency, as this is essential to differentiate certain emotions that have similar valence and arousal loadings, particularly anger and fear. Although it is not clear exactly how many dimensions may be needed to differentiate the emotions in vocally expressed 162 IEEE TRANSACTIONS ON AFFECTIVE COMPUTING, VOL. 2, NO. 3, JULY-SEPTEMBER 2011 . J. Sundberg and E. Bjo¨rkner are with the Department of Speech, Music and Hearing, KTH Royal Institute of Technology, Drottning Kristinas v. 31, Stockholm SE-100 44, Sweden. E-mail: [email protected], [email protected]. . S. Patel and K.R. Scherer are with the Swiss Centre for Affective Sciences (CISA), University of Geneva, 7 Rue des Battoirs, Geneva 1205, Switzerland. E-mail: {Sona.Patel, Klaus.Scherer}@unige.ch. Manuscript received 8 Nov. 2010; revised 4 May 2011; accepted 18 May 2011; published online 6 June 2011. Recommended for acceptance by S. Narayanan. For information on obtaining reprints of this article, please send e-mail to: [email protected], and reference IEEECS Log Number TAFFC-2010-11-0106. Digital Object Identifier no. 10.1109/T-AFFC.2011.14. 1949-3045/11/$26.00 ß 2011 IEEE Published by the IEEE Computer Society
Transcript

Interdependencies among Voice SourceParameters in Emotional Speech

Johan Sundberg, Sona Patel, Eva Bjorkner, and Klaus R. Scherer

Abstract—Emotions have strong effects on the voice production mechanisms and consequently on voice characteristics. The

magnitude of these effects, measured using voice source parameters, and the interdependencies among parameters have not been

examined. To better understand these relationships, voice characteristics were analyzed in 10 actors’ productions of a sustained/a/

vowel in five emotions. Twelve acoustic parameters were studied and grouped according to their physiological backgrounds, three

related to subglottal pressure, five related to the transglottal airflow waveform derived from inverse filtering the audio signal, and four

related to vocal fold vibration. Each emotion appeared to possess a specific combination of acoustic parameters reflecting a specific

mixture of physiologic voice control parameters. Features related to subglottal pressure showed strong within-group and between-

group correlations, demonstrating the importance of accounting for vocal loudness in voice analyses. Multiple discriminant analysis

revealed that a parameter selection that was based, in a principled fashion, on production processes could yield rather satisfactory

discrimination outcomes (87.1 percent based on 12 parameters and 78 percent based on three parameters). The results of this study

suggest that systems to automatically detect emotions use a hypothesis-driven approach to selecting parameters that directly reflect

the physiological parameters underlying voice and speech production.

Index Terms—Paralanguage analysis, affect sensing and analysis, affective computing, voice source, vocal physiology.

Ç

1 INTRODUCTION

IT is an everyday experience that emotions and moodshave salient effects on voice and speech. The ability to

predict the emotion of a speech sample has a number ofapplications for improving human-machine interactions(such as call directing in call centers and natural speechsynthesis). In consequence, research on emotional expres-sion has attempted to identify the acoustic features thatdescribe the prosodic patterns for different emotions.Rather than testing specific hypotheses regarding therelations between the emotions and their acoustic effectson speech, the strategy has been an exploratory approach toidentify the acoustic correlates for different vocal expres-sions that are readily computable by a variety of standardalgorithms. In many of these attempts, mel-frequency-cepstral coefficients (MFCCs), fundamental frequency ofphonation (f0), intensity, parameters to measure the energydistribution of the spectrum, and timing measurementshave been tested [1]. Typically, these initial and extremelylarge acoustic feature sets are reduced through a passivefeature selection process in which the results of statisticalprocedures or data reduction algorithms such as principalcomponents analysis (PCA), Fisher projection, multiple or

stepwise regressions, or sequential forward or backwardsearch [2], [3], [4] determine the features to be used inemotion classification. No attempt has been made to narrowdown the feature set based on physiological relevance andhighly correlated (or redundant) parameters. Only ahandful of studies have investigated classification usingvoice source features [5], [6], [7], [8], [9]. These studies havebeen interested in the predictive power of the voice sourcefeatures alone rather than in forming an optimal feature setwith both voice source and waveform or spectral features.

Most acoustic parameters have been linked to an under-lying arousal dimension (ranging from highly alert andexcited to relaxed and calm [1], [10]). Psychological models

often use multiple dimensions to describe various emotion-related phenomena. For example, Fontaine et al. [11] suggestthat four dimensions (valence: dividing positive and negative

emotions; potency: degree of control; arousal; and unpredict-ability) may be necessary to differentiate among theperceived similarity in emotion words. Indeed, their results

show that the power/control dimension explains a largerpercentage of the variance than the arousal dimension. Earlyon, Green and Cliff [12] suggested that a two-dimensionalmodel (pleasantness and excitement) may be necessary to

differentiate among the emotions in speech. Subsequentresearch has shown the importance of specific vocal cues incommunicating emotional arousal but failing to identify the

acoustic cues that carry the valence information (see reviewin [1]). This review also points out the need for identifyingacoustic cues differentiating further dimensions of the

emotional space, especially power/potency, as this isessential to differentiate certain emotions that have similarvalence and arousal loadings, particularly anger and fear.

Although it is not clear exactly how many dimensions may beneeded to differentiate the emotions in vocally expressed

162 IEEE TRANSACTIONS ON AFFECTIVE COMPUTING, VOL. 2, NO. 3, JULY-SEPTEMBER 2011

. J. Sundberg and E. Bjorkner are with the Department of Speech, Music andHearing, KTH Royal Institute of Technology, Drottning Kristinas v. 31,Stockholm SE-100 44, Sweden. E-mail: [email protected], [email protected].

. S. Patel and K.R. Scherer are with the Swiss Centre for Affective Sciences(CISA), University of Geneva, 7 Rue des Battoirs, Geneva 1205,Switzerland. E-mail: {Sona.Patel, Klaus.Scherer}@unige.ch.

Manuscript received 8 Nov. 2010; revised 4 May 2011; accepted 18 May 2011;published online 6 June 2011.Recommended for acceptance by S. Narayanan.For information on obtaining reprints of this article, please send e-mail to:[email protected], and reference IEEECS Log NumberTAFFC-2010-11-0106.Digital Object Identifier no. 10.1109/T-AFFC.2011.14.

1949-3045/11/$26.00 � 2011 IEEE Published by the IEEE Computer Society

speech devoid of any semantic information, the evidencesuggests that more than one dimension is necessary.

The search for acoustic parameters that may betterquantify aspects of other psychological dimensions hasmotivated the physiology-driven approach adopted for theanalyses reported here. The direction of this work is similarto recent research on facial expressions, where physiologicalchanges due to emotion expression are described by facial“action units” or groups of muscles that collaborate increating a particular movement [13]. Using the facial actioncoding system, automatic facial expression detection hasmoved beyond blind identification of movement changes toan informed feature selection procedure that looks forgroups of changes that co-occur [14]. Knowledge of thesignificant relations among vocal parameters from aphysiological perspective may help in creating “vocal actionunits” to provide an informed feature selection process forthe vocal domain. Thus, the basic strategy of the presentstudy was to view the results of an acoustic analysis ofemotional vocalizations from the point of view of voicephysiology, i.e., voice production, particularly as reflectedin voice quality.

Voice quality is determined by two main factors, vocalfold vibration and vocal tract resonance. These factors arecontrolled by the speaker’s phonatory and articulatorybehaviors. Thus, the quality of voice sounds is controlledby movements in the voice organ, including the respiratorysystem, the larynx, and the vocal tract. Indeed, every changeof vocal sound quality mirrors movements within the voiceorgan. Hence, the voice organ is a transformer of movementto sound. Movement patterns are the central characteristicsof the expression of different emotions and moods, as shownby the copious research on facial and bodily movement inemotional expression [15], [16]. Contrary to the latterexpression modalities, the movements of the voice organsare largely invisible (except for some articulatory move-ments such as mouth opening). Thus, the movements thatproduce vocalization—those that are affected by specificphysiological changes in emotion—need to be inferred fromthe vocal sounds they give rise to.

In recent years important progress has been made inunderstanding the relationships between the physiologyand acoustics of voice production. The properties of thewaveform of the transglottal airflow, i.e., the voice source,are determined by and hence reflect vocal fold vibrationcharacteristics, which in turn are determined by acombination of the driving transglottal pressure dropand various laryngeal adjustments. The voice sourceproperties are controlled by three main physiologicalparameters, the transglottal pressure drop, i.e., the airpressure difference across the glottis, the glottal adductionforce, i.e., the force by which the folds are pressedtogether, and vocal fold length and tension.

The acoustic correlates of various voice source waveformcharacteristics have been theoretically and empiricallyanalyzed. Thus, a shortening of the closed phase and anincrease of the peak-to-peak air pulse amplitude tends toincrease the amplitude of the voice source fundamental.Further, a greater drop in the rate of airflow during theclosing phase (i.e., a larger maximum flow declination rate

or MFDR) increases the excitation strength of the vocal tract,resulting in an increase of the level of the radiated soundand a decrease in the overall slope of the voice sourcespectrum, other things being constant [17], [18].

Some investigations have studied how variations of themain physiological voice control parameters subglottalpressure, glottal adduction, and vocal fold length andtension affect the waveform of the transglottal airflow, i.e.,the properties of the voice source [19], [20]. Thus it has beenshown that an increase of subglottal pressure increases theMFDR, the relative duration of the closed phase, and thepulse amplitude. An increase of glottal adduction increasesthe duration of the closed phase and decreases the pulseamplitude. In particular, some investigations suggest thatthe ratio between the pulse amplitude and MFDR,frequently referred to as the amplitude quotient (AQ), isparticularly sensitive to variation of glottal adduction [21].Finally, increasing vocal fold length and tension as a resultof stretching the vocal folds increases f0.

The spectral consequences of these waveform responsesto changes in physiological voice control parameters are notspecific. Thus, an increase of subglottal pressure willincrease MFDR, but MFDR may also increase because of adecrease glottal adduction since this tends to increase thepulse amplitude [18]. Likewise an increase of the closedquotient may be caused not only by an increase in glottaladduction, but also by an increase in subglottal pressure.Also, an increase of f0 may be caused not only by anincrease of vocal fold length and tension, but also by anincrease of subglottal pressure.

The waveform characteristics lead to spectrum effectsthat are also not entirely specific. An increase of MFDRincreases the overall sound level in addition to reducing thespectral slope. An increase of the pulse amplitude willresult in an increase in the level difference between the firstand second voice source partials, generally referred to asH1-H2 [18], [22], and the overall sound level, but a decreasein the overall spectrum slope (due to the increase in MFDR).

Summarizing, the physiology underlying waveform andspectrum effects of vocal sounds are complex. There arereasons to suspect that the physiological characteristics ofvoice production are more relevant to perception than theacoustic characteristics. For example, according to aclassical study by Ladefoged and McKinney [23], subglottalpressure correlates better with perceived vocal loudnessthan with SPL. This supports the assumption that the key tothe emotional code of voice and speech would hide in thephysiological rather than the acoustic voice characteristics.

An attempt was recently made to find the acousticcharacteristics of the emotions sadness, fear, anger, relief,and joy by analyzing a material where 10 actors tried torepresent these emotions by sustaining the vowel/a/ [24].In the current investigation, we perform a variety ofcorrelation analyses within and across speakers and emo-tions using the same data set in order to understand theinterdepencies among voice source parameters. This dis-cussion provides a rather novel contribution to the fields ofemotion psychology and affective computing, as automaticdetection of emotions and moods has been become a targetof much research in recent years (for example, [25], [26],

SUNDBERG ET AL.: INTERDEPENDENCIES AMONG VOICE SOURCE PARAMETERS IN EMOTIONAL SPEECH 163

[27]), particularly in the interest of the machine classificationof expressed emotions or the vocal synthesis of emotionalexpressions in autonomous agents. An understanding of thephysiological mechanisms involved in voice production asmeasured by voice source parameters can help identifygroups of related structures and parameters, therebyleading to an improved measurement of vocal expressions.

2 METHODS

2.1 Speech Material

The stimulus materials were previously collected as part ofthe Geneva Multimodal Emotional Portrayal (GEMEP)database, in which 10 professional French-speaking actors(5 male, 5 female; mean age: 37.1 years) expressed twononsense sentences and an extended/a/ vowel in at least12 emotional contexts. Written scenarios were given to helpinvoke the emotion during an interaction with a profes-sional stage director (complete details of the emotioninduction technique, recording procedures, and perceptualaccuracy are given in [28]). For the present investigation, thelatter material expressed in five emotions (relief, sadness,joy, panic fear, and hot anger) was used, thus avoidinglinguistic prosody related changes. The five emotions werechosen to include strong differences along the arousal andvalence dimensions and possibly power or potency,particularly between hot anger and panic fear.

The recordings were made in a sound-treated studio atthe University of Geneva. The audio signal was obtainedusing a head-mounted Sennheiser microphone located atthe participant’s left ear. The microphone amplification washeld constant for all recordings. The recorded signals weredigitized at a sampling rate of 41 kHz and saved as wavfiles. The mean token length was 2.11 seconds (standarddeviation ¼ 1:07 seconds). Two samples of each emotionwere obtained from each actor, resulting in 100 samplesð10 speakers � 5 emotions � 2 repetitionsÞ. All recordingswere submitted to perceptual evaluation as part of thevalidation study. Perceptual accuracy computed for onlythe samples used in the present study is as follows: angry(59.3 percent), joy (18.1 percent), panic fear (72.5 percent),relief (61.4 percent), and sadness (21.0 percent). Since thestudy was a classification task of 17 items (15 emotions, “noemotion,” and “other emotion”), these rates are well abovechance level (5.88 percent).

2.2 Voice Source Analysis

A mix of voice source, waveform, and spectral parameterswere extracted from the speech signals (see Table 1). Arepresentative time period (T0) from each audio signal wasselected for voice source analysis. Inverse filtering wasperformed by means of a custom-made software (Decap bySvante Granqvist, KTH, Stockholm). In the present applica-tion, the program displayed the inverse-filtered waveform,its derivative, as well as the spectrum before and after theinverse filtering (see Fig. 1). A ripple-free closed phase and asmoothly falling source spectrum envelope, void of dips andpeaks, were used as the criteria for manual tuning of thefilters. The emerging flow glottograms were saved togetherwith their derivatives in new files. This inverse filteringmethod yields particularly reliable formant frequency

measurement, being determined on the basis of bothspectrum and waveform information.

The Decap program provides the flow glottogram,showing transglottal airflow versus time. A number ofparameters were computed from the flow glottograms,including the closed phase, AC pulse amplitude ðACAmpÞ,MFDR (i.e., the negative peak value of the derivative of theflow glottogram), closed quotient (QClosed, defined as theratio between closed phase and T0), and the normalizedamplitude quotient (NAQ, defined as the ratio between theACAmp and the product of MFDR and T0). In addition, thelevel difference between the two lowest source spectrumpartials (“H1-H2”) of the flow glottogram was measured bymeans of the Spectrum Section program of the SoundSwellCore Signal Workstation (v. 4.0, Saven Hitech, Taby,Sweden). These measurements are illustrated in Fig. 2.

The ACAmp, the peak-to-peak airflow amplitude, tends toincrease with increased subglottal pressure and withdecreased glottal adduction; a large ACAmp tends toproduce a strong voice source fundamental. In contrast,QClosed increases with glottal adduction and, at low degreesof vocal loudness, also with subglottal pressure. MFDRrepresents how strongly the voice source is exciting thevocal tract resonator, and is thus strongly affiliated withvocal loudness. NAQ is closely related to glottal adduction,decreasing when adduction increases.

Inverse filtering could mostly be performed, except insamples where the amplitudes of the overtones were weak.This was the case for the sadness samples, which failed toshow a clear closed phase due to the soft and sometimeshypofunctional breathy phonation. This was evidenced bythe absence of the clearly demarcated flow discontinuity thattypically results from vocal fold collision. Because of such

164 IEEE TRANSACTIONS ON AFFECTIVE COMPUTING, VOL. 2, NO. 3, JULY-SEPTEMBER 2011

TABLE 1List of Acoustic Parameters (and Abbreviations)

Extracted from the Emotional Expressions

problems, no samples of “sadness” could be analyzed. In

addition, some of the samples produced with a loud voice

were inverse-filtered at instances within the first 25 percent

of the vowel so as to avoid clipped regions.The original wav files were analyzed by means of the

SoundSwell software. The Hist program was used to

compute the equivalent sound level (Leq) average. TheCorr autocorrelation subroutine was used for extracting f0and its average (Mf0), determined by means of the SpectrumSection program. An average spectrum over the entire/a/utterance was obtained as the long-term-average spectrum(LTAS), also from the Spectrum Section program. This wasmeasured between 0 and 6,700 Hz using an analysisbandwidth of 100 Hz. Two measurements were computedfrom the LTAS data. The first parameter, commonly referredto as the alpha ratio (Alpha), is defined as the ratio betweenthe summed sound energy in the spectrum below and above1,000 Hz. This parameter, expressed in dB, is highlydependent on subglottal pressure and, hence, vocal loud-ness, but it is also influenced by formant characteristics [29].The second parameter, H1-H2LTAS, was defined as thedifference in mean LTAS level, expressed in dB, over two orthree of the filter bands that surrounded Mf0 and 2�ðMf0Þ.This is a novel parameter and was assumed to reflect theaverage of the level difference between the first twospectrum partials. This parameter should be closely relatedto H1-H2 derived from the flow glottogram; however, theH1-H2LTAS has the advantage of allowing measurement forutterances that cannot be inverse-filtered. This parametershould be influenced by type of phonation.

Additional analysis of the original audio files wasperformed using Praat [30]. From this analysis, the followingparameters were extracted: the relative absolute perturba-tion (rap) jitter (“Jitter”), i.e., the average absolute differencebetween a period and the average of it and its two neighbors,divided by the mean period, shimmer (local), i.e., the averagemean difference between the amplitudes of consecutiveperiods, divided by the mean amplitude (“Shimmer”), andthe harmonics-to-noise ratio (cc or the cross correlationmethod, “HNR”).

2.3 Statistical Analyses

Two types of correlation analyses were performed in SPSS(SPSS, Inc.). First, the relationships between the 12 parameters

SUNDBERG ET AL.: INTERDEPENDENCIES AMONG VOICE SOURCE PARAMETERS IN EMOTIONAL SPEECH 165

Fig. 2. Flow glottogram measures. The flow glottogram is shown in the

top panel and its time derivative is shown in the middle panel.

The bottom panel shows the spectrum of the flow glottogram marked

with the two lowest spectrum partials, H1 and H2.

Fig. 1. Decap program display. Upper panel: Flow glottogram waveform and its derivative. Lower panel: Spectrum of the signal before and afterinverse filtering; F1, F2, F3, F4 show the formant frequencies and bandwidths of the filters, the latter on an arbitrary scale along the vertical axis. Thesmooth curves represent the typical bandwidth variation range.

were analyzed within each speaker using Pearson’s correla-tion coefficient in order to examine the stability of parametercoherence across speakers. Next, the Pearson’s correlationwas computed between parameters, this time across allspeakers and emotions. The purpose of this analysis was todetermine the parameters that are strongly related inemotional speech, which in turn may reveal the importantphysiological mechanisms used in vocal expression.

Two additional analyses were performed, one to examinethe change in parameters within each of the physiologicalmechanisms identified and a second to examine thediscriminability between emotions for each of the 12 para-meters by means of discriminant analysis. For this analysis,the data were first normalized. Most voice parameters aremore or less influenced by the individual characteristics ofthe voice organ. For example, persons with shorter vocalfolds tend to have higher f0 ranges than persons with longervocal folds [31]. Therefore, it is beneficial in certain analysesto normalize each parameter by speaker. A change score wascomputed for each parameter as the difference between themean measured score (across repetitions) and the speaker’sbaseline. In the past, this difference has been computedrelative to a “neutral” emotion. Since a “neutral” emotionalexpression was not recorded (as forced neutrality tends toproduce unnatural vocalizations), the speaker baselineswere defined for each parameter as the mean value acrossall expressions. In most cases, the average was computedacross 10 samples ð5 emotions � 2 repetitionsÞ; however, asinverse filtering of the sadness samples was not feasible, theaverage of the voice source parameters was computed acrosseight samples. Hence, except for the sadness samples, thetotal number of remaining cases was 50 ð1 mean sample �10 speakers � 5 emotionsÞ for each parameter. The data werefurther normalized by z-transforming the values by speaker.

These values allowed comparisons of the direction andstrength of variation for a given parameter by emotion.

3 RESULTS

Pearson’s correlations were calculated between the 12 para-meters for each speaker. The number of significantcorrelations between each pair of parameters (summedacross the 10 speakers) is shown in matrix form in Table 2,with the sign specifying the direction of the relationship.This table suggests that Leq was the most influentialparameter with 25 significant correlations. Other influentialparameters include Alpha, H1-H2LTAS, MFDR, Mf0, andHNR. In seven speakers an increase of Leq was associatedwith a decrease of Alpha and in half of the speakers with adecrease of H1-H2LTAS. An increase of Mf0 was frequentlyassociated with a decrease of Alpha.

In a second analysis, the Pearson’s correlation betweenparameters was calculated across all speakers and emotions.The correlation coefficients are shown in Table 3. One of theresults observed from this table is the strong significantcorrelations among the parameters Leq, Alpha, and MFDR.This is not surprising since MFDR represents the excitationstrength of the vocal tract and thus affects the level of theradiated sound and the overall source spectrum slope. Thus,these three parameters are all closely related to vocalloudness. An interesting result is that the parameters ofthe loudness group were also significantly correlated withmany other parameters including the two H1-H2 measuresand Jitter. The four parameters primarily associated withglottal adduction, QClosed, H1-H2, ACAmp, and NAQ, alsoshowed rather strong correlations between themselves,except for ACAmp. NAQ also failed to correlate withH1-H2LTAS. Since NAQ was highly correlated with H1-H2,

166 IEEE TRANSACTIONS ON AFFECTIVE COMPUTING, VOL. 2, NO. 3, JULY-SEPTEMBER 2011

TABLE 2Number of Significant Correlations between the Parameters Analyzed within Participants

Note: Negative values indicate the number of negative correlations. The bottom row provides the total number of significant correlations found foreach parameter.

it is likely that the inability to compute NAQ for the sadnesssamples (H1-H2LTAS was computed for the sadness samplesbut H1-H2 was not) affected this result. Finally, allparameters associated with regulation of vocal fold vibra-tion, i.e., Mf0, Jitter, Shimmer, and HNR, were significantlycorrelated with each other.

By and large, these findings support the assumptionthat the parameters can be divided into three groupsaccording to their main physiologic correlates. Thus, Leq,Alpha, and MFDR are all heavily influenced by subglottalpressure, which is used for variation of vocal loudness.Similarly, the two H1-H2 estimates, QClosed;ACAmp, andNAQ are all closely related to glottal adduction, which isa factor relevant to phonation type; a firm glottaladduction produces a pressed or hyperfunctional type,and a weak adduction leads to a hypofunctional or

breathy phonation. Finally, Mf0, Jitter, Shimmer, andHNR are all related to vocal fold vibration, which wouldbe influenced by vocal fold length and tension, glottaladduction, and subglottal pressure.

To compare the changes in parameter values withineach physiological group, the parameters were normalizedand grouped according to their main physiologicalcorrelate mentioned above and displayed in Figs. 3, 4,and 5. Each graph shows the mean and range of values formales and females for each emotion (recall that thesemeasurements are z-transformed differences from eachspeaker’s baseline). As these measures are normalizedchange scores, large positive or negative values indicatestronger deviations from the speaker’s baseline, and valuesaround zero indicate that the parameter value was near thespeakers’ baseline.

SUNDBERG ET AL.: INTERDEPENDENCIES AMONG VOICE SOURCE PARAMETERS IN EMOTIONAL SPEECH 167

TABLE 3Significant Correlations among Parameters, Calculated across Participants and Emotions

Note: Correlations shown are significant at the 0.05-level; values in bold are also significant at the 0.01-level. The bottom row provides the totalnumber of significant correlations found for each parameter.

Fig. 3. Female and male actors’ normalized change scores from speaker baselines for Leq, Alpha, and MFDR in the indicated emotions. Symbolsindicate the mean value for each gender, and the bars indicate the range of values across speakers.

The parameters included in the vocal loudness group(Leq, Alpha, and MFDR) are shown in Fig. 3. Theseparameters show a reasonably coherent pattern with arelatively small variation across emotions. Sadness andrelief, the emotions low in arousal, have negative Leq andpositive Alpha values. These two parameters are inverselyrelated, and therefore this pattern is expected. Fear, anger,and joy, the high-arousal emotions, show the oppositepattern—positive Leq and negative Alpha. The MFDR ispositive for anger and intermediate for fear, distinguishingbetween two emotions high in arousal, and suggesting thatvocal loudness, i.e., subglottal pressure, was higher in angerthan in fear. This suggests a stronger glottal adduction infear than in anger. Fig. 4 similarly shows the parameterdifferences in emotion for the glottal adduction group (thetwo measures of H1-H2, the ACAmp, the Qclosed, and theNAQ). These parameters show a less coherent picture.Relief is high in both H1-H2 measures and low in ACAmp

and Qclosed, suggesting weak glottal adduction. Fear showeda low H1-H2, suggesting strong glottal adduction, but,somewhat surprisingly, fear also showed high ACAmp,which typically is associated with a high H1-H2. The highACAmp may have been caused by a high subglottal pressureand may also account for the high ACAmp observed forAnger. The remaining parameters comprising the thirdgroup, those related to vocal fold length and tension, areshown in Fig. 5. Among these parameters, Mf0 was themost systematic, with large values in the high-arousalemotions fear, anger, and joy, and low values in sadnessand relief. Shimmer was low in fear and high in anger.

Table 4 summarizes the trends represented in Figs. 3, 4,and 5. The parameters are grouped here according to theirmain physiological mechanism: Leq, Alpha, and MFDR inthe subglottal pressure group, the two H1-H2 measures,ACAmp, NAQ, and QClosed under the adduction group, andMf0, Jitter, Shimmer, and HNR under the vocal foldvibration group. The subglottal pressure parameters showa quite coherent pattern, low in sadness and relief, high infear, anger, and joy, and thus reflect the arousal dimension.The adduction parameters show a somewhat less systema-tic pattern, H1-H2LTAS being high in sadness and relief andlow in joy, fear, and anger.

There are also several instances with unexpected para-meter patterns. For instance, the two H1-H2 measuresdisagree in anger. For the females voices, the H1-H2LTAS

measure is sensitive to a change in the first formantfrequency (F1), which they raised in anger. A rise in F1may very well bring this formant closer to and thus enhancethe second partial in the/a/vowel as produced by a femalevoice. This may cause a decrease of the H1-H2LTAS. A highdegree of glottal adduction should lead to low H1-H2;ACAmp, and NAQ values and to a high value of Qclosed. Infear, the low values of the H1-H2 parameters and theACAmp were associated with a high NAQ, and in relief, ahigh H1-H2 appeared in combination with a low value ofQclosed. These cases of conflicting parameter trends mayhave been caused by interindividual variation.

To complement this analysis and determine which of themeasured parameters would most successfully allowdifferentiation among the encoded emotions, a multiplediscriminant analysis (MDA) was performed using stepwise

168 IEEE TRANSACTIONS ON AFFECTIVE COMPUTING, VOL. 2, NO. 3, JULY-SEPTEMBER 2011

Fig. 4. Female and male actors’ normalized change scores from speaker baselines for QClosed, H1-H2, H1-H2LTAS;ACAmp, and NAQ in the indicatedemotions. Symbols indicate the mean value for each gender, and the bars indicate the range of values across speakers.

entry of the 12 parameters. Five variables were entered

(Alpha, H1-H2LTAS, Shimmer Leq, and NAQ, in this order,

using standard entry and removal criteria), resulting in

three discriminant functions. As the inverse filtering

parameters could not be determined for sadness, only the

remaining four emotions were entered into the analysis.

Table 5 shows the structure matrix for this solution. The first

function is determined by parameters that are, as described

above, related to subglottal tension and vocal loudness, the

second function represents a perturbation or noise factor,

SUNDBERG ET AL.: INTERDEPENDENCIES AMONG VOICE SOURCE PARAMETERS IN EMOTIONAL SPEECH 169

Fig. 5. Female and male actors’ normalized change scores from speaker baselines for Mf0, Jitter, Shimmer, and HNR in the indicated emotions.

Symbols indicate the mean value for each gender, and the bars indicate the range of values across speakers.

TABLE 4Direction of Change in the Indicated Parameters for Female and Male Actors’ Normalized Scores

Note: The direction of the change in each parameter is indicated with a “high” for an increase in value or “low” for a decrease in value (no changes ismarked by “NC”). The changes in direction for males and females are given separately unless this change was the same direction for both genders.

and the third function includes variables obtained throughinverse filtering and mainly marking the degree of vocalfold adduction.

Table 6 shows the values for the four emotions at thegroup centroids. The first (loudness) function differentiates

the high arousal emotions (panic fear and hot anger) fromrelief, the second (perturbation) function differentiates the

high power and energy emotion of hot anger from the otheremotions, and the third (adduction) function differentiates

joy from the remaining emotions. The classification matrixbased on cross-validation is shown in Table 7. This solutioncorrectly classified in 87.2 percent of the cases using cross

validation (100 percent of the original cases). Confusions areparticularly noticeable between joy and panic fear.

In a second MDA, only the waveform and spectralvariables were tested in order to be able to include sadnessin the discrimination. Based on the preceding structure

matrix (and an examination of the correlation patternsreported above), we decided to perform the MDA with only

three variables, one for each of the three discriminationfunctions shown above: Alpha for loudness, Shimmer for

perturbation, and H1-H2LTAS for adduction. The structurematrix in Table 8 confirms that these selected parametersindeed represent the three production-based factors. Thevalues of the emotions at the group centroids for thesefunctions are shown in Table 9. It can be seen that relief andsadness are characterized by lower vocal loudness, hot angerbeing separated from the other emotions by high perturba-tion. In a somewhat less spectacular way, the adduction factorseems to separate the positive from the negative emotions.

Table 10 shows the classification matrix. This solutioncorrectly classified 78 percent cases using cross validation(84 percent of the original cases), suggesting that theseparameters do indeed provide a simplified representation ofthe components and can be used to successfully discrimi-nate among emotions. In comparison with the earlierdiscriminant function set, based on five parameters, includ-ing NAQ based on time consuming inverse filtering, anddiscriminating only a subset of four emotions, the currentsolution based on only three waveform-extracted para-meters distinguishes all five emotions with only 5 percentloss of accuracy. As can be seen in the classification matrix,the confusion pattern remains essentially unchanged.

4 DISCUSSION

The correlation analyses between the various parametersillustrate the complexity of the relations between physiolo-gical and acoustical voice properties. The first set ofcorrelations performed for each speaker across emotionsrevealed the parameters that covaried as a result ofexpressivity and phonatory and articulatory habits. Theseresults also demonstrated that the same physiological settingcan result in different acoustic values across voices because

170 IEEE TRANSACTIONS ON AFFECTIVE COMPUTING, VOL. 2, NO. 3, JULY-SEPTEMBER 2011

TABLE 5Structure Matrix Showing Three Functions

Resulting from a Multiple Discriminant Analysis

Note: Pooled within-groups correlations between discriminating vari-ables and standardized canonical discriminant functions. Variablesordered by absolute size of correlation within function (where “*”indicates the largest absolute correlation between each variable and anydiscriminant function, and “a” refers to a variable not used in theanalysis).

TABLE 6Discriminant Analysis Functions at Group Centroids

Note: Unstandardized canonical discriminant functions evaluated atgroup means.

TABLE 7Classification Matrix Based on Cross-Validation

Note: Values represent number of cases.

TABLE 8Structure Matrix from Multiple Discriminant Analysis

with Three Parameters

Note: Pooled within-groups correlations between discriminating vari-ables and standardized discriminant functions. Variables ordered byabsolute size of correlation within function where “*” indicates the largestabsolute correlation between each variable and any discriminantfunction.

of morphological differences. For instance, a fundamentalfrequency produced with a certain subglottal pressure anddegree of glottal adduction can generate different pulseamplitudes if the glottis length (i.e., the vocal fold length) isnot the same. Likewise, Leq is influenced by the frequency ofthe first formant, which varies across speakers for aparticular vowel depending on factors such as pronunciationhabits and vocal tract dimensions.

These correlations shed some light on the parameters thatcovary for acoustic and/or physiological reasons, irrespec-tive of emotions and articulatory habits. The influence of suchfactors should be reduced if the number of data points isincreased by pooling speakers. Hence, the second set ofcorrelations among all parameters was computed across allsamples (all speakers and emotions). This test revealed anumber of significant correlations between the parametersanalyzed. An examination of the significant correlationssuggested that the parameters belonged to one of threephysiological mechanisms: subglottal pressure, glottal ad-duction, or vocal fold vibration frequency. These findingssuggest that the vocal variability associated with the emo-tional coloring of utterances concealed the correlationsbetween the three groups of parameters, presumably becausethey were systematically varied between emotions.

The novel H1-H2LTAS measure, defined as the meanLTAS level across the filter bands near Mf0 and the meanLTAS level one octave higher, had the advantage of beingapplicable for phonation void of closed phase (as was thecase for the sadness samples where inverse filtering couldbe performed). A significant correlation between thismeasure and the H1-H2 measure was found only in twoactors and with a correlation of no more than 0.67,suggesting that this parameter provides informationunique from the H1-H2 parameter. The low correlationmay arise since the H1-H2LTAS measure is influenced byboth f0 and F1, whereas the H1-H2 measure reflects onlythe amplitude of the voice source fundamental.

The NAQ parameter has been found to correlateinversely with perceived degree of glottal adduction [21].In other words, NAQ decreases as the degree of glottaladduction increases since it mainly represents the ratiobetween ACAmp and MFDR. In the present analysis, the roleof NAQ in differentiating emotions was minimal. The lackof significant findings for this parameter may arise since itis also influenced by other phonatory parameters, such asMf0 and subglottal pressure.

Although the stimulus set used a moderate number ofspeakers, thereby ensuring a certain degree of general-ization of the results, there were a few limitations of thisstudy. Specifically, it should be noted that no voice sourcedata could be obtained from the sadness samples becauseof the lack of a clearly visible closed phase, a characteristicof very low subglottal pressure, i.e., very soft phonation.In addition, in several samples of anger, the nonclippedpart used in the inverse filtering was taken from thebeginning of the sample. These two types of exclusionsnarrowed the dynamic range of the flow glottogram data.Nevertheless, these outcomes are based on correlations offive points per parameter (one for each emotion). Hence,this was a rather strict test since a correlation needed to bequite strong to survive the great voice variability causedby the emotional coloring of the utterances. Therefore, it isnot surprising that comparatively few significant correla-tions were found and that the greatest number of suchcorrelations was found for the highly influential subglottalpressure parameters Leq, Alpha, and MFDR.

Leq and Alpha have not been frequently included inprevious investigations. Nevertheless, Leq is more sensitivethan the average of SPL to the strongest parts of anutterance and is thus less sensitive to, for example, pauses.Hence, it should be a better measure of the mean loudnessof an utterance with varying SPL. Alpha is stronglycorrelated with vocal loudness and should therefore be avaluable complement to Leq and MFDR [29].

In light of these findings, it is now relevant to reflect backon the findings reported by Patel et al. [24] and relate ourthree physiological groups to the three components foundthrough PCA analysis of the acoustic parameters. Theyreported three components, including phonatory effort,phonation perturbation, and phonation frequency (note thatresults were derived based on the change scores relative tospeaker baselines). Their phonatory effort component wasdescribed by a positive contribution of QClosed and MFDR anda negative contribution of the two H1-H2 parameters. Thus,it combined a subglottal pressure parameter (MFDR) withtwo glottal adduction parameters (QClosed and H1-H2). Astrong glottal adduction would require a high subglottalpressure, which would be the reason why MFDR was foundto belong to this component. The perturbation componentwas described by a positive contribution of Jitter andShimmer and a negative contribution of HNR. Perturbation

SUNDBERG ET AL.: INTERDEPENDENCIES AMONG VOICE SOURCE PARAMETERS IN EMOTIONAL SPEECH 171

TABLE 9Discriminant Analysis Functions at Group Centroids

for Three Parameters

Note: Unstandardized canonical discriminant functions evaluated atgroup means. Results based on the inclusion of three parameters.

TABLE 10Classification Matrix for Three Parameters

Based on Cross-Validation

Note: Values represent number of cases. Classification results producedusing the MDA solution based on the inclusion of three parameters.

can be produced by turbulence generated by a glottal leakagedue to low degree of glottal adduction. It can also beproduced by strong airflow caused by high subglottalpressure combined with a narrow vocal tract constriction.This component differentiated hot anger from joy and panicfear, a distinction that is particularly visible in the presentgraphs of Fig. 5, except in Mf0. This may be the reason whytheir PCA analysis identified a third component (phonationfrequency) consisting of Mf0 only. Fundamental frequency ismainly controlled by vocal fold length and tension. High Mf0is typically associated with high vocal effort produced withfirm glottal adduction and high subglottal pressure, which iswhy it was included within the vocal fold length and tensionparameter group in the results reported here. The presentparameter grouping has the advantage of physiologicalrelevance, which should be closely related to humanbehavior.

Results from the two multiple discriminant analysessuggest that these components were able to discriminateamong the emotions. The first MDA, based on the 12 para-meters, provided a classification accuracy of 87.2 percent offour emotions (without sadness). For the second MDA, oneparameter was selected from each physiological group andincluded Alpha from the subglottal pressure group,H1-H2LTAS from the glottal adduction group, and Shimmerfrom the vocal fold vibration group. Results revealed asuccessful classification of the five emotions (includingsadness) at 78 percent. Thus, a parameter selection that wasbased, in a principled fashion, on production processes canyield rather satisfactory discrimination outcomes. Theseparameters thus seem particularly rewarding to analyze infuture investigations of emotional speech. Clearly, additionalefforts are needed to distinguish joy more clearly from panicfear. Further comparisons between direct measurement ofproduction parameters using EGG or inverse filteringtogether with principled selection of waveform-extractedparameter constitutes a promising approach in that direction.

An overview of the present results suggests physiologicalprofiles of the emotions studied. Sadness is characterized bylow subglottal pressure, weak glottal adduction, and low f0,or, in other words, a low degree of overall vocal effort. Bycontrast, fear corresponded to high subglottal pressure, highadduction, and high f0. Anger was also related to highsubglottal pressure and low degree of Shimmer; however, itfailed to show clear patterns with regard to adduction andf0. Relief seemed to share the physiological properties ofsadness. Joy was characterized by high subglottal pressureand high f0. By and large, these profiles appear intuitivelyconvincing. The lack of differentiation between sadness andrelief indicates that they differ in respects other than thoseconsidered here or that either or both can be expressed withdiffering physiological patterns. While these results areshown for affect bursts expressed by actors, it is unclear howthese results may differ from physiologically driven affectbursts experienced in everyday life.

It should be pointed out that a complicating factor in thepresent experiment is the lack of specificity of acoustic voiceparameters. For example, an increase of f0 will automaticallylead to a decrease of Alpha since this leads to fewer overtonesbelow 1,000 Hz. Also, an increase of subglottal pressure not

only will increase the level of the sound produced, but it willalso increase Mf0 and QClosed. Direct measurement of aphysiological parameter such as subglottal pressure mayincrease our understanding of the physiological character-istics of emotional expressivity in the voice. Another issueconcerns our method of computing the speaker baselines. Theemotions used in this study were selected to representdifferences along the three dimensions (arousal, valence, andpower), thereby providing a balanced baseline. Unfortu-nately, we could not foresee the difficulty with inversefiltering the sadness samples. Nevertheless, we predict thatany correlations between parameters that did and did notinclude the sadness samples would be adversely affectedand, as a result, would be less likely to show significance. Theuse of averaged baselines may be useful in real-timeapplications of natural speech when an ideal “neutral”sample is not available (for example, an initial x seconds ofspeech may be available as a reference for predictingemotional changes in the remainder of the sample). We hopeto see a comparison of classification performance obtainedusing this sort of “naturally-obtained” averaged baselineswith “laboratory-collected” neutral baselines in the future.

5 CONCLUSIONS

Most of the emotions considered here show differing acousticand physiological characteristics, and thus show generalpromise of an approach informed by production mechan-isms. The present results revealed differences from theresults obtained through conventional statistical methods inpsychology like the PCA. The present analyses suggestedthat the emotional samples could be better described by threephysiological mechanisms, namely, the parameters thatquantified subglottal pressure (Leq, Alpha, and MFDR),glottal adduction (H1-H2LTAS;H1-H2;ACAmp;Qclosed; andNAQ), and vocal fold length and tension (Mf0, Jitter,Shimmer, and HNR). In addition, multiple discriminantanalysis showed that the single-parameter estimation of thethree components, Alpha, H1-H2LTAS, and Shimmer, could beused for a surprisingly successful classification. Hence, anapproach informed by the underlying biological mechanismsmay enable a better understanding of the relationshipbetween the theoretical and empirical evidence on physiolo-gical change patterns for different emotions as determinedprimarily by the action tendencies elicited by the emotions.For example, while anger will generate aggression tenden-cies, fear will prepare flight responses. The physiologicalresponse patterns will differ accordingly, thereby influencingthe vocal output.

These results are very relevant to the area of affectivecomputing, specifically for the development of models topredict emotions from speech. With an understanding of theunderlying physiological mechanisms involved in emotionexpression, it is possible to develop a vocal action codingsystem in which vocal changes are described by changes ingroups of parameters that co-occur during certain expres-sions. This technique allows an “informed” feature selectionprocedure in which the parameters are now chosen based onphysiological relevance. This methodology has been success-fully used in automatic emotion recognition in the face (based

172 IEEE TRANSACTIONS ON AFFECTIVE COMPUTING, VOL. 2, NO. 3, JULY-SEPTEMBER 2011

on the facial action coding system developed by Ekman andFriesen [13]), and may be a promising new avenue for voice.

Further refinement of linking the acoustic parameters toproduction mechanisms may proceed by including a largernumber of samples expressed in each emotion by lay speakersin addition to trained speakers to improve the ecologicalvalidity of the work. It would also be useful to study a larger,balanced set of emotions in terms of one or more psycholo-gical dimensions to obtain a large variation in parameters. Inaddition, the present study was interested in the patterns ofcorrelations that were highly significant across individuals;however, it would be interesting to further examine whycertain correlations are strong for only some individuals, andthe perceptual consequences of these differences. Still furtherrefinement of this technique requires experimental workusing systematic manipulation of production characteristicsand to predict and then test the production mechanisms fordifferent emotions (such as the underlying determinants likesympathetic arousal or specific action tendencies, etc.),potentially using automatic inverse filtering procedures.

ACKNOWLEDGMENTS

This research was supported through grants to Klaus Schererby the Swiss National Science Foundation (100014-122491)and European Research Council (ERC-2008-AdG-230331-PROPEREMO). The authors wish to thank the anonymousreviewers for their comments and suggestions on earlierversions of this manuscript.

REFERENCES

[1] P.N. Juslin and K.R. Scherer, “Vocal Expression of Affect,” The NewHandbook of Methods in Nonverbal Behavior Research, J.A. Harrigan,R. Rosenthal, and K.R. Scherer, eds., pp. 65-135, Oxford Univ.Press, 2005.

[2] B. Schuller, S. Reiter, R. Muller, M. Al-Hames, M. Lang, and G.Rigoll, “Speaker Independent Speech Emotion Recognition byEnsemble Classification,” Proc. IEEE Int’l Conf. Multimedia andExpo, pp. 864-867, 2005.

[3] R. Picard, E. Vyzas, and J. Healy, “Toward MachineEmotional Intelligence: Analysis of Affective PhysiologicalState,” IEEE Trans. Pattern Analysis and Machine Intelligence,vol. 23, no. 10, pp. 1175-1191, Oct. 2001.

[4] C.M. Lee, S. Narayanan, and R. Pieraccini, “Recognition ofNegative Emotions from the Speech Signal,” Proc. IEEE WorkshopAutomatic Speech Recognition and Understanding, 2001.

[5] A.I. Iliev, M.S. Scordilis, J.P. Papa, and A.X. Falco, “SpokenEmotion Recognition through Optimum-Path Forest ClassificationUsing Glottal Features,” Computer, Speech, and Language, in press.

[6] J.F. Torres, E. Moore II, and E. Bryant, “A Study of GlottalWaveform Features for Deceptive Speech Classification,” Proc.IEEE Int’l Conf. Acoustics, Speech, and Signal Processing, pp. 4489-4492, 2008.

[7] M. Airas and P. Alku, “Emotions in Short Vowel Segments: Effectsof the Glottal Flow as Reflected by the Normalized AmplitudeQuotient,” Proc. Affective Dialogue Systems Workshop, pp. 13-24,2004.

[8] E. Moore, M.A. Clements, J.W. Peifer, and L. Weisser, “CriticalAnalysis of the Impact of Glottal Features in the Classification ofClinical Depression in Speech,” IEEE Trans. Biomedical Eng.,vol. 55, no.1, pp. 96-107, Jan. 2008.

[9] J. Toivanen et al., “Emotions in [a]: A Perceptual and AcousticStudy,” Logopedics, Phoniatrics, Vocology, vol. 31, pp. 43-48, 2006.

[10] C. Pereira, “Dimensions of Emotional Meaning in Speech,”Proc. ISCA Workshop Speech and Emotion, pp. 25-28, 2000.

[11] J. Fontaine, K.R. Scherer, E. Roesch, and P. Ellsworth, “The Worldof Emotions Is Not Two-Dimensional,” Psychological Science,vol. 18, pp. 1050-1057, 2007.

[12] R.S. Green and N. Cliff, “Multidimensional Comparisons ofStructures of Vocally and Facially Expressed Emotion,” Perceptionand Psychophysics, vol. 17, no. 5, pp. 429-438, 1975.

[13] P. Ekman and W. Friesen, Facial Action Coding System: A Techniquefor the Measurement of Facial Movement. Consulting PsychologistsPress, 1978.

[14] Y.I. Tian, T. Kanade, and J.F. Cohn, “Recognizing Action Units forFacial Expression Analysis,” IEEE Trans. Pattern Analysis andMachine Intelligence, vol 23, no 2, pp. 97-115, Feb. 2001.

[15] E. Krumhuber and A. Kappas, “Moving Smiles: The Role ofDynamic Components for the Perception of the Genuineness ofSmiles,” J. Nonverbal Behavior, vol. 29, pp. 3-24, 2005.

[16] K.R. Scherer, “Appraisal Considered as a Process of MultilevelSequential Checking,” Appraisal Processes in Emotion: Theory,Methods, Research, K.R. Scherer, A. Schorr, and T. Johnstone,eds., pp. 92-120, Oxford Univ. Press, 2001.

[17] G. Fant, Speech Acoustics and Phonetics. Kluwer AcademicPublishers, 2004.

[18] J. Gauffin and J. Sundberg, “Spectral Correlates of Glottal VoiceSource Waveform Characteristics,” J. Speech and Hearing Research,vol. 32, pp. 556-565, 1989.

[19] J. Sundberg, M. Andersson, and C. Hultqvist, “Effects ofSubglottal Pressure Variation on Professional Baritone Singers’Voice Sources,” J. Acoustical Soc. Am., vol. 105, pp. 1965-1971, 1999.

[20] J. Sundberg, E. Fahlstedt, and A. Morell, “Effects on the GlottalVoice Source of Vocal Loudness Variation in Untrained Femaleand Male Voices,” J. Acoustical Soc. Am., vol. 117, no. 2, pp. 879-885, 2005.

[21] J. Sundberg, M. Thalen, P. Alku, and E. Vilkman, “EstimatingPerceived Phonatory Pressedness in Singing from Flow Glotto-grams,” J. Voice, vol. 18, pp. 56-62, 2004.

[22] H.M. Hanson, “Glottal Characteristics of Female Speakers:Acoustic Correlates,” J. Acoustical Soc. Am., vol. 101, no. 1,pp. 466-481, 1997.

[23] P. Ladefoged and N.P. McKinney, “Loudness, Sound Pressure,and Subglottal Pressure in Speech,” J. Acoustical Soc. Am., vol. 35,pp. 454-460, 1963.

[24] S. Patel, K.R. Scherer, J. Sundberg, and E. Bjorkner, “MappingEmotions into Acoustic Space: The Role of Voice Production,”Biological Psychology, vol. 87, pp. 93-98, 2011.

[25] A. Batliner, K. Fisher, R. Huber, J. Spilker, and E. Noth,“Desperately Seeking Emotions or: Actors, Wizards, and HumanBeings,” Proc. ISCA Workshop Speech and Emotion, pp. 195-200, 2000.

[26] T. Vogt and E. Andre, “Improving Automatic Emotion Recogni-tion from Speech via Gender Differentiation,” Proc. IEEE Int’l Conf.Multimedia, 2005.

[27] S. Lee, S. Yildirim, A. Kazemzadeh, and S. Narayanan, “AnArticulatory Study of Emotional Speech Production,” Proc.Interspeech, pp. 497-500, 2005.

[28] T. Banziger and K.R. Scherer, “Introducing the Geneva MultimodalEmotion Portrayal (GEMEP) Corpus,” A Blueprint for an AffectivelyCompetent Agent: Cross-Fertilization between Emotion Psychology,Affective Neuroscience, and Affective Computing, K.R. Scherer,T. Banziger, and E. Roesch, eds., Oxford Univ. Press, 2010.

[29] J. Sundberg and M. Nordenberg, “Effects of Vocal LoudnessVariation on Spectrum Balance as Reflected by the Alpha Measureof Long-Term-Average Spectra of Speech,” J. Acoustical Soc. Am.,vol. 120, pp. 453-457, 2006.

[30] P. Boersma and D. Weenink, “Praat: Doing Phonetics byComputer [Computer Program],” Version 5.1.43, Retrieved4 Aug. 2010 from http://www.praat.org/, 2010.

[31] F. Roers, D. Murbe, and J. Sundberg, “Predicted Singers’ VocalFold Lengths and Voice Measures Classification—A Study ofX-Ray Morphological,” J. Voice, vol. 23, no. 4, pp. 408-413, 2009.

SUNDBERG ET AL.: INTERDEPENDENCIES AMONG VOICE SOURCE PARAMETERS IN EMOTIONAL SPEECH 173

Johan Sundberg Filosofie kandidat 1961,Ffilosofie licentiat, 1963, Filosofie doktor, docent(musicology) 1966, was personal chair musicacoustics at KTH, Stockholm from 1979 until heretired in 2001. He is an associated editor of theJournal of Voice, Journal of New Music re-search, Archives of Acoustics, Music Percep-tion; he received the silver medal in MusicalAcoustics from the Acoustical Society of Amer-ica, and the Quintana Award from the Voice

Foundation. His current research interests include music acoustics,acoustics of singing voice, theory of music performance, and musicperception. He is a fellow of the Acoustical Society of America and amember of the Royal Swedish Academy of Music.

Sona Patel received the BS degree in electricalengineering from Boston University in 2004 andthe MA and PhD degrees in communicationsciences and disorders from the University ofFlorida in 2008 and 2009, respectively. Directlyafter, she joined the Swiss Center for AffectiveSciences as a postdoctoral researcher, whereshe is currently developing a biologically inspiredcomputational model of emotions in speech withan emphasis on dynamic acoustic measure-

ments, measurement of voice quality, and listener perception. Her otherinterests are in the rehabilitation of affective disorders and social signalprocessing. She is a member of the Acoustical Society of America andthe International Speech Communication Association.

Eva Bjorkner received the master’s degree insinging teacher in 1998 and the PhD degree invoice acoustics from KTH in 2006. She is aprofessional singer, a company owner, doingvoice teaching, lecturing, and voice analysis.Her current research interests include the voiceproduction of different singing styles and thetechniques behind it.

Klaus R. Scherer received the PhD degree fromHarvard University in 1970. After teaching at theUniversity of Pennsylvania, Philadelphia, and theUniversity of Kiel, Germany, he was appointed afull professor of social psychology at the Uni-versity of Giessen, Germany, in 1973. He was afull professor of psychology at the University ofGeneva, Switzerland, and director of the GenevaEmotion Research Group from 1985 to 2008. Heis currently a professor emeritus at the University

of Geneva and director of the Swiss Center for Affective Sciences (CISA)in Geneva. He is a fellow of the Acoustical Society of America, AmericanPsychological Association, and Association for Psychological Science.He is a member of the Academia Europea and an honorary foreignmember of the American Academy of Arts and Sciences.

. For more information on this or any other computing topic,please visit our Digital Library at www.computer.org/publications/dlib.

174 IEEE TRANSACTIONS ON AFFECTIVE COMPUTING, VOL. 2, NO. 3, JULY-SEPTEMBER 2011


Recommended