+ All Categories
Home > Documents > 00 Izdeb V3 i-xxiiFM - Plural Publishing · Gerardo M. González and Amy L. Ramos 5 Automatic...

00 Izdeb V3 i-xxiiFM - Plural Publishing · Gerardo M. González and Amy L. Ramos 5 Automatic...

Date post: 17-Feb-2019
Category:
Upload: nguyendien
View: 212 times
Download: 0 times
Share this document with a friend
22
v CONTENTS FOREWORD by Hans von Leden vii PREFACE by Krzysztof Izdebski ix ACKNOWLEDGMENTS xiii CONTRIBUTORS xv INTRODUCTION by Krzysztof Izdebski xix 1 Research on Emotional Perception of Voices Based on a Morphing Method 1 Kazuhiko Kakehi, Yuko Sogabe, and Hideki Kawahara 2 A Paralinguistic Template for Creating Persona in Interactive Voice Response (IVR) Systems 15 Osamuyimen Thompson Stewart 3 Memory for Emotional Tone of Voice 35 John W. Mullennix 4 Assessing Voice Characteristics of Depression among English- and Spanish-Speaking Populations 49 Gerardo M. González and Amy L. Ramos 5 Automatic Discrimination of Emotion from Voice: A Review of Research Paradigms 67 Juhani Toivanen, Tapio Seppänen, and Eero Väyrynen 6 Dazed and Confused: Possible Processing Constraints on Emotional Response to Information-Dense Motivational Speech 79 Claude Steinberg 7 Emotion Processing Deficits in Functional Voice Disorders 105 Janet E. Baker and Richard D. Lane 8 Emotions, Anthropomorphism of Speech Synthesis, and Psychophysiology 137 Mirja Ilves and Veikko Surakka
Transcript

v

CONTENTS

FOREWORD by Hans von Leden viiPREFACE by Krzysztof Izdebski ixACKNOWLEDGMENTS xiiiCONTRIBUTORS xvINTRODUCTION by Krzysztof Izdebski xix

1 Research on Emotional Perception of Voices Based on aMorphing Method 1Kazuhiko Kakehi, Yuko Sogabe, and Hideki Kawahara

2 A Paralinguistic Template for Creating Persona in Interactive Voice Response (IVR) Systems 15Osamuyimen Thompson Stewart

3 Memory for Emotional Tone of Voice 35John W. Mullennix

4 Assessing Voice Characteristics of Depression among English- and Spanish-Speaking Populations 49Gerardo M. González and Amy L. Ramos

5 Automatic Discrimination of Emotion from Voice: A Review of Research Paradigms 67Juhani Toivanen, Tapio Seppänen, and Eero Väyrynen

6 Dazed and Confused: Possible Processing Constraints onEmotional Response to Information-Dense Motivational Speech 79Claude Steinberg

7 Emotion Processing Deficits in Functional Voice Disorders 105Janet E. Baker and Richard D. Lane

8 Emotions, Anthropomorphism of Speech Synthesis, andPsychophysiology 137Mirja Ilves and Veikko Surakka

00_Izdeb_V3_i-xxiiFM 7/25/08 4:34 PM Page v

9 LUCIA, a New Emotive/Expressive Italian Talking Head 153Piero Cosi and Carlo Drioli

10 Perceptions of Japanese Anime Voices by Hebrew Speakers 177Mihoko Teshigawara, Noam Amir, Ofer Amir, Edna Milano Wlosko,and Meital Avivi

11 Recognition of Vocal and Facial Emotions: Comparison between Japanese and North Americans 187Sumi Shigeno

12 Automatic Recognition of Emotive Voice and Speech 205Julia Sidorova, John McDonough, and Toni Badia

13 The Context of Voice and Emotion: A Voice-Over Artist’sPerspective 238Kathleen Antonia Tarr

14 Tokin Tuf: True Grit in the Voice of Virility 239Claude Steinberg

15 Vocal Expressions of Emotions and Personalities in Japanese Anime 263Mihoko Teshigawara

16 Preserving Vocal Emotions while Dubbing into Brazilian Portuguese: An Analysis of Characters’ Voices inChildren’s Movies 277Mara Behlau and Gisele Gasparini

17 Voice and Emotions in the Philippine Culture 289Juliana Sustento Seneriches

18 The Strains of the Voice 297Steven Connor

19 Approaches to Emotional Expressivity in Synthetic Speech 307Marc Schröder

INDEX 323

vi VOICE AND EMOTION: VOLUME 3

00_Izdeb_V3_i-xxiiFM 7/25/08 4:34 PM Page vi

49

CHAPTER 4

Assessing Voice Characteristicsof Depression among

English- and Spanish-SpeakingPopulations

Gerardo M. González and Amy L. Ramos

Abstract

Here we examine the integration of computerized speechrecognition and digital voice analyses (VIDAS) to assessdepressed mood and symptoms in English- and Spanish-speak-ing populations. The findings show VIDAS consistency toadminister reliable, valid, and culturally sensitive screening ofdepression in these populations. VIDAS has been imple-mented in high volume health care settings that serve diversepatient populations but lack bilingual personnel. VIDASquickly and unobtrusively collects participant data, scores thedata, and generates a report to inform health care staff of theparticipant’s mood and symptoms. As a result, VIDAS assessesmany individuals who are unlikely to initially seek out mentalhealth services. However, further study needs to be accom-plished in order to enhance and refine the VIDAS interview asa viable alternative method of assessment.

The relationship between the gender of the participant andchoice of digitized voice showed a preference for a female

04_Izdeb_V3_49-66 7/25/08 3:45 PM Page 49

Assessing Voice Characteristicsof Depression among English

and Spanish Speakers

Depressive disorders afflict 6% to 7% of the general population in the UnitedStates (Smith & Weissmann, 1992). Majordepressive disorder is the leading cause ofdisability in the United States and devel-oping countries (World Health Organiza-tion, 2001). Many depressed individualsare treated at primary care medical set-tings, where up to 30% of the patientsmay be clinically depressed (Broadhead,Clapp-Channing, Finch, & Copeland,1989). Primary health care settings, how-ever, suffer from deficiencies in screen-ing practices, high patient volume, andenormous time constraints that hinderthe adequate assessment of depression.Pérez-Stable, Miranda, Muñoz, & Ying(1990) found that depression was accu-rately detected in only 36% of primarycare medical patients.

Latinos constitute nearly 13% of theU.S. population, comprise the second larg-est ethnic group in the United States, andare the fastest growing ethnic group in the

50 VOICE AND EMOTION: VOLUME 3

country (U.S. Census, 2000). Past researchsuggests that Latinos are at higher risk fordepression than non-Latinos. For exam-ple, Kessler, McGonagle, Zhao, & Nelson(1994) found that Latinos reported an8.1% prevalence rate for current affec-tive disorders (7% is the norm). In fact,Mexican Americans reported a higherprevalence for affective disorders thantheir Mexican-born counterparts (Vegaet al., 1998).

Latinos in the United States generallylack accessibility to culturally responsiveand linguistically-compatible mental healthservices (González, 1997). An Epidemio-logical Catchment Area (ECA) study indi-cated that only 11% of Mexican Americans(vs. 22% of non-Hispanic Whites), whomet the criteria for clinical depression,sought a mental health care provider fortreatment (Hough et al., 1987; Shapiroet al., 1984). Latinos underutilize mentalhealth services because of cultural, lin-guistic, financial, and service deliverybarriers (Woodward, Dwinell, & Arons,1992). Moreover, 40% of the U.S. Latinopopulation primarily speaks Spanish orhas limited English proficiency (U.S. Cen-sus, 2000).

digitized voice. Several voice characteristics showed significantrelationships to depression levels, such as vocal energy andvariability; however, the findings have not been consistent acrossthe various VIDAS studies. Shortcomings with the analysis ofvoice characteristics are discussed and a role of a baselinemeasurement is stressed as it may be difficult to discriminatebetween a person who is depressed and one who normallyspeaks with a monotonic voice, and because psychiatriccomorbidity and medications also distort vocal markers fordepression. Also gender, age, linguistic, and physical factorsthat interact with speech characteristics may require develop-ing unique models of vocal emotional properties.

04_Izdeb_V3_49-66 7/25/08 3:45 PM Page 50

VOICE OF DEPRESSION IN ENGLISH- AND SPANISH-SPEAKING POPULATIONS 51

The absolute number of Latino thera-pists in the United States (29 for every100,000 Latinos compared to 173 clini-cians per 100,000 non-Hispanic Whites)represents an insufficient number to fea-sibly meet the present mental healthneeds of U.S. Latino populations (Centerfor Mental Health Services, 2000). Clearly,more appropriately trained culturally sen-sitive bilingual mental health profession-als are needed. Yet the growing disparitybetween the Latino population (estimatedto increase over 50% in the next decade)and current pool of Latino clinical psy-chology doctoral students in the trainingpipeline (levels static since 1980) makesit unlikely that ample Spanish-speakingprofessionals will be available to providenecessary services (e.g., Bernal & Castro,1994). Alternative strategies for deliver-ing culturally responsive mental healthassessment services for the detection ofdepression in Spanish-speaking commu-nities are needed.

Computerized psychological assess-ment represents several major advan-tages in the structure, flexibility, and easeof test administration (Kobak, 1996).Structured computerized interviewing im-proves the quality, quantity, and integrityof clinical data by accurately transcribing,scoring, and storing patient responses,standardizing administration procedures,and minimizing errors attributable tohuman oversight (Erdman, Klein, & Greist,1985). For example, a clinician may inad-vertently omit up to 35% of clinically mean-ingful inquiries during an open-endedface-to-face (Climent, Plutchik, & Estrada,1975). Many depressed patients report a preference for computer interactiveinterviews over face-to-face interviews,even when patients knew the clinician(e.g., Carr, Ghosh, & Ancill, 1983). Onepossible explanation for such a prefer-

ence is that computerized interviewingmay increase respondent self-disclosurebecause of discomfort with revealingsensitive issues (e.g., suicidal ideation) toa clinician (Levine, Ancill, & Roberts,1989). Another appealing aspect of com-puterized assessment is that it producesa cost savings through the use of moreefficient professional time to conduct as-sessment batteries and treatment (Butcher,1987).Thus, computerized screening pro-vides a cost-effective and efficient meansfor assessing depression.

Recent advances in computerized tech-nology offer viable alternative screeningmethods for populations not reliablyassessed with standard paper-and-pencilquestionnaires (Starkweather & Muñoz,1989). For example, illiterates or non-English speakers are less likely to utilizemental services because of written assess-ment or language barriers. For such pop-ulations, computerized technology hasthe potential to minimize the obstaclesthat contribute to the underidentificationof depression. Among the technologiesthat have strong potential is computer-ized speech recognition.A computerizedspeech recognition application is capa-ble of administering a discrete choicequestionnaire by presenting an item(visually on a computer screen or aurallyby a prerecorded prompt) and recogniz-ing a spoken response. Based on the capa-bilities of speech recognition technologyand the imminent need for alternativedepression screening methods in English-and Spanish-speaking communities, Gon-zález and colleagues developed bilingualcomputerized speech recognition appli-cations for screening depression.

Research also indicates that voice analy-sis may improve the accuracy of detect-ing depression. Digital analysis of voicecharacteristics represents a powerful

04_Izdeb_V3_49-66 7/25/08 3:45 PM Page 51

methodology for the objective assess-ment of depression (Starkweather, 1992).Voice characteristics serve as useful clin-ical indices for depression symptomsbecause vocalizations (respiration, artic-ulation, and tension or relaxation of larynx and oral muscles) are mediated by psychomotor disturbance stemmingfrom neurophysiological and subcortical(mesolimbic) dysfunction (Flint, Black,Campbell-Taylor, Gailey, & Levington,1993; Nilsonne, Sunberg, Ternstrom, &Askenfelt, 1988).

Research demonstrates that severalquantitative voice characteristics aregood predictors of depression, such asnarrow variability in tone (monotone),low fundamental frequency (pitch), andlow amplitude or loudness (Hargreaves& Starkweather, 1964; Vanger, Summer-field, Rosen, & Watson, 1992). Multilin-gual research has generated a model ofdepressed voice prosody (tempo andrhythm) represented by slower, flatter,and softer voice waves (Darby, Simmons,& Berger, 1984; Kuny & Stassen, 1993;Scherer & Zei, 1988).Cross-cultural studiesalso suggest that depressed individualsdisplay distinctive speech patterns com-pared to nondepressed persons, includ-ing more pauses and fewer utterances(e.g., Friedman & Sanders, 1992; Stassen,Bomben, & Günther, 1991) and longervocal response latency (vocal reactiontime) to answer a presented item (e.g.,Stout, 1981; Talavera, Sáiz-Ruiz, & García-Toro, 1994). Furthermore, changes inspeech variables are better predictors ofmood change for patients in treatmentthan psychiatrists’ impressions (Siegman,1987). Thus, voice analysis can help todiscern between the acoustic character-istics of depressed and nondepressedpersons.

52 VOICE AND EMOTION: VOLUME 3

Quantitative acoustic variables includespeech rate (number of utterances pertime frame), mean pitch (average funda-mental frequency of utterances), pitchvariability, changes in pitch, and vocalintensity (energy values of an utterance).For example, a sad mood displays identi-fiable vocal markers (e.g. slow, soft,monotonic speech) that are distinguish-able from vocal effects in normal moodand other emotional states. Table 4–1summarizes the general research find-ings on vocal characteristics for severalemotional states (Murray & Arnott, 1993).

The two most common voice analysesof depression models are the structuredspeech and free-form speech approaches(Alpert, Pouget, & Silva, 2000).The struc-tured speech approach requires therespondent to repeat a determined sound(please say “A”) or to read text (pleaseread the following paragraph). The re-corded repetition or text is assessed formood with short-time, long-time, andspectral analyses. The free-form speechapproach involves the assessment of nat-ural open-ended speech.The respondentis asked an open-ended question and thefree-form response is recorded and ana-lyzed. Also, it is common to obtain apretest (baseline) of an individual’s voicecharacteristics and a post-test (after treat-ment or intervention) to assess change inmood or emotion.

Speech Behavior

Gonzalez and colleagues initiated speechrecognition research for investigatingspeech behavior to increase the detectionof depression. Initially, the researchersexplored speech behavior, such as vocal

04_Izdeb_V3_49-66 7/25/08 3:45 PM Page 52

VOICE OF DEPRESSION IN ENGLISH- AND SPANISH-SPEAKING POPULATIONS 53

response latency (VRL) and speech recog-nition accuracy (SRA), i.e., computeraccuracy level for recognizing a partici-pant’s utterances.The researchers hypoth-esized that longer VRL and lower SRAwould be related to depressed mood.

The speech recognition applicationsare based on the Center for Epidemiolog-ical Studies-Depression scale (CES-D).The CES-D is a 20-item self-report screen-ing measure developed by the NationalInstitute of Mental Health (NIMH) forassessing the frequency of depressivemood and symptoms during the pastweek (less than 1 day, 1–2 days, 3–4 days,5–7 days). In the general population, acut point score of 16 or greater suggestsa high level of depressive symptoms(Radloff, 1977). The CES-D has strongpsychometric sensitivity for identifyingsymptomatic individuals; well established

normative, reliability, and validity datawith English- and Spanish-speaking sam-ples; and extensive testing with clinicaland nonclinical populations (Mosciki,Locke, Rae, & Boyd, 1989; Myers & Weiss-man, 1980).

González, Costello, La Tourette, Joyce,& Valenzuela (1997) evaluated a bilingualspeaker-dependent cellular telephone-assisted computerized speech recogni-tion CES-D. In a single session counter-balanced design, 32 English (ES) and23 Spanish speakers (SS) completed ran-domly ordered computer-telephone (CT)and face-to-face (FF) CES-D methods (0–7days’ response format), the Beck Depres-sion Inventory (BDI) (Beck & Steer,1993), and the Short Acculturation Scale(SAS). VRL and SRA were measured. Theresults suggested that the two CES-D meth-ods displayed strong internal consistency

Table 4–1. Summary of the research findings on vocal emotional effects relative toneutral speech

Sadness Fear Disgust Anger Happiness

Speech rate Slightly Much Very much Slightly Faster orslower faster slower faster slower

Pitch average Slightly Very much Very much Very much Muchlower higher lower higher higher

Pitch range Slightly Much Slightly Much Muchnarrower wider wider wider wider

Pitch changes Downward Normal Wide, Abrupt, Smooth,inflections downward on stressed upward

inflections syllables inflections

Vocal intensity Lower Normal Lower Higher Higher

Voice quality Resonant Irregular Grumbled, Breathy, Breathy,voicing chest tone chest tone blaring

Articulation Slurring Precise Normal Tense Normal

Note. From “Toward the Simulation of Emotion in Synthetic Speech: A Review of the Literature on HumanVocal Emotion,” by I. A. Murray and J. L. Arnott, 1993, Journal of the Acoustical Society of America, 93,pp. 1097–1108.

04_Izdeb_V3_49-66 7/25/08 3:45 PM Page 53

estimates (a > .85), good alternate formsreliabilities (> .85), and high correlationsto the BDI (r > .80) for both languagegroups.The two groups rated both meth-ods equally high, but the ES preferred theFF mode because it was more person-able. Among SS, the correlation betweendepression and acculturation was notsignificant. For the CT method, depres-sion scores directly correlated with VRL(.45) and inversely related to speechrecognition accuracy (−.37) across bothlanguage groups. Thus, longer VRL andlower SRA (more recognition compli-cations) served as general indices ofdepression.

González and colleagues conductedtwo studies with large Spanish- and English-speaking samples and collectedretest data on participants in a secondsession.The purpose of the two one-yearstudies was to develop, test, and evaluatean English and Spanish continuousspeaker-independent speech recognitionCES-D application for screening depres-sion symptoms by digital cellular tele-phone. A continuous speaker-independ-ent system is designed to recognizenatural continuous speech across multi-ple independent users. The system doesnot require template training; thus, inter-view time is significantly reduced. Also,the system presented the interviewusing a prerecorded digitized female ormale voice selected by the participant. Inprevious prototypes, only a prerecordeddigitized male voice was presented.

Study 1 assessed the psychometriccongruence of two speech recognitionCES-D methods (0–7 days’ choices) fordetecting depression levels in ES and SS.A 2 (language) × 2 (method) × 2 (session)repeated measures experimental designwas employed. The CES-D was randomlyadministered to 82 ES and 85 SS in CT or

54 VOICE AND EMOTION: VOLUME 3

FF form in two sessions (at least a 2-weekinterval). Additional measures included astructured demographic interview, theBidimensional Acculturation Scale (BAS)(Marín & Gamba, 1996), and the BDI.VRLand SRA were also measured. The resultssuggested that both methods displayedstrong psychometric properties. Themeans for the two methods were gener-ally not significantly different for both ESand SS. The two methods demonstratedhigh inter-item consistencies (a range .83to .94) and strong correlations to the BDI(range .68 to .88) for both languages.Test-retest reliabilities were very good (range.84 to .89); however, reliability of the ESCT method was moderate (.47).Althoughthe two language groups rated bothmethods highly, both groups preferredthe FF method. Analyses of the digitizedinterviewer gender showed that ES chosea female voice significantly more often in the first session while SS selected afemale voice more frequently in the sec-ond session. FF VRL was positively corre-lated to depression scores for the ES sam-ple in the first (.29) and second sessions(.46). SRA was negatively correlated withdepression scores in the ES first session(−.28) and SS second session (−.45). Inother words, depressed persons tended toexperience more voice recognition com-plications during the computer interviewrequiring more repetitions of the itemsand more time to complete the interview(González et al., 2000).

Study 2 was a validation study of anEnglish and Spanish telephone-assistedspeaker-independent CES-D. The aim ofStudy 2 was to evaluate the sensitivity(detecting true depressives) and speci-ficity (detecting true nondepressives) ofthe CES-D for assessing major depressionin ES and SS. Presentation of the CES-D(0–7 days’ choices) was refined based on

04_Izdeb_V3_49-66 7/25/08 3:45 PM Page 54

VOICE OF DEPRESSION IN ENGLISH- AND SPANISH-SPEAKING POPULATIONS 55

the findings of Study 1.The unique featuresof Study 2 included administering theComposite International Diagnostic Inter-view (CIDI) (Kessler, Nelson, McGonagle,& Liu, 1996) to identify depressed andnondepressed participants. The relation-ship of depression scores to the BDI-II,BAS, and VRL were also assessed. Study 2utilized a 2 × 2 × 2 language × diagnosis× session repeated measures (test-retest)design. A total of 160 participants (80 ESand 80 SS) including diagnosis group(depressed and nondepressed) wereinterviewed.

Data analyses revealed that there wereno significant language group differencesfor the means and variabilities of theCES-D across both sessions. The CES-Ddisplayed strong internal consistency for both language groups in both ses-sions (a ranged from .88 to .94). Test-retest reliabilities were .85 and .64 forthe SS and ES, respectively. There werestrong convergent validity coefficientsbetween the CES-D and the BDI-II inboth sessions (.69 and .67 for SS and .64and .87 for ES). CIDI analyses indicatedthat the CES-D displayed good sensitivity(.76) and specificity (.50) for the first ses-sion and similar sensitivity (.77) andspecificity (58) in the second session.More than two thirds of all participantsselected a female digitized voice for thefirst session. In the second session, 60%of the participants selected a femalevoice. Participants positively rated (1 =very uncomfortable to 6 = very comfort-able) the CES-D on the first session (bothgroup means over 4.3, no significant dif-ferences). VRL and CES-D total scoreswere positively related in the first session(r = .14) but not in the second session(González, 2000).

Digital voice analysis packages (e.g.,Avaaz Interactive Voice Analysis System,

IVANS) conduct complex short-time andlong-time acoustic analyses for detectingemotion in voice characteristics. Short-timeanalysis examines a segment of a voicesignal, such as a phoneme (basic sound).Long-time spectrum analysis assesses theentire voice signal. Spectral analyses ofvoice samples generate spectrograms,which are two- and three-dimensionalvisual representations plotted along var-ious acoustic variables (time, frequency,and amplitude). Table 4–2 summarizesthe definitions of common acoustic voicemeasures (Avaaz,1998).Digital voice analy-sis was implemented in the next phaseof research.

Voice-Interactive DepressionAssessment System

Gonzalez and colleagues evaluated theVoice-Interactive Depression AssessmentSystem (VIDAS) to detect depression symp-toms (using the CES-D) among English andSpanish speakers. The researchers devel-oped VIDAS using speaker-independentcontinuous speech recognition technol-ogy (Schalkwyk, Colton & Fanty, 1998).The researchers administered VIDAS tothe participants using a Pentium laptopcomputer (Windows XP) with a micro-phone/speaker handset.

VIDAS presented a discrete choicequestionnaire in English or Spanish byplaying digitally recorded .wav audiofiles, recognizing a respondent’s spokenanswers, scoring the responses, and stor-ing the data. Two bilingual male andfemale professionals fluent in both Eng-lish and Spanish recorded the prompts,instructions, and items in a neutral toneto reduce potential biases from partici-pant reactivity. VIDAS randomly ordered

04_Izdeb_V3_49-66 7/25/08 3:45 PM Page 55

the digitized male or female voice towhich the participant vocally responded.VIDAS also recorded participant vocal

56 VOICE AND EMOTION: VOLUME 3

data for subsequent voice analysis usingIVANS. Table 4–3 summarizes the basicVIDAS interview sequence.

Table 4–2. Definitions of acoustic voice measures

Measure

Long-Term Average Speech Spectrum (LTASS)

Spectral tilt

Flatness

Centroid

Skewness

Kurtosis

Speech Measures

Tilt

Harmonic-to-noise ratio

Linear prediction signal-to-noise ratio (SNR)

Pitch amplitude

Spectral flatness ratio (SFR)

Note. From Interactive Voice Analysis System (IVANS) User's Guide, by Avaaz Innovations Inc., 1998. Reprintedwith permission of Avaaz Innovations Inc: London, Ontario, Canada.

Definition

Summary of how energy in an utterance is distributed acrossfrequency, on average, over the duration of the specified signal.

Rate at which the energy of the speech signal declines asfrequency increases.

Represents the flatness of the LTASS. For speech signals that havemore noise content (“breathy” signals).

A weighted measure that determines the effective fulcrum of theLTASS. For unvoiced sounds, the spectral centroid is usually around2–3 kHz, while voiced sounds have a lower spectral centroid.

Quantifies the spread of the LTASS. For a spectrum that has aGaussian shape, the skewness is equal to zero. Positive skewnessvalues indicate more energy in the high frequency region, whilenegative skewness values reflect low frequency spectra.

Quantifies the shape of the spectrum. Lower kurtosis valuesindicate flat spectra, while higher values indicate spectra withvarying peaks.

Acoustic measures gathered from running speech.

Similar to the spectral tilt parameter, except only the voicedsegments of speech are included in the computation of the LTASS.

The effect of both pitch and amplitude perturbations. It alsoaccounts for such conditions as the increased noise in the mainformant frequency region, increased high frequency noise, anddecreased higher harmonics.

Relies on linear prediction modeling of the input speech sample.The SNR measure is taken as the ratio of the input signal energyand the energy of the residual signal at the output of the linearprediction model. Normal talkers typically have high LP-SNR values,which reflect good linear prediction modeling performance.

The amplitude of the second largest peak of the normalizedautocorrelation function of the residual signal.

A measure of how successfully the LP technique was able to modelthe input signal. If the LP model is successful, the residual signal ismade up of a series of impulses, one at each glottal excitationperiod.

04_Izdeb_V3_49-66 7/25/08 3:45 PM Page 56

VOICE OF DEPRESSION IN ENGLISH- AND SPANISH-SPEAKING POPULATIONS 57

Study 1 involved the development andpilot testing of bilingual telephone andmicrophone speech recognition VIDAS-1prototypes. VIDAS-1 was a computer-telephone or computer-microphone (CM)form of the CES-D (0–7 days’ choices).Fifty-eight English speakers and 60 Span-ish speakers completed a randomly as-signed CT or CM method. Other measuresincluded demographics, BAS, the BDI-II,and CIDI.

The results suggested that the CT andCM methods did not significantly differin total score means and variabilities.VIDAS-1 demonstrated good reliability (a > .80 for CT and CM in both language

groups) and strong validity with the BDI-II(r range .69 to .73 for CT and CM in bothlanguages). VIDAS-1 demonstrated goodsensitivity (.83) and moderate specificity(.38) across language groups and meth-ods. Although ES rated (M = 4.5) bothVIDAS methods higher than SS (M = 3.8),there was no significant language andmethod interaction. ES and SS were sig-nificantly more likely (80%) to select afemale digitized voice for the VIDASinterview.

A free-form approach for assessing par-ticipants’ individual responses to the firsttwo, middle two, and last two CES-D itemswas utilized.VRL was significantly longer

Table 4–3. Summary of VIDAS interview sequence

1. Introduction

a. Interviewer instructs participant (English or Spanish) for completing VIDAS

b. Interviewer asks participant to choose the gender of digitized interviewer voice (maleor female)

c. Interviewer initiates the VIDAS application

d. Over the handset, VIDAS greets participant in primary language (English or Spanish)and presents brief instructions for completing a scale (randomized)

2. Pretest Recording

VIDAS instructs participant to repeat a phrase, “This computer responds to my voice.”

3. CES-D Items

a. VIDAS presents brief instructions for completing the CES-D items orally

b. VIDAS begins by presenting an item and waits for the participant’s response

c. Participant verbally responds to the item

d. VIDAS registers and records the participant’s recognized spoken response

e. VIDAS continues to the next item until all the items are completed

f. VIDAS proceeds to the conclusion

4. Post-Test Recording

VIDAS instructs participant to repeat a phrase, “This computer responds to my voice.”

5. Conclusion

a. VIDAS thanks the participant, requests that the interviewer be advised, and terminates

b. VIDAS scores and analyzes the responses

c. VIDAS saves the results in a database

d. VIDAS generates a brief interpretative report (summary of responses and interpretation)

04_Izdeb_V3_49-66 7/25/08 3:45 PM Page 57

for depressed (M = 5.5 sec) than nonde-pressed participants (M = 3.3). VRL wasalso longer for the SS CT group M = 5.5(SD = 6.3) than the ES CT group M = 3.1,(SD = 1.2) and the SS CM group M =3.8 (SD = 3.1) than the ES CM group M = 2.1 (SD = .61), respectively. Correla-tions between CES-D total scores andVRL were examined by VIDAS methodand language group. There were no sig-nificant correlations between CES-D totalscores and VRL for either method or forSpanish speakers. However, a significantcorrelation was found between CES-D totalscore and VRL for English speakers, r (26)= .47, p < .05 (Ramos, G. M. González, P.González, Goldwaser, & Preble, 2002).

Study 2 compared VIDAS-1 and VIDAS-2. VIDAS-2 differed from VIDAS-1 in thatnew depression items (20) and responseformats were designed for three subscales:subscale 1 (yes/no), subscale 2 (discretechoices, e.g., “All of the time”), and sub-scale 3 (open-ended response to ques-tions, e.g., “How was your appetite?”).For the purpose of brevity, only VIDAS-2subscale 2 data will be summarized. Intotal, 130 ES and 95 SS participants com-pleted BAS, BDI-II, CIDI, VIDAS-1, andVIDAS-2.

VIDAS-2 demonstrated strong inter-item consistency (ES a .90 and SS a .80)and positive correlations to the BDI-II(ES .65 and SS .53). VIDAS-2 demon-strated strong sensitivity (.82) and mod-erate specificity (.39) across both lan-guage groups. In choosing the gender ofdigitized voice, 62% of ES and 83% of SSfemales and 84% of ES and 66% of SSmales selected a female voice. Both lan-guage groups positively rated VIDAS-2(scale 1 to 6), but ES had significantlyhigher levels of comfort (M = 4.4) thanSS (M = 4.0). Using a free-form analysis ofparticipant responses to selected sub-

58 VOICE AND EMOTION: VOLUME 3

scale 2 items (first two, middle two, andlast two),VIDAS-1 depression levels weresignificantly correlated to measures ofvoice intensity, such as spectral tilt (−.20)and speech tilt (−.34); thus, depressedindividuals displayed less vocal energy.VIDAS-2 did not show significant rela-tionships between voice properties andsubscale 2 total scores (Ramos, Shriver,Reza, & González, 2003).

VIDAS-3 was a bilingual computer-ized speech recognition application forscreening depression using two subscalesbased on CES-D and DSM-IV criteria. Inthis study, 128 English and 128 Spanishspeakers completed a demographic inter-view, BAS, BDI-II, the CIDI-Short Form,and VIDAS-3. Recordings of participantrepetitions of a phrase, “This computerresponds to my voice,” were obtainedbefore (pretest) and after completion(post-test) of the CES-D.

The results suggested that VIDAS-3subscales demonstrated high inter-itemreliability (.81 to .92), strong criterionvalidity (.58 to .67), and adequate sensi-tivity (.64 to .87) and specificity (.44 to.71). Both language groups positivelyrated VIDAS-3. Male and female partici-pants most often selected a digitizedfemale voice to present VIDAS-3. Long-term average speech spectrum (LTASS)measures (kurtosis, flatness, skewness,and centroid) that assess the tone andpitch of an individual’s vocal characteris-tics were used as the dependent variablesin a multivariate analysis of variance(MANOVA).

The results revealed a significant maineffect for depression in the participants’pretest recorded phrase across both lan-guage groups, [Pillai’s Trace = .064, F (1,236) = 6.81; p =. 016]. Separate follow-upANOVAs for each dependent variableshowed significant differences in cen-

04_Izdeb_V3_49-66 7/25/08 3:45 PM Page 58

VOICE OF DEPRESSION IN ENGLISH- AND SPANISH-SPEAKING POPULATIONS 59

troid, F (1, 241) = 9.55, p = .002, skew-ness, F (1, 241) = 5.11, p = .025, and kurtosis, F (1, 241) = 11.60, p = .001.Thus, depressed participants had lessvocal energy than nondepressed partici-pants. A MANOVA of LTASS measuresrevealed a significant main effect fordepression in participants’post-test record-ing [Pillai’s Trace = .076, F (6, 224) =3.092; p = .006]. Separate ANOVAs foreach dependent variable showed signifi-cant differences in flatness, F (1, 229) =5.15, p = .024. Depressed participants’vocal responses were flatter than nonde-pressed participants. As with the pretestresults, there were significant differencesfor centroid, F (1, 229) = 10.67, p = .001,skewness, F (1, 229) = 7.18, p = .008, andkurtosis, F (1, 229) = 12.74, p < .0001.

A MANOVA of speech measures [tilt,voiced tilt, harmonic-to-noise ratio, lowpitch-to-signal-to-noise ratio (LP-SNR),pitch amplitude, and signal frequencyratio (SFR)] revealed a significant maineffect for depression in participants’pretest recording [Pillai’s Trace = .08,F (7, 235) = 2.91; p =. 006]. SeparateANOVAs for each dependent variablesuggested that there were significant dif-ferences in harmonic-to-noise ratio, F (1,241) = 5.48, p =. 02, and pitch amplitude,F (1, 241) = 7.40, p =. 007. Thus, nonde-pressed participants displayed less noisein their vocal sounds while depressedparticipants had more hoarse and breathyvocal responses.

Finally, to test for differences acrosstime between depressed and nonde-pressed participants, dependent t-testswere conducted for LTASS and speechmeasures on variables that were found toshow significant differences betweendepressed and nondepressed partici-pants.There were more significant differ-ences for nondepressed than depressed

individuals across time. Specifically, fornondepressed individuals there was asignificant difference across time in the skewness t (171) = −2.43, p = .016,harmonic-to-noise ratio t (171) = −4.26,p < .0001, and pitch amplitude t (171) =−4.97, p < .0001. However, for depressedindividuals there was only a significant dif-ference in pitch amplitude t (59) = −2.95,p = .005. In sum, nondepressed partici-pants displayed greater vocal variability intheir voice characteristics than depressedparticipants (González & Shriver, 2004).

Two studies evaluated VIDAS-4 forscreening depression and anxiety symp-toms in English and Spanish. Study 1 in-volved 48 ES and 45 SS. Study 2 involved112 ES and 108 SS. Participants com-pleted a demographic scale, BAS, BDI-II,BAI (Beck & Steer, 1993), CIDI-SF, andVIDAS-4 depression and anxiety sub-scales.The studies examined the psycho-metric properties, comfort ratings, andselection of digitized gender for VIDAS-4.Study 2 examined the sensitivity andspecificity of VIDAS-4 to detect depres-sion and anxiety levels among comorbid,depressed, anxious, and no-disordergroups. As with VIDAS-3, participant pre-and post-recordings were obtained. Thestudies found that VIDAS-4 subscalesgenerally demonstrated adequate inter-item reliability (.80–.94), convergentvalidity (.62–.89), sensitivity (.84–.90),and specificity (.44–.69). Most partici-pants regarded VIDAS-4 as comfortable.Three of four participants selected afemale digitized voice. Comorbid partic-ipants reported the most severe levels ofdepression or anxiety.

Participants’ pretest and post-testrecorded phrases were analyzed usingMANOVA to assess differences betweenthe four diagnostic groups by language.Roy’s Largest Root was the statistic used

04_Izdeb_V3_49-66 7/25/08 3:45 PM Page 59

instead of more traditional analyses suchas Pillai’s Trace because Roy’s Largest Rootis said to be the best statistic for dealingwith differences among the groups whenthe difference is concentrated on thefirst discriminant function (which wasthe case, and the test for homogeneity ofcovariance matrices was positive). Theresults revealed a significant main effectfor LTASS measures among the four diag-nostic groups for English-speaking partic-ipants [Roy’s Largest Root = .11, F (3, 100)= 2.66; p = .037] and Spanish-speakingparticipants [Roy’s Largest Root = .2, F(3, 81) = 3.99; p = .005]. Follow-up analy-ses did not reveal any significant differ-ences between the four diagnostic groups.

Speech measures assessing the amountof vocal variability and energy betweenthe four diagnostic groups were used asdependent variables in a MANOVA. Theresults revealed a significant main effectamong the four groups, regardless of lan-guage [Roy’s Largest Root = .094, F (3,104)= 1.97; p = .009]. Follow-up analyses re-vealed no significant differences betweenthe four diagnostic groups or for each lan-guage (Shriver, Ramos, & Gonzalez, 2003).

VIDAS-5 is a computerized speech rec-ognition application for screening depres-sion and anxiety symptoms in Englishand Spanish. Study 1 was a pilot study of50 ES and 47 SS. Study 2 involved 108 ESand 109 SS in diverse settings. Participantscompleted a demographic scale, BAS,BDI-II, BAI, CIDI-SF, and VIDAS-5 (aural or visual methods). The audio portion ofVIDAS was the same for the aural andvisual methods. The difference betweenmethods involved visual cues for thevisual method, such as text and graphicalmessages to present the items and toreply to a recognized spoken answer. Asin VIDAS-3 and -4, pre- and post-test par-ticipant recordings were obtained.

60 VOICE AND EMOTION: VOLUME 3

Studies 1 and 2 examined the psycho-metric properties and participant com-fort ratings for VIDAS-5. Study 2 alsoexamined psychometric sensitivity andspecificity, and participant selection ofdigitized gender for VIDAS. The studiesfound that VIDAS-5 generally demonstrateda range of adequate inter-item reliability(.71–.91), convergent validity (.40–.86),sensitivity (.79–.1.0), and specificity(.39–.44). Discriminant validity resultsdemonstrated high overlap between de-pression and anxiety scales (.31–.79).Several differences were observed in thepsychometric properties of VIDAS sub-scales by language and method, such thatthe DAS and aural method displayed lowerreliability and validity. Participants in bothlanguage groups favorably rated the twoVIDAS methods but the visual methodreceived higher positive reactions. Partic-ipant comfort ratings of digitized voicedemonstrated an interaction such that thefemale visual voice and the male auralvoice were rated more favorably.

In a preliminary analysis of voice char-acteristics among depressed and nonde-pressed individuals, correlations wereconducted between BDI total scores,CES-D total scores, LTASS measures, andspeech measures.Among English-speakingparticipants, four pretest LTASS measureswere significantly correlated with theBDI, including harmonic-to-noise ratio (r = .314, p < .01), signal-to-noise ratio (r = −.267, p < .05), pitch amplitude (r =.262, p <.05), and signal frequency ratio(r = −.245, p < .05). None of the voicevariables were significantly correlated withthe BDI in Spanish or the CES-D in Eng-lish or Spanish. Among Spanish-speakingparticipants, two post-test speech meas-ures were significantly correlated withthe BDI, such as tilt (r = -.344, p < .01)and voiced tilt (r = -.348, p < .01). There

04_Izdeb_V3_49-66 7/25/08 3:45 PM Page 60

VOICE OF DEPRESSION IN ENGLISH- AND SPANISH-SPEAKING POPULATIONS 61

were no significant correlations betweenany of the voice variables and the BDI inEnglish or the CES-D in either language.Thus, depressed participants displayedless vocal intensity and variability (Gorze-man, Carter, & González, 2005).

Summary

The research presented here examinedthe integration of computerized speechrecognition and digital voice analyses toassess depressed mood and symptoms inEnglish and Spanish. The findings sug-gest that VIDAS is a feasible to adminis-ter, reliable, valid, positively acceptable,and culturally sensitive application forthe screening of depression in English-and Spanish-speaking populations. Therelationship between the gender of theparticipant and choice of digitized inter-viewer is complex, but most participantsmore often selected a female digitizedvoice. The preference for a female thera-pist has been documented in previousresearch (Kaplan, Becker, & Tenke, 1991).Most importantly, several voice charac-teristics demonstrate significant relation-ships to depression levels, such as vocalenergy and variability; however, the find-ings have not been consistent across thevarious VIDAS studies.

There are several shortcomings withthe analysis of voice characteristics. Theliterature reports complexities with differ-entiating between labile and transitionalemotional states such as sadness, bore-dom, and indifference (Scherer, 1986).Moreover, without a baseline measure-ment, it may be difficult to discriminatebetween a person who is depressed andone who normally speaks with a monot-onic voice. Psychiatric comorbidity also

distorts vocal markers for depression. Forinstance, depressed persons may displaymixed voice characteristics that repre-sent both psychomotor retardation andagitation (Mandal, Srivastava, & Singh,1990). In addition, psychotropic medica-tions alter the vocal expression of depres-sion symptoms, such as change in pitchand voice energy (Standke & Scherer,1984).There are also gender, age, linguis-tic, and physical factors that interactwith speech characteristics (Scherer,Banse, Wallbott, & Goldbeck, 1991). Dif-ferences between male and female voiceranges, age groups (children and geriatricpopulations), regional pronunciations,and speech impediments may requiredeveloping unique models of vocal emo-tional properties (Scherer, Ladd, & Silver-man, 1984). Occasionally, the speechrecognition system did not recognizevocal responses accurately. The speaker-independent speech recognition tech-nology used in VIDAS is based on syntaxand a phonetic structure. Such systemshave limitations with the recognition ofvariations in vocal utterances. Differencesbetween the respondent’s pronunciationand the system’s phonetic structure maysignificantly diminish recognition andaffect the interaction between computerand user (Noyes & Frankish, 1994).

Obviously, the limitations of speechrecognition and voice analyses need tobe addressed. Overall, significant progresshas been made toward developing a toolto increase the early and accurate detec-tion of depression cases.VIDAS has beenimplemented in high volume health caresettings that serve diverse patient popu-lations, but lack bilingual personnel.VIDAS quickly and unobtrusively collectsparticipant data, scores the data, and gen-erates a report to inform health care staffof the participant’s mood and symptoms.

04_Izdeb_V3_49-66 7/25/08 3:45 PM Page 61

As a result, VIDAS assesses many individ-uals who are unlikely to initially seek outmental health services. However, furtherstudy needs to be accomplished in orderto enhance and refine the VIDAS inter-view as a viable alternative method ofassessment.

By and large, past research has focusedon standard voice variables, such aspitch, tempo, and speech rate. Acousticvariables that measure fine-grained varia-tions in voice signals, such as shimmer(modulation in amplitude) and jitter(irregularity in vocal vibration), offer newinsights into the relationship betweenmood and voice characteristics (Bacho-rowski & Owren, 1995). Advancementsin experimental methodologies (struc-tured, free-form, and pre-post designs)and digital voice analysis (short-time,long-time, and spectral analyses) canovercome the limitations in the evalua-tion of speech variables associated withthe quality of voice sampling (Murray &Arnott, 1993). State-of-the-art voice analy-sis software packages that can detectsubtle changes in voice properties willaid in evaluating vocal emotion. Thesenew developments offer possibilities todevelop a reliable and valid English andSpanish language voice analysis to accu-rately discern between depression andnondepression.

Acknowledgments. The primary authorthanks Colby Carter, Gali Goldwaser,Patricia Gonzalez, Paul Hernandez, Jen-nifer Reza, Carlos Rodriguez, and ChrisShriver for their efforts in the data collec-tion and data analyses. MBRS grant num-ber MS4567 from the National Instituteof General Medical Science (NIGMS) andthe LRP program of the National Institutesof Health (NIH) supported the develop-ment of this manuscript.

62 VOICE AND EMOTION: VOLUME 3

References

Alpert, M., Pouget, E. R., & Silva, R. R. (2000).Reflections of depression in acoustic mea-sures of the patient’s speech. Journal ofAffective Disorders, 66, 59–69.

Avaaz Innovations Inc. (1998). InteractiveVoice Analysis System™ (IVANS) user’sguide. London: Avaaz Innovations.

Bachorowski, J.A., & Owren, M. (1995).Vocalexpression of emotion: Acoustic proper-ties of speech are associated with emo-tional intensity and context. Psychologi-cal Science, 6(4), 219–224.

Beck, A. T., & Steer, R. A. (1987). Manual forthe revised Beck Depression Inventory. SanAntonio, TX: Psychological Corporation.

Beck, A. T., & Steer, R. A. (1993). Manual forthe revised Beck Anxiety Inventory. SanAntonio, TX: Psychological Corporation.

Bernal, M. A., & Castro, F. G. (1994). Are clin-ical psychologists prepared for service andresearch with ethnic minorities? Report ofa decade of progress. American Psychol-ogist, 49, 797–808.

Broadhead, W., Clapp-Channing, N., Finch, J.,& Copeland, J. (1989). Effects of medicalillness and somatic symptoms on treatmentof depression in a family medicine residencypractice. General Hospital Psychiatry,11(3), 194–200.

Butcher, J. (1987). Computerized clinical andpersonality assessment using the MMPI. InJ. Butcher (Ed.), Computerized psycholog-ical assessment: A practitioner’s guide(pp. 161–197). New York: Basic Books.

Carr, A., Ghosh, A., & Ancill, R. (1983). Can acomputer take a psychiatric history? Psy-chological Medicine, 13(1), 151–158.

Center for Mental Health Services. (2000).Cultural competence standards in man-aged care mental health services: Fourunderserved/underrepresented racial/ethnic groups. Retrieved July 26, 2001,from http://www.mentalhealth.org/publications/allpubs/SMA00–3457

Climent, C., Plutchik, R, & Estrada, H. (1975).A comparison of traditional and symptom-

04_Izdeb_V3_49-66 7/25/08 3:45 PM Page 62

231

CHAPTER 13

The Context of Voice andEmotion: A Voice-Over Artist’s

Perspective

Kathleen Antonia Tarr

Abstract

This chapter reflects the perspective of an actor on the creationand understanding of believable vocalized emotions.

13_Izdeb_V3_231-238 7/25/08 4:11 PM Page 231

Introduction

Radio ads and commentary.Audio books.Phone calls and voice mail. There arevery few other venues that are home tovoice and emotion without visual context.However, whether the audience sees thespeaker or other visual cue, the body iskey to emotional expression. As a voice-over artist, I cannot act the part vocallyif physically I am disconnected from theemotion. If I am disconnected, the audi-ence will be disconnected, too. If I am onstage, any personal disconnection fromthe emotion I intend to portray also dis-connects the audience. The same is trueof film, and the same is true whether onecan see me or only hear my voice.

The Sense, the Context, andthe Know-How

In addition to my emotional intention,the effort to produce emotions in voicerequires a physical context. It is notenough to be a snapshot: lips upturned,a furrowed brow, a feeling of disgust.Context requires incorporation of themoments before and after, the bookends.Context in written form is how oneknows the difference between tear—torip—and tear—to cry. In visual form, anonlooker can look at a smile and try todecipher the supporting emotion of thesmiler, but can fail to detect whether thisperson is truly happy—or perhaps insteadcomplicit—if she doesn’t know the situ-ation that inspired the moment. Hence,one crucial key to understanding emotionin voice is understanding the context.

Take, for example, the quick image ofme in a Sunsweet prunes (rather, “dried

232 VOICE AND EMOTION: VOLUME 3

plums”) commercial (still airing as of thisprinting: Editor). I look down at a driedplum held between my fingers and say,“Wow.” That’s it.

After this commercial aired during the2006 Golden Globe Awards, I receivedseveral phone calls asking about the ad,always with the tag, “You sure wereenjoying that prune!”

Was I? The context suggests so. Imagesof others bookended mine, all with moredialog, all truly enjoying Sunsweet driedplums. Although I am not shown bitinginto the prune, there seems to be some-thing in my mouth, and with prune inhand, the “wow” is correctly interpretedas referring to that item. My nose isn’tscrunched up, my eyes and brow areraised, and I am looking at the prune.Context. I could have meant, “Wow! Thisis the most distasteful thing I’ve eaten all day!” but in addition to the context ofmy own expression and others’ in thecommercial, audiences prejudge correctlythat a company is not going to includethe image of someone who hates itsproduct in a promotional ad and thuscorrectly conclude that I like the taste. Itis interesting that of those who watchedthe Golden Globes—wherein there wasan overflow of enjoyment and “Wow!”—many made the observation, “You surewere enjoying that prune!” During pre-vious airings, say, during The AmazingRace—a show with, yes, enjoyment, butalso quite a bit of “ugh”—I did not receivecharacterizations of my dried plums performance that stressed the depth ofmy delight.

Contrast two other wows. A friend ofmine who once at dinner accidentallyput an entire chili pepper in her mouth,thinking it was a sweet green pepper,kept repeating “Wow” as she brushed hertongue with her napkin and followed up

13_Izdeb_V3_231-238 7/25/08 4:11 PM Page 232

THE CONTEXT OF VOICE AND EMOTION: A VOICE-OVER ARTIST’S PERSPECTIVE 233

with a gallon of water. My sister—whomI had forewarned about the foulness offermenting yeast—put Vegemite on toastand took a bite. Her wows were inter-spersed with chugs of guava juice. Infact, both my friend and my sister weresmiling during their ordeals. One snap-shot of the action, and they could easilyhave taken my spot in the Sunsweetdried plums commercial and fooled anaudience into thinking they were enjoy-ing their moments. But they weren’t.Quite the opposite.Why would the snap-shot work?

Well, all three of us were sharing thefeeling of “That’s incredible.” I was actu-ally enjoying the flavor of my item, butmy sense at that moment regarded theinfusion of orange into the prune. Surpris-ing! Delicious! My friend was also sur-prised but taken aback by “Pain! Help!”My sister predominated by “Yuck! My God,this is the most disgusting thing I haveever tasted!”And because they both haveexcellent senses of humor, smiles, raisedeyes, and raised brows.

More of Context

Because I know my friend and sister, andbecause I was there for their entire tastedisaster experiences, I knew that thesmiles under the wows were not express-ing happiness or enjoyment about theflavors consuming their consumption.Someone on the scene with less experi-ence or detail about the players and thestories might not have figured it out asquickly.They might have wondered aboutwhy these two women were having somuch dramatic fun. As they watched,they might have figured out that themoments actually involved quite a bit of

displeasure, but only in the context ofthe players, two people who see comedyin almost everything.

It is easy to conclude, then, that under-standing the emotion of a particularvocal moment is not only about externalcontext but also about understandingone’s own biases, expectations, and inter-pretive patterns. Biases can be benign oreven insightful, as above. But here is alsowhere interpretations of emotions andvoice can take sickening turns.

Racism, sexism,homophobia, and othermental delusions all carry with them fal-lacies of interpretation of emotions andof the voice. The same phrase uttered bya white person may be interpreted by aracist with black bias as a casual state-ment in the former demographic scenario,but if uttered by an African American, asasserting superiority: “I disagree. I thinkall people are equal.” Someone withoutracist delusions can hear that statementfrom someone of another race and notusually be threatened.

Someone with racist delusions mostoften cannot. It depends upon the con-text one projects. If it is simply an intel-lectual discussion, and both parties havesimilar understanding, disagreementsare not typically threatening. If one partythinks that disagreement means the otherdoesn’t know her “place” or thinks he is smarter—emotionally that the otherfeels confident when she should feelsubservient or when he should feel def-erential—then problems arise.

I used to play tackle football with menwho were high school and college teamstandouts. Notoriously, a couple of themwho had never played with or againstme taunted me in the days before agame. “You’re not even going to showup. Women can’t play football. What areyou? The cheerleader?” “I’ll see you on

13_Izdeb_V3_231-238 7/25/08 4:11 PM Page 233

Sunday,” was my common reply. “I’ll seeyou on Sunday,” infuriated many of them.They believed I was saying that theywere weak, that they weren’t really men,that they might as well play the cheer-leader. What did I mean? That none oftheir posturing meant anything to me.Skill would be decided on the field, notby argument. Did I think I was the bet-ter player? Hell yes! But did I think theywere weak, unmanly, cheerleader types?No. I wasn’t thinking of them at all. I hadno feeling for them at all. But for whatthey believed I thought and felt, I wasthreatened on several occasions withviolence.

One’s own bias and prejudgment arethe great interpreters of voice and emo-tions. They are as, if not more, importantthan any other measurement. Comparethe “problem” with people affected byphysical disorders that impact vocal toneor physicality, including facial expres-sion. The difficulties they face with con-necting the emotion to the voice are inthe interpretation, not in the manifesta-tion. (More on this subject is provided inChapter 8, 16, and 17 in Volume 2 ofthese series: Editor.)

The speaker has all of the emotions,speaks with all of the emotions support-ing the speech.The failure in understand-ing the emotion is in the interpretationof the voice. Someone more familiarwith the speaker will do a better job ofunderstanding. It is like any language.Someone without fluency will not as easily get it.

The search for the perfect robotic dis-play of vocal emotion is the search forthe language of the masses. Attempts tocreate authentic computerized voicesare not in fact efforts to make the speechmore decipherable but instead more com-fortable for the listener, somehow more

234 VOICE AND EMOTION: VOLUME 3

sincere, warmer. If people took a moment,they would realize that the computerizedvoice pleasantly speaking to them has nofeeling for them whatsoever, and in theend, it doesn’t really matter whether it ismonotonous or “kind.” In fact, much of thefury over the delighted voices of multi-prompters seems to be the result of frus-tration over not being acknowledged bythe voice on the other end of the phone,a voice that sounds warm, more human,less monotone, and computerized.

I’m on the phone having a fit becauseI’m on my 10th prompt transfer and“someone” is having a lovely day, tellingme, “I’m sorry” yet again, and with anaudible smile! I’ve been less put out bymonotones prompting me to enter my ac-count number over and over again. I canhear that they don’t care. Hey, I admit it.I’m human, biased and deceived and ledplaces by voices without real feeling likea lamb to the slaughter—which bringsme back to my job.

Tricks

There are certain tricks I use to helpaudiences correctly interpret the emo-tions I intend to project. When I playrestrained anger, I might smile with mymouth, but I am predatory with my eyes,scrunched, pinpointing attack.

When I play restrained derangement,I smile with my mouth, and my eyes re-main neutral. The words then flow fromthese places. Anger, they do not leave the throat easily. Derangement, they flowsmoothly. My face might transform athousand compositions, but because thevoice and eyes are specific, the emotionis fairly transparent. In fact, acting anypart, voice only or not, I rely primarily

13_Izdeb_V3_231-238 7/25/08 4:11 PM Page 234

THE CONTEXT OF VOICE AND EMOTION: A VOICE-OVER ARTIST’S PERSPECTIVE 235

upon my throat and eyes. My face follows,but its expressions are only important in as much as they reflect the secondaryemotion, i.e., the one my character ishoping to project in the scene. What the character is actually feeling, the emo-tion I primarily want understood, is notfound there.

There is a scene in Kill Bill, Vol. 2 inwhich Uma Thurman’s character is goingto be buried alive. The camera gives theaudience a shot of her face as the grave-digger exclaims,“Look at those eyes.Thisbitch is furious!” She is still above groundduring this scene, not yet confined to hercasket six feet under. The shot is only ofher eyes, partial forehead, bridge of nose.She doesn’t look furious to me. She looksfrightened. When her casket is beingclosed, the camera reveals a cut of thesame sequence. The lighting is the same.The point of view the same. She looksfrightened. It makes sense. The earlierusage is obviously the same take, nonsen-sical, above ground, only by virtue of theeditor’s effort to substitute for an other-wise unavailable shot. Had she spokenand sounded furious in the earlier shot,I might have then thought the characterwas deranged. When the eyes and voicedon’t match, insanity. Open the eyes widein utter surprise and smile. No, I mean it.Go look at yourself in the mirror. Say “Hi there” or something. Next Bride ofChucky, huh?

But here again, it is my projection, mybias that does most of the emotionalinterpretation. A wide-eyed smile mayjust mean someone had really bad plasticsurgery. I’m fairly astute, so usually I candetect bad surgery, and usually I simplyknow what someone is feeling, even ifthe emotion is only found in the voiceand eyes. It’s why people think I listenwell. I hear more than what is being said.

I can reflect more than what is said.But like I wrote, it is like any language,and I just happen to be very good at language.

Ability, Bias, and Observations

If one is unable to learn a new language,unable to precisely imitate tones andinflections and body movements, oneundoubtedly has limitations in the abilityto interpret emotions. If one has no base-line understanding of how an individualor group expresses itself, then the con-text is askew.

In language, I best learn new words byclosely watching how sounds are used.Watching how they are used. My bias:someone walks into a room, if she speaks,she is probably going to say “Hello” orsome similar greeting. Next, she willprobably be asked how she is, and shewill answer.

Observation: if she answers the equiv-alent to “I’m well,” her voice will rise, shewill smile. If she is not so well, she willsupport the words with emotional con-sistency, maybe heavy, slumping shoul-ders, down turned eyes, and a cringe for“awful.” One can interpret more compli-cated responses accordingly.When I learnlanguage, I come to understand not onlythe tone and thus vocabulary and grammarbut also the presentation. More impor-tantly, to mimic the presentation. Thecontext. I very rarely do a word-for-wordtranslation, even in my native AmericanEnglish, and likely few of us did when wewere first learning to speak. It is the soundof the whole thought or phrase and thetrumpeter’s performance that teachesthe voice associated with the feeling, andthat becomes the music to play.

13_Izdeb_V3_231-238 7/25/08 4:11 PM Page 235

Job

What am I, a poet? Back to my job as a voice-over artist. I passed over it fairlyquickly, but did you know that you canhear a smile? “Smile” is a frequent instruc-tion in the recording studio, particularlyfor commercials, jingles, or spoken word.Is it actually the case that the smile itselfchanges the voice? Very little. It is theemotion behind smile that does most ofthe work. It is what causes you to smilethat changes the tone of voice. I cansmile and sound like I am about to entera boxing ring. But if I think of somethingthat makes me happy (“be happy” prob-ably being the more precise directionthan “smile”), I can then sing or talk andsell to an unsuspecting audience thenotion that the product I’m singing ortalking about is making me happy.

I must say that the best voice-overaudition I ever had, one for which I didnot book the job, was for one of GeorgeLucas’s animated projects. I auditioned forthe role of an intelligent, older, womanwhose temper was frightening and mys-tical. Let’s suppose the dialog was:

That’s fine, but if you ever again persistin disregarding my direct order, I shallsee to it that you fall into the deepesttrenches of Hell and burn for an eter-nity without one hope of escape that isnot suffocated by the stench of yourdamnation.

Wow. That’s pretty wicked. I’m sureGeorge Lucas wrote a kinder, gentler,script . . . but let’s get on with it.

A good performance has contrasts, con-flict, and intrigue. “That’s fine.” Spokenwith a gentle, flowing vocal projection,

236 VOICE AND EMOTION: VOLUME 3

intended to give the audience a (false)sense of safety, ease, and relaxation. Eyessoft. “But if you ever again persist” beginswith a steadily widening eye squint and asteady increase in vocal volume andtempo—interlaced with normal spokenvariants—that peaks at stench and rum-bles to a stop at damnation.

For age effect, the baseline voice grum-bled deeply in my throat. The result is a line that reads as a growl with smallbarks interlaced, building until one majorthreatening bark that then falls to agrowl to conclude the thought. Maintain-ing some restraint in vocal projection,even at the peak, builds the threat bymaking the speaker seem on the verge of full out attack at any moment. Thismeans for the audience that as scary asthe character may sound, the situationcan get scarier. Deepening the tone with-out altering the pitch builds the threatsimilarly. The audience begins relieved,then becomes tense, then ends assuredthat the immediate threat has passed butremains on edge that danger still lingers.Mission accomplished! . . . except for thebooking, but honestly, just because a per-former doesn’t book a job, it doesn’t meanthe audition wasn’t stupendous. Isn’t thatright, George?

Even without hearing my voice, youhave an imagining about what I wouldsound like asking, “Isn’t that right,George?,” what I would look like while I speak. Perhaps you envision me laughingor smiling, at least in the eyes. You havegiven me a voice without hearing me orseeing me, and the emotion you attributeis the real measure. Perhaps even moreinteresting is the emotion with whichyou respond to the projection.

It’s the feeling that counts.

13_Izdeb_V3_231-238 7/25/08 4:11 PM Page 236


Recommended