+ All Categories
Home > Documents > Recognizing emotional speech in Persian: A validated database of Persian emotional speech (Persian...

Recognizing emotional speech in Persian: A validated database of Persian emotional speech (Persian...

Date post: 25-Jan-2017
Category:
Upload: gisela
View: 227 times
Download: 9 times
Share this document with a friend
20
Recognizing emotional speech in Persian: A validated database of Persian emotional speech (Persian ESD) Niloofar Keshtiari & Michael Kuhlmann & Moharram Eslami & Gisela Klann-Delius # Psychonomic Society, Inc. 2014 Abstract Research on emotional speech often requires valid stimuli for assessing perceived emotion through prosody and lexical content. To date, no comprehensive emotional speech database for Persian is officially available. The present article reports the process of designing, compiling, and evaluating a comprehensive emotional speech database for colloquial Persian. The database contains a set of 90 validated novel Persian sentences classified in five basic emotional categories (anger, disgust, fear, happiness, and sadness), as well as a neutral category. These sentences were validated in two ex- periments by a group of 1,126 native Persian speakers. The sentences were articulated by two native Persian speakers (one male, one female) in three conditions: (1) congruent (emotional lexical content articulated in a congruent emotional voice), (2) incongruent (neutral sentences articulated in an emotional voice), and (3) baseline (all emotional and neutral sentences articulated in neutral voice). The speech materials comprise about 470 sentences. The validity of the database was evaluated by a group of 34 native speakers in a perception test. Utterances recognized better than five times chance per- formance (71.4 %) were regarded as valid portrayals of the target emotions. Acoustic analysis of the valid emotional ut- terances revealed differences in pitch, intensity, and duration, attributes that may help listeners to correctly classify the intended emotion. The database is designed to be used as a reliable material source (for both text and speech) in future cross-cultural or cross-linguistic studies of emotional speech, and it is available for academic research purposes free of charge. To access the database, please contact the first author. Keywords Emotion recognition . Speech . Emotional speech database . Prosody . Persian Communicating and understanding emotions is crucial for human social interaction. Emotional prosody (i.e., modulations in the acoustic parameters of speech, such as intensity, rate and pitch) encompasses non-verbal aspects of human language and provides a rich source of information about a speakers emo- tions and social intentions (Banse & Scherer, 1996; Wilson & Wharton, 2006). Aside from prosody, emotions are also con- veyed verbally through the lexical content of the spoken utter- ance. These two channels of information (prosody and lexical cues) are inextricably linked and may reinforce or contradict each other (e.g., by conveying sarcasm or irony; Pell & Kotz, 2011). Therefore, to interpret the intended meaning or the emotions and the attitudes of a speaker, listeners should effec- tively monitor both prosodic and lexical information (Tanenhaus & Brown-Schmidt, 2007). To date, only scarce empirical data are available on the extent to which listeners Copyright 20102012 Niloofar Keshtiari. All rights reserved. This database, despite being available to researchers, is subject to copyright law. Any unauthorized use, copying, or distribution of material contained in the database without written permission from the copyright holder will lead to copyright infringement with possible ensuing litigation. Directive 96/9/EC of the European Parliament and the Council of March 11 (1996) describe the legal protection of databases. Published work that refers to the Persian Emotional Speech Database (Persian ESD) should cite this article. Electronic supplementary material The online version of this article (doi:10.3758/s13428-014-0467-x) contains supplementary material, which is available to authorized users. N. Keshtiari (*) : G. Klann-Delius Cluster of Excellence Languages of Emotion, Freie Universität Berlin, Habelschwerdter Allee 45, 14195 Berlin, Germany e-mail: [email protected] M. Kuhlmann Department of General Psychology and Cognitive Neuroscience, Freie Universität Berlin, Berlin, Germany M. Eslami Department of Persian Language and Literature, Zanjan University, Zanjan, Iran Behav Res DOI 10.3758/s13428-014-0467-x
Transcript
Page 1: Recognizing emotional speech in Persian: A validated database of Persian emotional speech (Persian ESD)

Recognizing emotional speech in Persian: A validated databaseof Persian emotional speech (Persian ESD)

Niloofar Keshtiari & Michael Kuhlmann &

Moharram Eslami & Gisela Klann-Delius

# Psychonomic Society, Inc. 2014

Abstract Research on emotional speech often requires validstimuli for assessing perceived emotion through prosody andlexical content. To date, no comprehensive emotional speechdatabase for Persian is officially available. The present articlereports the process of designing, compiling, and evaluating acomprehensive emotional speech database for colloquialPersian. The database contains a set of 90 validated novelPersian sentences classified in five basic emotional categories(anger, disgust, fear, happiness, and sadness), as well as aneutral category. These sentences were validated in two ex-periments by a group of 1,126 native Persian speakers. Thesentences were articulated by two native Persian speakers(one male, one female) in three conditions: (1) congruent(emotional lexical content articulated in a congruent emotional

voice), (2) incongruent (neutral sentences articulated in anemotional voice), and (3) baseline (all emotional and neutralsentences articulated in neutral voice). The speech materialscomprise about 470 sentences. The validity of the databasewas evaluated by a group of 34 native speakers in a perceptiontest. Utterances recognized better than five times chance per-formance (71.4 %) were regarded as valid portrayals of thetarget emotions. Acoustic analysis of the valid emotional ut-terances revealed differences in pitch, intensity, and duration,attributes that may help listeners to correctly classify theintended emotion. The database is designed to be used as areliable material source (for both text and speech) in futurecross-cultural or cross-linguistic studies of emotional speech,and it is available for academic research purposes free ofcharge. To access the database, please contact the first author.

Keywords Emotion recognition . Speech . Emotional speechdatabase . Prosody . Persian

Communicating and understanding emotions is crucial forhuman social interaction. Emotional prosody (i.e., modulationsin the acoustic parameters of speech, such as intensity, rate andpitch) encompasses non-verbal aspects of human language andprovides a rich source of information about a speaker’s emo-tions and social intentions (Banse & Scherer, 1996; Wilson &Wharton, 2006). Aside from prosody, emotions are also con-veyed verbally through the lexical content of the spoken utter-ance. These two channels of information (prosody and lexicalcues) are inextricably linked and may reinforce or contradicteach other (e.g., by conveying sarcasm or irony; Pell & Kotz,2011). Therefore, to interpret the intended meaning or theemotions and the attitudes of a speaker, listeners should effec-tively monitor both prosodic and lexical information(Tanenhaus & Brown-Schmidt, 2007). To date, only scarceempirical data are available on the extent to which listeners

Copyright 2010–2012 Niloofar Keshtiari. All rights reserved. Thisdatabase, despite being available to researchers, is subject to copyrightlaw. Any unauthorized use, copying, or distribution of material containedin the database without written permission from the copyright holder willlead to copyright infringement with possible ensuing litigation. Directive96/9/EC of the European Parliament and the Council of March 11 (1996)describe the legal protection of databases. Published work that refers tothe Persian Emotional Speech Database (Persian ESD) should cite thisarticle.

Electronic supplementary material The online version of this article(doi:10.3758/s13428-014-0467-x) contains supplementary material,which is available to authorized users.

N. Keshtiari (*) :G. Klann-DeliusCluster of Excellence Languages of Emotion, Freie UniversitätBerlin, Habelschwerdter Allee 45, 14195 Berlin, Germanye-mail: [email protected]

M. KuhlmannDepartment of General Psychology and Cognitive Neuroscience,Freie Universität Berlin, Berlin, Germany

M. EslamiDepartment of Persian Language and Literature, Zanjan University,Zanjan, Iran

Behav ResDOI 10.3758/s13428-014-0467-x

Page 2: Recognizing emotional speech in Persian: A validated database of Persian emotional speech (Persian ESD)

harness prosody versus lexical cues, or combine both channelsof information, to activate and retrieve emotional meaningsduring ongoing speech processing (Pell, Jaywant, Monetta, &Kotz, 2011). The existing literature on emotional speech hasusually focused on prosody and disregarded the complex inter-action of prosody and lexical content (Ben-David, vanLieshout, & Leszcz, 2011; Pell et al., 2011).

Considering the interaction of lexicon and prosody, emo-tional speech could be conveyed through the following threeconditions: (a) emotional lexical content articulated in con-gruent emotional prosody, (b) emotional and neutral lexicalcontent articulated in neutral prosody, (c) neutral lexical con-tent conveyed in emotional prosody. The first two conditionsare reported as being useful for functional neuroimaging stud-ies (see, e.g., Mitchell, Elliott, Barry, Cruttenden, &Woodruff,2004), whereas the third condition is useful for emotion per-ception and identification tasks (Russ, Gur, & Bilker, 2008).Therefore, conducting a study including any of these condi-tions requires the development of a validated database ofemotional speech (Russ et al., 2008).

This article reports the process involved in the design,creation, and validation a Persian database of acted emotionalspeech. Persian is an Indo-European language (Anvari &Givi,1996), spoken by almost 110 million people around the world(Sims-Williams & Bailey, 2002). The dialect examined in thisdatabase is Modern Conversational Persian, as spoken inTehran. This database contains a validated set of 90 sentencesarticulated in five basic emotions (anger, disgust, fear, happi-ness, and sadness) and a non-emotional category (neutral).The authors covered all of the three aforementioned condi-tions in this database in order to create a comprehensivelanguage recourse that enables researchers to separately iden-tify the impact of prosody and lexical content, as well as theirinteraction, on the recognition of emotional speech.

Emotional speech in Persian has been studied in limitedways by researchers. The existing research has mainly focusedon the recognition of emotional prosody in speech-processingsystems. For instance, Gharavian and Ahadi (2009) depictedthe substantial changes of speech parameters, such as, pitchcaused by emotional prosody. In a recent study, Gharavian,Sheikhan, Nazerieh, and Garoucy (2012) studied the effect ofusing a rich set of features such as pitch frequency, and energyto improve the performance of speech emotion recognitionsystems. Despite extant research on speech processing inPersian, no well-controlled stimulus database of emotionalspeech is available to researchers. The studies referencedearlier recruited native speakers with no expertise in actingto articulate neutral or emotional lexical content in theintended emotional prosody. Neither the lexical nor the vocalstimuli were validated in these studies. Furthermore, the pre-vious studies only investigated a limited number of emotions(anger, happiness, and sadness; e.g., Gharavian & Ahadi,2009; Gharavian & Sheikhan, 2010).

Pell (2001) suggests that emotional prosody can be affectedby the linguistic features of a language. Besides, as Liu andPell (2012) have claimed, it is essential for researchers togenerate valid emotional stimuli that is suitable for the lin-guistic background of the participants of a study. Consideringthis issue, recording samples of all the basic emotions arerequired for a comprehensive study on emotional speech forthe language under study. Therefore, by generating a robustset of validated stimuli, the authors aimed to fill this gap forPersian language. Such a stimuli can minimize the influenceof individual bias, and can avoid subjectivity in stimulusselection in future studies of Persian emotional speech.

Within a discrete emotion framework, anger, disgust, fear,happiness, pleasant surprise and sadness are frequentlyregarded as basic emotions, each having a distinct biologicalbasis and qualities that are universally shared across culturesand languages (Ekman, 1999). When designing the database,the authors considered a set of basic emotions known as “thebig six” (Cowie&Cornelius, 2003) and added a neutral mode.However, pleasant surprise was later omitted from the list ofthe target emotions, due to close resemblance to happiness (formore details, please see the Lexical Content Validationsection).

To date, numerous databases of vocal expressions of thebasic emotions have been established in several languagesincluding English (Cowie & Cornelius, 2003; Petrushin,1999), German (Burkhardt, Paeschke, Rolfes, Sendlmeier, &Weiss, 2005), Chinese (Liu & Pell, 2012; Yu, Chang, Xu, &Shum, 2001), Japanese (Niimi, Kasamatsu, Nishinoto, &Araki, 2001), Russian (Makarova & Petrushin; 2002), as wellas many other languages (for reviews, see Douglas-Cowie,Campbell, Cowie, & Roach, 2003; Juslin & Laukka, 2003;Ververidis & Kotropoulos, 2003). However, this has not beenachieved in Persian. Therefore, the present database providesa useful language recourse for conducting basic research on arange of vocal emotions in Persian, as well as for conductingcross-cultural/cross-linguistic studies of vocal emotion com-munication. Moreover, the stimuli in this database can be usedas a reliable language recourse (lexical and vocal) for theassessment and rehabilitation of communication skills in pa-tients with brain injuries.

One of the greatest challenges in emotion and speechresearch is obtaining authentic data. Researchers have devel-oped a number of strategies to obtain recordings of emotionalspeech, each with their own merits and shortcomings(Campbell, 2000). For a review of the strategies used for thevarious databases see Douglas-Cowie et al. (2003). Someresearchers have employed “spontaneous emotional speech”to gain the greatest authenticity in conveying the emotions(see, e.g., Roach, 2000; Roach, Stibbard, Osborne, Arnfield,& Setter, 1998; Scherer, Ladd, & Silverman, 1984). Elicitingauthentic emotions from speakers in the laboratory is anotherapproach used by researchers (see Gerrards‐Hesse, Spies, &

Behav Res

Page 3: Recognizing emotional speech in Persian: A validated database of Persian emotional speech (Persian ESD)

Hesse, 1994; Johnstone & Scherer, 1999). Although these twomethods generate naturalistic emotional expressions, they arereported to be restricted to non-full-blown states and to have avery low level of agreement with the decoder’s judgments ona specific emotion (Johnson, Emde, Scherer, & Klinnert,1986; Pakosz, 1983). These methods also suffer from varioustechnical and social limitations that hinder their usefulness in asystematic study of emotions (for a detailed explanation,see Ekman, Friesen, & O’Sullivan, 1988; Ververidis &Kotropoulos, 2003). One of the oldest, and still most fre-quently used, approaches for obtaining emotional speech dataconsists of “acted emotions” (Banse & Scherer, 1996). Theadvantages of this approach are: control over the verbal andprosodic content (i.e., all of the intended emotional categoriescan be produced using the same lexical content), a number ofspeakers could be employed to utter the same set of verbalcontent in all intended emotions, and high-quality recordingscan be produced in an anechoic chamber. This productionstrategy allows direct comparison of acoustic and prosodicrealizations for the various intended emotions portrayed.Critics of the acted emotions approach question the authentic-ity of the actors’ portrayals. However, this drawback can beminimized by employing the “Stanislavski method”1 (Banse& Scherer, 1996). As a result, the authors employed the actedemotions approach and the “Stanislavski method” in order toachieve vocal emotional portrayals that were as authentic aspossible for the stimuli of the present database.

The present article consists of six different parts. The firstpart describes the construction of the lexical content, whereasthe second part concerns its validation. In the third part, wemeasured the emotional intensity of the lexical content. Thefourth part describes the elicitation and recording procedure ofthe vocal portrayals. The fifth part concerns the validation ofthe vocal material and the sixth part is a summary and generaldiscussion of all previous parts.

This study started in October 2010 and finished in October2012. As we discussed below, all these procedures were con-ducted entirely in Persian at each stage of the investigation.

Part One: Lexical content construction

The existing literature on emotional speech has usually em-phasized the role of prosody and neglected the role of lexicalcontent (Ben-David et al., 2011). In studying emotionalspeech, various researchers have often prepared theirown, study-specific lists of sentences without validating

the emotional lexical content (see, e.g., Luo, Fu, & Galvin,2007; Maurage, Joassin, Philippot, & Campanella, 2007).However, to conduct a study on emotional speech, a set ofvalidated sentences is required to separate the impact of lex-ical content from prosody on the processing of emotionalspeech (Ben-David et al., 2011). Therefore, in this study threeexperiments were conducted to generate a set of validatedsentences (lexical material) in colloquial Persian.

In all the three experiments participants were recruitedworldwide through online advertisements and referrals fromother participants and attended the tests on a voluntary basiswithout financial compensation.2 The criteria for participantrecruitment were the same in all experiments: Participantswere all native speakers of Persian and this was their workingand everyday language. In case participants knew other lan-guages besides Persian, we asked further questions to makesure that Persian was their dominant language. Participantsdid not suffer from any psychopathological condition or neu-rological illness, had no head trauma, and took no psychoac-tive medication.

As the first step, a set of validated sentences that convey aspecific emotional content or a neutral state was produced. Atotal of 252 declarative sentences (36 for each of the targetemotions plus 36 sentences that were intended to be neutral)were created using the same simple Persian grammaticalstructure: subject + object + prepositional phrase + verb.Female and male Persian proper names were used as thesubject of the sentences. To avoid gender effects, the authorsused an equal number of male and female proper names. Eachof the sentences describes a scenario that is often associatedwith one of the target emotions. See Example 1 for a samplesentence.

In developing an emotional speech database, it isdesirable to match the lexical content for word frequen-cy and phonetic neighborhood density (i.e., the numberof words one can obtain by replacing one letter withanother one within a single word). However, this wasnot possible because the existing Persian corpora (seeGhayoomi, Momtazi, & Bijankhan, 2010, for a review)did not fully cover the domain of colloquial speechwhen the sentences were compiled. Nevertheless, inorder to produce a set of stimuli that was as authenticas possible, everyday Persian words were chosen and allof the sentences were checked for naturalness and fluency byfour native speakers of Persian (two psychologists, twolinguists).

However, to develop a reliable set of sentences conveying aparticular emotion or neutral state a validation procedure wasneeded.

1 The Stanislavski method is a progression of techniques (e.g., imagina-tion and other mental or muscular techniques) used to train actors toexperience a state similar to the intended emotion. This method, which isbased on the concept of emotional memory, helps actors to draw onbelievable emotions in their performances (O’Brien, 2011).

2 In order to prevent the same participant from taking the test twice, the IPaddress of each participant’s computer was checked.

Behav Res

Page 4: Recognizing emotional speech in Persian: A validated database of Persian emotional speech (Persian ESD)

Part Two: Lexical content validation

To make sure that each generated sentence only conveys onespecific emotion or no emotion at all (neutral), we performedtwo perceptual studies to validate the 252 sentences.

In doing so, we carefully considered the ecological validityof the sentences (Schmuckler, 2001). Furthermore, it wasimportant to obtain recognition accuracy data for thesentences from a sample that is more representative of thegeneral population than only student volunteers. Therefore,we recruited a large group of participants to serve as“decoders.”

The method employed to analyze emotion recognitionwarrants special consideration. On one hand, forcing partici-pants to choose an option from a short list of emotions mayinflate agreement scores and produce artifacts (Russell, 1994).On the other hand, providing participants with more options orallowing them to label the emotions freely would result invery high variability (Banse & Scherer, 1996; Russell, 1994).However, if participants are provided with the response option“none of the above,” together with a discrete number ofemotion choices, some of the artifacts can be avoided (Frank& Stennett, 2001). Therefore, in the following experiments,we used nominal scales including the intended emotions andwe added the option “none of the above.”

Experiment 1

Method

Participants A total of 1,126 individuals with no training inthis type of task were recruited as participants. The data for132 participants were not included in the analysis due to theirexcessively high error rates (i.e., above 25 %). Thus, the datafrom 994 participants (486 female, 508 male) were analyzed.The mean age of the participants was 32.6 years (SD = 13.9),ranging from 18 to 65 years. Participants were roughly equiv-alent in years of formal education (14.6 ± 1.8).

Materials and procedure The 252 sentences were presentedto participants in an online questionnaire. Participants wereasked to complete the survey individually in a quiet environ-ment. They were instructed to read each sentence, imagine thescenario explained, and as quickly as possible, select theemotion that best matches the scenario explained in the sen-tence. Responses were on an eight-point nominal scale corre-sponding to: anger, disgust, fear, happiness, pleasant surprise,sadness, neutral, and “none of the above.” Based on an eight-choice paradigm (six emotions, neutral, and none of theabove), chance level was 12.5 %.

To avoid effects of presentation sequence, sentences werepresented in three blocks in a fully randomized design.Following Ben-David et al. (2011), seven control sentences

(one in each emotional category) were presented in eachquestionnaire (in a randomized order) to control for inconsis-tent responses. These seven control sentences were the repe-tition of seven previous items in exactly the same wording.Participants who did not mark the repeated trials of thesecontrol sentences consistently (i.e., a marking difference ofmore than three), were removed from the analysis. A similarmethod has been used by Ben-David et al. to control forinconstancy of responses.

Results and discussion

Previous work on emotion recognition (Scherer, Banse,Wallbott, & Goldbeck, 1991) suggest that emotional contentis recognized approximately at four times chance perfor-mance. Accordingly, to develop the best possible exemplars,a minimum of five times chance performance in the eight-choice emotion recognition task—that is, 62.5 %—was set asthe cutoff level in this study. A set of 102 sentences form theemotional categories (anger: 18; disgust: 23; fear: 17; happi-ness: 21; pleasant surprise: 0; sadness: 23), and another 21sentences form the neutral category fulfilled the qualitycriteria. However, pleasant surprise was recognized mostpoorly overall (ranging between 16.9 % and 50.5 %; mean =39.4 %), and no token met the quality criteria (i.e., five timeschance performance in the eight-choice emotion recognitiontask; i.e., 62.5 %). The data revealed that there was a highprobability (two to four times beyond chance level; i.e., 25 %to 50%) that pleasant surprise was considered to be analogousto happiness. A similar confusion between happiness andpleasant surprise is reported in the literature (Paulmann, Pell,& Kotz, 2008; Pell, 2002). Besides, Wallbott and Scherer(1986) found out relative difficulties in distinguishing thetwo emotions in a study of emotional vocal portrayals.These two point may apparently suggest an overlap betweenhappiness and pleasant surprise. Some researchers believe thatmisclassifications of these two categories occur because oftheir similar valence (Scherer, 1986). This confusion may bemore pronounced when the linguistic context is ambiguous asto the expected emotional interpretation (e.g., happiness andpleasant surprise).

Additionally, analysis of the participants’ commentsshowed that having proper names as the subject of thesentences caused a bias. Therefore, a second experiment wasperformed to omit this bias effect.

Experiment 2

We conducted another experiment to avoid bias effects previ-ously mentioned. In doing so, proper names were replacedwith profession titles such as “Ms. Teacher” and “Mr.Farmer.” The English translations of these “profession titles”could be construed as common English surnames. However,

Behav Res

Page 5: Recognizing emotional speech in Persian: A validated database of Persian emotional speech (Persian ESD)

in Persian they only imply the person’s gender and theirprofession. For example, Mr. Farmer should only be under-stood to mean a male farmer. In Persian it is not typical thatprofession titles also exist as surnames like in English. Havingprofession titles instead of proper names is also beneficial inthat it indirectly provides the reader/listener (participants ofthe study) with extra contextual information that more easilyevokes a mental image of the scenarios. Consider the follow-ing two sentences:

(a) Sara lost sight in both her eyes forever.(b) Ms. Tailor3 lost sight in both her eyes forever.

Once participants have read “a lady tailor has lost sight inboth her eyes forever,” it is very likely for them to imagine thatthis lady is in great trouble, since she can no longer work as atailor. As compared to the first sentence (Sara lost sight inboth her eyes forever.), in which the reader has no informationabout Sara, the additional information provided in sentence(b) can help elicit the intended emotion more easily (sadness,in this example).

To avoid gender effects, the authors used an equal numberof male and female profession titles as the subject of thesentences. Then, a second experiment was conducted withthe aforementioned modifications (i.e., omission of pleasantsurprise from the list of the intended emotions, and replacingproper names with profession titles).

Method

Participants A total of 716 participants responded to thisquestionnaire. The data from 83 of the participants had to bediscarded because of their excessively high error rates (i.e.,above 25 %). These participants selected a “same” responseoption for most of the items (probably to finish the experimentas soon as possible). This resulted in 633 participants (329male, 304 female) being retained for the present data set. Themean age of the participants was 30.2 years (SD = 12.6),ranging from 18 to 62 years. The participants were roughlyequivalent in years of formal education (15.9 ± 2.4).

Materials and procedure A set of 123 modified sentenceswere presented to participants in an online questionnaire.The procedure was the same as in the previous experiment.However, due to the removal of the pleasant surprise, partic-ipants were provided with a seven-point nominal scale (i.e.,chance level was 14.3%). These seven points corresponded toanger, disgust, fear, sadness, happiness, neutral, and none ofthe above. The instructions for this experiment were the same

as those for the previous one. The presentation sequenceeffects were avoided by displaying the sentences in a randomorder. A set of six control sentences (one for each emotionalmode) was presented to control for inconsistent responses.Participants with more than two inconsistent responses wereexcluded.

Results and discussion

Scherer et al. (1991) suggest that in emotion recognitiontasks, emotional content is recognized approximately atfour times chance performance. Accordingly, to developthe best possible exemplars, a minimum of five timeschance performance in the seven-choice emotion recog-nition task, (i.e., 71.5 %) was set as the cutoff level.Applying this criterion led to a set of 90 validatedPersian sentences that were reliably associated eitherwith one particular emotion (i.e., anger, 17 sentences;disgust, 15; fear, 15; sadness, 14; and happiness, 15) orwith no emotion at all (neutral, 14). See Appendix Afor examples of the Persian sentences and their Englishtranslations.

So far, studying emotional speech, many researchers haveprepared their own study-specific lists of sentences withoutvalidating the emotional lexical content. (see Luo et al., 2007,for such studies). However, to conduct a reliable study onemotional speech, a set of validated sentences is re-quired (Ben-David et al., 2011). Accordingly, on thebasis of the results of the this experiment a set of 90sentences were categorized into one of the five emo-tional or neutral categories.

However, since the intensity of the emotion conveyedthrough the lexical content of the sentences could affect theparticipants’ recognition of the intended emotions, it wasnecessary to determine the emotional intensity of eachsentence.

Part Three: Measuring the emotional intensityof the lexical content

The intensity of the emotion conveyed through the lexicalcontent of the sentences could affect the participants’ recog-nition of the intended emotions. Therefore, as the next step,the authors performed a third experiment with the validated90-sentence set in order to identify its emotional intensity ofeach sentence.

Method

Participants A total of 250 Persian speakers (117 male, 133female) took part in the experiment, none of whom hadparticipated in previous studies. Of these, 50 participants were

3 On the basis of earlier explanations, Ms. Tailor was a lady who workedas a tailor, but whose family name is not Tailor.

Behav Res

Page 6: Recognizing emotional speech in Persian: A validated database of Persian emotional speech (Persian ESD)

excluded either because Persian was no longer their dominantlanguage (34 %) or because they did not follow the instruc-tions (16 %). The participant recruitment procedure was thesame as in the previous experiments. The mean age of theremaining participants (105 male, 95 female) was 29.6 years(SD = 12.3), ranging from 18 to 61 years. Participants weresimilar in years of formal education (15.7 ± 2.2).

Materials and procedure The 90 sentences were presented, inrandom order, to each participant in an online questionnaire.Participants were instructed to attend the questions individu-ally in a quiet environment. They were asked to read eachsentence, and to imagine the scenario depicted in the sentence.In this study, participants were required to rate the intensity ofthe intended emotion (given at the end of each sentence) on afive-point Likert scale (Likert, 1936), corresponding to verylittle, little, mild, high, and very high intensity. A five-pointLikert scale is a reliable method tomeasure the extent to whicha stimulus is characterized by a specific property (Calder,1998). In addition, in order to provide participants with thepossibility to reject the intended emotion, we added the re-sponse option “not at all.” For each item, this option wasprovided below the Likert scale, and participants wereinstructed to select “not at all” if they believed another emo-tion was being described by the sentence.

Results and discussion

Accordingly, the mean emotional intensity of each of the 90sentences was calculated. Therefore, the intensity of the lex-ical content of each sentence was identified (see Table 1 forthe details).

This additional piece of information will allow researchersto use a matched set of sentences in future studies.

At the end of part three of this study, we generatedand validated the first list of emotional and neutralsentences for Persian. This sentence set served as thefinalized lexical content for recording the vocal emo-tional portrayals.

Part Four: Elicitation and recording procedure

The ultimate goal of this study was to establish and validate adatabase of emotional vocal portrayals. Therefore, as the nextstep, the validated sentences were articulated by two Persiannative speakers (encoders) in the five emotional categories(anger, disgust, fear, happiness, and sadness) and neutralmode. These vocal portrayals were recorded in a professionalrecording studio in Berlin, Germany.

Method

Encoders Actors learn to express emotions in an exaggeratedway. Therefore, professional actors were not used in thisstudy. Instead, two middle-aged native Persian speakers(male, 50 years old; female, 49 years old) who hadtaken acting lessons and had practiced this professionfor a while, were chosen to simulate the verbal contentof the database. Both speakers had learned Persian frombirth and speak Persian without an accent. The speakersreceived €25/h as financial compensation for theirservice.

Materials The 90 sentences selected in the lexical contentvalidation phase (i.e.; anger, 17 sentences; disgust, 15; fear,15; sadness, 14; happiness, 15; and neutral, 14) were used asmaterials to elicit emotional speech from the two speakers inthe following three conditions: (1) congruent: emotional lex-ical content articulated in congruent emotional voice (76sentences by two speakers); (2) incongruent: neutral sentencesarticulated in emotional voice (70 sentences by two speakers);and (3) baseline: all emotional and neutral sentences articulat-ed in neutral voice (90 sentences by two speakers). Thisresulted in the generation of 472 vocal stimuli.

Procedure Prior to recording, each speaker had four practicesessions with the first author of this article. Each practicesession started with a review and discussion of the literaland figurative meanings of a given emotion, its ranges, andthe ways it could be portrayed in speech. After these discus-sions, the speakers were provided with standardized emotionportrayal instructions based on a scenario approach(Scherer et al., 1991). Five scenarios (one correspondingto each emotion) were used in the portrayal instructions(see Appendix B for the list of scenarios). The samescenarios had been used in a similar study by Schereret al. (1991). These scenarios had been checked forcultural appropriateness and translated into Persian priorto the recording session. The speakers were asked toread each scenario, imagine experiencing the situationdescribed (Stanislavski method) and then to articulatethe given list of sentences in the way they would have

Table 1 Descriptive statistical values of the emotional intensity of thelexical content

Emotional Category Mean Intensity SD Max Min

Anger 4.9 0.33 5.4 4.2

Disgust 5.3 0.51 5.7 4.7

Fear 4.8 0.46 5.4 4.3

Happiness 4.8 0.61 5.5 4.2

Sadness 5.2 0.43 5.6 4.6

Neutral 5.2 0.27 5.5 4.9

Behav Res

Page 7: Recognizing emotional speech in Persian: A validated database of Persian emotional speech (Persian ESD)

uttered them in that situation. Scherer et al. (1991)selected these scenarios on the basis of interculturalstudies on emotion experience, in which representativeemotion eliciting situations were gathered from almost3,000 participants living on the five continents (Wallbott& Scherer, 1986). These scenarios are likely to elicit thetarget emotion and were used both in practice as well asin recording sessions.

Having audio tokens that could serve as representativeexamples of particular prosodic emotions was the criterionfor selecting the speech samples. Therefore the speakers wereencouraged to avoid exaggerated or dramatic use of prosody.Once the authors and the speakers were satisfied with thesimulations in the practice sessions, the speakers made thefinal recordings. The speakers were recorded separately in aprofessional recording studio in Berlin under the supervisionof an acoustic engineer and the first author. Each of the fiveemotions and the neutral portrayals were recorded in separatesessions.

All utterances were recorded on digital tapes under identi-cal conditions, using a high-quality fixed microphone(Sennheiser MKH 20 P48). The recordings were digitalizedat a 16-bit/44.1 kHz sampling rate. The sound files wererecorded on digital tapes (TASCAM DA-20 MK II), digitallytransferred to a computer and edited to mark the onset andoffset of each sentence. Following Pell and Skorup (2008),each audio sentence was normalized to a peak intensity of70 dB using Adobe Audition version 1.5 to control forunavoidable differences in the sound level of the sourcerecordings across actors.

Accordingly, a total of 472 vocal utterances weregenerated. These vocal portrayals encompass the threeconditions of congruent (76 sentences by two speakers),incongruent (70 sentences by two speakers) and baseline(90 sentences by two speakers). It was anticipated thatdifficulties with the elicitation simulation procedurewould lead to some of the recorded stimuli not servingas typical portrayals of the target emotions (Pell, 2002;Scherer et al., 1991) or that other nuances of emotionalcategories would be identifiable with specific vocal portrayals(Scherer et al., 1991). Therefore, a perceptual study wasessential to eliminate the poor portrayals.

Part Five: Validation of the vocal materials

To develop a well-controlled database of emotional speech forPersian, we conducted a perceptual study to eliminate the poorvocal portrayals. Then, we conducted an acoustic analysis tocheck if the vocal portrayals reveal obvious differences inacoustic parameters that might help participants to distinguishthe intended emotional category correctly.

The perceptual study

Method

Participants To date, similar studies on emotional speech haverecruited between ten to twenty four participants as decoders toeliminate the poor vocal portrayals (i.e., Pell, 2002; Pell,Paulmann, Dara, Alasseri, & Kotz, 2009). Nevertheless,in order to obtain robust results, we recruited a total of34 participants as decoders (17 males; mean age 26.3 years,SD = 2.6).

Four participants had to be excluded for not follow-ing the instructions of the experiment. All of the par-ticipants were Iranian undergraduate or graduate stu-dents studying in Berlin. They had all learned Persianfrom birth and had been away from Iran for less thanthree years. They all reported good hearing and hadnormal or corrected-to-normal vision, as verified by theexaminer at the beginning of the study. Participants didnot suffer from any psychopathological conditions, hadno history of neurological problems, and took no psy-choactive medication, as assessed by a detailed ques-tionnaire. A detailed language questionnaire was com-pleted by each participant prior to testing to ensure thatPersian was their native and dominant language.Participants received €8/h as financial compensation for theircooperation.

Materials and procedure A total of 472 vocal utterances(all the emotional and neutral portrayals) encompassingthe three conditions of congruent (76 sentences by twospeakers), incongruent (70 sentences by two speakers)and baseline (90 sentences by two speakers) were in-cluded in a perception study. Each participant was testedindividually in a dim lit, sound-attenuated room.Participants were presented with the vocal utterancespreviously recorded. They were instructed to listen tothe utterances and to identify their emotional prosody,regardless of their lexical content. They were asked tomark their answers on a seven-button answer panel. Theseven choices available were anger, disgust, fear,happiness, sadness, neutral, and none of the above.The stimulus set was presented in four blocks in a fullyrandomized design. The experiment took almost 90 minfor each participant. To limit the fatigue effects andpossible inattention to the stimuli, participants weretested during four 20-min sessions, with a five-minutebreak after each session.

The experiment was run as follows: Acoustic exemplarswere presented via a laptop computer using the E-Primesoftware (Schneider, Eschman, & Zuccolotto, 2002). Eachparticipant heard the audio stimuli binaurally (through

Behav Res

Page 8: Recognizing emotional speech in Persian: A validated database of Persian emotional speech (Persian ESD)

Sennheiser HD 600 headphones) at his or her comfortableloudness level (manually adjusted by each participant at onsetof the study). Each trial sequence consisted of (1) thepresentation of a fixation cross for 200 ms, (2) a blankscreen for 200 ms, (3) audio presentation of an exem-plar with simultaneous display of an image of a loud-speaker, (4) display of a question mark, indicating thatan emotion judgment decision should be made, and (5)a blank screen for 2,000 ms. See Fig. 1 for a schematicillustration of the procedure.

Results and discussion

The percentage of native listeners who accurately categorizedthe target emotion expressed in each sentence was computedfor each item and speaker. The percentages of accurate re-sponses, as well as the error patterns averaged for the twospeakers, are presented in Table 2.

In order to create a set of exemplars that portray theintended emotion as accurately as possible, we adopted aparticular criterion, described below. Previous works suggestthat vocal emotions (except for pleasant surprise and neutral)are recognized almost four times above chance level (Schereret al., 1991). Therefore, in order to select the best possibleexemplars, a minimum of five times chance performance inthe seven-choice emotion recognition task (i.e., 71.42 %) wasset as the cutoff level in the present study. Application of thiscriterion led to the exclusion of only one token from theincongruent condition, articulated by the female speaker,which was intended to communicate “disgust.” In addition,in order to exclude a systematic uncontrolled condition, all ofthe tokens whose response percentage fell between 71.4 %and 85.7 % (i.e., between 5 and 6 times chance level) werescrutinized carefully for their error pattern. The tokens thatshowed repetition of the same wrong answer above chancelevel (i.e., 14.3 %) were then omitted. This resulted in theomission of three exemplars. All of the three omitted por-trayals belonged to the incongruent condition and were meant

to portray fear (one token by the female speaker) and sadness(two tokens by the male speaker). As a result, a total of 468vocal portrayals that fulfilled the quality criteria werekept. These vocal portrayals conveying five emotionalmeanings and the neutral mode serve as the vocalmaterials of the database of Persian emotional speech.See Fig. 2 for a tree chart of the validated database,articulated by two speakers.

The results obtained from the perceptual study (shown inTable 2) reveal that all emotional portrayals were recognizedvery accurately (ranging between 90.5% and 98%). Themostdifficult emotion to recognize was disgust (90.05 % in theincongruent and 95.65 % in the congruent condition).Interestingly, relative difficulties recognizing the vocal por-trayals of disgust have been reported in the literature (Schereret al., 1991). The participants also had difficulty with recog-nizing fear portrayals in comparison with the other emotionalcategories (94.5 % in the incongruent and 97.7 % in thecongruent condition). Analyzing the error patterns can providevaluable cues as for the nature of the inference process as wellas the clues used by the participants (Banse & Scherer, 1996).Cursory examination of the error patterns suggests that in thecongruent condition, sadness and fear were often confusedwith one another (i.e., fear was mistaken for sadness by 1.2 %,and sadness was confused with fear by 1.05 %). Similarconfusion patterns of fear and sadness have been reported inthe literature on emotional vocal portrayals (Paulmann et al.,2008; Pell et al., 2009). Portrayals of disgust were alsomistaken for sadness (1.3 % in the congruent and2.35 % in the incongruent condition). It has been ar-gued that emotions that are acoustically similar (e.g.,disgust and sadness) are very likely to be misclassified(Banse & Scherer, 1996). Some researchers have alsoclaimed that misclassifications often include emotions ofsimilar valence (e.g., fear and anger) and arousal (e.g.,sadness and disgust; Scherer, 1986). These reasons mayexplain some of the errors witnessed in this study. Asexpected, the vocal tokens in the congruent conditionshowed higher rates of emotion recognition than didthose in the incongruent condition. This could be dueto the absence of lexical cues in the incongruent condition.The results also reveal that vocal portrayals of neutrality,anger, and happiness were associated with the least amountsof confusion.

Acoustic analysis

Acoustic analyses were performed to determine whetherthe vocal portrayals would show obvious differences inacoustic parameters that might help participants to dis-tinguish the intended emotions correctly. The analyses

+

?

Trial start

Question mark

Blank screen

Exemplar presentation

Blank screen

2000 ms

Until response

200 ms200 ms

Fig. 1 Schematic illustration of a trial presentation

Behav Res

Page 9: Recognizing emotional speech in Persian: A validated database of Persian emotional speech (Persian ESD)

Fig. 2 Tree chart of the validated database

Table 2 Distribution (as percentages) of the responses given to each of the intended expressions

Condition Lexical Contentof the Sentences

Target VocalEmotion

Percentage of Responses

Anger Disgust Fear Sadness Happiness Neutral None ofthe Above

Congruent Anger Anger 97.55 0.7 0.5 0.1 1.15

Disgust Disgust 0.35 95.65 0.2 1.3 0.55 0.45 1.8

Fear Fear 0.1 97.7 1.2 1

Sadness Sadness 0.6 1.05 98.35

Happiness Happiness 0.55 0.65 97.7 1.1

Incongruent Neutral Anger 98 0.45 0.8 0.1 0.1 0.55

Neutral Disgust 1.05 90.5 0.1 2.35 0.1 1.8 4.1

Neutral Fear 1.4 94.5 0.6 0.35 3.15

Neutral Sadness 0.45 3.95 95.4 0.1 0.1

Neutral Happiness 0.5 0.15 0.7 97 0.45 1.2

Baseline Anger Neutral 0.1 99.7 0.2

Disgust Neutral 0.65 99.35

Fear Neutral 0.2 99.8

Sadness Neutral 0.25 0.8 98.95

Happiness Neutral 0.2 99.8

Neutral Neutral 100

Recognition accuracy rates (sensitivity) are indicated in bold. Values are averaged across speakers.

Behav Res

Page 10: Recognizing emotional speech in Persian: A validated database of Persian emotional speech (Persian ESD)

were limited to three critical parameters (mean pitch,mean intensity, and duration) that have been reportedto differentiate well among vocal emotion categoriesand perceptual terms (Juslin & Laukka, 2003). A totalof 468 vocal utterances (all of the validated emotionaland neutral portrayals), encompassing the congruent,incongruent, and baseline conditions, were included inthis analysis. These vocal utterances were analyzedusing the Praat speech analysis software (Boersma &Weenink, 2006). See Table 3 for the mean values of thenormalized acoustic measures of the valid emotionalportrayals.

Please also see the supplementary materials for a detailedlist of the file names, along with their values for the acousticmeasures.

The sentences that served as the lexical content of thedatabase were not matched for the number of syllables.Therefore, in the first step only the mean pitch and meanintensity of the vocal utterances were entered in a series ofunivariate analyses of variance (ANOVA). The two acousticmeasures (mean pitch and mean intensity) served as the de-pendent variables, and the six-level independent variable in-cluded the five emotion types and the neutral mode. Theresults revealed highly significant differences across theemotional categories. We found main effects of mean pitch[F(5, 84) = 142.307, p < .01, η2 = .894] and mean intensity[F(5, 84) = 7.626, p < .01, η2 = .312] in the congruentcondition (emotional lexical content articulated in the emo-tional voice) as well as for mean pitch [F(5, 78) = 54.41,p < .01, η2 = .777] and mean intensity [F(5, 78) = 9.424,p < .01, η2 = . 377] in the incongruent condition(neutral lexical content portrayed in emotional voices).More specifically, in the incongruent condition anger(279.18) and happiness (280.40) had the highest pitch

values, fear (250.18) and sadness (247.35) had similarbut lower pitch values, and disgust (216.34) had thelowest pitch value. In the congruent condition, anger(274.76), disgust (266.67), and happiness (268.78) hadthe highest pitch values, fear (249.56) had lower pitchvalue, and sadness (226.74) had the lowest pitch value.Using the Tukey–Kramer HSD test, a pairwise compar-ison between the mean pitch values of the emotionalcategories and the neutral mode was conducted. Thiscomparison was conducted for each of the two condi-tions separately. The results of this comparison revealeda highly significant difference (p < .01) between each ofthe five emotions and the neutral mode in both conditions.Figures 3 and 4 display the mean pitch values of the sentencesportrayed in the congruent and incongruent conditions, re-spectively; highly significant effects are marked by twoasterisks.

As for intensity, in both the congruent and incongruentconditions, all of the emotional categories had similarvalues (congruent condition: anger = 72.11, disgust = 71.75,fear = 73.40, happiness = 73.63, sadness = 72.11; incongruentcondition: anger = 72.61, disgust = 71.53, fear = 71.22,happiness = 73.81, sadness = 72.89). A pairwise comparisonwas performed between the mean intensities of the five emo-tional portrayals and that of the neutral portrayals, usingTukey–Kramer HSD tests. In the congruent condition, theresults revealed a highly significant difference (p < .01) be-tween the mean intensities of happiness and neutral portrayals.Fear portrayals also showed a significant difference (p < .05).However, we found no significant difference between themean intensities of the three emotional portrayals ofanger, disgust, and sadness and that of the neutralportrayals. On the basis of the results of this pairwisecomparison, in the incongruent condition, only the mean

Table 3 Normalized acoustic measures of the valid emotional portrayals, per condition (averaged for both speakers)

Condition Lexical Content ofthe Sentences

Target VocalEmotion

Pitch (in Hz) Intensity (in dB)

Mean Value SD Mean Value SD

Congruent Anger Anger 274.76 17.50 72.11 0.90

Disgust Disgust 266.67 16.41 71.75 1.15

Fear Fear 249.56 11.24 73.40 0.50

Happiness Happiness 268.78 13.64 73.63 1.53

Sadness Sadness 226.74 18.74 72.11 1.32

Incongruent Anger Anger 279.18 13.91 72.61 0.98

Disgust Disgust 216.34 17.24 71.53 1.41

Fear Fear 250.18 31.80 71.22 1.07

Happiness Happiness 280.40 22.34 73.81 1.11

Sadness Sadness 247.35 38.32 72.89 1.30

Behav Res

Page 11: Recognizing emotional speech in Persian: A validated database of Persian emotional speech (Persian ESD)

intensity of happiness showed a highly significant dif-ference (p < .01) with that of the neutral portrayals. Thedifferences between the mean intensities of the otheremotional portrayals (anger, disgust, fear, and sadness)and that of neutral portrayals were not significant.

Figures 5 and 6 display the mean intensity values of theemotional and neutral portrayals in the congruent and incon-gruent conditions; significant effects are marked by anasterisk, and highly significant effects are marked bytwo asterisks.

****

**

**

**

Fig. 3 Mean pitch values of the sentences, portrayed in the congruent condition. Error bars represent ±1 SD. **p < .01

**

****

**

**

Fig. 4 Mean pitch values of the sentences, portrayed in the incongruent condition and the neutral mode. Error bars represent ±1 SD. **p < .01

Behav Res

Page 12: Recognizing emotional speech in Persian: A validated database of Persian emotional speech (Persian ESD)

Table 4 shows the mean values of the normalized acousticmeasures per condition (averaged for both speakers).

In the next step, a repeated measures ANOVA was con-ducted with a 6 (prosody) × 2 (speakers) design, with themean duration of the vocal utterances in the incongruent

condition as the dependent variable (neutral sentencesportrayed in the five emotional categories plus neutral).The results revealed highly significant main effects ofspeaker [F(1, 10) = 64.425, p < .01] and prosody [F(5, 50) =377.275, p < .01], as well as a highly significant interaction

Fig. 5 Mean intensities of the emotional portrayals in the congruent condition and the neutral mode. Error bars represent ±1 SD. **p < .01, *p < .05

**

Fig. 6 Mean intensities of the neutral portrayals and of the emotional portrayals in the incongruent condition. Error bars represent ±1 SD. **p < .01

Behav Res

Page 13: Recognizing emotional speech in Persian: A validated database of Persian emotional speech (Persian ESD)

[F(5, 50) = 31.697, p < .01]. See Figs. 7 and 8 for thecomparison of the durations of the vocal utterances acrossvarious emotions. Next, each emotion was compared to theneutral portrayal by within-subjects contrasts. Apart fromanger, which revealed no significant effect [F(1, 10) =0.246, p = .63], all of the other emotions (disgust, fear,happiness, and sadness) displayed a highly significant differ-ence [all Fs(1, 10) > 27.915, ps < .01]. However, we didobtain a highly significant result for anger [F(1, 10) =

13.684, p < .01] when the other variable—that is, speaker—was taken into account in this comparison. As can be seen inFig. 7, when articulating the sentences in an angry voice, thefemale speaker slowed down to a meaningful extent.

In particular, fear portrayals (3.84) were the fastest to beuttered, anger (4.21) and happiness (4.85) utterances wereslower, sadness vocalizations (5.52) were even slower, anddisgust portrayals (7.88) were the slowest of all (all durationsare in seconds).

Table 4 Comparison of mean durations of the emotional sentences, portrayed in corresponding emotions (congruent condition) as well as neutral voice(baseline condition)

Condition Lexical Contentof the Sentences

Target VocalEmotion

Mean Duration per Sentence (in Seconds)

Male Speaker (SD) Female Speaker (SD)

Congruent Anger Anger 4.73 (0.42) 5.00 (0.35)

Disgust Disgust 7.15 (0.81) 6.78 (0.75)

Fear Fear 4.06 (0.34) 4.46 (0.46)

Happiness Happiness 5.31 (0.54) 5.23 (0.54)

Sadness Sadness 5.43 (0.5) 6.00 (0.42)

Baseline Anger Neutral 4.85 (0.34) 4.25 (0.3)

Disgust Neutral 4.33 (0.52) 3.81 (0.35)

Fear Neutral 4.59 (0.24) 4.01 (0.28)

Happiness Neutral 4.45 (0.39) 3.82 (0.26)

Sadness Neutral 4.81 (0.35) 4.06 (0.32)

Fig. 7 Comparison of mean durations of the neutral sentences, portrayed in all emotional categories for each speaker. Error bars represent ±1 SD

Behav Res

Page 14: Recognizing emotional speech in Persian: A validated database of Persian emotional speech (Persian ESD)

The durations of the emotional portrayals were then com-pared with neutral portrayals of the emotional sentences foreach emotion. Figure 9 and Table 4 display the comparison ofthe mean durations of the emotional sentences, portrayed incorresponding emotions as well as neutral voice. These

pairwise comparisons, through paired-sample t tests, showedsignificant differences for the mean durations of anger [t(33) =3.18, p < .01], disgust [t(29) = 30.237, p < .01], happiness[t(29) = 9.044, p < .01], and sadness [t(26) = 16.51, p < .01].Only the mean duration of fear did not differ significantly

Fig. 8 Comparison of mean durations of the neutral sentences, portrayed in all emotional categories averaged for the two speakers. Error bars represent ±1 SD

Fig. 9 Comparison of mean durations of the emotional sentences, portrayed in the corresponding emotions as well as neutral voice. Error bars represent ±1 SD

Behav Res

Page 15: Recognizing emotional speech in Persian: A validated database of Persian emotional speech (Persian ESD)

[t(29) = –0.353, n.s.]. By making multiple comparisons, wereduced the significance level to .01, in accordance withBonferroni correction.

Discriminant analysis

A discriminant function analysis for all three conditions (i.e.,congruent, incongruent, and baseline) was performed to de-termine how well the six categories (five emotions plus neu-tral) could be classified on the basis of the intended acousticmeasures (the mean pitch, mean intensity, and duration). Atotal of 468 vocal utterances (all of the validated emotionaland neutral portrayals) encompassing the three conditions ofcongruent, incongruent, and baseline were included in thediscriminant analysis.

Due to the various natures of each condition, not allof the three measures (means of pitch, intensity, andduration) were taken into account for each condition.In the analysis, the intended emotional category servedas the dependent variable, and the acoustic measure-ments served as independent variables. In the congruentcondition, the analysis was conducted only on the basisof the two measures of mean pitch and mean intensity(due to the different lengths of each sentence in the congruentcondition, the mean speech rate was not taken into accounthere). The vast majority (95.1 %) of the variance wasaccounted for by mean pitch. Pooled within-group correla-tions between the acoustic parameters and the first canonicaldiscriminant function scores revealed that mean pitch demon-strated the highest correlation (r = .998). Mean intensity hadthe largest pooled within-group correlation with the canonicaldiscriminant function score (r = .999) in a second function thataccounted for 4.9 % of the variance. The classificationsresulting from the discriminant analysis revealed thatthe model identified 62.2 % of the sentences correctly(anger, 47.1 %; disgust, 33.3 %; fear, 66.7 %; happi-ness, 53.3 %; sadness, 78.6 %; and neutral, 100 %).Figure 10 illustrates how the canonical discriminant func-tions separated the emotional categories for each sentence. Ascan be seen, with the exceptions of anger and disgust,which could often be mistaken for each other, the firsttwo functions successfully separated sentences by emo-tional category.

In the incongruent condition, the authors calculated thespeech rate for each emotional category by subtracting theduration of the neutrally portrayed sentences from that ofthe emotionally portrayed sentences. To avoid the doubleuse of values, the discriminant analysis was only conductedon the neutral sentences that were portrayed emotionally(i.e., the incongruent condition). Therefore, in this condi-tion the analysis was conducted on the basis of the threemeasures of mean pitch, mean intensity, and mean duration.

The vast majority (95.0 %) of the variance was accountedfor by the first function described by this discriminantanalysis. Pooled within-group correlations between theacoustic parameters and the first canonical discriminantfunction scores revealed that mean duration demonstratedthe highest correlation (r = .861). Mean intensity had thelargest pooled within-group correlation with the canonicaldiscriminant function score (r = .895) in a second functionthat accounted for 4.7 % of the variance. In a third function,which accounted for 0.3 % of the variance, mean pitch hadthe highest pooled within-group correlation with the canon-ical discriminant function score (r = .789). Figure 11 illus-trates how the canonical discriminant function scores forFunctions 1 and 2 separated the emotional categories foreach sentence. As can be seen, the first two functionsclearly separated sentences by emotional category. Theclassification results obtained from the discriminant analy-sis revealed that the model identified 81.4 % (anger,71.4 %; disgust, 100 %; fear, 78.6 %; happiness, 71.4 %;sadness, 85.7 %) of the sentences correctly.

In the baseline condition only, mean pitch and mean inten-sity were accounted for (due to the different lengths of thesentences in this condition, the mean speech rate was not takeninto account). In this condition, the vast majority (95 %) of thevariance was accounted for by the first function described bythis discriminant analysis. Pooled within-group correlationsbetween the acoustic parameters and the first canonical dis-criminant function scores revealed that mean pitch demon-strated the highest correlation (r = .999). Mean intensity hadthe largest pooled within-group correlation with the canonicaldiscriminant function score (r = .994) in a second function thataccounted for 5 % of the variance. As expected, even the bestmodel did not perform well in predicting the category mem-bership correctly, averaging 43.4 % (anger, 35.3 %; disgust,53.3 %; fear, 46.7 %; happiness, 33.3 %; sadness, 50.0 %). Ascan be seen in Fig. 12, the two functions did not separate thecategories clearly. At best, Function 1 separated anger anddisgust from the rest.

Part Six: Summary and general discussion

The present study was designed to create a well-controlleddatabase of emotional speech for Persian. A set of Persiansentences for the five basic emotions (anger, disgust, fear,happiness, and sadness) as well as neutral mode were gener-ated and validated in a set of rating studies. Then, theemotional intensity of each sentence was calculated.Having information on the intensity of the emotionalmeanings will allow researchers to use a matched setof sentences for clinical and neurological studies. Twonative speakers of Persian articulated the sentences in

Behav Res

Page 16: Recognizing emotional speech in Persian: A validated database of Persian emotional speech (Persian ESD)

the intended emotional categories. Their vocal expres-sions were validated in a perceptual study by a group of34 native listeners. The validated vocal portrayals werethen subjected to acoustic analysis. In both perceptualand acoustic patterns, expected variations were observedamong the five emotion categories.

The present study has a number of limitations. First,having two speakers (one speaker for each gender) asencoders makes it difficult to gauge the extent ofinterspeaker variability (Douglas-Cowie et al., 2003).This might have led to greater variability in acousticand perceptual measures. Moreover, having a small

1050-5

Function 1

4

2

0

-2

-4

Fu

nct

ion

2

5

4

3

2

1

Group Centroid

Ungrouped Cases

5Sadness

4Happyness

3Fear

2Disgust

1Angeremotion

Canonical Discriminant Functions

Fig. 11 Results of a discriminant analysis that demonstrates how the canonical discriminant function scores for Functions 1 and 2 separate the emotionalcategories for each sentence (in the incongruent condition)

5,02,50,0-2,5-5,0-7,5

Function 1

4

2

0

-2

-4

Fu

nct

ion

2

6

5

43

2

1

Group Centroid

6Neutral

5Sadness

4Happiness

3Fear

2Disgust

1Angeremotion

Canonical Discriminant Functions

Fig. 10 Results of a discriminant feature analysis that illustrates how the canonical discriminant function scores for Functions 1 and 2 separate theemotional categories for each sentence (in the congruent condition)

Behav Res

Page 17: Recognizing emotional speech in Persian: A validated database of Persian emotional speech (Persian ESD)

number of speakers might have introduced certain arti-facts (Pell, 2002). However, to mitigate this drawbackthe speakers were selected from a group of semiprofes-sional actors, and Stanislavski method was employed tohelp the speakers articulate emotions naturally. In addi-tion, their utterances were recorded in a professionalrecording studio, under the supervision of an acousticengineer. Another potential problem is that a smallnumber of decoders were recruited for the perceptualvalidation study, aimed to select the best possible ex-emplars. A larger number of decoders would likelyimprove the reliability of the results. Yet, to minimizethis drawback, we recruited more participants as de-coders than had been in similar studies (i.e., ten de-coders in a study by Pell, 2002, and 20 decoders inanother study by Pell et al., 2009). In addition, the usual cutofflevel used in previous studies (i.e., three times above chancelevel; Pell et al., 2009) was increased to a minimum of fivetimes chance level in the seven-choice emotion recognitiontask (i.e., 71.42 %). Finally, only a small number of acousticparameters were taken into account in the acoustic analysis.Having a larger set of acoustic parameters would providemore information about the acoustic features of Persian emo-tional speech. However, to diminish this gap and to make surethe vocal utterances contained detectable acoustic contrasts, adiscriminant function analysis was performed on theintended acoustic measures (pitch, intensity, and dura-tion). The classifications resulting from the discriminantanalysis showed that in the congruent condition themodel identified 62.2 % of the sentences correctly.

This value was equal to 81.4 % in the incongruent con-dition and 43.4 % in the baseline condition.

Despite these limitations, the emotional stimuli (both tex-tual and vocal) of the present study were perceptually validat-ed. This database (Persian ESD) encompasses a meaningfulset of validated lexical (90 items) and vocal (468 utterances)stimuli, conveying five emotional meanings. Since the data-base covers the three conditions of (a) congruent, (b) incon-gruent, and (c) baseline, it provides the unique possibility toseparately identify the effect of prosody and lexical content onthe identification of emotions in speech. The databasecould also be used in neuroimaging and clinical studiesto assess a person’s ability to identify emotions inspoken language. Additionally, this database can openup new opportunities for future investigations in speechsynthesis research, as well as in gender studies. Toaccess the database and the supplementary information,please contact [email protected].

Author note The authors express their appreciation to Silke Paulmann,Maria Macuch, Klaus Scherer, Luna Beck, Dar Meshi, Francesca Citron,Pooya Keshtiari, Arsalan Kahnemuyipour, Saeid Sheikh Rohani, GeorgHosoya, Jörg Dreyer, Masood Ghayoomi, Elif Alkan Härtwig, Lea Gutz,Reza Nilipour, YahyaModarresi Tehrani, Fatemeh Izadi, Trudi Falamaki-Zehnder, Liila Taruffi, Laura Hahn, Karl Brian Northeast, Arash Aryani,Christa Bös, and Afsaneh Fazly for their help with sentence constructionand validation, recordings, data collection and organization, andmanuscript preparation. A special thank you to our two speakersMithra Zahedi and Vahid Etemad. The authors also thank all ofthe participants who took part in the various experiments in thisstudy. This research was financially supported by a grant fromthe German Research Society (DFG) to N.K.

420-2

Function 1

4

2

0

-2

-4

Fu

nct

ion

2

5

4

32

1

Group Centroid

5Sadness

4Happiness

3Fear

2Disgust

1Angeremotion

Canonical Discriminant Functions

Fig. 12 Results of a discriminant feature analysis that reveals that, as expected, the canonical discriminant function scores for Functions 1 and 2 do notseparate the emotional categories for each sentence (in the baseline condition)

Behav Res

Page 18: Recognizing emotional speech in Persian: A validated database of Persian emotional speech (Persian ESD)

Appendix A: Sample of the Persian sentences includedin the database, along with their transliteration, glosses,and English translation

Abbreviations used are as follows: Ez: ezafe particle; CL: clitic ; CL.3SG third person singular clitic; DOM: direct object marker; 3SG:third person singular.

Behav Res

Page 19: Recognizing emotional speech in Persian: A validated database of Persian emotional speech (Persian ESD)

Appendix B: List of scenarios

Anger: The director is late for the rehearsal again and we have to workuntil late at night. Once again I have to cancel an important date.

Disgust: I have a summer job in a restaurant. Today I have to clean thetoilets which are incredibly filthy and smell very strongly.

Fear: While I am on a tour bus, the driver loses control of the bus whiletrying to avoid another car. The bus comes to a standstill at theedge of a precipice, threatening to fall over.

Example 1

Note that Persian is written from right to left. The abbreviations are as follow: Ez: ezafe particle; CL.3SG third person singularclitic; DOM: direct object marker; 3SG: third person singular.

References

Anvari, H., & Givi, H. (1996). Persian grammar (2 vols). Tehran, Iran:Fatemi.

Banse, R., & Scherer, K. R. (1996). Acoustic profiles in vocal emotionexpression. Journal of Personality and Social Psychology, 70, 614–636. doi:10.1037/0022-3514.70.3.614

Ben-David, B.M., van Lieshout, P. H., & Leszcz, T. (2011). A resource ofvalidated affective and neutral sentences to assess identification ofemotion in spoken language after a brain injury. Brain Injury, 25,206–220.

Boersma, P., & Weenink, D. (2006). Praat: Doing phonetics by computer(Version 4.4.11) [Computer program]. Retrieved February 26, 2010,from www.praat.org

Burkhardt, F., Paeschke, A., Rolfes, M., Sendlmeier, W. F., & Weiss, B.(2005, September). A database of German emotional speech. Paperpresented at the 9th European Conference on SpeechCommunication and Technology, Lisbon, Portugal.

Calder, J. (1998). Survey research methods.Medical Education, 32, 636–652.

Campbell, N. (2000, September). Databases of emotional speech.Paper presented at the ISCA Tutorial and Research Workshop(ITRW) on Speech and Emotion, Newcastle, Northern Ireland,UK.

Cowie, R., & Cornelius, R. R. (2003). Describing the emotional statesthat are expressed in speech. Speech Communication, 40, 5–32.

Douglas-Cowie, E., Campbell, N., Cowie, R., & Roach, P. (2003).Emotional speech: Towards a new generation of databases. SpeechCommunication, 40, 33–60.

Ekman, P. (1999). Basic emotions. In T. Dalgleish & T. Power (Eds.), Thehandbook of cognition and emotion (pp. 45–60). Hove, UK: Wiley.

Ekman, P., Friesen, W. V., & O’Sullivan, M. (1988). Smiles when lying.Journal of Personality and Social Psychology, 54, 414–420. doi:10.1037/0022-3514.54.3.414

Frank, M. G., & Stennett, J. (2001). The forced-choice paradigmand the perception of facial expressions of emotion. Journalof Personality and Social Psychology, 80, 75–85. doi:10.1037/0022-3514.80.1.75

Gerrards‐Hesse, A., Spies, K., & Hesse, F. W. (1994). Experimentalinductions of emotional states and their effectiveness: A review.British Journal of Psychology, 85, 55–78.

Gharavian, D., & Ahadi, S. M. (2009). Emotional speech recognition andemotion identification in Farsi language. Modares Technical andEngineering, 34(13), 2.

Gharavian, D., & Sheikhan, M. (2010). Emotion recognition and emotionspotting improvement using formant-related features. MajlesiJournal of Electrical Engineering, 4(4).

Gharavian, D., Sheikhan,M., Nazerieh, A., &Garoucy, S. (2012). Speechemotion recognition using FCBF feature selection method and GA-optimized fuzzy ARTMAP neural network. Neural Computing andApplications, 21, 2115–2126.

Ghayoomi, M., Momtazi, S., & Bijankhan, M. (2010). A study of corpusdevelopment for Persian. International Journal on Asian LanguageProcessing, 20, 17–33.

Johnson, W. F., Emde, R. N., Scherer, K. R., & Klinnert, M. D. (1986).Recognition of emotion from vocal cues. Archives of GeneralPsychiatry, 43, 280–283. doi:10.1001/archpsyc.1986.01800030098011

Johnstone, T., & Scherer, K. R. (1999, August). The effects of emotionson voice quality. In Proceedings of the 14th International Congressof Phonetic Sciences (pp. 2029–2032). San Francisco, CA:University of California, Berkeley.

Juslin, P. N., & Laukka, P. (2003). Communication of emotions in vocalexpression and music performance: Different channels, same code?Psychological Bulletin, 129, 770–814. doi:10.1037/0033-2909.129.5.770

Likert, R. (1936). A method for measuring the sales influence of a radioprogram. Journal of Applied Psychology, 20, 175–182.

Liu, P., & Pell, M. D. (2012). Recognizing vocal emotions in MandarinChinese: A validated database of Chinese vocal emotional stimuli.

Happiness: I am acting in a new play. From the start, I get alongextremely well with my colleagues who even throw aparty for me. Sadness: I get a call to tell me that mybest friend died suddenly.

Sadness: I get a call to tell that my best friend died suddenly.

Example 1

Behav Res

Page 20: Recognizing emotional speech in Persian: A validated database of Persian emotional speech (Persian ESD)

Behavior Research Methods, 44, 1042–1051. doi:10.3758/s13428-012-0203-3

Luo, X., Fu, Q. J., & Galvin, J. J. (2007). Vocal emotion recognition bynormal-hearing listeners and cochlear implant users. Trends inAmplification, 11, 301–315.

Makarova, V., & Petrushin, V. A. (2002, September). RUSLANA: Adatabase of Russian emotional utterances. Paper presented at theInternational Conference of Spoken Language Processing,Colorado, USA.

Maurage, P., Joassin, F., Philippot, P., & Campanella, S. (2007). A vali-dated battery of vocal emotional expressions. NeuropsychologicalTrends, 2, 63–74.

Mitchell, R. L., Elliott, R., Barry, M., Cruttenden, A., & Woodruff, P. W.(2004). Neural response to emotional prosody in schizophrenia andin bipolar affective disorder. British Journal of Psychiatry, 184,223–230.

Niimi, Y., Kasamatsu, M., Nishinoto, T., & Araki, M. (2001,August). Synthesis of emotional speech using prosodicallybalanced VCV segments. Paper presented at the 4th ISCATutorial and Research Workshop (ITRW) on SpeechSynthesis, Perthshire, Scotland.

O’Brien, N. (2011). Stanislavski in practice: Exercises for students. NewYork, NY: Routledge.

Pakosz, M. (1983). Attitudinal judgments in intonation: Someevidence for a theory. Journal of Psycholinguistic Research,12, 311–326.

Paulmann, S., Pell, M. D., & Kotz, S. A. (2008). How aging affects therecognition of emotional speech. Brain and Language, 104, 262–269. doi:10.1016/j.bandl.2007.03.002

Pell, M. D. (2001). Influence of emotion and focus location on prosody inmatched statements and questions. Journal of the Acoustical Societyof America, 109, 1668–1680. doi:10.1121/1.1352088

Pell, M. D. (2002). Evaluation of nonverbal emotion in face and voice:Some preliminary findings on a new battery of tests. Brain andCognition, 48, 499–514.

Pell, M. D., Jaywant, A., Monetta, L., & Kotz, S. A. (2011). Emotionalspeech processing: Disentangling the effects of prosody and seman-tic cues. Cognition and Emotion, 25, 834–853. doi:10.1080/02699931.2010.516915

Pell, M. D., & Kotz, S. A. (2011). On the time course of vocal emotionrecognition. PLoS ONE, 6, e27252. doi:10.1371/journal.pone.0016505

Pell, M. D., Paulmann, S., Dara, C., Alasseri, A., & Kotz, S. A.(2009). Factors in the recognition of vocally expressed emo-tions: A comparison of four languages. Journal of Phonetics,37, 417–435.

Pell, M. D., & Skorup, V. (2008). Implicit processing of emotional prosodyin a foreign versus native language. Speech Communication, 50,519–530.

Petrushin, V. (1999, November). Emotion in speech: Recognitionand application to call centers. Paper presented at theConference on Artificial Neural Networks in Engineering,St. Louis, USA.

Roach, P. (2000, September). Techniques for the phonetic description ofemotional speech. Paper presented at the ISCA Tutorial andResearch Workshop (ITRW) on Speech and Emotion, Newcastle,Northern Ireland, UK.

Roach, P., Stibbard, R., Osborne, J., Arnfield, S., & Setter, J. (1998).Transcription of prosodic and paralinguistic features of emotionalspeech. Journal of the International Phonetic Association, 28,83–94.

Russ, J. B., Gur, R. C., & Bilker, W. B. (2008). Validation of affective andneutral sentence content for prosodic testing. Behavior ResearchMethods, 40, 935–939. doi:10.3758/BRM.40.4.935

Russell, J. A. (1994). Is there universal recognition of emotionfrom facial expressions? A review of the cross-cultural stud-ies. Psychological Bulletin, 115, 102–141. doi:10.1037/0033-2909.115.1.102

Scherer, K. R. (1986). Vocal affect expression: A review and a model forfuture research. Psychological Bulletin, 99, 143–165. doi:10.1037/0033-2909.99.2.143

Scherer, K. R., Banse, R., Wallbott, H. G., & Goldbeck, T. (1991). Vocalcues in emotion encoding and decoding. Motivation and Emotion,15, 123–148.

Scherer, K. R., Ladd, D. R., & Silverman, K. E. A. (1984). Vocal cues tospeaker affect: Testing two models. Journal of the AcousticalSociety of America, 76, 1346–1356. doi:10.1121/1.391450

Schmuckler, M. A. (2001). What is ecological validity? A dimensionalanalysis. Infancy, 2, 419–436.

Schneider,W., Eschman, A., & Zuccolotto, A. (2002). E-Prime 1.0 user’sguide. Pittsburgh, PA: Psychological Software Tools.

Sims-Williams, N., & Bailey, H. W. (Eds.). (2002). Indo-Iranian lan-guages and peoples. Oxford, UK: Oxford University Press.

Tanenhaus, M. K., & Brown-Schmidt, S. (2007). Language processing inthe natural world. Philosophical Transactions of the Royal SocietyB, 363, 1105–1122. doi:10.1098/rstb.2007.2162

Ververidis, D., & Kotropoulos, C. (2003, October). A state of the artreview on emotional speech databases. Paper presented at the 1stRichmedia Conference, Lausanne, Switzerland.

Wallbott, H. G., & Scherer, K. R. (1986). How universal and specific isemotional experience? Evidence from 27 countries on five conti-nents. Social Science Information, 25, 763–795.

Wilson, D., & Wharton, T. (2006). Relevance and prosody. Journal ofPragmatics, 38, 1559–1579.

Yu, F., Chang, E., Xu, Y., & Shum, H. Y. (2001, October). Emotiondetection from speech to enrich multimedia content. Paper presentedat the 2nd IEEE Pacific Rim Conference on Multimedia, London,United Kingdom.

Behav Res


Recommended