+ All Categories
Home > Documents > 26 - inet1.ffst.hr

26 - inet1.ffst.hr

Date post: 31-Oct-2021
Category:
Upload: others
View: 3 times
Download: 0 times
Share this document with a friend
202
Transcript

26th Annual Conference of the International Association for Forensic Phonetics and

Acoustics

Split, Croatia 9th – 12th July 2017

Book of Abstracts

Faculty of Humanitiesand Social Sciences,

University of Split

Croatian Philological Association Faculty of Humanities

and Social Sciences, University of Zagreb

PublisherCroatian Philological Association

Zagreb

EditorsGordana Varošanec-Škarić

Anita Runjić-Stoilova

Proof Reading Diana Tomić

Cover DesignZoran Stoilov

PrintRedak d.o.o.

Number of Copies100

ISBN 978-953-296-139-3

The book was printed in July 2017.This book is not for sale

3

IAFPA 2017

Organising Committee1. Anita Runjić-Stoilova (chair) – University of Split, Croatia2. Tina Cambier-Langeveld - Netherlands Forensic Institute3. Gordana Varošanec-Škarić – University of Zagreb, Croatia4. Zdravka Biočina – University of Zagreb, Croatia5. Jelena Novaković – University of Split, Croatia

Scientific Committee1. Peter French (honorary chair) - University of York, UK2. Gordana Varošanec-Škarić (chair) – University of Zagreb,

Croatia3. Francis Nolan - Cambridge University, UK4. Hermann Künzel - Marburg University, Germany5. Angelika Braun - Trier University, Germany6. Volker Dellwo - Zurich University, Switzerland7. Paul Foulkes - University of York, UK8. Geoff Morrison - Indepent forensic consultant, University of

Alberta, Canada 9. Jonas Lindh - Gothenburg University, Sweden10. Michael Jessen - Bundes Kriminal Amt, Germany11. Ruth Bahr - University of South Florida, USA12. Sylvia Moosmüller - Acoustics Research Institute, Austria13. Tina Cambier-Langeveld - Netherlands Forensic Institute14. Marko Liker – University of Zagreb, Croatia15. Gabrijela Kišiček – University of Zagreb, Croatia

5

ContentIntroduction .........................................................................................11

Programme .........................................................................................13

Keynote Talk

Damir KovačićVoice gender identification in cochlear implant users .........................23

Peter French, Philip Harrison, Christin Kirchhübel, Richard Rhodes, and Jessica WormaldFrom Receipt of Recordings to Dispatch of Report: Opening the Blinds on Lab Practices .................................................29

Jos Vermeulen, and Tina Cambier-LangeveldOutstanding cases: about case reports with a “strong” conclusion ...................................................................31

Isolde WagnerThe BKA Standard Operation Procedure of Forensic Speaker Comparison and Examples of Case Work ...........................34

Helen FraserForensic Transcription: Where to from here to create a better system for handling of indistinct covert recordings? .........................................................................................37

Richard Rhodes, Peter French, Philip Harrison, Christin Kirchhübel, and Jessica WormaldWhich questions, propositions and ‘relevant populations’ should a speaker comparison expert assess? ..................................40

Maria Sundqvist, Therese Leinonen, Jonas Lindh, and Joel ÅkessonBlind Test Procedure to Avoid Bias in Perceptual Analysis for Forensic Speaker Comparison Casework .....................................45

Vincent J. van Heuven, and Paula CortésSpeaker specificity of filled pauses compared with vowels and consonants in Dutch ........................................................48

Erica Gold, Sula Ross, and Kate EarnshawDelimiting the West Yorkshire population: Examining the regional-specificity of hesitation markers......................................50

6

Sandra Ferrari Disner, and Andrés BenítezCase study: Earwitness reliability through a lens of psycholinguistics and acoustics..........................................................53

Lei He, and Volker DellwoBetween-speaker intensity variability is maintained in different frequency bands of amplitude demodulated signal ............................55

Radek Skarnitzl, and Alžběta RůžičkováThe malleability of speech production: An examination of sophisticated voice disguise .................................59

Kieran Dorreen, and Vica PappBilingual speakers’ long-term fundamental frequency distributions as cross-linguistic speaker discriminants ...............................................61

Nadja Tschäpe, Michael Jessen, and Stefan GfroererAnalysis of i-vector-based false-accept trials in a dialectlabelled telephone corpus .....................................................65

Willemijn HeerenSpeaker-dependency of /s/ in spontaneous telephone conversation ......................................................................68

Finnian Kelly, Oscar Forth, Alankar Atreya, Samuel Kent, and Anil AlexanderWhat your voice says about you:Automatic Speaker Profiling using i-vectors ......................................72

Cuiling Zhang, Geoffrey Stewart Morrison, and Ewald EnzingerForensic voice comparison: Older sister or younger sister?...............76

Anil Alexander, Oscar Forth, Alankar Atreya, Samuel Kent, and Finnian KellyNot a Lone Voice: Automatically Identifying Speakers in Multi-Speaker Recordings ..............................................................80

Vincent Hughes, Philip Harrison, Paul Foulkes, Peter French, Colleen Kavanagh, and Eugenia San SegundoThe complementarity of automatic, semi-automatic, and phonetic measures of vocal tract output in forensic voice comparison ...................................................83

7

Jonas Lindh, Andreas Nautsch, Therese Leinonen, and Joel ÅkessonComparison Between Perceptual and Automatic Systems on Finnish Phone Speech Data (FinEval1) - a pilot test using score simulations .....................................................86

Georgina Brown, and Dominic WattStrengths and Weaknesses of using Feature Selection in Automatic Accent Recognition ........................................................88

Wang Li, Kang Jintao, Li Jingyang, and Wang XiaodiSpeaker-specific dynamic features of diphthongs in Standard Chinese ...........................................................................91

Jun-jie YangComparing Chinese Identical Twins’ Speech Using Frequent Speech Acoustic Characteristics .........................................96

Elliott Land, and Erica GoldSpeaker Identification Using Laughter in a Close Social Network .........................................................................99

Kostis Dimos, Volker Dellwo, and Lei HeRhythm and speaker-specific variability in shouted speech ............................................................................102

Thayabaran Kathiresan, and Volker DellwoCepstral Dynamics in MFCCs using Conventional Deltas for Emotion and Speaker Recognition...................................105

Zdravka Biočina, and Gordana Varošanec-ŠkarićSpeaker recognition from the island of Brač.....................................109

Posters

Sandra Schwab, Michael S. Amato, Volker Dellwo, and Marianela Fernández TrinidadCan we hear nicotine craving? .........................................................115

Potapova Rodmonga, Agibalova Tatiana, Bobrov Nikolay, and Zabello NataliaPerceptual auditory speech features of drug-intoxicated female speakers (preliminary results) ....................118

8

Elisa Pellegrino, Lei He, and Volker DellwoThe effect of aging on between-speaker rhythmic variability...........................................................................................122

Therese Leinonen, Jonas Lindh, and Joel ÅkessonCreating Linguistic Feature Set Templates for Perceptual Forensic Speaker Comparison in Finnish and Swedish ........................................................................126

Kirsty McDougall, and Martin DuckworthFluency Profiling for Forensic Speaker Comparison: A Comparison of Syllable- and Time-Based Approaches .................129

Eugenia San Segundo, Lei He, and Volker DellwoProsody can help distinguish identical twins: implications for forensic speaker comparison...................................132

Gea de Jong-Lendle, Roland Kehrein, Frederike Urke, Janina Mołczanow, Anna Lena Georg, Belinda Fingerling, Sarah Franchini, Olaf Köster, and Christiane UlbrichLanguage identification from a foreign accent in German. ............................................................................135

Dominic Watt, Megan Jenkins, and Georgina BrownPerformance of human listeners vs. the Y-ACCDIST automatic accent classifier in an accent authentication task............................................................................139

Gordana Varošanec-Škarić, Iva Bašić, and Gabrijela KišičekComparison of vowel space of male speakers of Croatian, Serbian and Slovenian language ......................................142

Michael JessenA study on language differences in the score distributions of automatic speaker recognition systems ........................................147

Finnian Kelly, and John H. L. HansenAutomatic detection of the Lombard effect ........................................150

Vincent Hughes, and Jessica WormaldWikiDialects: a resource for assessing typicality in forensic voice comparison ............................................................153

9

Homa Asadi, Lei He, Elisa Pellegrino, and Volker DellwoBetween-speaker rhythmic variability in Persian ...............................155

Eugenia San Segundo, Almut Braun, Vincent Hughes, and Paul FoulkesSpeaker-similarity perception of Spanish twins and non-twins by native speakers of Spanish, German and English ........................................................................159

Lei He, and Volker DellwoSpeaker-specific temporal organizations of intensity contours .........................................................................163

Milana Milošević, and Željko NedeljkovićEmotional speech databases in Slavic languages – an overview ......................................................................................167

Kristina Tomić Cross-Language Accent Analysis for Determination of Origin ..........171

Sandro Bizozzero, Nele Netzschwitz, and Adrian LeemannThe effect of fundamental frequency f0, syllable rate and pitch range on listeners’ perception of fear in a female speaker’s voice ..................................................................173

Francesca Hippey, and Erica GoldDetecting remorse in the voice: A preliminary investigation into the perception of remorse using a voice line-up methodology ................................................................179

Sarah FranchiniConstruction of a voice profile: An acoustic study of /l/ .....................183

Linda Albers, and Willemijn HeerenIndexical information as a function of linguistic condition: does prosodic prominence affect speaker-specificity? ..........................................................................187

Katharin KlugRefining the Vocal Profile Analysis (VPA) scheme for forensic purposes ......................................................................190

Anna Lena GeorgThe effect of dialect on age estimation .............................................. 192

10

Benjamin Cohen-Lhyver, Sylvain Argentieri, and Bruno GasSpeaker Identification Enhancement by Inclusion of Perceptual Context: an Application of the Head Turning Modulation Model ................................................................195

Belinda FingerlingConstructing a voice profile: Reconstruction of the L1 vowel set for a L2 speaker ...........................................................197

11

Introduction

Dear participants of 26th Annual Conference of the International Association for Forensic Phonetics and Acoustics, dear colleagues and friends!

We are happy to welcome you on behalf of the Organizing and Scientific Committee of the 26th Annual Conference of the IAFPA in Split.

Although this year we do not celebrate an anniversary as we did last year in York, Great Britain at the wonderful 25th Conference, this is the first conference taking place in Croatia, in Split. Therefore, we feel Mediterranean joy for becoming a part of this tradition. There are 94 authors with 53 papers participating at the 26th Annual IAFPA Conference of which there are 27 talks and 26 poster presentations. This conference in Split was organized by the Department of Phonetics, University of Zagreb and the Department of Croatian Language, University of Split. The authors come from 19 different countries; majority from the UK, but also from Germany, Switzerland, the Netherlands, China, Sweden, the USA, Croatia, Serbia, Australia, the Czech Republic, Spain, France, Hungary, Iran, Malta, New Zealand, Poland and Russia. The number of young authors present at the conference seems quite promising while they stand next to the doyens of forensic science and experienced practitioners. Some papers indicate that researchers and practitioners in forensic speech science are not familiar with research from other laboratories and research centers. Thus, annual conferences organized by IAFPA definitely foster information exchange. I hope that this gathering in Split will also help the networking especially the creation of new connections between young researchers and practitioners from the famous and renowned laboratories. I believe that the young phoneticians and experts in acoustics will deepen their knowledge and gain valuable experience in future joint research.

12

I wish that the papers presented in Split are remembered as the Gay or Happy Science.

On behalf of the Organizing and Scientific Committees, I wish you a pleasant stay in Split and, moreover, a fruitful conference with plenty ideas!

Gordana Varošanec-ŠkarićConference Chair

13

Programme

SUNDAY, 9th JULY 2017

18:30 – 20:00Pre-conference reception

Croatian National Theatre, Trg Gaje Bulata 1, Split

MONDAY, 10th JULY 20179:00 – 9:30 Introduction

Chair: Jonas Lindh

9:30 - 9:55

Peter French, Philip Harrison, Christin Kirchhübel, Richard Rhodes, and Jessica Wormald: From Receipt of Recordings to Dispatch of Report: Opening the Blinds on Lab Practices

9:55 - 10:20Jos Vermeulen, and Tina Cambier-Langeveld: Outstanding cases: about case reports with a “strong” conclusion

10:20 - 10:45Isolde Wagner: The BKA Standard Operation Procedure of Forensic Speaker Comparison and Examples of Case Work

10:45 - 11:10Helen Fraser: Forensic Transcription: Where to from here to create a better system for handling of indistinct covert recordings?

11:10 – 11:40 COFFEE BREAK

Chair: Tina Cambier-Langeveld

11:40 - 12:05

Richard Rhodes, Peter French, Philip Harrison, Christin Kirchhübel, and Jessica Wormald: Which questions, propositions and ‘relevant populations’ should a speaker comparison expert assess?

14

12:05 - 12:30

Maria Sundqvist, Therese Leinonen, Jonas Lindh, and Joel Åkesson: Blind Test Procedure to Avoid Bias in Perceptual Analysis for Forensic Speaker Comparison Casework

12:30 - 12:55Vincent J. van Heuven, and Paula Cortés: Speaker specificity of filled pauses compared with vowels and consonants in Dutch

12:55 - 13:20Erica Gold, Sula Ross, and Kate Earnshaw: Delimiting the West Yorkshire population: Examining the regional-specificity of hesitation markers

13:20 - 14:20 LUNCH

14:20 – 15:20 Poster session 1

Chair: Erica Gold

15:20 - 15:45Sandra Ferrari Disner, and Andrés Benítez: Case study: Earwitness reliability through a lens of psycholinguistics and acoustics

15:45 - 16:10Lei He, and Volker Delwo: Between-speaker intensity variability is maintained in different frequency bands of amplitude demodulated signal

16:10 – 16:30 COFFEE BREAK

16:30 – 16:55Radek Skarnitzl, and Alžběta Růžičková: The malleability of speech production: An examination of sophisticated voice disguise

16:55 – 17:20Kieran Dorreen, and Vica Papp: Bilingual speakers’ long-term fundamental frequency distributions as cross-linguistic speaker discriminants

18:30 – 20:00Ivan Meštrović Gallery – sightseeingŠetalište Ivana Meštrovića 46, Split

15

TUESDAY, 11th JULY 2017

Chair: Peter French

9:30 – 9:55Nadja Tschäpe, Michael Jessen, and Stefan Gfroerer: Analysis of i-vector-based false-accept trials in a dialectlabelled telephone corpus

9:55 – 10:20Willemijn Heeren: Speaker-dependency of /s/ in spontaneous telephone conversation

10:20 – 10:45

Finnian Kelly, Oscar Forth, Alankar Atreya, Samuel Kent, and Anil Alexander: What your voice says about you: Automatic Speaker Profiling using i-vectors

10:45 – 11:10Cuiling Zhang, Geoffrey Stewart Morrison, and Ewald Enzinger: Forensic voice comparison: Older sister or younger sister?

11:10 – 11:30 COFFEE BREAK

Chair: Richard Rhodes

11:30 – 11:55

Anil Alexander, Oscar Forth, Alankar Atreya, Samuel Kent, and Finnian Kelly: Not a Lone Voice: Automatically Identifying Speakers in Multi-Speaker Recordings

11:55 – 12:20

Vincent Hughes, Philip Harrison, Paul Foulkes, Peter French, Colleen Kavanagh, and Eugenia San Segundo: The complementarity of automatic, semi-automatic, and phonetic measures of vocal tract output in forensic voice comparison

12:20 – 12:45

Jonas Lindh, Andreas Nautsch, Therese Leinonen, and Joel Åkesson: Comparison Between Perceptual and Automatic Systems on Finnish Phone Speech Data (FinEval1)- a pilot test using score simulations

16

12:45 – 13:10Georgina Brown, and Dominic Watt: Strengths and Weaknesses of using Feature Selection in Automatic Accent Recognition

13:10 – 14:10 LUNCH

14:10 – 15:10 Poster session 2 – work in progress/student posters

Chair: Kirsty McDougall

15:10 – 15:35Wang Li, Kang Jintao, Li Jingyang, and Wang Xiaodi: Speaker-specific dynamic features of diphthongs in Standard Chinese

15:35 – 16:00Jun-jie Yang: Comparing Chinese Identical Twins’ Speech Using Frequent Speech Acoustic Characteristics

16:00 – 16:20 COFFEE BREAK

16:00 – 17:30 AGM

18:15 – 00:00Trip and Conference dinnerIsland of Brač, Lovrečina bay

19:45 – 20:15 Drinks reception

20:30 – 00:00 Dinner and entertainments

WEDNESDAY 12th JULY 2017.

Chair: Gabrijela Kišiček

9:30 – 10:30KEYNOTE: Damir Kovačić: Voice gender identification in cochlear implant users

10:30 – 10:50 COFFEE BREAK

Chair: Gordana Varošanec-Škarić

17

10:50 - 11:15Elliott Land, and Erica Gold: Speaker Identification Using Laughter in a Close Social Network

11:15 - 11:40Kostis Dimos, Volker Dellwo, and Lei He: Rhythm and speaker-specific variability in shouted speech

11:40 - 12:05Thayabaran Kathiresan, and Volker Dellwo: Cepstral Dynamics in MFCCs using Conventional Deltas for Emotion and Speaker Recognition

12:05 - 12:30Zdravka Biočina, and Gordana Varošanec Škarić: Speaker recognition from the island of Brač

Monday 10th July Poster Session 1 (Posters/ Work in progress)14:20 - 15:20

Sandra Schwab, Michael S. Amato, Volker Dellwo, and Marianela Fernández Trinidad

Can we hear nicotine craving?

Potapova Rodmonga, Agibalova Tatiana, Bobrov Nikolay, and Zabello Natalia

Perceptual auditory speech features of drug-intoxicated female speakers (preliminary results)

Elisa Pellegrino, Lei He, and Volker Dellwo

The effect of aging on between-speaker rhythmic variability

Therese Leinonen, Jonas Lindh, and Joel Åkesson

Creating Linguistic Feature Set Templates for Perceptual Forensic Speaker Comparison in Finnish and Swedish

Kirsty McDougall, and Martin Duckworth

Fluency Profiling for Forensic Speaker Comparison: A Comparison of Syllable- and Time-Based Approaches

Eugenia San Segundo, Lei He, and Volker Dellwo

Prosody can help distinguish identical twins: implications for forensic speaker comparison

18

Gea de Jong-Lendle, Roland Kehrein, Frederike Urke, Janina Mołczanow, Anna Lena Georg, Belinda Fingerling, Sarah Franchini, Olaf Köster, and Christiane Ulbrich

Language identification from a foreign accent in German

Dominic Watt, Megan Jenkins, and Georgina Brown

Performance of human listeners vs. the Y-ACCDIST automatic accent classifier in an accent authentication task

Gordana Varošanec-Škarić, Iva Bašić, and Gabrijela Kišiček

Comparison of vowel space of male speakers of Croatian, Serbian and Slovenian language

Michael Jessen A study on language differences in the score distributions of automatic speaker recognition systems

Finnian Kelly, and John H. L. Hansen

Automatic detection of the Lombard effect

Vincent Hughes, and Jessica Wormald

WikiDialects: a resource for assessing typicality in forensic voice comparison

Homa Asadi, Lei He, Elisa Pellegrino, and Volker Dellwo

Between-speaker rhythmic variability in Persian

Eugenia San Segundo, Almut Braun Vincent Hughes, and Paul Foulkes

Speaker-similarity perception of Spanish twins and non-twins by native speakers of Spanish, German and English

Lei He, and Volker Dellwo Speaker specific temporal organizations of intensity contours

19

Tuesday 11th July Poster Session 2 - Work in progress/Student posters

14:10 - 15:10

Milana Milošević, and Željko Nedeljković

Overview of databases of emotional speech in Slavic languages

Kristina Tomić Cross-Language Accent Analysis for Determination of Origin

Sandro Bizozzero, Nele Netzschwitz, and Adrian Leemann

The effect of fundamental frequency f0, syllable rate and pitch range on listeners’ perception of fear in a female speaker’s voice

Francesca Hippey, and Erica Gold

Detecting remorse in the voice: A preliminary investigation into the perception of remorse using a voice line-up methodology

Sarah Franchini Construction of a voice profile: An acoustic study of /l/

Linda Albers, and Willemijn Heeren

Indexical information as a function of linguistic condition: does prosodic prominence affect speaker-specificity?

Katharina Klug Refining the Vocal Profile Analysis (VPA) scheme for forensic purposes

Anna Lena Georg The effect of dialect on age estimation

Benjamin Cohen-Lhyver, Sylvain Argentieri, and Bruno Gas

Speaker Identification Enhancement by Inclusion of Perceptual Context: an Application of the Head Turning Modulation Model

Belinda Fingerling Constructing a voice profile: Reconstruction of the L1 vowel set for a L2 speaker

21

Keynote Talk

22

23

Voice gender identification in cochlear implant users

Damir Kovačić1

1Department of Physics, Faculty of Science, University of Split, [email protected]

Cochlear implant (CI) is currently the prevailing neuro-prosthetic treatment for partial restoration of hearing in deaf people. CI comprises a linear electrode array containing up to 26 electrodes that is surgically inserted into the cochlea, typically within the perilymph-filled scala tympani (Rask-Andersen et al., 2012). This arrangement functionally bypasses some or all of approximately 3400 malfunctioned hair cells in the inner ear by directly stimulating preserved spiral ganglion neurons which forms the auditory nerve (Zeng, Rebscher, Harrison, Sun, & Feng, 2008). Acoustical information of perceived sound is encoded in the format of electrical pulse trains which are determined with the spectro-temporal processing based on physiological and engineering principles of hearing (Loizou, 1998). CIs often result in good outcomes, with high speech perception scores and joy of listening to music, albeit typically only in ideal conditions with minimal background noise (Wilson & Dorman, 2008). However, one of the landmarks of CI outcomes is that the efficacy of the electrical stimulation via a CI is highly variable, resulting in extremely broad speech perception scores in clinical assessments of hearing and specific experimental procedures (Dorman & Spahr, 2006; Sarant, Blamey, Dowell, Clark, & Gibson, 2001). We showed that such enormous variability is further extended to even simple perceptual tasks such as voice gender identification (Kovačić & Balaban, 2009, 2010).

In this talk, I will present voice gender identification as experimental perceptual paradigm to assess the role of temporal and spectral information for transmission of voice pitch. Voice pitch is one of the most important indexical cues transmitting information about the vocal properties of speakers that allow listeners to recognize a speaker’s identity, sex and body size. Indexical cues are represented by spectro-temporal modulations in speech signal which are affected by vocal tract length (VTL), and by laryngeal fold size determining the fundamental

24

frequency (F0) of the speaker. When combined, VTL and F0 information produces an almost perfect recognition of voice gender in normal hearing subjects (Bachorowski & Owren, 1999; Owren, Berkowitz, & Bachorowski, 2007; Smith, Walters, & Patterson, 2007) and was shown to produce impressive performance in automated voice recognition systems (Childers & Wu, 1991; Wu & Childers, 1991). In contrast, CI users show wide performance variation in identifying individual voices and their gender (Cleary & Pisoni, 2002; Cleary, Pisoni, & Kirk, 2005; Fu, Chinchilla, Nogaki, & Galvin, 2005; Fuller et al., 2014; Kovačić & Balaban, 2009, 2010; Vongphoe & Zeng, 2005). Finally, I will show how findings from research in voice gender identification, in particular CI users may be used for challenges of determining the identity, sex and body size of the speakers in acoustic forensics.

ReferencesBachorowski, J. A., & Owren, M. J. (1999). Acoustic correlates of talker sex and

individual talker identity are present in a short vowel segment produced in running speech. J Acoust Soc Am, 106(2), 1054–1063.

Childers, D. G., & Wu, K. (1991). Gender recognition from speech. Part II: Fine analysis. J Acoust Soc Am, 90(4 Pt 1), 1841–1856.

Cleary, M., & Pisoni, D. B. (2002). Talker discrimination by prelingually deaf children with cochlear implants: preliminary results. Ann Otol Rhinol Laryngol Suppl, 189, 113–118.

Cleary, M., Pisoni, D. B., & Kirk, K. I. (2005). Influence of voice similarity on talker discrimination in children with normal hearing and children with cochlear implants. J Speech Lang Hear Res, 48(1), 204–223.

Dorman, M. F., & Spahr, A. J. (2006). Speech Perception by Adults with Multichannel Cochlear Implants. In S. B. Waltzman & J. T. Roland (Eds.), Cochlear Implants (2nd ed., pp. 193–204). New York: Thieme Medical Publishers.

Fu, Q. J., Chinchilla, S., Nogaki, G., & Galvin, J. J. (2005). Voice gender identification by cochlear implant users: the role of spectral and temporal resolution. J Acoust Soc Am, 118(3), 1711–1718.

Fuller, C. D., Gaudrain, E., Clarke, J. N., Galvin, J. J., Fu, Q.-J., Free, R. H., & Başkent, D. (2014). Gender categorization is abnormal in cochlear implant users. Journal of the Association for Research in Otolaryngology : JARO, 15(6), 1037–48. http://doi.org/10.1007/s10162-014-0483-7

25

Kovačić, D., & Balaban, E. (2009). Voice gender perception by cochlear implantees. J Acoust Soc Am, 126(2), 762–775.

Kovačić, D., & Balaban, E. (2010). Hearing History Influences Voice Gender Perceptual Performance in Cochlear Implant Users. Ear and Hearing, 31(6), 806–814. http://doi.org/10.1097/AUD.0b013e3181ee6b64

Loizou, P. C. (1998). Mimicking the human ear. IEEE Signal Process Mag, 15(5), 101–130.

Owren, M. J., Berkowitz, M., & Bachorowski, J.-A. (2007). Listeners judge talker sex more efficiently from male than from female vowels. PerceptPsychophys, 69(6), 930–941.

Rask-Andersen, H., Liu, W., Erixon, E., Kinnefors, A., Pfaller, K., Schrott-Fischer, A., & Glueckert, R. (2012). Human cochlea: anatomical characteristics and their relevance for cochlear implantation. Anatomical Record (Hoboken, N.J. : 2007), 295(11), 1791–811. http://doi.org/10.1002/ar.22599

Sarant, J. Z., Blamey, P. J., Dowell, R. C., Clark, G. M., & Gibson, W. P. (2001). Variation in speech perception scores among children with cochlear implants. Ear Hear, 22(1), 18–28.

Smith, D. R. R., Walters, T. C., & Patterson, R. D. (2007). Discrimination of speaker sex and size when glottal-pulse rate and vocal-tract length are controlled. J Acoust Soc Am, 122, 3628–3639.

Vongphoe, M., & Zeng, F.-G. (2005). Speaker recognition with temporal cues in acoustic and electric hearing. J Acoust Soc Am, 118(2), 1055–1061.

Wilson, B. S., & Dorman, M. F. (2008). Cochlear implants: a remarkable past and a brilliant future. Hear Res, 242(1–2), 3–21.

Wu, K., & Childers, D. G. (1991). Gender recognition from speech. Part I: Coarse analysis. J Acoust Soc Am, 90(4 Pt 1), 1828–1840.

Zeng, F.-G., Rebscher, S., Harrison, W., Sun, X., & Feng, H. (2008). Cochlear Implants: System Design, Integration, and Evaluation. IEEE Reviews in Biomedical Engineering, 1, 115–142.

Oral PresentationsAbstracts are presented in the

running order of the programme.

29

From Receipt of Recordings to Dispatch of Report: Opening the Blinds on Lab Practices

Peter French, Philip Harrison, Christin Kirchhübel, Richard Rhodes, and Jessica Wormald

J P French Associates & Dept. of Language & Linguistic Science, University of York, UK

{peter.french|firstname.surname}@jpfrench.com

Recent surveys have made available information about the relative prevalence and geographical distribution of the various approaches to forensic speaker comparison (Gold and French, 2011, Morrison et al., 2016). However, given constraints on the time of participating colleagues, the information yielded by these surveys has been of a rather general nature. Further, in addition to the time constraint is the fact that some government/police laboratories are operating under strictures of confidentiality and, in a few cases at least, this extends to non-disclosure of certain aspects of analytic practice (Gold and French, in prep).

An initiative towards ‘opening the blinds’ on casework practices was taken by the Bundeskriminalamt Forensic Speaker Identification Lab in Wiesbaden in 20021. This took the form of a meeting principally concerning automatic methods, but at which practitioners from European countries were invited to describe normal operating procedures in their own labs. This was followed by a small-scale but promising survey of methods by Cambier-Langefveldt in 2007. Since that time, however, protocols and working practices have moved on, but these initiatives have not been sytematically followed up or continued.

A consequence of this is that, despite a widely acknowledged need for greater transparency in all aspects of forensic science, practitioners and researchers in forensic speech science still know very little of what actually happens in different laboratories.

1 Stand der Technik und Evaluierung aktueller Verfahren zur automatischen forensischen Sprecheridentifizierung und Authentisierung, 7th – 29th May.

30

This presentation describes in some detail the stages in processing recordings in a typical forensic speaker comparison case in the authors’ laboratory. Beginning with the receipt of recordings, it describes the preparatory work undertaken, the speech parameters that are examined, the techniques used to analyse them, the recording and collating of results, the checking procedures in place, and the evaluation of the evidence and formulation of conclusions for the ensuing report. Each stage of the work is illustrated with material - sound files and records of examinations - from a real case.

It is intended that the presentation will both stimulate discussion of working practices and be followed by other labs also ‘opening the blinds’.

ReferencesGold, E. & French, P. (2011) International practices in forensic speaker

comparison. International Journal of Speech Language and the Law, Vol 18 (2) 293 – 307

Gold, E. & French, P. (in prep) International practices in forensic speaker comparison: follow up survey.

Morrison, G. S., Sahito, F. H., Jardine, G., Djokic, D., Clavet, S., Berghs, S., Goemansand & Dorny, C. (2016). Interpol Survey of the Use of Speaker Identification by Law Enforcement Agencies. Forensic Science International, 263, 92 –100.

Cambier-Langeveld, T. (2007): Current methods in forensic speaker identification: Results of a collaborative exercise. The International Journal of Speech, Language and the Law 14, 223-243.

31

Outstanding cases: about case reportswith a “strong” conclusion

Jos Vermeulen, and Tina Cambier-LangeveldNetherlands Forensic Institute, Ministry of Security and Justice, The

Hague, the Netherlands j.vermeulen|[email protected]

At the Netherlands Forensic Institute, the conclusion of a forensic speaker comparison based on the auditory-acoustic approach is nowadays reported as the strength of evidence in a verbal likelihood scale. The likelihood of the findings of the comparative examination are evaluated under two hypotheses: (1) that the questioned sample was produced by the speaker of the reference sample, (2) that the questioned sample was produced by a different speaker (of the same sex and broadly the same linguistic background). An example of the wording of such a conclusion is:

- The findings of the examination provide strong support for the first hypothesis rather than the alternative hypothesis.

or

- The findings of the examination are much more probable under the first hypothesis than under the alternative hypothesis.

Some institutes, like the Swedish National Forensic Centre (NFC) and the Netherlands Forensic Institute (NFI), each with their own adaptations, prescribe a mapping of the verbal scale onto the numerical scale as suggested in the ENFSI 2015 Guidelines. The strength of the example conclusions above (strong support or much more probable) can for example be mapped on a likelihood ratio in the range of 1000 – 10,000 (ENFSI, 2015, p 17).

For this presentation, reports of forensic speaker comparison with a “strong” conclusion (minimally ‘much more probable’) were selected from cases performed at the NFI over the past 5 years. The findings of the reports were analysed in order to examine whether a certain type

32

of finding or circumstance provided the major support for the strong conclusion.

Results

Three categories of circumstances and findings were observed that led to a strong conclusion:

1. Cases with almost identical technical, acoustic and conversational circumstances of the questioned and reference recordings, allowing optimal comparison of voice quality and other features

2. Speakers with a speech impediment

3. Speakers with highly distinct speech behaviour

Typical of forensic casework is that the circumstances of the samples to be compared are different, making the first category the exception rather than the rule. However, in some cases where for example wire-tapped telephone conversations are to be compared to other wire-tapped telephone conversations, strong conclusions are possible due to the optimal comparability of the samples.

In the second category of speakers with a speech impediment or possibly a speech disorder, an estimation can be made based on (medical) information about the population with the speech impediment in question (e.g. cluttering, stuttering). In the presentation, an example with voice samples will be given. Of interest in this case is that a specific alternative hypothesis was formulated by the accused: yes the questioned sample is from someone who sounds just like me, but it is not me. Note that features such as voice quality will have low evidential value if they are to be interpreted in light of hypotheses involving sound-alikes.

The third category contains very distinct features that set the speaker apart from many other speakers. Remarkably, the features in question are often not in the linguistic domain, but in the behavioural and communicative domain: rare greetings and goodbyes, disfluencies, conversation style, etc. We will have a look at the features in question in more detail in the presentation.

33

Discussion

The second and third category are especially interesting, firstly because they are independent from spectral features used in automatic speaker comparisons, and secondly because background statistics can be obtained to ascertain the strength of the evidence. Examples will be discussed in the presentation. For some findings a comparison could be made with the results from analysis from a database.

In conclusion, analysis of the most distinctive features encountered in casework reveals that it would be fruitful to pay more attention to the analysis of behavioural features that have little to do with linguistic content. Features such as hesitation markers, breathing, backchannels, disfluencies, laughs etc. are not highly linguistically restrained and thus leave plenty of room for interspeaker variation.

ReferencesENFSI (2015). ENFSI guideline for evaluative reporting in forensic science.

Publication of the European Network of Forensic Science Institutes. http://enfsi.eu/wp-content/uploads/2016/09/m1_guideline.pdf

34

The BKA Standard Operation Procedure of Forensic Speaker Comparison and Examples

of Case Work

Isolde WagnerBundeskriminalamt, Forensic Science Institute, Germany

[email protected]

The BKA Standard Operation Procedure of Forensic Speaker Comparison is an accredited method of inspection according to the standard DIN EN ISO/IEC 17020. As illustrated in figure 1 it combines auditory phonetic-linguistic analysis and acoustic procedures of digital audio processing. Auditory phonetic-linguistic analysis is used for detailed descriptions of speech features. Acoustic procedures are used to quantify auditory perceptions and to detect inaudible features. Within the framework of acoustic procedures validated forensic automatic and semiautomatic speaker recognition are applied as additional methods for further objectivity. As quality assurance and control are required for accredited methods, results and opinions are, as a rule, examined by a second expert. The method is validated in inter- laboratory proficiency tests and collaborative exercises.

Before the method is applied, all the audio material has to be tested as to whether it satisfies the criteria of the procedure. The criteria concern quantity and quality parameters as well as match or miss-match conditions of the speech material in various aspects, such as digital audio format, acoustic environment of the recordings, spoken language or situational factors affecting speech behavior. The result of this testing gives information about if and to what extent the audio material is appropriate for the method and what kind of influence on the results of the speaker comparison analysis can be expected. The analyses are divided into three parameters: (1) speech and language, (2) voice and (3) manner of speaking. ‘Speech and language’ covers phonetic, lexical and grammatical features that can be regionally, socially or individually distinctive. ‘Voice’ comprises vibration characteristics of the vocal folds as well as characteristics of the vocal tract. ‘Manner of speaking’ contains

35

supra-segmental features like, e.g. articulation rate, speech fluency and speech disorders, speech rhythm, intonation and respiration. The evaluation of all findings results in a statement on a verbal probability scale of identity or non-identity of speakers.

Figure 1 Auditory phonetic-linguistic perception and description of speech features as well as acoustic measurements and calculations of the speech signal are applied and combined.

Five examples illustrate the application of the method under the various scenarios and conditions of forensic casework. The first example shows the treatment of the intra-variability of speech features in different communication situations, when a suspected speaker talks with a much more tense and higher pitched voice and a higher articulation rate to his father-in-law than to his wife. The questioned speaker talks to a friend in a different situation. A second example shows collaboration with an interpreter in the analysis of foreign language recordings. During joint auditory analysis with the interpreter different pronunciation patterns of a semantically empty phrase were found in the questioned and the suspect’s speech, which gave a strong indication of the probability of non-identity of the two speakers. In the third example two suspected brothers could be distinguished by differences in articulation precision, speech fluency and the type of code switching between two languages. In the fourth example questioned speaker recordings of only a few seconds were sufficient to reveal a variety of exceptional speech features and thus

36

to indicate identity with the suspected speaker with a probability close to certainty. The last example demonstrates the integration of automatic speaker recognition into the evaluation of a speaker comparison with longer recordings.

ReferencesGfroerer, S. (2014). Sprechererkennung und Tonträgerauswertung. Münchner

AnwaltsHandbuch Strafverteidigung (2nd edition), published by Gunter Widmaier, München: C.H. Beck, 2682-2707.

Jessen, M. (2012). Phonetische und linguistische Prinzipien des forensischen Stimmenvergleichs. München: Lincom Europa.

Künzel, H. J. (1987). Sprechererkennung. Gundzüge forensischer Sprach-verarbeitung. Heidelberg: Kriminalistik Verlag.

37

Forensic Transcription: Where to from here to create a better system for handling of

indistinct covert recordings?

Helen FraserForensic Phonetics [email protected]

Lawfully obtained covert recordings feature as evidence in criminal trials around the world. Where the audio is of poor quality, many jurisdictions allow police transcripts to assist the court in making out what is said in the recording, and attributing utterances to speakers. Unfortunately, practice for reception of covert recordings and their transcripts has been developed largely via legal principles, with little input from forensic phonetics or linguistics. Consequently, several misunderstandings have been incorporated.

It is true that contextual knowledge can potentially enable police to make out parts of a covertly recorded conversation that may be obscure to others. However, context is a double edged sword, as apt to induce inaccurate as accurate perception, with no diminution in hearers’ confidence (Bruce 1958, Lange 2011). Further, since seeing the transcript primes subsequent listeners’ perception, it is surprisingly easy for inaccuracies to go undetected (Miller 2016, Fraser 2014).

These factors (and others) mean that police interpretations of covert recordings, while clearly essential to investigations, are not suitable for evidentiary use (cf. French and Harrison 2006). The threat to justice has been well canvassed (e.g. Fishman 2006, Bucholtz 2009, Fraser 2016).

The present paper takes a further step, raising issues that need to be considered in developing a better process for handling indistinct covert recordings used as evidence in criminal trials. The focus is on the Australian jury system but remarks might have relevance in other systems.

38

Most important is recognition of apparently common sense solutions that are not as effective as expected, e.g. providing recordings with no transcript (cf. Fraser and Stevenson 2014), having police transcripts checked by an expert (cf. Wald 1995), cautioning the jury about the effect of priming (cf. Bonifaz 2014).

Next is consideration of what is likely to work better, on the basis of existing research findings, and development of a research program to provide additional necessary information. A proposal influenced by successful developments in the area of translating and interpreting, and using a modified version of Linear Sequential Unmasking (Dror et al 2015), is put forward for discussion.

ReferencesBonifaz, S. (2014). The Effect of Priming Awareness when Listening to Disputed

Utterances. MSc Dissertation, Department of Language and Linguistic Science, University of York.

Bruce, D. J. (1958). The effect of listeners' anticipations on the intelligibility of heard speech. Language and Speech, 1, 79–97.

Bucholtz, M. (2009). Captured on tape: professional hearing and competing entextualizations in the criminal justice system. Text and Talk - an Interdisciplinary Journal of Language, Discourse & Communication Studies, 29(5), 503–523.

Dror, I. et al. (2015). Context management toolbox: A linear sequential unmasking (LSU) approach for minimizing cognitive bias in forensic decision making. Journal of Forensic Sciences, 60(4), 1111–1112.

Fishman, C. S. (2006). Recordings, transcripts, and translations as evidence. Washington Law Review, 81, 473–523.

Fraser, H. (2014). Transcription of indistinct forensic recordings: Problems and solutions from the perspective of phonetic science. Language and Law/Linguagem E Direito, 1(2), 5–21.

Fraser, H. (2016). Reforms needed to ensure fairness in the use of covert speech recordings in criminal trials. National Law Reform Conference, Canberra.

French, P., & Harrison, P. (2006). Investigative and evidential applications of forensic speech science. In A. Heaton-Armstrong (Ed.), Witness testimony: psychological, investigative and evidential perspectives (pp. 247–262).

39

Lange, N. D., Thomas, R. P., Dana, J., & Dawes, R. M. (2011). Contextual biases in the interpretation of auditory evidence. Law and Human Behavior, 35(3), 178–187.

Miller, A. E. (2016). Jury suggestibility: The misinformation efect and why courts should care about inaccuracies in transcripts that accompany recorded evidence. Law and Psychology Review, 40, 363–382.

Wald, B. (1995). The problem of scholarly predisposition: G. Bailey, N. Maynor, & P. Cukor-Avila, eds., The emergence of Black English: Text and commentary. Language in Society, 24(2), 245–257.

40

Which questions, propositions and ‘relevant populations’ should a speaker comparison

expert assess?

Richard Rhodes, Peter French, Philip Harrison, Christin Kirchhübel, and Jessica Wormald

J P French Associates & Dept. of Language & Linguistic Science, University of York, UK

{richard.rhodes|firstname.surname}@jpfrench.com

Formulating appropriate propositions and, by doing so, defining the relevant population is central to the effectiveness of any forensic comparison. If this is not done conscientiously, according to the circumstances of each individual case and communicated transparently, even the most valid and accurate examinations will not provide the justice system with a useful result. However, owing to the nature of forensic work and the wider justice system, it is not always clear what the right questions are. Assumptions made by forensic scientists may significantly affect the strength and nature of evidence, and will potentially decide where the responsibility lies to consider facets of the evidence. This paper will discuss some of these issues in the context of providing forensic evaluations in the UK Criminal Justice Systems (which have adversarial legal systems). It will explain the concepts involved in light of recent debate concerning propositions for voice comparison, discuss the role of expert evidence (and how it functions in reality) and illustrate the issues using examples from real cases.

In order to carry out a comparison, the analyst must define a set, or sets, of propositions against which to test the evidence (see AFSP, 2009). Typically, determining the prosecution hypothesis (Hp) is fairly simple: ‘the suspect was the criminal speaker’. Successfully defining a ‘defence’ proposition (Hd) can be far more problematic. Yet, adopting an appropriate alternative to the prosecution hypothesis is extremely important as it defines the ‘relevant population’ against which the analyst must assess the typicality of features in the criminal recording; this in turn has a significant effect on the strength and validity of the evidence.

41

In a recent exchange of Science & Justice articles, Hicks et al (2015) suggest that the Hd in voice comparison cases should be formulated without reference to demographic information deduced from the criminal recording (unless this information is put forward by, for example, an instructing officer and agreed by all parties). Their argument is that findings, such as the criminal speaker’s accent, should not be included in propositions - which should be set before any analysis is carried out - especially where they rely on expert knowledge as this creates a disparity with the court’s prior expectations about the comparison. Instead, the Hdshould be developed based on agreed information about the case. In reply, Morrison et al (2016) illustrate how this approach could make forensic voice comparison impossible or ineffective, and that propositions can be set before the evidential analysis based on a pre-analytical screening exercise. Hicks et al’s (2017) response outlines how to address these concerns while keeping evidential observations and propositions separate by assessing two components in two stages: a) offender accent and b) linguistic/acoustic features and their typicality, given the accent.

We agree with some of the points raised by Morrison et al; that the determination of the Hd from the properties of a voice sample is properly the expert’s responsibility - since the police officer (and others) may/do not have the relevant competencies to do so - and that assessments of strength of evidence are better with a better-matched background population. However, we also agree with Hicks et al that linguistic observations about an offender’s demographic profile help to narrow down the list of potential perpetrators and thus are evidential; we acknowledge that this is an important aspect of speaker comparison casework - as evidence or in propositions - and that it receives far less attention than methods for assessing typicality of features.

We also acknowledge that specific conditions often arise in cases which require the expert to consider non-standard, and in some cases multiple sets of, propositions. In the UK at least, experts must take account of the following practical issues:

• the audience for a forensic report is not solely the trier-of-fact; reports have many wider impacts (e.g., on investigations, legal strategy, charging/appeal decisions);

42

• experts are often not in court to explain the process of assessing evidence according to propositions because forensic evidence in UK trials is often agreed/negotiated;

• in most cases, the ‘defence’ hypothesis is not formalised or agreed by the defence, but rather is an ‘alternative’ hypothesis formulated by the expert, and:

• it can be in the defence’s interest not to agree an alternative hypothesis;

• this hypothesis can be changed during the investigative/trial process;

• hypotheses can arise which are impossible to assess;• discussing different Hds could even be seen as (improperly)

giving legal advice!

• in cases concerning unusual hybrid varieties or combinations of features, an accurate Hd can vastly narrow down the relevant population or effectively identify a speaker.

This paper will present examples from cases where information apparent from the criminal recording and other case information (i.e., other than that deduced from the voice) are central to developing appropriate propositions. These include cases where selection of the propositions had a significant impact on the conclusion(s), where propositions were adjusted throughout the process and where the propositions lead to potential perpetrator groups of only one or two speakers. In many of these cases, the hierarchy of propositions put forward by Cook et al (1998; see below) offers a framework for developing useful sets of propositions.

43

Table 1. Hierarchy of propositions (Cook et al, 1998); modified for FVC in a robbery case.

Level Generic Example propositions

III Offence Hp Mr X committed the robbery;

Hd Mr X did not commit the robbery.

II Activity Hp Mr X is the man who demanded goods/money in the shop;

Hd The offender was another man (with a similar demographic profile) who had access/opportunity to rob the shop.

I Source Hp The offender speech on the CCTV came from Mr X;

Hd The offender speech on the CCTV came from another person with a similar demographic profile.

In our view, regardless of the position you take on including evidence in propositions, an important responsibility of the expert (and one of the most challenging) is to explain the assumptions made and the propositions selected in a clear and understandable way, so that the end-users can properly understand the process and the context of the conclusions they are being asked to consider, in order to avoid aspects of voice evidence being ‘double-’ or ‘half-counted’.

References[UK] Association of Forensic Science Providers. (2009). Standards for the

formulation of evaluative forensic science expert opinion. Science & Justice, 49, 161-164.

Cook, R., Evett, I. W., Jackson, G., Jones, P. J., & Lambert, J. A. (1998). A hierarchy of propositions: deciding which level to address in casework. Science & Justice, 38(4), 231-239.

44

Hicks, T., Biedermann, A., de Koeijer, J. A., Taroni, F., Champod, C., & Evett, I. W. (2015). The importance of distinguishing information from evidence/observations when formulating propositions. Science & Justice, 55(6), 520-525.

Hicks, T, et al [as above] (2017). Reply to Morrison et al.(2016) [below] Science& Justice.

Morrison, G. S., Enzinger, E., & Zhang, C. (2016). Refining the relevant population in forensic voice comparison–A response to Hicks et alii (2015) [above]. Science & Justice, 56(6), 492-497.

45

Blind Test Procedure to Avoid Bias in Perceptual Analysis for Forensic Speaker

Comparison Casework

Maria Sundqvist, Therese Leinonen, Jonas Lindh, and Joel Åkesson

Voxalys AB, Gothenburg, Sweden{maria|therese|jonas|joel}@voxalys.se

Research on bias has been carried out in forensic casework (Kassin, Dror, & Kukucka, 2013) or several different disciplines (Dror & Hampikian, 2011; Dror & Rosenthal, 2008; Nakhaeizadeh, Dror, & Morgan, 2014). When it comes to forensic phonetic casework, most studies of bias have been conducted with regard to earwitness identification (Brigham, 1980; Laubstein, 1997; Lindsay et al., 1991; Malpass & Devine, 1981), forensic transcription or disputed utterance analysis (Fraser & Australia, 2014; Fraser, 2003; Fraser & Stevenson, 2014; Morrison, Lindh, & Curran, 2014). In recent years some focus has also been given the problem of bias in pairwise comparison of voices in forensic speaker comparison casework (Cambier-Langeveld, 2007; Rhodes, 2016). In this work a new standard procedure for conducting blind testing as a part of forensic speaker comparison casework will be presented.

The procedure has been applied in accordance with meeting the accreditation prerequisites of

ISO17025 and enrich the comparison work conducted at our laboratory as subcontractors of the Swedish National Forensic Centre (NFC). The method is presented with some reflections and results from a few cases to illustrate the usefulness of the procedure.

Alongside the initial casework procedure of describing, anonymising and editing audio material and if the material allows for it (passes screening), a blind listening test is set up by an administrator. This allows for speakers to be compared relative to other similar recordings in the perceptual assessment. Such a test can only be performed if there is access to so-called 'foils'. Foils are recordings of similar acoustic quality available in

46

our databases. In this kind of listening test the material is administrated by one person and carried out by at least one other analyst. Comparisons are then made between all the speakers with a brief description and similarity assessment for each comparison (Swedish 9-point ordinal scale applied) accompanied by subjectively judged likelihood ratios (LRs). Not only does this type of blind test offer and aid an unbiased approach towards perceptual assessments, but it also allows for direct relational perceptual feedback for the analyst(s) with respect to subjective judgments of typicality. It is crucial that the analyst provided with the blind test have no prior background information on the case. Before an analyst is provided blind tests they undergo one practical and one theoretical training and evaluation exercise besides fulfilling the prerequisites of having a university degree in a relevant field. The theoretical training covers the fields of Bayesian reasoning in forensic analyses, relevant hypotheses, the Swedish ordinal scale and likelihood ratios, typicality as well as bias awareness. The practical training contains three steps where the first two address perceptually adjusting to the recording conditions in typical cases using typical recordings. The last step contains a larger perceptual evaluation. Results are then evaluated and compared with results of other analysts’ results and automatic systems’ results.

In the presentation a case example will be shown where a sole one to one comparison could have yielded a different result and influenced the final outcome if were it not accompanied by the blind test procedure.

ReferencesBrigham, J. C. (1980). Perspectives on the impact of lineup composition,

race, and witness confidence on identification accuracy. Law and Human Behavior, 4, 315–321.

Cambier-Langeveld, T. (2007). Current methods in forensic speaker identification: Results of a collaborative exercise. International Journal of Speech, Language & the Law, 14(2).

Dror, I. E., & Hampikian, G. (2011). Subjectivity and bias in forensic DNA mixture interpretation. Science & Justice: Journal of the Forensic Science Society, 51(4), 204–208.

Dror, I., & Rosenthal, R. (2008). Meta-analytically quantifying the reliability and biasability of forensic experts. Journal of Forensic Sciences, 53(4), 900–903.

47

Fraser, H. (2003). Issues in transcription: factors affecting the reliability of transcripts as evidence in legal cases. International Journal of Speech, Language and the Law - Forensic Linguistics, 10(2), 203–226.

Fraser, H., & Australia, F. P. (2014). Transcription of indistinct forensic recordings: Problems and solutions from the perspective of phonetic science. Language and Law/Linguagem E Direito. Retrieved from http://ler.letras.up.pt/uploads/ficheiros/13353.pdf

Fraser, H., & Stevenson, B. (2014). The Power and Persistence of Contextual Priming: More Risks in Using Police Transcripts to Aid Jurors’ Perception of Poor Quality Covert Recordings. The International Journal of Evidence & Proof, 18(3), 205–229.

Kassin, S. M., Dror, I. E., & Kukucka, J. (2013/3). The forensic confirmation bias: Problems, perspectives, and proposed solutions. Journal of Applied Research in Memory and Cognition, 2(1), 42–52.

Laubstein, A. S. (1997). Problems of voice line-ups. Forensic Linguistics, 4,262–279.

Lindsay, R. C. L., Lea, J. A., Nosworthy, G. J., Fulford, J. A., Hector, J., LeVan, V., & Seabrook, C. (1991). Biased lineups: Sequential presentation reduces the problem. The Journal of Applied Psychology, 76, 796–802.

Malpass, R. S., & Devine, P. G. (1981). Eyewitness identification: lineup instructions and the absence of the offender. The Journal of Applied Psychology, 66, 482–489.

Morrison, G. S., Lindh, J., & Curran, J. M. (2014/3). Likelihood ratio calculation for a disputed- utterance analysis with limited available data. Speech Communication, 58, 81–90.

Nakhaeizadeh, S., Dror, I. E., & Morgan, R. M. (2014). Cognitive bias in forensic anthropology: visual assessment of skeletal remains is susceptible to confirmation bias. Science & Justice: Journal of the Forensic Science Society, 54(3), 208–214.

Rhodes, R. (2014). Cognitive bias in forensic speech science. Proceedings of IAFPA2014. Zürich, Switzerland. Pholab.uzh.ch. Retrieved from

http://www.pholab.uzh.ch/static/IAFPA/abstracts/RHODESrichard.pdf

48

Speaker specificity of filled pauses compared with vowels and consonants

in Dutch

Vincent J. van Heuven,1, 2, 4, 5, 6 and Paula Cortés1, 3

1Leiden University Centre for Linguistics, Leiden, The Netherlands2Dept. Hungarian and Applied Linguistics, University of Pannonia,

Veszprém, Hungary3Ministry of Security and Justice, Netherlands Forensic Institute,

the Netherlands.4Dept. European Studies, Groningen University, Groningen,

The Netherlands5Leiden Institute for Brain and Cognition, Leiden, The Netherlands

6Fryske Akademy, Leeuwarden, The [email protected]; p.cortes@nfi.

minvenj.nl

In the 1990s a considerable research effort was made in the Netherlands to establish the degree to which speakers of Dutch could be successfully identified on the basis of either segmental (Van den Heuvel 1996 20 speakers) or prosodic (Kraaijeveld 1997, 50 speakers) speech characteristics. Using acoustic properties of specific vowels and consonants, however, introduces a lot of uncontrolled variation caused by, e.g., position of the segment in the syllable, whether the target segment is or is not stressed, and by coarticulation with neighboring segments. Speaker identification based on local characteristics of pitch contours was less successful than when based on vowels or consonants.

It is our hypothesis that hesitation markers, also called filled paused (e.g. eh, ehm), should be less sensitive to coarticulation and are not influenced by differences in stress. They should therefore be more stable and robust indexes of speaker identity. To test this idea we collected the first 20 filled pauses produced by all male speakers with low socio-economic status (SES) in the Spoken Dutch Corpus (CGN, Oostdijk 2000, 2002), excluding speakers from Belgium and speakers with fewer than 20 filled pauses. Filled pauses were automatically extracted from the recordings

49

on the basis of their schwa-like formant structure and long duration, and checked by ear. We extracted 20 MFCCs (at 11 equidistant time points) as well as the formant frequencies F1.. F5, duration and onset and offset pitch (F0) of each filled pause.

Linear Discriminant Analysis showed that the MFCC-data (averaged over the 11 time points per token) afforded correct speaker identification at 85% with 45 speakers (75% after cross-validation using the split-half method), which was better than similar LDAs run on formant data, duration or F0. Moreover, the performance was clearly better than the literature data on speaker-identification from individual vowels, consonants, and local properties of pitch contours. For instance, the best speaker recognition reported in Van den Heuvel (1996) was for he vowel /a/, with ten speakers and ten tokens per vowel per speaker, was at 80% correct, of which only 65% remained after cross-validation.

This suggests that hesitation markers are a promising feature for speaker-identification in situations where the amount of speech material is too limited to use long-term average methods.

ReferencesHeuvel, Henk van den (1996). Speaker variability in acoustic properties of

Dutch phoneme realisations. Doctoral dissertation, Catholic University Nijmegen.

Kraaijeveld, Johannes (1997). Idiosyncracy in prosody. Doctoral dissertation, Catholic University Nijmegen.

Oostdijk, Nelleke (2000). The Spoken Dutch Corpus. Overview and first evaluation. In M. Gravilidou, G.

Carayannis, S. Markantonatou, S. Piperidis & G. Stainhaouer (Eds.), LREC-2000 (Second International Conference on Language Resources and Evaluation) Proceedings, 887-894.

Oostdijk, Nelleke (2002). The design of the Spoken Dutch Corpus. In: P. Peters, P. Collins & A. Smith (Eds.): New Frontiers of Corpus Research.Amsterdam: Rodopi, 105-112.

50

Delimiting the West Yorkshire population: Examining the regional-specificity of

hesitation markers

Erica Gold, Sula Ross, and Kate EarnshawDepartment of Linguistics and Modern Languages,

University of Huddersfield, UKe.gold | s.m.ross | [email protected]

Introduction

Defining the reference population for forensic speaker comparisons can be a controversial and difficult task, insofar as a poorly matched reference population can misrepresent the strength of evidence (Hughes, 2014). Given the limited availability of large, forensically-relevant speech databases it has previously been difficult to establish the extent to which speaker populations may need to be delimited. This study investigates how hesitation markers are realised across three metropolitan boroughs within West Yorkshire: Bradford, Kirklees and Wakefield. Using data from the West Yorkshire Regional English Database (WYRED) (Gold et al., 2016), hesitation markers are acoustically analysed and population statistics for each borough are presented. The motivation behind this study is to explore the generalisability of this parameter and consider whether a single population can be created for West Yorkshire, using hesitaton data from all three boroughs, or whether seperate reference populations are required.

Hesitation markers have previously been identified as one of the most discriminant speech parameters (Gold and French, 2011) and have been shown to have greater discriminatory power than those of lexical vowels (Foulkes et al., 2004; King et al., 2013; Hughes et al., 2016). Furthermore, they are very frequent in spontaneous speech for most speakers, they are relatively easy to measure, and they are not thought to be consciously controlled.

51

Methods

This study presents prelimiary findings for a subset of participants from WYRED, which is expected to be released in Spring 2019. Once complete, WYRED will contain recordings of 180 male speakers who grew up and went to school in either Bradford, Kirklees or Wakefield. All speakers are aged between 18 and 30, have English as their first and only language and were raised in an English-only speaking household. Table 1 presents the tasks that each participant was recorded undertaking and provides details of task type, speaking style, interlocutor, recording quality, and approximate length for each task.

Table 1. Recordings included in the database for all speakers.Task and Number

SpeakingStyle Interlocutor Recording

QualityLength of Recording

1 Police Interview Spontaneous Research

Assistant 1 Studio ~ 20 mins.

2 Conversation withAccomplice

Spontaneous ResearchAssistant 2 Studio ~ 15 mins.

3 Paired Conversation Spontaneous

Maleparticipantfrom same area

Studio ~ 20 mins.

4 Voicemail Message Spontaneous N/A Studio ~ 2 mins.

Hesitation markers were manually segmented in Praat using the data from all four tasks listed above. An acoustic analysis was subsequently conducted by collecting average midpoint measurements of the first three formants in the vocalic portion of both “uh” /V/ and “um” /V+N/. The two categories of hesitation markers were treated separately as studies have found that they often differ in distribution and duration (Kirchhübel, 2013).

To establish the degree to which the results of this acoustic analysis can be generalised across the three boroughs, the variability between speakers versus the variability within speakers are compared. Less variability

52

between boroughs than within them (and within speakers), suggest a single population for West Yorkshire speech is sufficient for hesitation markers. It is also anticipated that the results of this investigation will expand the phonetic literature regarding a region in England which has received relatively little linguistic commentary in recent years.

ReferencesFoulkes, P., Carrol, G. & Hughes, S. (2004). Sociolinguistics and acoustic

variability in filled pauses. In 13th Annual Conference of the International Association for Forensic Phonetics and Acoustics (IAFPA), Helsinki, Finland.

Gold, E., Earnshaw, K. & Ross, S. (2016). An Introduction to the West Yorkshire Regional English Database (WYRED). In Transactions of the Yorkshire Dialect Society, edited by Kate Burland and Clive Upton.

Gold, E. & French, P. (2011). International practices in forensic speaker comparison. International Journal of Speech, Language & the Law, 18(2).

Hughes, V. (2014). The definition of the relevant population and the collection of data for likelihood ratio-based forensic voice comparison. Unpublished PhD thesis, University of York.

King, J., Foulkes, P., French, P. & Hughes, V. (2013). Hesitation markers as a parameter for forensic speaker comparison. In 22nd Annual Conference of the International Association for Forensic Phonetics and Acoustics (IAFPA), Tampa, Florida, USA.

Kirchhübel, C. (2013). The acoustic and temporal characteristics of deceptive speech. PhD thesis, University of York.

Hughes, V., Wood, S., & Foulkes, P. (2016). Strength of forensic voice comparison evidence from the acoustics of fileld pauses. International Journal of Speech, Language & the Law, 23(1).

53

Case study: Earwitness reliability through a lens of psycholinguistics and acoustics

Sandra Ferrari Disner, and Andrés BenítezDepartment of Linguistics, University of Southern California,

Los Angeles, USA{sdisner|a.benitez}@usc.edu

A wrongful-discharge case was argued before an arbitrator in California. During a labor dispute at an oil refinery, someone had used a recorded line connected to the public-address system to issue a menacing taunt to the management staff. Almost all of them identified that speaker as the refinery worker who was leading the strike effort. (Though one said that no one could have identified the brief, noisy message.) Relying on those earwitness identifications, and their high degree of confidence, the company terminated the strike leader.

The taunt was comprised of two utterances, respectively 11 and 4 words in length, separated by an 11-minute gap. Not all earwitnesses heard both. Three previous exemplars of the strike leader’s voice showed no evidence of vocal pathology or any unusual regional accent. His F0 coincided almost exactly with the mean for American males.

The company’s legal staff cited recognition studies of highly familiar speakers (Rose & Duncan, 1995; Hollien et al., 1982), and an extensive study of speakers at different levels of familiarity (Yarmey et al., 2001) which reported high levels of accuracy at the high-familiar level. They argued that the identification must have been correct because almost every witness had identified the same speaker, and had done so with the highest level of confidence.

We identified several problems with the applicability of these studies to this real-life scenario. First, the cited authors define highly familiar speakers as immediate family, close friends, or colleagues in a tight-knit group (e.g. a phonetics laboratory). Whether this label could be attached to the management and human-resources staff who identified the speaker is debatable. Even assuming a high level of listener familiarity with the voice, the high accuracy rates in the Yarmey study, at least, were achieved with much longer sample durations. And the degree of confidence with

54

which the refinery identifications were made has been shown to have little if any predictive value (Kreiman & Sidtis, 2011).

One major drawback to the company’s reliance on laboratory studies was the signal-to-noise ratio of the refinery recordings, which averaged -6 dB. Arguably, another drawback was the impromptu nature of the utterances. The refinery provided the antithesis of a controlled experiment. Not only were earwitnesses not expecting to perform a voice identification task amid the chatter of the loudspeaker, but the offensive words were lodged deep in each utterance. The threatening nature of the utterance would only become evident as the first words were ebbing from memory.

Psycholinguistic considerations, such as listener expectations (Ladefoged, 1978) and co-witness conformity (Gabbert et al., 2006), may also be brought to bear. The fact that the strike leader was the one singled out among hundreds of refinery workers might not be coincidental. And conditions for co-witness conformity were met when, immediately upon hearing the offending words, the plant foreman took to the PA system to opine “Hey, that sounds like [the strike leader]!” Opinions soon coalesced around that of the foreman. The arbitrator ultimately ruled for reinstatement of the strike leader in his job.

ReferencesGabbert, F., Memon, A., and Wright, D.B. (2006). Memory conformity:

Disentangling the steps toward influence during a discussion. PsychonomicBulletin & Review, 13 (3), 480–485.

Hollien, H., Majewski, W., and Doherty, E. T. (1982). Perceptual Identification of Voices under Normal, Stress, and Disguise Speaking Conditions. Journalof Phonetics, 10, 139–148.

Kreiman, J. and Sidtis, D. (2011). Foundations of Voice Studies. Malden, MA: Wiley Blackwell.

Ladefoged, P. (1978). Expectation Affects Identification by Listening. Languageand Speech, 21: 373-374.

Rose, P. and Duncan, S. (1995). Naïve Auditory Identification and Discrimination of Similar Voices by Familiar Listeners. Forensic Linguistics, 2: 1-17.

Yarmey, A. D., Yarmey, A. L., Yarmey, M.J., and Parliament, L. (2001). Commonsense Beliefs and the Identification of Familiar Voices. CognitivePsychology, 15: 283-299.

55

Between-speaker intensity variability is maintained in different frequency bands of

amplitude demodulated signal

Lei He, and Volker DellwoInstitute of Computational Linguistics, University of Zurich,

[email protected]; [email protected]

Individual differences in speech rhythm are salient, both in terms of duration-based rhythm measures (Dellwo et al. 2015; Leemann et al. 2014) and intensity-based rhythm measures (He and Dellwo 2016a, 2014; He et al. 2015). However, one notorious aspect of intensity measurements is that they are easily affected by different forms of signal distortions, such as head-turning during recording, dynamic range compression, and clipping. In this study, we introduce a condition of severe signal distortion by demodulating the amplitude envelope, and test whether between-speaker intensity variability as measured by intensity-based rhythm measures is maintained in different frequency bands of such distorted speech signals.

We used the same corpus as in He and Dellwo (2016), who showed that between-speaker variability was significant in intensity-based rhythm measures. This corpus (TEVOID corpus, see Dellwo et al. 2015 and Leemann et al. 2014) contains 16 Zürich German speakers (8m 8f), each read the same set of 256 sentences. We measured the varcoM and varcoP for each sentence in the corpus. The varcoM was defined as the coefficient of variation of average syllable intensity levels across a sentence; the varcoP was defined as the coefficient of variation of syllable peak intensity levels across a sentence. We further distorted all the speech signals of the corpus using the Hilbert transform-based amplitude demodulation (this step was based on a Praat-based algorithm developed by He and Dellwo 2016b), resulting in a flat amplitude envelope and intensity contour (see Figure 1). Between-speaker variability was thus eliminated in the broadband intensity curves. Next, we band-pass filtered all the peak-clipped signals into four bands: 0 – 500 Hz, 500 – 1'000

56

Hz, 1'000 – 2'000 Hz, and 2'000 – 4'000 Hz, and calculated varcoM and varcoP for each band. In order to reduce between-sentence effect, z-scorenormalisation by sentence was used for all measures in all conditions.

We observed that between-speaker differences were salient in all frequency bands of the nonclipped and peak-clipped speech (see Figures 2 and 3 for illustrations of the distributions of varcoM and varcoP in the band of 500 – 1'000 Hz in both clipped and non-clipped conditions). For future research, we will further examine how speakers differ from each other in different frequency bands of both high-quality and distorted speech signals, and establish statistical models based on intensity-based rhythm measures.

ReferencesDellwo V, Leemann A, and Kolly, M- J (2015) “Rhythmic variability between

speakers: articulatory, prosodic, and linguistic factors” Journal of the Acoustical Society of America 137, pp. 1513–1528.

He L and Dellwo V (2014) “Speaker idiosyncratic variability of intensity across syllables” In Proceedings of INTERSPEECH 2014 pp. 233–237, Singapore.

He L and Dellwo V (2016a) “The role of syllable intensity in between- speaker rhythmic variability” International Journal of Speech, Language and the Law 23, pp. 243–273.

He L and Dellwo V (2016b) “A Praat- based algorithm to extract the amplitude envelope and temporal fine structure using the Hilbert transform” In Proceedings of INTERSPEECH 2016, pp. 530 – 534, San Francisco, USA.

He L, Glavitsch U and Dellwo V (2015) “Comparisons of speaker recognition strengths using suprasegmental duration and intensity variability: an artificial neural networks approach” In Proceedings of the 18th International Congress of Phonetic Sciences (ICPhS), pp. 0395.1–5, Glasgow, UK.

Leemann A, Kolly M- J and Dellwo V (2014) Speech- individuality in suprasegmental temporal features: implications for forensic voice comparison. Forensic Science International 238, pp. 59–67.

57

Figure 1 The waveform of an amplitude demodulated signal (upper plot) and its associated spectrogram (lower plot). The green line superimposed over the spectrogram is the intensity contour of the signal (mean intensity = 90.98 dBSPL).

Figure 2 Boxplot showing the distribution of varcoM (z-score normalised) in the band 500 – 1’000 (Hz) in both high-quality and amplitude demodulation conditions.

58

Figure 3 Boxplot showing the distribution of varcoP (z-score normalised) in the band 500 – 1’000 (Hz) in both high-quality and amplitude demodulation conditions.

59

The malleability of speech production: An examination of sophisticated voice

disguise

Radek Skarnitzl, and Alžběta RůžičkováInstitute of Phonetics, Faculty of Arts, Charles University,

Czech [email protected], [email protected]

The variability of speech production continues to present one of the key challenges to forensic phoneticians relying on the auditory-acoustic method of comparing voices, as well as to automatic methods of speaker recognition. Although a speaker’s voice may be regarded as a reflection of the anatomy and physiology of his or her vocal tract, these only impose optimal values of individual parameters and limits; importantly, these limits are extremely generous, and the speech production mechanism is typically described as extremely plastic (e.g., Nolan, 1983: 27). As speakers, we exploit this plasticity in our everyday lives when communicating the various components of communicative intent and indexing the various facets of our identity; short- and long-term segmental and suprasegmental information (cf. recordings made by Francis Nolan and analyzed in Nolan, 1983) are combined in countless degrees of freedom (Nolan, 2012; Skarnitzl, 2016).

If we want to consider settings relevant for speaker identification, then the plasticity and malleability of speech production are perhaps best illustrated on voice disguise. Although it appears that, in a majority of actual cases, perpetrators use only one or two ways of modifying their voice (Masthoff, 1996; Svobodová & Voříšek, 2014), the aim of this presentation is to examine speakers whose voice disguise strategies were sophisticated and, most importantly, resulted in speech which sounded natural and might pass for the given speaker’s normal speech production.

The analyses are based on the Database of Common Czech (Skarnitzl & Vaňková, 2017) which features 100 male speakers recorded in a number of speaking tasks. One of the tasks consisted in reading a short phonetically rich text in an ordinary voice, while in another task the

60

speakers were asked to read another short text in a disguised voice. They were given sufficient time to devise a strategy to disguise their voice. The two texts differed but contained some identical phrases which may be used for detailed comparison.

A general mapping of disguise strategies was conducted by Růžičková and Skarnitzl (2017): 3 out of the 100 speakers did not perform any modification at all when disguising their voice, another 31 changed one characteristic. At the other end of the spectrum, we identified 15 speakers who were, at least on first listening, rather difficult to recognize when comparing the natural and disguised voice production, and whose speech sounded natural. These speakers were contacted again and asked to record a longer text using the same kind of disguise; the same text, with neutral content, was used for both conditions this time. The presentation will describe their phonatory and articulatory disguise (and in one case imitation) strategies in greater detail, including selected acoustic analyses and audio comparisons of the speakers’ natural and disguised voice.

ReferencesMasthoff, H. (1996). A report on a voice disguise experiment. Forensic

Linguistics, 3, 160–167.Nolan, F. (1983). The Phonetic Bases of Speaker Recognition. Cambridge:

Cambridge University Press.Nolan, F. (2012). Degrees of freedom in speech production: An argument for

native speakers in LADO. The International Journal of Speech, Language and the Law, 19, 263–289.

Růžičková, A. and R. Skarnitzl (2017, in print). Voice disguise strategies in Czech male speakers. AUC Philologica, Phonetica Pragensia.

Skarnitzl, R. (2016). What are our voices capable of? Phonetic perspective on the variability of speech production. Slovo a smysl, 26, 95–113. [in Czech]

Skarnitzl, R. and J. Vaňková (2017, in print). Fundamental frequency statistics for male speakers of Common Czech. AUC Philologica, Phonetica Pragensia.

Svobodová, M. and L. Voříšek (2014). Speaker identification from the perspective authentic criminological practice in the Czech Republic. In: R. Skarnitzl (Ed.), Phonetic Speaker Identification, 136–144. Praha: Faculty of Arts, Charles University. [in Czech]

61

Bilingual speakers’ long-term fundamental frequency distributions as cross-linguistic

speaker discriminants

Kieran Dorreen, and Vica PappUniversity of Canterbury, New Zealand

[email protected]@gmail.com

The fundamental frequency (F0), a robust and widely available phonetic parameter, has been shown to be highly variable both within and between speakers. F0’s use as a speaker discriminant in speaker comparison however suffers from a wide range of methodological shortcomings, such as inconsistent pitch tracking algorithms, analyst bias in the tracker settings and outlier management, analysts implicitly disregarding the creak phonation range and the bimodal (and sometimes multimodal) nature of F0 distributions, and minimalist statistical approaches to F0. As a result, research on the long-term F0 distribution for speaker comparison purposes has stalled somewhat.

This study explores the renewed potential of the long-term F0 distributions by 1) using REAPER, a modern, epoch-based pitch tracker to extract the glottal cycles, 2) employing novel ways of characterizing the distributions, and 3) testing it on bilingual corpora.

Our approach used a highly accurate epoch-based open source pitch tracker, REAPER (Robust Epoch And Pitch EstimatoR, Talkin, 2015), on two corpora of bilingual speakers from New Zealand, the QuakeBox Corpus (Walsh et al., 2013) and the Maori New Zealand English Corpus (MAONZE, MacLagan et al., 2004). The 17 selected speakers in each of the two corpora were interviewed in two languages (2 corpora x 17 speaker x 2 languages). In MAONZE the two languages were Māori and New Zealand English; in QuakeBox it was English and another language that the participant spoke, which included French, Mandarin, Punjabi, Russian, etc.

REAPER is an implementation of the EpochTracker.h class in the Low Level Virtual Machine (LLVM) compiler infrastructure. After REAPER

62

locates the epochs or individual glottal closure instants (GCI), local F0 is calculated as the inverse of the time between successive GCI, which are then rounded to the nearest Hz.

To explicitly account for the bimodality of the distributions, each F0 distribution was bisected at the statistical anti-mode of the distribution, that is, the location of the least frequent pitch value between two phonation (and statistical) modes. In the resulting separate creak vs. modal distributions, following Rose (2002) parameters such as mean F0, modal F0, skew, and kurtosis were calculated for both languages of each speaker. Within-speaker (i.e. between-language) and between-speaker comparisons were then calculated for the 34 speakers.

REAPER’s output shows greater accuracy in pitch tracking for both phonation types (cf. Fig 1), which revealed more speaker characteristic features in the long-term distributions than other, less accurate pitch trackers. This is especially true for creak, which, as a result, now shows greater potential as a speaker discriminant having both small within-speaker variation and relatively large between-speaker variation across languages (e.g. Fig 2). The accurate bimodal distributions extracted through REAPER also revealed a strong unforeseen speaker discriminant, the anti-mode itself, which shows smaller within-speaker variation and larger between-speaker variation than all the other F0 parameters studied (cf. Figure 3).

Future work will evaluate the stability of the successful measures as a function of package size (Gold, 2014), comparison with the Fujisaki intonational model (Leemann et al., 2014), and implementing down-sampling to 16kHz during the data pre-processing to avoid the quadratic increase of computational time.

63

Figure 1 F0 probability distribution of Cantonese speaker, based on Praat’s pitch tracker (black) vs. REAPER (blue).

Figure 2 Creak mean of speakers in the MAONZE Māori-English bilingual corpus.

64

Figure 3 Antimode frequency (Hz) of speakers in the bilingual section of the QuakeBox Corpus.

ReferencesGold, E. (2014). Calculating likelihood ratios for forensic speaker comparisons

using phonetic and linguistic parameters (unpublished PhD thesis). University of York, York, UK.

Leemann, A., Mixdorff, H., O'Reilly, M., Kolly, M.-J. and Dellwo, V. (2014). Speaker-individuality in Fujisaki model f0 features: implications for forensic voice comparison. International Journal of Speech, Language and the Law, 21(2): 343-370.

Maclagan, M., Harlow, R., King, J., Keegan, P., and Watson, C. (2004). Acoustic analysis of Māori: historical data", Proceedings for Australian Linguistics Society Conference, Sydney, June.

Rose, P. (2002). Forensic speaker identification. CRC Press.Talkin, D. (2015). REAPER: Robust Epoch And Pitch EstimatoR. Retrieved

from https://github.com/google/REAPERWalsh, L., Hay, J., Bent, D., Grant, L., King, J., Millar, P., Papp, V. and Watson,

K. (2013) The UC QuakeBox Project: Creation of a community-focusedresearch archive. New Zealand English Journal 27: 20-32.

65

Analysis of i-vector-based false-accept trials in a dialectlabelled

telephone corpus

Nadja Tschäpe, Michael Jessen, and Stefan GfroererDepartment of Speech and Audio (KT34), BKA, Germany

{nadja.tschaepe|michael.jesssen|stefan.gfroerer}@bka.bund.de

In their Odyssey paper What are we missing with i-vectors? González-Rodríguez et al. (2014) used a collection of nontarget trials (i.e. comparisons of different-speaker recordings) that are classified as same-speaker recordings by a modern (i-vector-based) automatic speaker recognition system. Based on such a collection of false-accept trials, phoneticians examined the pairs of recordings for any differences that might indicate the nonidentity of the speakers and that must have been “overlooked” by the automatic system. Since most current automatic systems are based on cepstral coefficients as feature vectors, there are many non-cepstral types of information, typically analysed by phoneticians/linguists, that remain uncaptured by the system. González-Rodríguez et al. (2014) arrived at a compilation of many phonetic features that differed in some of the trials. Most of the features were from the domain of phonatory voice quality, most others were f0-based or temporal prosodic characteristics. Dialect differences were mentioned only scarcely.

This paper attempts to expand González-Rodríguez et al. (2014) by concentrating on the aspect of regional variation as a potential source of differences that can reveal the nonidentity status of falsely accepted trials.

The experiment is based on the analysis of one-minute recordings of 110 male local police officers answering emergency calls. Each of these speaker samples were compared to each other, and the direction of the comparison (assignment to questioned-speaker or suspect status) mattered, hence arriving at a total number of 11990 nontarget trials. The recordings are from the German-dialect corpus described in Köster et al. (2012), where a three-tiered representation of German dialects into three

66

broad-level distinctions (lower, middle, upper German), six medium-level distinctions, and fourteen fine-level distinctions is used. Adopting that classification, four degrees of dialectal differences were derived ranging from 0 (no difference) to 3 (speakers straddling a broad-level boundary).

The automatic system that was used is Nuance Forensics. It requires a reference population of at least 30 speakers for the calculation of log likelihood ratios (LLRs). A reference population was selected based on 30 additional speakers from the same dialect corpus. Nuance Forensics offers as a default assumption that LLRs of 1.2 or higher constitute evidence in favor of speaker identity. This assumption was accepted for the purpose of this experiment.

The result shows that the false accept rate was approx. 1%, which means that a total of 121 nontarget comparisons resulted in a LLR of 1.2 or higher. Figure 1 shows how the false accept trials are distributed with respect to the four degrees of dialect difference.

Figure 1 Percentage of false accept trials next to percentage of all trials in four degrees of dialect difference.

67

As can be seen from the figure, a large number of false identifications can be prevented by assessing regional information. 72% of the false accepts include speaker pairs with dialectal differences and are expected to be classified as non-identical by an expert using dialectal analysis. A remaining 28% of the false-accept trials result from speakers with identical or very similar regionally accented speech. For these, dialect analysis might not give a lead to nonidentity and other phonetic/linguistic features are necessary.

ReferencesGonzáles-Rodríguez, J., Gil, J., Pérez, R., and Franco-Pedroso, J. (2014).

What are we missing with ivectors? A perceptual analysis of i-vector-based falsely accepted trials. Proceedings of Odyssey 2014: The Speaker and Language Recognition Workshop, Joensuu, Finland, 33-40.

Köster, O., Kehrein, R., Masthoff, K., and Boubaker, Y.H. (2012).The tell-tale accent: identification of regionally marked speech in German telephone conversations by forensic phoneticians. The International Journal of Speech, Language and the Law, 19, 51-71.

68

Speaker-dependency of /s/ in spontaneous telephone conversation

Willemijn HeerenDepartment of Languages, Literature and Communication

Utrecht University, Utrecht, The [email protected]

In the search for speaker-dependent speech sounds as features in forensic speaker comparisons, vowels have received most attention (e.g., McDougall, 2004; Gold, 2014; Hughes, 2014; Zuo and Mok, 2015). They reflect vocal fold activity and vocal tract resonances (cf. Fant, 1960), and tend to be well-measureable, even in noisy channels. In forensic casework, consonants are regularly used (Gold and French, 2011), but scientifically, their speaker-dependency is understudied (but see Van den Heuvel, 1996; Rose et al., 2003; Fecher and Watt, 2011; Kavanagh, 2012). As earlier work on read speech suggests that /s/ may be useful for forensic speaker comparisons, this investigation targeted this fricative’s speaker-dependency in spontaneous telephone conversation. The goals were to assess speaker classification performance using /s/-features in a speech style comparable to that in casework, and to perform a comparative analysis of methods for taking spectra.

MethodRecordings of spontaneous telephone conversation were taken from a phonemically annotated subset of the Spoken Dutch Corpus (Oostdijk, 2000). A homogeneous group of adult male speakers was selected, ranging in age from 18 to 50, with Standard Dutch as first, work and home language. A constant, high-frequency context was chosen: word-final /s/’s preceded by a vowel. The target phonemes were manually annotated, also coding aspects of their linguistic position for later analyses. Including only speakers with minimally 10 sufficient-quality samples left 17 speakers.

Each fricative’s duration, spectral centre of gravity (CoG) and spectral standard deviation were measured using PRAAT (Boersma, 2001). Spectral measures were taken in Hertz in five ways, but always over the 350-3500 Hz range given the telephone band. Spectra were taken over

69

the segment’s full duration, the mid-75% and mid-50% of the duration, to examine the effect of co-articulation on speaker-dependency. Also, time-averaged spectra were computed over the full duration (see Shadle, 2012). Finally, independent rather than overlapping analysis windows were used to obtain measurements for dynamic spectral representations (e.g., Munson, 2001): cubic and quadratic fits were determined using MATLAB’s polyfit() function. If necessary, measures were transformed to meet the normality requirement.

Analysis and results

To study speaker-dependency one-way ANOVAs were run with an acoustic measurement as dependent variable and Speaker as random factor. Linear Discriminant Analysis (LDA) was used to build speaker classifiers, with different (combinations of) measurements as predictors.

The CoG features derived from the different non-dynamic spectra were strongly correlated (r > .80, p < .001, see Figure 1), and all showed highly significant effects of Speaker (F(16,153) > 6.6, p < .001). LDAs using single predictors, whether fit coefficients or non-dynamic features, showed a comparable maximum performance of ~11% correct speaker classification (cross-validated, chance level = 5.9%). Quadratic fit coefficients outperformed those of cubic fits. Combining predictors that did not highly correlate yielded cross-validated results of ~20% correct speaker classification (cf. Kavanagh, 2010; van de Heuvel, 1996).

Figure 1 Boxplot of three CoG measurements (full duration, mid-50%, time-averaged spectra) in kHz, clustered by speaker.

70

The fricative /s/ contains speaker-dependent information in spontaneous conversational telephone speech. Results suggest that the exact method for computing spectra minimally influences classification performance. Also, dynamic fit coefficients may be combined with overall spectral measurements to enhance classification performance, presumably because the former reflect articulatory change whereas the latter reflect vocal tract resonances.

ReferencesBoersma, P. (2001). Praat, a system for doing phonetics by computer. Glot

International 5(9/10), 341- 345. Fant, G. (1960). Acoustic theory of speech production. The Hague: Mouton. Fecher, N., and Watt, D. (2011). Speaking under cover: the effect of face-

concealing garments on spectral properties of fricatives. Proceedings of the 17th ICPhS, Hong Kong, 663-666.

Gold, E. A. (2014). Calculating likelihood ratios for forensic speaker comparisons using phonetic and linguistic parameters. PhD dissertation, University of York, UK.

Gold, E., and French, P. (2011). International practices in forensic speaker comparison. International Journal of Speech, Language and the Law 18(2), 293-307.

Hughes, V. S. (2014). The definition of the relevant population and the collection of data for likelihood ratio-based forensic voice comparison. PhD dissertation, University of York, UK.

Kavanagh, C. M. (2012). New consonantal acoustic parameters for forensic speaker comparison. PhD dissertation, University of York, UK.

McDougall, K. (2004). Speaker-specific formant dynamics: An experiment on Australian English /aI/. International Journal of Speech, Language and the Law 11(1), 103-130.

Oostdijk, N. H. J. (2000). Het Corpus Gesproken Nederlands. NederlandseTaalkunde 5, 280-284.

Rose, P., Osanai, T., and Kinoshita, Y. (2003). Strength of forensic speaker identification evidence: multispeaker formant- and cepstrum-based segmental discrimination with a Bayesian likelihood ratio as threshold, Forensic Linguistics 10(2), 179-202.

Shadle, C. (2012). The acoustics and aerodynamics of fricatives. The Oxford Handbook of Laboratory Phonology, eds.: A. Cohn, C. Fougeron, M. K. Hoffman, Oxford University Press, pp. 511-526.

71

Van den Heuvel, H. (1996). Speaker variability in acoustic properties of Dutch phoneme realisations. PhD dissertation, Katholieke Universiteit Nijmegen, The Netherlands.

Zuo, D., and Mok, P. P. K. (2015). Formant dynamics of bilingual identical twins. Journal of Phonetics 52, 1-12.

72

What your voice says about you: Automatic Speaker Profiling

using i-vectors

Finnian Kelly, Oscar Forth, Alankar Atreya, Samuel Kent, and Anil Alexander

Research and Development, Oxford Wave Research Ltd., Oxford, U.K.{finnian|oscar|alankar|sam|anil}@oxfordwaveresearch.com

In forensic and investigative speech analysis tasks dealing with large volumes of recordings, triage by human experts may not be feasible. The ability to automatically extract information such as a speaker’s gender1, age and spoken language would support the rapid assessment of audio recordings in such cases. Additionally, this information could be used within an automatic speaker recognition framework, to inform the selection of a reference population for example. In this paper, we explore the automatic estimation of speaker gender, age and spoken language from telephone quality speech using an i-vector framework (Dehak et al. 2011a).

In the i-vector approach, a speech segment is converted into a compact, fixed-length representation, in which most of the important variability is retained. For speaker recognition, the variability of interest is speaker identity. However, other information carried by the speech signal is also encoded in the i-vector (Dehak et al. 2011a, Bahari et al. 2014, Ranjan et al. 2015).

We use the NIST Speaker Recognition Evaluation (SRE) data 2004-2008 for our experiments. From this large pool of speech, we extracted subsets of multilingual conversational telephone speech, balanced across gender, and containing as broad an age range as possible. The VOCALISE speaker recognition software (Alexander et al. 2014) was used to extract i- vectors from all selected recordings.

1 In this paper, we use the term ‘gender’ to refer to the biological sex of a speaker.

73

Gender Recognition and Age Estimation

A pool of 9000 NIST SRE recordings, balanced across gender and distributed across an age range of 18-89, was divided into train and test partitions in the ratio 2:1. There were 1000 unique speakers across both partitions, with no speaker overlap. After extracting i-vectors for the train set, we trained support vector machine (SVM) models for gender classification and age regression using the known gender and age labels. The SVM models were then applied to generate gender and age labels for the test set. In the case of gender recognition, an equal error rate (EER) of 2.4% and an accuracy rate of 97.7% were obtained. For age estimation, a Mean Absolute Error (MAE) of 7.50 years (males: 8.09, females 7.15) was obtained.

Language Recognition

We considered a set of ten widely-spoken languages for our tests, namely, Arabic, Bengali, Chinese (Mandarin), Chinese (Cantonese), English, Hindi, Japanese, Russian, Spanish and Thai. A pool of 2000 NIST SRE recordings, balanced across gender and each of the languages of interest, was divided into train and test partitions. There were 500 unique speakers across both partitions, with no speaker overlap. We applied linear discriminant analysis (LDA) to the train and test i-vectors based on the 10 language classes to reduce their dimension and enhance separability. To accommodate for gender and regional variations within the languages, we applied a k-means clustering to the train i-vectors. Using Cosine- Similarity based scoring of the train and test i-vectors, we obtained an average language recognition accuracy of 85.05% and a mean EER of 8.22%. Figure 1 contains a visualisation of language i-vectors and a confusion matrix of per-language recognition accuracy rates.

In this work, we have demonstrated how key speaker meta-data such as gender, language and age may be estimated automatically from telephone speech. Initial results indicate that this approach can be successfully extended to unconstrained public sources such as Youtube recordings.

74

Figure 1 Left: An unsupervised t-SNE (van der Maaten and Hinton 2008) projection of the 400-dimensional i-vectors (pre-LDA) into 3 dimensions. Each point is an i-vector and each colour indicates a different language. Right: A confusion matrix of language recognition accuracy (the colour-bar indicates accuracy in percent).

75

ReferencesDehak, N., Kenny, P. J., Dehak, R., Dumouchel, P., and Ouellet, P. (2011a).

Front-End Factor Analysis for Speaker Verification. IEEE Transactions on Audio, Speech, and Language Processing, vol. 19, no. 4, pp. 788-798.

Dehak, N., Torres-Carrasquilo, P. A., Reynolds, D., and Dehak, R. (2011b). Language recognition via i-vectors and dimensionality reduction. Interspeech 2011, Florence, Italy, pp. 857-860.

Bahari, M. H., McLaren, M., Van hamme, H., and van Leeuwen, D. (2014). Speaker age estimation using i-vectors. Engineering Applications of Artificial Intelligence, vol. 34, pp. 99-108.

Ranjan, S., Liu, G, and Hansen, J. H. L. (2015). An i-Vector PLDA based gender identification approach for severely distorted and multilingual DARPA RATS data. IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU) 2015, Scottsdale, AZ, pp. 331-337.

Alexander, A., Forth, O., Atreya, A. A., and Kelly, F. (2016). VOCALISE: A forensic automatic speaker recognition system supporting spectral, phonetic, and user-provided features. Speaker Odyssey 2016, Bilbao, Spain.

van der Maaten, L. J. P. and Hinton, G. E. (2008). Visualizing High-Dimensional Data Using t-SNE. Journal of Machine Learning Research, vol. 9, pp. 2579-2605.

76

Forensic voice comparison: Older sister or younger sister?

Cuiling Zhang,1,2 Geoffrey Stewart Morrison,3,4

and Ewald Enzinger5

1School of Criminal Investigation, Southwest University of Political Science & Law, Chongqing,China

2Chongqing Institutes of Higher Education Key Forensic Science Laboratory, Chongqing, China

[email protected] Speech Science Laboratory, Centre for Forensic Linguistics,

Aston University, England, United Kingdom4Department of Linguistics, University of Alberta, Edmonton,

Alberta, Canada5Eduworks, Corvallis, Oregon, United States of America

{geoff-morrison | ewald-enzinger}@forensic-evaluation.net

Introduction

The mobile phone conversation of the plaintiff in this case was recorded. The conversation included either the respondent or the respondent’s younger sister. Five new telephone conversations with each sister were recorded using the plaintiff’s mobile telephone. This provided known-speaker recordings under the same conditions as the unknown-speaker recording used for analysis. The known-speaker recordings were used to train and test a forensic voice comparison system. The system was then used to evaluate the strength of evidence associated with the unknown-speaker recording: What is the probability of obtaining the acoustic properties of the voice on the unknown-speaker recording if it were produced by the older sister versus if it were produced by the younger sister?

Acoustic and statistical analysis

MFCCs were extracted every 10 ms from the speech of the speaker of interest in each recording. The 1st through 4th coefficients were used for statistical analysis. Fig. 1 shows the smoothed spectra corresponding to these measurements. The data were transformed using canonical linear

77

discriminant functions (CLDFs). This procedure finds new dimensions which maximize the ratio of between- to within-category variance (between to within-speaker variance). Only the first CLDF dimension was used for subsequent analysis. Even though there was no mismatch in recording conditions, there may still be some between-session variability, and this served as a mismatch compensation technique. Also, with only five data points for each category, we needed to fit a parsimonious model at the next stage of statistical analysis. By using only one dimension, the number of parameters for which values had to be estimated was reduced. Fig. 2 shows the resulting CLDF values and Gaussian distributions fitted with a pooled variance.

With so little data, there was a concern about having poor estimates of parameter values, which could lead to vast overestimation of the strength of evidence. For the actual case, one solution was chosen a priori, but in this presentation we present the use of different statistical models which include shrinkage (these include additional analyses compare to those in an earlier publication [1] based on this case). As a point of comparison, we fitted a linear discriminant analysis model (LDA), which includes no shrinkage. We also fitted a Bayesian model with uninformative Jeffreys reference priors, we limited the maximum and minimum values from the LDA model using empirical lower and upper bounds (ELUB) [2], and we fitted a novel regularized logistic regression model (LogReg). The regularization consisted of a uniform distribution with a weight equivalent to 5 data points.

ResultsA leave-one-out cross validation procedure was applied to the known-speaker recordings from the two sisters. Table 1 shows preliminary likelihood ratio / Bayes factor results. The LDA procedure produced ridiculously large and small likelihood ratio values, which are not justifiable given the small amount of training data. The Bayesian analysis produced much more moderate Bayes factor values, but still questionable given the amount of training data. The ELUB procedure gave very conservative values. The regularized logistic regression procedure also gave conservative values, but these values could be above or below the ELUB values.

78

79

Table 1. Likelihood ratio values / Bayes factor values, for each known-speaker recording and the questioned-speaker recording. Y: a younger sister recording. O: an older sister recording. Q: the questioned speaker recording.

ReferencesZhang, C., Morrison, G.S., Enzinger, E. (2016). Use of relevant data,

quantitative measurements, and statistical models to calculate a likelihood ratio for a Chinese forensic voice comparison case involving two sisters. Forensic Science International, 267, 115–124. http://dx.doi.org/10.1016/j.forsciint.2016.08.017

Vergeer, P., van Es, A., de Jongh, A., Alberink, I., Stoel, R.D. (2016). Numerical likelihood ratios outputted by LR systems are often based on extrapolation: When to stop extrapolating? Science & Justice, 56, 482–491. http://dx.doi.org/10.1016/j.scijus.2016.06.003

80

Not a Lone Voice: Automatically Identifying Speakers in Multi-Speaker Recordings

Anil Alexander, Oscar Forth, Alankar Atreya, Samuel Kent, and Finnian Kelly

Research and Development, Oxford Wave Research Ltd., Oxford, U.K.{anil|oscar|alankar|sam|finnian}@oxfordwaveresearch.com

Law enforcement audio recordings such as interviews, telephone intercepts and surveillance recordings often contain speech from more than one speaker. Identifying speakers of interest within these multi-speaker recordings first involves editing to extract the speech of a single speaker. This editing process, in which extraneous noises and other speakers are removed, can either be performed manually or assisted using speaker diarisation software. However, if a large number of such files need to be analysed in a short period of time, it may not be practical to involve a human in the loop. In this paper, we attempt to address the challenging task of efficiently and accurately spotting certain target speakers from large volumes of multi-speaker recordings automatically.

We have tried to address this problem using a simple but effective approach, in which short overlapping segments of the multi-speaker recording are extracted and modeled within an i-vector framework. The i-vector approach converts a recording into a fixed length, low-dimensional representation of the speaker’s voice. The i-vectors for each overlapping segment (e.g. 10s segments, with 5s overlap) are compared with the i-vector for the target speaker file. The match scores obtained across all overlapping segments are first smoothed to reduce the effect of outliers, and then an average of the three maximum scoring segments provides a match score for the file.

We tested our approach with controlled laboratory data as well as real telephone intercept data. We used a multi-speaker modified version of the VOCALISE speaker recognition software (Alexander et al, 2014). For our experiments with laboratory data, we used interview and intercept recordings in same- and cross-channel conditions from the DyVIS database (Nolan et al, 2009). For ‘single target’ cross-channel

81

comparisons, we used 51 files containing two speakers from the intercept task and compared them with 59 single speaker files from DyVIS Task 3 (report and report recall). For each multi-speaker recording, the majority (94.1%) of corresponding target speakers were identified at rank one or two of the match score list (Figure 1). The equal error rate (EER) of this comparison was 3.90%. For uncontrolled real telephone intercept data, we have worked with a subset of the FRITS database (van der Vloed et al, 2014). All tests were conducted by and at the Netherlands Forensic Institute (NFI). This subset consisted of 11 multi-speaker conversations (mostly two, and in some cases, more speakers) and a set of 32 target speakers. For each multi-speaker recording the majority of corresponding target speakers were identified at rank one or two of the match score list (76.1%) (Figure 1). Conversely, for each target, a matching multi-speaker file containing that speaker was identified at rank one or two, 80% percent of the time.

We observe that the total duration of speech and the relative speaker mix for each target in a multi-speaker file are important for accurate recognition. Despite these challenges, this approach shows promise for automatically processing large volumes of real-world multi-speaker files.

82

Figure 1 The proportion of correct targets identified at a certain rank for single-target DyVIS database and two-target FRITS database.

ReferencesA. Alexander, O. Forth, A. A. Atreya, and F. Kelly (2016). “VOCALISE: A forensic

automatic speaker recognition system supporting spectral, phonetic, and user-provided features”, Odyssey 2016 Speaker and Language Recognition Workshop, Bilbao, Spain, 2016.

D. L. Van der Vloed, J. S.Bouten, and D.A. Van Leeuwen (2014). NFI-FRITS: A forensic speaker recognition database and some first experiments, Proceedings of Odyssey Speaker and Language Recognition Workshop 2014, Joensuu, Finland, pp. 6-13, 2014.

N. Dehak, P. Kenny, R. Dehak, P. Dumouchel, and P.Ouellet (2011). Front-end factor analysis for speaker verification, IEEE Transactions on Audio, Speech & Language Processing, vol. 19, no. 4, pp. 788–798, 2011.

F. Nolan, K. McDougall, G. de Jong & T. Hudson (2009). The DyViS database: style-controlled recordings of 100 homogeneous speakers for forensic phonetic research. International Journal of Speech, Language and the Law 16: 31–57, 2009..

83

The complementarity of automatic, semi-automatic, and phonetic measures

of vocal tract output in forensic voice comparison

Vincent Hughes,1 Philip Harrison,1,2 Paul Foulkes,1

Peter French,1,2 Colleen Kavanagh,1 andEugenia San Segundo1

1Department of Language and Linguistic Science, University of Y ork, UK.

2J P French Associates, York, UK.{vincent.hughes|philip.harrison|paul.foulkes|peter.

french|colleen.kavangh|eugneia.sansegundo}@york.ac.uk

In forensic voice comparison, automatic, semi-automatic and phonetic methods are available for evaluating voice evidence. Across the world, the phonetic approach is used predominantly in casework. This is due, in part, to the ‘black box’ perception of automatic systems and the lack of direct links between the features extracted and the underlying physiology. However, there is an increasing move towards the integration of the best elements of each approach (e.g. Gonzalez-Rodriguez et al., 2014). However, fundamental to the development of hybrid FVC systems is an understanding of the extent to which different methods capture complementary speaker-specific information.

In this study, we examine the potential improvement in the performance of a Mel-frequency cepstral coefficient-based (MFCC) automatic system with the inclusion of semi-automatic features (linear and Mel-weighted long term formant distributions; LTFDs and (M)LTFDs), and the role of auditory-based analysis of voice quality (VQ) in resolving errors. Recordings for 94 speakers from the DyViS corpus (Nolan et al., 2009) were analysed. Each sample was segmented into consonants and vowels using StkCV (Andre-Obrecht, 1988). The vowel-only portions of the samples were then divided into 20ms frames from which MFCC

84

(12MFCCs/12 Δs/12 ΔΔs), LTFD (F1~F4 frequencies/bandwidths/Δs), and (M)LTFD (Melweighted F1~F4 frequencies/bandwidths/Δs) feature vectors were extracted. VQ analysis was performed using a modified version of the vocal profile analysis (VPA) scheme (Laver, 1980; San Segundo et al., submitted). The 94 speakers were divided into development (31 speakers), test (31 speakers) and reference (32 speakers) sets. GMM-UBM likelihood ratios (LRs) were computed using the MFCCs, LTFDs and (M)LTFDs. The MFCC data were modelled with 1024 Gaussians, while 32 Gaussians were used for the formant data. Logistic-regression calibration and fusion was conducted using scores from the development data. Validity was evaluated using equal error rate (EER) and the log LR cost function (Cllr; Brümmer and du Preez, 2006).

The best performing MFCC system used MFCCs, Δs, and ΔΔs as input (EER=3.23%, Cllr=0.146). All of the LTFD and (M)LTFD systems performed considerably worse, with the (M)LTFD systems producing the poorest performance. For the LTFDs and (M)LTFDs, the addition of bandwidths and Δs did not improve performance. The fusion of LTFDs and (M)LTFDs with the MFCCs had essentially no effect on system performance, and in some cases validity got worse. Despite this, the best performing system overall used MFCCs+Δs+ΔΔs and LTFDs as input.

The errors – one false rejection and 13 false acceptances – produced by this system were evaluated in terms of VQ. A weak correlation was found between the typicality of a speaker’s supralaryngeal VQ profile and the strength of evidence, with unremarkable speakers (i.e. those who were not distinctive in the group as a whole) more likely to produce weak or contrary-to-fact evidence. These results suggest that LTFDs, (M)LTFDs and supralaryngeal VQ profiles capture some of the same speaker-specific information as MFCCs. However, the error pairs were still easy to separate based on auditory analysis, indicating that laryngeal VQ may provide independent complementary information which may improve the performance of (semi-)automatic systems.

ReferencesAndre-Obrecht, R. (1988) A new statistical approach for automatic speech

segmentation. IEEE Transactions on Acoustics, Speech and Signal

85

Processing 36: 29-40.Brummer, N. and du Preez, J. (2006) Application independent evaluation of

speaker detection. Computer Speech and Language 20(2-3): 230-275.Gonzalez-Rodriguez, J., Gil, J., Perez, R. and Franco-Pedroso, J. (2014)

What are we missing with ivectors? A perceptual analysis of i-vector-based falsely accepted trials. In Proceedings of Odyssey 2014: The Speaker and Language Recognition Workshop. Joensuu, Finland. pp. 33-40.

Laver, J. (1980) The Phonetic Description of Voice Quality. Cambridge University Press: Cambridge.

Nolan, F., McDougall, K., de Jong, G. and Hudson, T. (2009) The DyViS database: style-controlled recordings of 100 homogeneous speakers for forensic phonetic research. International Journal of Speech, Language and the Law 16: 31-57.

San Segundo, E., Foulkes, P., French, J. P., Harrison, P., Hughes, V. and Kavanagh, C. (submitted) The use of the Vocal Profile Analysis for speaker characterisation: a methodological proposal.

86

Comparison Between Perceptual and Automatic Systems on Finnish Phone

Speech Data (FinEval1) - a pilot test using score simulations

Jonas Lindh,1 Andreas Nautsch,2 Therese Leinonen,1

and Joel Åkesson1

1Voxalys AB, Gothenburg, Sweden{jonas|therese|joel}@voxalys.se

2da/sec – Biometrics and Internet Security Research Group, Hochschule Darmstadt, Germany

[email protected]

As a part of performing forensic speaker comparison casework for the NBI (National Bureau of Investigation1) evaluations of perceptual analyses and automatic systems were conducted on Finnish speech recordings. This is part of our laboratory’s standard procedures both when it comes to using an analyst for perceptual comparisons as well as automatic systems in casework. In this work we will present some of the evaluation results and make reasonable score distribution assumptions to perform the calibration and thereby compare the different systems. The data set contains 80 recordings, 48 speakers and with a mean duration of 75.3s and standard deviation 25.5s post VAD. 32 speakers read a text in one recording and speak spontaneously in the second recording describing a few pictures in front of them. The remaining 16 speakers only have one recording of either read or spontaneous speech. Suspects in Finland are presented with the same pictures and texts and are recorded through the same system using a mobile phone. Hence, the dataset can be utilized for 32 genuine (same speaker), 256 closed-set impostor (different speaker), and 512 open-set impostor comparisons. Due to time constraints of the perceptual analysis, a subset of 9 genuine scores and 244 impostor scores is considered, i.e. a genuine database prior of 9/253 (~0.04). Work on conducting a full perceptual test is on-going. The perceptual evaluation includes both judgment on an ordinal scale (see Nordgaard, Ansell,

1 https://www.poliisi.fi/en/national_bureau_of_investigation

87

Drotz, & Jaeger, 2011) and judged likelihood ratios. For these studies the ordinal scale results were used and converted to scores following Lindh & Morrison (2011).

As the amount of genuine scores inevitably will be few when comparing to perceptual evaluations, we propose utilizing Monte Carlo simulations of genuine and impostor score distributions in limited data scenarios, if the impostor score distribution is Gaussian. This follows on the likelihood ratio idempotence implication: “given either of the distributions, the other distribution is completely determined” (van Leeuwen and Brümmer, 2013). For the sake of conducting quantifiable and visualizable system comparisons, each score distribution undergoes Monte Carlo simulation, emitting 1000 scores for each distribution, i.e. mimicking a genuine database prior of 0.5. For visualization purposes, all score distributions are aligned by calibration. Interesting observations regarding the potential performance at a first glance could be made before and after the simulations.

The automatic systems contain of two different commercial i-Vector based systems and one based on long-term formant frequencies (LTF). At the presentation results of the true score distributions will be presented as well as the simulated. Implications are made for future research whether simulations can or should be used and how they might help in evaluation processes when comparing different systems.

Referencesvan Leeuwen, D. and Brümmer, N.. (2013). The distribution of calibrated likelihood-

ratios in speaker recognition. Proceedings of INTERSPEECH-2013, 1619-1623.

Lindh, J., & Morrison, G. S. (2011). Humans versus machine: forensic voice comparison on a small database of Swedish voice recordings. In Proceedings of ICPhS2011 (Vol. 17, p. 4).

Nordgaard, A., Ansell, R., Drotz, W., & Jaeger, L. (2011). Scale of conclusions for the value of evidence. Law, Probability and Risk.righam, J. C. (1980). Perspectives on the impact of lineup composition, race, and witness confidence on identification accuracy. Law and Human Behavior, 4, 315–321.

88

Strengths and Weaknesses of using Feature Selection in Automatic Accent Recognition

Georgina Brown, and Dominic WattDepartment of Language and Linguistic Science,

University of York, York, UK{gab514|dominic.watt}@york.ac.uk

Feature selection is a way of estimating which features are most valuable to a task, and only using those features to conduct that task. It is used across disciplines, including gene selection and text categorization (Guyon and Elisseeff, 2003). Integrating feature selection can benefit a system in two main ways: it can reduce the computational cost of a process and it could improve performance by removing ‘noisy’ features. It is usually applied to tasks where large volumes of high-dimensional data are involved. It is possible, however, that feature selection techniques could also benefit forensic speech science. We might not always have up-to-date knowledge of specific spoken varieties involved in casework. A way of automatically identifying which features are the most useful in a given task could therefore be useful.

This paper builds on work seen in Brown (2015), Brown (2016) and Wu et al (2010). We take an automatic accent recognition system, the Y-ACCDIST-SVM system, and integrate two feature selection methods that were implemented in Wu et al’s automatic accent recognition experiments. Y-ACCDIST-SVM has shown to perform well on different accent corpora, and it has been suggested that this could be a potential tool for forensic applications. Y-ACCDIST takes a segmental approach to automatic accent recognition, using representations of phoneme units to model individual speakers’ accents. These models are then used to assign unknown speakers an accent category. The features we are referring to in this study are pairs of phoneme units that form the model. This paper observes whether incorporating feature selection can improve Y-ACCDIST’s performance further, on two different corpora: the AISEB (Accent and Identity on the Scottish/English Border) corpus (Watt et al, 2014) and the Northern Englishes corpus (Haddican et al, 2013). Within this cross-corpus comparison, we also compare two feature selection

89

methods: Analysis of Variance (ANOVA) and Support Vector Machine Recursive Feature Elimination (SVM-RFE).

In Figure 1 below, we can see how the two different feature selection methods affect accent recognition performance on the AISEB corpus, classifying speakers into one of four accent groups. Using the two feature selection methods, we can reduce the number of features, ranked as the most valuable, that the Y-ACCDIST-SVM includes in processing. In increments of five (features), we generate a recognition rate from the system for each feature selection method. The horizontal red line marks the baseline performance where all available features (all vowels and consonants) are included in the analysis.

Figure 1 The effect of each feature selection method on Y-ACCDIST-SVM performance when classifying speakers of the AISEB corpus.

In the case of the AISEB varieties, these two feature selection methods appear to affect performance differently. ANOVA, overall, reaches the highest performance, whereas SVMRFE more consistently brings performance above baseline level.

We will compare the performances of these methods on the two different corpora, as well as more closely look at the different rankings of features

90

the two methods produce on the two corpora, gathering an idea of which phonemes are most useful to different tasks.

ReferencesBrown, G. (2015). Automatic recognition of geographically- proximate accents

using content-controlled and content-mismatched speech data. Proceedings of the 18th International Congress of Phonetic Sciences. Paper number 458.

Brown G. (2016). Exploring Forensic Accent Recognition using the Y- ACCDIST System. In Proceedings of the 16th Australasian International Conference on Speech Science and Technology. Sydney, Australia. pp 305- 308.

Guyon, I. and Elisseeff, A. (2003). An introduction to variable and feature selection. Journal of Machine Learning Research. 3. 1157 1182.

Haddican, W., Foulkes, P., Hughes, V. and Richards, H. (2013). Interaction of social and linguistic constraints of two vowel changes inNorthern England. Language Variation and Change. 25. 59- 74.

Watt, D. Llamas, C. and Johnson, D. E. (2014). Sociolinguistic variation on the Scottish- English Border. In R. Lawson (Ed.). Sociolinguistics in Scotland. Palgrave Macmillan: London. 79- 102.

Wu, T., Duchateau, J., Martens, J.- P. and Compernolle, D. V. (2010). Feature subset selection for improved native accent identification. Speech Communication. 2. 83- 98.

91

Speaker-specific dynamic features of diphthongs in Standard Chinese

Wang Li,1,2 Kang Jintao,1,2 Li Jingyang,1,2 Wang Xiaodi1,2

1Institue of Forensic Science, Ministry of Public Security, PRC2Key Laboratory of Intelligent Speech Technology,

Ministry of Public Security, PRC1{wangli|kangjintao|lijingyang|wangxiaodi}@cifs.gov.cn

This paper is eligible for the 'Best Student Paper Award'

Static features of monophthons show differences between speakers and thus provide important and reliable information in the field of speaker identification. However, many criminal cases involving voice comparisons lack sufficient speech content for extracting and comparing static features due to various reasons. Considering diphthongs hold a large proportion in vowels of Standard Chinese, it is necessary to find more speaker-specific features in them to make most of the existing voice recordings in order to solve this problem. Early studies (Eriksson, Cepeda, Rodman, McAllister et al. 2004; Eriksson, Cepeda, Rodman, Sullivan et al. 2004; Greisbach, Esser and Weinstock 1995; Ingram, Prandolini and Ong 1996; McDougall 2004; Rodman et al. 2002; Nolan and Grigoras 2005) have proved that the dynamic features of diphthongs can be used in voice comparison. The present study examines eight diphthongs produced by twenty male and twenty female Standard Chinese speakers. Formant trajectories and F1, F2, F3 and F4 frequencies of individual phonemes in diphthongs are measured and compared with those of monophthongs. The results show that formant trajectories of F1 and F2 in diphthongs are roughly in same shapes while those of F3 and F4 are different among different speakers and have a greater value in distinguishing speakers. The results of discriminant analyses based on predictors from four formants range from 93% in [ei] to 100% in [wo] and thus provide a solid foundation for the use of dynamic features of Standard Chinese diphthongs in forensic speaker comparison.

92

Introduction

Early studies (Eriksson, Cepeda, Rodman, McAllister et al. 2004; Eriksson, Cepeda, Rodman, Sullivan et al. 2004; Greisbach, Esser and Weinstock 1995; Ingram, Prandolini and Ong 1996; McDougall 2004; Rodman et al. 2002; Nolan and Grigoras 2005) have proved that the dynamic features of diphthongs can be used in voice comparison. This paper discusses the speaker-specific dynamic features of diphthongs in Standard Chinese.

Method

2.1 Subjects and Materials

20 male and 20 female well-versed Starndard Chinese speakers respectively read eight sentences including dithongs [ai], [ei], [au], [ou], [ja], [je], [wa], and [wo] 5 times

2.2 Recording and Measurements

The setting of recording: the distance between microphone and speaker’s mouth is about 10 centimeters, sampling rate is 11.025 KHz, bit depth is 16, mono soundtrack.

Broadband spectrogram analysis: bandwidth is 300 Hz, dynamic range is 35 dB, attenuation is 10 dB, pro-emphasis factor is 0.65, hamming window.

LPC analysis: frame length is 20 milliseconds, 512-point FFT and 14 analysis orders.

Results

3.1 F1, F2, F3 and F4 frequencies of these eight diphthongs

We have studied 8 diphthongs and thus generated 16 line charts for the distribution of frequencies. The following is the line chart of [ai] for 20 males.

93

Figure 1 Line chart of [ai] for 20 males

3.2 Discriminant analysis on the combining trajectories of F1, F2, F3 and F4 We set 5 measurement points for each formant, so for each diphthong, we will get 20 measurement points for all 4 formants. Each diphthong will be pronounced 5 times. For all 40 speakers, we will get 4000 measurement points. The following is the table of discriminant analyses results of [ai] for 20 male speakers.

Table 1. Discriminant analyses results of [ai] for 20 male speakers

94

Discussion and Conclusion

The results of discriminant analyses based on predictors from four formants range from 93% in [ei] to 100% in [wo] and thus provide a solid foundation for the use of dynamic features of Standard Chinese diphthongs in forensic speaker comparison. The following are the bar charts of discriminant analyses results based on predictors of [wo]. The upper part is for male speakers while the lower part is for the female speakers.

ReferencesEriksson, E.J., Cepeda, L.F., Rodman, R.D., Sullivan K.P.H., McAllister, D.F.,

Bitzer, D. and Arroway, P. (2004). Robustness of spectral moments: a study using voice imitations. In S. Cassidy, F. Cox, R. Mannell and S. Palethorpe (eds) Proceedings of the Tenth Australian International Conference on Speech Science and Technology, Sydney, 2004 259–264. Canberra, Australia: Australasian Speech Science and Technology Association.

Eriksson, E.J., Cepeda, L.F., Rodman, R.D., McAllister, D.F., Bitzer, D. and Arroway, P. (2004). Cross-language speaker identification using spectral moments. In P. Branderud, O. Engstran and H. Traunmüller (eds) Proceedings of FONETIK 2004. The Seventeenth Swedish Phonetics Conference 76–79. Stockholm: Department of Linguistics, Stockholm University.

Greisbach, R., Esser, O. and Weinstock, C. (1995). Speaker identification by formant contours. In A. Braun and J.-P. Köster (eds) Studies in Forensic Phonetics 49–55. Trier, Germany: Wissenschaftlicher Verlag.

Ingram, J.C.L., Prandolini, R. and Ong, S. (1996). Formant trajectories as indices of speaker identification. Forensic Linguistics (The International Journal of Speech, Language and the Law) 3(1): 129–145.

Li K., Li J. (2005). A Qualitative Study of Cantonese Speaker Identification. Forensic Science and Technology 0(6): 6–8.

Lin T, Wang L. (2003). A Course in Phonetics. Peking University Press. Lu W. (2000). Statistical Analysis of SPSS for Windows. Publishing House of

Electronics Industry. McDougall, K. (2006). Dynamic features of speech and the characterisation

of speakers. International Journal Speech, Language and the Law 13(1): 89–126.

95

McDougall, K. (2004). Speaker-specific formant dynamics: an experiment on Australian English. The International Journal of Speech, Language and the Law 11(1): 103–130.

Nolan, F. and Grigoras, C. (2005). A case for formant analysis in forensic speaker identification. The International Journal of Speech, Language and the Law 12(2): 144–173.

Rodman, R., McAllister, D., Bitzer, D., Cepeda, L. and Abbitt, P. (2002). Forensic speaker identification based on spectral moments. The International Journal of Speech, Language and the Law 9(1): 22–43.

Wu Z, Lin M. (1987). Essentials of Experimental Phonetics. Higher Education Press.

96

Comparing Chinese Identical Twins’ Speech Using Frequent Speech Acoustic

Characteristics

Jun-jie YangDepartment of Criminal Science and Technology,

Shanxi Police CollegeTaiyuan, Shanxi, [email protected]

There is a growing consensus that distinguishing identical voices is a challenge in the forensic realm (e.g. Nolan and Oh 1996; Jun-jie Yang et al., 2005; Loakes 2006; Ariyaeeinia, A et al.,2008; Künzel 2010; Weirich 2011; Adrian Leemann et al.,2014;E. San Segundo, 2015). This paper describes an experiment using the frequent acoustic characteristics including the Formant Characteristics and the Prosodic Characteristics to discriminate speakers within identical twin pairs.

Experiment

Contemporary speech samples of 30 monozygotic twin pairs (13 male pairs and 17 female pairs, age range 11- 67) were recorded in quiet rooms. The same text including the main syllables of Chinese being read five times. The sampling rate was16 KHz.

VS-99 Voice Station (Version 4.0), a speech analysis software developed by Yangchen Electronic Technology Company, was used to carry out the acoustic measurements of 13 different syllables. The relative formant intensity, the relative duration, relative intensity and relative tone are calculated.

To begin with, each characteristics was analyzed using the function of ‘Explore’ of SPSS software in the confidence level of 95% (Lu wendai, 2000). Results show that each investigated characteristics conforms to the normal distribution on the whole.

Then, according to normal distribution, the distribution range of each feature is determined at a confidence level (Dengbo, 1984). The

97

distribution range is just the individual variability of the feature. If the compared value falls within the variation range of a feature, there is no significant difference between the two features. If there are more than a curtain number of marked differences between two syllables, the syllables maybe from different speakers. If there are more than a certain number of syllables with significant differences, the speech samples compared will be labelled as from a different speaker.

Results

According to the principle of minimum sum of two kinds of errors (false rejections and false acceptances), the statistical analysis of the measurements was made on the confidence and the threshold of each identical twin pair. The results can be seen in table 1 and table 2.

Discussion and conclusions

The results reveal that there is speaker-specific behavior even within identical twin pair and also support the assumption that intra-speaker variation should be less than inter-speaker difference.

Results show that among the examined characteristics (the distinguishing power of the formant frequency characteristics > the distinguishing power of the formant intensity > the distinguishing power of the prosodic characteristics) the distinguishing effectiveness of the formant frequency characteristics is the most robust and the prosodic features only have negative values. The result is different from Golda & French’s (2011) study, showing the most useful characteristics to be the FSC, especially from “the stating that the vowel formant analysis is rarely insightful”.

98

Note:’Y’ indicates that there is a significant difference in the characteristics within the twins’ speech and 'N' is the opposite

ReferencesLu Wendai.(2000).SPSS for Windows Statistics and Analysis,223. Beijing:

Publishing House of Electronics Industry. Dengbo, (1984). The application of statistical methods in analysis and testing.

Beijing: Chemical Industry Press. Francis Nolan, Tomasina Oh.(1996). Identical twins, different voices.

International Journal of Speech Language and the Law, Vol 3(1): 39-49. Adrian Leemann et al. (2014). Speaker-individuality in suprasegmental temporal

features: Implications for forensic voice comparison Forensic.https://www.researchgate.net/publication/261184884_.

Gold, E. and French, P. (2011) An international investigation of forensic speaker comparison practices. In Proceedings of the 17th International Congress of Phonetic Sciences, Hong Kong, China: 1254–1257.

Künzel, H. J. (2010) Automatic speaker recognition of identical twins. International Journal of Speech, Language and the Law 17(2): 251–277.

Ariyaeeinia, A., Morrison, C., Malegaonkar, A. and Black, S. (2008) A test of the effectiveness of speaker verification for differentiating between identical twins. Science & Justice 48(4):182–186.

E. San Segundo, (2015). Forensic speaker comparison of Spanish twins and non-twin siblings: a phonetic-acoustic analysis of formant trajectories in vocalic sequences, glottal source parameters and cepstral characteristics, Int. J. Speech Lang. Law 22 (2): 249–253.

99

Speaker Identification Using Laughter in a Close Social Network

Elliott Land,1 and Erica Gold2

1Department of Linguistics and Modern Languages,University of Huddersfield, Huddersfield, UK

[email protected] of Linguistics and Modern Languages,

University of Huddersfield, Huddersfield, [email protected]

This paper is eligible for the 'Best Student Paper Award'

This paper considers the speaker-specificity of laughter. Forensically-relevant research on laughter is extremely limited in the literature. However, experts have reported analysing laughter in forensic speaker comparison casework (Gold and French 2011). The research presented here serves as a preliminary investigation into the speaker-specificity of laughter by conducting a laughter recognition test in a close social network.

A group of female undergraduate university students aged 20-21 was deemed to form a close social network on the basis that they had spent several hours of academic time and an unknown amount of social time together for at least two years prior to the study. Two foils who were similar to the network members’ ages and education level were also recruited. Laughter was elicited from these 7 network members and 2 foils whilst they individually viewed humorous video clips, a method of elicitation commonly used in laughter studies (see Mowrer et al. 1987; Petridis et al. 2013). The elicited laughter was classified at the episodiclevel (Cosentino et al. 2016) as being either voiced, unvoiced, or mixed(Bachorowski et al. 2001). Samples of approximately 4 seconds (not including 1.5 second pauses between individual episodes) were created to best represent episodic laughter produced by the participants. These samples were then presented in an open speaker recognition task.

100

Overall, the results suggest that the network members performed poorly, particularly in contrast to the performance of Barron & Foulkes’ (2000) network members in a speaker recognition task that used speech samples. In the present study, each network member identified only one speaker correctly. Furthermore, only two of these six listeners identified themselves correctly. The largest number of correct identifications of any speaker was three, while another three of the network members were never correctly identified. This suggests that some speakers’ laughter may be easier to identify than others. Previous studies that have investigated laughter using a voice line-up method have reported higher correct identification rates, ranging from 47-62% (Philippon et al. 2013; Yarmey 2004). The differences between the results of the present study and previous studies may be explained by qualitative and quantitative differences in the laughter samples themselves. Whereas this study used 4 second samples containing varying amounts of voiced, unvoiced, and mixed laughter, Philippon et al. (2013) used 13 second samples of voiced laughter. This suggests that longer voiced samples may facilitate higher identification rates. The most frequently identified speaker in the present study had the largest amount of voiced laughter across the samples. This could indicate that voicing does indeed aid in the identification of speakers, which may be expected given the difficulty of identifying speakers from whispered speech (Bartle & Dellwo 2015).

These findings presented in this paper are relevant to the larger forensic phonetic context, specifically: identification, attribution, and speaker comparison. The results in this study suggest that using laughter, specifically voiceless laughter, in naïve speaker identification, may generally be a more difficult identification task than just using speech. Further research is still needed on the speaker-specificity of voiced laughter. However, laughter may have the potential to be a useful speaker discriminant in forensic phonetic casework.

ReferencesBachorowski, J., Smoski, M. J., & Owren, M. J. (2001). The acoustic features of

human laughter. The Journal of the Acoustical Society of America, 110(3.1), 1581-1597.

101

Barron, A., & Foulkes, P. (2000). Telephone speaker recognition amongst members of a close social network. The International Journal of Speech, Language and the Law, 7(2), 180-198.

Bartle, A., & Dellwo, V. (2015). Auditory speaker discrimination by forensic phoneticians and naive listeners in voiced and whispered speech. The International Journal of Speech, Language and the Law, 22(2), 229-248.

Cosentino, S., Sessa, S., & Takanishi, A. (2016). Quantitative Laughter Detection, Measurement, and Classification. A Critical Survey. IEEE Reviews in Biomedical Engineering, 9(1), 148-162.

Gold, E., & French, P. (2011). International pracitices in forensic speaker comparison. The International Journal of Speech, Language and the Law, 18(2), 293-307. doi:10.1558/ijsll.v18i2.293

Mowrer, D. E., LaPointe, L. L., & Case, J. (1987). Analysis of five acoustic correlates of laughter. Journal of Nonverbal Behavior, 11(3), 191-199.

Petridis, S., Martinez, B., & Pantic, M. (2013). The MAHNOB laughter database. Image and Vision Computing, 31(2), 186-202.

Philippon, A. C., Randall, L. M., & Cherryman, J. (2013). The Impact of Laughter in Earwitness Identification Performance. Psychiatry, Psychology and Law, 20(6), 887-898.

Yarmey, A. D. (2004). Common-sense beliefs, recognition and the identification of familiar and unfamiliar speakers from verbal and non-linguistic vocalizations. International Journal of Speech, Language and the Law, 11(2), 267-277.

102

Rhythm and speaker-specific variability in shouted speech

Kostis Dimos1, Volker Dellwo,2 and Lei He2

Phonetics Laboratory, University of Zurich, Zurich, Switzerland{kostis.dimos|lei.he|volker.dellwo}@uzh.ch

This paper is eligible for the 'Best Student Paper Award'

Recent studies have provided evidence that several aspects of speech rhythmic and temporal characteristics contribute to speaker identity (Leemann et al. 2014; Dellwo et al. 2012a; Dellwo et al. 2012b; Dellwo et al. 2012c). Rhythmic measurements include vocalic and consonantal intervals variability metrics (%V, nPVI-V, VarcoC, VarcoV, Ramus et al., 1999, Grabe & Low, 2002, Dellwo, 2006, Dellwo et al., 2012a) as well as peak-to-peak and voicing variability metrics. Research on high vocal effort speech has revealed considerable differences between the two conditions (Traunmüller & Eriksson 2000), however there is very little research on the effect of increased loudness on speech rhythm.

Our study provides an overview of the rhythmic characteristics in normal and shouted speech. The speakers were recorded reading a set of semantically neutral, equal length and structurally identical sentences. A sound-level calibrated Matlab script was used in order to elicit loud speech by indicating loudness above a predefined threshold of 80 dB (C-weighted).

Our analysis includes vocalic, consonantal and peak-to-peak interval variability measurements both between and within speakers. Our results indicate an increase in the vocalic part of speech in shouting. The between speaker variability in %V remains significant both in normal [F(16,323) = 3.26, p < .001] and shouted speech [F(16,321) = 6,84, p < .001]. The duration of the consonantal intervals is lower in shouting, as the duration of the vocalic intervals increases. Despite their increased duration of the vocalic intervals, the normalized PVI both for vocalic intervals (p = 0.8) and for peak-to-peak intervals (p = 0.1) does not differ between the two conditions (Figure 1). Significant variability between speakers

103

was found in the PVI of peak-to-peak intervals only for the shouting condition [F(16,321) = 1.8, p = .027] but not in normal speech (p = 0.28). Additionally, the duration of the vocalic peak-to-peak intervals does not vary significantly between normal and shouted speech [F(1, 676 = 0.274, p = 0.6].

Our initial results indicate that, although the duration of vocalic intervals is higher in shouting, it does not affect the duration and variability of the peak-to-peak intervals. Additionally, the nPVI seems unaffected by the rate variation in the consonantal and vocalic intervals across the conditions. Interestingly, the increased duration of the vocalic intervals does not result to higher nPVI in shouting. Finally, %V is strongly affected by the increased loudness while the between speaker variability remains significant in both conditions.

Acknowledgments: This study is supported by the 2015 IAFPA research grants.

Figure 1 Left: Rate, log normalized mean and nPVI measurements in normal and shouted speech. Right: Percentage of vocalic speech by speaker and by condition

104

ReferencesDellwo, V. (2006). Rhythm and speech rate: a variation coefficient for deltaC.

In P. Karnowski & I. Szigeti (Eds.), Language and Language Processing, 231-241, Frankfurt: Peter Lang.

Dellwo, V., Leemann, A. and Kolly, M.-J. (2012a). Speaker idiosyncratic rhythm features in the speech signal. In Interspeech, Portland, USA.

Dellwo, V., Kolly, M.-J. & Leemann, A. (2012b). Speaker identification based on speech temporal information: A forensic phonetic study of speech rhythm in the Zurich variety of Swiss German. Abstract presented at IAFPA 2012, Santander/Spain.

Dellwo, V., Schmid, S., Leemann, A., Kolly, M.J. & Müller, M. (2012c). Speaker identification based on speech rhythm: the case of bilinguals. Abstract presented at PoRT2012, Glasgow/UK.

Grabe, E. and Low, E. L. (2002). Durational variability in speech and rhythm class hypothesis. In N. Warner & C. Gussenhoven (Eds.), Papers in Laboratory Phonology 7, 515-543, Berlin and New York: Mouton de Gruyter.

Leemann, A., Kolly, M.-J., and Dellwo, V. (2014). Speech-individuality in suprasegmental temporal features: implications for forensic voice comparison. Forensic Sci. Int., 238, 59-67.

Ramus, F., Nespor, M. and Mehler, J. (1999). Correlates of linguistic rhythm in the speech signal. Cognition, 73, 265-292.

Traunmüller, H., and Eriksson, A. (2000). Acoustic effects of variation in vocal effort by men, women, and children. J. Acoust Soc Am., 107, 3438.

105

Cepstral Dynamics in MFCCs using Conventional Deltas for Emotion and

Speaker Recognition

Thayabaran Kathiresan, and Volker DellwoPhonetics Laboratory, Department of Computational Linguistics,

University of Zurich, Switzerland(thayabaran.kathiresan|volker.dellwo)@uzh.ch

This paper is eligible for the 'Best Student Paper Award'

Under forensic speaker comparison circumstances the acoustic evidence data is typically not in the same speech register compared to the comparison data (e.g. evidence is more often emotionally aroused). This might create problems for automatic speaker recognition. Here we tested whether we can enhance recognition through cepstral dynamics in MFCCs when within-speaker variability caused by emotional arousal is present. Unlike all other features evolved for automatic speaker, speech and emotion recognition, Mel frequency cepstral coefficients (MFCCs) have been the most successful and known feature. The vocal tract characteristics in speech are segmented in the frequency domain using perceptually motivated series of Mel filter banks and pitch excitations are separated in the log-domain. The resultant series of cepstral band elements are de-correlated using discrete cosine transform (DCT) to yield MFCCs. The MFCCs are segmental features. To include the dynamics between the successive time frames the neighboring frame’s MFCCs are differentiated and included along with it as Delta and Delta-Delta.

106

T i m e ( f r a m e i n d e x )

Quefrency (j)

n - 1 n n + 1 n + 2 n + m - 1 n + m

T i m e ( f r a m e i n d e x )

Quefrency (j)

n - 1 n n + 1 n + 2 n + m - 1 n + m

T i m e ( f r a m e i n d e x )

Quefrency (j)

n - 1 n n + 1 n + 2 n + m - 1 n + m

M F C C S tr e a m

T e m p o ra l ΔM F C C S tr e a m

T e m p o ra l Δ 2

M F C C S tr e a m

T i m e ( f r a m e i n d e x )

Quefrency (j)

n - 1 n n + 1 n + 2 n + m - 1 n + m

M F C C S tr e a m

n n

n n

C e p s t r a l ΔM F C C S tr e a m

C e p s t r a l Δ 2

M F C C S tr e a m

C1

C2

C3

C4

C13

C1

C2

C3

C4

C13

C1

C2

C3

C4

C13

C1

C2

C3

C4

C13

C1

C2

C3

C4

C13

C1

C2

C3

C4

C13

Figure 1 MFCCs temporal deltas (left side) and MFCCs cepstral deltas (right side) derived from the MFCCs of a speech signal.

The two temporal dynamic features, Delta and Delta-Delta, contribute to nullifying the framing effects of a signal in the time-domain (Davis,1980). However, the series of Mel filter banks are applied in a segmented speech signal while processing for MFCCs which disturbs the spectral dynamics. This research addresses the significance of MFCCs cepstral dynamics in speaker and emotion recognition by adding Delta and Delta-Delta derived from the frequency-domain.

To study the importance of cepstral deltas in emotion recognition (Task 1), Berlin database of emotional speech (Burkhardt, 2005) and the speaker identification (Task 2), infant and adult directed speech (IDS, ADS) of 10 Swiss-German mothers were used. The emotion database includes 7 emotions namely anger (A), boredom (B), disgust (D), fear/anxiety (F), happiness (H), sadness (S) and neutral (N) produced by 5 men and women. The recordings were used to train a Gaussian mixture model (GMM) with 7 mixtures and diagonal covariance. Table 1, shows the list of features used in modeling the speakers and the emotion classes for Task 1 and 2 correspondingly.

107

Table 1. List of features used in training GMM and their dimensions

Features DimensionMFCCs + 2Δ (temporal) 39MFCCs + 2Δ (cepstral) 39

Table 2. Task1: Emotion recognition rate obtained for the different emotions with respective to the features.

Anger Boredom Disgust Fear/Anxiety Happiness Sadness NeutralMFCCs + 2Δ (temporal) 92.3 3.8 93.3 4.5 13 100 4MFCCs + 2Δ (cepstral) 69.2 42.3 80 63.6 47.8 100 40

Features Recognition rate

From the Table 2, the emotion recognition rate of the MFCCs with temporal deltas is very less for boredom, fear/anxiety, happiness and neutral emotions in comparison with the rates of cepstral deltas’. However, the cepstral deltas’ recognition rate decreased in recognizing the anger. Overall, the cepstral deltas are better than temporal deltas in capturing the emotion specific characteristics of the speakers.

Table 3. Task 2: Speaker recognition rate obtained for the IDS and ADS ( = training data; = test data).

Features IDS☀ IDS☼ IDS☀ ADS☼ ADS☀ IDS☼ ADS☀ ADS☼

MFCCs + 2Δ (temporal) 88.52 66.9 54.91 96.66MFCCs + 2Δ (cepstral) 81.15 42.62 36.44 83.05

In Table 3, for the same as well as cross modes of speech, the speaker recognition rate of the MFCCs with temporal deltas are high in comparison with the cepstral deltas. Hence, the temporal deltas are well representing the speaker specific characteristics than cepstral deltas. Whereas, cepstral deltas represent the emotion specific characteristics within speaker (Table 2). These results support the view that cepstral dynamics are not a speaker specific cue to be considered in performing speaker identification or verification.

108

As the dataset is rather small (10 female speakers in Task 2) results should be replicated to allow generalization. For future study, we are planning to do speaker identification in larger data sets, cross register or emotional speech using cepstral deltas with other classifiers such as support vector machine (SVM) (Sreenivasa, 2013) and hidden Markov models (HMMs) (Sato, 2007).

ReferencesBurkhardt, F, A. Paeschke, M. Rolfes, W. F. Sendlmeier, and B. Weiss. (2005).

A database of German emotional speech. Interspeech, vol. 5., 1517-1520.Davis, S, and P. Mermelstein. (1980). Comparison of parametric representations

for monosyllabic word recognition in continuously spoken sentences. IEEEtransactions on acoustics, speech, and signal processing 28.4., 357-366.

Sato, N. and Y. Obuchi. (2007). Emotion recognition using mel-frequency cepstral coefficients. Information and Media Technologies 2.3., 835-848.

Sreenivasa, R. K. and S. G. Koolagudi. (2013). Robust emotion recognition using spectral and prosodic features. Springer Science & Business Media.

109

Speaker recognition from the island of Brač

Zdravka Biočina, and Gordana Varošanec-ŠkarićDepartment of Phonetics, University of Zagreb, [email protected],[email protected]

This paper is eligible for the 'Best Student Paper Award'

Local speeches of the island of Brač can be divided into four language groups depending on the settlement location: a) central Čakavian settlements, b) west Čakavian settlements, c) east Čakavian settlements with Štokavian influence and d) east Štokavian settlement Sumartin (Sujoldžić et al., 1988; Šimunović, 2009) (Figure1).

Previous sociophonetic researches of the speaker recognition of Croatian varieties (Kalogjera, 1984; Mildner, 1997; Varošanec-Škarić and Kišiček, 2009; Kišiček, 2012) have included urban varieties and the assessment of listeners’ attitude towards urban and rural speakers. Kalogjera (1985) explored attitudes of city speakers of Korčula toward the speeches of villages on the island of Korčula. He stated that “the average inhabitant of the city of Korčula is capable of detecting fine distinctions in the speech of villagers such that he can identify the village of each speaker with a minimal error; villagers have the same ability.” (Kalogjera, 1985:97).

Two studies of the recognition of Croatian urban varieties showed that the listeners had a problem assessing the exact city of origin, but were more successful in connecting the speaker with the region. Mildner (1997) stated that when recognizing speakers, the listeners are more influenced by word stress, while Kišiček (2012) concluded that the listeners are more influenced by the segmental level.

To test these conclusions on rural inhabitants, 30 speakers from five settlements on the island of Brač were recorded, with six speakers from each – Supetar, Pučišća, Bol, Sumartin and Pražnica. The criteria for selecting the speakers were: 1) that they were born in the settlement included in the study, 2) that they have been living there for the past 10 years, and 3) that they are at least second generation inhabitants from

110

the settlement. Speakers were divided according to age groups: younger (N=2; 18 - 39 years), middle-aged (N=2; 40 - 59 years) and older (N=2; over 60 years), and by gender (NF=3, NM=3). Based on five minutes of spontaneous speech of each speaker, 23–second samples were created and played randomly for recognition. All toponyms and other indicators that could reveal the origin of the speaker were cut out from the samples. Apart from speakers from Brač, two speakers from Split (Čakavian – Štokavian speech), were also included in the recognition, to examine if the listeners recognize the speech of Split from the speeches of Brač. The listeners (NF=6, NM=8) were native speakers from Brač, who were born and live there as well. They were divided into three age groups to check the common topos that the older listeners are better at recognizing native speakers. The listeners needed to decide if the speaker is from Brač or Split. If they chose Brač, they had to write the exact settlement.

Hypothesis:

1. The listener’s recognition will be influenced by the segmental level, therefore they will be better at recognizing speeches whose pronunciation of vowels and consonants contains specific features, e.g. diphthongs in the speech of Pučišća.

2. Since diphthongs are less frequent in the speech of Bol than in the speech of Pučišća, listeners will classify those speakers from Bol whose speech contains diphthongs as speakers from Pučišća.

3. Speech of Pražnica will have a higher rate of recognition due to specific tonal intonation.

A middle-aged woman was the most successful in recognition (82%), while the second best was the youngest listener (18 years). One of the older speakers had the worst result. In total, the best achievements were those of the middle-aged (66%) and the younger groups (65%). Unexpectedly, the older group (50%) was the worst.

Overall, it can be concluded that native listeners recognize only those settlements whose speeches have very marked segmental features in dialectal pronunciation. It is considerably harder for them to recognize speakers only by word stress, with the exception of the speech of Pražnica

111

which they recognize well due to specific tonal intonation. The listeners interchanged the speech of Bol for that of Pučišća 16 times. Because of the Štokavian influence on Čakavian speech of Supetar, listeners did poorly in differentiating the speech of Supetar from that of Sumartin, as well as differentiating both of them from the speech of Split. The speech of Sumartin had the lowest rate of recognition, since the settlers who founded Sumartin preserved a lot of Štokavian features (Hraste, 1940). Listeners interchanged the speech of Sumartin with the speeches of Supetar and Split a total of 50 times. Split was recognized only in 54% of cases. With these results in mind, recognition is planned to be carried out with a native expert.

Figure 1. Dialect distribution of the island of Brač (adapted from Šimunović 2009: 14)

ReferencesHraste, M. (1940). Čakavski dijalekat ostrva Brača. Srpski dijalektološki zbornik,

X, 1–66.Kalogjera, D. (1985). Attitudes toward Serbo-Croatian language varieties.

International Journal of the Sociology of Language, 52, 93-109.Kišiček, G. (2012). Forenzično profiliranje i prepoznavanje govornika prema

gradskim varijetetima hrvatskoga jezika. Unpublished Doctoral Dissertation: Zagreb.

112

Mildner, V. (1997). Prepoznavanje hrvatskih govora. Zbornik savjetovanja Hrvatskog društva za primijenjenu lingvistiku, 209-221.

Šimunović, P. (2009). Rječnik bračkih čakavskih govora. Zagreb: Golden marketing Tehnička knjiga.

Sujoldžić, A., Šimunović, P., Finka, B. and Rudan, P. (1988). Sličnosti i razlike u govorima otoka Brača kao odraz migracijskih kretnja. Rasprave zavoda za jezik, 14, Zagreb, 163-184.

Varošanec-Škarić, G. and Kišiček, G. (2009). Izvanjske indeksikalne osobine govornika varaždinskoga i osječkoga govora. Suvremena lingvistika. 1, 67; 109-125.

113

PostersAbstracts are presented in the running

order of the programme.

114

115

Can we hear nicotine craving?

Sandra Schwab1,3, Michael S. Amato2, Volker Dellwo,3 and Marianela Fernández Trinidad4

1Université de Genè[email protected]

2Schroeder Institute for Tobacco Studies at Truth [email protected]

3Universität Zü[email protected]

4 Consejo Superior de Investigaciones Científicas de [email protected]

The long term effects of smoking on voice quality, specifically lower fundamental frequency and greater frequency perturbation parameters (jitter), have been previously documented (Gilbert & Weismer, 1974; Gonzalez & Carpi, 2004; Pinto, Crespo & Mourão, 2014; Sorensen & Horii, 1982). However, short term effects of nicotine craving on voice quality have not been investigated. The present research aims to determine whether craving has a perceptible effect on speech and to study the influence of craving on the acoustic properties of speech.

For the present pilot experiment based on the methodology presented in Schwab, Amato, Dellwo and Fernández Trinidad (2016), speech samples were collected from four Spanish medium-heavy smokers who read the sentences from Nazzi, Bertoncini & Mehler (1998) at three times: 1) after smoking a cigarette, 2) after a one-hour craving period, and 3) after smoking a cigarette following the craving period. The craving period included survey questions about smoking behaviors (Heatherton, Kozlowski, Frecker & Fagerström, 1991), reading catalogues of books discussing forthcoming publications in Humanities, and presentation of smoking images (Wray, Godleski & Tiffany, 2011). Momentary craving was assessed before each recording (using 10-item QSU; Cox, Tiffany & Christensen, 2001).

A subset of two sentences was selected for the perception study. Trials presented two versions of each sentence produced by the same speaker

116

under smoking and craving conditions to 22 Spanish-speaking listeners. They judged which sample sounded more stressed, and which sample was more attractive. We predicted that listeners could discriminate speech produced under craving and non-craving conditions.

The results showed that the samples produced under the craving condition were not judged as more stressed or more attractive than the samples produced under the smoking condition. Samples judged more attractive also tended to be judged less stressed. Thus, contrary to our expectations, we did not observe a perceptible effect of craving. However, these findings might be explained by the short craving period and/or by the specificity of the speech samples we used in the experiment.

Taking these limitations into account, we designed a new production experiment with a 3-hour craving period involving various speech production tasks (reading of sentences and texts, retelling stories and sustaining /a/). Ten medium-heavy smokers performed the tasks at three times: 1) After a two-hour craving period; 2) After a three-hour craving period; 3) After smoking the cigarette that followed the three-hour craving period. Participants were told that they should not smoke for at least two hours before the beginning of the experiment (i.e., two-hour craving). As in the pilot study, the craving period during the experimental session included survey questions about smoking behaviors, reading catalogues of forthcoming publications in Humanities, and presentation of smoking images. For now, the recordings have been completed, but acoustic analyses still need to be carried out (e.g., rhythmic measures, F0, shimmer, jitter, speech rate). The perception experiment will be designed with a wider variety of speech material (sentences, read texts and retold stories), on different dimensions (e.g., stress; anxiety, attractiveness, boredom) and with different experimental tasks (e.g., identification, ABX discrimination).

ReferencesCox, L. S., Tiffany, S. T. & Christensen, A. (2001). Evaluation of the brief

questionnaire of smoking urges (QSU-brief) in laboratory and clinical settings. Nicotine and Tobacco Research, 3, 7–16.

117

Gilbert, H. R. & Weismer, G. G. (1974). The effects of smoking on the speaking frequency adult women. Journal of Psycholinguistic Research, 3, 225-231.

Gonzalez J. & Carpi A. (2004). Early effects of smoking on the voice: a multidimensional study. Med Sci Monit., 10, CR649-56.

Heatherton, T. F., Kozlowski, L. T, Frecker, R. C. & Fagerström, K.-O. (1991). The Fagerström Test for Nicotine Dependence: a revision of the Fagerström Tolerance Questionnaire. British Journal of Addiction, 86, 1119-1127.

Nazzi, T., Bertoncini, J., & Mehler, J. (1998). Language discrimination by newborns: Toward an understanding of the role of rhythm. Journal of Experimental Psychology: Human Perception and Performance, 24, 756-766.

Pinto, A. G. L, Crespo, A. N. & Mourão, L. F. (2014). Influence of smoking isolated and associated to multifactorial aspects in vocal acoustic parameters. Brazilian Journal of Otorhinolaryngology, 80, 60-67.

Schwab, S., Amato, M., Dellwo, V. & Fernández Trinidad, M. (2016). The effects of nicotine craving on voice quality. IAFPA 2016, York.

Sorensen D & Horii Y. (1982). Cigarette smoking and voice fundamental frequency. J Commun Disord, 15, 135-144.

Wray, S. J. M., Godleski, S. A. & Tiffany, S. T. (2011). Cue-Reactivity in the Natural Environment of Cigarette Smokers: The Impact of Photographic and In Vivo Smoking. Psychology of Addictive Behaviors: Journal of the Society of Psychologists in Addictive Behaviors, 25, 733–737.

118

Perceptual auditory speech features ofdrug-intoxicated female speakers

(preliminary results)

Potapova Rodmonga1, Agibalova Tatiana2, Bobrov Nikolay1,Zabello Natalia1

1 Moscow State Linguistic University, Institute of Applied and Mathematical Linguistics, Moscow, Russia

2 Moscow Research and Practical Centre for Narcology of the Department of Public Health, Moscow, Russia

[email protected]

This paper presents preliminary results of the first stage of the investigation, the aim of which was auditory assessment of the degree of disfluency of a set of parameters with regard to drug-intoxicated female speakers (opioid addicts), which is a part of a broader-scale problem of speaker and speaker’s emotional/psychophysiological state identification (Hollien, 1990, 2001; French, 1994; Klasmeyer, Sendlmeyer, 1997; Neuhauser, 2013; Potapova, Potapov, 2016).

In this stage of research we examined phonograms (N=130) containing speech of drug-intoxicated native Russian female speakers (N=40) recorded at Moscow Research and Practical Centre for Narcology of the Department of Public Health, Moscow, in Autumn, 2016. The aim of the auditory analysis was to obtain information on various disfluencies manifesting themselves in the speech of drug-intoxicated female speakers. The term “disfluency” in the context of spoken language denotes any of the various breaks or irregularities that occur within the flow of otherwise fluent speech (Schiel, Heinrich 2015 : 19–33).

At the first stage of the research perceptual auditory experiment was carried out on a subset of the collected phonogram corpus containing 23 phonograms (7 hours of total duration) representing speech of 11 drug-intoxicated speakers. The full corpus does also contain recordings of the same speakers in non-intoxicated state (during the after-care stage, as they were about to leave hospital), which will be analyzed later. A

119

group of listeners (N=12, graduate students of Moscow State Linguistic University specializing in applied and mathematical linguistics, both male and female) were recruited to take part in this experiment. The listeners were asked to listen to the recordings (as many times as necessary) and to write down the verbal content of the recordings marking all phonetic disfluencies, irregularities and peculiarities (both segmental and suprasegmental) they perceived, namely:

• filled (hm, ah... etc) and unfilled pauses• word/segment repetition• false starts• word interruptions• vowel lengthening• consonant lengthening• syllable lengthening• segment or syllable elision• word or word group ellipsis• segment, syllable or word insertion• syllable metathesis• substitution of unstressed (or weakly stressed) syllables for stressed

syllables• equal stressing of all syllables• total absence of word stress• unnatural syllable accentuation• incorrect utterance segmentation into phrases and syntagmas• discrepancy between the communicative function of the utterance

and its intonation (eg, a complete narrative phrase with the intonation of a question, or imperative with the intonation of narrative)

• variations in perceived speed of the speech (eg speech rate acceleration with regard to the whole utterance or some specific parts of it as opposed to stable speech rate)

• variations of perceived loudness of speech across various parts of the spoken text

120

The listeners had received specific training which enabled them to accurately register all the listed parameters.

The preliminary analysis showed that the following phenomena are particularly frequent in the speech of drug-intoxicated speakers (here we give absolute frequencies of the most salient among the abovementioned phenomena with regard to the subcorpus analyzed): unfilled pauses (n=420), filled pauses (n=72); consonant elision (n=121); replacement of consonant clusters with single consonants (n=67); false starts or syllable repetition (n=95); syllable metathesis (n=30) (see Fig. 1).

Figure 1 Distribution of the perceptual auditory parameters appearing most frequently in the analyzed subcorpus.

At the present moment perceptual auditory analysis is being continued.Further research is planned to be carried out to obtain quantitative data on the difference between the intoxicated and non-intoxicated speakers regarding various speech disfluencies. Acoustic analysis of the collected speech material is underway, which is to clarify the nature of the mentioned changes.

Acknowledgments. The research is supported by the International Association for Forensic Phonetics and Acoustics.

ReferencesFrench, P. (1994) An overview of forensic phonetics with particular reference

to speaker identification. The International Journal Forensic Linguistics: Speech, Language and the Law. Vol. 1, issue 2, 169–182.

121

Hollien, H. (1990) The Acoustics of Crime: the New Science of Forensic Phonetics. New York: Plenum.

Hollien, H. (2001) Forensic Voice Identification. New York: Academic Press.Klasmeyer, G., Sendlmeyer, W.F. (1997) The Classification of Different

Phonation Types in Emotional and Neutral Speech. Forensic Linguistics, 4(1), 104–124.

Neuhauser, S. (2013) Phonetische und linguistische Prinzipien des forensisches stimmenvergleichs (Author Michael Jessen). The International Journal: Speech, Language and the Law. Vol. 20.2, 325–330.

Potapova, R., Potapov, V. (2016) On Individual Polyinformativity of Speech and Voice regarding Speakers Auditive Attribution (Forensic Phonetic Aspect). In: Ronzhin, A., Potapova, R., Nemeth, G. (Eds) Speech and Computer. Proceedings of the 18th International Conference SPECOM’2016. Budapest, Hungary, August 23–27, 2016. Springer Lecture Notes in Artificial Intelligence 9811, Cham, Heidelberg et al., 507–514.

Schiel, F., Heinrich, C. (2015) Disfluencies in the speech of intoxicated speakers // The International Journal of Speech, Language and the Law. Vol. 22.1, 2015, 19–34.

122

The effect of aging on between-speaker rhythmic variability

Elisa Pellegrino, Lei He, and Volker Dellwo1Institute of Computational Linguistics, Zurich University, Switzerland

{elisa.pellegrino|lei.he|volker.dellwo}@uzh.ch

As well as varying in their spectral characteristics, speakers’ voices can considerably differ in terms of their rhythmic characteristics, due to speaker-specific anatomy and idiolect which result in idiosyncratic movements of the articulators (Dellwo, Kolly and Leeman, 2015). While there is evidence that acoustic measures of speech rhythm based on the durational characteristics of consonantal and vocalic intervals (henceforth CV intervals), voiced and unvoiced intervals as well as amplitude envelope characteristics consistently separate:

- speakers of the same language variety (Dellwo, Kolly and Leeman, 2015; He and Dellwo, 2016; Leeman, Kolly and Dellwo, 2014);

- speakers of varieties of a language (L1, L2 or regional) (White and Mattys, 2007; White et al., 2009; Fuchs, 2016);

- healthy from dysarthric speakers (Liss et al. 2009; Liss et al., 2010, Pettorino, Busà and Pellegrino, 2016),

far less research has been done on between-speaker rhythmic variability due to advancing age.

Preliminary studies on Italian have shown that proportion over which speech is vocalic (%V) and the interval between two consecutive vowel onset points vary significantly between younger and older adults (Pettorino and Pellegrino, 2014). In the present study we further explored between-speaker variability across age, analyzing the rhythmic characteristics of 26 Zurich German speakers, aged between 18 and 81 years. All speakers read 60 sentences in Zurich German. Since segment duration and amplitude are among the acoustic features which mostly vary with aging, we examined the role of these characteristics in rhythmic variability. The whole corpus was annotated on three different

123

tiers: segment, CV intervals and syllables. From the three tiers, different sets of rhythm measures were calculated (Dellwo, 2006; Grabe and Low, 2002, Lei and Dellwo, 2016; Ramus et al. 1999).

- From segment tier, we derived speech rate in segment/sec. - From CV interval tier, we derived the following duration measures:

- %V: proportion over which speech is vocalic; - ΔC and ΔV: standard deviation of durational variability of

consonantal and vocalic intervals; - VarcoC and VarcoV: variation coefficient of the durational

variability of consonantal and vocalic intervals; - r-PVI-C: average durational difference between two consecutive

consonantal intervals; - n-PVI-V: rate-normalized durational variability of two consecu-

tive vocalic intervals. - From syllable tier, we derived the following intensity measures:

- VarcoM: variation coefficient of mean syllable intensity; - VarcoP: variation coefficient of syllable peak intensity.

Preliminary results based on duration measures have shown that segment rate, %V, as well as ΔC, ΔV and r-PVI-C are significantly higher in older than in younger adults, while VarcoC, VarcoV and n-PVI-V are more robust against age-related variations. The age-related rhythmic variations might relate to the generalized slowing in motor function in both the peripheral and central nervous system, as well as to the degenerative changes in the laryngeal and supra-laryngeal systems that inevitably alter the anatomy and hence the movements of the articulators, responsible for the rhythmic organization of speech.

In addition to clarifying the effect of aging on the production of speech rhythm, this research is expected to ultimately contribute to forensic speech sciences, in terms of the impact of aging on forensic voice comparisons and automatic speaker recognition systems. Also, in forensic casework, the time delay between the recording of a perpetrator and that of a suspect can sometimes be in the region of a few years. This means that if we want to apply time-domain measures of speaker individuality to forensic speaker comparison, it will be inevitable to understand more about the effects of age on such characteristics.

124

ReferencesDellwo, V. (2006). Rhythm and Speech Rate: A Variation Coefficient for deltaC.

In P. Karnowski & I. Szigeti (Ed.), Language and language-processing (pp. 231-241). Frankfurt/Main: Peter Lang.

Dellwo, V.; Leemann, A., Kolly, M.-J. (2015). Rhythmic variability between speakers: Articulatory, prosodic, and linguistic factors. Journal of the Acoustical Society of America, 137(3):1513-1528.

Fuchs, R. (2016). Speech Rhythm in Varieties of English: Evidence from Educated Indian English and British English. Singapore: Springer.

Grabe, E. and E. L. Low. (2002). Durational variability in speech and the rhythm class hypothesis. In N. Warner Gussenhoven (Ed.), Laboratory Phonology, Vol. 7, 515-545. Berlin: Mouton de Gruyter.

He, L.and Dellwo, V. 2016, The role of syllable intensity in between-speaker rhythmic variability, Journal of Speech, Language and the Law, 23.2: 243–273.

Kelly, F., A Drygajlo and N. Harte. (2013). Speaker verification in score-ageing-quality classification space. Computer Speech & Language 27 (5), 1068-1084.

Kelly, F. and N Harte. (2015). Forensic comparison of ageing voices from automatic and auditory perspectives. International Journal of Speech Language and the Law 22 (2), 167-202.

Leemann, A.; Kolly, M.-J. and Dellwo, V. (2014). Speaker-individuality in suprasegmental temporal features: Implications for forensic voice comparison. Forensic Science International, 238:59-67.

Liss J. M., S. LeGendre and A. J. Lotto. (2010). Discriminating dysarthria type from envelope modulation spectra. J. Speech Lang Hear Res, 53(5), 1246-1255.

Liss, J. M., L. White, S. L. Mattys, K. Lansford, A. J. Lotto, S. M. Spitzer and J. N. Caviness. (2009). Quantifying Speech Rhythm Abnormalities in the Dysarthrias. J. Speech Lang Hear Res, 52(5), 1334–1352.

Pettorino, M. and Pellegrino, E. 2014. Age and Rhythmic Variation: A study on Italian, Proceedings of Interspeech Conference 2014, Singapore, 1234-1237.

Ramus, F., M. Nespor and J. Mehler. (1999). Correlates of linguistic rhythm in the speech signal. Cognition 73, 265–292.

125

White, L., E. Payne and S. L. Mattys. (2009). Rhythmic and prosodic contrast in Venetan and Sicilian Italian. In M. Vigário, S. Frota and M. J. Freitas (eds) Phonetics and Phonology: Interactions and Interrelations 137–158. Amsterdam: John Benjamins.

White, L. and S. L. Mattys. (2007). Calibrating rhythm: First language and second language studies. Journal of Phonetics 35(4): 501–522..

126

Creating Linguistic Feature Set Templates for Perceptual Forensic Speaker Comparison in

Finnish and Swedish

Therese Leinonen, Jonas Lindh, and Joel ÅkessonVoxalys AB, Gothenburg, Sweden

{therese|jonas|joel}@voxalys.se

Auditory, perceptual linguistic analysis is an important part of forensic speaker comparison (Rose 2002). Linguistic analysis presupposes that the analyst has a particular expert knowledge of the target language and of intra- and interspeaker variation within the language, a knowledge which depends on available linguistic research and experience. This paper outlines linguistic variables available for forensic speaker comparison in Finnish and Swedish. These languages differ profoundly from each other when it comes to sociolinguistic situation, Standard language ideologies and types of variationist linguistic research.

Swedish has a long record of written language, with the oldest evidence being runic inscriptions (9th century onwards). The written language was standardized during the 18th century (Teleman, 2005). The spoken Standard language developed in the capital and was influenced by the court and the speech of higher social classes in Stockholm.

In contrast, Finnish written language is very young. The orthography was developed in the 16th century, and in the 19th century the written/Standard language was renewed by incorporating features from different Finnish dialects (Nuolijärvi & Vaattovaara, 2011). The Finnish Standard language, hence, did not originally have any native speakers, but was to all Finns a foreign variety being learnt at school.

Today, the relationship between spoken and written language is quite different in the two languages. Over the past century Swedish spoken and written language have become increasingly more similar (Josephson, 2004; Svensson, 2005); e.g. sounds in some morphological endings formerly deleted in spoken varieties have been reintroduced through the orthography. The opposite holds for Finnish, where the gap between

127

written language and every-day spoken language is increasing (Nuolijärvi & Vaattovaara, 2011).

The different constellations of Standard language and written language have consequences for style shifting within the languages. The combination of a young and hence phonemically transparent written language and a large gap between Finnish written language and spoken language causes speakers to alter their pronunciation notably depending on context. For speakers of Swedish it depends much more on the regional, dialectal and social background of the speaker to what extent style shifting occurs.

There is a long tradition of variationist linguistics in both countries, but the types of studies that have been carried out differ. Phonetic research has been very prominent for Swedish and there is a bulk of research comparing the phonetics of different varieties (Bruce, 2010). The focus of Finnish variationist linguistics has up to present-day been on phonological, morphological, syntactic and lexical variation, whereas Finnish phoneticians mainly have studied the Standard language and to much less extent taken regional variation into account (Nieminen et al., 2014).

In forensic speaker comparison casework there is a need for standardising the analyses performed. A fixed template feature set is therefore needed when performing perceptual analyses. We will present feature sets for Swedish and Finnish that have been developed as templates, and would also like to open a discussion on how to develop and use feature set templates for forensic casework in different languages.

ReferencesBruce, G. (2010). Vår fonetiska geografi. Om svenskans accenter, melodi och

uttal. Lund: Studentlitteratur.Josephson, O. (2004). Ju: ifrågasatta självklarheter om svenskan, engelskan

och alla andra språk i Sverige. Stockholm: Norstedts ordbok.Nieminen, T., T. Kurki, H. Kallio & H. Behravan (2014). Uusi puhesuomen

variaatiota tarkasteleva hanke: Katse kohti prosodisia ilmiöitä. Sananjalka56, 186-195.

128

Nuolijärvi, P. & J. Vaattovaara (2011). De-standardisation in progress in Finnish society? In Kristiansen & Coupland (Eds.) Standard languages and language standards in a changing Europe (pp. 67–74). Oslo: Novus.

Rose, P. (2002). Forensic speaker identification. London : Taylor & Francis.Svensson, J. (2005). Trends in the linguistic development since 1945 I:

Swedish. In Bandle (Ed.) The Nordic languages. An international handbook of the history of the North Germanic languages 2 (pp. 1804-1815). Berlin: De Gruyter.

Teleman, U. (2005). The role of language cultivators and grammarians for the Nordic linguistic development in the 16th, 17th and 18th centuries. In Bandle (Ed.) The Nordic languages. An international handbook of the history of the North Germanic languages 2 (pp. 1379-1395). Berlin: De Gruyter.

129

Fluency Profiling for Forensic Speaker Comparison: A Comparison of Syllable- and

Time-Based Approaches

Kirsty McDougall,1 and Martin Duckworth2

1University of Hertfordshire and University of Cambridge, [email protected]

[email protected]

Disfluency features such as filled and silent pauses, repetitions, prolongations and self-interruptions are of interest in forensic phonetics due to their potential for individual variation. Filled and silent pauses are likely to vary between speakers since they may play a part in the planning of speech. Other breaks in fluency such as repetition and prolongation might also function as part of the speech planning process and therefore be difficult to consciously control, leading to variation between individuals. Most speakers and listeners are generally unaware of disfluency features because they do not usually contribute to the message; these features are therefore unlikely to be manipulated by speakers for disguise. A further reason for investigating the forensic potential of disfluency features is that these features are largely realised through the temporal as opposed to spectral domain, and thus are less affected by signal compression algorithms.

Findings from our programme of research investigating the speaker-specificity of disfluency features have been presented over a series of contributions at IAFPA conferences (e.g. McDougall, Duckworth and Hudson 2015a; see also McDougall, Duckworth and Hudson 2015b). The measurement approach thus far has involved a syllable-based technique capturing the number of occurrences of each disfluency type per 100 syllables. We developed this approach as an extension of methods used in speech and language therapy for analysing disfluencies in the speech of people who stutter. While the results this method has yielded appear very promising in terms of speaker discrimination, implementing the method to collect the metrics is extremely labour-intensive and time-consuming, hindering its application in casework. An approach using a

130

time-based anchor instead of requiring the counting of syllables would be more efficient, providing it offered the same or improved levels of speaker-specificity in the metrics it produced. One study using a time-based approach is that of Braun and Rosin (2015) which analysed the hesitation markers of ten German speakers. In this study the occurrence of the total number of hesitation markers (filled pauses and prolongations) per minute were measured and various patterns of individual variation noted.

The present study tests the degree of speaker discrimination offered by both syllable-based and time-based analyses of speakers’ disfluency behaviour for a larger range of disfluency features than the Braun and Rosin study. Syllable- and time-based measures are compared for filled and silent pauses, repetitions, prolongations and self-interruptions in simulated police interview speech data from the DyViS database which contains recordings of 100 male SSBE speakers aged 18-25 years (Nolan et al. 2009). Preliminary results for the first five DyViS speakers are given in Figure 1 below which shows a comparison of total rates of disfluency for the two methods.

Figure 1 Rates of disfluency for the first five DyViS speakers for syllable-based and time-based measures in interview speech.

131

The differences between speakers are broadly similar for for each method, but not identical. The study will compare the levels of speaker discrimination afforded by the two methods to determine whether the time-based method, which is considerably faster to execute, is at least equally as effective as the syllable-based method. Practical implications for forensic speaker comparison will be discussed.

ReferencesBraun, A. and A. Rosin (2015) ‘On the speaker-specificity of hesitation markers.’

In: The Scottish Consortium for ICPhS 2015 (ed.) Proceedings of the 18th

International Congress of Phonetic Sciences, 10-14 August 2014, Glasgow. Paper number 0308.1-5.

<https://www.internationalphoneticassociation.org/icphs-proceedings/ICPhS2015/Papers/ ICPHS0731.pdf>

McDougall, K., M. Duckworth and T. Hudson (2015a) ‘Between-speaker variation in disfluency behaviour across accents: a comparison of Standard Southern British English and York English.’ Paper presented at the International Association for Forensic Phonetics and Acoustics Annual Conference, Leiden, 7-10 June 2015.

McDougall, K., M. Duckworth and T. Hudson (2015b) ‘Individual and group variation in disfluency features: a cross-accent investigation.’ In: The Scottish Consortium for ICPhS 2015 (ed.) Proceedings of the 18th

International Congress of Phonetic Sciences, 10-14 August 2014, Glasgow. Paper number 0308.1-5. <https://www.internationalphoneticassociation.org/icphs-proceedings/ICPhS2015/Papers/ICPHS0308.pdf>

Nolan, F., K. McDougall, G. de Jong and T. Hudson (2009) The DyViS database: style-controlled recordings of 100 homogeneous speakers for forensic phonetic research. International Journal of Speech, Language and the Law, 16.1, 31-57.

132

Prosody can help distinguish identical twins: implications for forensic speaker comparison

Eugenia San Segundo1, Lei He2, and Volker Dellwo2

1Department of Language and Linguistic Science, University of York, UK.

[email protected] of Computational Linguistics,

University of Zurich, Switzerland.{lei.he|volker.dellwo}@uzh.ch

Various researchers have shown an interest in the voice similarity of identical twins. However, results across studies are hardly comparable since the number of speakers, gender, speaking style and, most importantly, forensic comparison methods tend to differ. Therefore, it is difficult to assess the relative importance of different systems or the value of a set of acoustic features over others for identification purposes. Exceptionally we find studies, such as Künzel (2010) and San Segundo and Künzel (2015), which use the same ASR system, speaking task, sample duration and forensic output (EER). This facilitates the comparison of results: in this case system performance with German and Spanish twins. However, most studies are characterised by its heterogeneity in research design. With non-twin investigations, a similar situation occurs. For instance, the DyViS corpus has been extensively used in forensic research but the specific corpus task may differ across studies, different number of speakers may be selected for analysis, or the samples may present different quality (high quality vs. telephone filtered). This makes cross-study comparisons difficult.

The Twin Corpus (San Segundo, 2014) presents the advantage that most investigations have focused on Task 5: semi-directed spontaneous conversation between researcher and twins. Although the methodological approaches differ (Table 1), the same twin pairs were always considered, in high quality recordings. This makes new approaches to twins more easily comparable with results from previous methodologies, even though the expression of conclusions is different. This remains the main comparison challenge.

133

The new approach to twin research that we propose here is based on temporal parameters. We will analyze a range of rhythmic metrics related to the variability and proportion of duration between consonant and vocalic segments (Dellwo et al., 2015), as well as several syllabic measures such as intensity differences between consecutive syllables that have previously shown to play an important role in between-speaker rhythmic differences (He and Dellwo, 2016). The potential of these prosodic measures is promising, as they cover suprasegmental aspects of speaker idiosyncrasy that are more or less independent of traditional acoustic features such as formant frequencies. Anatomically, identical twins are so similar (e.g. their resonance cavities and vocal folds physiognomy) that previously tested systems based on spectral and glottal characteristics sometimes failed to distinguish them. If prosodic features could prove useful to tell twins apart when other systems fail, their inclusion in hybrid forensic comparison systems would be highly recommended. Further investigations into how to combine the output of these systems with LRs or scores yielded by ASR systems should then be necessary too.

Table 1. Investigations using Task 5 of the Twin Corpus. Results per twin pair: 1San Segundo and Künzel (2015); 2San Segundo and Gómez-Vilda (2014), 3San Segundo and Mompeán (2017). Gray-shaded cells for misidentified twins by each system.

134

ReferencesDellwo, V.; Leemann, A.; Kolly, M-J. (2015). Rhythmic variability between

speakers: Articulatory, prosodic, and linguistic factors. J. Acoust Soc Am., 137(3):1513-1528.

He, L.; Dellwo, V. (2016). The role of syllable intensity in between-speaker rhythmic variability. International Journal of Speech, Language and the Law, 23: 245-275.

Künzel, H. J. (2010). Automatic speaker recognition of identical twins. International Journal of Speech, Language and the Law, 17(2): 251-277.

San Segundo, E. (2014). Forensic speaker comparison of Spanish twins and non-twin siblings: a phonetic-acoustic analysis of formant trajectories in vocalic sequences, glottal source parameters and cepstral characteristics [Doctoral dissertation]. Menéndez Pelayo International University & Spanish National Research Council.

San Segundo, E.; Gómez-Vilda, P. (2014). Evaluating the forensic importance of glottal source features through the voice analysis of twins and non-twin siblings. Language and Law/Linguagem e Direito, 1:22-41.

San Segundo, E.; Künzel, H. J. (2015). Automatic speaker recognition of Spanish siblings: (monozygotic and dizygotic) twins and non-twin brothers. Loquens, 2(2):e021.

San Segundo, E.; Mompeán, J.A. (2017). A Simplified Vocal Profile Analysis Protocol for the Assessment of Voice Quality and Speaker Similarity. Journal of Voice. DOI: http://dx.doi.org/10.1016/j.jvoice.2017.01.005

135

Language identification from a foreign accent in German.

Gea de Jong-Lendle1, Roland Kehrein1, Frederike Urke1,Janina Mołczanow2, Anna Lena Georg1, Belinda Fingerling1,

Sarah Franchini1, Olaf Köster3, and Christiane Ulbrich4

1Institut für Germanistische Sprachwissenschaft, Philipps-Universität Marburg, Germany

{gea.dejong|kehrein|urkef}@staff.uni-marburg.de{ Georga|Fingerli|Franchin}@students.uni-marburg.de2Institute of Applied Linguistics, University of Warsaw, Poland

[email protected], Sprache & Audio, Bundeskriminalamt, Wiesbaden, Germany

[email protected] Sprachwissenschaft, Universität [email protected]

IntroductionA police force in central Germany contacted us concerning a case of kidnapping; when victim M, a 50-year old mentally-disabled person living in a residential community centre, did not return from his usual walk, staff decided to raise the alarm. Soon after, the family was contacted by the kidnapper(s) demanding a ransom for the victim to be released and providing exact details of how the money transfer should take place. Surprisingly, without the ransom being handed over, the victim was released the next morning; the kidnapper confirmed that M was in a good state and left chained to a tree in the woods around 115km south of his residential home.

No one seemed to have witnessed the event. Due to his limited mental abilities M´s information could not be used. However, high-quality recordings existed of three phone calls made by the kidnapper. Based on their similarities, it was assumed that all three calls had been made by the same speaker. In total, the calls resulted in over 5min of edited speech of the distant caller. A high proportion of the speech seemed spontaneous, a smaller proportion may have been read. The caller speaks German fairly fluently, but clearly with a foreign accent. His native language is

136

identified as coming from a region somewhere in former-Yugoslavia or the Balkan-region.

As an extensive search and investigation of the area including the people living and working in the area, did not produce the necessary clues, the police requested a voice profile based on the forensic analysis of the three phone calls and subsequent voice comparisons. They also welcomed any suggestions regarding his age, time he spent in Germany, area he learnt his German, etc. The case led to a number of interesting observations but also highly complicated research questions.

Research questions

1. Can someone´s native language (L1) be derived reliably from his/her foreign accent in another language (L2)?

2. We found [ç], the German Ich-sound, to be replaced most often by the [ʃ] sound. This substitution could be the result of borrowing either from the L1-inventory, or having acquired the sound from an ambient dialect, leading to a more general question about specific patterns that can be employed to replace L2 sounds absent in the L1-phonemic inventory.

3. The caller does not exhibit the same lax-tense distinction in his L1; e.g. he fails to produce the distinction for the vowels /o/ and /ɔ/, but maintains a distinction for the /y,ʏ/ and /ø,œ/ pairs as shown in Figure 1. Which strategies are used?

Figure 1 Vowel formants are shown comparing the round vowels of Standard-German (Sendlmeier & Seebode 2006) with the kidnapper´s.

137

4. The production of certain German phonemes, like /l/ or /ʏ/, seems variable, as shown in Fig. 2. Is there a pattern based on the phonetic context? Does the /ʏ/ vowel perhaps not belong to his native inventory? Could it be caused by the caller´s confusion concerning the orthography? Or copied from other non-native colleagues?

Figure 2 F1-F2 formants showing the variety of /l/-consonants produced in different phonetic contexts by the kidnapper ranging from a clear [l] to the dark variant [ɫ]. Spectrograms are showing the clear [l] with a high F2 highlighted in the German word “Blinker” (left) and the dark [ɫ] with a low F2 in the word “Kontrolle” (right).

138

In the investigation an attempt was made to answer the most relevant (or urgent) questions. These analyses are presented and the questions are discussed, that are still unanswered.

ReferencesSendlmeier, W.F., and Seebode, J. (2006) Formantkarten des deutschen

Vokalsystems. Bericht. Berlin: TU Berlin, Institut für Sprache und Kommunikation.

139

Performance of human listeners vs. the Y-ACCDIST automatic accent classifier

in an accent authentication task

Dominic Watt, Megan Jenkins, and Georgina BrownDepartment of Language and Linguistic Science,

University of York, [email protected]

[email protected]@york.ac.uk

The Y-ACCDIST automatic accent classifier system (Brown, 2016a,b; Brown and Wormald, 2017, in press) has performed accurately in trials using corpora of recordings of non-standard varieties of British English, notably the Accent and Identity on the Scottish/English Border (AISEB)corpus (Watt et al., 2013), the Panjabi-English in Bradford and Leicester (PEBL) database (Wormald, 2016), and the Northern Englishes Corpus (Haddican, et al. 2013). Like other recently-developed classifiers (e.g. Ferragne and Pellegrino, 2010; Hanani et al., 2013), Y-ACCDIST is based upon Huckvale’s ACCDIST system (Huckvale, 2007a,b). Y-ACCDISTdiffers from Huckvale’s classifier, however, in that Y-ACCDIST computes averaged MFCC vectors for each of the phonemes of English. ACCDIST, by contrast, treats each individual segment as effectively unique, meaning that Y-ACCDIST is less rigidly text-dependent thanACCDIST because it (Y-ACCDIST) can work with samples which are mismatched for content. Using this novel approach, Y-ACCDIST has regularly achieved 80–90% correct classification rates. Given that the aforementioned test corpora are comprised of recordings of varietieswhich are in some cases phonologically quite similar, Y-ACCDIST’s greatly abovechance performance speaks for the effectiveness of its phoneme-based approach.

The speech in the recordings on which Y-ACCDIST has hitherto been tested is produced by genuine native speakers of their respective accents. In view of how frequently we encounter voice disguise through mimicry of some other accent, however, we sought to investigate whether Y-ACCDIST would correctly classify samples in which the talker’s

140

accent is deliberately feigned. Crucially, would it perform better in this respect than a panel of human listeners? It can be difficult for even expert phoneticians to detect this form of disguise if the feigned accent is done competently (e.g. Schlichting and Sullivan, 1997; Eriksson, 2010; Neuhauser, 2012).

In this paper we focus on Y-ACCDIST’s capacity to distinguish between recordings of authentic and imitated Scottish (Edinburgh) English (Jenkins 2016), comparing its performance against that of 100 native English-speaking human listeners. The imitated samples, of which twelve 20-second extracts were heard by each participant, were produced by a group of nine non-Scottish English speakers composed of laypeople, actors and phoneticians. The results show that both human and machine perceivers distinguish between genuine and imitated samples at above-chance levels. Overall, the human listeners correctly classified all but one of the nine imitators as non-authentic. Scottish listeners performed more accurately than non-Scottish ones, respectively making 78% and 63% of their judgements correctly. The variant of Y-ACCDIST used for the present study estimates samples’ similarity in terms of their degree of correlation (Pearson’s r). It rejected all but a handful of the nonauthentic samples in the sense that only a small percentage of imitations surpassed a threshold (r = 0.81) set by calibrating the system using a sample of genuine Edinburgh English.

Importantly from the forensic standpoint, the false positive rate – i.e., the rate at which fakers were classified as genuine – was substantially lower for Y-ACCDIST (5%) than it was for the human listeners (28%).

We conclude by considering the implications of the study in terms of the viability of the YACCDIST architecture for accent classification in forensic casework, particularly under circumstances which give reason to doubt the authenticity of the talker’s accent.

ReferencesBrown, G. (2016a). Automatic accent recognition systems and the effects

of data on performance. Proceedings of Odyssey: The Speaker and Language Recognition Workshop, Bilbao, Spain. Online resource: <http://www.odyssey2016.org/papers/pdfs_stamped/29.pdf>.

141

Brown, G. (2016b). Exploring forensic accent recognition using the Y-ACCDIST system. Proceedings of the 16th Australasian Speech Science and Technology Conference, Sydney, Australia, 305–308.

Brown, G. and J. Wormald. (2017, in press). Automatic sociophonetics: Exploring corpora with a forensic accent recognition system. J. Acoust Soc Am., 141.

Eriksson, A. (2010). The disguised voice: Imitating accents or speech styles and impersonating individuals, in C. Llamas, & D. Watt (Eds.), Language and Identities, 86–98. Edinburgh: Edinburgh University Press.

Ferragne, E. and F. Pellegrino. (2010). Vowel systems and accent similarity in the British Isles: Exploiting multidimensional acoustic distances in phonetics. J. Phon., 38, 526–539.

Haddican, W., P. Foulkes, V. Hughes and H. Richards. (2013). Interaction of social and linguistic constraints on two vowel changes in northern England. Language Variation and Change, 25, 371–403.

Hanani, A., M. Russell and M. Carey. (2013). Human and computer recognition of regional accents and ethnic groups from British English speech. Computer Speech and Language, 27, 59–74.

Huckvale, M. (2007a). ACCDIST: An accent similarity metric for accent recognition and diagnosis. In C. Müller (Ed.), Lecture Notes in Computer Science: Speaker Classification II, 258–275. Berlin: Springer Verlag.

Huckvale, M. (2007b). Hierarchical clustering of speakers into accents with the ACCDIST metric. Proceedings of ICPhS 16, Saarbrücken, Germany, 1821–1824.

Jenkins, M. (2016). Identifying an imitated accent: Humans vs. computers. Unpublished MSc dissertation, University of York, UK.

Neuhauser, S. (2012). Phonetische und linguistische Aspekte der Akzentimitation im forensischen Kontext: Produktion und Perzeption. Tübingen: Narr.

Schlichting, F. and K. P. Sullivan. (1997). The imitated voice – a problem for voice line-ups? Forensic Linguistics, 4, 148–165.

Watt, D., C. Llamas and D. E. Johnson. (2014). Sociolinguistic variation on the Scottish-English border. In R. Lawson (Ed.), Sociolinguistics in Scotland, 79–102. London: Palgrave Macmillan.

Wormald, J. (2016). Regional Variation in Panjabi-English. PhD thesis, University of York, UK. Online resource: <http://etheses.whiterose.ac.uk/13188/>.

142

Comparison of vowel space of male speakers of Croatian, Serbian and Slovenian language

Gordana Varošanec-Škarić, Iva Bašić, and Gabrijela KišičekDepartment of Phonetics, University of Zagreb, Croatia

[email protected], [email protected], [email protected]

Commonly the vowel system of Croatian RP is described with five cardinal vowels in terms of IPA vowel space (Varošanec-Škarić, 2010). Usually it is perceived that the vowel system of Serbian language is more open than Croatian. Further on, Slovenian corner vowels /i, u, a/ are less “ideal” than vowel chart of Croatian corner vowels (Tivadar, 2003) or according to Toporičič (2000) they are described as “ideal”. Although Slovenian linguists disagree in the number of standard vowels, from eight (Toporišič, 1975) to nine (Jurgec, 2005), we used eight vowel categoristion for this study. Naturally, Slovenian vowel system is different from Croatian and Serbian. However, contrastive analyses commonly compare vowels of different languages. For example, Srebot-Rejec (1988) compares Slovenian and English vowels, Tivadar (2003) Slovenian and Croatian, Sudimac (2016) compares Serbian with vowels of British English.

Jovičić (1999) presents detailed description of the articulation and chart of vowel space F1 – F2 for five Serbian vowels respectively for both male and female speech based on spectral analysis. However, author doesn`t give precise formant values. Although there are several previous research on formant values for Croatian language (e.g. Škarić 1991, Bakran and Stamenković 1990, Varošanec-Škarić 2010) and also for Slovenian language (e.g. Lehiste 1961, Toporišič 1975, 2000; Ozbič 1998, Tivadar 2003, Jurgec 2005) due to the differences in methodology (different number of speakers, different methods and number of measured formants) there is a need for a complex study so that the vowel pronunciation can be compared. Therefore, the aim of this study was to analyse similar corpora and to determine similarity or differences in the vowel pronunciation for three similar languages. Pragmatically, the goal was to compare average formant values (F1, F2, F3) for Croatian RP based on the previous research of Varošanec-Škarić and Bašić (2015)

143

which was the reference for the other comparisons (e.g. with Croatian dialects or similar Slavic languages).

Method

Formant values of F1, F2, F3 (mean, min, max, S.D. in Hz) for all three languages were measured in speech samples of native speakers. Speakers (N= 42, age median 22) read the same (Croatian and Serbian) or very similar (Slovenian) declarative sentences with two syllabic “target” words at the end of each sentence. Twenty words for each vowel were then analyzed in Praat (Boersma and Weenink, 2015; Ver. 6.0.14). Considering the fact that all three vowel system are without diphthongs, formant values were measured in one stabile point of the stressed vowel of the target word. To achieve more reliable results and because of possible formant overlapping, visual inspection was conducted and, if needed, hand corrections were made.

Having in mind that Slovenian language has eight vowels, it was considered to choose the words in the same phonological environment as in Croatian and Serbian language. Three expert listeners (native speakers) verified the pronunciation for the RP Croatian, Slovenian and Serbian and 14 young male speakers were chosen for the analysis. Speakers were recorded in the capitals of three countries with the same recording equipment in the period of 2015 and beginning of 2017. Results were statistically tested using ANOVA (single and two factor analysis).

Results and discussion

Comparison of Croatian and Serbian language. Results have showed that F1 values are statistically significantly higher for vowel [a] for Croatian speakers (704.86: 632.72 Hz; p<0.0001) which means that they are more open in Croatian. Further on, F1 was lower for [i] (295, 12: 322, 06 Hz) and [u] (344, 1: 391,02 Hz; p<0.0001) which means that they are more close in Croatian than in Serbian. Differences of F1 for vowel [e] were not statistically significant, although the value was somewhat lower in Serbian language. F2 values are significantly higher in Croatian for the following vowels: [i] (2177.19: 2063.41 Hz) and [e] (1811.21: 1725.51 Hz) (p<0.0001) which means they are more front in Croatian language.

144

Significant difference in the values of F1 and F2 for vowel [u] (p<0.0001) and F3 (p= 0, 01) reveal more back pronunciation in Croatian language. It is interesting to notice that average values of F3 for all vowels are lower in Croatian language and significantly lower for [e] and [u] (p=0.01) which shows that the vocal tract is somewhat longer in the pronunciation of Croatian vowels. Results partly confirm the hypothesis that some Croatian vowels are more closed in Croatian: significantly for [i] and [u] and insignificantly for [e]. However, it is surprising that the vowel [a] is significantly more open in Croatian then in Serbian language. And it is confirmed that vowels [i] and [e] are significantly more front in Croatian language.

Regarding the comparison of Croatian and Slovenian language, it can be concluded that the pronunciation of Slovenian vowel [a] is more closed and more front than Croatian vowel [a] which is, based on acoustic analysis the most central and most open vowel Croatian vowel. Further on, results have showed that Croatian [i] is more front while the pronunciation of Slovenian [i] is more lax. Croatian [u] is more rounded (lower F2) than Slovenian, which is pronounced more centrally. However, even more significant is the difference between Slovenian closed [e] and Croatian [e] for all three formants (F1, F2, F3; p <0,0001) i.e. Slovenian closed [e] is more close and more front than Croatian [e]. It was interesting to notice that Slovenian open [ɛ] is different from Croatian [e] only for F3 values (p = 0,001), which means that they differ in the length of vocal tract (longer in Croatian pronunciation); Croatian [e] is slightly closer and more front. Further on, Slovenian open /o/ [ɔ] has significantly lower F2 compared to Croatian [o] (p <0,0001), F3 of Slovenian [ɔ] is significantly higher (p = 0,001). It can be concluded that Slovenian [ɔ] is more back and less rounded than Croatian [o]. Pronunciation of Slovenian close [o] is more closed compared to Croatian [o] because F1 is significantly lower (480: 511 Hz; p <0,0001). The differences between vowel spaces between three languages (RP varieties) are shown in Fig. 1.

Considering the fact that we analyzed the same age group of male speakers in which such differences in vowel pronunciation were not expected, results can be useful in the context of analyzing the differences between Croatian, Serbian and Slovenian language.

145

Figure 1 Vowel spaces of Croatian, Serbian and Slovenian language

ReferencesBoersma, P. and Weenink, D. (2015). Praat: doing phonetics by computer

(Version 6.0.14)Bakran, J. and Stamenković (1990). Formanti prirodnih i sintetiziranih vokala

hrvatskoga standardnoga govora. Govor VII, 2, 119-137. Jovičić, S. T. (1999). Govorna komunikacija. Beograd: Nauka.Jurgec, P. (2005). Formant frequencies of standard Slovene vowels. Govor

XXII, 2, 127-144.Lehiste, I. (1961). The phonemes of Slovene. International journal of Slavic

linguistics and poetics IV, 48-66.Ozbič, M. (1998). Akustična spektralna FFT-analiza samoglasniškega sistema

slovenskega jezika: formanti slovenskih samoglasnikov. Jezikovnetehnologije za slovenski jezik : Zbornik konference (eds. T. Erjavec i J. Gros), pp 55-59.

Srebot-Rejec (1988). Kakovost slovenskih in angleških samoglasnikov (kontrastivna analiza obeh sestavov po njihovi kakovosti s stališča akustične, artikulacijske in avditivne fonetike). Jezik in slovstvo XXXIV,3, 57–64.

146

Sudimac, N. (2016). Kontrastivna analiza visokih/zatvorenih vokala u produkciji izvornih govornika britanskog engleskog i srpskog. Filolog. VII, 14. Banja Luka: Filološki fakultetm 36-56.

Škarić, I. (1991). Fonetika hrvatskoga književnog jezika. U R. Katičić (ur.): S. Babić, D. Brozović, M. Moguš, S. Pavešić, I. Škarić i S. Težak, Povijesnipregled, glasovi i oblici hrvatskoga književnog jezika, 61-378. Zagreb: HAZU i Nakladni zavod Globus.

Tivadar, H. (2003). Kontrastivna analiza slovenskih i hrvatskih vokala (mogući izgovorni problemi sa slovenskog aspekta). Govor XX, 1-2, 449-467.

Tivadar, H. (2004). Fonetično-fonološke lastnosti samoglasnikov v sodobnem knjižnem jeziku. Slavistčna revija LII, 1, 31-48.

Toporišič, J. (1975). Formanti slovenskega knjižnega jezika. Slavistčna revija XXIII, 2, 153-196.

Toporišič, J. (2000). Slovenska slovnica. Maribor: Založba Obzorja.Varošanec-Škarić, G. (2010). Fonetska njega glasa i izgovora. Zagreb: FF

press.Varošanec-Škarić, G., Bašić, I. (2015). Acoustic characteristics of Croatian

cardinal vowel formants (F1, F2 and F3). In: M. Sovilj and M. Subotić (eds.) International Conference on Fundamental and Applied Aspects of Speech and Language, Beograd: Life Activities Advancement Center and the Institute for Experimental Phonetics, pp 41-49.

147

A study on language differences in the score distributions of automatic speaker

recognition systems

Michael JessenDepartment of Speech and Audio (KT34), BKA, Germany

[email protected]

The effect of language on the results of forensic automatic speaker recognition systems is an important topic for forensic voice comparison. The language effect has been approached from different perspectives. Some studies have investigated the language effect by examining bilingual speakers (e.g. Künzel 2013), others have investigated the impact of language- matching or mismatching background populations (e.g. van der Vloed et al. 2017). Another question is to what extent the results of automatic systems differ if the same systems are applied to different languages. When this question is addressed, the most common approach is to provide measures and representations of overall performance such as EER and DET-plots (van Leeuwen et al. 2006). Although overall performance is important, at least of equal importance is insight into the actual distributions of scores (raw scores, LLRs etc.) that result if different languages are studied. The size of this kind of language effect has practical consequences. The smaller it is the better the perspective is to apply automatic systems on untested languages. Another interesting question is in what ways different automatic systems differ with respect to the language effect.

Data sets in four languages were tested with three automatic systems. The four languages are Albanian, German, Russian and Turkish. The three systems are Nuance Forensics, iVocalise (Oxford Wave Research) and Voice Inspector (Phonexia). The recordings are based on telephone speech from male adult speakers derived from authentic forensic casework. Net durations range from slightly below 20 to about 60 seconds. There are two recordings per speaker from different telephone conversations. Due to limited availability from casework, the number of speakers is larger in the German set (23) than in the other sets

148

(Albanian 10, Russian 11, Turkish 12 speakers). The systems differ as to whether or not they require a reference population (for score normalisation, LLR calculation or both) and in terms of their outputs (scores, calibrated LLRs or both). The reference population data were taken from a German set of 30 speakers from case-relevant telephone conversations.

A Tippett plot of the results from one of the three systems is shown in Figure 1. It shows that with the system illustrated, the tested languages differ in their score distribution more when looking at the same-speaker (target) scores (here LLRs) than when examining the different- speaker (nontarget) scores. This is partially a system design; the other systems show the opposite pattern (more diversity in the nontarget scores) or are more symmetrical. More patterns that are similar and different between systems will be presented.

Figure 1 Tippett plots using the software BioMetrics (Oxford Wave Research) of the results from the four tested languages based on one of three forensic automatic speaker recognition systems. That system requires a reference population and results (x-axis) are expressed in terms of LLR (Log Likelihood Ratios). The vertical black line indicates the value LLR=0.

149

Generally, as shown in Figure 1 and the plots to be presented, the magnitude of the language differences is visible but limited. While bearing in mind the limited size of this study, this result has the practical consequence that it is to some extent possible to apply automatic speaker recognition to languages for which no tests of the kind shown here are available, provided other aspects of the recordings (technical; speech-stylistc) are kept as constant as they have been here and provided that potential language differences of approximately the observed magnitude are taken into account in the interpretation of the results.

ReferencesKünzel, H. J. (2013). Automatic speaker recognition with cross-language speech

material. The International Journal of Speech, Language and the Law, 20, 21–44.

Van der Vloed, D., M. Jessen, and S. Gfroerer. (2017). Experiments with two forensic automatic speaker comparison systems using reference populations that (mis)match the test language. Proceedings of the Audio Engineering Society 2017 Conference on Forensic Audio, Arlington.

Van Leeuwen, D. A, A. F. Martin, M. A. Przybocki, and J. S. Bouten. (2006). NIST and NFI-TNO evaluations of automatic speaker recognition. ComputerSpeech and Language, 20, 128–158.

150

Automatic detection of the Lombard effect

Finnian Kelly,1,2 and John H. L. Hansen1

1Center for robust speech systems (CRSS), The University of Texas at Dallas, U.S.A.

2Oxford Wave Research Ltd., Oxford, U.K.{finnian.kelly | john.hansen}@utdallas.edu

Speech recorded in forensic contexts may contain large variations in vocal effort due to physical, emotional or environmental stress. Variations in vocal effort affect speech parameters that are central to many types of speech and speaker analysis (Kelly & Hansen, 2016). It is therefore important to assess speech for the presence of non-neutral vocal effort prior to any analysis.

A commonly occuring source of vocal effort variation is the Lombard effect, which refers to the tendency of a speaker to increase their vocal effort in a noisy environment in order to remain intellligible. Speech produced under the Lombard effect differs from neutral speech in its pitch, intensity, rate and spectral tilt (Hansen & Varadarajan, 2009).

In this paper, we present an intial effort toward automatically detecting speech produced under the Lombard effect. Our approach is to train a discriminative classifier with examples of neutral and Lombard speech, and use it to detect Lombard effects in unseen speech samples. Specifically, a support vector machine (SVM) is used for classification, with MFCC- based i-vectors (Dehak, 2011) used as features. Experiments are presented on three corpora:

Corpus 1:

The UT-Scope-Lombard corpus (Ikeno, Varadarajan & Hansen, 2007) contains speech from 30 speakers (24 female, 6 male) of U.S. English. Studio-quality neutral and Lombard speech recordings were obtained from each speaker in both read and spontaneous contexts. The Lombard effect was elicited by playback of noise over headphones. Thus, the

151

speech recordings are clean. Three noise types (crowd, pink and car) and multiple noise levels (65 – 90 dB-SPL) were presented to each speaker.

Corpus 2:

The Pool-2010 corpus (Jessen, Koster & Gfroerer, 2005) is similar in design to UT-Scope-Lombard. It contains speech from 106 male speakers of German. Studio-quality neutral and Lombard speech recordings were obtained from each speaker in both read and spontaneous contexts. Versions of the studio recordings transmitted over a cellular (GSM) channel are also provided. The Lombard effect was elicited by playback of 80 dB-SPL white noise over headphones. Thus, the speech recordings are clean.

Corpus 3:

The Speakers in the Wild (SITW) corpus (McLaren et al., 2016) contains 118 speakers (48 female, 70 male) of English with both neutral and Lombard speech recordings. There is no control over the speech content, channel, recording device or environmental noise in any of the recordings. All Lombard effects were elicited naturally.

For each of these corpora, a detection experiment was carried out as follows: speakers were split into train/test groups in the ratio 4:1. An SVM was trained and tested using i-vectors from the relevant speaker group, and an Equal Error Rate (EER) detection metric was obtained. This process was repeated 50 times, assigning different random train/test speaker groups at every iteration. The mean detection performance is summarised in Table 1.

This initial study indicates that automatic detection of the Lombard effect in known conditions (i.e., where train and test speech samples are from the same corpus) is feasible. Ongoing work is addressing the detection of the Lombard effect in unknown conditions (i.e., where train and test speech samples are from different corpora). The i-vector framework will be expanded to incorporate features that are directly related to vocal effort variation, such as pitch and formant measures. In addition to detecting the presence of the Lombard effect, assessing the degree to which the speech deviates from neutral is also being considered.

152

Table 1. Mean EER and 95% Confidence Intervals (CI) of Lombard speech detection across 50 cross-validation iterations. The number of train/test speakers at each iteration and the minimum number of train/test files at each iteration is also indicated.

ReferencesKelly, F. and Hansen, J. H. L. (2016). Evaluation and calibration of Lombard

effects in speaker verification. IEEE Spoken Language Technology Workshop (SLT), San Diego, CA, pp 205-209

Hansen, J. H. L. and Varadarajan, V. (2009). Analysis and Compensation of Lombard Speech Across Noise Type and Levels With Application to In-Set/Out-of-Set Speaker Recognition. IEEE Transactions on Audio, Speech, and Language Processing, vol. 17, no. 2, pp. 366-378.

Dehak, N., Kenny, P. J., Dehak, R., Dumouchel, P., and Ouellet, P. (2011). Front-End Factor Analysis for Speaker Verification. IEEE Transactions on Audio, Speech, and Language Processing, vol. 19, no. 4, pp. 788-798

Ikeno, A., Varadarajan, V., Patil, S. and Hansen, J. H. L. (2007). UT-Scope: Speech under Lombard Effect and Cognitive Stress. IEEE Aerospace Conference, Big Sky, MT, pp. 1-7.

Jessen, M., Koster, O. and Gfroerer, S. (2005). Influence of vocal effort on average and variability of fundamental frequency. International Journal of Speech, Language and the Law, vol. 12, no. 2, pp. 174–213.

McLaren, M., Ferrer, L., Castan, D. and Lawson, A. (2016). The Speakers in the Wild (SITW) Speaker Recognition Database,” Interspeech 2016, San Francisco, CA, pp. 812-822

153

WikiDialects: a resource for assessing typicality in forensic voice comparison

Vincent Hughes,1 and Jessica Wormald1,2

1Department of Language and Linguistic Science, University of York, UK.

2J P French Associates, York, [email protected] [email protected]

In forensic voice comparison (FVC), the strength of the evidence is dependent not only on the similarity between the suspect and offender with regard to the features present in the voices, but also the typicality of these features within the wider, relevant population (Aitken and Taroni 2004). Despite calls across forensic science (and from the UK Government Forensic Regulator) for more robust and replicable estimations of typicality, experts still commonly rely on their experience, eminence and intuition when evaluating strength of evidence in FVC casework (for issues with this see Ross et al. 2016). One crucial reason for this is the relative lack of baseline descriptions of regional and, in particular, social varieties of languages. Comprehensive descriptions of varieties and dialectological work are no longer fashionable in sociolinguistics. Rather, papers tend to focus on a single linguistic feature for the purposes of exploring an aspect of variation and/or change. Moreover, these papers are often small in scale and dispersed across a range of journals, books, and blogs. Of the purely descriptive works that are available many are out of date (e.g. Wells 1982) and therefore may provide misleading accounts of varieties, which in turn may affect the conclusions that the expert arrives at in contemporary FVC cases. Finally, despite the large literature in sociolinguistics, there remains insufficient coverage of the diversity of language varieties even within individual countries (e.g. British English).

We are currently working on a funding application to address these issues. Our proposed solution is to create a wiki for descriptions of language varieties. A wiki is an online encyclopaedia (i.e. a knowledge pool) which is used, developed and updated by a community. In this case, the wiki would be a central resource for summaries of, and signposts

154

to, academic works and anecdotal accounts of patterns of language variation within different regional and social groups. The wiki would be a ‘living thing’ in that it would continually be updated as new research is conducted. We envisage that there would be a relatively large community of potential contributors and users. Contributions would primarily come from academics working in sociolinguistics, phonetics, and dialectology. Given the original motivation for this resource, it would be of central value to experts and academics working in forensic speech science. But beyond this, baseline descriptions of language varieties would also be of use to those working in speech and language therapy and language education (e.g. TESOL), and of general interest to dialect societies and the wider public.

We have arranged our grant proposal into three sections:

(1) Creating the wiki. This will involve deciding on the appropriate format of information and ways of indexing according to both regional/social factors and individual linguistic-phonetic features.

(2) Populating the wiki. As an initial step, we intend to collate research from sociolinguistics and dialectology to add to the wiki. This will focus on British English, but the resource will support users adding descriptions of varieties from other countries and languages.

(3) Outreach. We intend to encourage contributions to the wiki through workshops and guest lectures to academics in sociolinguistics, phonetics, and forensics. We will also publicise the resource to a wider audience, including the general public, in order to maximise its utility.

ReferencesAitken, C. G. G. and Taroni, F. (2004) Statistics and the Evaluation of Evidence

for Forensic Scientists (2nd edition). Chichester: Wiley.Ross, S., French, J. P. and Foulkes, P. (2016) UK practitioners’ estimates of

the distribution of speech variants. Paper presented at the International Association of Forensic Phonetics and Acoustics (IAFPA) conference, University of York, UK. 24th – 26th July 2016.

155

Between-speaker rhythmic variability in Persian

Homa Asadi1, Lei He2, Elisa Pellegrino,2 and Volker Dellwo2

1Department of Linguistics, Alzahra University, Tehran, [email protected]

2Institute of Computational Linguistics, University of Zurich, Zurich, Switzerland

{lei.he|elisa.pellegrino|volker.dellwo}@uzh.ch

Acoustic measures of speech rhythm based on the durational characteristics of consonantal and vocalic intervals, as well as on the syllable intensity characteristics capture between-speaker variability. The evidence is largely drawn from studies on stress-timed languages (Dellwo et al., 2015; Leemann et al., 2014; He and Dellwo, 2016). However, it is unknown whether speakers of syllable-timed languages also vary in terms of their rhythmic characteristics the same extent speakers of stress-timed languages do. Complex consonant clusters can be released differently between speakers and vowel reduction can be carried out more or less strong depending on speakers. When such features are missing it is possible that the degree of between- speaker variability goes down. We thus hypothesize that there might be less between-speaker variability in rhythm in syllable as opposed the stress-timed languages.

To explore between-rhythmic variability in a syllable-timed language, 10 native Persian speakers (Tehrani variety, 5 male, 5 female, age range = 22 - 33) were instructed to read The North Wind and the Sun in Persian at five different tempi (normal, slow, slower, fast and fastest possible). The speech corpus was segmented in two tiers: CV intervals and syllables. From the CV interval tier, we automatically calculated the following duration-based measures (Leemann et al., 2014; Dellwo et al., 2015):

- %V: proportion over which speech is vocalic; - ∆C(ln) and ∆V(ln): standard deviation of the natural-log normalized

durations of consonantal and vocalic intervals

156

- n-PVI-V: rate-normalized averaged durational differences between consecutive vocalic intervals;

- n-PVI-C: rate-normalized averaged durational differences between consecutive consonantal intervals.

From the syllable tier, we calculated the articulation rate (number of syllables per second) and the following intensity measures (He and Dellwo 2016):

- stdevM: the standard deviation of average syllable intensity levels; - stdevP: the standard deviation of syllable peak intensity levels; - varcoM: the variation coefficient of average syllable intensity

levels (normalized stdevM); - varcoP: the variation coefficient of syllable peak intensity levels

(normalized stdevP).

The results of this study indicate that there are significant differences between speakers in all the above-mentioned speech rhythm measures (table 1), even though within-speaker variability is strong, as shown for example by nPVI-V and varcoP (Figure 1 and Figure 2).

Table 1. Results from mixed-effects models for the rhythm measures.

Measure X2(df) P%V 78.48 (9) < 0.0001deltaCln 53.515 (9) < 0.0001deltaVln 49.418 (9) < 0.0001N_PVI_V 60.274 (9) < 0.0001Articulation rate 154.13 (9) < 0.0001stdevM 121.55 (9) < 0.0001stdevP 141.08 (9) < 0.0001varcoM 88.59 (9) < 0.0001varcoP 122.9 (9) < 0.0001

157

Figure 1 Boxplot showing the distribution of nPVI-V as a function of speaker and tempo

Figure 2 Boxplot showing the distribution of VarcoP as a function of speaker and tempo

158

ReferencesDellwo, V., A. Leeman and M.-J. Kolly. (2015). Rhythmic variability between

speakers: Articulatory, prosodic, and linguistic factors. The Journal of the Acoustical Society of America, 137,1513- 1528.

Leemann, A., M.-J. Kolly and V. Dellwo. (2014). Speech-individuality in suprasegmental temporal features: implications for forensic voice comparison. Forensic Sci. Int., 238, 59-67.

He, L. and V. Dellwo (2016). The role of syllable intensity in between-speaker rhythmic variability. International Journal of Speech, Language and the Law, 23, 243-273.

159

Speaker-similarity perception of Spanish twins and non-twins by native speakers of

Spanish, German and English

Eugenia San Segundo, Almut Braun, Vincent Hughes, and Paul Foulkes

Department of Language and Linguistic Science, University of York, UK

{eugenia.sansegundo|almut.braun|vincent.hughes|paul.foulkes}@york.ac.uk

Most previous studies on speaker identification suggest that native listeners have an advantage over non-natives (Köster and Schiller 1997; Perrachione et al. 2009). Other investigations, however, suggest that it is possible to identify voices successfully when stimuli are random phonemes with no semantic meaning and not belonging to any language (Bricker and Pruzansky 1966).

Therefore, listeners seem to pay attention to cues in a voice which do not require knowledge of the speaker’s language, for instance suprasegmental aspects. Ho (2007) found no native language effect when comparing British English and Chinese listeners in a speaker identification task where F0 was modified. This pointed to F0 as a language-independent factor for voice identification. Other suprasegmental features, such as voice quality (VQ), remain largely unexplored in this area. San Segundo, Foulkes and Hughes (2016) found little difference between Spanish and English listeners when rating speaker similarity of twin and non-twin pairs. The results suggested that similar listening strategies operate in naïve listeners (i.e. holistic approach to voice quality) in order to judge speaker similarity, regardless of the listener’s L1.

As a follow-up to San Segundo, Foulkes and Hughes (2016), in this investigation we have widened the scope of the original study to also include 20 German listeners (in addition to 20 English and 20 Spanish listeners), and we have also added the variables linguistic-phonetic expertise and musical training (Table 1).

160

Table 1. Participants of the perceptual experiment.

Spanish German EnglishDegree in Linguistics 8/20 10/20 15/20Musical training 6/20 18/20 18/20

The stimuli and the experimental design were the same as in San Segundo, Foulkes and Hughes (2016). Five pairs of male MZ twins were selected from the Twin Corpus (San Segundo 2014); all of them native speakers of Standard Peninsular Spanish. Three criteria were established in order to select only the most similar-sounding twin pairs from the corpus: (i) similar age (mean: 21, sd: 3.7); (ii) similar mean F0 (mean: 113 Hz, sd: 13 Hz); and (iii) similar Euclidean distance (EDs) between each speaker and his twin.

EDs took the form of Similarity Matching Coefficients (SMCs) and were based on the perceptual assessment of their VQ using the Vocal Profile Analysis (VPA) scheme.

A Multiple Forced Choice experiment was set up in Praat with 90 different-speaker pairings, i.e. each speaker compared with everyone else. Stimuli were presented in random order and listeners had to indicate the degree of similarity of each stimuli pair on a scale 1 (very similar) to 5 (very different). Listeners were not told that the stimuli included twin pairs. Ordinal mixed effects modelling was used to fit models to the similarity ratings with the following predictors: listener language, SMCs between the speakers, reaction time, whether speakers were twins or not, and whether the listener had a degree in Linguistics or not.

The variable music was removed from the predictors because of its strong association with the variable language (6 Spanish listeners with musical training vs. 18 for German and 18 for English).

161

Figure 1 Perceptual ratings (from 1 to 5) as a function of speaker type (twin or non-twin pairs). Results for (a) Spanish, (b) German and (c) English listeners.

Both language-independent and language-dependent results were found. On the one hand, regardless of the listener language, twins were judged as being much more similar than non-twin pairs (Figure 1). On the other hand, the interaction between language and reaction time shows different effects depending on the language. Namely, reaction time had no effect on the ratings given by English listeners. Conversely, both Spanish and German listeners were more likely to respond with 5 (very different) if their reaction time was short while the longer both listener groups took to respond, the more likely they were to respond with 1 (very similar).

Overall, listeners with a linguistics degree made a bigger distinction between twin and non-twin pairs. For linguists, there was no overlap of their rating distributions for twin and non-twin pairs (median rating for twins was ‘1’ in the case of linguists vs. ‘2’ in non-linguists). Linguists might have used a more analytical strategy to make their similarity judgments while the listeners without a degree in linguistics might have used a more holistic strategy.

ReferencesBricker, P. D., and S. Pruzansky. (1966). Effects of stimulus content and duration

on talker identification, Journal of the Acoustical Society of America, 40:1441-1449.

Ho, C.-T. (2007). Is pitch a language-independent factor in forensic speaker identification?, MA diss., University of York.

162

Köster, O. and N.O. Schiller. (1997). Different influences of the native language of a listener on speaker recognition, Forensic Linguistics, 4:8-28.

Perrachione, T.K., Pierrehumbert, J.B. and P.C.M. Wong. (2009). Differential neural contributions to native- and foreign-language talker identification, Journal of Experimental Psychology: Human Perception and Performance, 35:1950-1960.

San Segundo, E. (2014). Forensic speaker comparison of Spanish twins and non-twin siblings: A phonetic-acoustic analysis of formant trajectories in vocalic sequences, glottal source parameters and cepstral characteristics, PhD diss., Alicante: Biblioteca Virtual Miguel de Cervantes, 2017. http://www.cervantesvirtual.com/nd/ark:/59851/bmcm9293

San Segundo, E.; Foulkes, P. and V. Hughes. (2016). Holistic perception of voice quality matters more than L1 when judging speaker similarity in short stimuli, Proc. 16th Australasian Conference on Speech Science and Technology. University of Western Sydney, Australia (pp. 309-312).

163

Speaker-specific temporal organizations of intensity contours

Lei He, and Volker DellwoInstitute of Computational Linguistics,

University of Zurich, [email protected]; [email protected]

Vocal folds vibrations, vocal tract resonances, and articulatory movements are essential to the production of speech. All these processes contain speaker-specific information. Our present work(1) aims to study how speaker individual differences are manifested in temporal organizations of the intensity contour in terms of intensity dynamics (defined as the speed of intensity increase from a trough point to a consecutive peak point ‹positive dynamics›, or the speed of intensity decrease from a peak point to a consecutive trough point ‹negative

dynamics›). Speaker idiosyncratic characteristics in both vocal folds vibrations and vocal tract resonances have been extensively studied in forensic phonetics and (semi-)automatic speaker recognitions (Eriksson 2012; Kinnunen and Li 2010), however only a few number of studies examined the temporal characteristics of speech that are a result of the movements of the articulators (Eriksson 2012; Leemann et al. 2014; Dellwo et al. 2015; He and Dellwo 2016). This study underlies the assumption that the intensity contour may be closely related to the articulatory movements responsible for the changes of mouth aperture size in an utterance. Such a view is supported by Chandrasekaran et al. (2009) who showed that the amplitude envelope co-varied with the mouth aperture size (see Figure 1). This suggests that intensity dynamics are associated with articulatory movements which directly influence the speed by which the mouth aperture size increases and decreases. In addition, Birkholz et al. (2011) found that the forces and motor programs acting on the articulators in opening and closing gestures were different by time constants. According to Ghez and Krakauer’s (2000) view of the motor program, the extent of a movement is planned before the initiation of such a movement. Speakers are therefore likely to have different planning for both opening and closing gestures. For this reason, we

164

looked at speaker idiosyncratic effect in positive and negative dynamics separately.

Method The TEVOID corpus (Leemann et al. 2014; Dellwo et al. 2015) was used. It contains 16 native speakers of Zurich German (8f 8m; mean age=27); each speaker produced 256 read sentences. The negative dynamics (V[–]) was calculated as the speed of intensity decrease from a syllable intensity peak to its neighboring intensity trough (illustrated as the steepness of line XY in Figure 2). The positive dynamics (V[+]) was calculated as the speed of intensity increase from the intensity trough to its adjacent intensity peak (illustrated as the steepness of line YZ in Figure 2). An utterance is composed of both series of positive dynamics and negative dynamics. To capture the distribution of both types of dynamics, the mean, standard deviation and Pairwise Variability Index were calculated for both V[+] and V[–]: mean_V[+], stdev_V[+],PVI_V[+], mean_V[–], stdev_V[–] and PVI_V[–] (they are collectively referred to as the dynamics measures). They entered in the multinomial logistic models as numeric predictor variables, and speaker was modeled as the response variable. The contribution of each dynamic measure in explaining between-speaker variability was calculated.

Results The measures of positive dynamics (mean_V[+], stdev_V[+] andPVI_V[+]) explained ≈30% of between-speaker variability, while the measures of negative dynamics (mean_V[–], stdev_V[–] and PVI_V[–])explained ≈70% of between-speaker variability (see Figure 3). This suggests that closing gestures of articulatory movements encode more speaker-specific information.(1) Note: The initial idea with some pilot data has been presented in He et al. (2015). A refined experiment has been presented in He and Dellwo (2017). This work shows our most updated results.

ReferencesBirkholz P,KrögerBJ& Neuschaefer-RubeC (2011) “Model-based reproduction

of articulatory trajecto- ries for consonant-vowel sequences,” IEEE Trans. Audio Speech Language Processing 19, pp. 1422–33.

Chandrasekaran C, Trubanova A, Stillittano S, Caplier A & Ghazanfar AA (2009). The natural statistics of audiovisual speech. PLoS Computational Biology 5: e1000436.

165

Dellwo V, Leemann A & Kolly, M- J (2015) “Rhythmic variability between speakers: articulatory, prosodic, and linguistic factors” Journal of the Acoustical Society of America 137, pp. 1513–28.

Eriksson A, (2012) “Aural/acoustic vs. automatic methods in forensic phonetic case work,” in Forensic Speaker Recognition: Law Enforcement and Counter- Terrorism, edited by A. Neustein and H. A. Patil (Springer, New York), pp. 41–69.

Ghez C & Krakauer J (2000) “The organization of movement,” in Principlesof Neural Science, 4e, edited by E. R. Kandel, J. H. Schwartz, and T. M. Jessell (McGraw- Hill, New York), pp. 654–73.

He L, Glavitsch U & Dellwo V (2015). Inter- speaker variability in intensity dynamics. Presentation in the 24th Annual Conference of IAFPA, Leiden, the Netherlands.

He L & Dellwo V (2016) “The role of syllable intensity in between- speaker rhythmic variability” International Journal of Speech, Language and the Law 23, pp. 243–73.

He L & Dellwo V (2017). Speaker- specific variability in intensity dynamics. Presentation in Speech Production and Perception: Learning and Memory, Chorin, Germany.

Kinnunen T & Li H (2010) “An overview of text- independent speaker recognition: From features to super- vectors,” Speech Communication 52, pp.12–40.

Leemann A, Kolly M- J & Dellwo V (2014) Speech- individuality in suprasegmental temporal features: implications for forensic voice comparison. ForensicScience International 238, pp. 59–67.

Figure 1 An illustration of the covariation of both amplitude envelope and mouth aperture size (Chandrasekaran et al. 2009). This figure is distributed under the CC- BY license.

166

Figure 2 Illustration of the calculations of both positive and negative intensity dynamics.

Figure 3 Pie charts showing the amount of between-speaker variability accounted for by measures of both positive and negative dynamics.

167

Emotional speech databases in Slavic languages – an overview

Milana Milošević1, and Željko Nedeljković1

1School of Electrical Engineering, University of Belgrade, Belgrade, Serbia

{milana.milosevic| zeljko.nedeljkovic.bb }@gmail.com

Stress, emotions and anxiety influence speech properties. Such speech becomes a challenge for speech processing tasks such as speaker and speech recognition. (Schuller et al., 2011) emphasizes that obtaining more realistic data is very important issue for researches. It is common that in those studies, the well known databases of emotional speech are used (El Ayadi et al., 2011; Koolagudi and Rao, 2012; Ververidis et al., 2003). However, these databases used are usually in English, German or Romance languages etc. and very few researches consider Slavic languages. This is partially due to lack of information and poor availability of these databases. Here is an overview of all the available databases of emotional speech in Slavic languages: Serbian, Polish, Croatian, Russian and Czech.

Serbian Emotional Speech Database - GEESLanguage: SerbianEmotions: anger, happiness, fear, sadness and neutral. Emotions are acted. Speakers: 6 speakers - 3 male and 3 female speakersUtterances: 32 isolated words, 30 short (3 to 8 words) semantically neutral sentences, 30 long (6 to 12 words) semantically neutral sentences and one passage with 79 words in size. Validation: 93.33% to 96.06% depending on emotionFiles: wav, sample frequency 22.05 kHzRecording conditions: Anechoic studio at Faculty of Dramatic Arts, Belgrade University (Jovičić et al., 2004)

168

Database of Polish Emotional Speech - DPESLanguage: PolishEmotions: anger, happiness, fear, sadness, neutral and boredom. Emotions are acted.Speakers: 8 speakers - 4 male and 4 female speakers Utterances: 5 short (5 to 6 words) sentences. Validation: 60% to 84% depending on speakerFiles: wav, sample frequency 44.1 kHzRecording conditions: Aula of Polish National Film, Television and Theater School in Lodz, Poland(Cichosz, 2008)

Polish Emotional Speech Database - PESDLanguage: PolishEmotions: anger, happiness, fear, sadness, neutral and surprise and disgust. Emotions are acted.Speakers: 13 speakers - 7 male and 6 female speakers Utterances: 10 short (5 to 6 words) sentences. Validation: 40.52% to 73.41% depending on emotionFiles: wav, sample frequency 44.1 kHzRecording conditions: Laboratory recording studio(Staroniewicz, 2014)

Croatian emotional speech corpus - CrESLanguage: CroatianEmotions: anger, happiness, fear, sadness and neutral. Emotions are acted and spontaneous.Speakers: 341 speakersUtterances: Long and short sentences, each speaker saying different sentence. Validation: Evaluation per sentence is given in Excel file with databaseFiles: wav, sample frequency 11,025 kHz Recording conditions: spontaneous (Dropuljic et al., 2013, 2011)

169

Russian Language Affective Speech Database - RUSLANALanguage: RussianEmotions: anger, happiness, fear, sadness and neutral. Emotions are acted. Speakers: 61 speakers - 12 male and 49 femaleUtterances: Long and short sentences, phonetically representative, which included all the phonemes and most common consonant clusters in Russian Validation: Evaluation was done, but no results were presented. Files: PCM, sample frequency 32 kHzRecording conditions: Soundproof recording studio of the Department of Phonetics, St. Petersburg State University, St.Petersburg, Russia(Makarova et al., 2007; Makarova and Petrushin, 2003, 2002)

Czech Database of Speech Emotions - CDSELanguage: CzechEmotions: anger, happiness, sadness and neutral. Speakers: unknownUtterances: Long and short sentences, spontaneous, no background noiseValidation: Evaluation per sentence is given in Excel file with databaseFiles: wav, sample frequency 16 kHzRecording conditions:(Uhrin et al., 2014)

ReferencesCichosz, J., 2008. Database of Polish Emotional Speech. Lodz, Poland.Dropuljic, B., Chmura, M.T., Kolak, A., Petrinovic, D., 2011. Emotional speech

corpus of Croatian language, in: 2011 7th International Symposium on Image and Signal Processing and Analysis (ISPA). Dubrovnik, Croatia, pp. 95–100.

Dropuljic, B., Popovic, S., Petrinovic, D., Cosic, K., 2013. Estimation of emotional states enhanced by a priori knowledge, in: 2013 IEEE 4th International Conference on Cognitive Infocommunications (CogInfoCom). IEEE, Budapest, Hungary, pp. 481–486. doi:10.1109/CogInfoCom.2013.6719295

El Ayadi, M., Kamel, M.S., Karray, F., 2011. Survey on speech emotion recognition: Features, classification schemes, and databases. Pattern Recognit. 44, 572–587. doi:10.1016/j.patcog.2010.09.020

170

Jovičić, S.T., Kašić, Z., Đorđević, M., Rajković, M., 2004. Serbian emotional speech database: design, processing and evaluation, in: 9th International Conference on Speech and Computer SPECOM 2004. St.Petesburg, Russia, pp. 77–81.

Koolagudi, S.G., Rao, K.S., 2012. Emotion recognition from speech: a review. Int. J. Speech Technol. 15, 99–117. doi:10.1007/s10772-011-9125-1

Makarova, V., Petrushin, V.A., 2003. Phonetics of Emotion in Russian Speech, in: Proc. of the XVth International Conference of Phonetic Sciences. Barcelona, Spain, pp. 2–5.

Makarova, V., Petrushin, V.A., Company, T.N., 2007. Sonorant Segment Quality in Russian Emotional

171

Cross-Language Accent Analysis for Determination of Origin

Kristina TomićFaculty of Philosophy, University of Niš, Niš, Serbia

[email protected]

Language analysis for determination of origin (LADO) has become a hot topic with the recent rise in asylum seekers (Eades, 2005). However, determining someone’s background with greater precision is not only important for asylum cases, but also for forensic speaker identification and speaker profiling. Even though phoneticians nowadays often encounter cases involving samples in a foreign language (Künzel, 2013), there is still a general scarcity in cross-language forensic research.

The current study aims to find out to what extent phonetical characteristics of a dialect are retained across languages on the example of Serbian and English. The goal of the research is to discover if it is possible to use accent features of the native speakers of Serbian (pitch tone, height and segment duration) to precisely determine their dialect if the recording samples are in English as a foreign language.

Standard Serbian belongs to the class of hybrid prosodic systems, those that manipulate both stress and tone (Lehiste & Ivić, 1986). The standard Serbian pitch accent system has two pitch accents, falling and rising, each defined by a characteristic pitch shape, as well as by stress, whose correlate is increase in duration (Zec and Zsiga, 2009; Sredojević, 2017). However, there are significant dialectal differences in realisation of these accents as well as in their presence/absence (Ivić, 1956, 2009). This research is concerned with two dialectal regions of Serbia, that of Šumadija and Vojvodina (“šumadijsko-vojvođanski” dialect) and Prizren- Timok (“prizrensko-timočki” dialect).

Ten participants from each region were recorded in two speaking tasks, in Serbian and English respectively. During the audio analysis, the spontaneous speech was marked for tokens in such a way that the four accent types (long-rising, short-rising, long-falling and short-

172

falling) are equally represented. The acoustic analysis includes the measurement of the following features:

In the stressed vowel:

• Duration of the vowel in milliseconds• Place of the pitch peak in percentage• Pitch ratio between the starting and ending point - tone• Pitch ratio between the starting point and the peak• Pitch ratio between the peak and ending point

In the first post-stressed vowel:

• Duration of the vowel in milliseconds• Place of the pitch peak in percentage• Pitch ratio between the starting and ending point• Pitch ratio between the starting point and the peak• Pitch ratio between the peak and ending point

Other

• Ratio of pitch peaks in the stressed and the first post-stressed vowel

The author aims to determine that the acoustic correlates of accent in Serbian differ for Šumadija-Vojvodina and Prizren-Timok dialects. In addition, the author intends to test whether the acoustic values of pitch accent features of the samples in English are indicative of the speaker’s native dialect.

ReferencesEades, D. (2005). Applied Linguistics and Language Analysis in Determination

of Origin. Applied Linguistics, Vol 26, No 4, 503-526.Ivić, P. (1956). Dijalektologija srpskohrvatskog jezika. Uvod u štokavsko narečje.

Novi Sad: Matica Srpska.Ivić, P. (2009). Srpski diijalekti i njihova klasifikacija. Sremski Karlovci - Novi

Sad: Izdavačka knjižarnica Zorana Stojanovića.Künzel, H. (2013). Automatic speaker recognition with crosslanguage speech

material. International Journal of Speech Language and the Law, Vol 20, No 1, 21-44.

173

Lehiste, I. and P. Ivić. (1986). Word and Sentence Prosody in Serbocroatian.Cambridge, MA: MIT Press.

Sredojević, D. (2017). Fonetsko-fonološki opis akcenata u standardnom srpskom jeziku: od specifičnog ka opštem. Novi Sad: Sajnos.

Zec, D. and E. Zsiga. (2009). Interaction of Tone and Stress in Standard Serbian: The Cornell Meeting 2008. In W. Browne, A. Cooper, A. Fisher, E. Kesici, N. Predolac, & D. Zec (Eds.). Ann Arbor: Michigan Slavic Publications.

174

The effect of fundamental frequency f0,syllable rate and pitch range on listeners’

perception of fear in a female speaker’s voice

Sandro Bizozzero1, Nele Netzschwitz1, and Adrian Leemann2

1Departement für Germanistische Linguistik, Universität Freiburg/Université de Fribourg, Switzerland

{sandro.bizozzero|nele.netzschwitz}@unifr.ch2Department of Theoretical and Applied Linguistics,

University of Cambridge, United [email protected]

It has been widely reported that conditions of distress affect various vocal variables – such as fundamental frequency f0, formants, speech rate andintensity – in speech production. Little is known about the influence of those variables on listeners’ perception of fear, however – fear being a condition of distress.

In the present study 7 different stimuli (in 18 realizations) were presented to 48 subjects. In these 7 stimuli the fundamental frequency f0, the syllablerate (duration) and the pitch range had been modified respectively. These 7 stimuli (controlling stimulus and two modified stimuli for each variable) were all based on the same five seconds long recording of a fictitious emergency call in Swiss German reporting a break-in. Subjects were asked to rate the degree of fear they perceived in the female speaker’s voice on an endpoint labelled 6-point scale.

Results revealed that (1) an increased fundamental frequency f0 leadsto a higher perceived degree of fear. Results also showed (2) a clear tendency that an increased syllable rate and (3) a slight tendency that a reduced pitch range also lead to a higher perceived degree of fear.

We further report results indicating (4) that an increased syllable rate hasa bigger effect on female listeners and (5) that a decreased pitch range has a bigger effect on listeners aged 20-30 compared to listeners aged 15-19.

175

With regard to forensic phonetics, the findings of the present study could be applied in speaker profiling (e. g. in masked robberies) as well as in the assessment of emergency calls and hence in training programs for emergency call operators.

A future study should examine if the present findings also apply to the judged level of fear in a male speaker’s voice.

Furthermore we point out the importance of an exact terminology for future research in the field of emotional speech.

Fundamental frequency f0 Arithmetic mean with standard deviation (green) and median (blue).

Syllable rate Arithmetic mean with standard deviation (green) and median (blue).

176

Pitch range Arithmetic mean with standard deviation (green) and median (blue).

Effect of syllable rate by gender Level of fear judged by male (blue) and female listeners (red).

177

Effect of pitch range by age Level of fear jugded by age group 15-19 (red) and 20-30 (green).

ReferencesBanse, R. and K. R. Scherer. (1996). Acoustic profiles in vocal emotion

expression. In R. Geen (Ed.). (1996). Journal of Personality and Social Psychology, 70(3), 614-636.

Belin, P., S. Fillion-Bilodeau, F. Gosselin. (2008). The Montreal Affective Voices: A validated set of nonverbal affect bursts for research on auditory affective processing. In Behavior Research Methods, 40(2), 531-539.

Devillers, L., I. Vasilescu, L. Vidrascu. (2004). F0 and pause features analysis for anger and fear detection in real-life spoken dialogs. In B. Bel and I. Marlien (Eds.). (2004). Speech Prosody 2004, International Conference, 205-208.

Hicks Jr., J. W. (1979). An Acoustical/Temporal Analysis of Emotional Stress in Speech. In H. Hollien and P. Hollien (Eds.). (1979). Current Issues in the Phonetic Sciences: Proceedings of the IPS-77 Congress, Miami Beach, Florida, 17-19th December 1977 (Vol. 9), 279. John Benjamins Publishing.

Huttunen, K., H. Keränen, E. Väyrynen, R. Pääkkönen, T. Leino. (2011). Effect of cognitive load on speech prosody in aviation: Evidence from military simulator flights. In G. Costa (Ed.). (2011). Applied ergonomics, 42(2), 348-357.

Kirchhübel, C., D. M. Howard, A. W. Stedmon. (2011). Acoustic correlates of speech when under stress: Research, methods and future directions. In International Journal of Speech, Language and the Law, 18(1), 75-98.

178

Leinonen, L., T. Hiltunen, I. Linnankoski, M.-L. Laakso. (1997). Expression of emotional-motivational connotations with a one-word utterance. In The Journal of the Acoustical society of America 102(3), 1853-1863.

Meinerz, C. (2010). Effekte von Stress auf Stimme und Sprechen: Eine phonetische Untersuchung auf der Grundlage ausgewählter akustischer und sprechdynamischer Parameter unter Berücksichtigung verschiedener Stressklassen. BoD–Books on Demand.

Roberts, L. S. (2012). A forensic phonetic study of the vocal responses of individuals in distress. (Doctoral dissertation, University of York).

Scherer, K. R. (1981). Vocal indicators of stress. In J. K. Darby (Ed.). (1981). Speech Evaluation in Psychiatry, 171-187.

Siegman, A. W. (1993). Paraverbal correlates of stress: implications for stress identification and management. In L. Goldberger and S. Breznitz (Eds.). (1993). Handbook of Stress. Theoretical and Clinical Aspects, 2nd

ed., 274–299. New York: Free Press.Sigmund, M. (2006). Introducing the database ExamStress for speech under

stress. In J. R. Sveinsson (Ed.). (2006). Signal Processing Symposium, 2006. NORSIG 2006. Proceedings of the 7th Nordic, 290-293. IEEE.

Sigmund, M. (2007). Spectral Analysis of speech under stress. In J. M. Jun (Ed.). (2007). IJCSNS International Journal of Computer Science and Network Security, Vol. 7(4), 170-172.

Sobin, C. and M. Alpert. (1999). Emotion in speech: The acoustic attributes of fear, anger, sadness, and joy. In R. A. Javier and R. W. Rieber (Eds.). (1999). Journal of psycholinguistic research, 28(4), 347-365.

Spackman, M. P., B. L. Brown, S. Otto. (2009). Do emotions have distinct vocal profiles? A study of idiographic patterns of expression. In Cognition and Emotion, 23(8), 1565-1588.

Williams, C. E. and K. N. Stevens. (1972). Emotions and speech: Some acoustical correlates. In The Journal of the Acoustical Society of America 52(4B), 1238-1250.

179

Detecting remorse in the voice: A preliminary investigation into the perception of remorse

using a voice line-up methodology

Francesca Hippey,1 and Erica Gold1

1Department of Linguistics and Modern Languages, University of Huddersfield, UK

[email protected] | [email protected]

This paper is eligible for the 'Best Student Paper Award'

This paper provides a preliminary investigation into listeners’ ability to detect remorse in the voice. Previous research on remorse has been largely conducted in law on its perception, evaluation and significance in the courtroom (Bandes, 2015; Eisenberg et al., 1998; MacLin et al., 2009). This forensic phonetic approach was motivated by research into emotional speech (Kirchhübel, 2013; Watt, Kelly & Llamas, 2013; Roberts, 2012) and by common phrases in the media such as “the defendant showed no remorse”. This research considers the extent to which remorse may be perceived auditorily.

Methodology

A voice line-up method (Nolan, 2003; de Jong-Lendle et al., 2015) was used to play 50 statements (10 scenarios with 1 ‘true’ remorseful and 4 ‘fake’ remorseful utterances per scenario) to participants. When recording the stimuli, the speaker was instructed which utterances to say remorsefully, so the element of ‘remorse’ was acted. The speaker providing the stimuli was female, aged 21 with a Yorkshire accent, with no previous acting training. Participants were tasked with identifying which read utterance, if any, sounded remorseful according to the following definition: “Deep regret or guilt for doing something morally wrong; the fact or state of feeling sorrow for committing a sin; repentance, compunction” (Oxford English Dictionary, 2017).

180

Table 1. - an example of a scenario provided when producing/judging remorse.

ScenarioDrunk scenario

You went out for some drinks with friends and ended up drinking too much. In your state, you get upset with your friends for trying to help you. You argue with them and shout nasty things at them.

Table 2. - an example of stimuli (the ‘remorseful’ utterance is italicised).

Utt

eran

ce 1

Utt

eran

ce 2

Utt

eran

ce 3

Utt

eran

ce 4

Utt

eran

ce 5

Dru

nk sc

enar

io I’m sorry, I don’t mean any of the things that I said.

You know I love you guys. I don’t really believe anything I said.

I know there is no excuse for the things I said.

I drank too much, but I know that’s no excuse.

Anyway, I’m really sorry for it all. I promise I won’t do it again.

Pilot Study

A pilot study was conducted to test the validity and representativeness of the stimuli. The ‘true’ remorseful stimulus for each scenario was removed so participants (native English speakers, aged between 20-27) could judge which of the remaining 4 stimuli for each scenario they considered to be remorseful, if any. Generally, responses indicated that the ‘fake’ utterances were not distracting or influencing listener’s judgments, meaning there was no word biasing. However, participants noted perceiving a lack of remorse in statements that sounded ‘read’. Given the breadth of the current project, it was not deemed necessary to rerecord any stimulus, but instead would be considered in relation to the final results.

181

Main test

The main test was conducted using 24 participants (native English speakers, aged between 18-62) with the same methodology, but with the addition of ‘true’ remorseful utterances. Every ‘true’ remorseful utterance was identified, with most participants identifying it in each scenario. However, there were a few participants that had difficulty identifying the intended remorseful utterance in all statements. Even so, some of those incorrect responses were identified as potential problems in the pilot study as they may have been considered distracting. This warranted caution in interpreting them as evidence of the participants’ inability to recognise remorse. When taking into consideration the pilot study results, the results suggest that the participants are able to detect the acted remorse in short utterances.

Implications

This preliminary research has implications for the field of forensic phonetics as it has shown that listeners can identify acted remorse (to some extent) in the voice, supporting claims that it can be perceived auditorily. Listeners appear to perceive read speech as less remorseful. This would suggest that defendants may be encouraged to avoid reading an apology from a script if they would like to be perceived by the audience (or trier-of-fact) as remorseful. This research indicates that there may be scope for further research considering the perception of ‘non-acted’ remorse, as well as an acoustic investigation into the phonetic properties that may be perceived as remorse.

ReferencesBandes, S.A. (2015). Remorse and Criminal Justice. Emotion Review, 8 (1),

14-19.Eisenberg, T., Gervey, S.P., & Wells, M.T. (1998). But was he sorry? The role

of remorse in capital sentencing. Cornell Law Review. 83 (6), 1599-1637.de Jong-Lendle, G., Nolan, F., McDougall, K., & Hudson, T. (2015). Voice

Lineups: A Practical Guide. In 18th International Congress of Phonetic Sciences, Glasgow, Scotland.

Kirchhübel, C. (2013). The acoustic and temporal characteristics of deceptive speech (PhD thesis). University of York.

182

MacLin, M.K., Downs, C., MacLin, O.H., & Caspers, H.M. (2009). The Effect of Defendant Facial Expression on Mock Juror Decision-Making: The Power of Remorse. North American Journal of Psychology. 11 (2), 323-332.

Nolan, F. (2003). A recent voice parade. International Journal of Speech Language and the Law. 10 (3), 277-291.

Roberts, L.S. (2012). A forensic phonetic study of the vocal responses of individuals in distress (PhD thesis). University of York.

Watt, D., Kelly, S., & Llamas, C. (2013). Inference of threat from neutrally-worded utterances in familiar and unfamiliar languages. York Papers in Linguistics (YPL2). 2 (13), 98-120.

183

Construction of a voice profile: An acoustic study of /l/

Sarah FranchiniInstitut für Germanistische Sprachwissenschaft,

Philipps-Universität Marburg, [email protected]

Introduction

In June 2015, the mentally-disabled son of a famous entrepreneurial family was kidnapped (for further information see de Jong-Lendle et al., 2017). In the phone calls made by the kidnapper(s), the caller speaks German quite fluently, but clearly with a non-native accent exhibiting a striking difference with his pronunciation of /l/.

Clear [l] and dark [ɫ]

The clear [l] is produced by a single apicoalveolar constriction whereas dark [ɫ] implements a dorsovelar constriction causing a lowering of F2 and raising of F1 (Recasens et al., 1995). In a spectrogram, this gesture is represented by an approach of F1 and F2 resulting in the typical velarized sound quality with the potential of gradual variation (Recasens, 2004). Relative to vowel context, German is reported to have a non-velarized [l] with F2 values reaching from 1310-1730 Hz (Recasens, 2012).

The kidnapper’s native language was identified as originating from an area within the Balkan- region. He shows a large variety of /l/-consonant productions reaching from a clear to a dark variant. In order to find out, whether they are produced in complementary distribution or as free variants, a detailed spectral analysis was carried out of the kidnapper’s speech (de Jong- Lendle et al., 2017).

Materials

As the kidnapper is clearly an L2-speaker of German and suspected to come from the Balkan- region, his productions are compared to a German L1-speaker and another L2-speaker whose native language is Serbian. His

184

sample is selected from the database produced by the Police containing over 350 speakers from this region. This read text contains twelve words with the lateral approximant /l/ wherefore F1 and F2 we measured using Praat (Boersma and Weenink, 2013).

Results

Figure 1 shows F1 against F2-values for all three speakers. For the German L1-speaker it was confirmed that he typically produced a clear [l] with an F2 ranging from 1300-1800 Hz. Both the kidnapper and the Serbian speaker show a variety of /l/ productions, both however exhibited a preference for the velarized version with F2-values mainly varying from 1000-1300 Hz except for the [ɪ]-context. Here, F2-values can go up as much as 1640-1720 Hz for the kidnapper and 1500-1520 Hz respectively for the Serbian speaker. The distribution of the formant values indicates a major difference in articulation of /l/ between the L1- and the L2- speakers of German not only in differences in F2 but also in F1.

Figure 1 F1 and F2 measurements comparing the variety of /l/-consonants in different phonetic contexts of the analyzed speakers.

185

Figure 2 Comparison between the first and second formant for the L2-speakers (red/blue) compared to the L1-speaker of German (green).

The German speaker shows the largest margin between the first and second formant (mean=1175Hz) as shown in Figure 2. On average, the kidnapper (mean=869Hz) and the Serbian speaker (mean=788Hz) both reach a smaller and contiguous distance showing the tendency to velarize laterals. To figure out if segmental context (vowel context, syllable position) has an impact on the realization of the dark consonant, further speakers of both languages will be examined.

ReferencesBoersma, P., & Weenink, D. (2013). Praat: doing phonetics by computer

[Computer program]. Version 5.3.51, retrived 2 June 2013 from http://www.praat.org.

de Jong-Lendle, G. , Kehrein, R., Urke, F., Mołczanow, J., Georg, A.L., Fingerling, B., Franchini, S., Köster, O., Ulbrich, C. 2017 (accepted) Language identification from a foreign accent in German, Poster to be presented at the International Association for Forensic Phonetics and Acoustics Annual Conference, Split, 9-12 July, Croatia.

Recasens, D., Fontdevila, J., & Pallarès, M. D. (1995). Velarization degree and coarticulatory resistance for/I/in Catalan and German. Journal of Phonetics, 23(1-2), 37-52.

186

Recasens, D. (2004). Darkness in [l] as a scalar phonetic property: implications for phonology and articulatory control. Clinical Linguistics & Phonetics, 18(6-8), 593-603.

Recasens, D. (2012). A cross-language acoustic study of initial and final allophones of/l. Speech Communication, 54(3), 368-383.

187

Indexical information as a function of linguistic condition: does prosodic

prominence affect speaker-specificity?

Linda Albers1, and Willemijn Heeren1,2

1Department of Languages, Literature and Communication 2Utrecht Institute of Linguistics OTSUtrecht University, The Netherlands

[email protected]; [email protected]

Speech forms an interwoven stream of linguistic and indexical information (Abercrombie, 1967). In speech perception, this interdependency has repeatedly been shown to affect listeners’ performance (e.g., Goggin et al., 1991; Van Berkum et al., 2008; Winters et al., 2008; Perrachione et al., 2015). In speech production, however, linguistic and indexical information have generally been investigated independently (but see Moos, 2010; Kavanagh, 2012; Dellwo et al., 2015). Hence, it is not well understood if, and if so, why linguistic and indexical factors interact in the acoustic speech signal. The presence of this interaction would imply that speaker-dependent information, in the same speech sound, is not always the same.

Linguistic-phonetic research for instance shows that linguistic position (e.g., being stressed or not) systematically impacts acoustic forms: stressed syllables evoke more canonical pronunciations (Fry, 1955; Campbell & Beckman, 1997). Such systematic variation has been largely ignored in speaker-specificity studies, but may impact results: we predict that it alters the balance between within- and between-speaker variation, i.e. speaker-specificity. This study therefore explored how the presence versus absence of prosodic prominence on the syllable from which a vowel (/i/) was sampled, influenced the vowel’s speaker-dependency. We hypothesized that in prominent syllables, with more careful pronunciation, within-speaker variability is reduced. This may enhance speaker-specificity.

188

Method

From the Spoken Dutch Corpus (Oostdijk, 2000), spontaneous conversations were selected with manually-generated and checked phonetic and prosodic annotations. Using these annotations 17 speakers were selected (7 females, 10 males) for whom at least 8 realizations per prominence condition (prominent/not prominent) were available for the vowel (/i/), which were also well-measurable along all acoustic dimensions of interest. These vowel intervals were manually segmented using praat (Boersma, 2001), and subsequently different acoustic features per vowel were analyzed: F0, F1 through F3, spectral slope and balance, duration, and intensity (after scaling recordings to 65 dB). These parameters were selected as they may be sensitive to differences in prominence in Dutch (e.g., Sluijter & Van Heuven, 1996).

Results

Repeated measures ANOVAs with within-subjects factors Prominence and Repetition showed the effect of prominence for the acoustic variables duration, F0, F2, mean intensity and spectral balance. Taking the by-speaker variance as an indication of articulatory precision, paired samples t-tests were run on these acoustic variables to compare the two conditions: no significant reductions in variance for the prominent realizations were found, t(16) < |-1.6|. Linear Discriminant Analysis models (Klecka, 1980) were built for the prominent and non-prominent realizations separately. Using the non-prominent /i/-realizations, 40.4% of cases were classified correctly (cross-validated, chance level at 5.9%), whereas it was 46.7% using the prominent realizations. When the model was allowed to, in a stepwise manner, select the most successful predictors, the non-prominent condition yielded 35.5% correct speaker classifications, whereas the prominent condition yielded 46.7%.

These results provide some evidence in support of the hypothesis that speaker-dependency is enhanced in prominent syllables, but the effect is not consistent across measures. An extension of this exploratory study is planned to further investigate the potential interaction of linguistic and indexical information.

189

ReferencesAbercrombie, D. (1967). Elements of general phonetics. Edinburgh: Edinburgh

University Press. Boersma, P. (2001). Praat, a system for doing phonetics by computer. Glot

International 5:9/10, 341-345.Campbell, N., and Beckman, M. (1997). Stress, prominence, and spectral tilt,

in A. Botinis, G. Kouroupetroglou, & G. Carayiannis (eds.): Proceedings of ESCA Workshop on Intonation: Theory, Models and Applications, Athens, pp. 67-70.

Dellwo, V., Leemann, A., Kolly, M-J. (2015). Rhythmic variability between speakers: Articulatory, prosodic, and linguistic factors. Journal of the Acoustical Society of America 137(3), 1513-1528.

Fry, D. B. (1955). Duration and intensity as physical correlates of linguistic stress. Journal of the Acoustical Society of America 27, 765-768.

Goggin, J. P. Thompson, C. P., Strube, G., and Simental, L. (1991). The role of language familiarity in voice identification. Memory & Cognition 19, 448-458.

Kavanagh, C. M. (2012). New consonantal acoustic parameters for forensic speaker comparison. PhD dissertation, University of York, UK.

Klecka, W. R. (1980). Discriminant analysis. Sage University Paper Series on Quantitative Applications in the Social Sciences, No. 07-019 (Sage, Beverly Hills, CA), 72 pp.

Moos, A. (2010). Long-term formant distributions as a measure of speaker characteristics in read and spontaneous speech. The Phonetician 101, 7-24.

Oostdijk, N. H. J. (2000). Het Corpus Gesproken Nederlands. NederlandseTaalkunde 5, 280-284.

Perrachione, T., Dougherty, S. C., McLaughlin, D. E., and Lember, R. A. (2015). The effects of speech perception and speech comprehension on talker identification. Proceedings of the 18th International Congress of Phonetic Sciences, August 10-14, 2015, Glasgow, paper 0908, 4 pps.

Sluijter, A. M. C., and Heuven, V. J. van (1996). Spectral balance as an acoustic correlate of linguistic stress. Journal of the Acoustical Society of America100, 2471–2485.

Van Berkum, J. J., A., Van den Brink, D., Tesink, C. M. J. Y., Kos, M. and Hagoort, P. (2008). The neural integration of speaker and message. Journalof Cognitive Neuroscience 20(4), 580-591.

Winters, S. J., Levi, S. V., and Pisoni, D. B. (2008). Identification and discrimination of bilingual talkers across languages. Journal of the Acoustical Society of America 123, 4524-4538.

190

Refining the Vocal Profile Analysis (VPA) scheme for forensic purposes

Katharin KlugDepartment of Language and Linguistic Science,

University of York, [email protected]

This paper is eligible for the 'Best Student Paper Award'

VPA is a method for describing voice quality and was developed by John Laver and colleagues at the beginning of the 1980s (Laver et al., 1981). It is based on the assumption that a speaker’s voice quality (voice ‘timbre’ or voice ‘colouring’) can be described as a combination of a number of specified phonatory, supralaryngeal and muscular tension settings like creakiness, lip spreading and tense vocal tract. Based on these ideas, Janet Mackenzie Beck and colleagues designed a standard protocol which was originally aimed at assisting in assessments of voice pathologies (Mackenzie Beck, 2007). Subsequently, it has been applied to the analysis of socio-phonetic data as well as to aid forensic speaker comparison; particularly in the latter, voice quality assessments play an integral part (Gold and French, 2011). Forensic practitioners at J P French Associates have modified the original protocol to make it more applicable to speech analysis in the forensic setting.

Although the VPA is an extremely useful tool in the work of forensic practitioners, initial research and consultation with practitioners have highlighted two main areas for improvement:

a) There are aspects of the protocol which could be developed and refined in order to better capture more detailed differences in peoples’ voice qualities. Creakiness, for example, has been identified as a setting which would allow for further subcategorization in order to reflect the different types of creakiness that are perceivable;

b) There is a general lack of VPA training which means that there is a lower level of agreement across analysts as compared to auditory analysis of vowel and consonant sounds.

191

As well as presenting the research carried out so far, the poster will describe its future direction in order to address both a) and b) above.

ReferencesGold, E. and P. French. (2011). International practices in forensic speaker

comparison. International Journal of Speech Language and the Law, Vol:18 (2), 293-307.

Laver, J., S. Wirz, J. Mackenzie and S. Hiller. (1981). A perceptual protocol for the analysis of vocal profiles. Edinburgh University Department of Linguistics Work in Progress, 14, 139-155.

Mackenzie Beck, J. (2007). Vocal Profile Analysis Scheme: A User’s Manual. Queen Margaret University College-QMUC, Speech Science Research Centre, Edinburgh.

192

The effect of dialect on age estimation

Anna Lena GeorgInstitut für Germanistische Sprachwissenschaft,

Philipps-Universität Marburg, [email protected]

IntroductionEstimating a speaker’s age is one of the more common tasks in speaker profiling (Braun, 1996). Due to the fact that chronological (CA) and biological age (BA) often differ from each other, age estimation is not always a straightforward task. The aging of the body is affected by genetics and external factors: smoking, disease or an unbalanced diet all have an effect on perceived age (PA) (Braun, 1996; Kreul & Hecker, 1971).An additional factor to consider from a linguistic perspective could be dialect: Research on dialectal abilities has shown, that older people (> 65 years) have better dialectal skills than younger people (18-20 years). Could this mean that speakers who exhibit a strong dialect are necessarily judged to be older?The aim of this study is to explore the effect of dialect on age estimation. A comparison is made between recordings in which the speaker exhibits a strong dialect, with recordings in which the same speaker speaks as close to standard German as possible.

Method

Dialect and standard German recordings were selected from the Marburg Regionalsprache.de- database (Schmidt et al., 2008) for those 15 male speakers, where both versions differed significantly (more than 1.6 points). All speakers were born and raised in Swabia, a region in southwestern Germany. For each generation (young: 18-24 years, middle-aged: 46-53 years, older ones: > 66 years) five speakers were selected. Subsequently, 60 short stimuli (approximately four seconds each) were created consisting of two sentences per speaker produced in the standard and the dialectal version. As both samples were

193

produced by the same speaker and recorded in the same interview session, voice quality therefore remained stable and differences in age judgments should only rely on differences in dialect. For each stimulus adult German listeners (N=107, 18-25 years) were asked to judge the age of the 15 speakers.

Preliminary resultsThe average deviation from the CA was 10.8 years, ranging from the largest averaged deviation of18.8 years for speaker KFALT3 to the smallest of 3.4 years for speaker WNJUNG1 (figure 1). The largest individual deviation of all 107 judgements was 16.2 years and the smallest 7.6 years. The average deviation for all dialectal samples was 10.4 years and for all standard samples 11.2 years. Young speakers could be judged more specific (7.3 years) than middle-aged (11.0 years) and older ones (13.9 years).In 60% of all comparisons the speaker was judged to be older in the dialect sample than in the standard sample. The average extent of the overestimation was 2.0 years (figure 2), largest for the middle-aged (2.9 years) and smallest for the young speakers (0,2 years).

Figure 1 Deviations from PA and CA over all judgements (N=107) for 15 Swabian speakers. Results are shown separately for young (blue), middle-aged (orange) and older speakers (dark-green).

194

Figure 2 CA (black), PA_standard (green) and PA_dialect (red) of the 15 Swabian speakers. Results are shown separately for young (blue), middle-aged (orange) and older speakers (dark-green).

ReferencesBraun, A. (1996). Age estimation by different listener groups. Forensic Linguistics

3, p. 65-73.Kehrein, R. (2012). Regionalsprachliche Spektren im Raum - Zur linguistischen

Struktur der Vertikale. Stuttgart: Steiner (Zeitschrift für Dialektologie und Linguistik. Beihefte 152).

Kreul, E. J. & Hecker, M. H. (1971). Description of the speech of patients with cancer of the vocal folds. Part II: Judgements of age and voice quality. In: The Journal of Acoustical Society of America 49, 4(2), p. 1283-1287.

Schmidt, J. E., Herrgen, J. & Kehrein, R. (2008ff.). Regionalsprache.de (REDE). Forschungsplattform zu den modernen Regionalsprachen des Deutschen. Bearbeitet von D. Bock, B. Ganswindt, H. Girnth, R. Kehrein, A. Lameli, S. Messner, C. Purschke, A. Wolańska. Marburg: Forschungszentrum Deutscher Sprachatlas. www.regionalsprache.de

195

Speaker Identification Enhancement by Inclusion of Perceptual Context: an

Application of the Head Turning Modulation Model

Benjamin Cohen-Lhyver1, Sylvain Argentieri1,nd Bruno Gas1

1Institut des Systèmes Intelligents et de Robotique, Université Pierre et Marie Curie

[email protected]

This paper is eligible for the 'Best Student Paper Award'

"Everyday, experience tells us that it is often possible to identify a familiar speaker solely by his/her voice. Such observations reveal that speakers carry individual features in their voices", from Dellwo (2014).These observations might be true, b u t it is also true that (i) multimodal information, such as audiovisual data, and (ii) the context in which this data is perceived, play a major role in speaker identification, and, also, in every perceptual process. This notion of context can be,for instance, the environment in which the speaker is: if someone is at home, the probability of hearing/seeing a person that lives far away is extremely low. Thus, if one hears a voice in this environment, the template-matching process of comparing the "footprint" (by means of spectral, temporal, or mixed fetaures) of the voice perceived and the internal database of known voices should exclude the learned models of every person with a low-probability to actually emit this sound. By doing that, the recognition step gets much simpler.The Head Turning Modulation (HTM) model presented here has been developped in a robotic context within the Two!Ears project, which aimed at developing binaural models of listening together with the audiovisual exploration of the unknown environments and the emergence of an attentional behavior. The primary goal of the HTM model was to trigger head movements on a robot, driven by (i) the Congruence of an audiovisual event, and (ii) the need of additional information to

196

build an internal representation of the unknown environment being explored. The notion of Congruence can be compared to the saliency of an audio event but with a broader temporal range and the embedding of a contextual component. We turned this Congruence-based perception of an event tangible through head movements toward the location of the audiovisual object of interest, thus making emerge an attentional behavior.The HTM model is based on two components: the Dynamic Weighting module (DWmod,Cohen-Lhyver, 2015), which aims at computing a Congruence feature of the incoming audiovisual events, and the Multimodal Fusion & Inference module (MFImod, Cohen-Lhyver2016), which aims at learning the high-level link between audio and vision modalities. Together, these modules are able to take audiovisual data as inputs and to learn online and with no prior knowledge about the environment (by means of the probabilities of audiovisual events to occur) (i) if a new event is Congruent to this environment, and (ii) how to infer apotentially missing modality given the other (if the object is behind an obstacle for instance). Applied to speaker identification, the HTM model should be able to learn by itself a set of labeled audiovisual speakers, placed in different environments, possibly noisy, so as to be able to quicken and enhance the recognition of a speaker by using (i) context and (ii) the learned multimodal representation of speakers.

ReferencesLeeman A., M-J. Kolly and Dellwo V. (2014). Speaker-individuality in

suprasegmental temporal features: implications for forensic voice comparison. Forensics Science International., 238, 59–67.

Cohen-Lhyver, B., Argentieri S. and Gas B. (2015). Modulating the Auditory Turn-to Reflex on the Basis of Multimodal Feedback Loops: the Dynamic Weighting Model. IEEE International Conference on Robotics and Biomimetics (ROBIO)

Cohen-Lhyver, B., Argentieri S. and Gas B. (2016). Multimodal Fusion and Inference Using Binaural Audition and Vision. 22nd International Congress of Acoustics.

197

Constructing a voice profile: Reconstruction of the L1 vowel set for a L2 speaker

Belinda FingerlingInstitut für Germanistische Sprachwissenschaft, Philipps-Universität

Marburg, [email protected]

Introduction

In June 2015, the son of a German millionaire’s family was kidnapped (for further information see Jong-Lendle et al., 2017). Based on previous investigations it is assumed that the offender grew up either in the Balkan-region or a region in former Yugoslavia before he immigrated to Germany about 15 years ago. Given that these are just preliminary assumptions it is necessary to find out further information about the kidnappers origin. In this study a reconstruction was attempted of his vowel system based on his pronunciation of German.

Method

Based on parts of the extortion calls, the master thesis deals with the offender’s L1 phoneme system. First, the offender’s vowels were analysed with regard to F1 and F2. The vowels /i ɪ e ɛ o ɔ y ʏ/ were judged to be of particular interest. The formant values were subsequently compared to the reference values for native German speakers (Sendlmeier and Seebode, 2006). However, as these values are based on read word lists and on stressed vowels, it was decided that the use of spontaneous speech would provide a more useful comparison. Four male German speakers, all with a comparable F0 and with the ability to speak standard-German, were asked to describe a route on a map that is similar to the one described in the extortion calls. Beforehand, the speakers were shown a list of words and asked to somehow implement these in their description.

Results

In the first comparison, large differences in F2 were found for the vowels /i e ɛ o y/ (see Fig.

198

1). The vowel /o/ appears to have large differences in F1, too. Differences were smaller or partially overlapping for the vowels /ɪ ɔ ʏ/. However, when comparing the kidnapper´s vowels with other spontaneous speech, one can conclude that there is less discrepancy than initially assumed (see Fig. 2). Large differences appear only for the vowels /i ɛ o/. The preliminary results suggest that especially the lax vowels can be produced by the offender close to standard German, whereas he seems to have difficulty to pronounce tense vowels. Based on these results and on his phoneme system in general it may be possible to draw conclusions regarding the offender’s native language.

After analysing the offender’s accent it is now necessary to focus on his German dialect. Based on the investigations it is well possible that he has lived in the area near Frankfurt am Main in Hessen where he also learned the local accent. Dialectal characteristics might contribute valuable hints to the overall profile of the offender.

Figure 1 The offender’s average F1 and F2 values of the formants /i ɪ e ɛ o ɔ y ʏ/ in comparison to those from Sendlmeier and Seebode’s read wordlist (2006)

199

Figure 2 The offender’s average F1 and F2 values for the formants /i ɪ e ɛ o ɔ y ʏ/ in comparison to the averaged formants for spontaneous speech based on 4 male German speakers performing an adapted map reading task.

ReferencesJong-Lendle, Gea de; Kehrein, Roland; Urke, Frederike; Molczanow, Janina;

Georg, Anna Lena; Fingerling, Belinda et al. (2017): Language identification from a foreign accent in German. Philipps- Universität, Marburg. Institut für Germanistische Sprachwissenschaft.

Sendlmeier, Walter F.; Seebode, Julia (2006): Formantkarten des deutschen Vokalsystems. Technische Universität Berlin, Berlin.Institut für Sprache und Kommunikation.

200


Recommended