+ All Categories
Home > Documents > AUTOMATICALLY IDENTIFYING - Oxford Wave Research...VOICE PARADES •U.K. Home Office, “Advice on...

AUTOMATICALLY IDENTIFYING - Oxford Wave Research...VOICE PARADES •U.K. Home Office, “Advice on...

Date post: 23-Sep-2020
Category:
Upload: others
View: 4 times
Download: 0 times
Share this document with a friend
33
AUTOMATICALLY IDENTIFYING PERCEPTUALLY SIMILAR VOICES FOR VOICE PARADES Finnian Kelly 1, 2 , Anil Alexander 1 , Oscar Forth 1 , Samuel Kent 1 , Jonas Lindh 3 and Joel Åkesson 3 1 Oxford Wave Research Ltd, Oxford, U.K. 2 The University of Texas at Dallas, U.S.A. 3 Voxalys AB, Gothenburg, Sweden. { finnian|anil|oscar|sam} @oxfordwaveresearch.com , { jonas|joel}@voxalys.se
Transcript
Page 1: AUTOMATICALLY IDENTIFYING - Oxford Wave Research...VOICE PARADES •U.K. Home Office, “Advice on the use of voice identification parades”, circular: 57/2003, December 2003. K.

AUTOMATICALLY IDENTIFYING PERCEPTUALLY SIMILAR VOICES

FOR VOICE PARADES

Finnian Kelly1 2 Anil Alexander1 Oscar Forth1 Samuel Kent1 Jonas Lindh3 and Joel Aringkesson3

1Oxford Wave Research Ltd Oxford UK2The University of Texas at Dallas USA3Voxalys AB Gothenburg Sweden

finnian|anil|oscar|sam

oxfordwaveresearchcom jonas|joelvoxalysse

VOICE PARADES

bull UK Home Office ldquoAdvice on the use of voice identification paradesrdquo circular

572003 December 2003

bull K McDougall ldquoAssessing perceived voice similarity using Multidimensional Scaling

for the construction of voice paradesrdquo IJSLL vol 20 no 2 pp 163-172 2013

A voice parade is a set of voices or foils judged to lie within a suitable range of

similarity to a suspectrsquos voice

bull Selecting foils requires manual screening of candidate voices by a phonetician

bull This is a time-consuming costly and subjective process

bull Automating the selection of foils under supervision of the forensic expert could

bull Allow a much larger pool of candidate voices to be considered

bull Reduce subjectivity

bull Increase speed while reducing costs

VOICE CASTING

VOICE CASTING

Voice Casting is the task of identifying the voice in a candidate database

most similar to a target voice

bull Voice casting is typically cross-language dubbing of film and games

bull Voices are manually compared by casting experts

bull Automating voice casting carries similar benefits to automating voice

parades

bull N Obin and A Roebel Similarity Search of Acted Voices for Automatic

Voice Casting in IEEEACM Transactions on Audio Speech and Language

Processing vol 24 no 9 pp 1642-1651 Sept 2016

VOICE CASTING

Voice Casting is the task of identifying the voice in a candidate database

most similar to a target voice

bull Voice casting is typically cross-language dubbing of film and games

bull Voices are manually compared by casting experts

bull Automating voice casting carries similar benefits to automating voice

parades

bull N Obin and A Roebel Similarity Search of Acted Voices for Automatic

Voice Casting in IEEEACM Transactions on Audio Speech and Language

Processing vol 24 no 9 pp 1642-1651 Sept 2016

PERCEIVED VOICE SIMILARITY

PERCEIVED VOICE SIMILARITY

Speaker traits sex age

Acoustic Characteristics articulation

timbre prosody vocal effort

Voice Quality breathy hoarse creaky

bull F Nolan P French K McDougall L Stevens and T Hudson ldquoThe role of

voice quality lsquosettingsrsquo in perceived voice similarityrdquo IAFPA 2011 conference

Vienna Austria 2011

FEATURES FOR SPEAKER RECOGNITION

bull J H L Hansen and T Hasan Speaker Recognition by Machines and

Humans A tutorial review in IEEE Signal Processing Magazine vol 32 no

6 pp 74-99 Nov 2015

Mel Frequency Cepstral Coefficients

bull MFCCs are the standard in

automatic speaker recognition

bull They effectively capture short-term

characteristics of the vocal tract

FEATURES FOR SPEAKER RECOGNITION

Phonetic features eg long-term

formants

bull Less discriminative than MFCCs for

automatic speaker recognition

bull However they capture acoustic

characteristics of the voice

important for perceived similarityhellip

bull J H L Hansen and T Hasan Speaker Recognition by Machines and

Humans A tutorial review in IEEE Signal Processing Magazine vol 32 no

6 pp 74-99 Nov 2015

LTF illustration from Catalina Manual

AUTO-PHONETIC FEATURES IVOCALISE

Automatic extraction of phonetic

features with iVOCALISE

bull F1-F4 + ∆

bull A Alexander O Forth A A Atreya and F Kelly ldquoVOCALISE A

forensic automatic speaker recognition system supporting spectral

phonetic and user-provided featuresrdquo Odyssey 2016 Bilbao Spain

AUTO-PHONETIC FEATURES IVOCALISE

Automatic extraction of phonetic

features with iVOCALISE

bull F1-F4 + ∆

bull F0 + ∆

bull F0 (semitones) + ∆

Capture pitch and format ranges along

with temporal information (intonation

patterns)

bull A Alexander O Forth A A Atreya and F Kelly ldquoVOCALISE A

forensic automatic speaker recognition system supporting spectral

phonetic and user-provided featuresrdquo Odyssey 2016 Bilbao Spain

VOICE CORPUS

bull 175 public figures (actors musicians etc)

bull ~2 recordings each ~30 sec average length

bull Sourced from online archives (primarily YouTube)

bull Male-Female speaker ratio is approximately 21

bull All speech is in English with wide variation in accent

bull Recordings are exclusively from lapel microphones

bull Recording environment is unconstrained

SPEAKER RECOGNITION EXPERIMENT

bull A Alexander O Forth A A Atreya and F Kelly ldquoVOCALISE A

forensic automatic speaker recognition system supporting spectral

phonetic and user-provided featuresrdquo Odyssey 2016 Bilbao Spain

SPEAKER RECOGNITION EXPERIMENT

bull A Alexander O Forth A A Atreya and F Kelly ldquoVOCALISE A

forensic automatic speaker recognition system supporting spectral

phonetic and user-provided featuresrdquo Odyssey 2016 Bilbao Spain

SPEAKER RECOGNITION EXPERIMENT

1

2

N

similarity rank

hellip

SELECTING COHORTS FOR SUBJECTIVE COMPARISON

1 Similar two highest-ranked speakers

2 Different two randomly-ranked speakers (constrained to be same-gender and outside top ten)

3 Same speaker one different recording of the target speaker

1

2

N

hellip

CREATING LISTENER COMPARISONS

Similar Different Same Speaker

Recording of Target speaker 1 7 sec chunks

Similar Comparisons 1 amp 2

for Target Speaker 1

+ + + + +

CREATING LISTENER COMPARISONS

Similar Different Same Speaker

Recording of Target speaker 1 7 sec chunks

Different Comparisons 1 amp 2

for Target Speaker 1

+ + + + +

CREATING LISTENER COMPARISONS

Similar Different Same Speaker

Recording of Target speaker 1 7 sec chunks

Same Speaker Comparisons

1 amp 2 for Target Speaker 1

+ + + + +

THE LISTENER TEST

x 12 x 6

6 target speakers

3 male 3 female

x 12

Similar

comparisons

Different

comparisons

Same Speaker

comparisons

THE LISTENER TEST

1 Judge the similarity of the two voices on scale of 1-9

2 Ignore the speaker accents any non-speech noises or any of the spoken content

The test was administered over the web there was no supervision of the test or control over the listening environment

THE LISTENERS

bull 43 listeners 25 female 18 male

bull 20 spoke English as a first language 23 did not

bull Age range hearing status and playback method were noted

RESPONSES TO MALE VOICE COMPARISONS

Same Speaker

comparisons

Similar

comparisons

Different

comparisons

Similarity rating

P(S

imila

rity

rating

)

RESPONSES TO MALE VOICE COMPARISONS

Same Speaker

comparisons

Median = 8

Similar

comparisons

Median = 5

Different

comparisons

Median = 3

Similarity rating

P(S

imila

rity

rating

)

RESPONSES TO FEMALE VOICE COMPARISONS

Same Speaker

comparisons

Median = 8

Similar

comparisons

Median = 3

Different

comparisons

Median = 4

Similarity rating

P(S

imila

rity

rating

)

Same Speaker

comparisons

Median = 8

Similarity rating

P(S

imila

rity

rating

)

Cross-accent comparisons

Example 1

Example 2

RESPONSES TO FEMALE VOICE COMPARISONS

SCORE VS SIMILARITY RATING

Responses to similar and different male comparisons

mean similarity rating

score

Auto-Phonetic

Corr (Pearson) = 072

PERCEIVED VOICE SIMILARITY amp SPEAKER RECOGNITION

Is there a link between the scores from an automatic system and

perceived similarity

bull A correlation between perceived similarity and speaker

recognition scores has been observed with MFCC features

bull However greater correlation has been observed by combining

MFCCs and Voice Quality labels

bull Phonetic Features

bull N Obin and A Roebel Similarity Search of Acted Voices for Automatic

Voice Casting in IEEEACM Transactions on Audio Speech and Language

Processing vol 24 no 9 pp 1642-1651 Sept 2016

SCORE VS SIMILARITY RATING

Responses to similar and different male comparisons

mean similarity rating

score Auto-Phonetic

Corr (Pearson) = 072

MFCC

Corr (Pearson) = 072

SCORE VS SIMILARITY RATING

Responses to similar and different male comparisons

mean similarity rating

score

Auto-Phonetic + MFCC

Corr (Pearson) = 076

CONCLUSIONS

bull Promising results for male comparisons

bull Auto-Phonetic features capture speaker characteristics relevant to perceived similarity

bull Variable results across female comparisons

bull Data was a contributing factor smaller female candidate pool and cross-accent comparisons

bull Room to improve

bull Larger subjective evaluation required

bull Combine Auto-Phonetic and MFCC features

bull Scope to expand Auto-Phonetic feature set

AUTOMATIC VOICE CASTING RECENT RESULTS FROM SDI MEDIA

SDI media are a major provider of dubbing services worldwide

SDIrsquos Italian branch have been using iVOCALISE with AP and MFCC features for automatic voice castinghellip

English Italian 1

English Italian 2

CONCLUSIONS

bull The definition of voice similarity is application-dependent voice parades vs voice casting

bull Allow for an application-dependent search space

bull Use meta-data such as gender age accent to constrain the set of candidate voices

bull Allow for an application-dependent lsquodegree of similarityrsquo

bullGiven well-calibrated output scores from the automatic system can define a score range of interest

Page 2: AUTOMATICALLY IDENTIFYING - Oxford Wave Research...VOICE PARADES •U.K. Home Office, “Advice on the use of voice identification parades”, circular: 57/2003, December 2003. K.

VOICE PARADES

bull UK Home Office ldquoAdvice on the use of voice identification paradesrdquo circular

572003 December 2003

bull K McDougall ldquoAssessing perceived voice similarity using Multidimensional Scaling

for the construction of voice paradesrdquo IJSLL vol 20 no 2 pp 163-172 2013

A voice parade is a set of voices or foils judged to lie within a suitable range of

similarity to a suspectrsquos voice

bull Selecting foils requires manual screening of candidate voices by a phonetician

bull This is a time-consuming costly and subjective process

bull Automating the selection of foils under supervision of the forensic expert could

bull Allow a much larger pool of candidate voices to be considered

bull Reduce subjectivity

bull Increase speed while reducing costs

VOICE CASTING

VOICE CASTING

Voice Casting is the task of identifying the voice in a candidate database

most similar to a target voice

bull Voice casting is typically cross-language dubbing of film and games

bull Voices are manually compared by casting experts

bull Automating voice casting carries similar benefits to automating voice

parades

bull N Obin and A Roebel Similarity Search of Acted Voices for Automatic

Voice Casting in IEEEACM Transactions on Audio Speech and Language

Processing vol 24 no 9 pp 1642-1651 Sept 2016

VOICE CASTING

Voice Casting is the task of identifying the voice in a candidate database

most similar to a target voice

bull Voice casting is typically cross-language dubbing of film and games

bull Voices are manually compared by casting experts

bull Automating voice casting carries similar benefits to automating voice

parades

bull N Obin and A Roebel Similarity Search of Acted Voices for Automatic

Voice Casting in IEEEACM Transactions on Audio Speech and Language

Processing vol 24 no 9 pp 1642-1651 Sept 2016

PERCEIVED VOICE SIMILARITY

PERCEIVED VOICE SIMILARITY

Speaker traits sex age

Acoustic Characteristics articulation

timbre prosody vocal effort

Voice Quality breathy hoarse creaky

bull F Nolan P French K McDougall L Stevens and T Hudson ldquoThe role of

voice quality lsquosettingsrsquo in perceived voice similarityrdquo IAFPA 2011 conference

Vienna Austria 2011

FEATURES FOR SPEAKER RECOGNITION

bull J H L Hansen and T Hasan Speaker Recognition by Machines and

Humans A tutorial review in IEEE Signal Processing Magazine vol 32 no

6 pp 74-99 Nov 2015

Mel Frequency Cepstral Coefficients

bull MFCCs are the standard in

automatic speaker recognition

bull They effectively capture short-term

characteristics of the vocal tract

FEATURES FOR SPEAKER RECOGNITION

Phonetic features eg long-term

formants

bull Less discriminative than MFCCs for

automatic speaker recognition

bull However they capture acoustic

characteristics of the voice

important for perceived similarityhellip

bull J H L Hansen and T Hasan Speaker Recognition by Machines and

Humans A tutorial review in IEEE Signal Processing Magazine vol 32 no

6 pp 74-99 Nov 2015

LTF illustration from Catalina Manual

AUTO-PHONETIC FEATURES IVOCALISE

Automatic extraction of phonetic

features with iVOCALISE

bull F1-F4 + ∆

bull A Alexander O Forth A A Atreya and F Kelly ldquoVOCALISE A

forensic automatic speaker recognition system supporting spectral

phonetic and user-provided featuresrdquo Odyssey 2016 Bilbao Spain

AUTO-PHONETIC FEATURES IVOCALISE

Automatic extraction of phonetic

features with iVOCALISE

bull F1-F4 + ∆

bull F0 + ∆

bull F0 (semitones) + ∆

Capture pitch and format ranges along

with temporal information (intonation

patterns)

bull A Alexander O Forth A A Atreya and F Kelly ldquoVOCALISE A

forensic automatic speaker recognition system supporting spectral

phonetic and user-provided featuresrdquo Odyssey 2016 Bilbao Spain

VOICE CORPUS

bull 175 public figures (actors musicians etc)

bull ~2 recordings each ~30 sec average length

bull Sourced from online archives (primarily YouTube)

bull Male-Female speaker ratio is approximately 21

bull All speech is in English with wide variation in accent

bull Recordings are exclusively from lapel microphones

bull Recording environment is unconstrained

SPEAKER RECOGNITION EXPERIMENT

bull A Alexander O Forth A A Atreya and F Kelly ldquoVOCALISE A

forensic automatic speaker recognition system supporting spectral

phonetic and user-provided featuresrdquo Odyssey 2016 Bilbao Spain

SPEAKER RECOGNITION EXPERIMENT

bull A Alexander O Forth A A Atreya and F Kelly ldquoVOCALISE A

forensic automatic speaker recognition system supporting spectral

phonetic and user-provided featuresrdquo Odyssey 2016 Bilbao Spain

SPEAKER RECOGNITION EXPERIMENT

1

2

N

similarity rank

hellip

SELECTING COHORTS FOR SUBJECTIVE COMPARISON

1 Similar two highest-ranked speakers

2 Different two randomly-ranked speakers (constrained to be same-gender and outside top ten)

3 Same speaker one different recording of the target speaker

1

2

N

hellip

CREATING LISTENER COMPARISONS

Similar Different Same Speaker

Recording of Target speaker 1 7 sec chunks

Similar Comparisons 1 amp 2

for Target Speaker 1

+ + + + +

CREATING LISTENER COMPARISONS

Similar Different Same Speaker

Recording of Target speaker 1 7 sec chunks

Different Comparisons 1 amp 2

for Target Speaker 1

+ + + + +

CREATING LISTENER COMPARISONS

Similar Different Same Speaker

Recording of Target speaker 1 7 sec chunks

Same Speaker Comparisons

1 amp 2 for Target Speaker 1

+ + + + +

THE LISTENER TEST

x 12 x 6

6 target speakers

3 male 3 female

x 12

Similar

comparisons

Different

comparisons

Same Speaker

comparisons

THE LISTENER TEST

1 Judge the similarity of the two voices on scale of 1-9

2 Ignore the speaker accents any non-speech noises or any of the spoken content

The test was administered over the web there was no supervision of the test or control over the listening environment

THE LISTENERS

bull 43 listeners 25 female 18 male

bull 20 spoke English as a first language 23 did not

bull Age range hearing status and playback method were noted

RESPONSES TO MALE VOICE COMPARISONS

Same Speaker

comparisons

Similar

comparisons

Different

comparisons

Similarity rating

P(S

imila

rity

rating

)

RESPONSES TO MALE VOICE COMPARISONS

Same Speaker

comparisons

Median = 8

Similar

comparisons

Median = 5

Different

comparisons

Median = 3

Similarity rating

P(S

imila

rity

rating

)

RESPONSES TO FEMALE VOICE COMPARISONS

Same Speaker

comparisons

Median = 8

Similar

comparisons

Median = 3

Different

comparisons

Median = 4

Similarity rating

P(S

imila

rity

rating

)

Same Speaker

comparisons

Median = 8

Similarity rating

P(S

imila

rity

rating

)

Cross-accent comparisons

Example 1

Example 2

RESPONSES TO FEMALE VOICE COMPARISONS

SCORE VS SIMILARITY RATING

Responses to similar and different male comparisons

mean similarity rating

score

Auto-Phonetic

Corr (Pearson) = 072

PERCEIVED VOICE SIMILARITY amp SPEAKER RECOGNITION

Is there a link between the scores from an automatic system and

perceived similarity

bull A correlation between perceived similarity and speaker

recognition scores has been observed with MFCC features

bull However greater correlation has been observed by combining

MFCCs and Voice Quality labels

bull Phonetic Features

bull N Obin and A Roebel Similarity Search of Acted Voices for Automatic

Voice Casting in IEEEACM Transactions on Audio Speech and Language

Processing vol 24 no 9 pp 1642-1651 Sept 2016

SCORE VS SIMILARITY RATING

Responses to similar and different male comparisons

mean similarity rating

score Auto-Phonetic

Corr (Pearson) = 072

MFCC

Corr (Pearson) = 072

SCORE VS SIMILARITY RATING

Responses to similar and different male comparisons

mean similarity rating

score

Auto-Phonetic + MFCC

Corr (Pearson) = 076

CONCLUSIONS

bull Promising results for male comparisons

bull Auto-Phonetic features capture speaker characteristics relevant to perceived similarity

bull Variable results across female comparisons

bull Data was a contributing factor smaller female candidate pool and cross-accent comparisons

bull Room to improve

bull Larger subjective evaluation required

bull Combine Auto-Phonetic and MFCC features

bull Scope to expand Auto-Phonetic feature set

AUTOMATIC VOICE CASTING RECENT RESULTS FROM SDI MEDIA

SDI media are a major provider of dubbing services worldwide

SDIrsquos Italian branch have been using iVOCALISE with AP and MFCC features for automatic voice castinghellip

English Italian 1

English Italian 2

CONCLUSIONS

bull The definition of voice similarity is application-dependent voice parades vs voice casting

bull Allow for an application-dependent search space

bull Use meta-data such as gender age accent to constrain the set of candidate voices

bull Allow for an application-dependent lsquodegree of similarityrsquo

bullGiven well-calibrated output scores from the automatic system can define a score range of interest

Page 3: AUTOMATICALLY IDENTIFYING - Oxford Wave Research...VOICE PARADES •U.K. Home Office, “Advice on the use of voice identification parades”, circular: 57/2003, December 2003. K.

VOICE CASTING

VOICE CASTING

Voice Casting is the task of identifying the voice in a candidate database

most similar to a target voice

bull Voice casting is typically cross-language dubbing of film and games

bull Voices are manually compared by casting experts

bull Automating voice casting carries similar benefits to automating voice

parades

bull N Obin and A Roebel Similarity Search of Acted Voices for Automatic

Voice Casting in IEEEACM Transactions on Audio Speech and Language

Processing vol 24 no 9 pp 1642-1651 Sept 2016

VOICE CASTING

Voice Casting is the task of identifying the voice in a candidate database

most similar to a target voice

bull Voice casting is typically cross-language dubbing of film and games

bull Voices are manually compared by casting experts

bull Automating voice casting carries similar benefits to automating voice

parades

bull N Obin and A Roebel Similarity Search of Acted Voices for Automatic

Voice Casting in IEEEACM Transactions on Audio Speech and Language

Processing vol 24 no 9 pp 1642-1651 Sept 2016

PERCEIVED VOICE SIMILARITY

PERCEIVED VOICE SIMILARITY

Speaker traits sex age

Acoustic Characteristics articulation

timbre prosody vocal effort

Voice Quality breathy hoarse creaky

bull F Nolan P French K McDougall L Stevens and T Hudson ldquoThe role of

voice quality lsquosettingsrsquo in perceived voice similarityrdquo IAFPA 2011 conference

Vienna Austria 2011

FEATURES FOR SPEAKER RECOGNITION

bull J H L Hansen and T Hasan Speaker Recognition by Machines and

Humans A tutorial review in IEEE Signal Processing Magazine vol 32 no

6 pp 74-99 Nov 2015

Mel Frequency Cepstral Coefficients

bull MFCCs are the standard in

automatic speaker recognition

bull They effectively capture short-term

characteristics of the vocal tract

FEATURES FOR SPEAKER RECOGNITION

Phonetic features eg long-term

formants

bull Less discriminative than MFCCs for

automatic speaker recognition

bull However they capture acoustic

characteristics of the voice

important for perceived similarityhellip

bull J H L Hansen and T Hasan Speaker Recognition by Machines and

Humans A tutorial review in IEEE Signal Processing Magazine vol 32 no

6 pp 74-99 Nov 2015

LTF illustration from Catalina Manual

AUTO-PHONETIC FEATURES IVOCALISE

Automatic extraction of phonetic

features with iVOCALISE

bull F1-F4 + ∆

bull A Alexander O Forth A A Atreya and F Kelly ldquoVOCALISE A

forensic automatic speaker recognition system supporting spectral

phonetic and user-provided featuresrdquo Odyssey 2016 Bilbao Spain

AUTO-PHONETIC FEATURES IVOCALISE

Automatic extraction of phonetic

features with iVOCALISE

bull F1-F4 + ∆

bull F0 + ∆

bull F0 (semitones) + ∆

Capture pitch and format ranges along

with temporal information (intonation

patterns)

bull A Alexander O Forth A A Atreya and F Kelly ldquoVOCALISE A

forensic automatic speaker recognition system supporting spectral

phonetic and user-provided featuresrdquo Odyssey 2016 Bilbao Spain

VOICE CORPUS

bull 175 public figures (actors musicians etc)

bull ~2 recordings each ~30 sec average length

bull Sourced from online archives (primarily YouTube)

bull Male-Female speaker ratio is approximately 21

bull All speech is in English with wide variation in accent

bull Recordings are exclusively from lapel microphones

bull Recording environment is unconstrained

SPEAKER RECOGNITION EXPERIMENT

bull A Alexander O Forth A A Atreya and F Kelly ldquoVOCALISE A

forensic automatic speaker recognition system supporting spectral

phonetic and user-provided featuresrdquo Odyssey 2016 Bilbao Spain

SPEAKER RECOGNITION EXPERIMENT

bull A Alexander O Forth A A Atreya and F Kelly ldquoVOCALISE A

forensic automatic speaker recognition system supporting spectral

phonetic and user-provided featuresrdquo Odyssey 2016 Bilbao Spain

SPEAKER RECOGNITION EXPERIMENT

1

2

N

similarity rank

hellip

SELECTING COHORTS FOR SUBJECTIVE COMPARISON

1 Similar two highest-ranked speakers

2 Different two randomly-ranked speakers (constrained to be same-gender and outside top ten)

3 Same speaker one different recording of the target speaker

1

2

N

hellip

CREATING LISTENER COMPARISONS

Similar Different Same Speaker

Recording of Target speaker 1 7 sec chunks

Similar Comparisons 1 amp 2

for Target Speaker 1

+ + + + +

CREATING LISTENER COMPARISONS

Similar Different Same Speaker

Recording of Target speaker 1 7 sec chunks

Different Comparisons 1 amp 2

for Target Speaker 1

+ + + + +

CREATING LISTENER COMPARISONS

Similar Different Same Speaker

Recording of Target speaker 1 7 sec chunks

Same Speaker Comparisons

1 amp 2 for Target Speaker 1

+ + + + +

THE LISTENER TEST

x 12 x 6

6 target speakers

3 male 3 female

x 12

Similar

comparisons

Different

comparisons

Same Speaker

comparisons

THE LISTENER TEST

1 Judge the similarity of the two voices on scale of 1-9

2 Ignore the speaker accents any non-speech noises or any of the spoken content

The test was administered over the web there was no supervision of the test or control over the listening environment

THE LISTENERS

bull 43 listeners 25 female 18 male

bull 20 spoke English as a first language 23 did not

bull Age range hearing status and playback method were noted

RESPONSES TO MALE VOICE COMPARISONS

Same Speaker

comparisons

Similar

comparisons

Different

comparisons

Similarity rating

P(S

imila

rity

rating

)

RESPONSES TO MALE VOICE COMPARISONS

Same Speaker

comparisons

Median = 8

Similar

comparisons

Median = 5

Different

comparisons

Median = 3

Similarity rating

P(S

imila

rity

rating

)

RESPONSES TO FEMALE VOICE COMPARISONS

Same Speaker

comparisons

Median = 8

Similar

comparisons

Median = 3

Different

comparisons

Median = 4

Similarity rating

P(S

imila

rity

rating

)

Same Speaker

comparisons

Median = 8

Similarity rating

P(S

imila

rity

rating

)

Cross-accent comparisons

Example 1

Example 2

RESPONSES TO FEMALE VOICE COMPARISONS

SCORE VS SIMILARITY RATING

Responses to similar and different male comparisons

mean similarity rating

score

Auto-Phonetic

Corr (Pearson) = 072

PERCEIVED VOICE SIMILARITY amp SPEAKER RECOGNITION

Is there a link between the scores from an automatic system and

perceived similarity

bull A correlation between perceived similarity and speaker

recognition scores has been observed with MFCC features

bull However greater correlation has been observed by combining

MFCCs and Voice Quality labels

bull Phonetic Features

bull N Obin and A Roebel Similarity Search of Acted Voices for Automatic

Voice Casting in IEEEACM Transactions on Audio Speech and Language

Processing vol 24 no 9 pp 1642-1651 Sept 2016

SCORE VS SIMILARITY RATING

Responses to similar and different male comparisons

mean similarity rating

score Auto-Phonetic

Corr (Pearson) = 072

MFCC

Corr (Pearson) = 072

SCORE VS SIMILARITY RATING

Responses to similar and different male comparisons

mean similarity rating

score

Auto-Phonetic + MFCC

Corr (Pearson) = 076

CONCLUSIONS

bull Promising results for male comparisons

bull Auto-Phonetic features capture speaker characteristics relevant to perceived similarity

bull Variable results across female comparisons

bull Data was a contributing factor smaller female candidate pool and cross-accent comparisons

bull Room to improve

bull Larger subjective evaluation required

bull Combine Auto-Phonetic and MFCC features

bull Scope to expand Auto-Phonetic feature set

AUTOMATIC VOICE CASTING RECENT RESULTS FROM SDI MEDIA

SDI media are a major provider of dubbing services worldwide

SDIrsquos Italian branch have been using iVOCALISE with AP and MFCC features for automatic voice castinghellip

English Italian 1

English Italian 2

CONCLUSIONS

bull The definition of voice similarity is application-dependent voice parades vs voice casting

bull Allow for an application-dependent search space

bull Use meta-data such as gender age accent to constrain the set of candidate voices

bull Allow for an application-dependent lsquodegree of similarityrsquo

bullGiven well-calibrated output scores from the automatic system can define a score range of interest

Page 4: AUTOMATICALLY IDENTIFYING - Oxford Wave Research...VOICE PARADES •U.K. Home Office, “Advice on the use of voice identification parades”, circular: 57/2003, December 2003. K.

VOICE CASTING

Voice Casting is the task of identifying the voice in a candidate database

most similar to a target voice

bull Voice casting is typically cross-language dubbing of film and games

bull Voices are manually compared by casting experts

bull Automating voice casting carries similar benefits to automating voice

parades

bull N Obin and A Roebel Similarity Search of Acted Voices for Automatic

Voice Casting in IEEEACM Transactions on Audio Speech and Language

Processing vol 24 no 9 pp 1642-1651 Sept 2016

VOICE CASTING

Voice Casting is the task of identifying the voice in a candidate database

most similar to a target voice

bull Voice casting is typically cross-language dubbing of film and games

bull Voices are manually compared by casting experts

bull Automating voice casting carries similar benefits to automating voice

parades

bull N Obin and A Roebel Similarity Search of Acted Voices for Automatic

Voice Casting in IEEEACM Transactions on Audio Speech and Language

Processing vol 24 no 9 pp 1642-1651 Sept 2016

PERCEIVED VOICE SIMILARITY

PERCEIVED VOICE SIMILARITY

Speaker traits sex age

Acoustic Characteristics articulation

timbre prosody vocal effort

Voice Quality breathy hoarse creaky

bull F Nolan P French K McDougall L Stevens and T Hudson ldquoThe role of

voice quality lsquosettingsrsquo in perceived voice similarityrdquo IAFPA 2011 conference

Vienna Austria 2011

FEATURES FOR SPEAKER RECOGNITION

bull J H L Hansen and T Hasan Speaker Recognition by Machines and

Humans A tutorial review in IEEE Signal Processing Magazine vol 32 no

6 pp 74-99 Nov 2015

Mel Frequency Cepstral Coefficients

bull MFCCs are the standard in

automatic speaker recognition

bull They effectively capture short-term

characteristics of the vocal tract

FEATURES FOR SPEAKER RECOGNITION

Phonetic features eg long-term

formants

bull Less discriminative than MFCCs for

automatic speaker recognition

bull However they capture acoustic

characteristics of the voice

important for perceived similarityhellip

bull J H L Hansen and T Hasan Speaker Recognition by Machines and

Humans A tutorial review in IEEE Signal Processing Magazine vol 32 no

6 pp 74-99 Nov 2015

LTF illustration from Catalina Manual

AUTO-PHONETIC FEATURES IVOCALISE

Automatic extraction of phonetic

features with iVOCALISE

bull F1-F4 + ∆

bull A Alexander O Forth A A Atreya and F Kelly ldquoVOCALISE A

forensic automatic speaker recognition system supporting spectral

phonetic and user-provided featuresrdquo Odyssey 2016 Bilbao Spain

AUTO-PHONETIC FEATURES IVOCALISE

Automatic extraction of phonetic

features with iVOCALISE

bull F1-F4 + ∆

bull F0 + ∆

bull F0 (semitones) + ∆

Capture pitch and format ranges along

with temporal information (intonation

patterns)

bull A Alexander O Forth A A Atreya and F Kelly ldquoVOCALISE A

forensic automatic speaker recognition system supporting spectral

phonetic and user-provided featuresrdquo Odyssey 2016 Bilbao Spain

VOICE CORPUS

bull 175 public figures (actors musicians etc)

bull ~2 recordings each ~30 sec average length

bull Sourced from online archives (primarily YouTube)

bull Male-Female speaker ratio is approximately 21

bull All speech is in English with wide variation in accent

bull Recordings are exclusively from lapel microphones

bull Recording environment is unconstrained

SPEAKER RECOGNITION EXPERIMENT

bull A Alexander O Forth A A Atreya and F Kelly ldquoVOCALISE A

forensic automatic speaker recognition system supporting spectral

phonetic and user-provided featuresrdquo Odyssey 2016 Bilbao Spain

SPEAKER RECOGNITION EXPERIMENT

bull A Alexander O Forth A A Atreya and F Kelly ldquoVOCALISE A

forensic automatic speaker recognition system supporting spectral

phonetic and user-provided featuresrdquo Odyssey 2016 Bilbao Spain

SPEAKER RECOGNITION EXPERIMENT

1

2

N

similarity rank

hellip

SELECTING COHORTS FOR SUBJECTIVE COMPARISON

1 Similar two highest-ranked speakers

2 Different two randomly-ranked speakers (constrained to be same-gender and outside top ten)

3 Same speaker one different recording of the target speaker

1

2

N

hellip

CREATING LISTENER COMPARISONS

Similar Different Same Speaker

Recording of Target speaker 1 7 sec chunks

Similar Comparisons 1 amp 2

for Target Speaker 1

+ + + + +

CREATING LISTENER COMPARISONS

Similar Different Same Speaker

Recording of Target speaker 1 7 sec chunks

Different Comparisons 1 amp 2

for Target Speaker 1

+ + + + +

CREATING LISTENER COMPARISONS

Similar Different Same Speaker

Recording of Target speaker 1 7 sec chunks

Same Speaker Comparisons

1 amp 2 for Target Speaker 1

+ + + + +

THE LISTENER TEST

x 12 x 6

6 target speakers

3 male 3 female

x 12

Similar

comparisons

Different

comparisons

Same Speaker

comparisons

THE LISTENER TEST

1 Judge the similarity of the two voices on scale of 1-9

2 Ignore the speaker accents any non-speech noises or any of the spoken content

The test was administered over the web there was no supervision of the test or control over the listening environment

THE LISTENERS

bull 43 listeners 25 female 18 male

bull 20 spoke English as a first language 23 did not

bull Age range hearing status and playback method were noted

RESPONSES TO MALE VOICE COMPARISONS

Same Speaker

comparisons

Similar

comparisons

Different

comparisons

Similarity rating

P(S

imila

rity

rating

)

RESPONSES TO MALE VOICE COMPARISONS

Same Speaker

comparisons

Median = 8

Similar

comparisons

Median = 5

Different

comparisons

Median = 3

Similarity rating

P(S

imila

rity

rating

)

RESPONSES TO FEMALE VOICE COMPARISONS

Same Speaker

comparisons

Median = 8

Similar

comparisons

Median = 3

Different

comparisons

Median = 4

Similarity rating

P(S

imila

rity

rating

)

Same Speaker

comparisons

Median = 8

Similarity rating

P(S

imila

rity

rating

)

Cross-accent comparisons

Example 1

Example 2

RESPONSES TO FEMALE VOICE COMPARISONS

SCORE VS SIMILARITY RATING

Responses to similar and different male comparisons

mean similarity rating

score

Auto-Phonetic

Corr (Pearson) = 072

PERCEIVED VOICE SIMILARITY amp SPEAKER RECOGNITION

Is there a link between the scores from an automatic system and

perceived similarity

bull A correlation between perceived similarity and speaker

recognition scores has been observed with MFCC features

bull However greater correlation has been observed by combining

MFCCs and Voice Quality labels

bull Phonetic Features

bull N Obin and A Roebel Similarity Search of Acted Voices for Automatic

Voice Casting in IEEEACM Transactions on Audio Speech and Language

Processing vol 24 no 9 pp 1642-1651 Sept 2016

SCORE VS SIMILARITY RATING

Responses to similar and different male comparisons

mean similarity rating

score Auto-Phonetic

Corr (Pearson) = 072

MFCC

Corr (Pearson) = 072

SCORE VS SIMILARITY RATING

Responses to similar and different male comparisons

mean similarity rating

score

Auto-Phonetic + MFCC

Corr (Pearson) = 076

CONCLUSIONS

bull Promising results for male comparisons

bull Auto-Phonetic features capture speaker characteristics relevant to perceived similarity

bull Variable results across female comparisons

bull Data was a contributing factor smaller female candidate pool and cross-accent comparisons

bull Room to improve

bull Larger subjective evaluation required

bull Combine Auto-Phonetic and MFCC features

bull Scope to expand Auto-Phonetic feature set

AUTOMATIC VOICE CASTING RECENT RESULTS FROM SDI MEDIA

SDI media are a major provider of dubbing services worldwide

SDIrsquos Italian branch have been using iVOCALISE with AP and MFCC features for automatic voice castinghellip

English Italian 1

English Italian 2

CONCLUSIONS

bull The definition of voice similarity is application-dependent voice parades vs voice casting

bull Allow for an application-dependent search space

bull Use meta-data such as gender age accent to constrain the set of candidate voices

bull Allow for an application-dependent lsquodegree of similarityrsquo

bullGiven well-calibrated output scores from the automatic system can define a score range of interest

Page 5: AUTOMATICALLY IDENTIFYING - Oxford Wave Research...VOICE PARADES •U.K. Home Office, “Advice on the use of voice identification parades”, circular: 57/2003, December 2003. K.

VOICE CASTING

Voice Casting is the task of identifying the voice in a candidate database

most similar to a target voice

bull Voice casting is typically cross-language dubbing of film and games

bull Voices are manually compared by casting experts

bull Automating voice casting carries similar benefits to automating voice

parades

bull N Obin and A Roebel Similarity Search of Acted Voices for Automatic

Voice Casting in IEEEACM Transactions on Audio Speech and Language

Processing vol 24 no 9 pp 1642-1651 Sept 2016

PERCEIVED VOICE SIMILARITY

PERCEIVED VOICE SIMILARITY

Speaker traits sex age

Acoustic Characteristics articulation

timbre prosody vocal effort

Voice Quality breathy hoarse creaky

bull F Nolan P French K McDougall L Stevens and T Hudson ldquoThe role of

voice quality lsquosettingsrsquo in perceived voice similarityrdquo IAFPA 2011 conference

Vienna Austria 2011

FEATURES FOR SPEAKER RECOGNITION

bull J H L Hansen and T Hasan Speaker Recognition by Machines and

Humans A tutorial review in IEEE Signal Processing Magazine vol 32 no

6 pp 74-99 Nov 2015

Mel Frequency Cepstral Coefficients

bull MFCCs are the standard in

automatic speaker recognition

bull They effectively capture short-term

characteristics of the vocal tract

FEATURES FOR SPEAKER RECOGNITION

Phonetic features eg long-term

formants

bull Less discriminative than MFCCs for

automatic speaker recognition

bull However they capture acoustic

characteristics of the voice

important for perceived similarityhellip

bull J H L Hansen and T Hasan Speaker Recognition by Machines and

Humans A tutorial review in IEEE Signal Processing Magazine vol 32 no

6 pp 74-99 Nov 2015

LTF illustration from Catalina Manual

AUTO-PHONETIC FEATURES IVOCALISE

Automatic extraction of phonetic

features with iVOCALISE

bull F1-F4 + ∆

bull A Alexander O Forth A A Atreya and F Kelly ldquoVOCALISE A

forensic automatic speaker recognition system supporting spectral

phonetic and user-provided featuresrdquo Odyssey 2016 Bilbao Spain

AUTO-PHONETIC FEATURES IVOCALISE

Automatic extraction of phonetic

features with iVOCALISE

bull F1-F4 + ∆

bull F0 + ∆

bull F0 (semitones) + ∆

Capture pitch and format ranges along

with temporal information (intonation

patterns)

bull A Alexander O Forth A A Atreya and F Kelly ldquoVOCALISE A

forensic automatic speaker recognition system supporting spectral

phonetic and user-provided featuresrdquo Odyssey 2016 Bilbao Spain

VOICE CORPUS

bull 175 public figures (actors musicians etc)

bull ~2 recordings each ~30 sec average length

bull Sourced from online archives (primarily YouTube)

bull Male-Female speaker ratio is approximately 21

bull All speech is in English with wide variation in accent

bull Recordings are exclusively from lapel microphones

bull Recording environment is unconstrained

SPEAKER RECOGNITION EXPERIMENT

bull A Alexander O Forth A A Atreya and F Kelly ldquoVOCALISE A

forensic automatic speaker recognition system supporting spectral

phonetic and user-provided featuresrdquo Odyssey 2016 Bilbao Spain

SPEAKER RECOGNITION EXPERIMENT

bull A Alexander O Forth A A Atreya and F Kelly ldquoVOCALISE A

forensic automatic speaker recognition system supporting spectral

phonetic and user-provided featuresrdquo Odyssey 2016 Bilbao Spain

SPEAKER RECOGNITION EXPERIMENT

1

2

N

similarity rank

hellip

SELECTING COHORTS FOR SUBJECTIVE COMPARISON

1 Similar two highest-ranked speakers

2 Different two randomly-ranked speakers (constrained to be same-gender and outside top ten)

3 Same speaker one different recording of the target speaker

1

2

N

hellip

CREATING LISTENER COMPARISONS

Similar Different Same Speaker

Recording of Target speaker 1 7 sec chunks

Similar Comparisons 1 amp 2

for Target Speaker 1

+ + + + +

CREATING LISTENER COMPARISONS

Similar Different Same Speaker

Recording of Target speaker 1 7 sec chunks

Different Comparisons 1 amp 2

for Target Speaker 1

+ + + + +

CREATING LISTENER COMPARISONS

Similar Different Same Speaker

Recording of Target speaker 1 7 sec chunks

Same Speaker Comparisons

1 amp 2 for Target Speaker 1

+ + + + +

THE LISTENER TEST

x 12 x 6

6 target speakers

3 male 3 female

x 12

Similar

comparisons

Different

comparisons

Same Speaker

comparisons

THE LISTENER TEST

1 Judge the similarity of the two voices on scale of 1-9

2 Ignore the speaker accents any non-speech noises or any of the spoken content

The test was administered over the web there was no supervision of the test or control over the listening environment

THE LISTENERS

bull 43 listeners 25 female 18 male

bull 20 spoke English as a first language 23 did not

bull Age range hearing status and playback method were noted

RESPONSES TO MALE VOICE COMPARISONS

Same Speaker

comparisons

Similar

comparisons

Different

comparisons

Similarity rating

P(S

imila

rity

rating

)

RESPONSES TO MALE VOICE COMPARISONS

Same Speaker

comparisons

Median = 8

Similar

comparisons

Median = 5

Different

comparisons

Median = 3

Similarity rating

P(S

imila

rity

rating

)

RESPONSES TO FEMALE VOICE COMPARISONS

Same Speaker

comparisons

Median = 8

Similar

comparisons

Median = 3

Different

comparisons

Median = 4

Similarity rating

P(S

imila

rity

rating

)

Same Speaker

comparisons

Median = 8

Similarity rating

P(S

imila

rity

rating

)

Cross-accent comparisons

Example 1

Example 2

RESPONSES TO FEMALE VOICE COMPARISONS

SCORE VS SIMILARITY RATING

Responses to similar and different male comparisons

mean similarity rating

score

Auto-Phonetic

Corr (Pearson) = 072

PERCEIVED VOICE SIMILARITY amp SPEAKER RECOGNITION

Is there a link between the scores from an automatic system and

perceived similarity

bull A correlation between perceived similarity and speaker

recognition scores has been observed with MFCC features

bull However greater correlation has been observed by combining

MFCCs and Voice Quality labels

bull Phonetic Features

bull N Obin and A Roebel Similarity Search of Acted Voices for Automatic

Voice Casting in IEEEACM Transactions on Audio Speech and Language

Processing vol 24 no 9 pp 1642-1651 Sept 2016

SCORE VS SIMILARITY RATING

Responses to similar and different male comparisons

mean similarity rating

score Auto-Phonetic

Corr (Pearson) = 072

MFCC

Corr (Pearson) = 072

SCORE VS SIMILARITY RATING

Responses to similar and different male comparisons

mean similarity rating

score

Auto-Phonetic + MFCC

Corr (Pearson) = 076

CONCLUSIONS

bull Promising results for male comparisons

bull Auto-Phonetic features capture speaker characteristics relevant to perceived similarity

bull Variable results across female comparisons

bull Data was a contributing factor smaller female candidate pool and cross-accent comparisons

bull Room to improve

bull Larger subjective evaluation required

bull Combine Auto-Phonetic and MFCC features

bull Scope to expand Auto-Phonetic feature set

AUTOMATIC VOICE CASTING RECENT RESULTS FROM SDI MEDIA

SDI media are a major provider of dubbing services worldwide

SDIrsquos Italian branch have been using iVOCALISE with AP and MFCC features for automatic voice castinghellip

English Italian 1

English Italian 2

CONCLUSIONS

bull The definition of voice similarity is application-dependent voice parades vs voice casting

bull Allow for an application-dependent search space

bull Use meta-data such as gender age accent to constrain the set of candidate voices

bull Allow for an application-dependent lsquodegree of similarityrsquo

bullGiven well-calibrated output scores from the automatic system can define a score range of interest

Page 6: AUTOMATICALLY IDENTIFYING - Oxford Wave Research...VOICE PARADES •U.K. Home Office, “Advice on the use of voice identification parades”, circular: 57/2003, December 2003. K.

PERCEIVED VOICE SIMILARITY

PERCEIVED VOICE SIMILARITY

Speaker traits sex age

Acoustic Characteristics articulation

timbre prosody vocal effort

Voice Quality breathy hoarse creaky

bull F Nolan P French K McDougall L Stevens and T Hudson ldquoThe role of

voice quality lsquosettingsrsquo in perceived voice similarityrdquo IAFPA 2011 conference

Vienna Austria 2011

FEATURES FOR SPEAKER RECOGNITION

bull J H L Hansen and T Hasan Speaker Recognition by Machines and

Humans A tutorial review in IEEE Signal Processing Magazine vol 32 no

6 pp 74-99 Nov 2015

Mel Frequency Cepstral Coefficients

bull MFCCs are the standard in

automatic speaker recognition

bull They effectively capture short-term

characteristics of the vocal tract

FEATURES FOR SPEAKER RECOGNITION

Phonetic features eg long-term

formants

bull Less discriminative than MFCCs for

automatic speaker recognition

bull However they capture acoustic

characteristics of the voice

important for perceived similarityhellip

bull J H L Hansen and T Hasan Speaker Recognition by Machines and

Humans A tutorial review in IEEE Signal Processing Magazine vol 32 no

6 pp 74-99 Nov 2015

LTF illustration from Catalina Manual

AUTO-PHONETIC FEATURES IVOCALISE

Automatic extraction of phonetic

features with iVOCALISE

bull F1-F4 + ∆

bull A Alexander O Forth A A Atreya and F Kelly ldquoVOCALISE A

forensic automatic speaker recognition system supporting spectral

phonetic and user-provided featuresrdquo Odyssey 2016 Bilbao Spain

AUTO-PHONETIC FEATURES IVOCALISE

Automatic extraction of phonetic

features with iVOCALISE

bull F1-F4 + ∆

bull F0 + ∆

bull F0 (semitones) + ∆

Capture pitch and format ranges along

with temporal information (intonation

patterns)

bull A Alexander O Forth A A Atreya and F Kelly ldquoVOCALISE A

forensic automatic speaker recognition system supporting spectral

phonetic and user-provided featuresrdquo Odyssey 2016 Bilbao Spain

VOICE CORPUS

bull 175 public figures (actors musicians etc)

bull ~2 recordings each ~30 sec average length

bull Sourced from online archives (primarily YouTube)

bull Male-Female speaker ratio is approximately 21

bull All speech is in English with wide variation in accent

bull Recordings are exclusively from lapel microphones

bull Recording environment is unconstrained

SPEAKER RECOGNITION EXPERIMENT

bull A Alexander O Forth A A Atreya and F Kelly ldquoVOCALISE A

forensic automatic speaker recognition system supporting spectral

phonetic and user-provided featuresrdquo Odyssey 2016 Bilbao Spain

SPEAKER RECOGNITION EXPERIMENT

bull A Alexander O Forth A A Atreya and F Kelly ldquoVOCALISE A

forensic automatic speaker recognition system supporting spectral

phonetic and user-provided featuresrdquo Odyssey 2016 Bilbao Spain

SPEAKER RECOGNITION EXPERIMENT

1

2

N

similarity rank

hellip

SELECTING COHORTS FOR SUBJECTIVE COMPARISON

1 Similar two highest-ranked speakers

2 Different two randomly-ranked speakers (constrained to be same-gender and outside top ten)

3 Same speaker one different recording of the target speaker

1

2

N

hellip

CREATING LISTENER COMPARISONS

Similar Different Same Speaker

Recording of Target speaker 1 7 sec chunks

Similar Comparisons 1 amp 2

for Target Speaker 1

+ + + + +

CREATING LISTENER COMPARISONS

Similar Different Same Speaker

Recording of Target speaker 1 7 sec chunks

Different Comparisons 1 amp 2

for Target Speaker 1

+ + + + +

CREATING LISTENER COMPARISONS

Similar Different Same Speaker

Recording of Target speaker 1 7 sec chunks

Same Speaker Comparisons

1 amp 2 for Target Speaker 1

+ + + + +

THE LISTENER TEST

x 12 x 6

6 target speakers

3 male 3 female

x 12

Similar

comparisons

Different

comparisons

Same Speaker

comparisons

THE LISTENER TEST

1 Judge the similarity of the two voices on scale of 1-9

2 Ignore the speaker accents any non-speech noises or any of the spoken content

The test was administered over the web there was no supervision of the test or control over the listening environment

THE LISTENERS

bull 43 listeners 25 female 18 male

bull 20 spoke English as a first language 23 did not

bull Age range hearing status and playback method were noted

RESPONSES TO MALE VOICE COMPARISONS

Same Speaker

comparisons

Similar

comparisons

Different

comparisons

Similarity rating

P(S

imila

rity

rating

)

RESPONSES TO MALE VOICE COMPARISONS

Same Speaker

comparisons

Median = 8

Similar

comparisons

Median = 5

Different

comparisons

Median = 3

Similarity rating

P(S

imila

rity

rating

)

RESPONSES TO FEMALE VOICE COMPARISONS

Same Speaker

comparisons

Median = 8

Similar

comparisons

Median = 3

Different

comparisons

Median = 4

Similarity rating

P(S

imila

rity

rating

)

Same Speaker

comparisons

Median = 8

Similarity rating

P(S

imila

rity

rating

)

Cross-accent comparisons

Example 1

Example 2

RESPONSES TO FEMALE VOICE COMPARISONS

SCORE VS SIMILARITY RATING

Responses to similar and different male comparisons

mean similarity rating

score

Auto-Phonetic

Corr (Pearson) = 072

PERCEIVED VOICE SIMILARITY amp SPEAKER RECOGNITION

Is there a link between the scores from an automatic system and

perceived similarity

bull A correlation between perceived similarity and speaker

recognition scores has been observed with MFCC features

bull However greater correlation has been observed by combining

MFCCs and Voice Quality labels

bull Phonetic Features

bull N Obin and A Roebel Similarity Search of Acted Voices for Automatic

Voice Casting in IEEEACM Transactions on Audio Speech and Language

Processing vol 24 no 9 pp 1642-1651 Sept 2016

SCORE VS SIMILARITY RATING

Responses to similar and different male comparisons

mean similarity rating

score Auto-Phonetic

Corr (Pearson) = 072

MFCC

Corr (Pearson) = 072

SCORE VS SIMILARITY RATING

Responses to similar and different male comparisons

mean similarity rating

score

Auto-Phonetic + MFCC

Corr (Pearson) = 076

CONCLUSIONS

bull Promising results for male comparisons

bull Auto-Phonetic features capture speaker characteristics relevant to perceived similarity

bull Variable results across female comparisons

bull Data was a contributing factor smaller female candidate pool and cross-accent comparisons

bull Room to improve

bull Larger subjective evaluation required

bull Combine Auto-Phonetic and MFCC features

bull Scope to expand Auto-Phonetic feature set

AUTOMATIC VOICE CASTING RECENT RESULTS FROM SDI MEDIA

SDI media are a major provider of dubbing services worldwide

SDIrsquos Italian branch have been using iVOCALISE with AP and MFCC features for automatic voice castinghellip

English Italian 1

English Italian 2

CONCLUSIONS

bull The definition of voice similarity is application-dependent voice parades vs voice casting

bull Allow for an application-dependent search space

bull Use meta-data such as gender age accent to constrain the set of candidate voices

bull Allow for an application-dependent lsquodegree of similarityrsquo

bullGiven well-calibrated output scores from the automatic system can define a score range of interest

Page 7: AUTOMATICALLY IDENTIFYING - Oxford Wave Research...VOICE PARADES •U.K. Home Office, “Advice on the use of voice identification parades”, circular: 57/2003, December 2003. K.

PERCEIVED VOICE SIMILARITY

Speaker traits sex age

Acoustic Characteristics articulation

timbre prosody vocal effort

Voice Quality breathy hoarse creaky

bull F Nolan P French K McDougall L Stevens and T Hudson ldquoThe role of

voice quality lsquosettingsrsquo in perceived voice similarityrdquo IAFPA 2011 conference

Vienna Austria 2011

FEATURES FOR SPEAKER RECOGNITION

bull J H L Hansen and T Hasan Speaker Recognition by Machines and

Humans A tutorial review in IEEE Signal Processing Magazine vol 32 no

6 pp 74-99 Nov 2015

Mel Frequency Cepstral Coefficients

bull MFCCs are the standard in

automatic speaker recognition

bull They effectively capture short-term

characteristics of the vocal tract

FEATURES FOR SPEAKER RECOGNITION

Phonetic features eg long-term

formants

bull Less discriminative than MFCCs for

automatic speaker recognition

bull However they capture acoustic

characteristics of the voice

important for perceived similarityhellip

bull J H L Hansen and T Hasan Speaker Recognition by Machines and

Humans A tutorial review in IEEE Signal Processing Magazine vol 32 no

6 pp 74-99 Nov 2015

LTF illustration from Catalina Manual

AUTO-PHONETIC FEATURES IVOCALISE

Automatic extraction of phonetic

features with iVOCALISE

bull F1-F4 + ∆

bull A Alexander O Forth A A Atreya and F Kelly ldquoVOCALISE A

forensic automatic speaker recognition system supporting spectral

phonetic and user-provided featuresrdquo Odyssey 2016 Bilbao Spain

AUTO-PHONETIC FEATURES IVOCALISE

Automatic extraction of phonetic

features with iVOCALISE

bull F1-F4 + ∆

bull F0 + ∆

bull F0 (semitones) + ∆

Capture pitch and format ranges along

with temporal information (intonation

patterns)

bull A Alexander O Forth A A Atreya and F Kelly ldquoVOCALISE A

forensic automatic speaker recognition system supporting spectral

phonetic and user-provided featuresrdquo Odyssey 2016 Bilbao Spain

VOICE CORPUS

bull 175 public figures (actors musicians etc)

bull ~2 recordings each ~30 sec average length

bull Sourced from online archives (primarily YouTube)

bull Male-Female speaker ratio is approximately 21

bull All speech is in English with wide variation in accent

bull Recordings are exclusively from lapel microphones

bull Recording environment is unconstrained

SPEAKER RECOGNITION EXPERIMENT

bull A Alexander O Forth A A Atreya and F Kelly ldquoVOCALISE A

forensic automatic speaker recognition system supporting spectral

phonetic and user-provided featuresrdquo Odyssey 2016 Bilbao Spain

SPEAKER RECOGNITION EXPERIMENT

bull A Alexander O Forth A A Atreya and F Kelly ldquoVOCALISE A

forensic automatic speaker recognition system supporting spectral

phonetic and user-provided featuresrdquo Odyssey 2016 Bilbao Spain

SPEAKER RECOGNITION EXPERIMENT

1

2

N

similarity rank

hellip

SELECTING COHORTS FOR SUBJECTIVE COMPARISON

1 Similar two highest-ranked speakers

2 Different two randomly-ranked speakers (constrained to be same-gender and outside top ten)

3 Same speaker one different recording of the target speaker

1

2

N

hellip

CREATING LISTENER COMPARISONS

Similar Different Same Speaker

Recording of Target speaker 1 7 sec chunks

Similar Comparisons 1 amp 2

for Target Speaker 1

+ + + + +

CREATING LISTENER COMPARISONS

Similar Different Same Speaker

Recording of Target speaker 1 7 sec chunks

Different Comparisons 1 amp 2

for Target Speaker 1

+ + + + +

CREATING LISTENER COMPARISONS

Similar Different Same Speaker

Recording of Target speaker 1 7 sec chunks

Same Speaker Comparisons

1 amp 2 for Target Speaker 1

+ + + + +

THE LISTENER TEST

x 12 x 6

6 target speakers

3 male 3 female

x 12

Similar

comparisons

Different

comparisons

Same Speaker

comparisons

THE LISTENER TEST

1 Judge the similarity of the two voices on scale of 1-9

2 Ignore the speaker accents any non-speech noises or any of the spoken content

The test was administered over the web there was no supervision of the test or control over the listening environment

THE LISTENERS

bull 43 listeners 25 female 18 male

bull 20 spoke English as a first language 23 did not

bull Age range hearing status and playback method were noted

RESPONSES TO MALE VOICE COMPARISONS

Same Speaker

comparisons

Similar

comparisons

Different

comparisons

Similarity rating

P(S

imila

rity

rating

)

RESPONSES TO MALE VOICE COMPARISONS

Same Speaker

comparisons

Median = 8

Similar

comparisons

Median = 5

Different

comparisons

Median = 3

Similarity rating

P(S

imila

rity

rating

)

RESPONSES TO FEMALE VOICE COMPARISONS

Same Speaker

comparisons

Median = 8

Similar

comparisons

Median = 3

Different

comparisons

Median = 4

Similarity rating

P(S

imila

rity

rating

)

Same Speaker

comparisons

Median = 8

Similarity rating

P(S

imila

rity

rating

)

Cross-accent comparisons

Example 1

Example 2

RESPONSES TO FEMALE VOICE COMPARISONS

SCORE VS SIMILARITY RATING

Responses to similar and different male comparisons

mean similarity rating

score

Auto-Phonetic

Corr (Pearson) = 072

PERCEIVED VOICE SIMILARITY amp SPEAKER RECOGNITION

Is there a link between the scores from an automatic system and

perceived similarity

bull A correlation between perceived similarity and speaker

recognition scores has been observed with MFCC features

bull However greater correlation has been observed by combining

MFCCs and Voice Quality labels

bull Phonetic Features

bull N Obin and A Roebel Similarity Search of Acted Voices for Automatic

Voice Casting in IEEEACM Transactions on Audio Speech and Language

Processing vol 24 no 9 pp 1642-1651 Sept 2016

SCORE VS SIMILARITY RATING

Responses to similar and different male comparisons

mean similarity rating

score Auto-Phonetic

Corr (Pearson) = 072

MFCC

Corr (Pearson) = 072

SCORE VS SIMILARITY RATING

Responses to similar and different male comparisons

mean similarity rating

score

Auto-Phonetic + MFCC

Corr (Pearson) = 076

CONCLUSIONS

bull Promising results for male comparisons

bull Auto-Phonetic features capture speaker characteristics relevant to perceived similarity

bull Variable results across female comparisons

bull Data was a contributing factor smaller female candidate pool and cross-accent comparisons

bull Room to improve

bull Larger subjective evaluation required

bull Combine Auto-Phonetic and MFCC features

bull Scope to expand Auto-Phonetic feature set

AUTOMATIC VOICE CASTING RECENT RESULTS FROM SDI MEDIA

SDI media are a major provider of dubbing services worldwide

SDIrsquos Italian branch have been using iVOCALISE with AP and MFCC features for automatic voice castinghellip

English Italian 1

English Italian 2

CONCLUSIONS

bull The definition of voice similarity is application-dependent voice parades vs voice casting

bull Allow for an application-dependent search space

bull Use meta-data such as gender age accent to constrain the set of candidate voices

bull Allow for an application-dependent lsquodegree of similarityrsquo

bullGiven well-calibrated output scores from the automatic system can define a score range of interest

Page 8: AUTOMATICALLY IDENTIFYING - Oxford Wave Research...VOICE PARADES •U.K. Home Office, “Advice on the use of voice identification parades”, circular: 57/2003, December 2003. K.

FEATURES FOR SPEAKER RECOGNITION

bull J H L Hansen and T Hasan Speaker Recognition by Machines and

Humans A tutorial review in IEEE Signal Processing Magazine vol 32 no

6 pp 74-99 Nov 2015

Mel Frequency Cepstral Coefficients

bull MFCCs are the standard in

automatic speaker recognition

bull They effectively capture short-term

characteristics of the vocal tract

FEATURES FOR SPEAKER RECOGNITION

Phonetic features eg long-term

formants

bull Less discriminative than MFCCs for

automatic speaker recognition

bull However they capture acoustic

characteristics of the voice

important for perceived similarityhellip

bull J H L Hansen and T Hasan Speaker Recognition by Machines and

Humans A tutorial review in IEEE Signal Processing Magazine vol 32 no

6 pp 74-99 Nov 2015

LTF illustration from Catalina Manual

AUTO-PHONETIC FEATURES IVOCALISE

Automatic extraction of phonetic

features with iVOCALISE

bull F1-F4 + ∆

bull A Alexander O Forth A A Atreya and F Kelly ldquoVOCALISE A

forensic automatic speaker recognition system supporting spectral

phonetic and user-provided featuresrdquo Odyssey 2016 Bilbao Spain

AUTO-PHONETIC FEATURES IVOCALISE

Automatic extraction of phonetic

features with iVOCALISE

bull F1-F4 + ∆

bull F0 + ∆

bull F0 (semitones) + ∆

Capture pitch and format ranges along

with temporal information (intonation

patterns)

bull A Alexander O Forth A A Atreya and F Kelly ldquoVOCALISE A

forensic automatic speaker recognition system supporting spectral

phonetic and user-provided featuresrdquo Odyssey 2016 Bilbao Spain

VOICE CORPUS

bull 175 public figures (actors musicians etc)

bull ~2 recordings each ~30 sec average length

bull Sourced from online archives (primarily YouTube)

bull Male-Female speaker ratio is approximately 21

bull All speech is in English with wide variation in accent

bull Recordings are exclusively from lapel microphones

bull Recording environment is unconstrained

SPEAKER RECOGNITION EXPERIMENT

bull A Alexander O Forth A A Atreya and F Kelly ldquoVOCALISE A

forensic automatic speaker recognition system supporting spectral

phonetic and user-provided featuresrdquo Odyssey 2016 Bilbao Spain

SPEAKER RECOGNITION EXPERIMENT

bull A Alexander O Forth A A Atreya and F Kelly ldquoVOCALISE A

forensic automatic speaker recognition system supporting spectral

phonetic and user-provided featuresrdquo Odyssey 2016 Bilbao Spain

SPEAKER RECOGNITION EXPERIMENT

1

2

N

similarity rank

hellip

SELECTING COHORTS FOR SUBJECTIVE COMPARISON

1 Similar two highest-ranked speakers

2 Different two randomly-ranked speakers (constrained to be same-gender and outside top ten)

3 Same speaker one different recording of the target speaker

1

2

N

hellip

CREATING LISTENER COMPARISONS

Similar Different Same Speaker

Recording of Target speaker 1 7 sec chunks

Similar Comparisons 1 amp 2

for Target Speaker 1

+ + + + +

CREATING LISTENER COMPARISONS

Similar Different Same Speaker

Recording of Target speaker 1 7 sec chunks

Different Comparisons 1 amp 2

for Target Speaker 1

+ + + + +

CREATING LISTENER COMPARISONS

Similar Different Same Speaker

Recording of Target speaker 1 7 sec chunks

Same Speaker Comparisons

1 amp 2 for Target Speaker 1

+ + + + +

THE LISTENER TEST

x 12 x 6

6 target speakers

3 male 3 female

x 12

Similar

comparisons

Different

comparisons

Same Speaker

comparisons

THE LISTENER TEST

1 Judge the similarity of the two voices on scale of 1-9

2 Ignore the speaker accents any non-speech noises or any of the spoken content

The test was administered over the web there was no supervision of the test or control over the listening environment

THE LISTENERS

bull 43 listeners 25 female 18 male

bull 20 spoke English as a first language 23 did not

bull Age range hearing status and playback method were noted

RESPONSES TO MALE VOICE COMPARISONS

Same Speaker

comparisons

Similar

comparisons

Different

comparisons

Similarity rating

P(S

imila

rity

rating

)

RESPONSES TO MALE VOICE COMPARISONS

Same Speaker

comparisons

Median = 8

Similar

comparisons

Median = 5

Different

comparisons

Median = 3

Similarity rating

P(S

imila

rity

rating

)

RESPONSES TO FEMALE VOICE COMPARISONS

Same Speaker

comparisons

Median = 8

Similar

comparisons

Median = 3

Different

comparisons

Median = 4

Similarity rating

P(S

imila

rity

rating

)

Same Speaker

comparisons

Median = 8

Similarity rating

P(S

imila

rity

rating

)

Cross-accent comparisons

Example 1

Example 2

RESPONSES TO FEMALE VOICE COMPARISONS

SCORE VS SIMILARITY RATING

Responses to similar and different male comparisons

mean similarity rating

score

Auto-Phonetic

Corr (Pearson) = 072

PERCEIVED VOICE SIMILARITY amp SPEAKER RECOGNITION

Is there a link between the scores from an automatic system and

perceived similarity

bull A correlation between perceived similarity and speaker

recognition scores has been observed with MFCC features

bull However greater correlation has been observed by combining

MFCCs and Voice Quality labels

bull Phonetic Features

bull N Obin and A Roebel Similarity Search of Acted Voices for Automatic

Voice Casting in IEEEACM Transactions on Audio Speech and Language

Processing vol 24 no 9 pp 1642-1651 Sept 2016

SCORE VS SIMILARITY RATING

Responses to similar and different male comparisons

mean similarity rating

score Auto-Phonetic

Corr (Pearson) = 072

MFCC

Corr (Pearson) = 072

SCORE VS SIMILARITY RATING

Responses to similar and different male comparisons

mean similarity rating

score

Auto-Phonetic + MFCC

Corr (Pearson) = 076

CONCLUSIONS

bull Promising results for male comparisons

bull Auto-Phonetic features capture speaker characteristics relevant to perceived similarity

bull Variable results across female comparisons

bull Data was a contributing factor smaller female candidate pool and cross-accent comparisons

bull Room to improve

bull Larger subjective evaluation required

bull Combine Auto-Phonetic and MFCC features

bull Scope to expand Auto-Phonetic feature set

AUTOMATIC VOICE CASTING RECENT RESULTS FROM SDI MEDIA

SDI media are a major provider of dubbing services worldwide

SDIrsquos Italian branch have been using iVOCALISE with AP and MFCC features for automatic voice castinghellip

English Italian 1

English Italian 2

CONCLUSIONS

bull The definition of voice similarity is application-dependent voice parades vs voice casting

bull Allow for an application-dependent search space

bull Use meta-data such as gender age accent to constrain the set of candidate voices

bull Allow for an application-dependent lsquodegree of similarityrsquo

bullGiven well-calibrated output scores from the automatic system can define a score range of interest

Page 9: AUTOMATICALLY IDENTIFYING - Oxford Wave Research...VOICE PARADES •U.K. Home Office, “Advice on the use of voice identification parades”, circular: 57/2003, December 2003. K.

FEATURES FOR SPEAKER RECOGNITION

Phonetic features eg long-term

formants

bull Less discriminative than MFCCs for

automatic speaker recognition

bull However they capture acoustic

characteristics of the voice

important for perceived similarityhellip

bull J H L Hansen and T Hasan Speaker Recognition by Machines and

Humans A tutorial review in IEEE Signal Processing Magazine vol 32 no

6 pp 74-99 Nov 2015

LTF illustration from Catalina Manual

AUTO-PHONETIC FEATURES IVOCALISE

Automatic extraction of phonetic

features with iVOCALISE

bull F1-F4 + ∆

bull A Alexander O Forth A A Atreya and F Kelly ldquoVOCALISE A

forensic automatic speaker recognition system supporting spectral

phonetic and user-provided featuresrdquo Odyssey 2016 Bilbao Spain

AUTO-PHONETIC FEATURES IVOCALISE

Automatic extraction of phonetic

features with iVOCALISE

bull F1-F4 + ∆

bull F0 + ∆

bull F0 (semitones) + ∆

Capture pitch and format ranges along

with temporal information (intonation

patterns)

bull A Alexander O Forth A A Atreya and F Kelly ldquoVOCALISE A

forensic automatic speaker recognition system supporting spectral

phonetic and user-provided featuresrdquo Odyssey 2016 Bilbao Spain

VOICE CORPUS

bull 175 public figures (actors musicians etc)

bull ~2 recordings each ~30 sec average length

bull Sourced from online archives (primarily YouTube)

bull Male-Female speaker ratio is approximately 21

bull All speech is in English with wide variation in accent

bull Recordings are exclusively from lapel microphones

bull Recording environment is unconstrained

SPEAKER RECOGNITION EXPERIMENT

bull A Alexander O Forth A A Atreya and F Kelly ldquoVOCALISE A

forensic automatic speaker recognition system supporting spectral

phonetic and user-provided featuresrdquo Odyssey 2016 Bilbao Spain

SPEAKER RECOGNITION EXPERIMENT

bull A Alexander O Forth A A Atreya and F Kelly ldquoVOCALISE A

forensic automatic speaker recognition system supporting spectral

phonetic and user-provided featuresrdquo Odyssey 2016 Bilbao Spain

SPEAKER RECOGNITION EXPERIMENT

1

2

N

similarity rank

hellip

SELECTING COHORTS FOR SUBJECTIVE COMPARISON

1 Similar two highest-ranked speakers

2 Different two randomly-ranked speakers (constrained to be same-gender and outside top ten)

3 Same speaker one different recording of the target speaker

1

2

N

hellip

CREATING LISTENER COMPARISONS

Similar Different Same Speaker

Recording of Target speaker 1 7 sec chunks

Similar Comparisons 1 amp 2

for Target Speaker 1

+ + + + +

CREATING LISTENER COMPARISONS

Similar Different Same Speaker

Recording of Target speaker 1 7 sec chunks

Different Comparisons 1 amp 2

for Target Speaker 1

+ + + + +

CREATING LISTENER COMPARISONS

Similar Different Same Speaker

Recording of Target speaker 1 7 sec chunks

Same Speaker Comparisons

1 amp 2 for Target Speaker 1

+ + + + +

THE LISTENER TEST

x 12 x 6

6 target speakers

3 male 3 female

x 12

Similar

comparisons

Different

comparisons

Same Speaker

comparisons

THE LISTENER TEST

1 Judge the similarity of the two voices on scale of 1-9

2 Ignore the speaker accents any non-speech noises or any of the spoken content

The test was administered over the web there was no supervision of the test or control over the listening environment

THE LISTENERS

bull 43 listeners 25 female 18 male

bull 20 spoke English as a first language 23 did not

bull Age range hearing status and playback method were noted

RESPONSES TO MALE VOICE COMPARISONS

Same Speaker

comparisons

Similar

comparisons

Different

comparisons

Similarity rating

P(S

imila

rity

rating

)

RESPONSES TO MALE VOICE COMPARISONS

Same Speaker

comparisons

Median = 8

Similar

comparisons

Median = 5

Different

comparisons

Median = 3

Similarity rating

P(S

imila

rity

rating

)

RESPONSES TO FEMALE VOICE COMPARISONS

Same Speaker

comparisons

Median = 8

Similar

comparisons

Median = 3

Different

comparisons

Median = 4

Similarity rating

P(S

imila

rity

rating

)

Same Speaker

comparisons

Median = 8

Similarity rating

P(S

imila

rity

rating

)

Cross-accent comparisons

Example 1

Example 2

RESPONSES TO FEMALE VOICE COMPARISONS

SCORE VS SIMILARITY RATING

Responses to similar and different male comparisons

mean similarity rating

score

Auto-Phonetic

Corr (Pearson) = 072

PERCEIVED VOICE SIMILARITY amp SPEAKER RECOGNITION

Is there a link between the scores from an automatic system and

perceived similarity

bull A correlation between perceived similarity and speaker

recognition scores has been observed with MFCC features

bull However greater correlation has been observed by combining

MFCCs and Voice Quality labels

bull Phonetic Features

bull N Obin and A Roebel Similarity Search of Acted Voices for Automatic

Voice Casting in IEEEACM Transactions on Audio Speech and Language

Processing vol 24 no 9 pp 1642-1651 Sept 2016

SCORE VS SIMILARITY RATING

Responses to similar and different male comparisons

mean similarity rating

score Auto-Phonetic

Corr (Pearson) = 072

MFCC

Corr (Pearson) = 072

SCORE VS SIMILARITY RATING

Responses to similar and different male comparisons

mean similarity rating

score

Auto-Phonetic + MFCC

Corr (Pearson) = 076

CONCLUSIONS

bull Promising results for male comparisons

bull Auto-Phonetic features capture speaker characteristics relevant to perceived similarity

bull Variable results across female comparisons

bull Data was a contributing factor smaller female candidate pool and cross-accent comparisons

bull Room to improve

bull Larger subjective evaluation required

bull Combine Auto-Phonetic and MFCC features

bull Scope to expand Auto-Phonetic feature set

AUTOMATIC VOICE CASTING RECENT RESULTS FROM SDI MEDIA

SDI media are a major provider of dubbing services worldwide

SDIrsquos Italian branch have been using iVOCALISE with AP and MFCC features for automatic voice castinghellip

English Italian 1

English Italian 2

CONCLUSIONS

bull The definition of voice similarity is application-dependent voice parades vs voice casting

bull Allow for an application-dependent search space

bull Use meta-data such as gender age accent to constrain the set of candidate voices

bull Allow for an application-dependent lsquodegree of similarityrsquo

bullGiven well-calibrated output scores from the automatic system can define a score range of interest

Page 10: AUTOMATICALLY IDENTIFYING - Oxford Wave Research...VOICE PARADES •U.K. Home Office, “Advice on the use of voice identification parades”, circular: 57/2003, December 2003. K.

AUTO-PHONETIC FEATURES IVOCALISE

Automatic extraction of phonetic

features with iVOCALISE

bull F1-F4 + ∆

bull A Alexander O Forth A A Atreya and F Kelly ldquoVOCALISE A

forensic automatic speaker recognition system supporting spectral

phonetic and user-provided featuresrdquo Odyssey 2016 Bilbao Spain

AUTO-PHONETIC FEATURES IVOCALISE

Automatic extraction of phonetic

features with iVOCALISE

bull F1-F4 + ∆

bull F0 + ∆

bull F0 (semitones) + ∆

Capture pitch and format ranges along

with temporal information (intonation

patterns)

bull A Alexander O Forth A A Atreya and F Kelly ldquoVOCALISE A

forensic automatic speaker recognition system supporting spectral

phonetic and user-provided featuresrdquo Odyssey 2016 Bilbao Spain

VOICE CORPUS

bull 175 public figures (actors musicians etc)

bull ~2 recordings each ~30 sec average length

bull Sourced from online archives (primarily YouTube)

bull Male-Female speaker ratio is approximately 21

bull All speech is in English with wide variation in accent

bull Recordings are exclusively from lapel microphones

bull Recording environment is unconstrained

SPEAKER RECOGNITION EXPERIMENT

bull A Alexander O Forth A A Atreya and F Kelly ldquoVOCALISE A

forensic automatic speaker recognition system supporting spectral

phonetic and user-provided featuresrdquo Odyssey 2016 Bilbao Spain

SPEAKER RECOGNITION EXPERIMENT

bull A Alexander O Forth A A Atreya and F Kelly ldquoVOCALISE A

forensic automatic speaker recognition system supporting spectral

phonetic and user-provided featuresrdquo Odyssey 2016 Bilbao Spain

SPEAKER RECOGNITION EXPERIMENT

1

2

N

similarity rank

hellip

SELECTING COHORTS FOR SUBJECTIVE COMPARISON

1 Similar two highest-ranked speakers

2 Different two randomly-ranked speakers (constrained to be same-gender and outside top ten)

3 Same speaker one different recording of the target speaker

1

2

N

hellip

CREATING LISTENER COMPARISONS

Similar Different Same Speaker

Recording of Target speaker 1 7 sec chunks

Similar Comparisons 1 amp 2

for Target Speaker 1

+ + + + +

CREATING LISTENER COMPARISONS

Similar Different Same Speaker

Recording of Target speaker 1 7 sec chunks

Different Comparisons 1 amp 2

for Target Speaker 1

+ + + + +

CREATING LISTENER COMPARISONS

Similar Different Same Speaker

Recording of Target speaker 1 7 sec chunks

Same Speaker Comparisons

1 amp 2 for Target Speaker 1

+ + + + +

THE LISTENER TEST

x 12 x 6

6 target speakers

3 male 3 female

x 12

Similar

comparisons

Different

comparisons

Same Speaker

comparisons

THE LISTENER TEST

1 Judge the similarity of the two voices on scale of 1-9

2 Ignore the speaker accents any non-speech noises or any of the spoken content

The test was administered over the web there was no supervision of the test or control over the listening environment

THE LISTENERS

bull 43 listeners 25 female 18 male

bull 20 spoke English as a first language 23 did not

bull Age range hearing status and playback method were noted

RESPONSES TO MALE VOICE COMPARISONS

Same Speaker

comparisons

Similar

comparisons

Different

comparisons

Similarity rating

P(S

imila

rity

rating

)

RESPONSES TO MALE VOICE COMPARISONS

Same Speaker

comparisons

Median = 8

Similar

comparisons

Median = 5

Different

comparisons

Median = 3

Similarity rating

P(S

imila

rity

rating

)

RESPONSES TO FEMALE VOICE COMPARISONS

Same Speaker

comparisons

Median = 8

Similar

comparisons

Median = 3

Different

comparisons

Median = 4

Similarity rating

P(S

imila

rity

rating

)

Same Speaker

comparisons

Median = 8

Similarity rating

P(S

imila

rity

rating

)

Cross-accent comparisons

Example 1

Example 2

RESPONSES TO FEMALE VOICE COMPARISONS

SCORE VS SIMILARITY RATING

Responses to similar and different male comparisons

mean similarity rating

score

Auto-Phonetic

Corr (Pearson) = 072

PERCEIVED VOICE SIMILARITY amp SPEAKER RECOGNITION

Is there a link between the scores from an automatic system and

perceived similarity

bull A correlation between perceived similarity and speaker

recognition scores has been observed with MFCC features

bull However greater correlation has been observed by combining

MFCCs and Voice Quality labels

bull Phonetic Features

bull N Obin and A Roebel Similarity Search of Acted Voices for Automatic

Voice Casting in IEEEACM Transactions on Audio Speech and Language

Processing vol 24 no 9 pp 1642-1651 Sept 2016

SCORE VS SIMILARITY RATING

Responses to similar and different male comparisons

mean similarity rating

score Auto-Phonetic

Corr (Pearson) = 072

MFCC

Corr (Pearson) = 072

SCORE VS SIMILARITY RATING

Responses to similar and different male comparisons

mean similarity rating

score

Auto-Phonetic + MFCC

Corr (Pearson) = 076

CONCLUSIONS

bull Promising results for male comparisons

bull Auto-Phonetic features capture speaker characteristics relevant to perceived similarity

bull Variable results across female comparisons

bull Data was a contributing factor smaller female candidate pool and cross-accent comparisons

bull Room to improve

bull Larger subjective evaluation required

bull Combine Auto-Phonetic and MFCC features

bull Scope to expand Auto-Phonetic feature set

AUTOMATIC VOICE CASTING RECENT RESULTS FROM SDI MEDIA

SDI media are a major provider of dubbing services worldwide

SDIrsquos Italian branch have been using iVOCALISE with AP and MFCC features for automatic voice castinghellip

English Italian 1

English Italian 2

CONCLUSIONS

bull The definition of voice similarity is application-dependent voice parades vs voice casting

bull Allow for an application-dependent search space

bull Use meta-data such as gender age accent to constrain the set of candidate voices

bull Allow for an application-dependent lsquodegree of similarityrsquo

bullGiven well-calibrated output scores from the automatic system can define a score range of interest

Page 11: AUTOMATICALLY IDENTIFYING - Oxford Wave Research...VOICE PARADES •U.K. Home Office, “Advice on the use of voice identification parades”, circular: 57/2003, December 2003. K.

AUTO-PHONETIC FEATURES IVOCALISE

Automatic extraction of phonetic

features with iVOCALISE

bull F1-F4 + ∆

bull F0 + ∆

bull F0 (semitones) + ∆

Capture pitch and format ranges along

with temporal information (intonation

patterns)

bull A Alexander O Forth A A Atreya and F Kelly ldquoVOCALISE A

forensic automatic speaker recognition system supporting spectral

phonetic and user-provided featuresrdquo Odyssey 2016 Bilbao Spain

VOICE CORPUS

bull 175 public figures (actors musicians etc)

bull ~2 recordings each ~30 sec average length

bull Sourced from online archives (primarily YouTube)

bull Male-Female speaker ratio is approximately 21

bull All speech is in English with wide variation in accent

bull Recordings are exclusively from lapel microphones

bull Recording environment is unconstrained

SPEAKER RECOGNITION EXPERIMENT

bull A Alexander O Forth A A Atreya and F Kelly ldquoVOCALISE A

forensic automatic speaker recognition system supporting spectral

phonetic and user-provided featuresrdquo Odyssey 2016 Bilbao Spain

SPEAKER RECOGNITION EXPERIMENT

bull A Alexander O Forth A A Atreya and F Kelly ldquoVOCALISE A

forensic automatic speaker recognition system supporting spectral

phonetic and user-provided featuresrdquo Odyssey 2016 Bilbao Spain

SPEAKER RECOGNITION EXPERIMENT

1

2

N

similarity rank

hellip

SELECTING COHORTS FOR SUBJECTIVE COMPARISON

1 Similar two highest-ranked speakers

2 Different two randomly-ranked speakers (constrained to be same-gender and outside top ten)

3 Same speaker one different recording of the target speaker

1

2

N

hellip

CREATING LISTENER COMPARISONS

Similar Different Same Speaker

Recording of Target speaker 1 7 sec chunks

Similar Comparisons 1 amp 2

for Target Speaker 1

+ + + + +

CREATING LISTENER COMPARISONS

Similar Different Same Speaker

Recording of Target speaker 1 7 sec chunks

Different Comparisons 1 amp 2

for Target Speaker 1

+ + + + +

CREATING LISTENER COMPARISONS

Similar Different Same Speaker

Recording of Target speaker 1 7 sec chunks

Same Speaker Comparisons

1 amp 2 for Target Speaker 1

+ + + + +

THE LISTENER TEST

x 12 x 6

6 target speakers

3 male 3 female

x 12

Similar

comparisons

Different

comparisons

Same Speaker

comparisons

THE LISTENER TEST

1 Judge the similarity of the two voices on scale of 1-9

2 Ignore the speaker accents any non-speech noises or any of the spoken content

The test was administered over the web there was no supervision of the test or control over the listening environment

THE LISTENERS

bull 43 listeners 25 female 18 male

bull 20 spoke English as a first language 23 did not

bull Age range hearing status and playback method were noted

RESPONSES TO MALE VOICE COMPARISONS

Same Speaker

comparisons

Similar

comparisons

Different

comparisons

Similarity rating

P(S

imila

rity

rating

)

RESPONSES TO MALE VOICE COMPARISONS

Same Speaker

comparisons

Median = 8

Similar

comparisons

Median = 5

Different

comparisons

Median = 3

Similarity rating

P(S

imila

rity

rating

)

RESPONSES TO FEMALE VOICE COMPARISONS

Same Speaker

comparisons

Median = 8

Similar

comparisons

Median = 3

Different

comparisons

Median = 4

Similarity rating

P(S

imila

rity

rating

)

Same Speaker

comparisons

Median = 8

Similarity rating

P(S

imila

rity

rating

)

Cross-accent comparisons

Example 1

Example 2

RESPONSES TO FEMALE VOICE COMPARISONS

SCORE VS SIMILARITY RATING

Responses to similar and different male comparisons

mean similarity rating

score

Auto-Phonetic

Corr (Pearson) = 072

PERCEIVED VOICE SIMILARITY amp SPEAKER RECOGNITION

Is there a link between the scores from an automatic system and

perceived similarity

bull A correlation between perceived similarity and speaker

recognition scores has been observed with MFCC features

bull However greater correlation has been observed by combining

MFCCs and Voice Quality labels

bull Phonetic Features

bull N Obin and A Roebel Similarity Search of Acted Voices for Automatic

Voice Casting in IEEEACM Transactions on Audio Speech and Language

Processing vol 24 no 9 pp 1642-1651 Sept 2016

SCORE VS SIMILARITY RATING

Responses to similar and different male comparisons

mean similarity rating

score Auto-Phonetic

Corr (Pearson) = 072

MFCC

Corr (Pearson) = 072

SCORE VS SIMILARITY RATING

Responses to similar and different male comparisons

mean similarity rating

score

Auto-Phonetic + MFCC

Corr (Pearson) = 076

CONCLUSIONS

bull Promising results for male comparisons

bull Auto-Phonetic features capture speaker characteristics relevant to perceived similarity

bull Variable results across female comparisons

bull Data was a contributing factor smaller female candidate pool and cross-accent comparisons

bull Room to improve

bull Larger subjective evaluation required

bull Combine Auto-Phonetic and MFCC features

bull Scope to expand Auto-Phonetic feature set

AUTOMATIC VOICE CASTING RECENT RESULTS FROM SDI MEDIA

SDI media are a major provider of dubbing services worldwide

SDIrsquos Italian branch have been using iVOCALISE with AP and MFCC features for automatic voice castinghellip

English Italian 1

English Italian 2

CONCLUSIONS

bull The definition of voice similarity is application-dependent voice parades vs voice casting

bull Allow for an application-dependent search space

bull Use meta-data such as gender age accent to constrain the set of candidate voices

bull Allow for an application-dependent lsquodegree of similarityrsquo

bullGiven well-calibrated output scores from the automatic system can define a score range of interest

Page 12: AUTOMATICALLY IDENTIFYING - Oxford Wave Research...VOICE PARADES •U.K. Home Office, “Advice on the use of voice identification parades”, circular: 57/2003, December 2003. K.

VOICE CORPUS

bull 175 public figures (actors musicians etc)

bull ~2 recordings each ~30 sec average length

bull Sourced from online archives (primarily YouTube)

bull Male-Female speaker ratio is approximately 21

bull All speech is in English with wide variation in accent

bull Recordings are exclusively from lapel microphones

bull Recording environment is unconstrained

SPEAKER RECOGNITION EXPERIMENT

bull A Alexander O Forth A A Atreya and F Kelly ldquoVOCALISE A

forensic automatic speaker recognition system supporting spectral

phonetic and user-provided featuresrdquo Odyssey 2016 Bilbao Spain

SPEAKER RECOGNITION EXPERIMENT

bull A Alexander O Forth A A Atreya and F Kelly ldquoVOCALISE A

forensic automatic speaker recognition system supporting spectral

phonetic and user-provided featuresrdquo Odyssey 2016 Bilbao Spain

SPEAKER RECOGNITION EXPERIMENT

1

2

N

similarity rank

hellip

SELECTING COHORTS FOR SUBJECTIVE COMPARISON

1 Similar two highest-ranked speakers

2 Different two randomly-ranked speakers (constrained to be same-gender and outside top ten)

3 Same speaker one different recording of the target speaker

1

2

N

hellip

CREATING LISTENER COMPARISONS

Similar Different Same Speaker

Recording of Target speaker 1 7 sec chunks

Similar Comparisons 1 amp 2

for Target Speaker 1

+ + + + +

CREATING LISTENER COMPARISONS

Similar Different Same Speaker

Recording of Target speaker 1 7 sec chunks

Different Comparisons 1 amp 2

for Target Speaker 1

+ + + + +

CREATING LISTENER COMPARISONS

Similar Different Same Speaker

Recording of Target speaker 1 7 sec chunks

Same Speaker Comparisons

1 amp 2 for Target Speaker 1

+ + + + +

THE LISTENER TEST

x 12 x 6

6 target speakers

3 male 3 female

x 12

Similar

comparisons

Different

comparisons

Same Speaker

comparisons

THE LISTENER TEST

1 Judge the similarity of the two voices on scale of 1-9

2 Ignore the speaker accents any non-speech noises or any of the spoken content

The test was administered over the web there was no supervision of the test or control over the listening environment

THE LISTENERS

bull 43 listeners 25 female 18 male

bull 20 spoke English as a first language 23 did not

bull Age range hearing status and playback method were noted

RESPONSES TO MALE VOICE COMPARISONS

Same Speaker

comparisons

Similar

comparisons

Different

comparisons

Similarity rating

P(S

imila

rity

rating

)

RESPONSES TO MALE VOICE COMPARISONS

Same Speaker

comparisons

Median = 8

Similar

comparisons

Median = 5

Different

comparisons

Median = 3

Similarity rating

P(S

imila

rity

rating

)

RESPONSES TO FEMALE VOICE COMPARISONS

Same Speaker

comparisons

Median = 8

Similar

comparisons

Median = 3

Different

comparisons

Median = 4

Similarity rating

P(S

imila

rity

rating

)

Same Speaker

comparisons

Median = 8

Similarity rating

P(S

imila

rity

rating

)

Cross-accent comparisons

Example 1

Example 2

RESPONSES TO FEMALE VOICE COMPARISONS

SCORE VS SIMILARITY RATING

Responses to similar and different male comparisons

mean similarity rating

score

Auto-Phonetic

Corr (Pearson) = 072

PERCEIVED VOICE SIMILARITY amp SPEAKER RECOGNITION

Is there a link between the scores from an automatic system and

perceived similarity

bull A correlation between perceived similarity and speaker

recognition scores has been observed with MFCC features

bull However greater correlation has been observed by combining

MFCCs and Voice Quality labels

bull Phonetic Features

bull N Obin and A Roebel Similarity Search of Acted Voices for Automatic

Voice Casting in IEEEACM Transactions on Audio Speech and Language

Processing vol 24 no 9 pp 1642-1651 Sept 2016

SCORE VS SIMILARITY RATING

Responses to similar and different male comparisons

mean similarity rating

score Auto-Phonetic

Corr (Pearson) = 072

MFCC

Corr (Pearson) = 072

SCORE VS SIMILARITY RATING

Responses to similar and different male comparisons

mean similarity rating

score

Auto-Phonetic + MFCC

Corr (Pearson) = 076

CONCLUSIONS

bull Promising results for male comparisons

bull Auto-Phonetic features capture speaker characteristics relevant to perceived similarity

bull Variable results across female comparisons

bull Data was a contributing factor smaller female candidate pool and cross-accent comparisons

bull Room to improve

bull Larger subjective evaluation required

bull Combine Auto-Phonetic and MFCC features

bull Scope to expand Auto-Phonetic feature set

AUTOMATIC VOICE CASTING RECENT RESULTS FROM SDI MEDIA

SDI media are a major provider of dubbing services worldwide

SDIrsquos Italian branch have been using iVOCALISE with AP and MFCC features for automatic voice castinghellip

English Italian 1

English Italian 2

CONCLUSIONS

bull The definition of voice similarity is application-dependent voice parades vs voice casting

bull Allow for an application-dependent search space

bull Use meta-data such as gender age accent to constrain the set of candidate voices

bull Allow for an application-dependent lsquodegree of similarityrsquo

bullGiven well-calibrated output scores from the automatic system can define a score range of interest

Page 13: AUTOMATICALLY IDENTIFYING - Oxford Wave Research...VOICE PARADES •U.K. Home Office, “Advice on the use of voice identification parades”, circular: 57/2003, December 2003. K.

SPEAKER RECOGNITION EXPERIMENT

bull A Alexander O Forth A A Atreya and F Kelly ldquoVOCALISE A

forensic automatic speaker recognition system supporting spectral

phonetic and user-provided featuresrdquo Odyssey 2016 Bilbao Spain

SPEAKER RECOGNITION EXPERIMENT

bull A Alexander O Forth A A Atreya and F Kelly ldquoVOCALISE A

forensic automatic speaker recognition system supporting spectral

phonetic and user-provided featuresrdquo Odyssey 2016 Bilbao Spain

SPEAKER RECOGNITION EXPERIMENT

1

2

N

similarity rank

hellip

SELECTING COHORTS FOR SUBJECTIVE COMPARISON

1 Similar two highest-ranked speakers

2 Different two randomly-ranked speakers (constrained to be same-gender and outside top ten)

3 Same speaker one different recording of the target speaker

1

2

N

hellip

CREATING LISTENER COMPARISONS

Similar Different Same Speaker

Recording of Target speaker 1 7 sec chunks

Similar Comparisons 1 amp 2

for Target Speaker 1

+ + + + +

CREATING LISTENER COMPARISONS

Similar Different Same Speaker

Recording of Target speaker 1 7 sec chunks

Different Comparisons 1 amp 2

for Target Speaker 1

+ + + + +

CREATING LISTENER COMPARISONS

Similar Different Same Speaker

Recording of Target speaker 1 7 sec chunks

Same Speaker Comparisons

1 amp 2 for Target Speaker 1

+ + + + +

THE LISTENER TEST

x 12 x 6

6 target speakers

3 male 3 female

x 12

Similar

comparisons

Different

comparisons

Same Speaker

comparisons

THE LISTENER TEST

1 Judge the similarity of the two voices on scale of 1-9

2 Ignore the speaker accents any non-speech noises or any of the spoken content

The test was administered over the web there was no supervision of the test or control over the listening environment

THE LISTENERS

bull 43 listeners 25 female 18 male

bull 20 spoke English as a first language 23 did not

bull Age range hearing status and playback method were noted

RESPONSES TO MALE VOICE COMPARISONS

Same Speaker

comparisons

Similar

comparisons

Different

comparisons

Similarity rating

P(S

imila

rity

rating

)

RESPONSES TO MALE VOICE COMPARISONS

Same Speaker

comparisons

Median = 8

Similar

comparisons

Median = 5

Different

comparisons

Median = 3

Similarity rating

P(S

imila

rity

rating

)

RESPONSES TO FEMALE VOICE COMPARISONS

Same Speaker

comparisons

Median = 8

Similar

comparisons

Median = 3

Different

comparisons

Median = 4

Similarity rating

P(S

imila

rity

rating

)

Same Speaker

comparisons

Median = 8

Similarity rating

P(S

imila

rity

rating

)

Cross-accent comparisons

Example 1

Example 2

RESPONSES TO FEMALE VOICE COMPARISONS

SCORE VS SIMILARITY RATING

Responses to similar and different male comparisons

mean similarity rating

score

Auto-Phonetic

Corr (Pearson) = 072

PERCEIVED VOICE SIMILARITY amp SPEAKER RECOGNITION

Is there a link between the scores from an automatic system and

perceived similarity

bull A correlation between perceived similarity and speaker

recognition scores has been observed with MFCC features

bull However greater correlation has been observed by combining

MFCCs and Voice Quality labels

bull Phonetic Features

bull N Obin and A Roebel Similarity Search of Acted Voices for Automatic

Voice Casting in IEEEACM Transactions on Audio Speech and Language

Processing vol 24 no 9 pp 1642-1651 Sept 2016

SCORE VS SIMILARITY RATING

Responses to similar and different male comparisons

mean similarity rating

score Auto-Phonetic

Corr (Pearson) = 072

MFCC

Corr (Pearson) = 072

SCORE VS SIMILARITY RATING

Responses to similar and different male comparisons

mean similarity rating

score

Auto-Phonetic + MFCC

Corr (Pearson) = 076

CONCLUSIONS

bull Promising results for male comparisons

bull Auto-Phonetic features capture speaker characteristics relevant to perceived similarity

bull Variable results across female comparisons

bull Data was a contributing factor smaller female candidate pool and cross-accent comparisons

bull Room to improve

bull Larger subjective evaluation required

bull Combine Auto-Phonetic and MFCC features

bull Scope to expand Auto-Phonetic feature set

AUTOMATIC VOICE CASTING RECENT RESULTS FROM SDI MEDIA

SDI media are a major provider of dubbing services worldwide

SDIrsquos Italian branch have been using iVOCALISE with AP and MFCC features for automatic voice castinghellip

English Italian 1

English Italian 2

CONCLUSIONS

bull The definition of voice similarity is application-dependent voice parades vs voice casting

bull Allow for an application-dependent search space

bull Use meta-data such as gender age accent to constrain the set of candidate voices

bull Allow for an application-dependent lsquodegree of similarityrsquo

bullGiven well-calibrated output scores from the automatic system can define a score range of interest

Page 14: AUTOMATICALLY IDENTIFYING - Oxford Wave Research...VOICE PARADES •U.K. Home Office, “Advice on the use of voice identification parades”, circular: 57/2003, December 2003. K.

SPEAKER RECOGNITION EXPERIMENT

bull A Alexander O Forth A A Atreya and F Kelly ldquoVOCALISE A

forensic automatic speaker recognition system supporting spectral

phonetic and user-provided featuresrdquo Odyssey 2016 Bilbao Spain

SPEAKER RECOGNITION EXPERIMENT

1

2

N

similarity rank

hellip

SELECTING COHORTS FOR SUBJECTIVE COMPARISON

1 Similar two highest-ranked speakers

2 Different two randomly-ranked speakers (constrained to be same-gender and outside top ten)

3 Same speaker one different recording of the target speaker

1

2

N

hellip

CREATING LISTENER COMPARISONS

Similar Different Same Speaker

Recording of Target speaker 1 7 sec chunks

Similar Comparisons 1 amp 2

for Target Speaker 1

+ + + + +

CREATING LISTENER COMPARISONS

Similar Different Same Speaker

Recording of Target speaker 1 7 sec chunks

Different Comparisons 1 amp 2

for Target Speaker 1

+ + + + +

CREATING LISTENER COMPARISONS

Similar Different Same Speaker

Recording of Target speaker 1 7 sec chunks

Same Speaker Comparisons

1 amp 2 for Target Speaker 1

+ + + + +

THE LISTENER TEST

x 12 x 6

6 target speakers

3 male 3 female

x 12

Similar

comparisons

Different

comparisons

Same Speaker

comparisons

THE LISTENER TEST

1 Judge the similarity of the two voices on scale of 1-9

2 Ignore the speaker accents any non-speech noises or any of the spoken content

The test was administered over the web there was no supervision of the test or control over the listening environment

THE LISTENERS

bull 43 listeners 25 female 18 male

bull 20 spoke English as a first language 23 did not

bull Age range hearing status and playback method were noted

RESPONSES TO MALE VOICE COMPARISONS

Same Speaker

comparisons

Similar

comparisons

Different

comparisons

Similarity rating

P(S

imila

rity

rating

)

RESPONSES TO MALE VOICE COMPARISONS

Same Speaker

comparisons

Median = 8

Similar

comparisons

Median = 5

Different

comparisons

Median = 3

Similarity rating

P(S

imila

rity

rating

)

RESPONSES TO FEMALE VOICE COMPARISONS

Same Speaker

comparisons

Median = 8

Similar

comparisons

Median = 3

Different

comparisons

Median = 4

Similarity rating

P(S

imila

rity

rating

)

Same Speaker

comparisons

Median = 8

Similarity rating

P(S

imila

rity

rating

)

Cross-accent comparisons

Example 1

Example 2

RESPONSES TO FEMALE VOICE COMPARISONS

SCORE VS SIMILARITY RATING

Responses to similar and different male comparisons

mean similarity rating

score

Auto-Phonetic

Corr (Pearson) = 072

PERCEIVED VOICE SIMILARITY amp SPEAKER RECOGNITION

Is there a link between the scores from an automatic system and

perceived similarity

bull A correlation between perceived similarity and speaker

recognition scores has been observed with MFCC features

bull However greater correlation has been observed by combining

MFCCs and Voice Quality labels

bull Phonetic Features

bull N Obin and A Roebel Similarity Search of Acted Voices for Automatic

Voice Casting in IEEEACM Transactions on Audio Speech and Language

Processing vol 24 no 9 pp 1642-1651 Sept 2016

SCORE VS SIMILARITY RATING

Responses to similar and different male comparisons

mean similarity rating

score Auto-Phonetic

Corr (Pearson) = 072

MFCC

Corr (Pearson) = 072

SCORE VS SIMILARITY RATING

Responses to similar and different male comparisons

mean similarity rating

score

Auto-Phonetic + MFCC

Corr (Pearson) = 076

CONCLUSIONS

bull Promising results for male comparisons

bull Auto-Phonetic features capture speaker characteristics relevant to perceived similarity

bull Variable results across female comparisons

bull Data was a contributing factor smaller female candidate pool and cross-accent comparisons

bull Room to improve

bull Larger subjective evaluation required

bull Combine Auto-Phonetic and MFCC features

bull Scope to expand Auto-Phonetic feature set

AUTOMATIC VOICE CASTING RECENT RESULTS FROM SDI MEDIA

SDI media are a major provider of dubbing services worldwide

SDIrsquos Italian branch have been using iVOCALISE with AP and MFCC features for automatic voice castinghellip

English Italian 1

English Italian 2

CONCLUSIONS

bull The definition of voice similarity is application-dependent voice parades vs voice casting

bull Allow for an application-dependent search space

bull Use meta-data such as gender age accent to constrain the set of candidate voices

bull Allow for an application-dependent lsquodegree of similarityrsquo

bullGiven well-calibrated output scores from the automatic system can define a score range of interest

Page 15: AUTOMATICALLY IDENTIFYING - Oxford Wave Research...VOICE PARADES •U.K. Home Office, “Advice on the use of voice identification parades”, circular: 57/2003, December 2003. K.

SPEAKER RECOGNITION EXPERIMENT

1

2

N

similarity rank

hellip

SELECTING COHORTS FOR SUBJECTIVE COMPARISON

1 Similar two highest-ranked speakers

2 Different two randomly-ranked speakers (constrained to be same-gender and outside top ten)

3 Same speaker one different recording of the target speaker

1

2

N

hellip

CREATING LISTENER COMPARISONS

Similar Different Same Speaker

Recording of Target speaker 1 7 sec chunks

Similar Comparisons 1 amp 2

for Target Speaker 1

+ + + + +

CREATING LISTENER COMPARISONS

Similar Different Same Speaker

Recording of Target speaker 1 7 sec chunks

Different Comparisons 1 amp 2

for Target Speaker 1

+ + + + +

CREATING LISTENER COMPARISONS

Similar Different Same Speaker

Recording of Target speaker 1 7 sec chunks

Same Speaker Comparisons

1 amp 2 for Target Speaker 1

+ + + + +

THE LISTENER TEST

x 12 x 6

6 target speakers

3 male 3 female

x 12

Similar

comparisons

Different

comparisons

Same Speaker

comparisons

THE LISTENER TEST

1 Judge the similarity of the two voices on scale of 1-9

2 Ignore the speaker accents any non-speech noises or any of the spoken content

The test was administered over the web there was no supervision of the test or control over the listening environment

THE LISTENERS

bull 43 listeners 25 female 18 male

bull 20 spoke English as a first language 23 did not

bull Age range hearing status and playback method were noted

RESPONSES TO MALE VOICE COMPARISONS

Same Speaker

comparisons

Similar

comparisons

Different

comparisons

Similarity rating

P(S

imila

rity

rating

)

RESPONSES TO MALE VOICE COMPARISONS

Same Speaker

comparisons

Median = 8

Similar

comparisons

Median = 5

Different

comparisons

Median = 3

Similarity rating

P(S

imila

rity

rating

)

RESPONSES TO FEMALE VOICE COMPARISONS

Same Speaker

comparisons

Median = 8

Similar

comparisons

Median = 3

Different

comparisons

Median = 4

Similarity rating

P(S

imila

rity

rating

)

Same Speaker

comparisons

Median = 8

Similarity rating

P(S

imila

rity

rating

)

Cross-accent comparisons

Example 1

Example 2

RESPONSES TO FEMALE VOICE COMPARISONS

SCORE VS SIMILARITY RATING

Responses to similar and different male comparisons

mean similarity rating

score

Auto-Phonetic

Corr (Pearson) = 072

PERCEIVED VOICE SIMILARITY amp SPEAKER RECOGNITION

Is there a link between the scores from an automatic system and

perceived similarity

bull A correlation between perceived similarity and speaker

recognition scores has been observed with MFCC features

bull However greater correlation has been observed by combining

MFCCs and Voice Quality labels

bull Phonetic Features

bull N Obin and A Roebel Similarity Search of Acted Voices for Automatic

Voice Casting in IEEEACM Transactions on Audio Speech and Language

Processing vol 24 no 9 pp 1642-1651 Sept 2016

SCORE VS SIMILARITY RATING

Responses to similar and different male comparisons

mean similarity rating

score Auto-Phonetic

Corr (Pearson) = 072

MFCC

Corr (Pearson) = 072

SCORE VS SIMILARITY RATING

Responses to similar and different male comparisons

mean similarity rating

score

Auto-Phonetic + MFCC

Corr (Pearson) = 076

CONCLUSIONS

bull Promising results for male comparisons

bull Auto-Phonetic features capture speaker characteristics relevant to perceived similarity

bull Variable results across female comparisons

bull Data was a contributing factor smaller female candidate pool and cross-accent comparisons

bull Room to improve

bull Larger subjective evaluation required

bull Combine Auto-Phonetic and MFCC features

bull Scope to expand Auto-Phonetic feature set

AUTOMATIC VOICE CASTING RECENT RESULTS FROM SDI MEDIA

SDI media are a major provider of dubbing services worldwide

SDIrsquos Italian branch have been using iVOCALISE with AP and MFCC features for automatic voice castinghellip

English Italian 1

English Italian 2

CONCLUSIONS

bull The definition of voice similarity is application-dependent voice parades vs voice casting

bull Allow for an application-dependent search space

bull Use meta-data such as gender age accent to constrain the set of candidate voices

bull Allow for an application-dependent lsquodegree of similarityrsquo

bullGiven well-calibrated output scores from the automatic system can define a score range of interest

Page 16: AUTOMATICALLY IDENTIFYING - Oxford Wave Research...VOICE PARADES •U.K. Home Office, “Advice on the use of voice identification parades”, circular: 57/2003, December 2003. K.

SELECTING COHORTS FOR SUBJECTIVE COMPARISON

1 Similar two highest-ranked speakers

2 Different two randomly-ranked speakers (constrained to be same-gender and outside top ten)

3 Same speaker one different recording of the target speaker

1

2

N

hellip

CREATING LISTENER COMPARISONS

Similar Different Same Speaker

Recording of Target speaker 1 7 sec chunks

Similar Comparisons 1 amp 2

for Target Speaker 1

+ + + + +

CREATING LISTENER COMPARISONS

Similar Different Same Speaker

Recording of Target speaker 1 7 sec chunks

Different Comparisons 1 amp 2

for Target Speaker 1

+ + + + +

CREATING LISTENER COMPARISONS

Similar Different Same Speaker

Recording of Target speaker 1 7 sec chunks

Same Speaker Comparisons

1 amp 2 for Target Speaker 1

+ + + + +

THE LISTENER TEST

x 12 x 6

6 target speakers

3 male 3 female

x 12

Similar

comparisons

Different

comparisons

Same Speaker

comparisons

THE LISTENER TEST

1 Judge the similarity of the two voices on scale of 1-9

2 Ignore the speaker accents any non-speech noises or any of the spoken content

The test was administered over the web there was no supervision of the test or control over the listening environment

THE LISTENERS

bull 43 listeners 25 female 18 male

bull 20 spoke English as a first language 23 did not

bull Age range hearing status and playback method were noted

RESPONSES TO MALE VOICE COMPARISONS

Same Speaker

comparisons

Similar

comparisons

Different

comparisons

Similarity rating

P(S

imila

rity

rating

)

RESPONSES TO MALE VOICE COMPARISONS

Same Speaker

comparisons

Median = 8

Similar

comparisons

Median = 5

Different

comparisons

Median = 3

Similarity rating

P(S

imila

rity

rating

)

RESPONSES TO FEMALE VOICE COMPARISONS

Same Speaker

comparisons

Median = 8

Similar

comparisons

Median = 3

Different

comparisons

Median = 4

Similarity rating

P(S

imila

rity

rating

)

Same Speaker

comparisons

Median = 8

Similarity rating

P(S

imila

rity

rating

)

Cross-accent comparisons

Example 1

Example 2

RESPONSES TO FEMALE VOICE COMPARISONS

SCORE VS SIMILARITY RATING

Responses to similar and different male comparisons

mean similarity rating

score

Auto-Phonetic

Corr (Pearson) = 072

PERCEIVED VOICE SIMILARITY amp SPEAKER RECOGNITION

Is there a link between the scores from an automatic system and

perceived similarity

bull A correlation between perceived similarity and speaker

recognition scores has been observed with MFCC features

bull However greater correlation has been observed by combining

MFCCs and Voice Quality labels

bull Phonetic Features

bull N Obin and A Roebel Similarity Search of Acted Voices for Automatic

Voice Casting in IEEEACM Transactions on Audio Speech and Language

Processing vol 24 no 9 pp 1642-1651 Sept 2016

SCORE VS SIMILARITY RATING

Responses to similar and different male comparisons

mean similarity rating

score Auto-Phonetic

Corr (Pearson) = 072

MFCC

Corr (Pearson) = 072

SCORE VS SIMILARITY RATING

Responses to similar and different male comparisons

mean similarity rating

score

Auto-Phonetic + MFCC

Corr (Pearson) = 076

CONCLUSIONS

bull Promising results for male comparisons

bull Auto-Phonetic features capture speaker characteristics relevant to perceived similarity

bull Variable results across female comparisons

bull Data was a contributing factor smaller female candidate pool and cross-accent comparisons

bull Room to improve

bull Larger subjective evaluation required

bull Combine Auto-Phonetic and MFCC features

bull Scope to expand Auto-Phonetic feature set

AUTOMATIC VOICE CASTING RECENT RESULTS FROM SDI MEDIA

SDI media are a major provider of dubbing services worldwide

SDIrsquos Italian branch have been using iVOCALISE with AP and MFCC features for automatic voice castinghellip

English Italian 1

English Italian 2

CONCLUSIONS

bull The definition of voice similarity is application-dependent voice parades vs voice casting

bull Allow for an application-dependent search space

bull Use meta-data such as gender age accent to constrain the set of candidate voices

bull Allow for an application-dependent lsquodegree of similarityrsquo

bullGiven well-calibrated output scores from the automatic system can define a score range of interest

Page 17: AUTOMATICALLY IDENTIFYING - Oxford Wave Research...VOICE PARADES •U.K. Home Office, “Advice on the use of voice identification parades”, circular: 57/2003, December 2003. K.

CREATING LISTENER COMPARISONS

Similar Different Same Speaker

Recording of Target speaker 1 7 sec chunks

Similar Comparisons 1 amp 2

for Target Speaker 1

+ + + + +

CREATING LISTENER COMPARISONS

Similar Different Same Speaker

Recording of Target speaker 1 7 sec chunks

Different Comparisons 1 amp 2

for Target Speaker 1

+ + + + +

CREATING LISTENER COMPARISONS

Similar Different Same Speaker

Recording of Target speaker 1 7 sec chunks

Same Speaker Comparisons

1 amp 2 for Target Speaker 1

+ + + + +

THE LISTENER TEST

x 12 x 6

6 target speakers

3 male 3 female

x 12

Similar

comparisons

Different

comparisons

Same Speaker

comparisons

THE LISTENER TEST

1 Judge the similarity of the two voices on scale of 1-9

2 Ignore the speaker accents any non-speech noises or any of the spoken content

The test was administered over the web there was no supervision of the test or control over the listening environment

THE LISTENERS

bull 43 listeners 25 female 18 male

bull 20 spoke English as a first language 23 did not

bull Age range hearing status and playback method were noted

RESPONSES TO MALE VOICE COMPARISONS

Same Speaker

comparisons

Similar

comparisons

Different

comparisons

Similarity rating

P(S

imila

rity

rating

)

RESPONSES TO MALE VOICE COMPARISONS

Same Speaker

comparisons

Median = 8

Similar

comparisons

Median = 5

Different

comparisons

Median = 3

Similarity rating

P(S

imila

rity

rating

)

RESPONSES TO FEMALE VOICE COMPARISONS

Same Speaker

comparisons

Median = 8

Similar

comparisons

Median = 3

Different

comparisons

Median = 4

Similarity rating

P(S

imila

rity

rating

)

Same Speaker

comparisons

Median = 8

Similarity rating

P(S

imila

rity

rating

)

Cross-accent comparisons

Example 1

Example 2

RESPONSES TO FEMALE VOICE COMPARISONS

SCORE VS SIMILARITY RATING

Responses to similar and different male comparisons

mean similarity rating

score

Auto-Phonetic

Corr (Pearson) = 072

PERCEIVED VOICE SIMILARITY amp SPEAKER RECOGNITION

Is there a link between the scores from an automatic system and

perceived similarity

bull A correlation between perceived similarity and speaker

recognition scores has been observed with MFCC features

bull However greater correlation has been observed by combining

MFCCs and Voice Quality labels

bull Phonetic Features

bull N Obin and A Roebel Similarity Search of Acted Voices for Automatic

Voice Casting in IEEEACM Transactions on Audio Speech and Language

Processing vol 24 no 9 pp 1642-1651 Sept 2016

SCORE VS SIMILARITY RATING

Responses to similar and different male comparisons

mean similarity rating

score Auto-Phonetic

Corr (Pearson) = 072

MFCC

Corr (Pearson) = 072

SCORE VS SIMILARITY RATING

Responses to similar and different male comparisons

mean similarity rating

score

Auto-Phonetic + MFCC

Corr (Pearson) = 076

CONCLUSIONS

bull Promising results for male comparisons

bull Auto-Phonetic features capture speaker characteristics relevant to perceived similarity

bull Variable results across female comparisons

bull Data was a contributing factor smaller female candidate pool and cross-accent comparisons

bull Room to improve

bull Larger subjective evaluation required

bull Combine Auto-Phonetic and MFCC features

bull Scope to expand Auto-Phonetic feature set

AUTOMATIC VOICE CASTING RECENT RESULTS FROM SDI MEDIA

SDI media are a major provider of dubbing services worldwide

SDIrsquos Italian branch have been using iVOCALISE with AP and MFCC features for automatic voice castinghellip

English Italian 1

English Italian 2

CONCLUSIONS

bull The definition of voice similarity is application-dependent voice parades vs voice casting

bull Allow for an application-dependent search space

bull Use meta-data such as gender age accent to constrain the set of candidate voices

bull Allow for an application-dependent lsquodegree of similarityrsquo

bullGiven well-calibrated output scores from the automatic system can define a score range of interest

Page 18: AUTOMATICALLY IDENTIFYING - Oxford Wave Research...VOICE PARADES •U.K. Home Office, “Advice on the use of voice identification parades”, circular: 57/2003, December 2003. K.

CREATING LISTENER COMPARISONS

Similar Different Same Speaker

Recording of Target speaker 1 7 sec chunks

Different Comparisons 1 amp 2

for Target Speaker 1

+ + + + +

CREATING LISTENER COMPARISONS

Similar Different Same Speaker

Recording of Target speaker 1 7 sec chunks

Same Speaker Comparisons

1 amp 2 for Target Speaker 1

+ + + + +

THE LISTENER TEST

x 12 x 6

6 target speakers

3 male 3 female

x 12

Similar

comparisons

Different

comparisons

Same Speaker

comparisons

THE LISTENER TEST

1 Judge the similarity of the two voices on scale of 1-9

2 Ignore the speaker accents any non-speech noises or any of the spoken content

The test was administered over the web there was no supervision of the test or control over the listening environment

THE LISTENERS

bull 43 listeners 25 female 18 male

bull 20 spoke English as a first language 23 did not

bull Age range hearing status and playback method were noted

RESPONSES TO MALE VOICE COMPARISONS

Same Speaker

comparisons

Similar

comparisons

Different

comparisons

Similarity rating

P(S

imila

rity

rating

)

RESPONSES TO MALE VOICE COMPARISONS

Same Speaker

comparisons

Median = 8

Similar

comparisons

Median = 5

Different

comparisons

Median = 3

Similarity rating

P(S

imila

rity

rating

)

RESPONSES TO FEMALE VOICE COMPARISONS

Same Speaker

comparisons

Median = 8

Similar

comparisons

Median = 3

Different

comparisons

Median = 4

Similarity rating

P(S

imila

rity

rating

)

Same Speaker

comparisons

Median = 8

Similarity rating

P(S

imila

rity

rating

)

Cross-accent comparisons

Example 1

Example 2

RESPONSES TO FEMALE VOICE COMPARISONS

SCORE VS SIMILARITY RATING

Responses to similar and different male comparisons

mean similarity rating

score

Auto-Phonetic

Corr (Pearson) = 072

PERCEIVED VOICE SIMILARITY amp SPEAKER RECOGNITION

Is there a link between the scores from an automatic system and

perceived similarity

bull A correlation between perceived similarity and speaker

recognition scores has been observed with MFCC features

bull However greater correlation has been observed by combining

MFCCs and Voice Quality labels

bull Phonetic Features

bull N Obin and A Roebel Similarity Search of Acted Voices for Automatic

Voice Casting in IEEEACM Transactions on Audio Speech and Language

Processing vol 24 no 9 pp 1642-1651 Sept 2016

SCORE VS SIMILARITY RATING

Responses to similar and different male comparisons

mean similarity rating

score Auto-Phonetic

Corr (Pearson) = 072

MFCC

Corr (Pearson) = 072

SCORE VS SIMILARITY RATING

Responses to similar and different male comparisons

mean similarity rating

score

Auto-Phonetic + MFCC

Corr (Pearson) = 076

CONCLUSIONS

bull Promising results for male comparisons

bull Auto-Phonetic features capture speaker characteristics relevant to perceived similarity

bull Variable results across female comparisons

bull Data was a contributing factor smaller female candidate pool and cross-accent comparisons

bull Room to improve

bull Larger subjective evaluation required

bull Combine Auto-Phonetic and MFCC features

bull Scope to expand Auto-Phonetic feature set

AUTOMATIC VOICE CASTING RECENT RESULTS FROM SDI MEDIA

SDI media are a major provider of dubbing services worldwide

SDIrsquos Italian branch have been using iVOCALISE with AP and MFCC features for automatic voice castinghellip

English Italian 1

English Italian 2

CONCLUSIONS

bull The definition of voice similarity is application-dependent voice parades vs voice casting

bull Allow for an application-dependent search space

bull Use meta-data such as gender age accent to constrain the set of candidate voices

bull Allow for an application-dependent lsquodegree of similarityrsquo

bullGiven well-calibrated output scores from the automatic system can define a score range of interest

Page 19: AUTOMATICALLY IDENTIFYING - Oxford Wave Research...VOICE PARADES •U.K. Home Office, “Advice on the use of voice identification parades”, circular: 57/2003, December 2003. K.

CREATING LISTENER COMPARISONS

Similar Different Same Speaker

Recording of Target speaker 1 7 sec chunks

Same Speaker Comparisons

1 amp 2 for Target Speaker 1

+ + + + +

THE LISTENER TEST

x 12 x 6

6 target speakers

3 male 3 female

x 12

Similar

comparisons

Different

comparisons

Same Speaker

comparisons

THE LISTENER TEST

1 Judge the similarity of the two voices on scale of 1-9

2 Ignore the speaker accents any non-speech noises or any of the spoken content

The test was administered over the web there was no supervision of the test or control over the listening environment

THE LISTENERS

bull 43 listeners 25 female 18 male

bull 20 spoke English as a first language 23 did not

bull Age range hearing status and playback method were noted

RESPONSES TO MALE VOICE COMPARISONS

Same Speaker

comparisons

Similar

comparisons

Different

comparisons

Similarity rating

P(S

imila

rity

rating

)

RESPONSES TO MALE VOICE COMPARISONS

Same Speaker

comparisons

Median = 8

Similar

comparisons

Median = 5

Different

comparisons

Median = 3

Similarity rating

P(S

imila

rity

rating

)

RESPONSES TO FEMALE VOICE COMPARISONS

Same Speaker

comparisons

Median = 8

Similar

comparisons

Median = 3

Different

comparisons

Median = 4

Similarity rating

P(S

imila

rity

rating

)

Same Speaker

comparisons

Median = 8

Similarity rating

P(S

imila

rity

rating

)

Cross-accent comparisons

Example 1

Example 2

RESPONSES TO FEMALE VOICE COMPARISONS

SCORE VS SIMILARITY RATING

Responses to similar and different male comparisons

mean similarity rating

score

Auto-Phonetic

Corr (Pearson) = 072

PERCEIVED VOICE SIMILARITY amp SPEAKER RECOGNITION

Is there a link between the scores from an automatic system and

perceived similarity

bull A correlation between perceived similarity and speaker

recognition scores has been observed with MFCC features

bull However greater correlation has been observed by combining

MFCCs and Voice Quality labels

bull Phonetic Features

bull N Obin and A Roebel Similarity Search of Acted Voices for Automatic

Voice Casting in IEEEACM Transactions on Audio Speech and Language

Processing vol 24 no 9 pp 1642-1651 Sept 2016

SCORE VS SIMILARITY RATING

Responses to similar and different male comparisons

mean similarity rating

score Auto-Phonetic

Corr (Pearson) = 072

MFCC

Corr (Pearson) = 072

SCORE VS SIMILARITY RATING

Responses to similar and different male comparisons

mean similarity rating

score

Auto-Phonetic + MFCC

Corr (Pearson) = 076

CONCLUSIONS

bull Promising results for male comparisons

bull Auto-Phonetic features capture speaker characteristics relevant to perceived similarity

bull Variable results across female comparisons

bull Data was a contributing factor smaller female candidate pool and cross-accent comparisons

bull Room to improve

bull Larger subjective evaluation required

bull Combine Auto-Phonetic and MFCC features

bull Scope to expand Auto-Phonetic feature set

AUTOMATIC VOICE CASTING RECENT RESULTS FROM SDI MEDIA

SDI media are a major provider of dubbing services worldwide

SDIrsquos Italian branch have been using iVOCALISE with AP and MFCC features for automatic voice castinghellip

English Italian 1

English Italian 2

CONCLUSIONS

bull The definition of voice similarity is application-dependent voice parades vs voice casting

bull Allow for an application-dependent search space

bull Use meta-data such as gender age accent to constrain the set of candidate voices

bull Allow for an application-dependent lsquodegree of similarityrsquo

bullGiven well-calibrated output scores from the automatic system can define a score range of interest

Page 20: AUTOMATICALLY IDENTIFYING - Oxford Wave Research...VOICE PARADES •U.K. Home Office, “Advice on the use of voice identification parades”, circular: 57/2003, December 2003. K.

THE LISTENER TEST

x 12 x 6

6 target speakers

3 male 3 female

x 12

Similar

comparisons

Different

comparisons

Same Speaker

comparisons

THE LISTENER TEST

1 Judge the similarity of the two voices on scale of 1-9

2 Ignore the speaker accents any non-speech noises or any of the spoken content

The test was administered over the web there was no supervision of the test or control over the listening environment

THE LISTENERS

bull 43 listeners 25 female 18 male

bull 20 spoke English as a first language 23 did not

bull Age range hearing status and playback method were noted

RESPONSES TO MALE VOICE COMPARISONS

Same Speaker

comparisons

Similar

comparisons

Different

comparisons

Similarity rating

P(S

imila

rity

rating

)

RESPONSES TO MALE VOICE COMPARISONS

Same Speaker

comparisons

Median = 8

Similar

comparisons

Median = 5

Different

comparisons

Median = 3

Similarity rating

P(S

imila

rity

rating

)

RESPONSES TO FEMALE VOICE COMPARISONS

Same Speaker

comparisons

Median = 8

Similar

comparisons

Median = 3

Different

comparisons

Median = 4

Similarity rating

P(S

imila

rity

rating

)

Same Speaker

comparisons

Median = 8

Similarity rating

P(S

imila

rity

rating

)

Cross-accent comparisons

Example 1

Example 2

RESPONSES TO FEMALE VOICE COMPARISONS

SCORE VS SIMILARITY RATING

Responses to similar and different male comparisons

mean similarity rating

score

Auto-Phonetic

Corr (Pearson) = 072

PERCEIVED VOICE SIMILARITY amp SPEAKER RECOGNITION

Is there a link between the scores from an automatic system and

perceived similarity

bull A correlation between perceived similarity and speaker

recognition scores has been observed with MFCC features

bull However greater correlation has been observed by combining

MFCCs and Voice Quality labels

bull Phonetic Features

bull N Obin and A Roebel Similarity Search of Acted Voices for Automatic

Voice Casting in IEEEACM Transactions on Audio Speech and Language

Processing vol 24 no 9 pp 1642-1651 Sept 2016

SCORE VS SIMILARITY RATING

Responses to similar and different male comparisons

mean similarity rating

score Auto-Phonetic

Corr (Pearson) = 072

MFCC

Corr (Pearson) = 072

SCORE VS SIMILARITY RATING

Responses to similar and different male comparisons

mean similarity rating

score

Auto-Phonetic + MFCC

Corr (Pearson) = 076

CONCLUSIONS

bull Promising results for male comparisons

bull Auto-Phonetic features capture speaker characteristics relevant to perceived similarity

bull Variable results across female comparisons

bull Data was a contributing factor smaller female candidate pool and cross-accent comparisons

bull Room to improve

bull Larger subjective evaluation required

bull Combine Auto-Phonetic and MFCC features

bull Scope to expand Auto-Phonetic feature set

AUTOMATIC VOICE CASTING RECENT RESULTS FROM SDI MEDIA

SDI media are a major provider of dubbing services worldwide

SDIrsquos Italian branch have been using iVOCALISE with AP and MFCC features for automatic voice castinghellip

English Italian 1

English Italian 2

CONCLUSIONS

bull The definition of voice similarity is application-dependent voice parades vs voice casting

bull Allow for an application-dependent search space

bull Use meta-data such as gender age accent to constrain the set of candidate voices

bull Allow for an application-dependent lsquodegree of similarityrsquo

bullGiven well-calibrated output scores from the automatic system can define a score range of interest

Page 21: AUTOMATICALLY IDENTIFYING - Oxford Wave Research...VOICE PARADES •U.K. Home Office, “Advice on the use of voice identification parades”, circular: 57/2003, December 2003. K.

THE LISTENER TEST

1 Judge the similarity of the two voices on scale of 1-9

2 Ignore the speaker accents any non-speech noises or any of the spoken content

The test was administered over the web there was no supervision of the test or control over the listening environment

THE LISTENERS

bull 43 listeners 25 female 18 male

bull 20 spoke English as a first language 23 did not

bull Age range hearing status and playback method were noted

RESPONSES TO MALE VOICE COMPARISONS

Same Speaker

comparisons

Similar

comparisons

Different

comparisons

Similarity rating

P(S

imila

rity

rating

)

RESPONSES TO MALE VOICE COMPARISONS

Same Speaker

comparisons

Median = 8

Similar

comparisons

Median = 5

Different

comparisons

Median = 3

Similarity rating

P(S

imila

rity

rating

)

RESPONSES TO FEMALE VOICE COMPARISONS

Same Speaker

comparisons

Median = 8

Similar

comparisons

Median = 3

Different

comparisons

Median = 4

Similarity rating

P(S

imila

rity

rating

)

Same Speaker

comparisons

Median = 8

Similarity rating

P(S

imila

rity

rating

)

Cross-accent comparisons

Example 1

Example 2

RESPONSES TO FEMALE VOICE COMPARISONS

SCORE VS SIMILARITY RATING

Responses to similar and different male comparisons

mean similarity rating

score

Auto-Phonetic

Corr (Pearson) = 072

PERCEIVED VOICE SIMILARITY amp SPEAKER RECOGNITION

Is there a link between the scores from an automatic system and

perceived similarity

bull A correlation between perceived similarity and speaker

recognition scores has been observed with MFCC features

bull However greater correlation has been observed by combining

MFCCs and Voice Quality labels

bull Phonetic Features

bull N Obin and A Roebel Similarity Search of Acted Voices for Automatic

Voice Casting in IEEEACM Transactions on Audio Speech and Language

Processing vol 24 no 9 pp 1642-1651 Sept 2016

SCORE VS SIMILARITY RATING

Responses to similar and different male comparisons

mean similarity rating

score Auto-Phonetic

Corr (Pearson) = 072

MFCC

Corr (Pearson) = 072

SCORE VS SIMILARITY RATING

Responses to similar and different male comparisons

mean similarity rating

score

Auto-Phonetic + MFCC

Corr (Pearson) = 076

CONCLUSIONS

bull Promising results for male comparisons

bull Auto-Phonetic features capture speaker characteristics relevant to perceived similarity

bull Variable results across female comparisons

bull Data was a contributing factor smaller female candidate pool and cross-accent comparisons

bull Room to improve

bull Larger subjective evaluation required

bull Combine Auto-Phonetic and MFCC features

bull Scope to expand Auto-Phonetic feature set

AUTOMATIC VOICE CASTING RECENT RESULTS FROM SDI MEDIA

SDI media are a major provider of dubbing services worldwide

SDIrsquos Italian branch have been using iVOCALISE with AP and MFCC features for automatic voice castinghellip

English Italian 1

English Italian 2

CONCLUSIONS

bull The definition of voice similarity is application-dependent voice parades vs voice casting

bull Allow for an application-dependent search space

bull Use meta-data such as gender age accent to constrain the set of candidate voices

bull Allow for an application-dependent lsquodegree of similarityrsquo

bullGiven well-calibrated output scores from the automatic system can define a score range of interest

Page 22: AUTOMATICALLY IDENTIFYING - Oxford Wave Research...VOICE PARADES •U.K. Home Office, “Advice on the use of voice identification parades”, circular: 57/2003, December 2003. K.

THE LISTENERS

bull 43 listeners 25 female 18 male

bull 20 spoke English as a first language 23 did not

bull Age range hearing status and playback method were noted

RESPONSES TO MALE VOICE COMPARISONS

Same Speaker

comparisons

Similar

comparisons

Different

comparisons

Similarity rating

P(S

imila

rity

rating

)

RESPONSES TO MALE VOICE COMPARISONS

Same Speaker

comparisons

Median = 8

Similar

comparisons

Median = 5

Different

comparisons

Median = 3

Similarity rating

P(S

imila

rity

rating

)

RESPONSES TO FEMALE VOICE COMPARISONS

Same Speaker

comparisons

Median = 8

Similar

comparisons

Median = 3

Different

comparisons

Median = 4

Similarity rating

P(S

imila

rity

rating

)

Same Speaker

comparisons

Median = 8

Similarity rating

P(S

imila

rity

rating

)

Cross-accent comparisons

Example 1

Example 2

RESPONSES TO FEMALE VOICE COMPARISONS

SCORE VS SIMILARITY RATING

Responses to similar and different male comparisons

mean similarity rating

score

Auto-Phonetic

Corr (Pearson) = 072

PERCEIVED VOICE SIMILARITY amp SPEAKER RECOGNITION

Is there a link between the scores from an automatic system and

perceived similarity

bull A correlation between perceived similarity and speaker

recognition scores has been observed with MFCC features

bull However greater correlation has been observed by combining

MFCCs and Voice Quality labels

bull Phonetic Features

bull N Obin and A Roebel Similarity Search of Acted Voices for Automatic

Voice Casting in IEEEACM Transactions on Audio Speech and Language

Processing vol 24 no 9 pp 1642-1651 Sept 2016

SCORE VS SIMILARITY RATING

Responses to similar and different male comparisons

mean similarity rating

score Auto-Phonetic

Corr (Pearson) = 072

MFCC

Corr (Pearson) = 072

SCORE VS SIMILARITY RATING

Responses to similar and different male comparisons

mean similarity rating

score

Auto-Phonetic + MFCC

Corr (Pearson) = 076

CONCLUSIONS

bull Promising results for male comparisons

bull Auto-Phonetic features capture speaker characteristics relevant to perceived similarity

bull Variable results across female comparisons

bull Data was a contributing factor smaller female candidate pool and cross-accent comparisons

bull Room to improve

bull Larger subjective evaluation required

bull Combine Auto-Phonetic and MFCC features

bull Scope to expand Auto-Phonetic feature set

AUTOMATIC VOICE CASTING RECENT RESULTS FROM SDI MEDIA

SDI media are a major provider of dubbing services worldwide

SDIrsquos Italian branch have been using iVOCALISE with AP and MFCC features for automatic voice castinghellip

English Italian 1

English Italian 2

CONCLUSIONS

bull The definition of voice similarity is application-dependent voice parades vs voice casting

bull Allow for an application-dependent search space

bull Use meta-data such as gender age accent to constrain the set of candidate voices

bull Allow for an application-dependent lsquodegree of similarityrsquo

bullGiven well-calibrated output scores from the automatic system can define a score range of interest

Page 23: AUTOMATICALLY IDENTIFYING - Oxford Wave Research...VOICE PARADES •U.K. Home Office, “Advice on the use of voice identification parades”, circular: 57/2003, December 2003. K.

RESPONSES TO MALE VOICE COMPARISONS

Same Speaker

comparisons

Similar

comparisons

Different

comparisons

Similarity rating

P(S

imila

rity

rating

)

RESPONSES TO MALE VOICE COMPARISONS

Same Speaker

comparisons

Median = 8

Similar

comparisons

Median = 5

Different

comparisons

Median = 3

Similarity rating

P(S

imila

rity

rating

)

RESPONSES TO FEMALE VOICE COMPARISONS

Same Speaker

comparisons

Median = 8

Similar

comparisons

Median = 3

Different

comparisons

Median = 4

Similarity rating

P(S

imila

rity

rating

)

Same Speaker

comparisons

Median = 8

Similarity rating

P(S

imila

rity

rating

)

Cross-accent comparisons

Example 1

Example 2

RESPONSES TO FEMALE VOICE COMPARISONS

SCORE VS SIMILARITY RATING

Responses to similar and different male comparisons

mean similarity rating

score

Auto-Phonetic

Corr (Pearson) = 072

PERCEIVED VOICE SIMILARITY amp SPEAKER RECOGNITION

Is there a link between the scores from an automatic system and

perceived similarity

bull A correlation between perceived similarity and speaker

recognition scores has been observed with MFCC features

bull However greater correlation has been observed by combining

MFCCs and Voice Quality labels

bull Phonetic Features

bull N Obin and A Roebel Similarity Search of Acted Voices for Automatic

Voice Casting in IEEEACM Transactions on Audio Speech and Language

Processing vol 24 no 9 pp 1642-1651 Sept 2016

SCORE VS SIMILARITY RATING

Responses to similar and different male comparisons

mean similarity rating

score Auto-Phonetic

Corr (Pearson) = 072

MFCC

Corr (Pearson) = 072

SCORE VS SIMILARITY RATING

Responses to similar and different male comparisons

mean similarity rating

score

Auto-Phonetic + MFCC

Corr (Pearson) = 076

CONCLUSIONS

bull Promising results for male comparisons

bull Auto-Phonetic features capture speaker characteristics relevant to perceived similarity

bull Variable results across female comparisons

bull Data was a contributing factor smaller female candidate pool and cross-accent comparisons

bull Room to improve

bull Larger subjective evaluation required

bull Combine Auto-Phonetic and MFCC features

bull Scope to expand Auto-Phonetic feature set

AUTOMATIC VOICE CASTING RECENT RESULTS FROM SDI MEDIA

SDI media are a major provider of dubbing services worldwide

SDIrsquos Italian branch have been using iVOCALISE with AP and MFCC features for automatic voice castinghellip

English Italian 1

English Italian 2

CONCLUSIONS

bull The definition of voice similarity is application-dependent voice parades vs voice casting

bull Allow for an application-dependent search space

bull Use meta-data such as gender age accent to constrain the set of candidate voices

bull Allow for an application-dependent lsquodegree of similarityrsquo

bullGiven well-calibrated output scores from the automatic system can define a score range of interest

Page 24: AUTOMATICALLY IDENTIFYING - Oxford Wave Research...VOICE PARADES •U.K. Home Office, “Advice on the use of voice identification parades”, circular: 57/2003, December 2003. K.

RESPONSES TO MALE VOICE COMPARISONS

Same Speaker

comparisons

Median = 8

Similar

comparisons

Median = 5

Different

comparisons

Median = 3

Similarity rating

P(S

imila

rity

rating

)

RESPONSES TO FEMALE VOICE COMPARISONS

Same Speaker

comparisons

Median = 8

Similar

comparisons

Median = 3

Different

comparisons

Median = 4

Similarity rating

P(S

imila

rity

rating

)

Same Speaker

comparisons

Median = 8

Similarity rating

P(S

imila

rity

rating

)

Cross-accent comparisons

Example 1

Example 2

RESPONSES TO FEMALE VOICE COMPARISONS

SCORE VS SIMILARITY RATING

Responses to similar and different male comparisons

mean similarity rating

score

Auto-Phonetic

Corr (Pearson) = 072

PERCEIVED VOICE SIMILARITY amp SPEAKER RECOGNITION

Is there a link between the scores from an automatic system and

perceived similarity

bull A correlation between perceived similarity and speaker

recognition scores has been observed with MFCC features

bull However greater correlation has been observed by combining

MFCCs and Voice Quality labels

bull Phonetic Features

bull N Obin and A Roebel Similarity Search of Acted Voices for Automatic

Voice Casting in IEEEACM Transactions on Audio Speech and Language

Processing vol 24 no 9 pp 1642-1651 Sept 2016

SCORE VS SIMILARITY RATING

Responses to similar and different male comparisons

mean similarity rating

score Auto-Phonetic

Corr (Pearson) = 072

MFCC

Corr (Pearson) = 072

SCORE VS SIMILARITY RATING

Responses to similar and different male comparisons

mean similarity rating

score

Auto-Phonetic + MFCC

Corr (Pearson) = 076

CONCLUSIONS

bull Promising results for male comparisons

bull Auto-Phonetic features capture speaker characteristics relevant to perceived similarity

bull Variable results across female comparisons

bull Data was a contributing factor smaller female candidate pool and cross-accent comparisons

bull Room to improve

bull Larger subjective evaluation required

bull Combine Auto-Phonetic and MFCC features

bull Scope to expand Auto-Phonetic feature set

AUTOMATIC VOICE CASTING RECENT RESULTS FROM SDI MEDIA

SDI media are a major provider of dubbing services worldwide

SDIrsquos Italian branch have been using iVOCALISE with AP and MFCC features for automatic voice castinghellip

English Italian 1

English Italian 2

CONCLUSIONS

bull The definition of voice similarity is application-dependent voice parades vs voice casting

bull Allow for an application-dependent search space

bull Use meta-data such as gender age accent to constrain the set of candidate voices

bull Allow for an application-dependent lsquodegree of similarityrsquo

bullGiven well-calibrated output scores from the automatic system can define a score range of interest

Page 25: AUTOMATICALLY IDENTIFYING - Oxford Wave Research...VOICE PARADES •U.K. Home Office, “Advice on the use of voice identification parades”, circular: 57/2003, December 2003. K.

RESPONSES TO FEMALE VOICE COMPARISONS

Same Speaker

comparisons

Median = 8

Similar

comparisons

Median = 3

Different

comparisons

Median = 4

Similarity rating

P(S

imila

rity

rating

)

Same Speaker

comparisons

Median = 8

Similarity rating

P(S

imila

rity

rating

)

Cross-accent comparisons

Example 1

Example 2

RESPONSES TO FEMALE VOICE COMPARISONS

SCORE VS SIMILARITY RATING

Responses to similar and different male comparisons

mean similarity rating

score

Auto-Phonetic

Corr (Pearson) = 072

PERCEIVED VOICE SIMILARITY amp SPEAKER RECOGNITION

Is there a link between the scores from an automatic system and

perceived similarity

bull A correlation between perceived similarity and speaker

recognition scores has been observed with MFCC features

bull However greater correlation has been observed by combining

MFCCs and Voice Quality labels

bull Phonetic Features

bull N Obin and A Roebel Similarity Search of Acted Voices for Automatic

Voice Casting in IEEEACM Transactions on Audio Speech and Language

Processing vol 24 no 9 pp 1642-1651 Sept 2016

SCORE VS SIMILARITY RATING

Responses to similar and different male comparisons

mean similarity rating

score Auto-Phonetic

Corr (Pearson) = 072

MFCC

Corr (Pearson) = 072

SCORE VS SIMILARITY RATING

Responses to similar and different male comparisons

mean similarity rating

score

Auto-Phonetic + MFCC

Corr (Pearson) = 076

CONCLUSIONS

bull Promising results for male comparisons

bull Auto-Phonetic features capture speaker characteristics relevant to perceived similarity

bull Variable results across female comparisons

bull Data was a contributing factor smaller female candidate pool and cross-accent comparisons

bull Room to improve

bull Larger subjective evaluation required

bull Combine Auto-Phonetic and MFCC features

bull Scope to expand Auto-Phonetic feature set

AUTOMATIC VOICE CASTING RECENT RESULTS FROM SDI MEDIA

SDI media are a major provider of dubbing services worldwide

SDIrsquos Italian branch have been using iVOCALISE with AP and MFCC features for automatic voice castinghellip

English Italian 1

English Italian 2

CONCLUSIONS

bull The definition of voice similarity is application-dependent voice parades vs voice casting

bull Allow for an application-dependent search space

bull Use meta-data such as gender age accent to constrain the set of candidate voices

bull Allow for an application-dependent lsquodegree of similarityrsquo

bullGiven well-calibrated output scores from the automatic system can define a score range of interest

Page 26: AUTOMATICALLY IDENTIFYING - Oxford Wave Research...VOICE PARADES •U.K. Home Office, “Advice on the use of voice identification parades”, circular: 57/2003, December 2003. K.

Same Speaker

comparisons

Median = 8

Similarity rating

P(S

imila

rity

rating

)

Cross-accent comparisons

Example 1

Example 2

RESPONSES TO FEMALE VOICE COMPARISONS

SCORE VS SIMILARITY RATING

Responses to similar and different male comparisons

mean similarity rating

score

Auto-Phonetic

Corr (Pearson) = 072

PERCEIVED VOICE SIMILARITY amp SPEAKER RECOGNITION

Is there a link between the scores from an automatic system and

perceived similarity

bull A correlation between perceived similarity and speaker

recognition scores has been observed with MFCC features

bull However greater correlation has been observed by combining

MFCCs and Voice Quality labels

bull Phonetic Features

bull N Obin and A Roebel Similarity Search of Acted Voices for Automatic

Voice Casting in IEEEACM Transactions on Audio Speech and Language

Processing vol 24 no 9 pp 1642-1651 Sept 2016

SCORE VS SIMILARITY RATING

Responses to similar and different male comparisons

mean similarity rating

score Auto-Phonetic

Corr (Pearson) = 072

MFCC

Corr (Pearson) = 072

SCORE VS SIMILARITY RATING

Responses to similar and different male comparisons

mean similarity rating

score

Auto-Phonetic + MFCC

Corr (Pearson) = 076

CONCLUSIONS

bull Promising results for male comparisons

bull Auto-Phonetic features capture speaker characteristics relevant to perceived similarity

bull Variable results across female comparisons

bull Data was a contributing factor smaller female candidate pool and cross-accent comparisons

bull Room to improve

bull Larger subjective evaluation required

bull Combine Auto-Phonetic and MFCC features

bull Scope to expand Auto-Phonetic feature set

AUTOMATIC VOICE CASTING RECENT RESULTS FROM SDI MEDIA

SDI media are a major provider of dubbing services worldwide

SDIrsquos Italian branch have been using iVOCALISE with AP and MFCC features for automatic voice castinghellip

English Italian 1

English Italian 2

CONCLUSIONS

bull The definition of voice similarity is application-dependent voice parades vs voice casting

bull Allow for an application-dependent search space

bull Use meta-data such as gender age accent to constrain the set of candidate voices

bull Allow for an application-dependent lsquodegree of similarityrsquo

bullGiven well-calibrated output scores from the automatic system can define a score range of interest

Page 27: AUTOMATICALLY IDENTIFYING - Oxford Wave Research...VOICE PARADES •U.K. Home Office, “Advice on the use of voice identification parades”, circular: 57/2003, December 2003. K.

SCORE VS SIMILARITY RATING

Responses to similar and different male comparisons

mean similarity rating

score

Auto-Phonetic

Corr (Pearson) = 072

PERCEIVED VOICE SIMILARITY amp SPEAKER RECOGNITION

Is there a link between the scores from an automatic system and

perceived similarity

bull A correlation between perceived similarity and speaker

recognition scores has been observed with MFCC features

bull However greater correlation has been observed by combining

MFCCs and Voice Quality labels

bull Phonetic Features

bull N Obin and A Roebel Similarity Search of Acted Voices for Automatic

Voice Casting in IEEEACM Transactions on Audio Speech and Language

Processing vol 24 no 9 pp 1642-1651 Sept 2016

SCORE VS SIMILARITY RATING

Responses to similar and different male comparisons

mean similarity rating

score Auto-Phonetic

Corr (Pearson) = 072

MFCC

Corr (Pearson) = 072

SCORE VS SIMILARITY RATING

Responses to similar and different male comparisons

mean similarity rating

score

Auto-Phonetic + MFCC

Corr (Pearson) = 076

CONCLUSIONS

bull Promising results for male comparisons

bull Auto-Phonetic features capture speaker characteristics relevant to perceived similarity

bull Variable results across female comparisons

bull Data was a contributing factor smaller female candidate pool and cross-accent comparisons

bull Room to improve

bull Larger subjective evaluation required

bull Combine Auto-Phonetic and MFCC features

bull Scope to expand Auto-Phonetic feature set

AUTOMATIC VOICE CASTING RECENT RESULTS FROM SDI MEDIA

SDI media are a major provider of dubbing services worldwide

SDIrsquos Italian branch have been using iVOCALISE with AP and MFCC features for automatic voice castinghellip

English Italian 1

English Italian 2

CONCLUSIONS

bull The definition of voice similarity is application-dependent voice parades vs voice casting

bull Allow for an application-dependent search space

bull Use meta-data such as gender age accent to constrain the set of candidate voices

bull Allow for an application-dependent lsquodegree of similarityrsquo

bullGiven well-calibrated output scores from the automatic system can define a score range of interest

Page 28: AUTOMATICALLY IDENTIFYING - Oxford Wave Research...VOICE PARADES •U.K. Home Office, “Advice on the use of voice identification parades”, circular: 57/2003, December 2003. K.

PERCEIVED VOICE SIMILARITY amp SPEAKER RECOGNITION

Is there a link between the scores from an automatic system and

perceived similarity

bull A correlation between perceived similarity and speaker

recognition scores has been observed with MFCC features

bull However greater correlation has been observed by combining

MFCCs and Voice Quality labels

bull Phonetic Features

bull N Obin and A Roebel Similarity Search of Acted Voices for Automatic

Voice Casting in IEEEACM Transactions on Audio Speech and Language

Processing vol 24 no 9 pp 1642-1651 Sept 2016

SCORE VS SIMILARITY RATING

Responses to similar and different male comparisons

mean similarity rating

score Auto-Phonetic

Corr (Pearson) = 072

MFCC

Corr (Pearson) = 072

SCORE VS SIMILARITY RATING

Responses to similar and different male comparisons

mean similarity rating

score

Auto-Phonetic + MFCC

Corr (Pearson) = 076

CONCLUSIONS

bull Promising results for male comparisons

bull Auto-Phonetic features capture speaker characteristics relevant to perceived similarity

bull Variable results across female comparisons

bull Data was a contributing factor smaller female candidate pool and cross-accent comparisons

bull Room to improve

bull Larger subjective evaluation required

bull Combine Auto-Phonetic and MFCC features

bull Scope to expand Auto-Phonetic feature set

AUTOMATIC VOICE CASTING RECENT RESULTS FROM SDI MEDIA

SDI media are a major provider of dubbing services worldwide

SDIrsquos Italian branch have been using iVOCALISE with AP and MFCC features for automatic voice castinghellip

English Italian 1

English Italian 2

CONCLUSIONS

bull The definition of voice similarity is application-dependent voice parades vs voice casting

bull Allow for an application-dependent search space

bull Use meta-data such as gender age accent to constrain the set of candidate voices

bull Allow for an application-dependent lsquodegree of similarityrsquo

bullGiven well-calibrated output scores from the automatic system can define a score range of interest

Page 29: AUTOMATICALLY IDENTIFYING - Oxford Wave Research...VOICE PARADES •U.K. Home Office, “Advice on the use of voice identification parades”, circular: 57/2003, December 2003. K.

SCORE VS SIMILARITY RATING

Responses to similar and different male comparisons

mean similarity rating

score Auto-Phonetic

Corr (Pearson) = 072

MFCC

Corr (Pearson) = 072

SCORE VS SIMILARITY RATING

Responses to similar and different male comparisons

mean similarity rating

score

Auto-Phonetic + MFCC

Corr (Pearson) = 076

CONCLUSIONS

bull Promising results for male comparisons

bull Auto-Phonetic features capture speaker characteristics relevant to perceived similarity

bull Variable results across female comparisons

bull Data was a contributing factor smaller female candidate pool and cross-accent comparisons

bull Room to improve

bull Larger subjective evaluation required

bull Combine Auto-Phonetic and MFCC features

bull Scope to expand Auto-Phonetic feature set

AUTOMATIC VOICE CASTING RECENT RESULTS FROM SDI MEDIA

SDI media are a major provider of dubbing services worldwide

SDIrsquos Italian branch have been using iVOCALISE with AP and MFCC features for automatic voice castinghellip

English Italian 1

English Italian 2

CONCLUSIONS

bull The definition of voice similarity is application-dependent voice parades vs voice casting

bull Allow for an application-dependent search space

bull Use meta-data such as gender age accent to constrain the set of candidate voices

bull Allow for an application-dependent lsquodegree of similarityrsquo

bullGiven well-calibrated output scores from the automatic system can define a score range of interest

Page 30: AUTOMATICALLY IDENTIFYING - Oxford Wave Research...VOICE PARADES •U.K. Home Office, “Advice on the use of voice identification parades”, circular: 57/2003, December 2003. K.

SCORE VS SIMILARITY RATING

Responses to similar and different male comparisons

mean similarity rating

score

Auto-Phonetic + MFCC

Corr (Pearson) = 076

CONCLUSIONS

bull Promising results for male comparisons

bull Auto-Phonetic features capture speaker characteristics relevant to perceived similarity

bull Variable results across female comparisons

bull Data was a contributing factor smaller female candidate pool and cross-accent comparisons

bull Room to improve

bull Larger subjective evaluation required

bull Combine Auto-Phonetic and MFCC features

bull Scope to expand Auto-Phonetic feature set

AUTOMATIC VOICE CASTING RECENT RESULTS FROM SDI MEDIA

SDI media are a major provider of dubbing services worldwide

SDIrsquos Italian branch have been using iVOCALISE with AP and MFCC features for automatic voice castinghellip

English Italian 1

English Italian 2

CONCLUSIONS

bull The definition of voice similarity is application-dependent voice parades vs voice casting

bull Allow for an application-dependent search space

bull Use meta-data such as gender age accent to constrain the set of candidate voices

bull Allow for an application-dependent lsquodegree of similarityrsquo

bullGiven well-calibrated output scores from the automatic system can define a score range of interest

Page 31: AUTOMATICALLY IDENTIFYING - Oxford Wave Research...VOICE PARADES •U.K. Home Office, “Advice on the use of voice identification parades”, circular: 57/2003, December 2003. K.

CONCLUSIONS

bull Promising results for male comparisons

bull Auto-Phonetic features capture speaker characteristics relevant to perceived similarity

bull Variable results across female comparisons

bull Data was a contributing factor smaller female candidate pool and cross-accent comparisons

bull Room to improve

bull Larger subjective evaluation required

bull Combine Auto-Phonetic and MFCC features

bull Scope to expand Auto-Phonetic feature set

AUTOMATIC VOICE CASTING RECENT RESULTS FROM SDI MEDIA

SDI media are a major provider of dubbing services worldwide

SDIrsquos Italian branch have been using iVOCALISE with AP and MFCC features for automatic voice castinghellip

English Italian 1

English Italian 2

CONCLUSIONS

bull The definition of voice similarity is application-dependent voice parades vs voice casting

bull Allow for an application-dependent search space

bull Use meta-data such as gender age accent to constrain the set of candidate voices

bull Allow for an application-dependent lsquodegree of similarityrsquo

bullGiven well-calibrated output scores from the automatic system can define a score range of interest

Page 32: AUTOMATICALLY IDENTIFYING - Oxford Wave Research...VOICE PARADES •U.K. Home Office, “Advice on the use of voice identification parades”, circular: 57/2003, December 2003. K.

AUTOMATIC VOICE CASTING RECENT RESULTS FROM SDI MEDIA

SDI media are a major provider of dubbing services worldwide

SDIrsquos Italian branch have been using iVOCALISE with AP and MFCC features for automatic voice castinghellip

English Italian 1

English Italian 2

CONCLUSIONS

bull The definition of voice similarity is application-dependent voice parades vs voice casting

bull Allow for an application-dependent search space

bull Use meta-data such as gender age accent to constrain the set of candidate voices

bull Allow for an application-dependent lsquodegree of similarityrsquo

bullGiven well-calibrated output scores from the automatic system can define a score range of interest

Page 33: AUTOMATICALLY IDENTIFYING - Oxford Wave Research...VOICE PARADES •U.K. Home Office, “Advice on the use of voice identification parades”, circular: 57/2003, December 2003. K.

CONCLUSIONS

bull The definition of voice similarity is application-dependent voice parades vs voice casting

bull Allow for an application-dependent search space

bull Use meta-data such as gender age accent to constrain the set of candidate voices

bull Allow for an application-dependent lsquodegree of similarityrsquo

bullGiven well-calibrated output scores from the automatic system can define a score range of interest


Recommended