Post on 29-Mar-2015
transcript
SPEAKER RECOGNITIONScott Settembre
ss424@cse.buffalo.eduCSE 734 : Cyber Physical Spaces
Scott Settembre [ss424@cse.buffalo.edu] 2
Overview
• Speaker Identification• Speaker Validation• Two types of Recognition methods– Text dependent vs. Text independent
• Speaker Recognition steps• Conclusion / References
March 16, 2009
Scott Settembre [ss424@cse.buffalo.edu] 3
Speaker Identification
• Determines the speaker from a set of registered speakers– This is called a “closed” set identification– Result is the best speaker matched
• What if the speaker is not in the database?– This is called an “open” set identification– Result can be a speaker or a no-match result
March 16, 2009
Scott Settembre [ss424@cse.buffalo.edu] 4
Speaker Identification Diagram
March 16, 2009
Actual Speaker
Input
Normalization Feature Extraction
Speaker Database
Calculate similarity to each speaker template or
model
Select best match
Identification of Speaker
Enro
llmen
t
Scott Settembre [ss424@cse.buffalo.edu] 5
Overview
• Speaker Identification• Speaker Validation• Two types of Recognition methods– Text dependent vs. Text independent
• Speaker Recognition steps• Conclusion / References
March 16, 2009
Scott Settembre [ss424@cse.buffalo.edu] 6
Speaker Validation
• Also called “Verification” or “Authentication”• Determines if the voice matches a particular
registered speaker– Result is the probability of a match or a similarity
measure• Similarity must exceed a particular threshold– Higher threshold produces more false negatives– Lower threshold produces more false positives– Voice variability and security issues make this a difficult
threshold value to determine (more later)March 16, 2009
Scott Settembre [ss424@cse.buffalo.edu] 7
Speaker Validation Diagram
March 16, 2009
Actual Speaker
Input
Normalization Feature Extraction
Speaker Database
Calculate similarity to
given template or model
Does similarity exceed threshold?
Verification (Accept/Reject)
Speaker template or model
Speaker ID
Enro
llmen
t
Scott Settembre [ss424@cse.buffalo.edu] 8
Overview
• Speaker Identification• Speaker Validation• Two types of Recognition methods– Text dependent vs. Text independent
• Speaker Recognition steps• Conclusion / References
March 16, 2009
Scott Settembre [ss424@cse.buffalo.edu] 9
Recognition Methods
• Text Dependent– Requires user to speak text spoken at enrollment– Usually a name, password, or phrase– Text Prompting is used to combat deception• The system requires the user to repeat back a random
phrase or list of numbers
• Video example from “CSAIL” - Spoken Language Systems group at MIT.
March 16, 2009
Scott Settembre [ss424@cse.buffalo.edu] 10March 16, 2009
Scott Settembre [ss424@cse.buffalo.edu] 11
Recognition Methods, cont.
• Text Independent– Non-invasive, does not require user to actively
answer prompts– Longer enrollment phase required, more training
data needed– Focuses on a subset of audio/phonetic features
• Video example from Nathan Harrington at IBM developerWorks.
March 16, 2009
Scott Settembre [ss424@cse.buffalo.edu] 12March 16, 2009
Scott Settembre [ss424@cse.buffalo.edu] 13
Overview
• Speaker Identification• Speaker Validation• Two types of Recognition methods– Text dependent vs. Text independent
• Speaker Recognition steps• Conclusion / References
March 16, 2009
Scott Settembre [ss424@cse.buffalo.edu] 14
Speaker Recognition Steps
1. Input Speech2. Normalize captured speech3. Feature extraction4. Similarity matching5. Decision/Threshold
March 16, 2009
Scott Settembre [ss424@cse.buffalo.edu] 15
Step 1. Input Speech
• Various fidelity from inputs– Telephone, computer microphone, noise
cancelling headset, dedicated capture microphone, room microphones
• Noise– Background noise, room echoes
• Variability in voice– Speaking manner (rate and volume), sickness,
aging, emotions, morning vs. evening voice
March 16, 2009
Scott Settembre [ss424@cse.buffalo.edu] 16
Step 2. Normalize Captured Speech
• Intersession variability and variability over time cause speech features to fluctuate
• Use of “filter bank” is common• Normalization helps remove these variations,
but at a price– Parameter-Domain normalization– Distance/Similarity-Domain normalization
March 16, 2009
Scott Settembre [ss424@cse.buffalo.edu] 17
Step 2.a. Normalization Techniques
• Parameter-Domain normalization– Spectral equalization (i.e. signal processing)• Dampens large variations in features by averaging over
time, useful for long utterances• Removes some speaker specific features
• Distance/Similarity-Domain normalization– Various techniques that use probabilities of known
speakers that have already been enrolled• Useful if you are doing validation
March 16, 2009
Scott Settembre [ss424@cse.buffalo.edu] 18
Step 3. Feature Extraction
• The input utterance is converted to a set of feature vectors
• Time alignment may need to be done
• Calculate similarity between each captured vector with the registered speaker template or model
March 16, 2009
Hello h he e el l lo o
h he e el l lo o
h he e el l lo o
h h .90 similarity he he .60 similarity, .75 overall
Scott Settembre [ss424@cse.buffalo.edu] 19
Side note : Analyzing speech “ah”
March 16, 2009
Waveform(Raw acoustic data)
Spectrograph(Frequency vs.Amplitude)
Formant(Continuous peakthat crossesfrequencies)
Image attributed to Dr. Douglas Roland from lecture notes describing speech recognition.
Scott Settembre [ss424@cse.buffalo.edu] 20
Step 4. Similarity Matching
• Other pattern classification techniques can be used on the normalized input
• Each speaker gets his/her own HMM, neural network, VQ codebook, etc.
• Another approach is to target specific phonemes or features– Example showing the targeting of vowel sounds, in
particular the syllable “ah”
March 16, 2009
Scott Settembre [ss424@cse.buffalo.edu] 21
Example of Vowel Comparisons
March 16, 2009
Charts attributed to Pasich, C. Speaker Identification MATLAB files, Connexions Web site. http://cnx.org/content/m14201/1.3/, Feb 16, 2007.
Scott Settembre [ss424@cse.buffalo.edu] 22
Step 5. Decision/Threshold
• For speaker identification, simply take the registered speaker template with the highest similarity score
• For speaker verification, there needs to be a minimum acceptable similarity score
March 16, 2009
Scott Settembre [ss424@cse.buffalo.edu] 23
Overview
• Speaker Identification• Speaker Validation• Two types of Recognition methods– Text dependent vs. Text independent
• Speaker Recognition steps• Conclusion / References
March 16, 2009
Scott Settembre [ss424@cse.buffalo.edu] 24
Conclusion : Why care?
• Speaker recognition will become ubiquitous– Cell phone applications – banking, security, logins– Forensic analysis (voiceprints)– Home automation (know thy user)– Google “speaker” search? (You know it’s going to
happen! )
March 16, 2009
Scott Settembre [ss424@cse.buffalo.edu] 25
References• Video links
– MIT, CSAIL. http://www.youtube.com/watch?v=0ec1Gtnlq1k– IBM, developerWorks. http://www.youtube.com/watch?v=JJ_YzBaqzAo
• Cole, Ronald A., Editor (1996) Survey of the State of the Art in Human Language Technology. http://cslu.cse.ogi.edu/HLTsurvey/HLTsurvey.html
• Iyer, Manjunath Ramachandra (2007). “Differentially Fed Artificial Neural Networks for Speech Signal Prediction.” In Hector Perez-Meana, Editor. Advances in audio and speech signal processing : technologies and applications (pp. 309-323 ) Hershey, PA : Idea Group Pub., c2007.
• Lung, Shung-Yung (2007). “Speaker Recognition.” In Hector Perez-Meana, Editor. Advances in audio and speech signal processing : technologies and applications (pp. 371-407) Hershey, PA : Idea Group Pub., c2007.
March 16, 2009