Date post: | 21-Dec-2015 |
Category: |
Documents |
View: | 212 times |
Download: | 0 times |
27th, February 2004 Presentation for the speech recognition system
An overview of the SPHINX Speech Recognition System
Jie Zhou, Zheng GongJie Zhou, Zheng Gong
Lingli Wang, Tiantian DingLingli Wang, Tiantian Ding
M.Sc in CMHEM.Sc in CMHE
Spoken Language Processing ModuleSpoken Language Processing Module
Presentation of the speech recognition systemPresentation of the speech recognition system
2727thth, February 2004, February 2004
27th, February 2004 Presentation for the speech recognition system
AbstractAbstract
SPHINX is a system that demonstrates the feasibility of accuracy, large-vocabulary speaker- independent, continuous speech recognition.
SPHINX is based on discrete hidden Markov model (HMM’s) with LPC-derived parameters.
To provide speaker independence To deal with co-articulation in continuous speech Adequately represent a large-vocabulary SPHINX attained word accuracies of 71, 94, and
96 percent on a 997-word task.
27th, February 2004 Presentation for the speech recognition system
IntroductionIntroduction
SPHINX is a system that tries to overcome SPHINX is a system that tries to overcome three constraints:three constraints:1)1) Speaker dependentSpeaker dependent
2)2) Isolated wordsIsolated words
3)3) Small vocabularySmall vocabulary
27th, February 2004 Presentation for the speech recognition system
IntroductionIntroduction
Speaker independent Train on less appropriate training data Many more data can be acquired which may
compensate for the less appropriate training material Continuous speech recognition’s difficulties
Word boundaries are difficult to locate Coarticulatory effects are much stronger in
continuous speech Content words are often emphasized , while function
words are poorly articulated Large vocabulary
1000 words or more
27th, February 2004 Presentation for the speech recognition system
IntroductionIntroduction
To improve speaker independence• Presented additional knowledge through the use of multiple vector
quantized codebooks• Enhance the recognizer with carefully designed models and word
duration modeling. To deal with coarticulation in continuous speech
Function-word-dependent phone models Generalized triphone models
SPHINX achieved speaker-independent word recognition accuracies of 71, 94 and 96 percent on the 997 word DARPA resource management task with grammars of perplexity 997, 60 and 20.
27th, February 2004 Presentation for the speech recognition system
The baseline SPHINX systemThe baseline SPHINX system
This system uses standard HMM techniquesThis system uses standard HMM techniques Speech ProcessingSpeech Processing
Sample rate 16KHzSample rate 16KHz Frame span 20ms, each frame overlap 10msFrame span 20ms, each frame overlap 10ms Each frame is multiplied by Hamming windowEach frame is multiplied by Hamming window Computing the LPC coefficientsComputing the LPC coefficients 12 LPC-derived cepstral coefficients are got12 LPC-derived cepstral coefficients are got 12 LPC cepstrum coefficient are vector quantized 12 LPC cepstrum coefficient are vector quantized
into one of 256 prototype vectorsinto one of 256 prototype vectors
27th, February 2004 Presentation for the speech recognition system
Task and DatabaseTask and Database
The resource Management taskThe resource Management task SHPINX was evaluated on the DARPA resource SHPINX was evaluated on the DARPA resource
management taskmanagement task Three difficult grammars are used with SPHINXThree difficult grammars are used with SPHINX
Null grammar (perplexity 997)Null grammar (perplexity 997) Word-pair grammar (perplexity 60)Word-pair grammar (perplexity 60) Bigram grammar (perplexity 20)Bigram grammar (perplexity 20)
The TIRM DatabaseThe TIRM Database 80 “training” speakers80 “training” speakers 40 “development test” speakers40 “development test” speakers 40 “evaluation” speakers40 “evaluation” speakers
27th, February 2004 Presentation for the speech recognition system
Task and DatabaseTask and Database
Phonetic Hidden Markov ModelsPhonetic Hidden Markov Models HMM’s are parametric models particularly suitable HMM’s are parametric models particularly suitable
for describing speech events.for describing speech events. Each HMM represents a phoneEach HMM represents a phone A total number of 46 phones in EnglishA total number of 46 phones in English {s}: a set of states{s}: a set of states {a{aijij}: a set of transitions where a}: a set of transitions where aij ij is the probability of is the probability of
transition from state i to state jtransition from state i to state j {b{bijij(k)}: the output probability matrix(k)}: the output probability matrix Phonetic HMM’s topology figurePhonetic HMM’s topology figure
27th, February 2004 Presentation for the speech recognition system
Phonetic HMM’s topologyPhonetic HMM’s topology
27th, February 2004 Presentation for the speech recognition system
Task and DatabaseTask and Database
TrainingTraining A set of 46 phone models was used to initialize the A set of 46 phone models was used to initialize the
parameters. parameters. Ran the forward-backward algorithm on the Ran the forward-backward algorithm on the
resource management training sentences. resource management training sentences. Create a sentence model from word models, which Create a sentence model from word models, which
were in turn concatenated from phone models.were in turn concatenated from phone models. The trained transition probability are used directly in The trained transition probability are used directly in
recognitionrecognition The output probabilities are smoothed with a uniform The output probabilities are smoothed with a uniform
distribution distribution The SPHINX recognition search is a standard time-The SPHINX recognition search is a standard time-
synchronous Viterbi beam search.synchronous Viterbi beam search.
27th, February 2004 Presentation for the speech recognition system
Task and DatabaseTask and Database
The results with the baseline SPHINX system, using 15 The results with the baseline SPHINX system, using 15 new speakers with 10 sentences each for evaluation are new speakers with 10 sentences each for evaluation are shown in table I.shown in table I.
Baseline system is inadequate for any realistic large-Baseline system is inadequate for any realistic large-vocabulary applications, without incorporating vocabulary applications, without incorporating knowledge and contextual modelingknowledge and contextual modeling
27th, February 2004 Presentation for the speech recognition system
Adding knowledge to SPHINXAdding knowledge to SPHINX
Fixed-Width Speech ParametersFixed-Width Speech Parameters Lexical/Phonological ImprovementsLexical/Phonological Improvements Word Duration ModelingWord Duration Modeling ResultsResults
27th, February 2004 Presentation for the speech recognition system
Fixed-Width Speech Fixed-Width Speech ParameterParameter
Bilinear Transform on the Cepstrum Bilinear Transform on the Cepstrum CoefficientsCoefficients
Differenced Cepstrum CoefficientsDifferenced Cepstrum Coefficients Power and Differenced PowerPower and Differenced Power Integrating Fixed-Width Parameters in Multiple Integrating Fixed-Width Parameters in Multiple
CodebooksCodebooks
27th, February 2004 Presentation for the speech recognition system
Lexical/Phonological Lexical/Phonological ImprovementsImprovements
This set of improvements involved the modification of the set of This set of improvements involved the modification of the set of phones and the pronunciation dictionary. These changes lead to more phones and the pronunciation dictionary. These changes lead to more accurate assumptions about how words are articulated, without accurate assumptions about how words are articulated, without changing our assumption that each word has a single pronunciation. changing our assumption that each word has a single pronunciation.
The first step we took was to replace the baseform pronunciation with The first step we took was to replace the baseform pronunciation with the most likely pronunciation. the most likely pronunciation.
In order to improve the appropriateness of the word pronunciation In order to improve the appropriateness of the word pronunciation dictionary, a small set of rules was created to dictionary, a small set of rules was created to
modify closure-stop pairs into optional compound phones when modify closure-stop pairs into optional compound phones when appropriateappropriate
modify /t/’s and /d/’s into /dx/ when appropriatemodify /t/’s and /d/’s into /dx/ when appropriate reduce nasal /t/’s when appropriate reduce nasal /t/’s when appropriate perform other mappings such as /t s/ to /ts/. perform other mappings such as /t s/ to /ts/.
Finally, there is the issue of what HMM topology is optimal for phones Finally, there is the issue of what HMM topology is optimal for phones in general, and what topology is optimal for each phone.in general, and what topology is optimal for each phone.
27th, February 2004 Presentation for the speech recognition system
Word Duration ModelingWord Duration Modeling
HMM’s model duration of events with transition HMM’s model duration of events with transition probabilities, which lead to a geometric probabilities, which lead to a geometric distribution for the duration of state residence.distribution for the duration of state residence.
We incorporated word duration into SPHINX as We incorporated word duration into SPHINX as a part of the Viterbi search. The duration of a a part of the Viterbi search. The duration of a word is modelled by a univariate Gaussian word is modelled by a univariate Gaussian distribution, with the mean and variance distribution, with the mean and variance estimated from a supervised Viterbi estimated from a supervised Viterbi segmentation of the training set.segmentation of the training set.
27th, February 2004 Presentation for the speech recognition system
ResultsResults
We have presented various strategies for adding We have presented various strategies for adding knowledge to SPHINX. knowledge to SPHINX.
Consistent with earlier results, we found that bilinear Consistent with earlier results, we found that bilinear transformed coefficients improved the recognition rates. transformed coefficients improved the recognition rates. An even greater improvement came from the use of An even greater improvement came from the use of differential coefficients, power, and differenced power in differential coefficients, power, and differenced power in three separate codebooks. three separate codebooks.
Next, we enhanced the dictionary and the phone set- a Next, we enhanced the dictionary and the phone set- a step that led to an appreciable improvement. step that led to an appreciable improvement.
Finally, the addition of durational information Finally, the addition of durational information significantly improved SPHINX’s accuracy when no significantly improved SPHINX’s accuracy when no grammar was used, but was not helpful with a grammar was used, but was not helpful with a grammar.grammar.
27th, February 2004 Presentation for the speech recognition system
Context Modeling in SPHINXContext Modeling in SPHINX
Previously Proposed Units of Previously Proposed Units of SpeechSpeech
Function-Word Dependent PhonesFunction-Word Dependent Phones Generalized TriphonesGeneralized Triphones Smoothing Detailed ModelsSmoothing Detailed Models
27th, February 2004 Presentation for the speech recognition system
Previously Proposed Previously Proposed Units of SpeechUnits of Speech
Since lack of sharing across words, word Since lack of sharing across words, word models not practical for large-vocabulary models not practical for large-vocabulary speech recognitionspeech recognition
In order to improve trainability, some subword In order to improve trainability, some subword unit has to be usedunit has to be used
Word-dependent phones: a compromise btw Word-dependent phones: a compromise btw word modeling and phone modelingword modeling and phone modeling
Context-dependent phones: triphone model, Context-dependent phones: triphone model, instead of modeling phone-in-word, they instead of modeling phone-in-word, they model phone-in-contextmodel phone-in-context
27th, February 2004 Presentation for the speech recognition system
Function-Word Function-Word Dependent PhonesDependent Phones
Function words are particularly problematic in Function words are particularly problematic in continuous speech recognition since they are continuous speech recognition since they are typically unstressedtypically unstressed
The phones in function words are distortedThe phones in function words are distorted Function-word-dependent phones are the Function-word-dependent phones are the
same as word-dependent phones, except they same as word-dependent phones, except they are only used for function wordsare only used for function words
27th, February 2004 Presentation for the speech recognition system
Generalized TriphonesGeneralized Triphones
Triphones model are sparsely trained and Triphones model are sparsely trained and consume substantial memoryconsume substantial memory
Combining similar triphones, improving the Combining similar triphones, improving the trainability and reduce the memory storagetrainability and reduce the memory storage
Create generalized triphones by merging Create generalized triphones by merging contexts with an agglomerative clustering contexts with an agglomerative clustering procedureprocedure
To determine the similarity btw two models, To determine the similarity btw two models, we use the following distance metric:we use the following distance metric:
27th, February 2004 Presentation for the speech recognition system
Generalized TriphonesGeneralized Triphones
In measuring the distance btw the two models, In measuring the distance btw the two models, we only consider the o/p probabilities and we only consider the o/p probabilities and ignore the transition probabilities, which are of ignore the transition probabilities, which are of secondary importantsecondary important
This context generalization algorithm provides This context generalization algorithm provides the ideal means for finding the equilibrium btw the ideal means for finding the equilibrium btw trainability and sensitivity.trainability and sensitivity.
27th, February 2004 Presentation for the speech recognition system
Smoothing Detailed ModelsSmoothing Detailed Models
Detailed models are accurate, but are less Detailed models are accurate, but are less robust since many o/p probabilities will be robust since many o/p probabilities will be zeros, which can be disastrous to recognition.zeros, which can be disastrous to recognition.
Combing these detailed models with other Combing these detailed models with other more robust ones.more robust ones.
An ideal solution for weighting different An ideal solution for weighting different estimates of the same event is estimates of the same event is deleted deleted interpolated estimation.interpolated estimation.
Procedure to combine the detailed models and Procedure to combine the detailed models and robust modelsrobust models
Using the uniform distribution to smooth the Using the uniform distribution to smooth the distributiondistribution
27th, February 2004 Presentation for the speech recognition system
Entire training procedureEntire training procedure
The summary of the The summary of the
entire training procedure entire training procedure
is illustrated in figure 2is illustrated in figure 2
27th, February 2004 Presentation for the speech recognition system
Summary of ResultsSummary of Results
The six versions correspond to the following The six versions correspond to the following descriptions with incremental improvements:descriptions with incremental improvements: the baseline system, which uses only LPC cepstral the baseline system, which uses only LPC cepstral
parameters in one codebook;parameters in one codebook; the addition of differenced LPC cepstral coefficients, the addition of differenced LPC cepstral coefficients,
power, and differenced power in one codebook;power, and differenced power in one codebook; all four feature sets were used in three separate all four feature sets were used in three separate
codebooks codebooks tuning of phone models and the pronunciation tuning of phone models and the pronunciation
dictionary, and the use of word duration modelling;dictionary, and the use of word duration modelling; function word dependent phone modelling function word dependent phone modelling generalized triphone modellinggeneralized triphone modelling
27th, February 2004 Presentation for the speech recognition system
Results of five versions of Results of five versions of SPHINXSPHINX
27th, February 2004 Presentation for the speech recognition system
ConclusionConclusion
Given a fixed amount of training, model Given a fixed amount of training, model specificity and model trainability pose two specificity and model trainability pose two incompatible goals.incompatible goals.
More specificity usually reduces trainability, More specificity usually reduces trainability, and increased trainability usually results in and increased trainability usually results in over generality.over generality.
Our work lies on finding an equilibrium btw Our work lies on finding an equilibrium btw specificity and trainabilityspecificity and trainability
27th, February 2004 Presentation for the speech recognition system
ConclusionConclusion
To improve trainability, using one of the largest speaker-To improve trainability, using one of the largest speaker-independent speech databases.independent speech databases.
To facilitate sharing btw models, using deleted To facilitate sharing btw models, using deleted interpolation to combine robust models with detailed ones.interpolation to combine robust models with detailed ones.
Improving trainability through sharing by combining poorly Improving trainability through sharing by combining poorly trained models with well-trained modelstrained models with well-trained models
To improve specificity, using multiple codebookds of To improve specificity, using multiple codebookds of various LPC-derived features, and integrated external various LPC-derived features, and integrated external knowledge sources into the systemknowledge sources into the system
Improving the phone set to include multiple Improving the phone set to include multiple representations of some phones, and introduce the use of representations of some phones, and introduce the use of function-word-dependent phone modeling and generalized function-word-dependent phone modeling and generalized triphone modelingtriphone modeling
27th, February 2004 Presentation for the speech recognition system
ReferenceReference
An Overview of the SPHINX Speech An Overview of the SPHINX Speech Recognition System, Kai-Fu LEE, member IEEE, Recognition System, Kai-Fu LEE, member IEEE, Hsiao-Wuen, Hon, and Raj Reddy, fellow, IEEE, Hsiao-Wuen, Hon, and Raj Reddy, fellow, IEEE, 19891989
The SPHINX Speech Recognition system, Kai-The SPHINX Speech Recognition system, Kai-Fu Lee, Hsiao-Wuen Hon, Mei-Yuh Hwang, Fu Lee, Hsiao-Wuen Hon, Mei-Yuh Hwang, Sanjoy Mahajan, Raj Reddy, 1989Sanjoy Mahajan, Raj Reddy, 1989
27th, February 2004 Presentation for the speech recognition system
Thank you very much!Thank you very much!