Jinyu LiMicrosoft
} Review the deep learning trends for automatic speech recognition (ASR) in industry◦ Deep Neural Network (DNN)◦ Long Short-Term Memory (LSTM)◦ Connectionist Temporal Classification (CTC)
} Describe selected key technologies to make deep learning models more effective under production environment
Feature Analysis (Spectral Analysis)
Language Model
Word Lexicon
Confidence Scoring
Pattern Classification
(Decoding, Search)
Acoustic Model (HMM)
Input Speech “Hey Cortana”
(0.9) (0.8)
s(n), W
Xn
W
Feature Analysis (Spectral Analysis)
Language Model
Word Lexicon
Confidence Scoring
Pattern Classification
(Decoding, Search)
Acoustic Model (HMM)
Input Speech “Hey Cortana”
(0.9) (0.8)
s(n), W
Xn
W
} Word sequence: Hey Cortana} Phone sequence: hh ey k ao r t ae n ax} Triphonesequence: sil-hh+ey hh-ey+k ey-k+ao k-ao+r
ao-r+ae ae-n+ax n-ax+sil} Every triphone is then modeled by a three-state HMM: sil-
hh+ey[1], sil-hh+ey[2], sil-hh+ey[3], hh-ey+k[1], ......, n-ax+sil[3]. The key problem is how to evaluate the state likelihood given the speech signal.
sil-hh+ey[2]sil-hh+ey[1] sil-hh+ey[3] hh-ey+k[1] n-ax+sil[3]
sil-hh+ey[2]sil-hh+ey[1] sil-hh+ey[3] hh-ey+k[1] n-ax+sil[3]
sil-hh+ey[2]sil-hh+ey[1] sil-hh+ey[3]hh-ey+k[1] n-ax+sil[3]
sil-hh+ey[2]sil-hh+ey[1] sil-hh+ey[3] hh-ey+k[1] n-ax+sil[3]
} ZH-CN is improved by 32% within one year!
0
5
10
15
20
25
30
35
GMM MFCC CE DNN LFB CE DNN LFB SE DNN
ZH-CN Relative Improvement
CERR
CE: Cross Entropy trainingSE: SEquence training
DNNs process speech frames independently
tx1−tx ( )bxWh += thxt σ
RNN considers temporal relation over speech frames.
tx1−tx
Vulnerable to gradients vanishing and exploding( )bhWxWh ++= −1thhthxt σ
Memory cells store the history information
Various gates control the information flow inside LSTM
Advantageous in learning long short-term temporal dependency
tx1−tx
0.00
2.00
4.00
6.00
8.00
10.00
12.00
14.00
16.00
18.00
20.00
SMD2015 VS2015 MobileC Mobile Win10C
WER
RRelative WER reduction of LSTM from DNN
The HMM/GMM or HMM/DNN pipeline is highly complexMultiple training stages: CI phone, CD senones, …
Various resources: lexicon, decision trees questions, …
Many hyper-parameters: number of senones, number of Gaussians, …
CI Phone
CD Senone
DNN/ LSTM
GMM Hybrid
Feature Analysis (Spectral Analysis)
Language Model
Word Lexicon
Confidence Scoring
Pattern Classification
(Decoding, Search)
Acoustic Model (HMM)
Input Speech “Hey Cortana”
(0.9) (0.8)
s(n), W
Xn
W
The HMM/GMM or HMM/DNN pipeline is highly complexMultiple training stages: CI phone, CD senones, …
Various resources: lexicon, decision trees questions, …
Many hyper-parameters: number of senones, number of Gaussians, …
LM building requests tons of data and complicated process also
Efficient decoder writing needs experts with years’ experience
The HMM/GMM or HMM/DNN pipeline is highly complexMultiple training stages: CI phone, CD senones, …
Various resources: lexicon, decision trees questions, …
Many hyper-parameters: number of senones, number of Gaussians, …
LM building requests tons of data and complicated process also
Efficient decoder writing needs experts with years’ experience
End-to-EndModel
“Hey Cortana”
} ASR is a sequence-to-sequence learning problem.} A simpler paradigm with a single model (and training
stage) is desired.
Allow repetitions of non-blank labels
Add the blank as an additional label, meaning no (actual) labels are emitted
A B C!A!!!A!!!∅!!!∅!!!B!!!C!!!∅!!∅!!!A!!!A!!!B!!!∅!!!C!!!C!!∅!!!∅!!!∅!!!A!!!B!!!C!!!∅!
collapse
expand
} CTC is a sequence-to-sequence learning method used to map speech waveforms directly to characters, phonemes, or even words
} CTC paths differ from labels sequences in that:
A B C
-- labels sequence z -- observation frames X
t-1 t t+1
LSTM LSTM LSTM……
softmax
∅blank
words
} Directly from speech to text, no language model, no decoder, no lexicon……
} Reduce runtime cost without accuracy loss
} Adapt to speakers with low footprints
} Reduce accuracy gap between large and small deep networks
} Enable languages with limited training data
[Xue13]
} The runtime cost of DNN is much larger than that of GMM, which has been fully optimized in product deployment. We need to reduce the runtime cost of DNN in order to ship it.
} The runtime cost of DNN is much larger than that of GMM, which has been fully optimized in product deployment. We need to reduce the runtime cost of DNN in order to ship it.
} We propose a new DNN structure by taking advantage of the low-rank property of DNN model to compress it
} How to reduce the runtime cost of DNN ?SVD !!!
} speaker personalization & AM modularization.
𝐴"×$ = 𝑈"×$∑$×$𝑉$×$) =𝑢++ ⋯ 𝑢+$⋮ ⋱ ⋮
𝑢"+ ⋯ 𝑢"$/
𝜖++ ⋯⋮ ⋱
0 ⋯ 0⋮ ⋱ ⋮
0 ⋯⋮ ⋱0 ⋯
𝜖22 ⋯ 0⋮ ⋱ ⋮0 ⋯ 𝜖$$
/𝑣++ ⋯ 𝑣+$⋮ ⋱ ⋮𝑣$+ ⋯ 𝑣$$
} Number of parameters: mn->mk+nk. } Runtime cost: O(mn) -> O(mk+nk). } E.g., m=2048, n=2048, k=192. 80% runtime cost reduction.
} Singular Value Decomposition
LSTM LSTM
tx1−tx
LSTM
1+tx
Copy
DNN Model LSTM Model
DNN DNN
tx1−tx
DNN
1+tx
Copy
Split training utterances through frame skipping
2x1x 3x 5x4x 6x
1x 3x 5x 2x 4x 6x
When skipping 1 frame, odd and even frames are picked as separate utterances
Frame labels are selected accordingly
[Xue 14]
} Speaker personalization with a deep model creates a storage size issue: It is not practical to store an entire deep models for each individual speaker during deployment.
} Speaker personalization with a DNN model creates a storage size issue: It is not practical to store an entire DNN model for each individual speaker during deployment.
} We propose low-footprint DNN personalization method based on SVD structure.
0 0.36
18.6420.86
30
7.4 7.4
0.26
FULL-SIZE DNN SVD DNN STANDARD ADAPTATION SVD ADAPTATION
Adapting with 100 utterances
Relative WER reduction Number of parameters (M)
} SVD matrices are used to reduce the number of DNN parameters and CPU cost.
} Quantization for SSE evaluation is used for single instruction multiple data processing.
} Frame skipping is used to remove the evaluation of some frames.
} The industry has strong interests to have DNN systems on devices due to the increasingly popular mobile scenarios.
} Even with the technologies mentioned above, the large computational cost is still very challenging due to the limited processing power of devices.
} A common way to fit CD-DNN-HMM on devices is to reduce the DNN model size by ◦ reducing the number of nodes in hidden layers◦ reducing the number of targets in the output layer
} Better accuracy is obtained if we use the output of large-size DNN for acoustic likelihood evaluation
} The output of small-size DNN is away from that of large-size DNN, resulting in worse recognition accuracy
} The problem is solved if the small-size DNN can generate similar output as the large-size DNN
...
...
...
...
...
...Text
...
...
...
...
...
...
...
...
◦ Use the standard DNN training method to train a large-size teacher DNN using transcribed data
◦ Minimize the KL divergence between the output distribution of the student DNN and teacher DNN with large amount of un-transcribed data
} 2 Million parameter for small-size DNN, compared to 30 Million parameters for teacher DNN.
} The footprint is further reduced to 0.5 million parameter when combining with SVD.
Teacher DNN trained with standard sequence training
Small-size DNN trained with standard sequence training
Student DNN trained with output distribution learning
Accuracy
[Huang 13]
} Develop a new language in new scenario with small amount of training data.
} Develop a new language in new scenario with small amount of training data.
} Leverage the resource-rich languages to develop high-quality ASR for resource-limited languages.
...
...
...
...
...
...
...
InputLayer:Awindowofacousticfeatureframes
SharedFeatureTransformation
OutputLayer
Newlanguagesenones
NewLanguage TrainingorTestingSamples
Text
ManyHiddenLayers
0
5
10
15
20
25
3 hrs 9hrs 36hrs 139hrs
releative error reduction