Deep Learning for Speech Recognition in Cortana at AI NEXT Conference

Post on 11-Apr-2017

76 views 6 download

transcript

Jinyu LiMicrosoft

} Review the deep learning trends for automatic speech recognition (ASR) in industry◦ Deep Neural Network (DNN)◦ Long Short-Term Memory (LSTM)◦ Connectionist Temporal Classification (CTC)

} Describe selected key technologies to make deep learning models more effective under production environment

Feature Analysis (Spectral Analysis)

Language Model

Word Lexicon

Confidence Scoring

Pattern Classification

(Decoding, Search)

Acoustic Model (HMM)

Input Speech “Hey Cortana”

(0.9) (0.8)

s(n), W

Xn

W

Feature Analysis (Spectral Analysis)

Language Model

Word Lexicon

Confidence Scoring

Pattern Classification

(Decoding, Search)

Acoustic Model (HMM)

Input Speech “Hey Cortana”

(0.9) (0.8)

s(n), W

Xn

W

} Word sequence: Hey Cortana} Phone sequence: hh ey k ao r t ae n ax} Triphonesequence: sil-hh+ey hh-ey+k ey-k+ao k-ao+r

ao-r+ae ae-n+ax n-ax+sil} Every triphone is then modeled by a three-state HMM: sil-

hh+ey[1], sil-hh+ey[2], sil-hh+ey[3], hh-ey+k[1], ......, n-ax+sil[3]. The key problem is how to evaluate the state likelihood given the speech signal.

sil-hh+ey[2]sil-hh+ey[1] sil-hh+ey[3] hh-ey+k[1] n-ax+sil[3]

sil-hh+ey[2]sil-hh+ey[1] sil-hh+ey[3] hh-ey+k[1] n-ax+sil[3]

sil-hh+ey[2]sil-hh+ey[1] sil-hh+ey[3]hh-ey+k[1] n-ax+sil[3]

sil-hh+ey[2]sil-hh+ey[1] sil-hh+ey[3] hh-ey+k[1] n-ax+sil[3]

} ZH-CN is improved by 32% within one year!

0

5

10

15

20

25

30

35

GMM MFCC CE DNN LFB CE DNN LFB SE DNN

ZH-CN Relative Improvement

CERR

CE: Cross Entropy trainingSE: SEquence training

DNNs process speech frames independently

tx1−tx ( )bxWh += thxt σ

RNN considers temporal relation over speech frames.

tx1−tx

Vulnerable to gradients vanishing and exploding( )bhWxWh ++= −1thhthxt σ

Memory cells store the history information

Various gates control the information flow inside LSTM

Advantageous in learning long short-term temporal dependency

tx1−tx

0.00

2.00

4.00

6.00

8.00

10.00

12.00

14.00

16.00

18.00

20.00

SMD2015 VS2015 MobileC Mobile Win10C

WER

RRelative WER reduction of LSTM from DNN

The HMM/GMM or HMM/DNN pipeline is highly complexMultiple training stages: CI phone, CD senones, …

Various resources: lexicon, decision trees questions, …

Many hyper-parameters: number of senones, number of Gaussians, …

CI Phone

CD Senone

DNN/ LSTM

GMM Hybrid

Feature Analysis (Spectral Analysis)

Language Model

Word Lexicon

Confidence Scoring

Pattern Classification

(Decoding, Search)

Acoustic Model (HMM)

Input Speech “Hey Cortana”

(0.9) (0.8)

s(n), W

Xn

W

The HMM/GMM or HMM/DNN pipeline is highly complexMultiple training stages: CI phone, CD senones, …

Various resources: lexicon, decision trees questions, …

Many hyper-parameters: number of senones, number of Gaussians, …

LM building requests tons of data and complicated process also

Efficient decoder writing needs experts with years’ experience

The HMM/GMM or HMM/DNN pipeline is highly complexMultiple training stages: CI phone, CD senones, …

Various resources: lexicon, decision trees questions, …

Many hyper-parameters: number of senones, number of Gaussians, …

LM building requests tons of data and complicated process also

Efficient decoder writing needs experts with years’ experience

End-to-EndModel

“Hey Cortana”

} ASR is a sequence-to-sequence learning problem.} A simpler paradigm with a single model (and training

stage) is desired.

Allow repetitions of non-blank labels

Add the blank as an additional label, meaning no (actual) labels are emitted

A B C!A!!!A!!!∅!!!∅!!!B!!!C!!!∅!!∅!!!A!!!A!!!B!!!∅!!!C!!!C!!∅!!!∅!!!∅!!!A!!!B!!!C!!!∅!

collapse

expand

} CTC is a sequence-to-sequence learning method used to map speech waveforms directly to characters, phonemes, or even words

} CTC paths differ from labels sequences in that:

A B C

-- labels sequence z -- observation frames X

t-1 t t+1

LSTM LSTM LSTM……

softmax

∅blank

words

} Directly from speech to text, no language model, no decoder, no lexicon……

} Reduce runtime cost without accuracy loss

} Adapt to speakers with low footprints

} Reduce accuracy gap between large and small deep networks

} Enable languages with limited training data

[Xue13]

} The runtime cost of DNN is much larger than that of GMM, which has been fully optimized in product deployment. We need to reduce the runtime cost of DNN in order to ship it.

} The runtime cost of DNN is much larger than that of GMM, which has been fully optimized in product deployment. We need to reduce the runtime cost of DNN in order to ship it.

} We propose a new DNN structure by taking advantage of the low-rank property of DNN model to compress it

} How to reduce the runtime cost of DNN ?SVD !!!

} speaker personalization & AM modularization.

𝐴"×$ = 𝑈"×$∑$×$𝑉$×$) =𝑢++ ⋯ 𝑢+$⋮ ⋱ ⋮

𝑢"+ ⋯ 𝑢"$/

𝜖++ ⋯⋮ ⋱

0 ⋯ 0⋮ ⋱ ⋮

0 ⋯⋮ ⋱0 ⋯

𝜖22 ⋯ 0⋮ ⋱ ⋮0 ⋯ 𝜖$$

/𝑣++ ⋯ 𝑣+$⋮ ⋱ ⋮𝑣$+ ⋯ 𝑣$$

} Number of parameters: mn->mk+nk. } Runtime cost: O(mn) -> O(mk+nk). } E.g., m=2048, n=2048, k=192. 80% runtime cost reduction.

} Singular Value Decomposition

LSTM LSTM

tx1−tx

LSTM

1+tx

Copy

DNN Model LSTM Model

DNN DNN

tx1−tx

DNN

1+tx

Copy

Split training utterances through frame skipping

2x1x 3x 5x4x 6x

1x 3x 5x 2x 4x 6x

When skipping 1 frame, odd and even frames are picked as separate utterances

Frame labels are selected accordingly

[Xue 14]

} Speaker personalization with a deep model creates a storage size issue: It is not practical to store an entire deep models for each individual speaker during deployment.

} Speaker personalization with a DNN model creates a storage size issue: It is not practical to store an entire DNN model for each individual speaker during deployment.

} We propose low-footprint DNN personalization method based on SVD structure.

0 0.36

18.6420.86

30

7.4 7.4

0.26

FULL-SIZE DNN SVD DNN STANDARD ADAPTATION SVD ADAPTATION

Adapting with 100 utterances

Relative WER reduction Number of parameters (M)

} SVD matrices are used to reduce the number of DNN parameters and CPU cost.

} Quantization for SSE evaluation is used for single instruction multiple data processing.

} Frame skipping is used to remove the evaluation of some frames.

} The industry has strong interests to have DNN systems on devices due to the increasingly popular mobile scenarios.

} Even with the technologies mentioned above, the large computational cost is still very challenging due to the limited processing power of devices.

} A common way to fit CD-DNN-HMM on devices is to reduce the DNN model size by ◦ reducing the number of nodes in hidden layers◦ reducing the number of targets in the output layer

} Better accuracy is obtained if we use the output of large-size DNN for acoustic likelihood evaluation

} The output of small-size DNN is away from that of large-size DNN, resulting in worse recognition accuracy

} The problem is solved if the small-size DNN can generate similar output as the large-size DNN

...

...

...

...

...

...Text

...

...

...

...

...

...

...

...

◦ Use the standard DNN training method to train a large-size teacher DNN using transcribed data

◦ Minimize the KL divergence between the output distribution of the student DNN and teacher DNN with large amount of un-transcribed data

} 2 Million parameter for small-size DNN, compared to 30 Million parameters for teacher DNN.

} The footprint is further reduced to 0.5 million parameter when combining with SVD.

Teacher DNN trained with standard sequence training

Small-size DNN trained with standard sequence training

Student DNN trained with output distribution learning

Accuracy

[Huang 13]

} Develop a new language in new scenario with small amount of training data.

} Develop a new language in new scenario with small amount of training data.

} Leverage the resource-rich languages to develop high-quality ASR for resource-limited languages.

...

...

...

...

...

...

...

InputLayer:Awindowofacousticfeatureframes

SharedFeatureTransformation

OutputLayer

Newlanguagesenones

NewLanguage TrainingorTestingSamples

Text

ManyHiddenLayers

0

5

10

15

20

25

3 hrs 9hrs 36hrs 139hrs

releative error reduction