Download - A Hybrid CPU/GPU Speech Recognition Engine for Real ......HYDRA: ! A Hybrid CPU/GPU Speech Recognition Engine for Real-Time LVCSR JungsukKim!!!!Jike!Chong!!!!Ian!Lane!! Electrical

HYDRA: ��A Hybrid CPU/GPU Speech Recognition

Engine for Real-Time LVCSR

Jungsuk Kim Jike Chong Ian Lane Electrical and Computer Engineering

Carnegie Mellon University Silicon Valley

March 21, 2013 @GTC2013 1

Carnegie Mellon University

Overview

2

•  Introduc5on

•  Speech Recogni5on with Weighted Finite State Transducers (WFST)

•  GPU-‐Accelerated Speech Recogni5on •  On-‐The-‐Fly Hypothesis Rescoring •  Evalua5on Results •  Demonstra5on Video


Introduc5on

3

•  Voice interfaces a core technology for User Interaction •  Mobile devices, Smart TVs, In-Vehicle Systems, …

•  For a captivating User Experience, Voice UI must be: •  Robust

•  Acoustic robustness Large Acoustic Models •  Linguistics robustness Large Vocabulary Recognition

•  Responsive •  Low latency Faster than real-time search

•  Adaptive •  User and Task adaptation

A New Speech Recognition Architecture Required


Introduc5on

4

&

Speech recognition contains many highly

parallel tasks

GPU processors optimized for parallel

computing

HYDRA an ASR engine designed specifically

for GPUs + =


Speech Recogni5on with Weighted Finite State Transducers (WFSTs)

r! eh! k! ax! g! n! ay! z! s! p! iy! ch!

Speech Recognize

r! eh! k! b! iy! ch!

Beach Wreck Nice

n! ay! s!a!

a

A search space in speech recognition is represented as a WFST graph that includes acoustic, phonetic and linguistic constraints




Speech Recognize


Beach Wreck Nice

n! ay! s!a!

a

Search is performed in 3 phases




Speech Recognize


Beach Wreck Nice

n! ay! s!a!

a

Phase 0: Prepare Active Set Gather active states (speech recognition hypotheses) from previous frame.




Speech Recognize


Beach Wreck Nice

n! ay! s!a!

a

Phase 1: Compute Observation Probability Compute likelihood of phonetic models (Gaussian Mixture Model) for current frame




Speech Recognize


Beach Wreck Nice

n! ay! s!a!

a

Phase 2: WFST Search Frame synchronous Viterbi search is performed on WFST network



eh! k! ax! g! n! ay! z! s! p! iy! ch!

Speech Recognize

eh! k! b! iy! ch!

Beach Wreck Nice

n! ay! s!a!

a

Phase 0: Prepare Active Set

r!

r!




Speech Recognize

eh! k! b! iy! ch!

Beach Wreck Nice

n! ay! s!a!

a

Phase 1: Compute Observation Probability

r!

r!




Speech Recognize

eh! k! b! iy! ch!

Beach Wreck Nice

n! ay! s!a!

a

Phase 2: WFST Search

r!

r!



k! ax! g! n! ay! z! s! p! iy! ch!

Speech Recognize

k! b! iy! ch!

Beach Wreck Nice

n! ay! s!a!

a


r! eh!

r! eh!




Speech Recognize

k! b! iy! ch!

Beach Wreck Nice

n! ay! s!a!

a


r! eh!

r! eh!




Speech Recognize

k! b! iy! ch!

Beach Wreck Nice

n! ay! s!a!

a

Phase 2: WFST Search

r! eh!

r! eh!



ax! g! n! ay! z! s! p! iy! ch!

Speech Recognize

b! iy! ch!

Beach Nice

n! ay! s!a!

a


r! eh! k!

r! eh! k!




Speech Recognize

b! iy! ch!

Beach Nice

n! ay! s!a!

a


r! eh! k!

r! eh! k!




Speech Recognize

b! iy! ch!

Beach Wreck Nice

n! ay! s!a!

a

Backtrack from Viterbi Search is Speech Recognition Hypothesis

r! eh! k!

r! eh! k!



Initialize data structures"

CPU!

Backtrack"

Output Results"

Phase 0!""

Phase 1 !Compute

Observation Probabilities!

Phase 2!WFST Search!

Save !Backtrack Log!W

R W!

R!

R W

Data! Control!

R!

RW

R!

WW

RW

R

Prepare ActiveSet!

Iteration Control!

Decoding Process •  Phase 0: Prepare Ac5ve Set

•  Gather ac2ve speech recogni2on hypotheses from previous frame.

•  Phase 1: Compute Observa5on Probability •  Compute likelihood of phone2c models (Gaussian

Mixture Model) for current input features.

•  Phase 2: WFST Search •  Frame synchronous Viterbi search is performed on

WFST network.

19


Concurrency Opportuni5es

20

1

2

3

4

5

6

Thread block 1

Thread block 2

6

2

3

4

•  Phase 0: Prepare Ac5ve Set •  Communica6on-‐intensive phase

•  Phase 1: Compute Observa5on Probability •  1000~2000 GMM clusters are ac2vated per step. •  Computa6on-‐intensive phase

•  Phase 2: WFST Search •  10,000s par2al hypotheses tracked per step. •  Handling irregular graph structure with data parallel

opera2on. •  Conflict-‐free reduc6on in graph traversal to resolve write-‐

conflict. •  Communica6on-‐intensive phase


GPU-‐Accelerated Speech Recogni5on

21


Backtrack"

Output Results"

Phase 0!""

Phase 1 !Compute



Save !Backtrack Log!

Prepare ActiveSet!

Iteration Control!

CPU GPU

Paul R. Dixon [ICASSP 2009]

Approach 1: •  Conduct compute intensive phase only on GPU •  Incurs high data-‐copying overheads between the

CPU and the GPU. •  Less scalable as transfer become a sequen2al

boRleneck in algorithm.

Faster than CPU but less scalable



22


Backtrack"

Output Results"

Phase 0!""

Phase 1 !Compute




Prepare ActiveSet!

Iteration Control!

CPU GPU



Backtrack"

Output Results"

Phase 0!""

Phase 1 !Compute




Prepare ActiveSet!

Iteration Control!

CPU GPU

Jike Chong [INTERSPEECH 2009]

Collect Backtrack Info"



23


Backtrack"

Output Results"

Phase 0!""

Phase 1 !Compute




Prepare ActiveSet!

Iteration Control!

CPU GPU



Approach 2: •  Both phases are on GPU. •  Scalable algorithm •  Only suitable for small models (vocab / linguis2c

context) due to limited GPU memory (1~6GB)

Fast and Scalable but Limited



24


Backtrack"

Output Results"

Phase 0!""

Phase 1 !Compute




Prepare ActiveSet!

Iteration Control!

CPU GPU



Backtrack"

Output Results"

Phase 0!""

Phase 1 !Compute




Prepare ActiveSet!

Iteration Control!

CPU GPU




Backtrack"

Output Results"

Phase 0!""

Phase 1 !Compute




Prepare ActiveSet!

Iteration Control!

CPU GPU

Proposed


On-The-Fly Rescoring!

(LM Lookup) !



25


Backtrack"

Output Results"

Phase 0!""

Phase 1 !Compute




Prepare ActiveSet!

Iteration Control!

CPU GPU

Proposed


On-The-Fly Rescoring!

(LM Lookup) !

Vocab. size 5k 64k 1M

unigram 2 12 92 bigram 114 1,880 N/A* trigram 676 4,644* N/A*

WFST network size (MB) (* Unable to decode)

Proposed Approach: •  Both phases are on GPU using unigram WFST •  Rescore hypothesis “On-‐The-‐Fly” using larger

language model on CPU

Fast, Scalable, Large Models


On-‐The-‐Fly Hypothesis Rescoring

26

ax! g! n! ay! z!

Speech Recognize

Wreck Nice

n! ay! s!a!

a

r! eh! k!

r! eh! k!

<s> s! p! iy! ch!

b! iy! ch!

Beach

P(Recognize|<s>)

At word boundary update graph weight with linguistic context from CPU


On-‐The-‐Fly Hypothesis Rescoring

27

ax! g! n! ay! z!

Speech Recognize

Wreck Nice

n! ay! s!a!

a

r! eh! k!

r! eh! k!

<s> s! p! iy! ch!

b! iy! ch!

Beach

P(Speech|<s>,Recognize)

At word boundary update graph weight with linguistic context from CPU


Evalua5on Results (5k vocab.)

28

•  20X speed-‐up compared to standard WFST decoding on CPU at word accuracy 93.80%

•  95.4% maximum accuracy is achieved

85

86

87

88

89

90

91

92

93

94

95

0.04 0.40

Wor

d A

ccur

acy

[%]

Real time Factor

STANDARD (2GRAM) PROPOSED (3GRAM, N=3)

93.8%

0.078 1.582 20X


Evalua5on Results (1M vocab.)

29

•  2.74X faster than real-‐2me when the WER is 9.35%

0"

0.1"

0.2"

0.3"

0.4"

0.5"

0.6"

TRIGRAM" 40GRAM" 50GRAM" 40GRAM"

GPU" CPU"

REAL%TIM

E%FAC

TOR%

9.61%

9.35%

9.38%

0.36


Evalua5on Results (1M vocab.)

30

•  2.74X faster than real-‐2me when the WER is 9.35%

•  10X faster than CPU decoder.

0"

0.5"

1"

1.5"

2"

2.5"

3"

3.5"

4"

4.5"

TRIGRAM" 4/GRAM" 5/GRAM" 4/GRAM"

GPU" CPU"

REAL%TIM

E%FAC

TOR%

9.61% 9.35% 9.38%

9.59%

Real-Time 0.36

4.01


Q&A

31

Thank you for your attention.