HYDRA: ���A Hybrid CPU/GPU Speech Recognition
Engine for Real-Time LVCSR
Jungsuk Kim Jike Chong Ian Lane Electrical and Computer Engineering
Carnegie Mellon University Silicon Valley
March 21, 2013 @GTC2013 1
Carnegie Mellon University
Overview
2
• Introduc5on
• Speech Recogni5on with Weighted Finite State Transducers (WFST)
• GPU-‐Accelerated Speech Recogni5on • On-‐The-‐Fly Hypothesis Rescoring • Evalua5on Results • Demonstra5on Video
Carnegie Mellon University
Introduc5on
3
• Voice interfaces a core technology for User Interaction • Mobile devices, Smart TVs, In-Vehicle Systems, …
• For a captivating User Experience, Voice UI must be: • Robust
• Acoustic robustness Large Acoustic Models • Linguistics robustness Large Vocabulary Recognition
• Responsive • Low latency Faster than real-time search
• Adaptive • User and Task adaptation
A New Speech Recognition Architecture Required
Carnegie Mellon University
Introduc5on
4
&
Speech recognition contains many highly
parallel tasks
GPU processors optimized for parallel
computing
HYDRA an ASR engine designed specifically
for GPUs + =
Carnegie Mellon University
Speech Recogni5on with Weighted Finite State Transducers (WFSTs)
r! eh! k! ax! g! n! ay! z! s! p! iy! ch!
Speech Recognize
r! eh! k! b! iy! ch!
Beach Wreck Nice
n! ay! s!a!
a
A search space in speech recognition is represented as a WFST graph that includes acoustic, phonetic and linguistic constraints
Carnegie Mellon University
Speech Recogni5on with Weighted Finite State Transducers (WFSTs)
r! eh! k! ax! g! n! ay! z! s! p! iy! ch!
Speech Recognize
r! eh! k! b! iy! ch!
Beach Wreck Nice
n! ay! s!a!
a
Search is performed in 3 phases
Carnegie Mellon University
Speech Recogni5on with Weighted Finite State Transducers (WFSTs)
r! eh! k! ax! g! n! ay! z! s! p! iy! ch!
Speech Recognize
r! eh! k! b! iy! ch!
Beach Wreck Nice
n! ay! s!a!
a
Phase 0: Prepare Active Set Gather active states (speech recognition hypotheses) from previous frame.
Carnegie Mellon University
Speech Recogni5on with Weighted Finite State Transducers (WFSTs)
r! eh! k! ax! g! n! ay! z! s! p! iy! ch!
Speech Recognize
r! eh! k! b! iy! ch!
Beach Wreck Nice
n! ay! s!a!
a
Phase 1: Compute Observation Probability Compute likelihood of phonetic models (Gaussian Mixture Model) for current frame
Carnegie Mellon University
Speech Recogni5on with Weighted Finite State Transducers (WFSTs)
r! eh! k! ax! g! n! ay! z! s! p! iy! ch!
Speech Recognize
r! eh! k! b! iy! ch!
Beach Wreck Nice
n! ay! s!a!
a
Phase 2: WFST Search Frame synchronous Viterbi search is performed on WFST network
Carnegie Mellon University
Speech Recogni5on with Weighted Finite State Transducers (WFSTs)
eh! k! ax! g! n! ay! z! s! p! iy! ch!
Speech Recognize
eh! k! b! iy! ch!
Beach Wreck Nice
n! ay! s!a!
a
Phase 0: Prepare Active Set
r!
r!
Carnegie Mellon University
Speech Recogni5on with Weighted Finite State Transducers (WFSTs)
eh! k! ax! g! n! ay! z! s! p! iy! ch!
Speech Recognize
eh! k! b! iy! ch!
Beach Wreck Nice
n! ay! s!a!
a
Phase 1: Compute Observation Probability
r!
r!
Carnegie Mellon University
Speech Recogni5on with Weighted Finite State Transducers (WFSTs)
eh! k! ax! g! n! ay! z! s! p! iy! ch!
Speech Recognize
eh! k! b! iy! ch!
Beach Wreck Nice
n! ay! s!a!
a
Phase 2: WFST Search
r!
r!
Carnegie Mellon University
Speech Recogni5on with Weighted Finite State Transducers (WFSTs)
k! ax! g! n! ay! z! s! p! iy! ch!
Speech Recognize
k! b! iy! ch!
Beach Wreck Nice
n! ay! s!a!
a
Phase 0: Prepare Active Set
r! eh!
r! eh!
Carnegie Mellon University
Speech Recogni5on with Weighted Finite State Transducers (WFSTs)
k! ax! g! n! ay! z! s! p! iy! ch!
Speech Recognize
k! b! iy! ch!
Beach Wreck Nice
n! ay! s!a!
a
Phase 1: Compute Observation Probability
r! eh!
r! eh!
Carnegie Mellon University
Speech Recogni5on with Weighted Finite State Transducers (WFSTs)
k! ax! g! n! ay! z! s! p! iy! ch!
Speech Recognize
k! b! iy! ch!
Beach Wreck Nice
n! ay! s!a!
a
Phase 2: WFST Search
r! eh!
r! eh!
Carnegie Mellon University
Speech Recogni5on with Weighted Finite State Transducers (WFSTs)
ax! g! n! ay! z! s! p! iy! ch!
Speech Recognize
b! iy! ch!
Beach Nice
n! ay! s!a!
a
Phase 0: Prepare Active Set
r! eh! k!
r! eh! k!
Carnegie Mellon University
Speech Recogni5on with Weighted Finite State Transducers (WFSTs)
ax! g! n! ay! z! s! p! iy! ch!
Speech Recognize
b! iy! ch!
Beach Nice
n! ay! s!a!
a
Phase 1: Compute Observation Probability
r! eh! k!
r! eh! k!
Carnegie Mellon University
Speech Recogni5on with Weighted Finite State Transducers (WFSTs)
ax! g! n! ay! z! s! p! iy! ch!
Speech Recognize
b! iy! ch!
Beach Wreck Nice
n! ay! s!a!
a
Backtrack from Viterbi Search is Speech Recognition Hypothesis
r! eh! k!
r! eh! k!
Carnegie Mellon University
Speech Recogni5on with Weighted Finite State Transducers (WFSTs)
Initialize data structures"
CPU!
Backtrack"
Output Results"
Phase 0!""
Phase 1 !Compute
Observation Probabilities!
Phase 2!WFST Search!
Save !Backtrack Log!W
R W!
R!
R W
Data! Control!
R!
RW
R!
WW
RW
R
Prepare ActiveSet!
Iteration Control!
Decoding Process • Phase 0: Prepare Ac5ve Set
• Gather ac2ve speech recogni2on hypotheses from previous frame.
• Phase 1: Compute Observa5on Probability • Compute likelihood of phone2c models (Gaussian
Mixture Model) for current input features.
• Phase 2: WFST Search • Frame synchronous Viterbi search is performed on
WFST network.
19
Carnegie Mellon University
Concurrency Opportuni5es
20
1
2
3
4
5
6
Thread block 1
Thread block 2
6
2
3
4
• Phase 0: Prepare Ac5ve Set • Communica6on-‐intensive phase
• Phase 1: Compute Observa5on Probability • 1000~2000 GMM clusters are ac2vated per step. • Computa6on-‐intensive phase
• Phase 2: WFST Search • 10,000s par2al hypotheses tracked per step. • Handling irregular graph structure with data parallel
opera2on. • Conflict-‐free reduc6on in graph traversal to resolve write-‐
conflict. • Communica6on-‐intensive phase
Carnegie Mellon University
GPU-‐Accelerated Speech Recogni5on
21
Initialize data structures"
Backtrack"
Output Results"
Phase 0!""
Phase 1 !Compute
Observation Probabilities!
Phase 2!WFST Search!
Save !Backtrack Log!
Prepare ActiveSet!
Iteration Control!
CPU GPU
Paul R. Dixon [ICASSP 2009]
Approach 1: • Conduct compute intensive phase only on GPU • Incurs high data-‐copying overheads between the
CPU and the GPU. • Less scalable as transfer become a sequen2al
boRleneck in algorithm.
Faster than CPU but less scalable
Carnegie Mellon University
GPU-‐Accelerated Speech Recogni5on
22
Initialize data structures"
Backtrack"
Output Results"
Phase 0!""
Phase 1 !Compute
Observation Probabilities!
Phase 2!WFST Search!
Save !Backtrack Log!
Prepare ActiveSet!
Iteration Control!
CPU GPU
Paul R. Dixon [ICASSP 2009]
Initialize data structures"
Backtrack"
Output Results"
Phase 0!""
Phase 1 !Compute
Observation Probabilities!
Phase 2!WFST Search!
Save !Backtrack Log!
Prepare ActiveSet!
Iteration Control!
CPU GPU
Jike Chong [INTERSPEECH 2009]
Collect Backtrack Info"
Carnegie Mellon University
GPU-‐Accelerated Speech Recogni5on
23
Initialize data structures"
Backtrack"
Output Results"
Phase 0!""
Phase 1 !Compute
Observation Probabilities!
Phase 2!WFST Search!
Save !Backtrack Log!
Prepare ActiveSet!
Iteration Control!
CPU GPU
Jike Chong [INTERSPEECH 2009]
Collect Backtrack Info"
Approach 2: • Both phases are on GPU. • Scalable algorithm • Only suitable for small models (vocab / linguis2c
context) due to limited GPU memory (1~6GB)
Fast and Scalable but Limited
Carnegie Mellon University
GPU-‐Accelerated Speech Recogni5on
24
Initialize data structures"
Backtrack"
Output Results"
Phase 0!""
Phase 1 !Compute
Observation Probabilities!
Phase 2!WFST Search!
Save !Backtrack Log!
Prepare ActiveSet!
Iteration Control!
CPU GPU
Paul R. Dixon [ICASSP 2009]
Initialize data structures"
Backtrack"
Output Results"
Phase 0!""
Phase 1 !Compute
Observation Probabilities!
Phase 2!WFST Search!
Save !Backtrack Log!
Prepare ActiveSet!
Iteration Control!
CPU GPU
Jike Chong [INTERSPEECH 2009]
Collect Backtrack Info"
Initialize data structures"
Backtrack"
Output Results"
Phase 0!""
Phase 1 !Compute
Observation Probabilities!
Phase 2!WFST Search!
Save !Backtrack Log!
Prepare ActiveSet!
Iteration Control!
CPU GPU
Proposed
Collect Backtrack Info"
On-The-Fly Rescoring!
(LM Lookup) !
Carnegie Mellon University
GPU-‐Accelerated Speech Recogni5on
25
Initialize data structures"
Backtrack"
Output Results"
Phase 0!""
Phase 1 !Compute
Observation Probabilities!
Phase 2!WFST Search!
Save !Backtrack Log!
Prepare ActiveSet!
Iteration Control!
CPU GPU
Proposed
Collect Backtrack Info"
On-The-Fly Rescoring!
(LM Lookup) !
Vocab. size 5k 64k 1M
unigram 2 12 92 bigram 114 1,880 N/A* trigram 676 4,644* N/A*
WFST network size (MB) (* Unable to decode)
Proposed Approach: • Both phases are on GPU using unigram WFST • Rescore hypothesis “On-‐The-‐Fly” using larger
language model on CPU
Fast, Scalable, Large Models
Carnegie Mellon University
On-‐The-‐Fly Hypothesis Rescoring
26
ax! g! n! ay! z!
Speech Recognize
Wreck Nice
n! ay! s!a!
a
r! eh! k!
r! eh! k!
<s> s! p! iy! ch!
b! iy! ch!
Beach
P(Recognize|<s>)
At word boundary update graph weight with linguistic context from CPU
Carnegie Mellon University
On-‐The-‐Fly Hypothesis Rescoring
27
ax! g! n! ay! z!
Speech Recognize
Wreck Nice
n! ay! s!a!
a
r! eh! k!
r! eh! k!
<s> s! p! iy! ch!
b! iy! ch!
Beach
P(Speech|<s>,Recognize)
At word boundary update graph weight with linguistic context from CPU
Carnegie Mellon University
Evalua5on Results (5k vocab.)
28
• 20X speed-‐up compared to standard WFST decoding on CPU at word accuracy 93.80%
• 95.4% maximum accuracy is achieved
85
86
87
88
89
90
91
92
93
94
95
0.04 0.40
Wor
d A
ccur
acy
[%]
Real time Factor
STANDARD (2GRAM) PROPOSED (3GRAM, N=3)
93.8%
0.078 1.582 20X
Carnegie Mellon University
Evalua5on Results (1M vocab.)
29
• 2.74X faster than real-‐2me when the WER is 9.35%
0"
0.1"
0.2"
0.3"
0.4"
0.5"
0.6"
TRIGRAM" 40GRAM" 50GRAM" 40GRAM"
GPU" CPU"
REAL%TIM
E%FAC
TOR%
9.61%
9.35%
9.38%
0.36
Carnegie Mellon University
Evalua5on Results (1M vocab.)
30
• 2.74X faster than real-‐2me when the WER is 9.35%
• 10X faster than CPU decoder.
0"
0.5"
1"
1.5"
2"
2.5"
3"
3.5"
4"
4.5"
TRIGRAM" 4/GRAM" 5/GRAM" 4/GRAM"
GPU" CPU"
REAL%TIM
E%FAC
TOR%
9.61% 9.35% 9.38%
9.59%
Real-Time 0.36
4.01
Carnegie Mellon University
Q&A
31
Thank you for your attention.