Post on 01-Feb-2021
transcript
GTC 2019, San Jose CA
Optimizing Runtime Performance
of Neural Net Architectures
for High Scalability
in Speech Recognition Servers
John Kominek, CTO, Voci Technologies
March 21, 2019
S9535
© 2019 | 2GTC 2019, San Jose CA
High Density Speech Recognition on Nvidia GPUs
▪ Training of neural nets receives a lot of attention
– Consumes the entire resources of the GPU
▪ Less attention is given to evaluation but is critical to
large commercial deployments
– Each stream uses a fraction of GPU resources
– But you want to get maximal use of the card
© 2019 | 3GTC 2019, San Jose CA
An Untold Story of Neural Nets in ASR
▪ Experience Learned
– Insight into what's easy and what's hard
– The multi-threading trick that work surprisingly well
– The mystery (and pain) of negative scaling
– Kepler, Maxwell, Pascal, Volta – how do they stack up?
▪ Intermediate level talk
– Light on math, plenty of insider jargon
© 2019 | 4GTC 2019, San Jose CA
Company HighlightsDeliver the world’s best speech to text platform for analytics
Pre 2011
2012
2014
2011 2015 20182017
• CMU government funded projects leads to founding Voci
• Series A funding
• Emotion, Sentiment, Gender Labeling
• Deep learning
• Integration
2016
• Language ID introduced
• Expanded language models
• Over 10m minutes transcribed
2019
• 50 Employees.
• >10m Bookings
• >8 Billion est. minutes transcribed
• New Website launched
• Focused business model
• First generation speech engine –V-Blaze developed
• Speed, Accuracy, Scalability
• Real-time transcription
• AI powers ASR
• V-Spark introduced
• Speaker Separation
• Partner enablement
• V-Cloud Introduced
• Speaker ID
• 40 New logos
• V-Blaze 5.0 released
• Custom language models
• Over 100m minutes transcribed
• Series B funding
• 5 Billion minutes transcribed
• Biometrics introduced
• 200 man-years of development
© 2019 | 5GTC 2019, San Jose CA
Neural Net Revolution = Company Crossroads
▪ Up to 2013
– FPGA implemented large fully continuous GMM
models with integrated statistical language model
evaluation and search. Fastest ASR engine in the
world at the time.
▪ 2013
– G. Hinton et al. established superiority of deep neural
networks. Led to a rare seismic shift in field of speech
recognition.
© 2019 | 6GTC 2019, San Jose CA
Voci's Technology Shift from FPGAs to GPUs
▪ 2013-2014 – Technology bakeoff
– Implemented DNN evaluation on Xilinx Virtex-5 pitted
against CUDA implemented on K20
– Nvidia platform won convincingly
– Matrix multiplication primitives are tailor made for
deep feedforward network evaluation
– Migrated to open source Kaldi toolkit for model
training
© 2019 | 7GTC 2019, San Jose CA
Voci V-Blaze Runs on Extensive Array of GPUs
▪ Servers: Tesla K20, K40, K80, M10, M60, P100, V100
▪ Embedded: Jetson Tegra TK1, TX1, TX2, CX2
▪ Laptops: GeForce GTX 960M, GTX 1050, GTX 1050 Ti
▪ In the cloud on AWS
▪ Redhat/CentOS, Debian/Ubuntu
© 2019 | 8GTC 2019, San Jose CA
Pictures of Server Rooms are Boring, so...
Voci powering advanced automotive conversational systems
© 2019 | 9GTC 2019, San Jose CA
If it's a Neural Net, Throw it on the GPU
▪ Voci V-Blaze runs
– DNN
– LSTM, BLSTM
– CNN
– TDNN
– RNNLM
– Combinations: e.g. DNN + CNN + BLSTM
© 2019 | 10GTC 2019, San Jose CA
A Story of Joy, Struggle, and Triumph
▪ Easy to accelerate: Feedforward DNN
▪ Hard to accelerate: Bidirectional LSTM
© 2019 | 11GTC 2019, San Jose CA
Evaluating Feedforward DNNs
▪ Single threaded evaluation is a straightforward sequence
of matrix multiplications and non-linear range-
compression functions
▪ Invoke the appropriate cudnn functions … and voila,
marketing gold
© 2019 | 12GTC 2019, San Jose CA
Single Threaded Performance
© 2019 | 13GTC 2019, San Jose CA
Multi-Core Performance is What Matters
▪ There are plenty of CUDA cores left over
▪ There are untapped Xeon cores
▪ How well does neural net inference scale as the compute
load is increased and more cores (GPU or CPU) are
invoked?
© 2019 | 14GTC 2019, San Jose CA
Increasing Load on an M10, DNN Evaluation
© 2019 | 15GTC 2019, San Jose CA
Increasing Load on an M60, DNN Evaluation
© 2019 | 16GTC 2019, San Jose CA
Increasing Load on a P100, DNN Evaluation
© 2019 | 17GTC 2019, San Jose CA
Increasing Load on a V100, DNN Evaluation
© 2019 | 18GTC 2019, San Jose CA
Meaning of Compute Load
▪ A compute load of 1 is one process pumping audio to
the GPU as fast as results can be returned
▪ Load = number of such processes in parallel
▪ Independent processes, not threads
© 2019 | 19GTC 2019, San Jose CA
Voci Engineers are Scofflaws!
▪ The Nvidia programming guidelines recommend multi-
threading. Separate processes do not truly run in parallel.
To run in parallel, program threads.
▪ We're like, "yeah, whatever."
© 2019 | 20GTC 2019, San Jose CA
Translating Compute Load to Speed
▪ Depends on the size of the neural net
• Small = 1024x6 ~ 12 million connections
• Medium = 2048x6 ~ 34 million
• Large = 4096x6 ~ 110 million
▪ Speed reported as x times faster than real time
© 2019 | 21GTC 2019, San Jose CA
DNN Evaluation Speed vs Model Size (P100)
© 2019 | 22GTC 2019, San Jose CA
DNN Evaluation by Tesla Generation (K80)
© 2019 | 23GTC 2019, San Jose CA
DNN Evaluation by Tesla Generation (M10)
© 2019 | 24GTC 2019, San Jose CA
DNN Evaluation by Tesla Generation (M60)
© 2019 | 25GTC 2019, San Jose CA
DNN Evaluation by Tesla Generation (P100)
© 2019 | 26GTC 2019, San Jose CA
DNN Evaluation by Tesla Generation (V100)
© 2019 | 27GTC 2019, San Jose CA
Comparison to Pure CPU Performance Curve
© 2019 | 28GTC 2019, San Jose CA
V100 Provides Best Peak Power Efficiency
© 2019 | 29GTC 2019, San Jose CA
So much for Easy, Now for Hard
https://github.com/dophist/kaldi-lstm
© 2019 | 30GTC 2019, San Jose CA
Speed of Open Source Kaldi Implementation
© 2019 | 31GTC 2019, San Jose CA
Visual Profiler Reveals the Problem
DNN
BLSTM – kernel synchronization dominates
© 2019 | 32GTC 2019, San Jose CA
Highly Suspicious Power/Utilization Pattern
© 2019 | 33GTC 2019, San Jose CA
The Shock of Negative Scaling
Instead of saturating,speed decreases!
© 2019 | 34GTC 2019, San Jose CA
M10/M60 Scale According to GPU Count, then Drop
© 2019 | 35GTC 2019, San Jose CA
What to do?
▪ Separate processes were interfering with each other
▪ Three avenues forward
– Switch to older cards that present less powerful but
multiple GPU interfaces (the M10)
– Re-engineer the infrastructure code to be a multi-
threaded, single-process server
– See how far optimizing the code will take you
© 2019 | 36GTC 2019, San Jose CA
4 Custom Optimizations
▪ Kernel merging (15%)
▪ Matrix transpose into row major form (10%)*
▪ Reverse direction compute stream pairs (24%)
▪ Application-specific data parallelism (26%)
▪ Together: increase single process speed by 2x
– * J. Appleyard, Optimizing Recurrent Neural Networks in cuDNN5, GTC 2016
© 2019 | 37GTC 2019, San Jose CA
Application Specific Data Parallelism
▪ Serialism inherent in recurrent loops can be approximated
Time
fwd/bwd compute stream pair
fwd/bwd compute stream pair
fwd/bwd compute stream pair
....
cudaStreamCreate(&stream_fwd)
cudaStreamCreate(&stream_bwd)
© 2019 | 38GTC 2019, San Jose CA
Before and After – from 37.5x to 375x
© 2019 | 39GTC 2019, San Jose CA
Negative Scaling Eliminated
© 2019 | 40GTC 2019, San Jose CA
Power/Utilization after Optimization is Sane Again
© 2019 | 41GTC 2019, San Jose CA
V100 Still Leads on Power Efficiency
© 2019 | 42GTC 2019, San Jose CA
Best Price/Performance Provided by M10
© 2019 | 43GTC 2019, San Jose CA
Unsurprising Findings
▪ What's easy and what's hard
– DNNs are easy, BLSTMs are hard
▪ Kepler, Maxwell, Pascal, Volta comparison
– V100 is fastest
– V100 has best power efficiency
– M10 has best price/performance
© 2019 | 44GTC 2019, San Jose CA
Unexpected Findings
▪ The multi-threading trick that work surprisingly well to
achieve high performance scaling
– Don't multi-thread (even though you should)
▪ Negative scaling can happen – and can be overcome
– It's still kind of a mystery, though
– For advanced details, join our company
www.vocitec.com
© 2019 | 45GTC 2019, San Jose CA
The only true enterprise speech-to-text platform that solves real business challenges
john.kominek@vocitec.com, mike.coney@vocitec.com (CEO)
www.vocitec.com
mailto:mike.coney@vocitec.com