Optimizing Runtime Performance of Neural Net Architectures ......Neural Net Revolution = Company...

transcript

GTC 2019, San Jose CA

Optimizing Runtime Performance

of Neural Net Architectures

for High Scalability

in Speech Recognition Servers

John Kominek, CTO, Voci Technologies

March 21, 2019

S9535

© 2019 | 2GTC 2019, San Jose CA

High Density Speech Recognition on Nvidia GPUs

▪ Training of neural nets receives a lot of attention

– Consumes the entire resources of the GPU

▪ Less attention is given to evaluation but is critical to

large commercial deployments

– Each stream uses a fraction of GPU resources

– But you want to get maximal use of the card

© 2019 | 3GTC 2019, San Jose CA

An Untold Story of Neural Nets in ASR

▪ Experience Learned

– Insight into what's easy and what's hard

– The multi-threading trick that work surprisingly well

– The mystery (and pain) of negative scaling

– Kepler, Maxwell, Pascal, Volta – how do they stack up?

▪ Intermediate level talk

– Light on math, plenty of insider jargon

© 2019 | 4GTC 2019, San Jose CA

Company HighlightsDeliver the world’s best speech to text platform for analytics

Pre 2011

2012

2014

2011 2015 20182017

• CMU government funded projects leads to founding Voci

• Series A funding

• Emotion, Sentiment, Gender Labeling

• Deep learning

• Integration

2016

• Language ID introduced

• Expanded language models

• Over 10m minutes transcribed

2019

• 50 Employees.

• >10m Bookings

• >8 Billion est. minutes transcribed

• New Website launched

• Focused business model

• First generation speech engine –V-Blaze developed

• Speed, Accuracy, Scalability

• Real-time transcription

• AI powers ASR

• V-Spark introduced

• Speaker Separation

• Partner enablement

• V-Cloud Introduced

• Speaker ID

• 40 New logos

• V-Blaze 5.0 released

• Custom language models

• Over 100m minutes transcribed

• Series B funding

• 5 Billion minutes transcribed

• Biometrics introduced

• 200 man-years of development

© 2019 | 5GTC 2019, San Jose CA

Neural Net Revolution = Company Crossroads

▪ Up to 2013

– FPGA implemented large fully continuous GMM

models with integrated statistical language model

evaluation and search. Fastest ASR engine in the

world at the time.

▪ 2013

– G. Hinton et al. established superiority of deep neural

networks. Led to a rare seismic shift in field of speech

recognition.

© 2019 | 6GTC 2019, San Jose CA

Voci's Technology Shift from FPGAs to GPUs

▪ 2013-2014 – Technology bakeoff

– Implemented DNN evaluation on Xilinx Virtex-5 pitted

against CUDA implemented on K20

– Nvidia platform won convincingly

– Matrix multiplication primitives are tailor made for

deep feedforward network evaluation

– Migrated to open source Kaldi toolkit for model

training

© 2019 | 7GTC 2019, San Jose CA

Voci V-Blaze Runs on Extensive Array of GPUs

▪ Servers: Tesla K20, K40, K80, M10, M60, P100, V100

▪ Embedded: Jetson Tegra TK1, TX1, TX2, CX2

▪ Laptops: GeForce GTX 960M, GTX 1050, GTX 1050 Ti

▪ In the cloud on AWS

▪ Redhat/CentOS, Debian/Ubuntu

© 2019 | 8GTC 2019, San Jose CA

Pictures of Server Rooms are Boring, so...

Voci powering advanced automotive conversational systems

© 2019 | 9GTC 2019, San Jose CA

If it's a Neural Net, Throw it on the GPU

▪ Voci V-Blaze runs

– DNN

– LSTM, BLSTM

– CNN

– TDNN

– RNNLM

– Combinations: e.g. DNN + CNN + BLSTM

© 2019 | 10GTC 2019, San Jose CA

A Story of Joy, Struggle, and Triumph

▪ Easy to accelerate: Feedforward DNN

▪ Hard to accelerate: Bidirectional LSTM

© 2019 | 11GTC 2019, San Jose CA

Evaluating Feedforward DNNs

▪ Single threaded evaluation is a straightforward sequence

of matrix multiplications and non-linear range-

compression functions

▪ Invoke the appropriate cudnn functions … and voila,

marketing gold

© 2019 | 12GTC 2019, San Jose CA

Single Threaded Performance

© 2019 | 13GTC 2019, San Jose CA

Multi-Core Performance is What Matters

▪ There are plenty of CUDA cores left over

▪ There are untapped Xeon cores

▪ How well does neural net inference scale as the compute

load is increased and more cores (GPU or CPU) are

invoked?

© 2019 | 14GTC 2019, San Jose CA

Increasing Load on an M10, DNN Evaluation

© 2019 | 15GTC 2019, San Jose CA

Increasing Load on an M60, DNN Evaluation

© 2019 | 16GTC 2019, San Jose CA

Increasing Load on a P100, DNN Evaluation

© 2019 | 17GTC 2019, San Jose CA

Increasing Load on a V100, DNN Evaluation

© 2019 | 18GTC 2019, San Jose CA

Meaning of Compute Load

▪ A compute load of 1 is one process pumping audio to

the GPU as fast as results can be returned

▪ Load = number of such processes in parallel

▪ Independent processes, not threads

© 2019 | 19GTC 2019, San Jose CA

Voci Engineers are Scofflaws!

▪ The Nvidia programming guidelines recommend multi-

threading. Separate processes do not truly run in parallel.

To run in parallel, program threads.

▪ We're like, "yeah, whatever."

© 2019 | 20GTC 2019, San Jose CA

Translating Compute Load to Speed

▪ Depends on the size of the neural net

• Small = 1024x6 ~ 12 million connections

• Medium = 2048x6 ~ 34 million

• Large = 4096x6 ~ 110 million

▪ Speed reported as x times faster than real time

© 2019 | 21GTC 2019, San Jose CA

DNN Evaluation Speed vs Model Size (P100)

© 2019 | 22GTC 2019, San Jose CA

DNN Evaluation by Tesla Generation (K80)

© 2019 | 23GTC 2019, San Jose CA

DNN Evaluation by Tesla Generation (M10)

© 2019 | 24GTC 2019, San Jose CA

DNN Evaluation by Tesla Generation (M60)

© 2019 | 25GTC 2019, San Jose CA

DNN Evaluation by Tesla Generation (P100)

© 2019 | 26GTC 2019, San Jose CA

DNN Evaluation by Tesla Generation (V100)

© 2019 | 27GTC 2019, San Jose CA

Comparison to Pure CPU Performance Curve

© 2019 | 28GTC 2019, San Jose CA

V100 Provides Best Peak Power Efficiency

© 2019 | 29GTC 2019, San Jose CA

So much for Easy, Now for Hard

https://github.com/dophist/kaldi-lstm

© 2019 | 30GTC 2019, San Jose CA

Speed of Open Source Kaldi Implementation

© 2019 | 35GTC 2019, San Jose CA

What to do?

▪ Separate processes were interfering with each other

▪ Three avenues forward

– Switch to older cards that present less powerful but

multiple GPU interfaces (the M10)

– Re-engineer the infrastructure code to be a multi-

threaded, single-process server

– See how far optimizing the code will take you

© 2019 | 36GTC 2019, San Jose CA

4 Custom Optimizations

▪ Kernel merging (15%)

▪ Matrix transpose into row major form (10%)*

▪ Reverse direction compute stream pairs (24%)

▪ Application-specific data parallelism (26%)

▪ Together: increase single process speed by 2x

– * J. Appleyard, Optimizing Recurrent Neural Networks in cuDNN5, GTC 2016

© 2019 | 37GTC 2019, San Jose CA

Application Specific Data Parallelism

▪ Serialism inherent in recurrent loops can be approximated

Time

fwd/bwd compute stream pair



....

cudaStreamCreate(&stream_fwd)

cudaStreamCreate(&stream_bwd)

© 2019 | 43GTC 2019, San Jose CA

Unsurprising Findings

▪ What's easy and what's hard

– DNNs are easy, BLSTMs are hard

▪ Kepler, Maxwell, Pascal, Volta comparison

– V100 is fastest

– V100 has best power efficiency

– M10 has best price/performance

© 2019 | 44GTC 2019, San Jose CA

Unexpected Findings

▪ The multi-threading trick that work surprisingly well to

achieve high performance scaling

– Don't multi-thread (even though you should)

▪ Negative scaling can happen – and can be overcome

– It's still kind of a mystery, though

– For advanced details, join our company

www.vocitec.com

© 2019 | 45GTC 2019, San Jose CA

The only true enterprise speech-to-text platform that solves real business challenges

john.kominek@vocitec.com, mike.coney@vocitec.com (CEO)

www.vocitec.com
mailto:mike.coney@vocitec.com

Optimizing Runtime Performance of Neural Net Architectures ......Neural Net Revolution = Company...

Documents