GTC - DC – November - 2019
GPU ACCELERATED SPEECH-TO-TEXT
2
AGENDA
1) Brief introduction to speech processing
2) Previous Results
3) Performance Updates Since GTC
4) Kaldi Container
5) Production Deployment with Intelligent Voice
3
Speech Recognition: the process of taking a raw audio signal and transcribing to text
Use of Automatic Speech Recognition has exploded in the last ten years:
Personal assistants, Medical transcription, Call center analytics, Video search, etc
INTRODUCTION TO ASRTranslating Speech into Text
NVIDIA is
cool0/0.98 1
-:-
2
nvidia:nvidia/1.0
3ai:ai/1.24
4
speech:speech/1.63
-:-
-:-
-:-
4
KALDI
Kaldi is a speech processing framework out of Johns Hopkins University
Uses a combination of DL and ML algorithms for speech processing
Started in 2009 with the intent to reduce the time and cost needed to build ASR systems
http://kaldi-asr.org/
Considered state-of-the-art
Speech Processing Framework
5
SPEECH RECOGNITION
• Kaldi fuses known state-of-the-art techniques from speech recognition with deep learning
• Hybrid DL/ML approach continues to perform better than deep learning alone
• "Classical" ML Components:
• Mel-Frequency Cepstral Coefficients (MFCC) features – represent audio as spectrum of spectrum
• I-vectors – Uses factor analysis, Gaussian Mixture Models to learn speaker embedding – helps acoustic model adapt to variability in speakers
• Predict phone states – HMM - Unlike "end-to-end" DL models, Kaldi Acoustic Models predict context-dependent phone substates as Hidden Markov Model (HMM) states
• Result is system that, to date, is more robust than DL-only approaches and typically requires less data to train
State of the Art
6
KALDI SPEECH PROCESSING PIPELINE
NVIDIA is
cool
Raw AudioFeature
Extraction
Acoustic
Model
Language
ModelOutput
MFCC &
IvectorsNNET3 Decoder
Kaldi Components:
Lattice
7
PREVIOUS RESULTS
8
PREVIOUS WORK
NVIDIA Presentations/Publications:
GTC On Demand: https://on-demand-gtc.gputechconf.com/gtcnew/on-demand-gtc.php
Spring 2018: S81034
Fall 2018: DC8189
Spring 2019: S9672
https://arxiv.org/abs/1910.10032
Devblogs:
https://devblogs.nvidia.com/nvidia-accelerates-speech-text-transcription-3500x-kaldi/
https://devblogs.nvidia.com/gpu-accelerated-speech-to-text-with-kaldi-a-tutorial-on-getting-started/
9
GTC 2019 ACCELERATED COMPONENTSGPU Accelerated
NVIDIA is
cool0/0.98 1
-:-
2
nvidia:nvidia/1.0
3ai:ai/1.24
4
speech:speech/1.63
-:-
-:-
-:-
Raw AudioFeature
Extraction
Acoustic
Model
Language
ModelOutput
10
GTC-2019 PERFORMANCE 1 GPU, LibriSpeech, 19-03 container
2x Xeon*: 2x Intel Xeon Platinum 8168,Xavier: AGX Devkit, T4*: PCI-E, V100*: SXM
Determinized Lattice Outputbeam=10
lattice-beam=7Uses all available HW threads
Hardware Perf (RTFx) WER Perf
LibriSpeech Model, Libri Clean Data
2x Intel Xeon 381 5.5 1.0x
AGX Xavier 500 5.5 1.3x
Tesla T4 1635 5.5 4.3x
Tesla V100 3524 5.5 9.2x
LibriSpeech Model, Libri Other Data
2x Intel Xeon 377 14.0 1.0x
AGX Xavier 450 14.0 1.2x
Tesla T4 1439 14.0 3.8x
Tesla V100 2854 14.0 7.6x
0x
5x
10x
15x
20x
25x
30x
T4 Perf (!) V100 Perf (!)
Spe
edu
p (
!)GTC-2019 Scale-up performance
1 GPU 2 GPUs 4 GPUs 8 GPUs
T4 Performance V100 Performance
1635
RTFx
3371
RTFx
6368
RTFx
7906
RTFx3524
RTFx
7082
RTFx
10011
RTFx
9399
RTFx
12
MULTI-GPU PERFORMANCE LIMITERS
Scalability Limited Due to CPU Overhead
Feature Extraction and Determinization become bottlenecks
CPU has a hard time keeping up with GPU performance
13
RECENT PERFORMANCE
UPDATES
14
RECENT IMPROVEMENTS
Multi-threading improvements
Moved more tasks to worker threads which allows control threads to submit work faster and keep GPU busy
Reduce memory usage
Increased batch size = more performance
General Optimization
Container Improvements
Automatic segmenting and dataset preparation
Added ASpIRE Model
Since GTC 2019
15
LATEST SINGLE GPU PERFORMANCE 1 GPU, LibriSpeech, 19.11 Container
2x Xeon*: 2x Intel Xeon Platinum 8168V100*: SXM
Determinized Lattice Outputbeam=10
lattice-beam=7Uses all available HW threads
Hardware19.11 Perf
(RTFx)WER
GTC – 2019
19.3 Speedup
GTC–DC -2019
19.11 Speedup
LibriSpeech Model, Libri Clean Data
2x Intel Xeon 381 5.5 1.0x 1.0x
Tesla T4 1849 5.5 4.3x 4.9x
Tesla V100 5154 5.5 9.2x 13.5x
LibriSpeech Model, Libri Other Data
2x Intel Xeon 377 14.0 1.0x 1.0x
Tesla T4 1679 14.0 3.8x 4.5x
Tesla V100 3925 14.0 7.6x 10.4x
16
SPRING 2019 ACCELERATED COMPONENTSGPU Accelerated
NVIDIA is
cool0/0.98 1
-:-
2
nvidia:nvidia/1.0
3ai:ai/1.24
4
speech:speech/1.63
-:-
-:-
-:-
Raw AudioFeature
Extraction
Acoustic
Model
Language
ModelOutput
Large amount of CPU work.
When scaling to multi-GPU CPU
threads cannot keep up.
17
GPU ACCELERATED FEATURE EXTRACTIONReduce CPU overhead
NVIDIA is
cool0/0.98 1
-:-
2
nvidia:nvidia/1.0
3ai:ai/1.24
4
speech:speech/1.63
-:-
-:-
-:-
Raw AudioFeature
Extraction
Acoustic
Model
Language
ModelOutput
Batch=1 Implementation moved
work to GPU significantly
reducing CPU load.
18
FEATURE EXTRACTIONPipeline
Base
Feature
PitchOnlineCmvn
Ivector
ExtractionMFCCFBANK
Green = Implemented in CUDA.Individual Models may not use all components.
Batch=1 implementation only
Input Feature
IvectorFeature
Raw Audio
Currently Not
supported.
19
IVECTOR EXTRACTIONPipeline
Base
Feature
LDA
Transform
Online
CMVN
SpliceLDA
Transform
Posteriors
Ivector
Stats
Compute
Ivector
Splice
IvectorFeature
Green = Implemented in CUDA.Individual Models may not use all components.
Batch=1 implementation only
20
GPU FEATURE EXTRACTIONEnd-to-End Scalability & Efficiency (DGX-1V)
GPU_THREADS=2, MAX_BATCH_SIZE=300, BATCH_DRAIN_SIZE=40, DATASETS=test_clean, COPY_THREADS=0
0
5000
10000
15000
20000
25000
1 2 4 8
Rea
l Tim
e Fa
cto
r
Number of V100-SXM-16GB
Parallel Scalability
CPU feature extraction
GPU feature extraction
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
1 2 4 8
Para
llel E
ffic
ien
cy
Number of V100-SXM-16GB
Parallel Efficiency
CPU Feature Extraction
GPU feature extraction
21
FULL NODE PERFORMANCE
Good Scalability across a range of hardware platforms
0
10000
20000
30000
40000
50000
60000
DGX-1V(V100 SXM 16GB)
DGX-2(V100 SXM 32 GB)
SYS-6049GP-TRT(T4)
RTF
x
Kaldi - Multi-GPU Scalability
1 GPU 2 GPUS 4 GPUs 8 GPUs 16 GPUs 20 GPUs
0%
20%
40%
60%
80%
100%
120%
DGX-1V(V100 SXM 16GB)
DGX-2(V100 SXM 32 GB)
SYS-6049GP-TRT(T4)
Par
alle
l Eff
icie
ncy
Kaldi - Multi-GPU Efficiency
1 GPU 2 GPUS 4 GPUs 8 GPUs 16 GPUs 20 GPUs
14 hours in
one second
22
FUTURE WORK
Batched Feature Extraction
More performance
Online Speech Pipeline
Streaming audio
Lower latency, higher throughput, less memory
Can emulate offline with same benefits
More models
Look for these features at GTC - 2020
23
THE NGC CONTAINER REGISTRY
Discover over 40 GPU-Accelerated ContainersSpanning deep learning, machine learning, HPC applications, HPC visualization, and more
Innovate in Minutes, Not WeeksPre-configured, ready-to-run
Run AnywhereThe top cloud providers, NVIDIA DGX Systems, PCs and workstations with selectNVIDIA GPUs, and NGC-Ready systems
Simple Access to GPU-Accelerated Software
24
NGC CONTAINER
Get an NGC account: https://ngc.nvidia.com/signup
Free & Easy
#login in to NGC, pull container, and run it
%> docker login nvcr.io
%> docker pull nvcr.io/nvidia/kaldi:19.10-py3
%> nvidia-docker run --rm -it nvcr.io/nvidia/kaldi:19.10-py3
#prepare models and data
%> cd /workspace/nvidia-examples/librispeech
%> ./prepare_data.sh
#run benchmark
%> ./run_benchmark.sh
25
BENCHMARK OUTPUTNGC Container
Process 0:
~Group 0 completed Aggregate Total Time: 15.3179 Audio: 19452.5 RealTimeX: 1269.92
~Group 1 completed Aggregate Total Time: 20.8032 Audio: 38905 RealTimeX: 1870.15
~Group 2 completed Aggregate Total Time: 26.5266 Audio: 58357.4 RealTimeX: 2199.96
~Group 3 completed Aggregate Total Time: 31.8119 Audio: 77809.9 RealTimeX: 2445.94
~Group 4 completed Aggregate Total Time: 37.179 Audio: 97262.4 RealTimeX: 2616.06
~Group 5 completed Aggregate Total Time: 42.5534 Audio: 116715 RealTimeX: 2742.79
~Group 6 completed Aggregate Total Time: 48.0023 Audio: 136167 RealTimeX: 2836.68
~Group 7 completed Aggregate Total Time: 49.4219 Audio: 155620 RealTimeX: 3148.8
~Group 8 completed Aggregate Total Time: 54.2707 Audio: 175072 RealTimeX: 3225.91
~Group 9 completed Aggregate Total Time: 57.2566 Audio: 194525 RealTimeX: 3397.42
Overall: Aggregate Total Time: 57.2567 Total Audio: 194525 RealTimeX: 3397.42
%WER 5.54 [ 29134 / 525760, 3900 ins, 2321 del, 22913 sub ]
%SER 51.50 [ 13494 / 26200 ]
Scored 26200 sentences, 0 not present in hyp.
Decoding completed successfully.
Total RTF: 3397.42 Average RTF: 3397.4200 Average WER: 5.5400
All WER and PERF tests passed.
26
BENCHMARK FEATURES
Transcribes a corpus of audio using multiple threads and an NVIDIA GPU
Create corpus from a directory of wav files or use a provided corpus
Scores transcriptions when gold text is present
Comes with two English models (LibriSpeech & ASpIRE)
Highly tunable through various parameters
CPU_THREADS, GPU_THREADS, NUM_PROCESSES, SEGMENT_SIZE, ITERATIONS, etc
https://devblogs.nvidia.com/gpu-accelerated-speech-to-text-with-kaldi-a-tutorial-on-getting-started/
27
NVIDIA TECHNICAL CONTRIBUTORS
*Justin Luitjens
Senior Developer Technology Engineer
*Ryan Leary
Senior Applied Research Scientist
Hugo Braun
Senior AI Developer Technology Engineer
*Levi Barnes
Developer Technology Engineer
*David Taubenheim
Senior Solutions Architect
*Attending GTC-DC 2019, Come ask questions and tell us how we can help solve your mission!
*Adam Thompson
Senior Solutions Architect
Nigel CanningsCEO
INTELLIGENT VOICE LTD
@intelligentvoxwww.intelligentvoice.com
Some of the world’s best speech solutions are driven by IV
+130 more..
Intelligent Voice Limited is a global leader in the development of proactive compliance and eDiscovery technology solutions for voice, video and other media. Its clients include government agencies, banks, securities firms, Call-Centers, litigation support providers, international consultancy, advisory businesses and insurers, all involved in the management of risk and meeting of multi-jurisdictional regulation.
A Brief History of GPU Accelerated Voice
A Brief History of GPU Accelerated Voice
May 2 2007
Nigel reads about “A Supercomputer on your Desk”
Nigel reads about “A Supercomputer on your Desk”May 2 2007
A Brief History of GPU Accelerated Voice
May 3 2007
Nigel’s wife: “We haven’t got a spare 60k ”
”
A Brief History of GPU Accelerated Voice
May 2 2007
May 3 2007
Nigel reads about “A Supercomputer on your Desk”
Nigel’s wife: “We haven’t got a spare 60k
Aug 1 2013
The UK Government gives Nigel a Grant to GPU Accelerate ASR
”
A Brief History of GPU Accelerated Voice
May 2 2007
May 3 2007
Aug 1 2013
Nigel reads about “A Supercomputer on your Desk”
Nigel’s wife: “We haven’t got a spare 60k
The UK Government gives Nigel a Grant to GPU Accelerate ASR
Early2014
27 CUDA programmers tell Nigel it is Impossible
”
A Brief History of GPU Accelerated Voice
May 2 2007
May 3 2007
Aug 1 2013
Early2014
Nigel reads about “A Supercomputer on your Desk”
Nigel’s wife: “We haven’t got a spare 60k
The UK Government gives Nigel a Grant to GPU Accelerate ASR
27 CUDA programmers tell Nigel it is Impossible
Jun 11 2014
One man says “Alright, I’ll give that a try”
”
A Brief History of GPU Accelerated Voice
May 2 2007
May 3 2007
Aug 1 2013
Early2014
Jun 11 2014
Nigel reads about “A Supercomputer on your Desk”
Nigel’s wife: “We haven’t got a spare 60k
The UK Government gives Nigel a Grant to GPU Accelerate ASR
27 CUDA programmers tell Nigel it is Impossible
One man says “Alright, I’ll give that a try”
Mar 17 2015
Nigel Releases GPU Powered ASR at GTC 2015 Running at 31 x Realtime on a K80!
”
A Brief History of GPU Accelerated Voice
May 2 2007
May 3 2007
Aug 1 2013
Early2014
Jun 11 2014
Mar 17 2015
Nigel reads about “A Supercomputer on your Desk”
Nigel’s wife: “We haven’t got a spare 60k
The UK Government gives Nigel a Grant to GPU Accelerate ASR
27 CUDA programmers tell Nigel it is Impossible
One man says “Alright, I’ll give that a try”
Nigel Releases GPU Powered ASR at GTC 2015 Running at 31 x Realtime on a K80!
Nov 6 2019
Nigel and NVIDIA Show the same process running 1000X Realtime on a V100
A Brief History of GPU Accelerated Voice
May 2 2007
May 3 2007
Aug 1 2013
Early2014
Jun 11 2014
Mar 17 2015
Nov 6 2019
Nigel reads about “A Supercomputer on your Desk”
Nigel’s wife: “We haven’t got a spare 60k
The UK Government gives Nigel a Grant to GPU Accelerate ASR
27 CUDA programmers tell Nigel it is Impossible
One man says “Alright, I’ll give that a try”
Nigel Releases GPU Powered ASR at GTC 2015 Running at 31 x Realtime on a K80!
Nigel and NVIDIA Show the same process running 1000X Realtime on a V100
”
Intelligent Voice – Model Performance
Real world Speed: - x250 on a T4- x1000 on a V100 32Gb CPU vs GPU accuracy virtually identical
13.6x14.5x
13.7x 15.6x15.2x
15.4x 12.1x
17.1x
12.8x
15.8x
11.1x
12.2x
18.0x
14.4x
13.5x 15.4x
12.7x
14.4x
14.3x
13.3x
15.0x 9.3x
14.1x
0
200
400
600
800
1000
1200
1400
1600
1800
RTF
x
1x V100 16GB SXM 2x Intel E5-2698Data Labels = SpeedupBeam=15, Lattice Beam = 2.5
Nobody just wants a transcript..
It is in theory possible to extrapolate the whole of creation—every Galaxy, every sun, every planet, their orbits, their composition, and their economic and social history from, say, one small piece of fairy cake.
”
“
Douglas Adams
The Voice Suite
High Speed ASRLightning Fast Speech-
to-text
Live Call MonitoringCatch anomalies in
real-time
IVNOTE + Smart
TranscriptSearch what’s said
Model BuildingAccelerates ‘learning’
and accuracy
API-based
IntegrationLet our features
enhance yours
Onsite or in-cloudChoose where your
data lives
Biometric SearchVoice ID
Hyperphonic SearchSounds & phrases
searched, instantly
INDEXIntelligent Voice indexes key
words and phrases from your
telephone calls
SPEECH TO TEXTThis allows you to search
for telephone calls as if they
were text.
ANALYSEAdd-on modules give you the
power to analyze calls and track
behavior.
STOREYou have full control of your
data - Securely encrypted on
your cloud platform of choice.
Emotional
Intelligence Behavioural analysis
of human speech
PCI Redaction Automatically remove
Payment Card
information from audio
recordings.
Edge ProcessingOn device speech
recognition, with FULL
vocabulary
Live conference
TranscriptionInstant and accurate.
Encrypted Search:
PATENTEDSearch sound, keeping the
words hidden
IN DEVELOPMENTFEATURES: CURRENTLY AVAILABLE
Automated Language
Detection
What can we do with it - use cases
Communication Surveillance Financial InstitutionsCompliance monitoringSurveillance – Voice, Chat, Email, Web conference
Live and post Call monitoring Call Centre, Law Enforcement Key word and Phrase spotting and alertingBiometric authorisationQuality Assurance reportingPCI data identification and redaction
E-Discovery Legal Service Providers, Forensic analysts, RegulatorsSearch & Review of large audio and text data setsBiometric Search, persons of Interest
Fraud Detection Insurance Claims
Credibility Analysis Earnings CallsBehavioural analysis of Voice communication
Intelligent Voice Differentiators
GPU Acceleration Model Training
Language DetectionLattice/ Alternate Search
Integrations
Optimised Pipeline
Audio Pre-FilteringCustom VAD
Low Confidence Region BoostingNumber OptimisationDynamic Lattice Boost
Available Languages
English – UK
English – US
English – SA
English – AUS
English – Global
Spanish – MEX
Spanish – EU
Catalan
German – DE
German – Swiss
Portuguese – BR
Portuguese – EU
Dutch
Norwegian
Danish
Japanese
French
Russian
Korean
Mandarin
Tagalog
Cantonese
Italian
Canadian French
Coming Soon:
Arabic
SmartTranscript™
Trace Alert Terms
Lattice Matching
Redact text and audio direct in your review platform using simple word
highlighting
Karaoke
Automated Topics
Speaker Separated Transcripts
See Word Alternatives
“Vox in a Box”
Pre-Configured Speech Server
Transcription in 20+ Languages and Dialects
Fully TrainableREST-based API
Highly Optimised for Speed
EDGE or Data Centre
10-50,000 hours per day
GPU powered
"In an infinite universe, the one thing sentient life cannot afford to have is a sense of proportion."
Douglas Adams