GPU ACCELERATED SPEECH-TO-TEXT - NVIDIA...Personal assistants, Medical transcription, Call center...

GTC - DC – November - 2019

GPU ACCELERATED SPEECH-TO-TEXT

2

AGENDA

1) Brief introduction to speech processing

2) Previous Results

3) Performance Updates Since GTC

4) Kaldi Container

5) Production Deployment with Intelligent Voice

3

Speech Recognition: the process of taking a raw audio signal and transcribing to text

Use of Automatic Speech Recognition has exploded in the last ten years:

Personal assistants, Medical transcription, Call center analytics, Video search, etc

INTRODUCTION TO ASRTranslating Speech into Text

NVIDIA is

cool0/0.98 1

-:-

2

nvidia:nvidia/1.0

3ai:ai/1.24

4

speech:speech/1.63

-:-

-:-

-:-

4

KALDI

Kaldi is a speech processing framework out of Johns Hopkins University

Uses a combination of DL and ML algorithms for speech processing

Started in 2009 with the intent to reduce the time and cost needed to build ASR systems

http://kaldi-asr.org/

Considered state-of-the-art

Speech Processing Framework

http://kaldi-asr.org/

5

SPEECH RECOGNITION

• Kaldi fuses known state-of-the-art techniques from speech recognition with deep learning

• Hybrid DL/ML approach continues to perform better than deep learning alone

• "Classical" ML Components:

• Mel-Frequency Cepstral Coefficients (MFCC) features – represent audio as spectrum of spectrum

• I-vectors – Uses factor analysis, Gaussian Mixture Models to learn speaker embedding – helps acoustic model adapt to variability in speakers

• Predict phone states – HMM - Unlike "end-to-end" DL models, Kaldi Acoustic Models predict context-dependent phone substates as Hidden Markov Model (HMM) states

• Result is system that, to date, is more robust than DL-only approaches and typically requires less data to train

State of the Art

6

KALDI SPEECH PROCESSING PIPELINE

NVIDIA is

cool

Raw AudioFeature

Extraction

Acoustic

Model

Language

ModelOutput

MFCC &

IvectorsNNET3 Decoder

Kaldi Components:

Lattice

7

PREVIOUS RESULTS

8

PREVIOUS WORK

NVIDIA Presentations/Publications:

GTC On Demand: https://on-demand-gtc.gputechconf.com/gtcnew/on-demand-gtc.php

Spring 2018: S81034

Fall 2018: DC8189

Spring 2019: S9672

https://arxiv.org/abs/1910.10032

Devblogs:

https://devblogs.nvidia.com/nvidia-accelerates-speech-text-transcription-3500x-kaldi/

https://devblogs.nvidia.com/gpu-accelerated-speech-to-text-with-kaldi-a-tutorial-on-getting-started/

https://on-demand-gtc.gputechconf.com/gtcnew/on-demand-gtc.php

https://arxiv.org/abs/1910.10032

https://devblogs.nvidia.com/nvidia-accelerates-speech-text-transcription-3500x-kaldi/


9

GTC 2019 ACCELERATED COMPONENTSGPU Accelerated

NVIDIA is

cool0/0.98 1

-:-

2

nvidia:nvidia/1.0

3ai:ai/1.24

4

speech:speech/1.63

-:-

-:-

-:-

Raw AudioFeature

Extraction

Acoustic

Model

Language

ModelOutput

10

GTC-2019 PERFORMANCE 1 GPU, LibriSpeech, 19-03 container

2x Xeon*: 2x Intel Xeon Platinum 8168,Xavier: AGX Devkit, T4*: PCI-E, V100*: SXM

Determinized Lattice Outputbeam=10

lattice-beam=7Uses all available HW threads

Hardware Perf (RTFx) WER Perf

LibriSpeech Model, Libri Clean Data

2x Intel Xeon 381 5.5 1.0x

AGX Xavier 500 5.5 1.3x

Tesla T4 1635 5.5 4.3x

Tesla V100 3524 5.5 9.2x

LibriSpeech Model, Libri Other Data

2x Intel Xeon 377 14.0 1.0x

AGX Xavier 450 14.0 1.2x

Tesla T4 1439 14.0 3.8x

Tesla V100 2854 14.0 7.6x

0x

5x

10x

15x

20x

25x

30x

T4 Perf (!) V100 Perf (!)

Spe

edu

p (

!)GTC-2019 Scale-up performance

1 GPU 2 GPUs 4 GPUs 8 GPUs

T4 Performance V100 Performance

1635

RTFx

3371

RTFx

6368

RTFx

7906

RTFx3524

RTFx

7082

RTFx

10011

RTFx

9399

RTFx

12

MULTI-GPU PERFORMANCE LIMITERS

Scalability Limited Due to CPU Overhead

Feature Extraction and Determinization become bottlenecks

CPU has a hard time keeping up with GPU performance

13

RECENT PERFORMANCE

UPDATES

14

RECENT IMPROVEMENTS

Multi-threading improvements

Moved more tasks to worker threads which allows control threads to submit work faster and keep GPU busy

Reduce memory usage

Increased batch size = more performance

General Optimization

Container Improvements

Automatic segmenting and dataset preparation

Added ASpIRE Model

Since GTC 2019

15

LATEST SINGLE GPU PERFORMANCE 1 GPU, LibriSpeech, 19.11 Container

2x Xeon*: 2x Intel Xeon Platinum 8168V100*: SXM

Determinized Lattice Outputbeam=10

lattice-beam=7Uses all available HW threads

Hardware19.11 Perf

(RTFx)WER

GTC – 2019

19.3 Speedup

GTC–DC -2019

19.11 Speedup

LibriSpeech Model, Libri Clean Data

2x Intel Xeon 381 5.5 1.0x 1.0x

Tesla T4 1849 5.5 4.3x 4.9x

Tesla V100 5154 5.5 9.2x 13.5x

LibriSpeech Model, Libri Other Data

2x Intel Xeon 377 14.0 1.0x 1.0x

Tesla T4 1679 14.0 3.8x 4.5x

Tesla V100 3925 14.0 7.6x 10.4x

16

SPRING 2019 ACCELERATED COMPONENTSGPU Accelerated

NVIDIA is

cool0/0.98 1

-:-

2

nvidia:nvidia/1.0

3ai:ai/1.24

4

speech:speech/1.63

-:-

-:-

-:-

Raw AudioFeature

Extraction

Acoustic

Model

Language

ModelOutput

Large amount of CPU work.

When scaling to multi-GPU CPU

threads cannot keep up.

17

GPU ACCELERATED FEATURE EXTRACTIONReduce CPU overhead

NVIDIA is

cool0/0.98 1

-:-

2

nvidia:nvidia/1.0

3ai:ai/1.24

4

speech:speech/1.63

-:-

-:-

-:-

Raw AudioFeature

Extraction

Acoustic

Model

Language

ModelOutput

Batch=1 Implementation moved

work to GPU significantly

reducing CPU load.

18

FEATURE EXTRACTIONPipeline

Base

Feature

PitchOnlineCmvn

Ivector

ExtractionMFCCFBANK

Green = Implemented in CUDA.Individual Models may not use all components.

Batch=1 implementation only

Input Feature

IvectorFeature

Raw Audio

Currently Not

supported.

19

IVECTOR EXTRACTIONPipeline

Base

Feature

LDA

Transform

Online

CMVN

SpliceLDA

Transform

Posteriors

Ivector

Stats

Compute

Ivector

Splice

IvectorFeature

Green = Implemented in CUDA.Individual Models may not use all components.

Batch=1 implementation only

20

GPU FEATURE EXTRACTIONEnd-to-End Scalability & Efficiency (DGX-1V)

GPU_THREADS=2, MAX_BATCH_SIZE=300, BATCH_DRAIN_SIZE=40, DATASETS=test_clean, COPY_THREADS=0

0

5000

10000

15000

20000

25000

1 2 4 8

Rea

l Tim

e Fa

cto

r

Number of V100-SXM-16GB

Parallel Scalability

CPU feature extraction

GPU feature extraction

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

1 2 4 8

Para

llel E

ffic

ien

cy

Number of V100-SXM-16GB

Parallel Efficiency

CPU Feature Extraction

GPU feature extraction

21

FULL NODE PERFORMANCE

Good Scalability across a range of hardware platforms

0

10000

20000

30000

40000

50000

60000

DGX-1V(V100 SXM 16GB)

DGX-2(V100 SXM 32 GB)

SYS-6049GP-TRT(T4)

RTF

x

Kaldi - Multi-GPU Scalability

1 GPU 2 GPUS 4 GPUs 8 GPUs 16 GPUs 20 GPUs

0%

20%

40%

60%

80%

100%

120%

DGX-1V(V100 SXM 16GB)

DGX-2(V100 SXM 32 GB)

SYS-6049GP-TRT(T4)

Par

alle

l Eff

icie

ncy

Kaldi - Multi-GPU Efficiency

1 GPU 2 GPUS 4 GPUs 8 GPUs 16 GPUs 20 GPUs

14 hours in

one second

22

FUTURE WORK

Batched Feature Extraction

More performance

Online Speech Pipeline

Streaming audio

Lower latency, higher throughput, less memory

Can emulate offline with same benefits

More models

Look for these features at GTC - 2020

23

THE NGC CONTAINER REGISTRY

Discover over 40 GPU-Accelerated ContainersSpanning deep learning, machine learning, HPC applications, HPC visualization, and more

Innovate in Minutes, Not WeeksPre-configured, ready-to-run

Run AnywhereThe top cloud providers, NVIDIA DGX Systems, PCs and workstations with selectNVIDIA GPUs, and NGC-Ready systems

Simple Access to GPU-Accelerated Software

24

NGC CONTAINER

Get an NGC account: https://ngc.nvidia.com/signup

Free & Easy

#login in to NGC, pull container, and run it

%> docker login nvcr.io

%> docker pull nvcr.io/nvidia/kaldi:19.10-py3

%> nvidia-docker run --rm -it nvcr.io/nvidia/kaldi:19.10-py3

#prepare models and data

%> cd /workspace/nvidia-examples/librispeech

%> ./prepare_data.sh

#run benchmark

%> ./run_benchmark.sh

https://ngc.nvidia.com/signup

25

BENCHMARK OUTPUTNGC Container

Process 0:

~Group 0 completed Aggregate Total Time: 15.3179 Audio: 19452.5 RealTimeX: 1269.92

~Group 1 completed Aggregate Total Time: 20.8032 Audio: 38905 RealTimeX: 1870.15









Overall: Aggregate Total Time: 57.2567 Total Audio: 194525 RealTimeX: 3397.42

%WER 5.54 [ 29134 / 525760, 3900 ins, 2321 del, 22913 sub ]

%SER 51.50 [ 13494 / 26200 ]

Scored 26200 sentences, 0 not present in hyp.

Decoding completed successfully.

Total RTF: 3397.42 Average RTF: 3397.4200 Average WER: 5.5400

All WER and PERF tests passed.

26

BENCHMARK FEATURES

Transcribes a corpus of audio using multiple threads and an NVIDIA GPU

Create corpus from a directory of wav files or use a provided corpus

Scores transcriptions when gold text is present

Comes with two English models (LibriSpeech & ASpIRE)

Highly tunable through various parameters

CPU_THREADS, GPU_THREADS, NUM_PROCESSES, SEGMENT_SIZE, ITERATIONS, etc



27

NVIDIA TECHNICAL CONTRIBUTORS

*Justin Luitjens

Senior Developer Technology Engineer

*Ryan Leary

Senior Applied Research Scientist

Hugo Braun

Senior AI Developer Technology Engineer

*Levi Barnes

Developer Technology Engineer

*David Taubenheim

Senior Solutions Architect

*Attending GTC-DC 2019, Come ask questions and tell us how we can help solve your mission!

*Adam Thompson

Senior Solutions Architect

Nigel CanningsCEO

INTELLIGENT VOICE LTD

@intelligentvoxwww.intelligentvoice.com

Some of the world’s best speech solutions are driven by IV

+130 more..

Intelligent Voice Limited is a global leader in the development of proactive compliance and eDiscovery technology solutions for voice, video and other media. Its clients include government agencies, banks, securities firms, Call-Centers, litigation support providers, international consultancy, advisory businesses and insurers, all involved in the management of risk and meeting of multi-jurisdictional regulation.

A Brief History of GPU Accelerated Voice


May 2 2007

Nigel reads about “A Supercomputer on your Desk”

Nigel reads about “A Supercomputer on your Desk”May 2 2007


May 3 2007

Nigel’s wife: “We haven’t got a spare 60k ”

”


May 2 2007

May 3 2007


Nigel’s wife: “We haven’t got a spare 60k

Aug 1 2013

The UK Government gives Nigel a Grant to GPU Accelerate ASR

”


May 2 2007

May 3 2007

Aug 1 2013




Early2014

27 CUDA programmers tell Nigel it is Impossible

”


May 2 2007

May 3 2007

Aug 1 2013

Early2014





Jun 11 2014

One man says “Alright, I’ll give that a try”

”


May 2 2007

May 3 2007

Aug 1 2013

Early2014

Jun 11 2014






Mar 17 2015

Nigel Releases GPU Powered ASR at GTC 2015 Running at 31 x Realtime on a K80!

”


May 2 2007

May 3 2007

Aug 1 2013

Early2014

Jun 11 2014

Mar 17 2015







Nov 6 2019

Nigel and NVIDIA Show the same process running 1000X Realtime on a V100


May 2 2007

May 3 2007

Aug 1 2013

Early2014

Jun 11 2014

Mar 17 2015

Nov 6 2019







Nigel and NVIDIA Show the same process running 1000X Realtime on a V100

”

Intelligent Voice – Model Performance

Real world Speed: - x250 on a T4- x1000 on a V100 32Gb CPU vs GPU accuracy virtually identical

13.6x14.5x

13.7x 15.6x15.2x

15.4x 12.1x

17.1x

12.8x

15.8x

11.1x

12.2x

18.0x

14.4x

13.5x 15.4x

12.7x

14.4x

14.3x

13.3x

15.0x 9.3x

14.1x

0

200

400

600

800

1000

1200

1400

1600

1800

RTF

x

1x V100 16GB SXM 2x Intel E5-2698Data Labels = SpeedupBeam=15, Lattice Beam = 2.5

Nobody just wants a transcript..

It is in theory possible to extrapolate the whole of creation—every Galaxy, every sun, every planet, their orbits, their composition, and their economic and social history from, say, one small piece of fairy cake.

”

“

Douglas Adams

The Voice Suite

High Speed ASRLightning Fast Speech-

to-text

Live Call MonitoringCatch anomalies in

real-time

IVNOTE + Smart

TranscriptSearch what’s said

Model BuildingAccelerates ‘learning’

and accuracy

API-based

IntegrationLet our features

enhance yours

Onsite or in-cloudChoose where your

data lives

Biometric SearchVoice ID

Hyperphonic SearchSounds & phrases

searched, instantly

INDEXIntelligent Voice indexes key

words and phrases from your

telephone calls

SPEECH TO TEXTThis allows you to search

for telephone calls as if they

were text.

ANALYSEAdd-on modules give you the

power to analyze calls and track

behavior.

STOREYou have full control of your

data - Securely encrypted on

your cloud platform of choice.

Emotional

Intelligence Behavioural analysis

of human speech

PCI Redaction Automatically remove

Payment Card

information from audio

recordings.

Edge ProcessingOn device speech

recognition, with FULL

vocabulary

Live conference

TranscriptionInstant and accurate.

Encrypted Search:

PATENTEDSearch sound, keeping the

words hidden

IN DEVELOPMENTFEATURES: CURRENTLY AVAILABLE

Automated Language

Detection

What can we do with it - use cases

Communication Surveillance Financial InstitutionsCompliance monitoringSurveillance – Voice, Chat, Email, Web conference

Live and post Call monitoring Call Centre, Law Enforcement Key word and Phrase spotting and alertingBiometric authorisationQuality Assurance reportingPCI data identification and redaction

E-Discovery Legal Service Providers, Forensic analysts, RegulatorsSearch & Review of large audio and text data setsBiometric Search, persons of Interest

Fraud Detection Insurance Claims

Credibility Analysis Earnings CallsBehavioural analysis of Voice communication

Intelligent Voice Differentiators

GPU Acceleration Model Training

Language DetectionLattice/ Alternate Search

Integrations

Optimised Pipeline

Audio Pre-FilteringCustom VAD

Low Confidence Region BoostingNumber OptimisationDynamic Lattice Boost

Available Languages

English – UK

English – US

English – SA

English – AUS

English – Global

Spanish – MEX

Spanish – EU

Catalan

German – DE

German – Swiss

Portuguese – BR

Portuguese – EU

Dutch

Norwegian

Danish

Japanese

French

Russian

Korean

Mandarin

Tagalog

Cantonese

Italian

Canadian French

Coming Soon:

Arabic

SmartTranscript™

Trace Alert Terms

Lattice Matching

Redact text and audio direct in your review platform using simple word

highlighting

Karaoke

Automated Topics

Speaker Separated Transcripts

See Word Alternatives

“Vox in a Box”

Pre-Configured Speech Server

Transcription in 20+ Languages and Dialects

Fully TrainableREST-based API

Highly Optimised for Speed

EDGE or Data Centre

10-50,000 hours per day

GPU powered

"In an infinite universe, the one thing sentient life cannot afford to have is a sense of proportion."

Douglas Adams

Date post:	01-Aug-2020
Category:	Documents
Upload:	others
View:	9 times
Download:	0 times

GPU ACCELERATED SPEECH-TO-TEXT - NVIDIA...Personal assistants, Medical transcription, Call center...

Documents