April 4-7, 2016 | Silicon Valley HIGH PERFORMANCE CTC ... · For warp-ctc, this is the run time of...

Post on 23-Jul-2020

1 views 0 download

transcript

April 4-7, 2016 | Silicon Valley

Minmin Sun, NVIDIA

minmins@nvidia.com

April 5th

HIGH PERFORMANCE CTC TRAINING FOR END-TO-END SPEECH RECOGNITION ON GPU

2

AGENDA

Brief Introduction of CTC

Alpha/Beta Matrix Computation

Gradient Matrix Computation

Overall Performance

3

BRIEF INTRODUCTION OF CTC

4

BRIEF INTRODUCTION OF CTC Overview

CTC is a loss function to train the RNN

Inputs: (1) 𝑝--softmax output (2) label sequence

Output: 𝑔--gradient w.r.t. output layer

CTC Includes: (1)Alpha computation (2)Beta computation (3)Gradient computation

6/7/2016

Hidden layer

…. Hidden layer

Hidden layer

….

frame t-1 frame t frame t+1

h[t-1]

Output layer

y[t-1]

Softmax

𝑝[𝑡-1]

CTC

𝑔[𝑡-1]

Label sequence: ‘C’, ‘A’, ‘T’

h[t]

Output layer

y[t]

Softmax

𝑝[𝑡] 𝑔[𝑡]

h[t+1]

Output layer

y[t+1]

Softmax

𝑝[𝑡+1] 𝑔[𝑡+1]

RNN

5

BRIEF INTRODUCTION OF CTC Alpha/Beta Matrix Computation

Matrix Dim: 𝑇 rows * 𝑆 columns

𝑆 = 2𝐿 + 1 is the length of augmented label sequence 𝑙

𝐿 is the number of characters in the original label sequence

𝑇 is the number of time-steps in the utterance

6/7/2016

𝑖𝑓 𝑙 𝑠 = 𝑏𝑙𝑎𝑛𝑘 𝑜𝑟 𝑙 𝑠 = 𝑙 𝑠 − 2 ∶ 𝛼 𝑡 𝑠 = 𝛼 𝑡 − 1 𝑠 + 𝛼 𝑡 − 1 𝑠 − 1 ∗ 𝑝 𝑡 𝑙 𝑠 else : 𝛼 𝑡 𝑠 = 𝛼 𝑡 − 1 𝑠 + 𝛼 𝑡 − 1 𝑠 − 1 + 𝛼 𝑡 − 1 𝑠 − 2 ∗ 𝑝 𝑡 𝑙 𝑠

6

BRIEF INTRODUCTION OF CTC Alpha/Beta Matrix Computation

6/7/2016

𝛼 𝑡 − 1 𝑠 − 2 𝛼 𝑡 − 1 𝑠 − 1 𝛼 𝑡 − 1 𝑠

𝛼 𝑡 𝑠

𝒃𝒍𝒂𝒏𝒌 𝒄 𝒃𝒍𝒂𝒏𝒌 𝒂 𝒃𝒍𝒂𝒏𝒌 𝒕 𝒃𝒍𝒂𝒏𝒌

𝒍

𝜶 𝑠 − 2 𝑠 − 1 𝑠

𝑡 − 1

𝑡

7

BRIEF INTRODUCTION OF CTC Gradient Matrix Computation

Matrix Dim: 𝑇 rows * 𝐴 columns

𝐴 is the alphabet size, e.g. 28 for English

key-value reduction using the character 𝑙 𝑠 as key

6/7/2016

𝑔 𝑡 𝑎 = 𝑝 𝑡 𝑎 −1

𝑝 𝑡 𝑎 ∗ 𝑛𝑙𝑙 𝛼 𝑡 𝑠 ∗ 𝛽

𝑠: 𝑙 𝑠 =𝑎

𝑡 𝑠

8

BRIEF INTRODUCTION OF CTC Gradient Matrix Computation

6/7/2016

𝒃𝒍𝒂𝒏𝒌 𝑪 𝒃𝒍𝒂𝒏𝒌 𝑨 𝒃𝒍𝒂𝒏𝒌 𝑻 𝒃𝒍𝒂𝒏𝒌 𝒍

𝜶 ∗ 𝜷

𝒈 𝑏𝑙𝑎𝑛𝑘 𝐴 𝐵 𝐶 … 𝑍 𝑠𝑝𝑎𝑐𝑒

𝑡

𝑡

9

ALPHA/BETA MATRIX COMPUTATION

10

ALPHA/BETA MATRIX COMPUTATION

Each CUDA Block owns one sequence, i.e. #Block is the minibatch size

Each Thread owns one column of the Alpha/Bata Matrix.

Threads iterate over matrix rows with synchronizations after each iteration.

GPU Implementation

𝛼 𝑡 − 1 𝑠 − 2 𝛼 𝑡 − 1 𝑠 − 1 𝛼 𝑡 − 1 𝑠

𝛼 𝑡 𝑠

Thread 𝑠 − 2 Thread 𝑠 − 1 Thread 𝑠 Block

11

ALPHA/BETA MATRIX COMPUTATION Data Reuse

𝒍 𝒔 and 𝒍 𝒔 − 𝟐 will be used by all iterations

They are invariable across all iterations

So load them into Register File to be reused by all iterations in the thread

6/7/2016

𝑖𝑓 𝒍 𝒔 = 𝑏𝑙𝑎𝑛𝑘 𝑜𝑟 𝒍 𝒔 = 𝒍 𝒔 − 𝟐 ∶ 𝛼 𝑡 𝑠 = 𝛼 𝑡 − 1 𝑠 + 𝛼 𝑡 − 1 𝑠 − 1 ∗ 𝑝 𝑡 𝒍 𝒔 else : 𝛼 𝑡 𝑠 = 𝛼 𝑡 − 1 𝑠 + 𝛼 𝑡 − 1 𝑠 − 1 + 𝛼 𝑡 − 1 𝑠 − 2 ∗ 𝑝 𝑡 𝒍 𝒔

12

ALPHA/BETA MATRIX COMPUTATION Data Reuse

𝜶 𝒕 − 𝟏 𝒔 is output of last iteration of the same thread

Thus can be transferred through Register File

6/7/2016

𝑖𝑓 𝑙 𝑠 = 𝑏𝑙𝑎𝑛𝑘 𝑜𝑟 𝑙 𝑠 = 𝑙 𝑠 − 2 ∶ 𝛼 𝑡 𝑠 = 𝜶 𝒕 − 𝟏 𝒔 + 𝛼 𝑡 − 1 𝑠 − 1 ∗ 𝑝 𝑡 𝑙 𝑠 else : 𝛼 𝑡 𝑠 = 𝜶 𝒕 − 𝟏 𝒔 + 𝛼 𝑡 − 1 𝑠 − 1 + 𝛼 𝑡 − 1 𝑠 − 2 ∗ 𝑝 𝑡 𝑙 𝑠

13

ALPHA/BETA MATRIX COMPUTATION Data Reuse

𝜶 𝒕 − 𝟏 𝒔 − 𝟏 and 𝜶 𝒕 − 𝟏 𝒔 − 𝟐 are outputs of last iteration of the other threads in the same block

Thus can be transferred through Shared Memory

6/7/2016

𝑖𝑓 𝑙 𝑠 = 𝑏𝑙𝑎𝑛𝑘 𝑜𝑟 𝑙 𝑠 = 𝑙 𝑠 − 2 ∶ 𝛼 𝑡 𝑠 = 𝛼 𝑡 − 1 𝑠 + 𝜶 𝒕 − 𝟏 𝒔 − 𝟏 ∗ 𝑝 𝑡 𝑙 𝑠 else : 𝛼 𝑡 𝑠 = 𝛼 𝑡 − 1 𝑠 + 𝜶 𝒕 − 𝟏 𝒔 − 𝟏 + 𝜶 𝒕 − 𝟏 𝒔 − 𝟐 ∗ 𝑝 𝑡 𝑙 𝑠

14

ALPHA/BETA MATRIX COMPUTATION

T=150, L=40, A=28 warp-ctc optimized speedup

N=1 0.41ms 0.22ms 1.89x

N=16 0.42ms 0.23ms 1.84x

N=32 0.42ms 0.23ms 1.82x

N=64 0.43ms 0.26ms 1.70x

N=128 0.47ms 0.30ms 1.56x

Warp-ctc: https://github.com/baidu-research/warp-ctc

Performance on Titan X – Small Alphabet Size

15

ALPHA/BETA MATRIX COMPUTATION

T=150, L=20, A=5000 warp-ctc optimized speedup

N=1 0.41ms 0.25ms 1.65x

N=16 0.47ms 0.28ms 1.66x

N=32 0.47ms 0.28ms 1.65x

N=64 0.48ms 0.29ms 1.65x

N=128 0.50ms 0.30ms 1.68x

Warp-ctc: https://github.com/baidu-research/warp-ctc

Performance on Titan X – Large Alphabet Size

16

GRADIENT MATRIX COMPUTATION

17

GRADIENT MATRIX COMPUTATION

Each Block owns one row of Alpha and Beta Matrix, i.e. #Block = minibatch * T

Within each block, key-value reduction through Atomic operations on Shared Memory

GPU Implementation

𝜶 ∗ 𝜷

𝒈 𝑏𝑙𝑎𝑛𝑘 𝐴 𝐵 𝐶 … 𝑍 𝑠𝑝𝑎𝑐𝑒

Block 𝑡

Shared Memory of Block 𝑡

18

GRADIENT MATRIX COMPUTATION

Blanks contribute most of the address conflicts

We know their exact position in the label sequence

It becomes a normal parallel reduction problem to compute for blanks separately

Compute for Blanks Separately

𝜶 ∗ 𝜷

Block 𝑡

𝒃𝒍𝒂𝒏𝒌 𝑪 𝒃𝒍𝒂𝒏𝒌 𝑨 𝒃𝒍𝒂𝒏𝒌 𝑻 𝒃𝒍𝒂𝒏𝒌 𝒍

Shared Memory

Shared Memory

19

GRADIENT MATRIX COMPUTATION

It reduces address conflicts for atomic operations

Results in redundant shared memory elements are then accumulated for each character in parallel

Not applicable for languages with large alphabet size, like Chinese

Allocate Redundant Shared Memory

20

GRADIENT MATRIX COMPUTATION Reuse the memory of Matrix 𝑝 for Gradient Matrix 𝑔

Results in 0 for more than 99% characters in the large alphabet of Chinese.

So more than 99% elements of Matrix 𝑔 are the same as Matrix 𝑝, and nearly half time is spent on “copying” them from Matrix 𝑝 to Matrix 𝑔

Matrix 𝑝 will no longer be used after the gradient computation

Reusing the memory of Matrix 𝑝 for Gradient Matrix 𝑔 , we only need to update gradient of less than 1% Matrix elements

Not necessary for languages with small alphabet size, like English

6/7/2016

𝑔 𝑡 𝑎 = 𝑝 𝑡 𝑎 −1

𝑝 𝑡 𝑎 ∗ 𝑛𝑙𝑙 𝛼 𝑡 𝑠 ∗ 𝛽

𝑠: 𝑙 𝑠 =𝑎

𝑡 𝑠

21

GRADIENT MATRIX COMPUTATION

T=150, L=40, A=28 warp-ctc optimized speedup

N=1 2.16ms 0.02ms 134.89x

N=16 2.19ms 0.06ms 37.26x

N=32 2.20ms 0.11ms 19.32x

N=64 2.23ms 0.21ms 10.49x

N=128 2.24ms 0.41ms 5.52x

For warp-ctc, this is the run time of kernel compute_betas_grad_kernel minus the run time of compute_alpha_kernel

Performance on Titan X – Small Alphabet Size

22

GRADIENT MATRIX COMPUTATION

T=150, L=20, A=5000 warp-ctc optimized speedup

N=1 5.52ms 0.04ms 128.26x

N=16 6.36ms 0.21ms 30.28x

N=32 6.49ms 0.47ms 13.73x

N=64 6.75ms 0.78ms 8.67x

N=128 7.20ms 1.56ms 4.63x

For warp-ctc, this is the run time of kernel compute_betas_grad_kernel minus the run time of compute_alpha_kernel

Performance on Titan X – Large Alphabet Size

23

OVERALL PERFORMANCE

24

OVERALL PERFORMANCE

T=150, L=40, A=28 warp-ctc optimized speedup

N=1 2.98ms 0.45ms 6.57x

N=16 3.03ms 0.51ms 5.92x

N=32 3.05ms 0.58ms 5.25x

N=64 3.10ms 0.72ms 4.27x

N=128 3.18ms 1.01ms 3.14x

CTC(Alpha+Beta+Gradient) on Titan X – Small Alphabet Size

25

OVERALL PERFORMANCE

T=150, L=20, A=5000 warp-ctc optimized speedup

N=1 6.34ms 0.54ms 11.67x

N=16 7.30ms 0.77ms 9.43x

N=32 7.43ms 1.04ms 7.14x

N=64 7.71ms 1.36ms 5.67x

N=128 8.20ms 2.15ms 3.81x

CTC(Alpha+Beta+Gradient) on Titan X – Large Alphabet Size

26

OVERALL PERFORMANCE

T=150, L=40, A=28 warp-ctc optimized speedup

N=1 3.12ms 0.59ms 5.28x

N=16 3.16ms 0.65ms 4.89x

N=32 3.20ms 0.88ms 3.65x

N=64 3.30ms 1.08ms 3.07x

N=128 3.49ms 1.37ms 2.56x

Softmax+CTC on Titan X – Small Alphabet Size

27

OVERALL PERFORMANCE

T=150, L=20, A=5000 warp-ctc Optimized speedup

N=1 6.61ms 0.79ms 8.34x

N=16 9.13ms 2.69ms 3.40x

N=32 11.01ms 4.92ms 2.24x

N=64 14.83ms 8.67ms 1.71x

N=128 22.36ms 16.49ms 1.36x

Softmax+CTC on Titan X – Large Alphabet Size

April 4-7, 2016 | Silicon Valley

THANK YOU

JOIN THE NVIDIA DEVELOPER PROGRAM AT developer.nvidia.com/join