OPTIMIZED GPU KERNELS FOR DEEP LEARNING
Amir Khosrowshahi
GTC 17 Mar 2015
Outline
2
• About nervana
• Optimizing deep learning at assembler level
• Limited precision for deep learning
• neon benchmarks
3
About nervana
• A platform for machine intelligence
• enable deep learning at scale
• optimized from algorithms to silicon
About | Kernels | neon | Summary
X
4
Verticals
$Medical Finance Pharma Oil&Gas Agriculture
About | Kernels | neon | Summary
4
Verticals
$Medical Finance Pharma Oil&Gas Agriculture
• Deep learning supplanting traditional approaches everywhere
• Small improvements have large impact
• Customers require clear roadmap that scales to growing need.
About | Kernels | neon | Summary
5
nervana platform for deep learning
5 About | Kernels | neon | Summary
SolutionsData
nervana framework
train deploy
nervana cloud
explore
5
nervana platform for deep learning
5 About | Kernels | neon | Summary
SolutionsData
nervana framework
train deploy
nervana cloud
explore
GPUs
CPUs
nervana engine
6
• Full control of:
• register allocation
• instruction ordering
• control codes
• barriers, stall counts
• Built-in scheduler (optional)
• Meta-programming
About | Kernels [ maxas ] | neon | Summary
maxas: a Maxwell Assembler
6
• Full control of:
• register allocation
• instruction ordering
• control codes
• barriers, stall counts
• Built-in scheduler (optional)
• Meta-programming
About | Kernels [ maxas ] | neon | Summary
maxas: a Maxwell Assembler
Scott Gray
6
• Full control of:
• register allocation
• instruction ordering
• control codes
• barriers, stall counts
• Built-in scheduler (optional)
• Meta-programming
About | Kernels [ maxas ] | neon | Summary
maxas: a Maxwell Assembler
Scott Gray
See GitHub repo for docs and examples
7
ptxas struggles with Instruction Level Parallelism
0"
5"
10"
15"
20"
25"
1" 6" 11"
16"
21"
26"
31"
36"
41"
46"
51"
56"
61"
66"
71"
76"
81"
86"
91"
96"
101"
106"
111"
116"
121"
126"
131"
136"
141"
146"
151"
Coun
t&
FFMA&Line#&.&LDS&Line#&
Distribu4on&of&Number&of&Instruc4ons&Between&LDS&and&Dependant&FFMA&Operands&
ptx"cublas"
Bad Good
About | Kernels [ maxas ] | neon | Summary
courtesy Scott Gray
8
Easy register allocation through maxas
About | Kernels [ maxas ] | neon | Summary
Register banking for outer products
c = a bt
c
a b
9
Example GEMM code in maxas
About | Kernels [ maxas ] | neon | Summary
9
Example GEMM code in maxas
About | Kernels [ maxas ] | neon | Summary
Load from shared
9
Example GEMM code in maxas
About | Kernels [ maxas ] | neon | Summary
Fused fp32 multiply add
Load from shared
9
Example GEMM code in maxas
Control Codes
About | Kernels [ maxas ] | neon | Summary
Fused fp32 multiply add
Load from shared
9
Example GEMM code in maxas
Control Codes
About | Kernels [ maxas ] | neon | Summary
Dual issue instr.
Fused fp32 multiply add
Load from shared
9
Example GEMM code in maxas
Control Codes
About | Kernels [ maxas ] | neon | Summary
Dual issue instr.
Fused fp32 multiply add
Load from shared
Set barrier
9
Example GEMM code in maxas
Control Codes
About | Kernels [ maxas ] | neon | Summary
Dual issue instr.
Fused fp32 multiply add
Load from shared
Barrier sync
Set barrier
C H x W R x S K P x Q N
10
Convolution kernels for deep learning
C
H
W
RS
RS
P
QK
K
Input Filters Output
* =
About | Kernels [ Convolution ] | neon | Summary
Number of input channelsInput spatial dimsFilter spatial dimsNumber of filtersOutput spatial dimsMini-batch dim (not shown)
11
Access patterns for matrix lowering• Convolution kernels:
About | Kernels [ Convolution ] | neon | Summary
11
Access patterns for matrix lowering
fprop• Convolution kernels:
About | Kernels [ Convolution ] | neon | Summary
11
Access patterns for matrix lowering
Backprop(Step(1(
23(
N(=(3(C(=(3(H(=(W(=(3(
C(=(3(K(=(2(R(=(S(=(2((
P(=(Q(=(2(K(=(2(
(In(each(itera'on,(there(is(a(mul'ply(opera'on(between(an(NxK(matrix(and(a(Kx(C*R*S)(matrix.(The(operands(and(the(result(are(shown(shaded.(((((((((((stands(for(deconvolve.(
δ1(
δ0(
The(results(of(the(matrix(mul'plica'ons(are(accumulated(to(obtain(((((((((((.(δ0(
bprop• Convolution kernels:
About | Kernels [ Convolution ] | neon | Summary
11
Access patterns for matrix lowering
Backprop(Step(2(–(Weight(Updates(
27(
N(=(3(C(=(3(H(=(W(=(3(
P(=(Q(=(2(K(=(2(
(In(each(itera'on,(there(is(a(mul'ply(opera'on(between(a(KxN(matrix(and(an(Nx(C*R*S)(matrix.(Note(that(the(delta(matrix(is(sliced(and(the(result(transposed(before(the(mul'plica'on.(
δ1(
The(results(of(the(matrix(mul'plica'ons(are(accumulated(to(obtain(the(weight(updates.(
Output(of(the(previous(layer(
C(=(3(K(=(2(R(=(S(=(2((
Weight(updates(
update• Convolution kernels:
About | Kernels [ Convolution ] | neon | Summary
12
Deep learning with low precision works
About | Kernels [ Limited Precision ] | neon | Summary
12
Deep learning with low precision works
About | Kernels [ Limited Precision ] | neon | Summary
Improving the speed of neural networks on CPUs
Vincent Vanhoucke
Google, Inc.Mountain View, CA 94043
Andrew Senior
Google, Inc.New York, NY 10011
Mark Z. Mao
Google, Inc.Mountain View, CA [email protected]
Abstract
Recent advances in deep learning have made the use of large, deep neural net-works with tens of millions of parameters suitable for a number of applicationsthat require real-time processing. The sheer size of these networks can represent achallenging computational burden, even for modern CPUs. For this reason, GPUsare routinely used instead to train and run such networks. This paper is a tutorialfor students and researchers on some of the techniques that can be used to reducethis computational cost considerably on modern x86 CPUs. We emphasize datalayout, batching of the computation, the use of SSE2 instructions, and particularlyleverage SSSE3 and SSE4 fixed-point instructions which provide a 3⇥ improve-ment over an optimized floating-point baseline. We use speech recognition as anexample task, and show that a real-time hybrid hidden Markov model / neuralnetwork (HMM/NN) large vocabulary system can be built with a 10⇥ speedupover an unoptimized baseline and a 4⇥ speedup over an aggressively optimizedfloating-point baseline at no cost in accuracy. The techniques described extendreadily to neural network training and provide an effective alternative to the useof specialized hardware.
1 Introduction
The recent resurgence of interest in neural networks owes a certain debt to the availability of af-fordable, powerful GPUs which routinely speed up common operations such as large matrix com-putations by factors from 5⇥ to 50⇥ [1-3]. These enabled researchers to tackle much larger, moredifficult machine learning tasks using neural networks, auto-encoders or deep belief networks [4-6]. Due to a variety of factors, including cost, component reliability and programming complexity,GPUs are still however the exception rather than the norm in computing clusters. The question thenbecomes whether to invest in GPU resources, or whether traditional CPUs can be made to performfast enough that, using distributed computing, they will yield similar or superior scalability andperformance. The purpose of this paper is not to settle this debate, but rather to introduce to neuralnetwork researchers some tools which can significantly improve the performance of neural networkson Intel and AMD CPUs in accessible form. Some of these might not be novel to researchers wellversed in high-performance computing, but they lay the foundation for improvements going beyondwhat one might obtain using existing optimized BLAS packages. We will show in particular howone can outperform optimized BLAS packages by a factor of 3 using fixed point arithmetic andSSSE3 / SSE4 instructions.
1
Under review as a conference paper at ICLR 2015
LOW PRECISION ARITHMETIC FOR DEEP LEARNING
Matthieu Courbariaux & Jean-Pierre David
Department of Electrical EngineeringEcole Polytechnique de MontrealMontreal, QC H3T 1J4, Canada{matthieu.courbariaux,jean-pierre.david}@polymtl.ca
Yoshua Bengio
⇤
Department of Computer Science and Operations ResearchUniversite de MontrealMontreal, QC H3T 1J4, [email protected]
ABSTRACT
We simulate the training of a set of state of the art neural networks, the Maxoutnetworks (Goodfellow et al., 2013a), on three benchmark datasets: the MNIST,CIFAR10 and SVHN, with three distinct arithmetics: floating point, fixed pointand dynamic fixed point. For each of those datasets and for each of those arith-metics, we assess the impact of the precision of the computations on the finalerror of the training. We find that very low precision computation is sufficient notjust for running trained networks but also for training them. For example, almoststate-of-the-art results were obtained on most datasets with around 10 bits forcomputing activations and gradients, and 12 bits for storing updated parameters.
1 INTRODUCTION
Deep learning is very often limited by memory and computational power. Lots of previous worksaddress the best exploitation of general-purpose hardware, typically CPU clusters (Dean et al.,2012) and GPUs (Coates et al., 2009; Krizhevsky et al., 2012a). Faster implementations usuallylead to state of the art results (Dean et al., 2012; Krizhevsky et al., 2012a; Sutskever et al., 2014).
Actually, such approaches always consist in adapting the algorithm to best exploit state of the arthardware. Nevertheless, some dedicated deep learning hardware is appearing as well. FPGA imple-mentations claim a better power efficiency than general-purpose hardware (Farabet et al., 2011; Kimet al., 2009). The corresponding ASIC implementations are even more efficient (Pham et al., 2012).In contrast with general-purpose hardware, dedicated hardware such as ASIC and FPGA enables tobuild the hardware from the algorithm. In this context, it is important to know what is the minimumprecision acceptable.
Actually, minimizing the size of the arithmetic operators and the size of the memories would lead toarchitectures with more operators and memories working in parallel. It would also drastically reducethe power consumption. For instance, using single precision (32 bits) instead of double precision(64 bits) for a floating point multiplier reduces its area by four on modern FPGAs (Govindu et al.,2004; Underwood, 2004).
In this paper, we simulate the training of a set of state of the art neural networks, the Maxout net-works (Goodfellow et al., 2013a), on three benchmark datasets: the MNIST, CIFAR10 and SVHN,with three distinct arithmetics: floating point, fixed point and dynamic fixed point. For each of thosedatasets and for each of those arithmetics, we assess the impact of the precision of the computationson the final error of the training. We find that very low precision computation is sufficient not justfor running trained networks but also for training them. For example, almost state-of-the-art results
⇤Yoshua Bengio is a CIFAR Senior Fellow.
1
arX
iv:1
412.
7024
v1 [
cs.L
G]
22 D
ec 2
014
12
Deep learning with low precision works
About | Kernels [ Limited Precision ] | neon | Summary
Deep Learning with Limited Numerical Precision
Suyog Gupta [email protected]
Ankur Agrawal [email protected]
Kailash Gopalakrishnan [email protected]
IBM T. J. Watson Research Center, Yorktown Heights, NY 10598
Pritish Narayanan [email protected]
IBM Almaden Research Center, San Jose, CA 95120
Abstract
Training of large-scale deep neural networksis often constrained by the available compu-tational resources. We study the e↵ect of lim-ited precision data representation and com-putation on neural network training. Withinthe context of low-precision fixed-point com-putations, we observe the rounding schemeto play a crucial role in determining thenetwork’s behavior during training. Our re-sults show that deep networks can be trainedusing only 16-bit wide fixed-point numberrepresentation when using stochastic round-ing, and incur little to no degradation in theclassification accuracy. We also demonstratean energy-e�cient hardware accelerator thatimplements low-precision fixed-point arith-metic with stochastic rounding.
1. Introduction
To a large extent, the success of deep learning tech-niques is contingent upon the underlying hardwareplatform’s ability to perform fast, supervised train-ing of complex networks using large quantities oflabeled data. Such a capability enables rapid evalua-tion of di↵erent network architectures and a thoroughsearch over the space of model hyperparameters. Itshould therefore come as no surprise that recent yearshave seen a resurgence of interest in deploying large-scale computing infrastructure designed specificallyfor training deep neural networks. Some notablee↵orts in this direction include distributed computinginfrastructure using thousands of CPU cores (Deanet al., 2012; Chilimbi et al., 2014), or high-end graphicsprocessors (GPUs) (Krizhevsky & Hinton, 2009), or acombination of CPUs and GPUs scaled-up to multiplenodes (Coates et al., 2013; Wu et al., 2015).
At the same time, the natural error resiliency ofneural network architectures and learning algorithmsis well-documented, setting them apart from moretraditional workloads that typically require precisecomputations and number representations with highdynamic range. It is well appreciated that in thepresence of statistical approximation and estimationerrors, high-precision computation in the context oflearning is rather unnecessary (Bottou & Bousquet,2007). Moreover, the addition of noise during train-ing has been shown to improve the neural network’sperformance (Murray & Edwards, 1994; Bishop, 1995;Audhkhasi et al., 2013). With the exception of em-ploying the asynchronous version of the stochasticgradient descent algorithm (Recht et al., 2011) toreduce network tra�c, the state-of-the-art large-scaledeep learning systems fail to adequately capitalize onthe error-resiliency of their workloads. These systemsare built by assembling general-purpose computinghardware designed to cater to the needs of more tradi-tional workloads, incurring high and often unnecessaryoverhead in the required computational resources.
The work presented in this paper owes its inceptionto the thinking that it may be possible to leveragealgorithm-level noise-tolerance to relax certain con-straints on the underlying hardware, leading to ahardware-software co-optimized system that achievessignificant improvement in computational performanceand energy e�ciency. Allowing the low-level hard-ware components to perform approximate, possiblynon-deterministic computations and exposing thesehardware-generated errors up to the algorithm level ofthe computing stack forms a key ingredient in develop-ing such systems. Additionally, the low-level hardwarechanges need to be introduced in a manner that pre-serves the programming model so that the benefits canbe readily absorbed at the application-level withoutincurring significant software redevelopment costs.
arX
iv:1
502.
0255
1v1
[cs.L
G]
9 Fe
b 20
15Deep Learning with Limited Numerical Precision
Suyog Gupta [email protected]
Ankur Agrawal [email protected]
Kailash Gopalakrishnan [email protected]
IBM T. J. Watson Research Center, Yorktown Heights, NY 10598
Pritish Narayanan [email protected]
IBM Almaden Research Center, San Jose, CA 95120
Abstract
Training of large-scale deep neural networksis often constrained by the available compu-tational resources. We study the e↵ect of lim-ited precision data representation and com-putation on neural network training. Withinthe context of low-precision fixed-point com-putations, we observe the rounding schemeto play a crucial role in determining thenetwork’s behavior during training. Our re-sults show that deep networks can be trainedusing only 16-bit wide fixed-point numberrepresentation when using stochastic round-ing, and incur little to no degradation in theclassification accuracy. We also demonstratean energy-e�cient hardware accelerator thatimplements low-precision fixed-point arith-metic with stochastic rounding.
1. Introduction
To a large extent, the success of deep learning tech-niques is contingent upon the underlying hardwareplatform’s ability to perform fast, supervised train-ing of complex networks using large quantities oflabeled data. Such a capability enables rapid evalua-tion of di↵erent network architectures and a thoroughsearch over the space of model hyperparameters. Itshould therefore come as no surprise that recent yearshave seen a resurgence of interest in deploying large-scale computing infrastructure designed specificallyfor training deep neural networks. Some notablee↵orts in this direction include distributed computinginfrastructure using thousands of CPU cores (Deanet al., 2012; Chilimbi et al., 2014), or high-end graphicsprocessors (GPUs) (Krizhevsky & Hinton, 2009), or acombination of CPUs and GPUs scaled-up to multiplenodes (Coates et al., 2013; Wu et al., 2015).
At the same time, the natural error resiliency ofneural network architectures and learning algorithmsis well-documented, setting them apart from moretraditional workloads that typically require precisecomputations and number representations with highdynamic range. It is well appreciated that in thepresence of statistical approximation and estimationerrors, high-precision computation in the context oflearning is rather unnecessary (Bottou & Bousquet,2007). Moreover, the addition of noise during train-ing has been shown to improve the neural network’sperformance (Murray & Edwards, 1994; Bishop, 1995;Audhkhasi et al., 2013). With the exception of em-ploying the asynchronous version of the stochasticgradient descent algorithm (Recht et al., 2011) toreduce network tra�c, the state-of-the-art large-scaledeep learning systems fail to adequately capitalize onthe error-resiliency of their workloads. These systemsare built by assembling general-purpose computinghardware designed to cater to the needs of more tradi-tional workloads, incurring high and often unnecessaryoverhead in the required computational resources.
The work presented in this paper owes its inceptionto the thinking that it may be possible to leveragealgorithm-level noise-tolerance to relax certain con-straints on the underlying hardware, leading to ahardware-software co-optimized system that achievessignificant improvement in computational performanceand energy e�ciency. Allowing the low-level hard-ware components to perform approximate, possiblynon-deterministic computations and exposing thesehardware-generated errors up to the algorithm level ofthe computing stack forms a key ingredient in develop-ing such systems. Additionally, the low-level hardwarechanges need to be introduced in a manner that pre-serves the programming model so that the benefits canbe readily absorbed at the application-level withoutincurring significant software redevelopment costs.
arX
iv:1
502.
0255
1v1
[cs.L
G]
9 Fe
b 20
15
12
Deep learning with low precision works
About | Kernels [ Limited Precision ] | neon | Summary
neon: nervana python deep learning library
13 About | Kernels | neon | Summary
neon: nervana python deep learning library
13
• User-friendly, extensible, abstracts parallelism
About | Kernels | neon | Summary
neon: nervana python deep learning library
13
• User-friendly, extensible, abstracts parallelism
• Support for many deep learning models
About | Kernels | neon | Summary
neon: nervana python deep learning library
13
• User-friendly, extensible, abstracts parallelism
• Support for many deep learning models
• Interface to nervana cloud
About | Kernels | neon | Summary
neon: nervana python deep learning library
13
• User-friendly, extensible, abstracts parallelism
• Support for many deep learning models
• Interface to nervana cloud
• Supports multiple backends
nervana engine GPU cluster CPU cluster (eg. Cray XC30) Xeon Phi cluster (soon)
{ }
About | Kernels | neon | Summary
neon: nervana python deep learning library
13
• User-friendly, extensible, abstracts parallelism
• Support for many deep learning models
• Interface to nervana cloud
• Supports multiple backends
• Multiple limited precision options
nervana engine GPU cluster CPU cluster (eg. Cray XC30) Xeon Phi cluster (soon)
{ }
About | Kernels | neon | Summary
neon: nervana python deep learning library
13
• User-friendly, extensible, abstracts parallelism
• Support for many deep learning models
• Interface to nervana cloud
• Supports multiple backends
• Multiple limited precision options
• Optimized for Maxwell at assembler level
nervana engine GPU cluster CPU cluster (eg. Cray XC30) Xeon Phi cluster (soon)
{ }
About | Kernels | neon | Summary
14 About | Kernels | neon | Summary
neon: easy model configuration
14 About | Kernels | neon | Summary
neon: easy model configuration
• Dataset
14 About | Kernels | neon | Summary
neon: easy model configuration
• Dataset
• Weight initialization
14 About | Kernels | neon | Summary
neon: easy model configuration
• Dataset
• Weight initialization
• Learning rule
14 About | Kernels | neon | Summary
neon: easy model configuration
• Dataset
• Weight initialization
• Learning rule
• Model layers and cost
neon experiments in fp16/32
15 About | Kernels | neon | Summary
neon experiments in fp16/32
15
• Use 16-bit floating point (fp16) as memory format
About | Kernels | neon | Summary
neon experiments in fp16/32
15
• Use 16-bit floating point (fp16) as memory format
• Multiply-and-adds use fp32
About | Kernels | neon | Summary
neon experiments in fp16/32
15
• Use 16-bit floating point (fp16) as memory format
• Multiply-and-adds use fp32
• Kernel support for:
About | Kernels | neon | Summary
GEMM Stochastic rounding Dropout / maxout
Conv {f,b}prop, update Max pooling Statistics collection
neon experiments in fp16/32
15
• Use 16-bit floating point (fp16) as memory format
• Multiply-and-adds use fp32
• Kernel support for:
• Python element-wise operations auto-compiled into kernels
About | Kernels | neon | Summary
GEMM Stochastic rounding Dropout / maxout
Conv {f,b}prop, update Max pooling Statistics collection
neon experiments in fp16/32
15
• Use 16-bit floating point (fp16) as memory format
• Multiply-and-adds use fp32
• Kernel support for:
• Python element-wise operations auto-compiled into kernels
• fp16 accumulations done carefully to minimize errors
About | Kernels | neon | Summary
GEMM Stochastic rounding Dropout / maxout
Conv {f,b}prop, update Max pooling Statistics collection
neon experiments in fp16/32
15
• Use 16-bit floating point (fp16) as memory format
• Multiply-and-adds use fp32
• Kernel support for:
• Python element-wise operations auto-compiled into kernels
• fp16 accumulations done carefully to minimize errors
• Working with collaborators (Baidu, Bengio lab) to improve
About | Kernels | neon | Summary
GEMM Stochastic rounding Dropout / maxout
Conv {f,b}prop, update Max pooling Statistics collection
29
Cou
nt
30 31 32 33 34 35
fp32
Error (%) distribution over 25 reruns
fp16/32 accuracy
16 About | Kernels | neon | Summary
• No accuracy loss going from fp32 to fp16
Error (%) distribution over 25 runs
29
Cou
nt
30 31 32 33 34 35
fp32
Error (%) distribution over 25 reruns
fp16/32 accuracy
16 About | Kernels | neon | Summary
• No accuracy loss going from fp32 to fp16
29
Cou
nt
30 31 32 33 34 35
fp16fp32
Error (%) distribution over 25 rerunsError (%) distribution over 25 runs
29
Cou
nt
30 31 32 33 34 35
fp32
Error (%) distribution over 25 reruns
fp16/32 accuracy
16 About | Kernels | neon | Summary
• No accuracy loss going from fp32 to fp16
29
Cou
nt
30 31 32 33 34 35
fp16fp32
Error (%) distribution over 25 reruns29
Cou
nt
30 31 32 33 34 35
fp 16 stofp16fp32
Error (%) distribution over 25 rerunsError (%) distribution over 25 runs
17
Speed benchmarks1: fp16 vs others
About | Kernels | neon | Summary
100
200
300
400
500
600
Tim
e pe
r lay
er (m
s)
neon fp16Cuda-
convnet2neon
cudanet Torch7 cuDNN
5 layers forward pass, 5 backward pass05 convolutional layers, forward and backward pass Lower times are better. Benchmarks on GTX980
*
*2nd, 3rd layer don’t fit on a 4GB card
1 Soumith Chintala, github.com/soumith/convnet-benchmarks
X
Speed benchmarks1: fp16 vs fp32
About | Kernels | neon | Summary
100
200
300
400
500
600
Tim
e pe
r lay
er (m
s)
neon fp16Cuda-
convnet2neon
cudanet Torch7 cuDNN
5 layers forward pass, 5 backward pass05 convolutional layers, forward and backward pass Lower times are better. Benchmarks on GTX980
*
*some layers do not fit on a 4GB card
1 Soumith Chintala, github.com/soumith/convnet-benchmarks
X
Speed benchmarks1: fp16 vs fp32
About | Kernels | neon | Summary
100
200
300
400
500
600
Tim
e pe
r lay
er (m
s)
neon fp16Cuda-
convnet2neon
cudanet Torch7 cuDNN
5 layers forward pass, 5 backward pass0
100
200
Tim
e pe
r lay
er (m
s)
5 layers forward pass, 5 backward pass0
neon fp16Cuda-
convnet2neon
cudanet Torch7 cuDNN
5 convolutional layers, forward and backward pass Lower times are better. Benchmarks on GTX980
*
*some layers do not fit on a 4GB card
1 Soumith Chintala, github.com/soumith/convnet-benchmarks
X
Speed benchmarks1: fp16 vs fp32
About | Kernels | neon | Summary
100
200
300
400
500
600
Tim
e pe
r lay
er (m
s)
neon fp16Cuda-
convnet2neon
cudanet Torch7 cuDNN
5 layers forward pass, 5 backward pass0
100
200
Tim
e pe
r lay
er (m
s)
5 layers forward pass, 5 backward pass0
neon fp16Cuda-
convnet2neon
cudanet Torch7 cuDNN
50
100
Tim
e pe
r lay
er (m
s)
5 layers forward pass, 5 backward pass0
neon fp16Cuda-
convnet2neon
cudanet Torch7 cuDNN
5 convolutional layers, forward and backward pass Lower times are better. Benchmarks on GTX980
*
*some layers do not fit on a 4GB card
1 Soumith Chintala, github.com/soumith/convnet-benchmarks
18
Benchmarks1 show 2x performance
About | Kernels | neon | Summary
Maximum practical peak is 4700 gflops.
1 Using conventions here: Soumith Chintala, github.com/soumith/convnet-benchmarks 2 Numbers are relative to Titan Black (Kepler architecture)
More than double speed2 with half memory storage / bandwidth.
Raw numbers (averaged over 10 runs)
0 1 2 3 4 5Speed (TFLOPS)
0
10
20
30
Tim
e / (
s)
Alexnet fp16
Alexnet Cuda-Convnet2
18
Benchmarks1 show 2x performance
About | Kernels | neon | Summary
Avg(10) fprop: 43.650 msecs 4188.573 gflopsAvg(10) bprop: 94.315 msecs 3877.055 gflopsAvg(10) total: 137.965 msecs 3975.615 gflops
Alexnet
Maximum practical peak is 4700 gflops.
1 Using conventions here: Soumith Chintala, github.com/soumith/convnet-benchmarks 2 Numbers are relative to Titan Black (Kepler architecture)
More than double speed2 with half memory storage / bandwidth.
Raw numbers (averaged over 10 runs)
0 1 2 3 4 5Speed (TFLOPS)
0
10
20
30
Tim
e / (
s)
Alexnet fp16
Alexnet Cuda-Convnet2
18
Benchmarks1 show 2x performance
About | Kernels | neon | Summary
Avg(10) fprop: 172.005 msecs 4169.400 gflopsAvg(10) bprop: 355.809 msecs 4031.144 gflopsAvg(10) total: 527.815 msecs 4076.199 gflops
Overfeat
Maximum practical peak is 4700 gflops.
1 Using conventions here: Soumith Chintala, github.com/soumith/convnet-benchmarks 2 Numbers are relative to Titan Black (Kepler architecture)
More than double speed2 with half memory storage / bandwidth.
Raw numbers (averaged over 10 runs)
0 1 2 3 4 5Speed (TFLOPS)
0
10
20
30
Tim
e / (
s)
Alexnet fp16
Alexnet Cuda-Convnet2
18
Benchmarks1 show 2x performance
About | Kernels | neon | Summary
Avg(10) fprop: 234.050 msecs 4161.347 gflopsAvg(10) bprop: 529.052 msecs 3681.920 gflopsAvg(10) total: 763.102 msecs 3828.965 gflops
VGG (N=64)
Maximum practical peak is 4700 gflops.
1 Using conventions here: Soumith Chintala, github.com/soumith/convnet-benchmarks 2 Numbers are relative to Titan Black (Kepler architecture)
More than double speed2 with half memory storage / bandwidth.
Raw numbers (averaged over 10 runs)
0 1 2 3 4 5Speed (TFLOPS)
0
10
20
30
Tim
e / (
s)
Alexnet fp16
Alexnet Cuda-Convnet2
19
Summary
About | Kernels | neon | Summary
19
Summary
• neon: User-friendly python library
About | Kernels | neon | Summary
19
Summary
• neon: User-friendly python library
• maxas: Powerful tool for optimizing deep learning
About | Kernels | neon | Summary
19
Summary
• neon: User-friendly python library
• maxas: Powerful tool for optimizing deep learning
• Fast performance, full utilization of GPU
About | Kernels | neon | Summary
19
Summary
• neon: User-friendly python library
• maxas: Powerful tool for optimizing deep learning
• Fast performance, full utilization of GPU
• Limited precision allows for larger models
About | Kernels | neon | Summary
19
Summary
• neon: User-friendly python library
• maxas: Powerful tool for optimizing deep learning
• Fast performance, full utilization of GPU
• Limited precision allows for larger models
• Toolbox for exploring numerical representations
About | Kernels | neon | Summary
19
Summary
• neon: User-friendly python library
• maxas: Powerful tool for optimizing deep learning
• Fast performance, full utilization of GPU
• Limited precision allows for larger models
• Toolbox for exploring numerical representations
About | Kernels | neon | Summary
20
GTC 2015
About | Kernels | neon | Summary
• Contact us at [email protected]
• We are hiring!
• Sign up to try neon, our deep learning library.
• We can help solve your problem.
• Cloud engineers • machine learning engineers
• GPU experts • software engineers