Date post: | 09-Jan-2017 |
Category: |
Technology |
Upload: | nvidia-taiwan |
View: | 150 times |
Download: | 0 times |
TAIPEI | SEP. 21-22, 2016
Robert Sheen
HPE APJeC Principle Solution Architect
Sep 21, 2016
A PLATFORM FOR ACCELERATINGMACHINE LEARNING APPLICATIONS
2
WHAT CONFUSION! ARTIFICIAL INTELLIGENCE … MACHINE LEARNING … NEURAL NETWORKS … DEEP LEARNING
3
A QUICK INTRODUCTION TO (DEEP) NEURAL NETWORKS
The (artificial) neuron.Artificial Neural Networks (ANNs) are inspired by biological systems similar to our brain
f(z)
xo
x1
x2
x3
x4
1
y1
Bias = threshold
Inputs
1
1,0w
1
2,0w
1
3,0w
1
0b
Weights
1
4,0w
1
5,0w 1
0
11
0 bxwz l
kk
l
jk
)( 1
0
1
0 zfa
NNs are made up of neurons, which are a mathematical
approximation to biological neurons
ReLU / SoftplusHyperbolic tangent
+1
-1
Logistic (sigma)
𝑎 𝑧 = tanh 𝑧
𝑎 𝑧 = max 0, 𝑧
𝑎 𝑧 = ln 1 + 𝑒𝑧𝑎 𝑧 =
1
1 + 𝑒−𝑧
𝑤ℎ𝑒𝑟𝑒 𝑧 =
𝑗
𝑤𝑗 𝑥𝑗 − 𝑏
Artificial Neural Networks (ANNs) are inspired by biological systems similar to our brain
f(z)
xo
x1
x2
x3
x4
1
y1
Bias = threshold
Inputs
1
1,0w
1
2,0w
1
3,0w
1
0b
Weights
1
4,0w
1
5,0w 1
0
11
0 bxwz l
kk
l
jk
)( 1
0
1
0 zfa
NNs are made up of neurons, which are a mathematical
approximation to biological neurons
In a typical neuron the inputs (xn) are multiplied by weights
(𝑤𝑗𝑘𝑙 ) and then summed up ( 𝑤𝑗𝑘
𝑙 ).
A non-linear activation function, 𝑓, is applied to the
summed and “thresholded” output (𝑧𝑖𝑙) using a non-linear
activation function, 𝑓 𝑧 .
This activation is the output of the neuron.
4
A QUICK INTRODUCTION TO (DEEP) NEURAL NETWORKS
To solve useful problems we have to connect multiple neurons together. The output from a neuron in one layer becomes the input to neurons in the next layer.
Notice that the arrows go in one direction only. We will only be discussing “feed-forward” networks. There are others.
What is deep learning? It is essentially artificial neural networks consisting of many (>1) layers and a large number of neurons (units). This is very computationally intensive and uses mathematical techniques typical of high performance computing (matrix-matrix multiplies, vector operations, FFTs, convolutions) and requires HPC hardware.
Training deep networks requires high performance computing hardware and techniques.
5
A QUICK INTRODUCTION TO (DEEP) NEURAL NETWORKS
What do neural networks do?
They classify
-E.g., Given an image is it a bird, is it a cat? Is it Stephen Fleischman?
-Given an audio signal, what are the words. What do they mean?
-This requires a training data set with inputs and their classes.
-This is supervised learning and what we will focus on.
-They cluster
-Find groups of similar things.
-Does not require classified training sets.
-This is unsupervised learning.
-It is often used together with supervised learning.
MNIST handwriting recognition data set for digits. Classify each image as 0 .. 9.
6
The most important networks that solve the ImageNet
challenge over the years are benchmarked.
Some of them are:
Alexnet (The original!)
VGG_A
Overfeat
Inception V1 (and now Inception V3!) (From Google)
The ImageNet dataset is a database of around 1.2 million annotated images.
The challenge is to train the neural network using a subset of the database and then attempt to classify all
the images in the dataset.
The industry standard parameter is the number of images per second that we can train.
Training time is forward + back propagation time of the network
Every year various teams compete to classify the ImageNet dataset in the “ImageNet Large Scale Visual
Recognition Challenge” (ILSVRC). The network that has the greatest accuracy wins.
Testing PerformanceThe ImageNet dataset and benchmark
7
The most important networks that solve the ImageNet challenge over the years are benchmarked.
The classification accuracy has been improving year on year, so much that now it is better than humans!
Testing PerformanceThe ImageNet dataset and benchmark
Low
er is b
ette
r
8
Computers have to be explicitly programmed
Analyze the problem to be solved.
Write the code in a programming language.
Deductive reasoning
Instruction and PC
Neural networks learn from examples
No requirement of an explicit description of the problem.
The neural computer adapts itself during a training period, based on examples of similar problems
Able to generalize or to handle incomplete data.
Inductive reasoning
Works well with “natural” data (like speech, image etc.)
How does a Neural Network work?
A quick introduction to (Deep) Neural Networks
9
Why is Deep Learning High Performance Computing?
DNNs are compute intensive and the training for a typical DNN application runs for weeks even on modern hardware
Maps to BLAS functions like SGEMM, finding max/min, matrix inversions, FFTs etc.
Easily mapped to accelerators thus these applications becomes natural target for HPC platforms
Analysis shows that about 80% of time is spent in convolutions, which are basically SGEMMcomputations
Recent developments in learning models have enhanced parallelism with both data and model parallelisms
Recent advances with Nvidia libraries have supported multiple GPUs (1-8) in a single node
Known to scale well with scale-out configurations too.
A quick introduction to (Deep) Neural Networks
10
Challenges in training deep neural networks
– Slow convergence with millions of weights / parameters.
– Activations saturate or explode.
– Depends on the function but result is that weights going into that neuron stop training.
– Vanishing gradient problem.
– Result of how we optimize the weights.
– Overfitting (or Overtraining)
- So many parameters you can easily train to fit the training data but then be completely unable to generalize.
– Achieving scalability in training is crucial but to do so on more than one GPU
For each of these challenges there are methods to ameliorate them. Depends on the problem and the choices that you make in the activation function, the cost function, the number of layers, the number of neurons, the types of layers etc.
These are the hyper-parameters of the neural network model and choosing them is currently 1) an art as much as a science 2) an active area of research 3) a major factor in sizing the hardware for deep learning.
A quick introduction to (Deep) Neural Networks
11
Getting training to scale
– Model parallelism
– Split the model (neural network) across GPUs and servers.
– Parallelizes well on a single GPU
– Up to 8 GPUs currently but some claims of better efficiency (Baidu).
– Multiple server is a problem.
– Data parallelism
– Gather scatter (SXM2)
– Split the training set across processing units and gather the updates. Requires peer to peer communication.
– Parameter servers (Master-Slave)– Traditional manager/worker parallelism. Use the CPU to gather and dispatch the data. Not being used for much anyway. Need to store the
entire model on the GPU but no peer to peer communication.
– Hyper-parameters– Figuring out the number of layers, number of neurons, training momentum can be done in parallels.
– Consensus– Can have multiple neural networks training on the same data with different models and have them vote or otherwise combine their weights.
– Potentially more suitable for clusters of servers.
– Inference: Run it in parallel if you replicate the model.
A quick introduction to (Deep) Neural Networks
12
• Domain-specific embedded language with associated optimizing compiler and runtime
• Array programming language embedded in a state machine execution model
• Targets advanced analytics workloads on massively parallel distributed systems
• Design Goals
– Optimal deployment on parallel hardware
– Fast design iterations
– Enforce scalability
– Broad COTS hardware support
– Compatible with shared infrastructure
– High productivity for analysts and algorithm engineers
What is CogX?
CogX
13
Compute graph
moviet
backgroundt +*0.999f
*0.001f
nextBackgroundt backgroundt+1
- absreduce Sum
suspicioust
ColorMovie
Opportunities for optimization
14
Compute graph
moviet
backgroundtnextBackgroundt backgroundt+1
suspicioust
ColorMovie
* 0.001f
* 0.999f +
- Abs reduce Sum
device kernel
Opportunities for optimization
Initially: 6 separate devie kernels.
15
Compute graph
moviet
backgroundt +*0.999f
*0.001f
nextBackgroundt backgroundt+1
- absreduce Sum
suspicioust
ColorMovie
device kernel
Opportunities for optimization
After a “single-output” kernel fuser pass: 2 device kernels remain.
16
Compute graph
moviet
backgroundt +*0.999f
*0.001f
nextBackgroundt backgroundt+1
- absreduce Sum
suspicioust
ColorMovie
device kernel
Opportunities for optimization
After a “multi-output” kernel fuser pass: only a single device kernel remains
17
User CogX model
(scala)
parsing and OpenCL code
generation
Kernel circuit
(kernels,field bufs)
Optimized kernel circuit
(mergedkernels)
optimizations, including kernel
fusion
CogX code snippet
*opencl
multiply
kernel
A
B
C
+
opencl
add
kernelD
E *+
fused
opencl
multiply/
add
kernel
A
D
EB
val A = ScalarField(10,10)
val B = ScalarField(10,10)
val C = A * B
val D = ScalarField(10,10)
val E = C + D
CogX compiler:
translating CogX to OpenCL with kernel fusion
18
• Basic operators • FFT/DCT • Type coercion
• +, -, *, /, % • fft, fftInverse • toScalarField, toVectorField
• Logical operators • fftRI, fftInverseRI • toMatrixField, toComplexField
• >, >=, <, <=, ===, !=== • fftRows, fftInverseRows • toComplexVectorField, toColorField
• Pointwise functions • fftColumns, fftInverseColumns • toGenericComplexField
• cos, cosh, acos • dct, dctInverse, dctTransposed • Type construction
• sin, sinh, asin • dctInverseTransposed • complex, polarComplex
• tan, tanh, atan2 • Complex numbers • vectorField, complexVectorField
• sq, sqrt, log, signum • phase, magnitude, conjugate • matrixField, colorField
• pow, reciprocal • realPart, imaginaryPart • Reductions
• exp, abs, floor • Convolution-like • reduceSum, blockReduceSum
• Comparison functions • crossCorrelate, • reduceMin, blockReduceMin
• max, min crossCorrelateSeparable • reduceMax, blockReduceMax
• Shape manipulation • convolve, convolveSeparable • fieldReduceMax, fieldReduceMin
• flip, shift, shiftCyclic • projectFrame, backProjectFrame • fieldReduceSum, fieldReduceMedian
• transpose, subfield • crossCorrelateFilterAdjoint • Normalizations
• expand, select, stack • convolveFilterAdjoint • normalizeL1, normalizeL2
• matrixRow, reshape • Gradient/divergence • Resampling
• subfields, trim • backwardDivergence • supersample, downsample, upsample
• vectorElement, vectorElements • backwardGradient • Special operators
• transposeMatrices • centralGradient • winnerTakeAll
• transposeVectors • forwardGradient • random
• replicate, slice • Linear algebra • solve
• dot, crossDot • transform
• reverseCrossDot • warp
• Debugging • <==
• probe
CogX core functions and operators
19
• Computer Vision
• Annotation tools
• Color space transformations
• Polynomial dense optic flow
• Segmentation
• Solvers
• Boundary-gated nonlinear
diffusion
• FISTA solver (with sub-
variants)
• Golden section solver
• Incremental k-means
implementation
• LSQR solver (with sub-
variants)
• Poisson solver (with sub-
variants)
• Filtering
• Contourlets
• 4 frequency-domain filters
• Mathematical morphology
operators
• 27 space-domain filters (from
a simple box filter up to local
polynomial expansion and
steerable Gabor filters)
• Steerable pyramid filter
• Wavelets
• Variants of whitening
transforms
• Contrast normalization
• Domain transfer filter
• Gaussian pyramid
• Monogenic phase
congruency
• Dynamical Systems
• Kalman filter
• Linear system modeling
support
• CPU matrix pseudo-
inverse
• Statistics
• Normal and uniform
distributions
• Histograms
• Moment calculations
• Pseudo-random number
generator sensors
CogX toolkit functions
20
Application
CogX debugger
CogX compiler and standard library
Neural network
toolkit
Sandbox toolkitI/O toolkit
Scala CogX runtime C++ CogX runtime
HDF5 loader JOCL
HDF5 OpenCL HDF5
CogX core
External
libraries
CogX
libraries/toolkit
Cluster package
Apache Mesos
Applications are written by users
– Introductory and training examples for single-GPU and distributed computation
– Performance benchmarks covering the core and neural network package
– Several larger-scale demo applications integrating multiple CogX functions
HPE Cognitive Computing Toolkit
http://on-demand-gtc.gputechconf.com/gtcnew/on-demand-gtc.php?searchByKeyword=S6772&searchItems=&sessionTopic=&sessionEvent=&sessionYear=&sessionFormat=&submit=&select=
21
SOME MACHINE LEARNING APPLICATIONS
21
22
….But what about “enterprise-class” use cases?
GamesChat bots (Cortana, Suri, Jarvis, etc.)
Intelligent Assistants (Siri, Alexa, etc)
Deep Learning Use Cases
The better-known, well publicized implementations..
Self-driving cars
23
Finance Medicine E-Commerce
shoppers
Security
threats
AI-assisted trading, beyond current algorithmic trading
Rise of “AI Hedge Funds”
Healthcare
institutions use AI-
assisted diagnosis,
recommendations,
reduce human error
Agent and chatbots
provide product
recommendations,
“interacts” with
potential
Beyond facial
recognition,
understand “context”
of danger and flag
security
AI in the Enterprise
Deep Learning and Neural Networks for the mainstream?
24
Social networking Geospatial
Yan LeCunn was hired by Facebook, Geoff Hinton by Google and Andrew Ng by Baidu.
Sentiment analysis.
Facial recognition.
Understanding text.
Image recognition.
High spatial resolution remote-
sensing (HSR-RS) images scene
classification (BoVWs)
Oil and Gas
Channel sands identification.
Other seismic analysis.
AI in the Enterprise
Deep Learning and Neural Networks for the mainstream?
25
Self-driving cars
Deep neural networks are being used to understand the scene in self-driving cars!
The 4 Stage IoT Solutions Architecture:
Primarily
analog data
sources
Devices,
machines,
people, tools,
cars, animals,
clothes, toys,
environment,
buildings, etc.
The “Things”
Data Flow:
Th
e E
dg
e
Sensors/Actuators
(wired, wireless)
Internet Gateways,
Data Acquisition
Systems
(data aggregation, A/D,
measurement, control)
Edge IT
(analytics, pre-
processing)
Data Center / Cloud
(analytics,
management, archive)
Stage 1 Stage 2 Stage 3 Stage 4
Visualization
Control Flow:
SW Stacks:AnalyticsManagementControl
AnalyticsManagementControl
AnalyticsManagementControl
27
Enableworkplaceproductivity
Empower a data-driven organization
Transform to a hybrid
infrastructure
Protect your digital enterprise
* Benchmarking results provided at or shortly after announcement
Use Cases AutomatedIntelligence
delivered by HPE
Apollo 6500 and Deep
Learning software
Video, Image, Text,Audio, time seriespattern recognition
solutions
Large, highly complex, Real-time, near unstructured simulation real-time analytics
and modeling
Faster Model training time, better fusion of data*
Customer benefits
HPE Apollo 6500 is an ideal HPC and Deep Learning platform providing unprecedented performance with 8 GPUs, high bandwidth
fabric and a configurable GPU topology to match deep learning workloads
– Up to 8 high powered GPUs per tray (node), 2P Intel E5-2600 v4 support
– Choice of high-speed, low latency fabrics with 2x IO expansion
– Workload optimized using flexible configuration capabilities
Deliver automated intelligence in real-time
Unprecedented performance and scale with HPE Apollo 6500 high density GPU solution
Apollo 8000
Supercomputing
Apollo 6000
Rack Scale HPC
Apollo 4000
Server Solutions Purpose
Built for Big Data
Apollo 2000
Enterprise Bridge to
Scale-Out Compute
Big Data WorkloadsHPC Workloads
Mellanox NVIDIA Seagate
Pla
tform
sS
olu
tions / IS
Vs
HPE Apollo platforms and solutions are optimized for HPC, IoT and Big Data
Next Gen Workloads
Moonshot*
Optimized for Next Gen
Workloads
Video encoding
Mobile workplace
IoT
Oil and gas Life Sciences Financial Services
Manufacturing CAD/CAE
Academia Object Storage
Data Analytics
Scality
Cleversafe
Ceph
Hortonworks
Hadoop
Cloudera
Schlumberger
Paradigm
Halliburton
Gaussian
BIOVIA Redline
Synopsys
ANSYS Custom Apps
28
HPE Software (i.e. Vertica, HPE Haven), HPE Enterprise Services
29
HP APOLLO 6000 POWER SHELF
Pooled Power Efficiency
Efficiency
• External pooled power shelf
• Fits up to 6 power supplies
• 2400W or 2650W power supplies
• Up to 15.9kW non-redundant
• Single or 3-phased AC input
• Up to twelve 12V DC cables
1.5U
2.55”
17.64”
30.88”
Back View
Front View
1.5U (H) x 44.81cm (W) x 78.44cm
(D)
1.5U (H) x 17.64 in (W) x 30.88 in
(D)
30
HPE Apollo 6500– Dense GPU server optimized for Deep
Learning and HPC workloads
– Density optimization
– High performance fabrics
Cluster Management Enhancements(Massive Scaling, Open APIs, tight Integration, multiple user
interfaces)
– GPU density
– Configurable GPU topologies
– More network bandwidth
– Power and cooling optimization
– Manageability
– Better productivity
New technologies, products
UniqueSolution differentiators
Deep Learning, HPC Software platform Enablement(HPE CCTK, Caffe, CUDA, Google TensorFlow, HPE IDOL)
HPE Apollo 6500 solution innovation
System Design Innovation to maximize GPU capacity and performance with lower TCO
31
方案一 : 企業虛擬化首選 方案二 : 高效能運算首選
HPE Apollo 2000/XL190r 1 node
+ NVIDIA TeslaM60 *1
Apollo r2200 12LFF或 r2600 24SFF
XL190r Gen9 規格 :
E5-2640v4*2/ 16GB*2/ 1TB*1/ 800W/
3yr Fndn Care 24*7 service NVIDIA
Tesla M60 Dual GPU*1
HPE Apollo 2000/XL190r 1 node
+ NVIDIA TeslaK80 *1
Apollo r2200 12LFF或 r2600 24SFF
XL190r Gen9 規格 :
E5-2640v4*2/ 16GB*2/ 1TB*1/ 800W/
3yr Fndn Care 24*7 service NVIDIA
Tesla K80 Dual GPU*1
限時限量優惠組合
最強組合
密度最佳的 HPE伺服器再加 NVIDIA GPU給你最強大組合
單一 2U 機箱最大可擴至 2 台 HPE Apollo系統伺服器及
4 張 NVIDIA 高效運算加速卡
Apollo 2000+ NVIDIA GPU促銷方案
NT$360,000(未稅價) 起 NT$360,000(未稅價) 起
※活動截止日期 : 2016 / 12 / 31 如對產品有興趣請撥打:(02)2652-4040 本號碼僅限台灣區使用
TAIPEI | SEP. 21-22, 2016
THANK YOU