A Platform for Accelerating Machine Learning Applications

TAIPEI | SEP. 21-22, 2016

Robert Sheen

HPE APJeC Principle Solution Architect

Sep 21, 2016

A PLATFORM FOR ACCELERATINGMACHINE LEARNING APPLICATIONS

2

WHAT CONFUSION! ARTIFICIAL INTELLIGENCE … MACHINE LEARNING … NEURAL NETWORKS … DEEP LEARNING

3

A QUICK INTRODUCTION TO (DEEP) NEURAL NETWORKS

The (artificial) neuron.Artificial Neural Networks (ANNs) are inspired by biological systems similar to our brain

f(z)

xo

x1

x2

x3

x4

1

y1

Bias = threshold

Inputs

1

1,0w

1

2,0w

1

3,0w

1

0b

Weights

1

4,0w

1

5,0w 1

0

11

0 bxwz l

kk

l

jk

)( 1

0

1

0 zfa

NNs are made up of neurons, which are a mathematical

approximation to biological neurons

ReLU / SoftplusHyperbolic tangent

+1

-1

Logistic (sigma)

𝑎 𝑧 = tanh 𝑧

𝑎 𝑧 = max 0, 𝑧

𝑎 𝑧 = ln 1 + 𝑒𝑧𝑎 𝑧 =

1

1 + 𝑒−𝑧

𝑤ℎ𝑒𝑟𝑒 𝑧 =

𝑗

𝑤𝑗 𝑥𝑗 − 𝑏

Artificial Neural Networks (ANNs) are inspired by biological systems similar to our brain

f(z)

xo

x1

x2

x3

x4

1

y1

Bias = threshold

Inputs

1

1,0w

1

2,0w

1

3,0w

1

0b

Weights

1

4,0w

1

5,0w 1

0

11

0 bxwz l

kk

l

jk

)( 1

0

1

0 zfa

NNs are made up of neurons, which are a mathematical

approximation to biological neurons

In a typical neuron the inputs (xn) are multiplied by weights

(𝑤𝑗𝑘𝑙 ) and then summed up ( 𝑤𝑗𝑘

𝑙 ).

A non-linear activation function, 𝑓, is applied to the

summed and “thresholded” output (𝑧𝑖𝑙) using a non-linear

activation function, 𝑓 𝑧 .

This activation is the output of the neuron.

4


To solve useful problems we have to connect multiple neurons together. The output from a neuron in one layer becomes the input to neurons in the next layer.

Notice that the arrows go in one direction only. We will only be discussing “feed-forward” networks. There are others.

What is deep learning? It is essentially artificial neural networks consisting of many (>1) layers and a large number of neurons (units). This is very computationally intensive and uses mathematical techniques typical of high performance computing (matrix-matrix multiplies, vector operations, FFTs, convolutions) and requires HPC hardware.

Training deep networks requires high performance computing hardware and techniques.

5


What do neural networks do?

They classify

-E.g., Given an image is it a bird, is it a cat? Is it Stephen Fleischman?

-Given an audio signal, what are the words. What do they mean?

-This requires a training data set with inputs and their classes.

-This is supervised learning and what we will focus on.

-They cluster

-Find groups of similar things.

-Does not require classified training sets.

-This is unsupervised learning.

-It is often used together with supervised learning.

MNIST handwriting recognition data set for digits. Classify each image as 0 .. 9.

6

The most important networks that solve the ImageNet

challenge over the years are benchmarked.

Some of them are:

Alexnet (The original!)

VGG_A

Overfeat

Inception V1 (and now Inception V3!) (From Google)

The ImageNet dataset is a database of around 1.2 million annotated images.

The challenge is to train the neural network using a subset of the database and then attempt to classify all

the images in the dataset.

The industry standard parameter is the number of images per second that we can train.

Training time is forward + back propagation time of the network

Every year various teams compete to classify the ImageNet dataset in the “ImageNet Large Scale Visual

Recognition Challenge” (ILSVRC). The network that has the greatest accuracy wins.

Testing PerformanceThe ImageNet dataset and benchmark

7

The most important networks that solve the ImageNet challenge over the years are benchmarked.

The classification accuracy has been improving year on year, so much that now it is better than humans!

Testing PerformanceThe ImageNet dataset and benchmark

Low

er is b

ette

r

8

Computers have to be explicitly programmed

Analyze the problem to be solved.

Write the code in a programming language.

Deductive reasoning

Instruction and PC

Neural networks learn from examples

No requirement of an explicit description of the problem.

The neural computer adapts itself during a training period, based on examples of similar problems

Able to generalize or to handle incomplete data.

Inductive reasoning

Works well with “natural” data (like speech, image etc.)

How does a Neural Network work?

A quick introduction to (Deep) Neural Networks

9

Why is Deep Learning High Performance Computing?

DNNs are compute intensive and the training for a typical DNN application runs for weeks even on modern hardware

Maps to BLAS functions like SGEMM, finding max/min, matrix inversions, FFTs etc.

Easily mapped to accelerators thus these applications becomes natural target for HPC platforms

Analysis shows that about 80% of time is spent in convolutions, which are basically SGEMMcomputations

Recent developments in learning models have enhanced parallelism with both data and model parallelisms

Recent advances with Nvidia libraries have supported multiple GPUs (1-8) in a single node

Known to scale well with scale-out configurations too.


10

Challenges in training deep neural networks

– Slow convergence with millions of weights / parameters.

– Activations saturate or explode.

– Depends on the function but result is that weights going into that neuron stop training.

– Vanishing gradient problem.

– Result of how we optimize the weights.

– Overfitting (or Overtraining)

- So many parameters you can easily train to fit the training data but then be completely unable to generalize.

– Achieving scalability in training is crucial but to do so on more than one GPU

For each of these challenges there are methods to ameliorate them. Depends on the problem and the choices that you make in the activation function, the cost function, the number of layers, the number of neurons, the types of layers etc.

These are the hyper-parameters of the neural network model and choosing them is currently 1) an art as much as a science 2) an active area of research 3) a major factor in sizing the hardware for deep learning.


11

Getting training to scale

– Model parallelism

– Split the model (neural network) across GPUs and servers.

– Parallelizes well on a single GPU

– Up to 8 GPUs currently but some claims of better efficiency (Baidu).

– Multiple server is a problem.

– Data parallelism

– Gather scatter (SXM2)

– Split the training set across processing units and gather the updates. Requires peer to peer communication.

– Parameter servers (Master-Slave)– Traditional manager/worker parallelism. Use the CPU to gather and dispatch the data. Not being used for much anyway. Need to store the

entire model on the GPU but no peer to peer communication.

– Hyper-parameters– Figuring out the number of layers, number of neurons, training momentum can be done in parallels.

– Consensus– Can have multiple neural networks training on the same data with different models and have them vote or otherwise combine their weights.

– Potentially more suitable for clusters of servers.

– Inference: Run it in parallel if you replicate the model.


12

• Domain-specific embedded language with associated optimizing compiler and runtime

• Array programming language embedded in a state machine execution model

• Targets advanced analytics workloads on massively parallel distributed systems

• Design Goals

– Optimal deployment on parallel hardware

– Fast design iterations

– Enforce scalability

– Broad COTS hardware support

– Compatible with shared infrastructure

– High productivity for analysts and algorithm engineers

What is CogX?

CogX

13

Compute graph

moviet

backgroundt +*0.999f

*0.001f

nextBackgroundt backgroundt+1

- absreduce Sum

suspicioust

ColorMovie

Opportunities for optimization

14

Compute graph

moviet

backgroundtnextBackgroundt backgroundt+1

suspicioust

ColorMovie

* 0.001f

* 0.999f +

- Abs reduce Sum

device kernel


Initially: 6 separate devie kernels.

15

Compute graph

moviet


*0.001f


- absreduce Sum

suspicioust

ColorMovie

device kernel


After a “single-output” kernel fuser pass: 2 device kernels remain.

16

Compute graph

moviet


*0.001f


- absreduce Sum

suspicioust

ColorMovie

device kernel


After a “multi-output” kernel fuser pass: only a single device kernel remains

17

User CogX model

(scala)

parsing and OpenCL code

generation

Kernel circuit

(kernels,field bufs)

Optimized kernel circuit

(mergedkernels)

optimizations, including kernel

fusion

CogX code snippet

*opencl

multiply

kernel

A

B

C

+

opencl

add

kernelD

E *+

fused

opencl

multiply/

add

kernel

A

D

EB

val A = ScalarField(10,10)

val B = ScalarField(10,10)

val C = A * B

val D = ScalarField(10,10)

val E = C + D

CogX compiler:

translating CogX to OpenCL with kernel fusion

18

• Basic operators • FFT/DCT • Type coercion

• +, -, *, /, % • fft, fftInverse • toScalarField, toVectorField

• Logical operators • fftRI, fftInverseRI • toMatrixField, toComplexField

• >, >=, <, <=, ===, !=== • fftRows, fftInverseRows • toComplexVectorField, toColorField

• Pointwise functions • fftColumns, fftInverseColumns • toGenericComplexField

• cos, cosh, acos • dct, dctInverse, dctTransposed • Type construction

• sin, sinh, asin • dctInverseTransposed • complex, polarComplex

• tan, tanh, atan2 • Complex numbers • vectorField, complexVectorField

• sq, sqrt, log, signum • phase, magnitude, conjugate • matrixField, colorField

• pow, reciprocal • realPart, imaginaryPart • Reductions

• exp, abs, floor • Convolution-like • reduceSum, blockReduceSum

• Comparison functions • crossCorrelate, • reduceMin, blockReduceMin

• max, min crossCorrelateSeparable • reduceMax, blockReduceMax

• Shape manipulation • convolve, convolveSeparable • fieldReduceMax, fieldReduceMin

• flip, shift, shiftCyclic • projectFrame, backProjectFrame • fieldReduceSum, fieldReduceMedian

• transpose, subfield • crossCorrelateFilterAdjoint • Normalizations

• expand, select, stack • convolveFilterAdjoint • normalizeL1, normalizeL2

• matrixRow, reshape • Gradient/divergence • Resampling

• subfields, trim • backwardDivergence • supersample, downsample, upsample

• vectorElement, vectorElements • backwardGradient • Special operators

• transposeMatrices • centralGradient • winnerTakeAll

• transposeVectors • forwardGradient • random

• replicate, slice • Linear algebra • solve

• dot, crossDot • transform

• reverseCrossDot • warp

• Debugging • <==

• probe

CogX core functions and operators

19

• Computer Vision

• Annotation tools

• Color space transformations

• Polynomial dense optic flow

• Segmentation

• Solvers

• Boundary-gated nonlinear

diffusion

• FISTA solver (with sub-

variants)

• Golden section solver

• Incremental k-means

implementation

• LSQR solver (with sub-

variants)

• Poisson solver (with sub-

variants)

• Filtering

• Contourlets

• 4 frequency-domain filters

• Mathematical morphology

operators

• 27 space-domain filters (from

a simple box filter up to local

polynomial expansion and

steerable Gabor filters)

• Steerable pyramid filter

• Wavelets

• Variants of whitening

transforms

• Contrast normalization

• Domain transfer filter

• Gaussian pyramid

• Monogenic phase

congruency

• Dynamical Systems

• Kalman filter

• Linear system modeling

support

• CPU matrix pseudo-

inverse

• Statistics

• Normal and uniform

distributions

• Histograms

• Moment calculations

• Pseudo-random number

generator sensors

CogX toolkit functions

20

Application

CogX debugger

CogX compiler and standard library

Neural network

toolkit

Sandbox toolkitI/O toolkit

Scala CogX runtime C++ CogX runtime

HDF5 loader JOCL

HDF5 OpenCL HDF5

CogX core

External

libraries

CogX

libraries/toolkit

Cluster package

Apache Mesos

Applications are written by users

– Introductory and training examples for single-GPU and distributed computation

– Performance benchmarks covering the core and neural network package

– Several larger-scale demo applications integrating multiple CogX functions

HPE Cognitive Computing Toolkit

http://on-demand-gtc.gputechconf.com/gtcnew/on-demand-gtc.php?searchByKeyword=S6772&searchItems=&sessionTopic=&sessionEvent=&sessionYear=&sessionFormat=&submit=&select=

http://on-demand-gtc.gputechconf.com/gtcnew/on-demand-gtc.php?searchByKeyword=S6772&searchItems=&sessionTopic=&sessionEvent=&sessionYear=&sessionFormat=&submit=&select

21

SOME MACHINE LEARNING APPLICATIONS

21

22

….But what about “enterprise-class” use cases?

GamesChat bots (Cortana, Suri, Jarvis, etc.)

Intelligent Assistants (Siri, Alexa, etc)

Deep Learning Use Cases

The better-known, well publicized implementations..

Self-driving cars

23

Finance Medicine E-Commerce

shoppers

Security

threats

AI-assisted trading, beyond current algorithmic trading

Rise of “AI Hedge Funds”

Healthcare

institutions use AI-

assisted diagnosis,

recommendations,

reduce human error

Agent and chatbots

provide product

recommendations,

“interacts” with

potential

Beyond facial

recognition,

understand “context”

of danger and flag

security

AI in the Enterprise

Deep Learning and Neural Networks for the mainstream?

24

Social networking Geospatial

Yan LeCunn was hired by Facebook, Geoff Hinton by Google and Andrew Ng by Baidu.

Sentiment analysis.

Facial recognition.

Understanding text.

Image recognition.

High spatial resolution remote-

sensing (HSR-RS) images scene

classification (BoVWs)

Oil and Gas

Channel sands identification.

Other seismic analysis.

AI in the Enterprise

Deep Learning and Neural Networks for the mainstream?

25

Self-driving cars

Deep neural networks are being used to understand the scene in self-driving cars!

The 4 Stage IoT Solutions Architecture:

Primarily

analog data

sources

Devices,

machines,

people, tools,

cars, animals,

clothes, toys,

environment,

buildings, etc.

The “Things”

Data Flow:

Th

e E

dg

e

Sensors/Actuators

(wired, wireless)

Internet Gateways,

Data Acquisition

Systems

(data aggregation, A/D,

measurement, control)

Edge IT

(analytics, pre-

processing)

Data Center / Cloud

(analytics,

management, archive)

Stage 1 Stage 2 Stage 3 Stage 4

Visualization

Control Flow:

SW Stacks:AnalyticsManagementControl

AnalyticsManagementControl

AnalyticsManagementControl

27

Enableworkplaceproductivity

Empower a data-driven organization

Transform to a hybrid

infrastructure

Protect your digital enterprise

* Benchmarking results provided at or shortly after announcement

Use Cases AutomatedIntelligence

delivered by HPE

Apollo 6500 and Deep

Learning software

Video, Image, Text,Audio, time seriespattern recognition

solutions

Large, highly complex, Real-time, near unstructured simulation real-time analytics

and modeling

Faster Model training time, better fusion of data*

Customer benefits

HPE Apollo 6500 is an ideal HPC and Deep Learning platform providing unprecedented performance with 8 GPUs, high bandwidth

fabric and a configurable GPU topology to match deep learning workloads

– Up to 8 high powered GPUs per tray (node), 2P Intel E5-2600 v4 support

– Choice of high-speed, low latency fabrics with 2x IO expansion

– Workload optimized using flexible configuration capabilities

Deliver automated intelligence in real-time

Unprecedented performance and scale with HPE Apollo 6500 high density GPU solution

Apollo 8000

Supercomputing

Apollo 6000

Rack Scale HPC

Apollo 4000

Server Solutions Purpose

Built for Big Data

Apollo 2000

Enterprise Bridge to

Scale-Out Compute

Big Data WorkloadsHPC Workloads

Mellanox NVIDIA Seagate

Pla

tform

sS

olu

tions / IS

Vs

HPE Apollo platforms and solutions are optimized for HPC, IoT and Big Data

Next Gen Workloads

Moonshot*

Optimized for Next Gen

Workloads

Video encoding

Mobile workplace

IoT

Oil and gas Life Sciences Financial Services

Manufacturing CAD/CAE

Academia Object Storage

Data Analytics

Scality

Cleversafe

Ceph

Hortonworks

Hadoop

Cloudera

Schlumberger

Paradigm

Halliburton

Gaussian

BIOVIA Redline

Synopsys

ANSYS Custom Apps

28

HPE Software (i.e. Vertica, HPE Haven), HPE Enterprise Services

29

HP APOLLO 6000 POWER SHELF

Pooled Power Efficiency

Efficiency

• External pooled power shelf

• Fits up to 6 power supplies

• 2400W or 2650W power supplies

• Up to 15.9kW non-redundant

• Single or 3-phased AC input

• Up to twelve 12V DC cables

1.5U

2.55”

17.64”

30.88”

Back View

Front View

1.5U (H) x 44.81cm (W) x 78.44cm

(D)

1.5U (H) x 17.64 in (W) x 30.88 in

(D)

30

HPE Apollo 6500– Dense GPU server optimized for Deep

Learning and HPC workloads

– Density optimization

– High performance fabrics

Cluster Management Enhancements(Massive Scaling, Open APIs, tight Integration, multiple user

interfaces)

– GPU density

– Configurable GPU topologies

– More network bandwidth

– Power and cooling optimization

– Manageability

– Better productivity

New technologies, products

UniqueSolution differentiators

Deep Learning, HPC Software platform Enablement(HPE CCTK, Caffe, CUDA, Google TensorFlow, HPE IDOL)

HPE Apollo 6500 solution innovation

System Design Innovation to maximize GPU capacity and performance with lower TCO

31

方案一 : 企業虛擬化首選方案二 : 高效能運算首選

HPE Apollo 2000/XL190r 1 node

+ NVIDIA TeslaM60 *1

Apollo r2200 12LFF或 r2600 24SFF

XL190r Gen9 規格 :

E5-2640v4*2/ 16GB*2/ 1TB*1/ 800W/

3yr Fndn Care 24*7 service NVIDIA

Tesla M60 Dual GPU*1

HPE Apollo 2000/XL190r 1 node

+ NVIDIA TeslaK80 *1

Apollo r2200 12LFF或 r2600 24SFF

XL190r Gen9 規格 :

E5-2640v4*2/ 16GB*2/ 1TB*1/ 800W/

3yr Fndn Care 24*7 service NVIDIA

Tesla K80 Dual GPU*1

限時限量優惠組合

最強組合

密度最佳的 HPE伺服器再加 NVIDIA GPU給你最強大組合

單一 2U 機箱最大可擴至 2 台 HPE Apollo系統伺服器及

4 張 NVIDIA 高效運算加速卡

Apollo 2000+ NVIDIA GPU促銷方案

NT$360,000(未稅價) 起 NT$360,000(未稅價) 起

※活動截止日期 : 2016 / 12 / 31 如對產品有興趣請撥打：(02)2652-4040 本號碼僅限台灣區使用

TAIPEI | SEP. 21-22, 2016

THANK YOU

Date post:	09-Jan-2017
Category:	Technology
Upload:	nvidia-taiwan
View:	150 times
Download:	0 times

A Platform for Accelerating Machine Learning Applications

Technology