Accelerating Machine Learning and Deep Learning with Intel ... · Natural Language Processing...

Accelerating Machine Learning and Deep Learning with Intel® High Performance Libraries

ISTEP Tokyo 2016

Jon Kim

[email protected]

Agenda

Intel’s Efforts to Accelerate Machine Learning (ML) and Deep Learning

Intel® Math Kernel Library (Intel® MKL) and DNN extensions for Intel® MKL and Caffe optimization

Intel® Data Analytics Acceleration Library (Intel® DAAL)

Customer Use Case

Summary and Q&A

Deep Learning FrameworksCaffe

Intel® MKL & Intel® DAAL

http://deeplearning.net/software/theano/


How Intel Accelerates Machine Learning & Deep Learning

Silicon Library

How Intel Accelerates Machine Learning & Deep Learning

Intel® Xeon®, Intel® Xeon Phi™ processors, Storage, Network powering the Data Center

ML primitives accelerating through Intel® Math Kernel Library (Intel® MKL) and Data Analytics Acceleration Library (Intel® DAAL)

Extending Machine Learning (ML) through Trusted Analytics Platform (TAP) and scale programs

1

2

Enhancing Spark* Mllib, Caffe*, Theano*Python*, providing faster time to insight+ MKL-DNN

3

4

Engineering Resources to support all ML segments5

Open Network Platform

Open Storage Platform

Optimized with Intel MKL_DNN primitives for Deep Learning – NEW Open Source

Trusted Analytics PlatformOpen Source, ISV, SI, & AcademicDeveloper Outreach

Intel Solution Architects and Software Engineers

Speech Image Classification Prediction Recommendation

Solutions

Intel® Math Kernel Library and Data Analytics Acceleration Libraries

+FPGA



http://spark.apache.org/

http://spark.apache.org/

Agenda




Customer Use Case

Summary and Q&A


Intel® MKLMKL-DNN Extentions



8

Speeds math processing for machine learning, scientific, engineering financial & design applications

Includes functions for dense & sparse linear algebra (BLAS, LAPACK, PARDISO), FFTs, vector math, summary statistics & more

Standard APIs for easy switching from other math libraries

Highly optimized, threaded & vectorized to maximize processor performance

Intel® Math Kernel LibraryEnergy Financial

AnalyticsEngineering

DesignDigital

Content Creation

Science & Research

Signal Processing

Optimized Mathematical Building Blocks Intel® Math Kernel Library

Linear Algebra

• BLAS

• LAPACK• ScaLAPACK

• Sparse BLAS• PARDISO* SMP & Cluster

• Iterative sparse solvers

Fast Fourier Transforms

• Multidimensional

• FFTW interfaces• Cluster FFT

Vector Math

• Trigonometric

• Hyperbolic • Exponential

• Log• Power

• Root

Vector RNGs

• Congruential

• Wichmann-Hill

• Mersenne Twister

• Sobol

• Neiderreiter

• Non-deterministic

Summary Statistics

• Kurtosis

• Variation coefficient

• Order statistics

• Min/max

• Variance-covariance

And More

• Splines

• Interpolation

• Trust Region

• Fast Poisson Solver

Intel Distribution for Python* Intel® Math Kernel Library

Intel Distribution for Python* with Intel MKL Intel distribution for Python (

Numpy / SciPy : Scientific computing libraries for Python.

Up to 100x performance improvement for through enhancing Intel® MKL accelerated Python computation packages, NumPy and SciPy.

Build NumPy/SciPy with Intel® MKL (Intel® MKL) Knowledge based article

https://software.intel.com/en-us/python-distribution

http://software.intel.com/en-us/articles/numpyscipy-with-intel-mkl




Intel® High Performance

Library

Intel® MKL-DNN*



14

Deep Learning Applications

Pedestrian detection

Self-driving carsGoogle*/Apple*/Tesla*

Image content description – Google*/Bing*/Baidu*

Natural Language Processing – Siri*/Cortana*/Google Now*/Baidu*

Google/Bing/Baidu image searches

* All trademarks and registered trademarks are the property of their respective owners.

[Images courtesy Deep learning, LeCun, Bengio, Hinton, doi:10.1038/nature14539]

Intel Confidential

15

Multi scale feature learning Hierarchical learning

Low level features are extracted from pixels Higher level features are extracted from lower

level

Intermediate network layers normalize data

[Honglak Lee, NIPS 2010 Workshop Deep Learning and Unsupervised Feature Learning]

[Yann LeCun, The Unreasonable Effectiveness of Deep Learning, GPUTECHCONF 2014 webinar]

Intel Confidential

16

Topology : AlexNet

SM

[Krizhevsky et al. in NIPS 2012]

Popular DNN topology for image recognition

Winner of the ImageNet Challenge 2012

Hotspots

Intel Confidential

Multi Scale Feature Learning

17

Components of Intel MKL 2017

Linear Algebra

• BLAS• LAPACK• ScaLAPACK• Sparse BLAS• Sparse Solvers• Iterative • PARDISO*• Cluster Sparse

Solver

Fast Fourier Transforms

• Multidimensional• FFTW interfaces• Cluster FFT

Vector Math

• Trigonometric• Hyperbolic • Exponential• Log• Power• Root• Vector RNGs

Summary Statistics

• Kurtosis• Variation

coefficient• Order statistics• Min/max• Variance-

covariance

And More…

• Splines• Interpolation• Trust Region• Fast Poisson

Solver

Deep Neural Networks

• Convolution• Pooling• Normalization• ReLU• Softmax

MKL-DNN functionality ROADMApGreen – doneBold – high confidenceItalic – low confidence

MKL-DNN v0.1 Tech Preview

MKL-DNN v0.5Tech Preview

CPU optimizations HSW HSW

Accelerator support FPGA-ORCA

Primitives type ForwardInference

ForwardInferenceBackward

Data types SP SP

Compute Fully connectedConv 3D Direct

Fully connectedConv 3D DirectConv 3D Winograd

Merged primitives Conv 3D Direct+ReLu Conv 3D Direct+ReLu

Pooling Maxpool MaxpoolAvgpoolMinpool

Activation ReLU ReLU

Normalization LRN LRNBatched

Auxillary functionality SplitConcatSum

Topologies AlexNetVGG

AlexNetVGGGoogleNetResidualNetCifar

19

Intel Machine Learning Software Stack

Intel Confidential

Intel® Math Kernel Library

XeonXeon Phi

FPGAMLECVE

SW building block to extract max Intel HW performance & provide common interface to all Intel accelerators

Open source IA optimized DNN API’s, combined with Intel® MKL and build tools designed for scalable, high-velocity integration with ML/DL frameworks

Popular ML/DL frameworks with fast-evolving algorithms for training and classifying images & speech

Intel® MKL-DNN*

Intel libraries as path to bring optimized ML/DL frameworks to Intel hardware

A C C E L E R A T O R S

Host &Offload

Includes:• Open Source implementations of new DNN

functionality included in MKL 2017• New algorithms ahead of MKL releases• IA optimizations contributed by community



20

+ Intel MKL

Intel Caffe https://github.com/intelcaffe/caffe• Aimed at improving Caffe performance on CPU, in particular Xeon servers

Configurations• Install Intel MKL 2017, set `BLAS := mkl` in Makefile.config (cmake: add -

DBLAS=mkl), and build. The support for the DNN primitives will be detected and compiled into the Caffe

• Set `engine: MKL2017` in the layer's parameters definition

• Example of network accelerated by using DNN primitives in models/mkl2017_alexnet

>12x speedup• Intel Caffe + Intel MKL 2017 over main Caffe branch on Intel® Xeon® Processor

E5-2699v4 for AlexNet training

https://github.com/intelcaffe/caffe

Caffe integration code

template <typename Dtype>void NeuraliaReLULayer<Dtype>::LayerSetUp(const vector<Blob<Dtype>*>& bottom,

const vector<Blob<Dtype>*>& top) {Dtype negative_slope = this->layer_param_.relu_param().negative_slope();

vector<unsigned> sizes;for (auto d : bottom[0]->shape())

sizes.push_back(d);

bottom_data_ = memory::create({engine::cpu, memory::format::yxfb_f32, sizes});top_data_ = memory::create({engine::cpu, memory::format::yxfb_f32, sizes});bottom_diff_ = memory::create({engine::cpu, memory::format::yxfb_f32, sizes});top_diff_ = memory::create({engine::cpu, memory::format::yxfb_f32, sizes}); reluFwd_ = relu::create({engine::reference, top_data_, bottom_data_, negative_slope});reluBwd_ = relu_backward::create({engine::reference, {bottom_diff_}, {top_diff_, bottom_data_}, negative_slope});

}

template <typename Dtype>void NeuraliaReLULayer<Dtype>::Forward_cpu(const vector<Blob<Dtype>*>& bottom,

const vector<Blob<Dtype>*>& top) {void* bottom_data = (void*)bottom[0]->prv_data();void* top_data = NULL;

if (bottom_data) {top_data = top[0]->mutable_prv_data();

} else {DLOG(INFO) << "Using cpu_data in NeuraliaReLULayer.";bottom_data = (void*)bottom[0]->cpu_data();top_data = top[0]->mutable_cpu_data();

}

execute({bottom_data_(bottom_data), top_data_(top_data), reluFwd_});}

Integration of ReLU layer into Caffe using API in 8 lines of code.

integration code

Caffe code

Caffe + Intel MKLOpen source version (intelcaffe@github): https://github.com/intelcaffe/caffe/

Delivered with source code

Based on latest BVLC Caffe

Support all network topology

Support multiple nodes training

No specific hardware requirement

https://software.intel.com/en-us/articles/single-node-caffe-scoring-and-training-on-intel-xeon-e5-series-processors

2575

273

731

0

200

400

600

800

Training ScoringPe

rfo

rman

ce, i

mag

es /

sec

on

d

Caffe/AlexNet performance on 2-socket Intel® Xeon® E5-2699 v3 processor-based system

Out-of-the-box Intel Optimized

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. For more information go to http://www.intel.com/performance .

https://github.com/intelcaffe/caffe/

https://software.intel.com/en-us/articles/single-node-caffe-scoring-and-training-on-intel-xeon-e5-series-processors

http://www.intel.com/performance

Agenda




Customer Use Case

Summary and Q&A


Intel® DAAL



Intel® Data Analytics Acceleration Library (Intel® DAAL)New High Performance Library for Big Data Analytics

Library of Intel® Architecture (IA) optimized analytics building blocks• Key ingredient for proprietary and open source data analytics

platforms and applications

• Fundamental building blocks for all data analysis stages

• Delivers forward-scaling performance and parallelism on IA

• Built upon Intel® Math Kernel Library and Intel® Integrated Performance Primitives

Intel® Data Analytics Acceleration Library(Intel® DAAL)

Big Data Platform: Spark*

Hardware Layer: IA (Intel Xeon processor)

105

480

0

200

400

600

Intel DAAL PCA Spark Native PCA

Saves up to 4.57x Time (Second, the smaller the better)

Intel DAAL dramatically speeds up PCA computing speed to reduce data dimension from 4096 to 128 on big data platform


The test data is based on Intel® Xeon® E5620 @ 2.4GHz, 2 sockets, 64 GB RAM, 2 node


Intel® Data Analytics Acceleration Library(Intel® DAAL)

An industry leading Intel® Architecture based data analytics acceleration library of fundamental algorithms covering all data analysis stages.

(De-)Compression PCAStatistical momentsVar-Covar matrixMatrix decompositionsAprioriK-Means ClusteringEM for GMM

Linear regressionDecision treesNaïve BayesMulti-Class SVMBoosting

Pre-processing Transformation Analysis Modeling Decision Making

Scie

nti

fic/

Engi

nee

rin

g

Web

/So

cial

Bu

sin

ess

Validation

Where Intel DAAL Fits?

27

Intel® Data Analytics Acceleration Library

DatasourcesData compressionSerializationModel import/export

PCALow order momentsMatrix factorizationOutlier detectionDistancesAssociation rules……

Regression• Linear regression• Ridge regression

Classification• SVM• Naïve Bayes • Boosting algorithms

Recommendation• ALS

Clustering• K-Means• EM for GMM

……

UtilitiesAnalysis Machine learning

C++JavaPython

Programming languages

Batch processingDistributed processingOnline processing

Processing modes

Agenda




Customer Use Case

Summary and Q&A

Case Study: AD Recommendation ML Platform

Background:• A ML application used in a AD company for advertisement

recommendation

• The advertisement system is the company’s key component and will impact ~40% of the group’s revenue

Challenge:• Memory bound, Heavy random memory access type. Not much perf

benefit from Intel® Advanced Vector Extensions 2/FMA.

• There is compatible issue for ICC, another hotspot used openblas (daxpy), very bad core scaling.

Our Solution:• Successfully replaced OpenBlas by Intel MKL, which leads 35% better

performance, customer is ongoing the deployment for our optimized solution

0

0.5

1

1.5

1

1.35

Performance comparison- higher is better

* The test data is based on Intel® Xeon® E5 2699 V3 processor

Up to 35% performance increase


This company uses ML widely today:• AD recommendation, Intelligent customer services in the payment

application, Image classification, voice recognition, etc

Intel® MKL

OpenBlas Intel MKL


Case Study: Cloud service provider* Illegal Video Detection

Cloud provides illegal video detection service to 3rd video cloud customer to help them detect the illegal videos

Originally, Cloud providers adopted open source BVLC Caffe + OpenBlas as CNN framework, but saw poor performance

By using Intel Optimized Caffe plus Intel® Math Kernel Library, got up

to x30 performance improvement for training in production environment

* The test data is based on Intel® Xeon® E5 2680 V3 processor

Cloud Illegal Video Detection Process Flow

…

CNN Network based on CaffePicturesVideoVideo

Classification

Porno

graphic

Sexual

Normal

Picture Scoring

Extractkey frame Input Classification

0

10

20

30

BVLC Caffe + OpenBlas(Baseline)

Intel Optimized Caffe +MKL

Optimized Performance of Cloud* Illegal Video Detection - higher is better

Intel Optimized Caffe*

Agenda




Customer Use Case

Summary and Q&A

Summary

• Intel MKL/DAAL boosts application performance with minimal effort

• Robust and growing feature set for Machine Learning

• Easy to use and maintain

• Supports current and future processors

• Intel Engineering resources

• Deep learning applications can benefit from Intel MKL and DAALfunctionality

• DNN primitives introduced in Intel MKL and DAAL 2017

• Intel Caffe improves performance on CPU

34

Legal Notices and DisclaimersIntel technologies’ features and benefits depend on system configuration and may require enabled hardware, software or service activation. Learn more at intel.com, or from the OEM or retailer.

No computer system can be absolutely secure.

Tests document performance of components on a particular test, in specific systems. Differences in hardware, software, or configuration will affect actual performance. Consult other sources of information to evaluate performance as you consider your purchase. For more complete information about performance and benchmark results, visit http://www.intel.com/performance.

Cost reduction scenarios described are intended as examples of how a given Intel-based product, in the specified circumstances and configurations, may affect future costs and provide cost savings. Circumstances will vary. Intel does not guarantee any costs or cost reduction.

This document contains information on products, services and/or processes in development. All information provided here is subject to change without notice. Contact your Intel representative to obtain the latest forecast, schedule, specifications and roadmaps.

Statements in this document that refer to Intel’s plans and expectations for the quarter, the year, and the future, are forward-looking statements that involve a number of risks and uncertainties. A detailed discussion of the factors that could affect Intel’s results and plans is included in Intel’s SEC filings, including the annual report on Form 10-K.

The products described may contain design defects or errors known as errata which may cause the product to deviate from published specifications. Current characterized errata are available on request.

No license (express or implied, by estoppel or otherwise) to any intellectual property rights is granted by this document.

Intel does not control or audit third-party benchmark data or the web sites referenced in this document. You should visit the referenced web site and confirm whether referenced data are accurate.

Intel, 3D XPoint, Xeon, and the Intel logo are trademarks of Intel Corporation in the United States and other countries.

*Other names and brands may be claimed as the property of others.

© 2016 Intel Corporation.


Optimization Notice

Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel.

Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice.

Notice revision #20110804

Date post:	22-May-2020
Category:	Documents
Upload:	others
View:	3 times
Download:	0 times

Accelerating Machine Learning and Deep Learning with Intel ... · Natural Language Processing...

Documents