Accelerating Machine Learning and Deep Learning with Intel® High Performance Libraries
ISTEP Tokyo 2016
Jon Kim
Agenda
Intel’s Efforts to Accelerate Machine Learning (ML) and Deep Learning
Intel® Math Kernel Library (Intel® MKL) and DNN extensions for Intel® MKL and Caffe optimization
Intel® Data Analytics Acceleration Library (Intel® DAAL)
Customer Use Case
Summary and Q&A
Deep Learning FrameworksCaffe
Intel® MKL & Intel® DAAL
How Intel Accelerates Machine Learning & Deep Learning
Silicon Library
How Intel Accelerates Machine Learning & Deep Learning
Intel® Xeon®, Intel® Xeon Phi™ processors, Storage, Network powering the Data Center
ML primitives accelerating through Intel® Math Kernel Library (Intel® MKL) and Data Analytics Acceleration Library (Intel® DAAL)
Extending Machine Learning (ML) through Trusted Analytics Platform (TAP) and scale programs
1
2
Enhancing Spark* Mllib, Caffe*, Theano*Python*, providing faster time to insight+ MKL-DNN
3
4
Engineering Resources to support all ML segments5
Open Network Platform
Open Storage Platform
Optimized with Intel MKL_DNN primitives for Deep Learning – NEW Open Source
Trusted Analytics PlatformOpen Source, ISV, SI, & AcademicDeveloper Outreach
Intel Solution Architects and Software Engineers
Speech Image Classification Prediction Recommendation
Solutions
Intel® Math Kernel Library and Data Analytics Acceleration Libraries
+FPGA
Agenda
Intel’s Efforts to Accelerate Machine Learning (ML) and Deep Learning
Intel® Math Kernel Library (Intel® MKL) and DNN extensions for Intel® MKL and Caffe optimization
Intel® Data Analytics Acceleration Library (Intel® DAAL)
Customer Use Case
Summary and Q&A
Deep Learning FrameworksCaffe
Intel® MKLMKL-DNN Extentions
8
Speeds math processing for machine learning, scientific, engineering financial & design applications
Includes functions for dense & sparse linear algebra (BLAS, LAPACK, PARDISO), FFTs, vector math, summary statistics & more
Standard APIs for easy switching from other math libraries
Highly optimized, threaded & vectorized to maximize processor performance
Intel® Math Kernel LibraryEnergy Financial
AnalyticsEngineering
DesignDigital
Content Creation
Science & Research
Signal Processing
Optimized Mathematical Building Blocks Intel® Math Kernel Library
Linear Algebra
• BLAS
• LAPACK• ScaLAPACK
• Sparse BLAS• PARDISO* SMP & Cluster
• Iterative sparse solvers
Fast Fourier Transforms
• Multidimensional
• FFTW interfaces• Cluster FFT
Vector Math
• Trigonometric
• Hyperbolic • Exponential
• Log• Power
• Root
Vector RNGs
• Congruential
• Wichmann-Hill
• Mersenne Twister
• Sobol
• Neiderreiter
• Non-deterministic
Summary Statistics
• Kurtosis
• Variation coefficient
• Order statistics
• Min/max
• Variance-covariance
And More
• Splines
• Interpolation
• Trust Region
• Fast Poisson Solver
Intel Distribution for Python* Intel® Math Kernel Library
Intel Distribution for Python* with Intel MKL Intel distribution for Python (
Numpy / SciPy : Scientific computing libraries for Python.
Up to 100x performance improvement for through enhancing Intel® MKL accelerated Python computation packages, NumPy and SciPy.
Build NumPy/SciPy with Intel® MKL (Intel® MKL) Knowledge based article
Intel Distribution for Python* Intel® Math Kernel Library
Intel Distribution for Python* Intel® Math Kernel Library
Deep Learning FrameworksCaffe
Intel® High Performance
Library
Intel® MKL-DNN*
14
Deep Learning Applications
Pedestrian detection
Self-driving carsGoogle*/Apple*/Tesla*
Image content description – Google*/Bing*/Baidu*
Natural Language Processing – Siri*/Cortana*/Google Now*/Baidu*
Google/Bing/Baidu image searches
* All trademarks and registered trademarks are the property of their respective owners.
[Images courtesy Deep learning, LeCun, Bengio, Hinton, doi:10.1038/nature14539]
Intel Confidential
15
Multi scale feature learning Hierarchical learning
Low level features are extracted from pixels Higher level features are extracted from lower
level
Intermediate network layers normalize data
[Honglak Lee, NIPS 2010 Workshop Deep Learning and Unsupervised Feature Learning]
[Yann LeCun, The Unreasonable Effectiveness of Deep Learning, GPUTECHCONF 2014 webinar]
Intel Confidential
16
Topology : AlexNet
SM
[Krizhevsky et al. in NIPS 2012]
Popular DNN topology for image recognition
Winner of the ImageNet Challenge 2012
Hotspots
Intel Confidential
Multi Scale Feature Learning
17
Components of Intel MKL 2017
Linear Algebra
• BLAS• LAPACK• ScaLAPACK• Sparse BLAS• Sparse Solvers• Iterative • PARDISO*• Cluster Sparse
Solver
Fast Fourier Transforms
• Multidimensional• FFTW interfaces• Cluster FFT
Vector Math
• Trigonometric• Hyperbolic • Exponential• Log• Power• Root• Vector RNGs
Summary Statistics
• Kurtosis• Variation
coefficient• Order statistics• Min/max• Variance-
covariance
And More…
• Splines• Interpolation• Trust Region• Fast Poisson
Solver
Deep Neural Networks
• Convolution• Pooling• Normalization• ReLU• Softmax
MKL-DNN functionality ROADMApGreen – doneBold – high confidenceItalic – low confidence
MKL-DNN v0.1 Tech Preview
MKL-DNN v0.5Tech Preview
CPU optimizations HSW HSW
Accelerator support FPGA-ORCA
Primitives type ForwardInference
ForwardInferenceBackward
Data types SP SP
Compute Fully connectedConv 3D Direct
Fully connectedConv 3D DirectConv 3D Winograd
Merged primitives Conv 3D Direct+ReLu Conv 3D Direct+ReLu
Pooling Maxpool MaxpoolAvgpoolMinpool
Activation ReLU ReLU
Normalization LRN LRNBatched
Auxillary functionality SplitConcatSum
Topologies AlexNetVGG
AlexNetVGGGoogleNetResidualNetCifar
19
Intel Machine Learning Software Stack
Intel Confidential
Intel® Math Kernel Library
XeonXeon Phi
FPGAMLECVE
SW building block to extract max Intel HW performance & provide common interface to all Intel accelerators
Open source IA optimized DNN API’s, combined with Intel® MKL and build tools designed for scalable, high-velocity integration with ML/DL frameworks
Popular ML/DL frameworks with fast-evolving algorithms for training and classifying images & speech
Intel® MKL-DNN*
Intel libraries as path to bring optimized ML/DL frameworks to Intel hardware
A C C E L E R A T O R S
Host &Offload
Includes:• Open Source implementations of new DNN
functionality included in MKL 2017• New algorithms ahead of MKL releases• IA optimizations contributed by community
20
+ Intel MKL
Intel Caffe https://github.com/intelcaffe/caffe• Aimed at improving Caffe performance on CPU, in particular Xeon servers
Configurations• Install Intel MKL 2017, set `BLAS := mkl` in Makefile.config (cmake: add -
DBLAS=mkl), and build. The support for the DNN primitives will be detected and compiled into the Caffe
• Set `engine: MKL2017` in the layer's parameters definition
• Example of network accelerated by using DNN primitives in models/mkl2017_alexnet
>12x speedup• Intel Caffe + Intel MKL 2017 over main Caffe branch on Intel® Xeon® Processor
E5-2699v4 for AlexNet training
Caffe integration code
template <typename Dtype>void NeuraliaReLULayer<Dtype>::LayerSetUp(const vector<Blob<Dtype>*>& bottom,
const vector<Blob<Dtype>*>& top) {Dtype negative_slope = this->layer_param_.relu_param().negative_slope();
vector<unsigned> sizes;for (auto d : bottom[0]->shape())
sizes.push_back(d);
bottom_data_ = memory::create({engine::cpu, memory::format::yxfb_f32, sizes});top_data_ = memory::create({engine::cpu, memory::format::yxfb_f32, sizes});bottom_diff_ = memory::create({engine::cpu, memory::format::yxfb_f32, sizes});top_diff_ = memory::create({engine::cpu, memory::format::yxfb_f32, sizes}); reluFwd_ = relu::create({engine::reference, top_data_, bottom_data_, negative_slope});reluBwd_ = relu_backward::create({engine::reference, {bottom_diff_}, {top_diff_, bottom_data_}, negative_slope});
}
template <typename Dtype>void NeuraliaReLULayer<Dtype>::Forward_cpu(const vector<Blob<Dtype>*>& bottom,
const vector<Blob<Dtype>*>& top) {void* bottom_data = (void*)bottom[0]->prv_data();void* top_data = NULL;
if (bottom_data) {top_data = top[0]->mutable_prv_data();
} else {DLOG(INFO) << "Using cpu_data in NeuraliaReLULayer.";bottom_data = (void*)bottom[0]->cpu_data();top_data = top[0]->mutable_cpu_data();
}
execute({bottom_data_(bottom_data), top_data_(top_data), reluFwd_});}
Integration of ReLU layer into Caffe using API in 8 lines of code.
integration code
Caffe code
Caffe + Intel MKLOpen source version (intelcaffe@github): https://github.com/intelcaffe/caffe/
Delivered with source code
Based on latest BVLC Caffe
Support all network topology
Support multiple nodes training
No specific hardware requirement
https://software.intel.com/en-us/articles/single-node-caffe-scoring-and-training-on-intel-xeon-e5-series-processors
2575
273
731
0
200
400
600
800
Training ScoringPe
rfo
rman
ce, i
mag
es /
sec
on
d
Caffe/AlexNet performance on 2-socket Intel® Xeon® E5-2699 v3 processor-based system
Out-of-the-box Intel Optimized
Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. For more information go to http://www.intel.com/performance .
Agenda
Intel’s Efforts to Accelerate Machine Learning (ML) and Deep Learning
Intel® Math Kernel Library (Intel® MKL) and DNN extensions for Intel® MKL and Caffe optimization
Intel® Data Analytics Acceleration Library (Intel® DAAL)
Customer Use Case
Summary and Q&A
Deep Learning FrameworksCaffe
Intel® DAAL
Intel® Data Analytics Acceleration Library (Intel® DAAL)New High Performance Library for Big Data Analytics
Library of Intel® Architecture (IA) optimized analytics building blocks• Key ingredient for proprietary and open source data analytics
platforms and applications
• Fundamental building blocks for all data analysis stages
• Delivers forward-scaling performance and parallelism on IA
• Built upon Intel® Math Kernel Library and Intel® Integrated Performance Primitives
Intel® Data Analytics Acceleration Library(Intel® DAAL)
Big Data Platform: Spark*
Hardware Layer: IA (Intel Xeon processor)
105
480
0
200
400
600
Intel DAAL PCA Spark Native PCA
Saves up to 4.57x Time (Second, the smaller the better)
Intel DAAL dramatically speeds up PCA computing speed to reduce data dimension from 4096 to 128 on big data platform
Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. For more information go to http://www.intel.com/performance .
The test data is based on Intel® Xeon® E5620 @ 2.4GHz, 2 sockets, 64 GB RAM, 2 node
Intel® Data Analytics Acceleration Library(Intel® DAAL)
An industry leading Intel® Architecture based data analytics acceleration library of fundamental algorithms covering all data analysis stages.
(De-)Compression PCAStatistical momentsVar-Covar matrixMatrix decompositionsAprioriK-Means ClusteringEM for GMM
Linear regressionDecision treesNaïve BayesMulti-Class SVMBoosting
Pre-processing Transformation Analysis Modeling Decision Making
Scie
nti
fic/
Engi
nee
rin
g
Web
/So
cial
Bu
sin
ess
Validation
Where Intel DAAL Fits?
27
Intel® Data Analytics Acceleration Library
DatasourcesData compressionSerializationModel import/export
PCALow order momentsMatrix factorizationOutlier detectionDistancesAssociation rules……
Regression• Linear regression• Ridge regression
Classification• SVM• Naïve Bayes • Boosting algorithms
Recommendation• ALS
Clustering• K-Means• EM for GMM
……
UtilitiesAnalysis Machine learning
C++JavaPython
Programming languages
Batch processingDistributed processingOnline processing
Processing modes
Agenda
Intel’s Efforts to Accelerate Machine Learning (ML) and Deep Learning
Intel® Math Kernel Library (Intel® MKL) and DNN extensions for Intel® MKL and Caffe optimization
Intel® Data Analytics Acceleration Library (Intel® DAAL)
Customer Use Case
Summary and Q&A
Case Study: AD Recommendation ML Platform
Background:• A ML application used in a AD company for advertisement
recommendation
• The advertisement system is the company’s key component and will impact ~40% of the group’s revenue
Challenge:• Memory bound, Heavy random memory access type. Not much perf
benefit from Intel® Advanced Vector Extensions 2/FMA.
• There is compatible issue for ICC, another hotspot used openblas (daxpy), very bad core scaling.
Our Solution:• Successfully replaced OpenBlas by Intel MKL, which leads 35% better
performance, customer is ongoing the deployment for our optimized solution
0
0.5
1
1.5
1
1.35
Performance comparison- higher is better
* The test data is based on Intel® Xeon® E5 2699 V3 processor
Up to 35% performance increase
Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. For more information go to http://www.intel.com/performance .
This company uses ML widely today:• AD recommendation, Intelligent customer services in the payment
application, Image classification, voice recognition, etc
Intel® MKL
OpenBlas Intel MKL
Case Study: Cloud service provider* Illegal Video Detection
Cloud provides illegal video detection service to 3rd video cloud customer to help them detect the illegal videos
Originally, Cloud providers adopted open source BVLC Caffe + OpenBlas as CNN framework, but saw poor performance
By using Intel Optimized Caffe plus Intel® Math Kernel Library, got up
to x30 performance improvement for training in production environment
* The test data is based on Intel® Xeon® E5 2680 V3 processor
Cloud Illegal Video Detection Process Flow
…
CNN Network based on CaffePicturesVideoVideo
Classification
Porno
graphic
Sexual
Normal
Picture Scoring
Extractkey frame Input Classification
0
10
20
30
BVLC Caffe + OpenBlas(Baseline)
Intel Optimized Caffe +MKL
Optimized Performance of Cloud* Illegal Video Detection - higher is better
Intel Optimized Caffe*
Agenda
Intel’s Efforts to Accelerate Machine Learning (ML) and Deep Learning
Intel® Math Kernel Library (Intel® MKL) and DNN extensions for Intel® MKL and Caffe optimization
Intel® Data Analytics Acceleration Library (Intel® DAAL)
Customer Use Case
Summary and Q&A
Summary
• Intel MKL/DAAL boosts application performance with minimal effort
• Robust and growing feature set for Machine Learning
• Easy to use and maintain
• Supports current and future processors
• Intel Engineering resources
• Deep learning applications can benefit from Intel MKL and DAALfunctionality
• DNN primitives introduced in Intel MKL and DAAL 2017
• Intel Caffe improves performance on CPU
34
Legal Notices and DisclaimersIntel technologies’ features and benefits depend on system configuration and may require enabled hardware, software or service activation. Learn more at intel.com, or from the OEM or retailer.
No computer system can be absolutely secure.
Tests document performance of components on a particular test, in specific systems. Differences in hardware, software, or configuration will affect actual performance. Consult other sources of information to evaluate performance as you consider your purchase. For more complete information about performance and benchmark results, visit http://www.intel.com/performance.
Cost reduction scenarios described are intended as examples of how a given Intel-based product, in the specified circumstances and configurations, may affect future costs and provide cost savings. Circumstances will vary. Intel does not guarantee any costs or cost reduction.
This document contains information on products, services and/or processes in development. All information provided here is subject to change without notice. Contact your Intel representative to obtain the latest forecast, schedule, specifications and roadmaps.
Statements in this document that refer to Intel’s plans and expectations for the quarter, the year, and the future, are forward-looking statements that involve a number of risks and uncertainties. A detailed discussion of the factors that could affect Intel’s results and plans is included in Intel’s SEC filings, including the annual report on Form 10-K.
The products described may contain design defects or errors known as errata which may cause the product to deviate from published specifications. Current characterized errata are available on request.
No license (express or implied, by estoppel or otherwise) to any intellectual property rights is granted by this document.
Intel does not control or audit third-party benchmark data or the web sites referenced in this document. You should visit the referenced web site and confirm whether referenced data are accurate.
Intel, 3D XPoint, Xeon, and the Intel logo are trademarks of Intel Corporation in the United States and other countries.
*Other names and brands may be claimed as the property of others.
© 2016 Intel Corporation.
Optimization Notice
Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel.
Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice.
Notice revision #20110804