+ All Categories
Home > Documents > The convergence of HPC and Big Data Intel® Data Analytics ... - Coding high performan… · stages...

The convergence of HPC and Big Data Intel® Data Analytics ... - Coding high performan… · stages...

Date post: 31-May-2020
Category:
Upload: others
View: 2 times
Download: 0 times
Share this document with a friend
24
The convergence of HPC and Big Data Intel® Data Analytics Acceleration Library (DAAL) Roger Philp Intel HPC Software Workshop Series 2016 HPC Code Modernization for Intel® Xeon and Xeon Phi™ February 17 th 2016, Barcelona
Transcript
Page 1: The convergence of HPC and Big Data Intel® Data Analytics ... - Coding high performan… · stages of a data analytics project Abstracted from communication layers and data sources,

The convergence of HPC and Big Data

Intel® Data Analytics Acceleration Library (DAAL)

Roger Philp

Intel HPC Software Workshop Series 2016

HPC Code Modernization for Intel® Xeon and Xeon Phi™

February 17th 2016, Barcelona

Page 2: The convergence of HPC and Big Data Intel® Data Analytics ... - Coding high performan… · stages of a data analytics project Abstracted from communication layers and data sources,

Setting the stage for analytics

2

Page 3: The convergence of HPC and Big Data Intel® Data Analytics ... - Coding high performan… · stages of a data analytics project Abstracted from communication layers and data sources,

Data analytics in the age of Big Data

Big Data is converging with High Performance Computing

• Extreme Data Volumes

• Intense Compute Workloads

Gap between current programming and hardware evolution

• More cores/threads, wider vectors, more memory, more storage, faster interconnect

• Many big data applications leave performance at the table – Not optimized for underlying hardware

3

More Cores

More Threads

More Memory

More Storage

Wider SIMD Vectors

Faster Interconnect

Page 4: The convergence of HPC and Big Data Intel® Data Analytics ... - Coding high performan… · stages of a data analytics project Abstracted from communication layers and data sources,

Fraud detection

Detected using

Benford‘sLaw

4

A B

Page 5: The convergence of HPC and Big Data Intel® Data Analytics ... - Coding high performan… · stages of a data analytics project Abstracted from communication layers and data sources,

Benford’s law

Also called the first digit law

States that in listings, tables of statistics, etc., the digit 1 tends to occur with probability ∼30%, much greater than the expected 11.1% (i.e., one digit out of 9)

Source: MathWorld

5

30.1%

17.6%

12.5%

9.7%7.9%

6.7%5.8% 5.1% 4.6%

0.00

0.05

0.10

0.15

0.20

0.25

0.30

1 2 3 4 5 6 7 8 9

P(D

)

IDC estimates that High Performance Data Analytics has saved PayPal more than $700 million so far (fraud detection)

Page 6: The convergence of HPC and Big Data Intel® Data Analytics ... - Coding high performan… · stages of a data analytics project Abstracted from communication layers and data sources,

Big data analytics: problem statement

6

Spark* MLLib

Breeze

Netlib-Java

JVM

JNINetlibBLAS

Run on stat-of-art hardwareBuilt with patchwork of math libs

Not exploiting HW performance features

Limited performanceMany layers of dependenciesLow ROI on HW investment

Page 7: The convergence of HPC and Big Data Intel® Data Analytics ... - Coding high performan… · stages of a data analytics project Abstracted from communication layers and data sources,

Big data analytics: desired solution

7

Run on stat-of-art hardwareSingle library to cover all stages of data

analyticsFully optimized for underlying HW

Optimized performanceSimpler development/deployment

High ROI on HW investment

Page 8: The convergence of HPC and Big Data Intel® Data Analytics ... - Coding high performan… · stages of a data analytics project Abstracted from communication layers and data sources,

Intel® DAAL for HPC and analytics

8

Page 9: The convergence of HPC and Big Data Intel® Data Analytics ... - Coding high performan… · stages of a data analytics project Abstracted from communication layers and data sources,

Introducing Intel® DAALData Analytics Acceleration Library

An industry leading end-to-end IA-based data analytics acceleration library of fundamental algorithms coveringall data analysis stages

What Intel DAAL brings to the game:• Optimized algorithms based upon Intel® Math Kernel Library (Intel® MKL)

• Data serialization primitives to help running in a distributed system

• Speed and some building blocks for preprocessing data

• Support IA-32 and Intel64 architectures.

• C++, Java APIs.

• Static and dynamic linking.

• A standalone library, and also bundled in Intel® Parallel Studio XE 2016.

• Windows, Linux, OS-X

• Microsoft Visual Studio* (Windows*) & Eclipse/CDT* (Linux*)

9

Page 10: The convergence of HPC and Big Data Intel® Data Analytics ... - Coding high performan… · stages of a data analytics project Abstracted from communication layers and data sources,

Create faster code… faster

Intel® Parallel Studio XE • Design, build, verify and tune

• C++, C, Fortran and Java*

Highlights from what’s new for “2016” edition• Intel® Data Analytics Acceleration Library

• Vectorization Advisor:Custom Analysis and Advice

• MPI Performance Snapshot: Scalable profiling

• Support for the latest Standards, Operating Systems and Processors

10

Page 11: The convergence of HPC and Big Data Intel® Data Analytics ... - Coding high performan… · stages of a data analytics project Abstracted from communication layers and data sources,

Intel® Parallel Studio XE 2016 configurations

Component

Full Licensing(including Intel® Premier Support)

Free Licensing

ComposerEdition

ProfessionalEdition

ClusterEdition

Student/Educator

Open SourceContributor

AcademicResearcher

Community(Everyone!)

Intel® C/C++ Compiler(including Intel® Cilk™ Plus)

Intel® Fortran Compiler

OpenMP 4.0

Intel® Threading Building Blocks (C++ only)

Intel® IPP Library (C/C++ only)

Intel® Math Kernel Library

Intel® Data Analytics Acceleration Library

Intel® MPI Library

Rogue Wave IMSL Library(Fortran only)

Bundledand Add-on

Add-on Add-on

Intel® Advisor XE

Intel® Inspector XE

Intel® VTuneTM Amplifier XE

Intel® ITAC + MPI Performance Snapshot

11

Page 12: The convergence of HPC and Big Data Intel® Data Analytics ... - Coding high performan… · stages of a data analytics project Abstracted from communication layers and data sources,

What Intel® DAAL “is” and “isn’t”?

It is…

A performance library with C++ and Java APIs optimized for Intel architectures

A collection of common building blocks for constructing high-end solutions in all stages of a data analytics project

Abstracted from communication layers and data sources, to be easily integrated into different analytics platforms

Boosting performance of critical algorithms hence reducing time-to-value of your big data projects

It is not…

A programming environment or a cluster computing framework (like MATLAB, R, or Hadoop)

A black-box solution to tackle domain specific analytics needs

A toolkit or plug-in tied to a particular big data platform

Promoting fancy algorithms as the silver bullet for all you big data needs

12

Page 13: The convergence of HPC and Big Data Intel® Data Analytics ... - Coding high performan… · stages of a data analytics project Abstracted from communication layers and data sources,

How Intel accelerates analytics & machine learning

Intel® Analytics Toolkit for Apache Hadoop* software providing faster time to insight

Enhancing MLLib, Spark*, GraphX

Extending Machine Learning (ML) through Intel® Data Analytics Acceleration Library (Intel® DAAL) and “Trusted Analytics” platform

ML primitives accelerating through Intel® Math Kernel Library (Intel® MKL)

Intel® Xeon®, Intel® Xeon Phi™ processors powering the Data Center

13

1

2

3

4

5

Hardware Platforms

Intel Analytics Toolkit

Intel DAAL

Intel MKL

Page 14: The convergence of HPC and Big Data Intel® Data Analytics ... - Coding high performan… · stages of a data analytics project Abstracted from communication layers and data sources,

Intel® DAAL in the analytics workflow

14

ForecastingDecision trees

Etc.

Hypothesis testingModel errors

(De-)CompressionFiltering

Normalization

AggregationDimension reduction

Summary statistics

Clustering

Machine learningParameter estimationSimulation

Pre-processing Transformation Analysis ModelingDecisionMaking

Scientific/Engineering

Web/Social

Business

Validation

Page 15: The convergence of HPC and Big Data Intel® Data Analytics ... - Coding high performan… · stages of a data analytics project Abstracted from communication layers and data sources,

Intel® DAAL in the analytics pipeline

15

Ingest & Clean

Engineer Features

Build GraphQuery & Visualize

DataLearn

ParsePrepare Graph

data Basic AnalysisRun Graph/ML

Algorithm

Insightful result

Basic Analysis

Intel DAALModeling and Decision Making

Intel DAALSummary Statistics

Intel DAALPCA

Intel DAALCompression and outlier detection

Page 16: The convergence of HPC and Big Data Intel® Data Analytics ... - Coding high performan… · stages of a data analytics project Abstracted from communication layers and data sources,

Key features of Intel® DAAL

Optimized for Intel® Architecture

C++ and Java* API in the first release (more in future)

Flexible interface to build connectors to data sources, HDFS*, Spark* RDD, SQL, etc.

Batch, streaming, and distributed processing modes

16

Distributed Processing

Streaming Processing

Batch Processing

D1D2D3

R = F(R1,…,Rk)

Si+1 = T(Si,Di)Ri+1 = F(Si+1)

R1

Rk

D1

D2

Dk

R2 R

Si,Ri

D1Dk-1Dk…

Append

R = F(D1,…,Dk)

Page 17: The convergence of HPC and Big Data Intel® Data Analytics ... - Coding high performan… · stages of a data analytics project Abstracted from communication layers and data sources,

Data transformation and analysis (Intel® DAAL)

Basic statistics for datasets

Statistical moments

Quantities

Correlation and dependence

Cosine distance

Correlation distance

Variance-covariance

matrix

Matrix factorization

SVD

QR

Cholesky

Dimensionality reduction

PCA

Association rule missing

(Apriori)

Outlier detection

Univariate

Multivariate

17

Algorithms support streaming and distributed processing in the current release

Page 18: The convergence of HPC and Big Data Intel® Data Analytics ... - Coding high performan… · stages of a data analytics project Abstracted from communication layers and data sources,

Machine learning (Intel® DAAL)

18

Supervised learning

RegressionLinear

Regression

Classification

Weak learner

Boosting (Ada, Brown, Logit)

Naïve Bayes

SVM

Unsupervised learning

K-Means Clustering

EM for GMM

Collaborative filtering

ALS

Algorithms support streaming and distributed processing in the current release

To be available in

future releases

Page 19: The convergence of HPC and Big Data Intel® Data Analytics ... - Coding high performan… · stages of a data analytics project Abstracted from communication layers and data sources,

Intel® DAAL workflow

Optimizes the entire workflow

• From data acquisition from SQL* and no-SQL data sources

• To data transformations

• To data analysis, training and prediction

19

Connects to physical data (e.g. f iles,

ODBC, etc.)

Streams data into memory

Transforms raw data into numeric

representat ion of supported layout

(Numeric Table)

Performs f iltering (out lier

detect ion)

Computes basic counters

Streams in-memory data to

algorithm by blocks to

improve data locality

Converts variety of numeric

formats into smaller set of

numeric formats for ef fect ive

vectorizat ion

Transforms Numeric Table

layout into layout which is the

most eff icient for a given

Algorithm

Compression Engine

Serialization and Compression Engine

Performs data processing

Page 20: The convergence of HPC and Big Data Intel® Data Analytics ... - Coding high performan… · stages of a data analytics project Abstracted from communication layers and data sources,

Intel® DAAL major component families

20

Data processingOptimized analytics building blocks for all data analysis stages, from data acquisition to data

mining and machine learning.

Data modelingData structures for model representation, and operations to derive model-based predictions

and conclusions.

Data managementInterfaces for data representation and access.

Connectors to a variety of data sources and data formats, such HDFS, SQL, CSV, ARFF, and

user-defined data source/format.

Data sources

Numeric tables

Outliers detection

Compression / Decompression

Serialization / Deserialization

Page 21: The convergence of HPC and Big Data Intel® Data Analytics ... - Coding high performan… · stages of a data analytics project Abstracted from communication layers and data sources,

Data management in Intel® DAAL

Covers raw data acquisition, filtering, normalization, and conversion• Data sources

• Define interfaces for access/management of data in raw format and out-of-memory data

• Stream and transform raw out-of-memory data into numeric in-memory data

• Numeric tables• Fundamental component of in-memory numeric data processing

• Supports heterogeneous and homogeneous numeric tables for dense and sparse data

Usage model might also require compression/decompression, etc.21

Data processing

Optimized analytics building blocks for all data analysis stages, from data acquisition to data mining and machine learning.

Data Modeling

Data structures for model representation, and operations to derive model-based predictions and conclusions.

Data management

Interfaces for data representation and access. Connectors to a variety of data sources and data formats, such HDFS, SQL, CSV,

ARFF, and user-defined data source/format.

Data sources Numeric tables Outliers detection

Compression/Decompression Serialization/Deserialization

Page 22: The convergence of HPC and Big Data Intel® Data Analytics ... - Coding high performan… · stages of a data analytics project Abstracted from communication layers and data sources,

Data processing in Intel® DAAL

Algorithms for data analysis (data mining) and machine learning• Matrix decompositions, clustering algorithms, PCA, etc.

Following processing modes are supported• Batch: all algorithms, entire dataset available in memory of a single process

• Online: processing of data sets in blocks (streaming), might be asynchronous

• Distributed: data set is split in blocks across computation nodes• Multi-device computing and data transfer scenarios (e.g., MPI, Hadoop/Spark, etc.)

22

Data processing

Optimized analytics building blocks for all data analysis stages, from data acquisition to data mining and machine learning.

Data Modeling

Data structures for model representation, and operations to derive model-based predictions and conclusions.

Data management

Interfaces for data representation and access. Connectors to a variety of data sources and data formats, such HDFS, SQL, CSV,

ARFF, and user-defined data source/format.

Data sources Numeric tables Outliers detection

Compression/Decompression Serialization/Deserialization

Page 23: The convergence of HPC and Big Data Intel® Data Analytics ... - Coding high performan… · stages of a data analytics project Abstracted from communication layers and data sources,

Data modeling in Intel® DAAL

Structures and operations for data modeling in two stages• Training: estimates model parameters based on a training data set

• Prediction: uses the trained model to predict the outcome based on new data

Available methods• Regression: predict the values of dependent variables (responses) by observing

independent variables

• Classification: identify to which sub-population (class) a given observation belongs

23

Data processing

Optimized analytics building blocks for all data analysis stages, from data acquisition to data mining and machine learning.

Data Modeling

Data structures for model representation, and operations to derive model-based predictions and conclusions.

Data management

Interfaces for data representation and access. Connectors to a variety of data sources and data formats, such HDFS, SQL, CSV,

ARFF, and user-defined data source/format.

Data sources Numeric tables Outliers detection

Compression/Decompression Serialization/Deserialization

Page 24: The convergence of HPC and Big Data Intel® Data Analytics ... - Coding high performan… · stages of a data analytics project Abstracted from communication layers and data sources,

More information about Intel® DAAL

24

https://software.intel.com/en-us/intel-daal


Recommended