+ All Categories
Home > Data & Analytics > Studies of HPCC Systems from Machine Learning Perspectives

Studies of HPCC Systems from Machine Learning Perspectives

Date post: 22-Jan-2018
Category:
Upload: hpcc-systems
View: 583 times
Download: 0 times
Share this document with a friend
56
Studies of HPCC Systems® from Machine Learning Perspectives Ying Xie, Pooja Chenna, Ken Hoganson Department of Computer Science Kennesaw State University
Transcript

Studies of HPCC Systems® from Machine Learning Perspectives

Ying Xie, Pooja Chenna, Ken Hoganson Department of Computer Science

Kennesaw State University

Outline • Leverage and Enhance Deep Learning Capability of HPCC

Systems®

• Comparative Studies of HPCC Systems® and Hadoop Systems

Leverage and Enhance Deep Learning Capability of HPCC Systems®

• Deep learning has been emerged as a recent breakthrough in the area of Machine Learning and revitalized research activities in Artificial Intelligence (AI).

• The development of deep learning techniques has been primarily driven by big data analytics.

• Since HPCC Systems is a well established big data platform, it is very important to leverage and enhance the deep learning capability of the HPCC Systems platform.

Images are copied from neuralnetworksanddeeplearning.com

• From Neural Network to Deep Learning

A typical multilayer perceptron

A deep neural architecture

Stacked Autoencoder

Images are copied from ufldl.stanford.edu

• Stacked Autoencoder

This code was implemented by Maryam Mousaarab Najafabadi from F

Deep Belief Network

Restricted Boltzmann Machine

Images are copied from http://deeplearning4j.org/

Deep Belief Network

Visualize High Dimensional Data

• Iris Data

Use DBN to conduct dimension reduction

• Breast Cancer Wisconsin (Original) Data Set

Use DBN to conduct dimension reduction

• Glass Data

Use stacked auto-encoder to conduct dimension reduction

• Based on the visualization, we may even gain some good ideas what classification algorithm may work well

Classification Algorithm #Miss-classified Instances

Logistic 72

MLP 66

Simple Logistic 70

7NN 76

5NN 70

3NN 64

Mapping Data to Higher Dimensional Spaces

• Blood Transfusion Data

• Classification performance on higher dimensional space

Original Space (4 Dim) (#Miss-classified Instances)

Higher Dim (10 Dim) (#Miss-classified Instances)

Logistic 171 157 MLP 160 166

Mapped to higher dimensional space by Stacked Auto-Encoder

Mapped to higher dimensional space DBN

• Wine Data

• Classification performance on higher dimensional space

Original Space (13 Dim) (#Miss-classified Instances)

Higher Dim (15 Dim) (#Miss-classified Instances)

Logistic 10 6 MLP 4 4

Original Space (13 Dim) (#Miss-classified Instances)

Higher Dim (15 Dim) (#Miss-classified Instances)

Logistic 10 7 MLP 4 2

Mapped to higher dimensional space by Stacked Auto-Encoder

• For a given data set, we can explore which combination of deep learning mapping techniques, dimensional space, and supervised learning model may yield best classification result

• For instance – Breast Cancer Data

3 Dim. Space (#Miss-classified

Instances)

6 Dim. Space (#Miss-classified

Instances)

9 Dim. Space (#Miss-classified

Instances)

12 Dim. Space (#Miss-classified

Instances)

Logistic 22 21 24 25 MLP 21 22 30 21

3 Dim. Space (#Miss-classified

Instances)

6 Dim. Space (#Miss-classified

Instances)

9 Dim. Space (#Miss-classified

Instances)

12 Dim. Space (#Miss-classified

Instances)

Logistic 22 23 24 25 MLP 20 21 30 22

Mapped to different dimensional spaces by Stacked Auto-Encoder

Mapped to different dimensional spaces by DBN

• Our next step: try to implement a meta supervised learning algorithm on HPCC – This algorithm will automatically map the given data

to different dimensional spaces by using both stacked auto-encoder and DBN

– Then classification models will be trained on all dimensional spaces in a distributed manner

– Cross-validation will be used to select the best performed model as the final output.

Implementation of Deep Belief Network on

HPCC

Our Implementations of Deep Learning on HPCC

• Restricted Boltzmann Machine (RBM) with Contrastive Divergence learning algorithm

• Deep Belief Network by stacking RBMs • Supervised Deep Belief Network

Machine Learning Routines • Utility Module • Matrix Library • Dense Matrix Library • PBblas

Restricted Boltzmann Machine - RBM

v

h

v’

h’

w(t+1) = w(t) + α(vhT – v’h’T)

Stacking Boltzmann Machines - Deep Belief Network

Final Output

Input Parameters

• Iris Data Sample

Supervised Boltzmann Machine – Deep Belief Network

y – actual output h – hidden samples v – visible samples

Supervised Deep Belief Network

• Our ultimate goal is to implement a full-stack of deep learning techniques on HPCC and conduct a wide range of experiences to show how powerful the deep learning engine on HPCC will be.

Comparative Studies of HPCC and Hadoop

• HPCC and Hadoop clusters on CSCloud – HPCC cluster :

• 5 thor nodes • 5 roxie nodes • 2 middle-ware nodes • 1 landing zone node for uploading files

– Hadoop cluster

• 1 job-tracker / name-node • 1 support system (Web UI, hadoop ecosystem, etc) • 4 worker nodes: task-tracker / data-node

• Algorithms for comparison – Text Processing Algorithms:

• Word Count • Inverted Index

– Machine Learning Algorithms: • Supervised Learning Algorithm - Random Forests • Unsupervised Learning Algorithm - KMeans

– Graph Algorithm: • Page Rank

Text Processing Algorithms Data: Authorized version of Bible downloaded from http://av1611.com/ HPCC Implementation of Inverted Index: http://www.dabhand.org/ECL/construct_a_simple_bible_search.htm (implemented by David Bayliss, Chief Data Scientist and VP of LexisNexis Risk Solutions) Hadoop Implementation of Inverted Index: Victor Guana and Joshua Davidson. On Comparing Inverted Index Parallel Implementations Using Map/Reduce Technical Report. May 09, 2012.

Algorithm HPCC Hadoop

Word Count 1.003 seconds 23.466 seconds

Inverted Index 34.205 seconds 27.047 seconds

Machine Learning Algorithms Data: KDD Network Intrusion Dataset Reference: C. Blake and C. J. Merz. UCI Repository of machine learning databases. Irvine, CA: University of California, Department of Information and Computer Science. [http://www.ics.uci.edu/~mlearn/MLRepository.html] Description:

– Total number of Instances : 4000000 – Used instances : 20394 randomly picked – Number of Attributes : 42

Hadoop Libraries – Apache Mahout:

– Version : CDH-5.4.2-1 (Cloudera)

HPCC Machine Learning Library

Efficiency:

Algorithm HPCC Hadoop

Random Forests 1 minutes 50 seconds 18 seconds

KMeans 36.675 seconds 1 min 45 seconds

Graph Algorithm Data Sets: Randomly generated graph with 25 nodes with maximum degree 5. HPCC Implementation of Pageranking – Our team’s implementation Hadoop Implementation Pageranking - http://blog.xebia.com/2011/09/wiki-pagerank-with-hadoop/

Algorithm HPCC Hadoop

Page Rank 29.817 seconds 36 minutes

THANK YOU


Recommended