+ All Categories
Home > Documents > A Cloud Framework for Knowledge Discovery Workflows on Azure · Cloud Computing . can be exploited...

A Cloud Framework for Knowledge Discovery Workflows on Azure · Cloud Computing . can be exploited...

Date post: 02-Aug-2020
Category:
Upload: others
View: 3 times
Download: 0 times
Share this document with a friend
26
A Cloud Framework for Knowledge Discovery Workflows on Azure Fabrizio Marozzo 1 , Domenico Talia 1,2 and Paolo Trunfio 1 1 DEIS, University of Calabria 2 ICARCNR Italy Italy HPC 2012High Performance Computing, Grids and Clouds June 28, 2012 Cetraro, Italy
Transcript
Page 1: A Cloud Framework for Knowledge Discovery Workflows on Azure · Cloud Computing . can be exploited to provide end-users with computing and storage applications and scalable execution

A Cloud Framework for Knowledge Discovery Workflows on Azure

Fabrizio Marozzo1, Domenico Talia1,2

and Paolo Trunfio1

1 DEIS, University of Calabria                 2

ICAR‐CNRItaly                               

Italy

HPC 2012‐

High Performance Computing, Grids and Clouds ‐

June 28, 2012 ‐

Cetraro, Italy

Page 2: A Cloud Framework for Knowledge Discovery Workflows on Azure · Cloud Computing . can be exploited to provide end-users with computing and storage applications and scalable execution

Complex Problems

Complex Problems

Big and complex problems must

be solved by Cloud, HPC systems and large scale distributed computing systems.

DATA SOURCES are

larger and larger and

ubiquitous (Web, sensor networks, mobile devices, telescopes, social media,bio labs, large scientific

instruments )2June 28, 2012 HPC 2012

Page 3: A Cloud Framework for Knowledge Discovery Workflows on Azure · Cloud Computing . can be exploited to provide end-users with computing and storage applications and scalable execution

…and Big Data…and Big Data

Large data sources in many fields cannot be read by humans

so

The huge amount of data available today requires smart data analysys techniques to help people to deal with it

and

Scalable algorithms, techniques, and systems.

3June 28, 2012 HPC 2012

Page 4: A Cloud Framework for Knowledge Discovery Workflows on Azure · Cloud Computing . can be exploited to provide end-users with computing and storage applications and scalable execution

The use of computers (and digital data repositories) changed our way to make discoveries in science, and engineering.

Improved both speed, methods, processes, and quality of the scientific discovery processes.

The same change is occurring in business domains.

4June 28, 2012 HPC

2012

Distributed Data Intensive Apps

Page 5: A Cloud Framework for Knowledge Discovery Workflows on Azure · Cloud Computing . can be exploited to provide end-users with computing and storage applications and scalable execution

KDD and data mining techniques are used in many applications areas to extract useful knowledge from large datasets.

KDD applications range from

Single-task applications

Parameter-sweeping applications

Complex (Workflow-based, structured, concurrent) applications.

Cloud Computing can be exploited to provide end-users with computing and storage applications and scalable execution mechanisms needed to efficiently run all these classes of applications.

Goal: Developing a Data Mining Cloud framework for supporting the scalable execution of data mining applications on Clouds.

Goals

5June 28, 2012 HPC 2012

Page 6: A Cloud Framework for Knowledge Discovery Workflows on Azure · Cloud Computing . can be exploited to provide end-users with computing and storage applications and scalable execution

Data Mining Cloud App: Overview

The Data Mining Cloud App was our first prototype supporting the execution of data mining applications on the Cloud.

Built on top of Windows Azure (PaaS).

The framework supports both single-task and parameter sweeping parallel data mining applications

Single-task applications: A single data mining task, such as classification, clustering, or association rules discovery is performed on a given data source.

Parameter sweeping application: A dataset is analyzed in parallel by multiple instances of the same data mining algorithm with different parameters.

The number of tasks to be executed increases with the number of swept parameters and the range of their values.

It requires a large amount of computing resources.

6June 28, 2012 HPC 2012

Page 7: A Cloud Framework for Knowledge Discovery Workflows on Azure · Cloud Computing . can be exploited to provide end-users with computing and storage applications and scalable execution

Outline

Windows Azure Components

Data Mining Cloud App

Architecture

User interface

Execution mechanisms

Performance results

Classification application

Clustering application

From the DM Cloud App to the DM Cloud Framework

Supporting KDD workflows

Execution mechanisms

Conclusions

7June 28, 2012 HPC 2012

Page 8: A Cloud Framework for Knowledge Discovery Workflows on Azure · Cloud Computing . can be exploited to provide end-users with computing and storage applications and scalable execution

Windows Azure Components

Compute is the computational environment to execute Cloud applications:

Web role: Web-based applications.

Worker role: batch applications.

VM role: virtual machine images.

Storage provides scalable storage elements:

Blobs: storing binary and text data.

Tables: non-relational databases.

Queues: communication between components.

Fabric controller links the physical machines of a single data center:

Compute and Storage services are built on top of this component.

8June 28, 2012 HPC 2012

Page 9: A Cloud Framework for Knowledge Discovery Workflows on Azure · Cloud Computing . can be exploited to provide end-users with computing and storage applications and scalable execution

Compute

Windows Azure Platform

WebsiteWebsite

Web Role instances

WorkerWorker

Worker Role instances

Queues

Task Queue

Tables

Task Status Table Storage

Fabric

Blobs

Input datasets

Data mining models

Data Mining Cloud App: Architecture

BrowserUser

9June 28, 2012 HPC 2012

Page 10: A Cloud Framework for Knowledge Discovery Workflows on Azure · Cloud Computing . can be exploited to provide end-users with computing and storage applications and scalable execution

Data Mining Cloud App: User Interface

The Web interface allows users to:

submit a data mining application

monitor its execution

access the results of each task

10June 28, 2012 HPC 2012

Page 11: A Cloud Framework for Knowledge Discovery Workflows on Azure · Cloud Computing . can be exploited to provide end-users with computing and storage applications and scalable execution

Compute

Windows Azure Platform

WebsiteWebsite

Web Role instances

WorkerWorker

Worker Role instances

Queues

Task Queue

Tables

Task Status TableStorage

Fabric

Blobs

Input datasets

Data mining models

Data Mining Cloud App: Exec. mechanisms (1/3)

BrowserUser

T TTT T

TaskID: 1634454118824362358-001

Algorithm: SimpleKMeans

Input dataset: US_Census_20MB.arff

Number of Clusters: 2

Seed: 1211

7/4/2011  7:33:12  PM

7/4/2011  7:33:26  PM

7/4/2011  7:33:40  PM

7/4/2011  7:34:02  PM

7/4/2011  7:32:22  PM

7/4/2011  7:32:26  PM

7/4/2011  7:32:35  PM

7/4/2011  7:32:40  PM

7/4/2011  7:32:42  PM

7/4/2011  7:32:56  PM

7/4/2011  7:33:12  PM

7/4/2011  7:33:26  PM

7/4/2011  7:33:40  PM

7/4/2011  7:34:02  PM

7/4/2011  7:32:22  PM

7/4/2011  7:32:26  PM

7/4/2011  7:32:35  PM

7/4/2011  7:32:40  PM

7/4/2011  7:32:42  PM

7/4/2011  7:32:56  PM

7/4/2011  7:39:30  PM

7/4/2011  7:33:26  PM

7/4/2011  7:33:40  PM

7/4/2011  7:34:02  PM

7/4/2011  7:32:22  PM

7/4/2011  7:32:26  PM

7/4/2011  7:32:35  PM

7/4/2011  7:32:40  PM

7/4/2011  7:32:42  PM

7/4/2011  7:32:56  PM

7/4/2011  7:39:30  PM

7/4/2011  7:39:57  PM

7/4/2011  7:40:21  PM

7/4/2011  7:40:29  PM

7/4/2011  7:39:45  PM

7/4/2011  7:40:06  PM

7/4/2011  7:40:27  PM

7/4/2011  7:40:51  PM

7/4/2011  7:32:42  PM

7/4/2011  7:32:56  PM

7/4/2011  7:39:30  PM

7/4/2011  7:39:57  PM

7/4/2011  7:40:21  PM

7/4/2011  7:40:29  PM

7/4/2011  7:46:20  PM

7/4/2011  7:46:35  PM

7/4/2011  7:46:58  PM

7/4/2011  7:47:22  PM

7/4/2011  7:53:44  PM

7/4/2011  7:54:01  PM

7/4/2011  7:39:30  PM

11June 28, 2012 HPC 2012

Page 12: A Cloud Framework for Knowledge Discovery Workflows on Azure · Cloud Computing . can be exploited to provide end-users with computing and storage applications and scalable execution

1

2

3

Compute

Windows Azure Platform

WebsiteWebsite

Web Role instance

WorkerWorker

Worker Role instances

Queues

Task Queue

Tables

Task Status Table

5

6

Storage

4

Fabric

Blobs

Input datasets

Data mining models

BrowserUser

Data Mining Cloud App: Exec. mechanisms (2/3)

The task status is dynamically updated whenever the status of a task changes.

12June 28, 2012 HPC 2012

Page 13: A Cloud Framework for Knowledge Discovery Workflows on Azure · Cloud Computing . can be exploited to provide end-users with computing and storage applications and scalable execution

Data Mining Cloud App: Exec. mechanisms (3/3)

WorkerWorkerTo reduce the impact of data transfer on the overall execution time, it is important that input data are physically close to virtual servers where workers run on.

To this end, we use the Azure’s Affinity Group feature.

13June 28, 2012 HPC 2012

Page 14: A Cloud Framework for Knowledge Discovery Workflows on Azure · Cloud Computing . can be exploited to provide end-users with computing and storage applications and scalable execution

Evaluation: Performance metrics and test settings

Goal: Evaluating the performance of Data Mining Cloud App through the execution of a set of parameter sweeping applications on a pool of virtual servers hosted by Microsoft Cloud datacenters.

Scalability that can be achieved through the parallel execution on a pool of virtual servers.

Test settings:

Publicly available datasets from the UCI archive

1 Web role instance

From 1 to 16 Worker role instances.

Virtual server: single-core 1GHz CPU, 0.75 GB of RAM, 20 GB of disk space

Performance metrics:

Turnaround time

includes file transfer overhead

Speed up

14June 28, 2012 HPC 2012

Page 15: A Cloud Framework for Knowledge Discovery Workflows on Azure · Cloud Computing . can be exploited to provide end-users with computing and storage applications and scalable execution

Evaluation: Clustering applicationDataset: USCensus90 (size: 20, 40, 80MB); Algorithm: K-Means (from the Weka project)Parameters:-Number of clusters: from 2 to 9 by 1 (8 values);-Seed (1211, 1311);

The speedup achieved is not linear, because the tasks are very heterogeneous in terms of execution times.

16 tasks

15June 28, 2012 HPC 2012

Page 16: A Cloud Framework for Knowledge Discovery Workflows on Azure · Cloud Computing . can be exploited to provide end-users with computing and storage applications and scalable execution

Evaluation: Classification applicationDataset: Covertype (size: 9, 18, 36MB); Algorithm: J48 (from the Weka project)Parameters:-Confidence value: from 0.05 to 0.50 by 0.03

In this case the speedup is almost linear, because the tasks are heterogeneous in terms of execution times.

16 tasks

16June 28, 2012 HPC 2012

Page 17: A Cloud Framework for Knowledge Discovery Workflows on Azure · Cloud Computing . can be exploited to provide end-users with computing and storage applications and scalable execution

Evaluation: Remarks

The experimental results show:

The scalability that can be achieved by the parallel execution of parameter sweeping application on a pool of virtual servers.

The suitability of the Cloud to support possibly complex KDD applications, e.g., workflow-based knowledge discovery

17June 28, 2012 HPC 2012

Page 18: A Cloud Framework for Knowledge Discovery Workflows on Azure · Cloud Computing . can be exploited to provide end-users with computing and storage applications and scalable execution

From DM Cloud App to DM Cloud Framework

Data Mining Cloud Framework for supporting workflow-based KDD applications, expressed as a graphs that link together data sources, data mining algorithms, and visualization tools.

18June 28, 2012 HPC 2012

Page 19: A Cloud Framework for Knowledge Discovery Workflows on Azure · Cloud Computing . can be exploited to provide end-users with computing and storage applications and scalable execution

Data Mining Cloud Framework: Workflow composition

CoverType

Sample1

Sample3

Sample2

J48

J48

J48

Sample4

Model1

Model2

Model3Splitter Voter Model

train

train

train

test

T2

T3

T4 T5T1

19June 28, 2012 HPC 2012

Page 20: A Cloud Framework for Knowledge Discovery Workflows on Azure · Cloud Computing . can be exploited to provide end-users with computing and storage applications and scalable execution

Data Mining Cloud Framework: Workflow execution

June 28, 2012 HPC 2012 20

Worker

TaskID Task status Dependency list

Task Table

Input datasets

Data mining models

Blobs

Azure components

submittedT2 (T1

)

submittedT3 (T1

)

submittedT4 (T1

)

submittedT5 (T2

, T3

, T4

, T5

)

submittedT1 ()Task Queue

Page 21: A Cloud Framework for Knowledge Discovery Workflows on Azure · Cloud Computing . can be exploited to provide end-users with computing and storage applications and scalable execution

June 28, 2012 HPC 2012 21

Task Queue

Worker

TaskID Task status Dependency list

Task Table

Input datasets

Data mining models

Blobs

Azure components

submittedT2 (T1

)

submittedT3 (T1

)

submittedT4 (T1

)

submittedT5 (T1

, T2

, T3

, T4

)

submittedT1 () No dependencies

Data Mining Cloud Framework: Workflow execution

readyrunningdone

ready

ready

ready

Page 22: A Cloud Framework for Knowledge Discovery Workflows on Azure · Cloud Computing . can be exploited to provide end-users with computing and storage applications and scalable execution

June 28, 2012 HPC 2012 22

Task Queue

Worker

TaskID Task status Dependency list

Task Table

Input datasets

Data mining models

Blobs

Azure components

readyT2 (T1

)

readyT3 (T1

)

readyT4 (T1

)

submittedT5 (T1

, T2

, T3

, T4

)

doneT1 ()

Data Mining Cloud Framework: Workflow execution

done

done

done

done

Page 23: A Cloud Framework for Knowledge Discovery Workflows on Azure · Cloud Computing . can be exploited to provide end-users with computing and storage applications and scalable execution

Conclusions

Data mining and knowledge discovery tools are needed to find what is interesting and valuable in very large data sources.

Cloud computing systems can effectively be used as scalable infrastructures for service-oriented data mining applications.

We need new distributed infrastructures and smart scalable analysis techniques to solve more challenging problems.

23June 28, 2012 HPC 2012

Page 24: A Cloud Framework for Knowledge Discovery Workflows on Azure · Cloud Computing . can be exploited to provide end-users with computing and storage applications and scalable execution

24June 28, 2012 HPC 2012

Communicating (data) each other, exchanging information, is nature,

taking into account received information is knowledge.

J.W. von Goethe

Page 25: A Cloud Framework for Knowledge Discovery Workflows on Azure · Cloud Computing . can be exploited to provide end-users with computing and storage applications and scalable execution

June 28, 2012 HPC 2012 25

Task Queue

Worker

TaskID Task status Dependency list

Task Table

Input datasets

Data mining models

Blobs

Azure components

readyT2 (T1

)

readyT3 (T1

)

readyT4 (T1

)

submittedT5 (T1

, T2

, T3

, T4

)

doneT1 ()

Data Mining Cloud Framework: Workflow execution

running

running

running Sequential execution,

or parallel on different Workers

done

done

done

ready

Page 26: A Cloud Framework for Knowledge Discovery Workflows on Azure · Cloud Computing . can be exploited to provide end-users with computing and storage applications and scalable execution

June 28, 2012 HPC 2012 26

Task Queue

Worker

TaskID Task status Dependency list

Task Table

Input datasets

Data mining models

Blobs

Azure components

readyT2 (T1

)

readyT3 (T1

)

readyT4 (T1

)

submittedT5 (T1

, T2

, T3

, T4

)

doneT1 ()

Data Mining Cloud Framework: Workflow execution

done

done

done

readyrunningdone


Recommended