A Cloud Framework for Knowledge Discovery Workflows on Azure
Fabrizio Marozzo1, Domenico Talia1,2
and Paolo Trunfio1
1 DEIS, University of Calabria 2
ICAR‐CNRItaly
Italy
HPC 2012‐
High Performance Computing, Grids and Clouds ‐
June 28, 2012 ‐
Cetraro, Italy
Complex Problems
…
Complex Problems
…
Big and complex problems must
be solved by Cloud, HPC systems and large scale distributed computing systems.
DATA SOURCES are
larger and larger and
ubiquitous (Web, sensor networks, mobile devices, telescopes, social media,bio labs, large scientific
instruments )2June 28, 2012 HPC 2012
…and Big Data…and Big Data
Large data sources in many fields cannot be read by humans
so
The huge amount of data available today requires smart data analysys techniques to help people to deal with it
and
Scalable algorithms, techniques, and systems.
3June 28, 2012 HPC 2012
The use of computers (and digital data repositories) changed our way to make discoveries in science, and engineering.
Improved both speed, methods, processes, and quality of the scientific discovery processes.
The same change is occurring in business domains.
4June 28, 2012 HPC
2012
Distributed Data Intensive Apps
KDD and data mining techniques are used in many applications areas to extract useful knowledge from large datasets.
KDD applications range from
Single-task applications
Parameter-sweeping applications
Complex (Workflow-based, structured, concurrent) applications.
Cloud Computing can be exploited to provide end-users with computing and storage applications and scalable execution mechanisms needed to efficiently run all these classes of applications.
Goal: Developing a Data Mining Cloud framework for supporting the scalable execution of data mining applications on Clouds.
Goals
5June 28, 2012 HPC 2012
Data Mining Cloud App: Overview
The Data Mining Cloud App was our first prototype supporting the execution of data mining applications on the Cloud.
Built on top of Windows Azure (PaaS).
The framework supports both single-task and parameter sweeping parallel data mining applications
Single-task applications: A single data mining task, such as classification, clustering, or association rules discovery is performed on a given data source.
Parameter sweeping application: A dataset is analyzed in parallel by multiple instances of the same data mining algorithm with different parameters.
The number of tasks to be executed increases with the number of swept parameters and the range of their values.
It requires a large amount of computing resources.
6June 28, 2012 HPC 2012
Outline
Windows Azure Components
Data Mining Cloud App
Architecture
User interface
Execution mechanisms
Performance results
Classification application
Clustering application
From the DM Cloud App to the DM Cloud Framework
Supporting KDD workflows
Execution mechanisms
Conclusions
7June 28, 2012 HPC 2012
Windows Azure Components
Compute is the computational environment to execute Cloud applications:
Web role: Web-based applications.
Worker role: batch applications.
VM role: virtual machine images.
Storage provides scalable storage elements:
Blobs: storing binary and text data.
Tables: non-relational databases.
Queues: communication between components.
Fabric controller links the physical machines of a single data center:
Compute and Storage services are built on top of this component.
8June 28, 2012 HPC 2012
Compute
Windows Azure Platform
WebsiteWebsite
Web Role instances
WorkerWorker
Worker Role instances
Queues
Task Queue
Tables
Task Status Table Storage
Fabric
Blobs
Input datasets
Data mining models
Data Mining Cloud App: Architecture
BrowserUser
9June 28, 2012 HPC 2012
Data Mining Cloud App: User Interface
The Web interface allows users to:
submit a data mining application
monitor its execution
access the results of each task
10June 28, 2012 HPC 2012
Compute
Windows Azure Platform
WebsiteWebsite
Web Role instances
WorkerWorker
Worker Role instances
Queues
Task Queue
Tables
Task Status TableStorage
Fabric
Blobs
Input datasets
Data mining models
Data Mining Cloud App: Exec. mechanisms (1/3)
BrowserUser
T TTT T
TaskID: 1634454118824362358-001
Algorithm: SimpleKMeans
Input dataset: US_Census_20MB.arff
Number of Clusters: 2
Seed: 1211
7/4/2011 7:33:12 PM
7/4/2011 7:33:26 PM
7/4/2011 7:33:40 PM
7/4/2011 7:34:02 PM
7/4/2011 7:32:22 PM
7/4/2011 7:32:26 PM
7/4/2011 7:32:35 PM
7/4/2011 7:32:40 PM
7/4/2011 7:32:42 PM
7/4/2011 7:32:56 PM
7/4/2011 7:33:12 PM
7/4/2011 7:33:26 PM
7/4/2011 7:33:40 PM
7/4/2011 7:34:02 PM
7/4/2011 7:32:22 PM
7/4/2011 7:32:26 PM
7/4/2011 7:32:35 PM
7/4/2011 7:32:40 PM
7/4/2011 7:32:42 PM
7/4/2011 7:32:56 PM
7/4/2011 7:39:30 PM
7/4/2011 7:33:26 PM
7/4/2011 7:33:40 PM
7/4/2011 7:34:02 PM
7/4/2011 7:32:22 PM
7/4/2011 7:32:26 PM
7/4/2011 7:32:35 PM
7/4/2011 7:32:40 PM
7/4/2011 7:32:42 PM
7/4/2011 7:32:56 PM
7/4/2011 7:39:30 PM
7/4/2011 7:39:57 PM
7/4/2011 7:40:21 PM
7/4/2011 7:40:29 PM
7/4/2011 7:39:45 PM
7/4/2011 7:40:06 PM
7/4/2011 7:40:27 PM
7/4/2011 7:40:51 PM
7/4/2011 7:32:42 PM
7/4/2011 7:32:56 PM
7/4/2011 7:39:30 PM
7/4/2011 7:39:57 PM
7/4/2011 7:40:21 PM
7/4/2011 7:40:29 PM
7/4/2011 7:46:20 PM
7/4/2011 7:46:35 PM
7/4/2011 7:46:58 PM
7/4/2011 7:47:22 PM
7/4/2011 7:53:44 PM
7/4/2011 7:54:01 PM
7/4/2011 7:39:30 PM
11June 28, 2012 HPC 2012
1
2
3
Compute
Windows Azure Platform
WebsiteWebsite
Web Role instance
WorkerWorker
Worker Role instances
Queues
Task Queue
Tables
Task Status Table
5
6
Storage
4
Fabric
Blobs
Input datasets
Data mining models
BrowserUser
Data Mining Cloud App: Exec. mechanisms (2/3)
The task status is dynamically updated whenever the status of a task changes.
12June 28, 2012 HPC 2012
Data Mining Cloud App: Exec. mechanisms (3/3)
WorkerWorkerTo reduce the impact of data transfer on the overall execution time, it is important that input data are physically close to virtual servers where workers run on.
To this end, we use the Azure’s Affinity Group feature.
13June 28, 2012 HPC 2012
Evaluation: Performance metrics and test settings
Goal: Evaluating the performance of Data Mining Cloud App through the execution of a set of parameter sweeping applications on a pool of virtual servers hosted by Microsoft Cloud datacenters.
Scalability that can be achieved through the parallel execution on a pool of virtual servers.
Test settings:
Publicly available datasets from the UCI archive
1 Web role instance
From 1 to 16 Worker role instances.
Virtual server: single-core 1GHz CPU, 0.75 GB of RAM, 20 GB of disk space
Performance metrics:
Turnaround time
includes file transfer overhead
Speed up
14June 28, 2012 HPC 2012
Evaluation: Clustering applicationDataset: USCensus90 (size: 20, 40, 80MB); Algorithm: K-Means (from the Weka project)Parameters:-Number of clusters: from 2 to 9 by 1 (8 values);-Seed (1211, 1311);
The speedup achieved is not linear, because the tasks are very heterogeneous in terms of execution times.
16 tasks
15June 28, 2012 HPC 2012
Evaluation: Classification applicationDataset: Covertype (size: 9, 18, 36MB); Algorithm: J48 (from the Weka project)Parameters:-Confidence value: from 0.05 to 0.50 by 0.03
In this case the speedup is almost linear, because the tasks are heterogeneous in terms of execution times.
16 tasks
16June 28, 2012 HPC 2012
Evaluation: Remarks
The experimental results show:
The scalability that can be achieved by the parallel execution of parameter sweeping application on a pool of virtual servers.
The suitability of the Cloud to support possibly complex KDD applications, e.g., workflow-based knowledge discovery
17June 28, 2012 HPC 2012
From DM Cloud App to DM Cloud Framework
Data Mining Cloud Framework for supporting workflow-based KDD applications, expressed as a graphs that link together data sources, data mining algorithms, and visualization tools.
18June 28, 2012 HPC 2012
Data Mining Cloud Framework: Workflow composition
CoverType
Sample1
Sample3
Sample2
J48
J48
J48
Sample4
Model1
Model2
Model3Splitter Voter Model
train
train
train
test
T2
T3
T4 T5T1
19June 28, 2012 HPC 2012
Data Mining Cloud Framework: Workflow execution
June 28, 2012 HPC 2012 20
Worker
TaskID Task status Dependency list
Task Table
Input datasets
Data mining models
Blobs
Azure components
submittedT2 (T1
)
submittedT3 (T1
)
submittedT4 (T1
)
submittedT5 (T2
, T3
, T4
, T5
)
submittedT1 ()Task Queue
June 28, 2012 HPC 2012 21
Task Queue
Worker
TaskID Task status Dependency list
Task Table
Input datasets
Data mining models
Blobs
Azure components
submittedT2 (T1
)
submittedT3 (T1
)
submittedT4 (T1
)
submittedT5 (T1
, T2
, T3
, T4
)
submittedT1 () No dependencies
Data Mining Cloud Framework: Workflow execution
readyrunningdone
ready
ready
ready
June 28, 2012 HPC 2012 22
Task Queue
Worker
TaskID Task status Dependency list
Task Table
Input datasets
Data mining models
Blobs
Azure components
readyT2 (T1
)
readyT3 (T1
)
readyT4 (T1
)
submittedT5 (T1
, T2
, T3
, T4
)
doneT1 ()
Data Mining Cloud Framework: Workflow execution
done
done
done
done
Conclusions
Data mining and knowledge discovery tools are needed to find what is interesting and valuable in very large data sources.
Cloud computing systems can effectively be used as scalable infrastructures for service-oriented data mining applications.
We need new distributed infrastructures and smart scalable analysis techniques to solve more challenging problems.
23June 28, 2012 HPC 2012
24June 28, 2012 HPC 2012
Communicating (data) each other, exchanging information, is nature,
taking into account received information is knowledge.
J.W. von Goethe
June 28, 2012 HPC 2012 25
Task Queue
Worker
TaskID Task status Dependency list
Task Table
Input datasets
Data mining models
Blobs
Azure components
readyT2 (T1
)
readyT3 (T1
)
readyT4 (T1
)
submittedT5 (T1
, T2
, T3
, T4
)
doneT1 ()
Data Mining Cloud Framework: Workflow execution
running
running
running Sequential execution,
or parallel on different Workers
done
done
done
ready
June 28, 2012 HPC 2012 26
Task Queue
Worker
TaskID Task status Dependency list
Task Table
Input datasets
Data mining models
Blobs
Azure components
readyT2 (T1
)
readyT3 (T1
)
readyT4 (T1
)
submittedT5 (T1
, T2
, T3
, T4
)
doneT1 ()
Data Mining Cloud Framework: Workflow execution
done
done
done
readyrunningdone