Download - 1 Towards an Open Service Framework for Cloud-based Knowledge Discovery Domenico Talia ICAR-CNR & UNIVERSITY OF CALABRIA, Italy [email protected] Cloud.

1

Towards an Open Service Framework for Cloud-based Knowledge Discovery

Domenico TaliaICAR-CNR & UNIVERSITY OF CALABRIA,

Italy

[email protected]

Cloud Futures 2010 – April 8-9, 2010 - Redmond, WA

2

Goal• Discuss a strategy based on the use of services for the

design of distributed knowledge discovery tasks and applications on Cloud, Grids and large distributed systems.

• Outline how service-oriented knowledge discovery tasks can be developed as a collection of Grid/Web/Cloud services.

• Investigate how they can be used to develop distributed data analysis applications exploiting the SOA model in a distributed computing scenario.

3

Complex Big Problems• Bigger and more complex

problems must be solvedby large scale distributed computing.

• DATA SOURCES are larger and larger and distributed.

• The main problem is not storing DATA, it is analyse, mine, and process DATA for understanding it.

4

• Today the information stored in digital data archives is enormous and its size is still growing very rapidly.

Data Availability or Data Deluge?

The world has created or 750 exabytes (750 billion gigabytes) of digital information in 2006. In 2010, it will create more than 1 zettabyte.

(source: IDC)

5

• Whereas until some decades ago the main problem was the shortage of information, the challenge now seems to be • the very large volume of information to deal with and • the associated complexity to process it and to extract significant and

useful parts or summaries.

Data Availability or Data Deluge?

6

Distributed Data Intensive Apps

• The use of computers changed our way to make discoveries and is improving both speed and quality of the scientific discovery processes.

• In this scenario HPC, Cloud and Grid systems provide an effective computational support for distributed data intensive application and for knowledge discovery from large and distributed data sets.

• Grid systems, parallel computers, and cloud computing systems are key technologies for e-Science. They can be used in integrated frameworks through service interfaces.

7

Distributed Data Mining on Clouds• Knowledge discovery (KDD) and data mining (DM) are:

• Compute- and data-intensive processes/tasks• Often based on distribution of data, algorithms, and users.

• Large scale systems like Clouds and Grids integrate both distributed computing and parallel computing, thus they are key infrastructures for high-performance distributed knowledge discovery. (Data Analytics Clouds)

• They also offer• security, resource information, data access and management,

communication, scheduling, fault detection, …

8

Distributed Data Analysis Patterns

• Data parallelism? Task parallelism?• Managing data dependencies• Data management: input, intermediate, output• Dynamic task graphs/workflows (data dependencies)• Dynamic data access involving large amounts of data• Parallel data mining and/or Distributed data mining• Programming distributed mining operations/taks/patterns

8

9

Programming Levels

Grain size

MPI, OpenMP, threads, MapReduce, RMI, HPF,…

Components, Patterns, Distributed Objects, …

Web Services, Grid Services, Workflows, Mushup, …

Process #

10

Services for distributed data mining

• Exploiting the SOA model it is possible to define basic services for supporting distributed data mining tasks/applications in large scale distributed systems for science and industry (from a private Cloud to Interclouds).

• Those services can address all the aspects that must be considered in data mining and in knowledge discovery processes • data selection and transport services,

• data analysis services,

• knowledge models representation services, and

• knowledge visualization services.

11

Collection of Services for Distributed Data Mining

• It is possible to define services corresponding to

Single KDD Steps All steps that compose a KDD process such as preprocessing, filtering, and visualization are expressed as services.

Single Data Mining Tasks

Here are included tasks such as classification, clustering, and association rules discovery.

Distributed Data Mining Patterns

This level implements, as services, patterns such as collective learning, parallel classification and meta-learning models.

Data Mining Applications or KDD processes This level includes the previous tasks and patterns composedin a multi-step workflow.

12

• This collection of data mining services can constitute an Open Service Framework for Grid-based Data Mining

• Allowing developers to program distributed KDD processes as a composition of single and/or aggregated services available over a Cloud.

• Those services should exploit other basic Cloud services for data transfer, replica management, data integration and querying.

Data mining services

Open Service Framework for Cloud-based Data Mining

13

Data mining Cloud services

• By exploiting the Cloud services features it is possible to develop data mining services accessible every time and everywhere (remotely and from small devices).

• This approach may result in • Service-based distributed data mining applications• Data mining services for communities/virtual organizations.• Distributed data analysis services on demand.• A sort of knowledge discovery eco-system formed of a large

numbers of decentralized data analysis services.

14

Data Mining Services: Are they programming abstractions?

• Apparently not, in a traditional approach.

• Yes, if we consider the user and application requirements in handling data and in understanding what is useful in it.• Basic services as simple operations;• Service programming languages for composing them;• Complex services and their complex composition;• Towards distributed programming patterns for services.

15

Services for Distributed Data Mining

• Service-based systems we developed

• Weka4WS

• Knowledge Grid

• Mobile Data Mining Services

• Mining@home

16

Weka4WS KnowledgeFlow

Programming a data mining workflow and run them in parallel.

17

Service-oriented Knowledge Grid

Service Selection

Service Workflow

Composition

Application Execution on the Cloud

S2

S3 S6

S5

S7

S1

S4

S8 S9

S2

S6S7 S4 S9S3

S5

S1

18

Knowledge Grid: Application Designstart

end

data mining data mining


file transfer file transfer

voting

schema: “/../DMService.wsdl”operation: “clusterize”argument: “EM”argument: “-I 100 -N -1 -S 90” ...



file transfer file transfer

schema: “/../DMService.wsdl”operation: “classify”argument: “J48”argument: “-C 0.25 -M 2” ...

19

An example Example of distributed classification

The dataset is split into smaller chunks Each chunk is processed in parallel The best model of each processing is chosen

20

Service Oriented Mobile Data Mining

• The main reaserch goal here is to support a user to access data mining services on mobile devices.

• The system includes three components:• Data providers

• Mining servers

• Mobile clients

Dataprovider

Miningserver Data store

Miningserver Data store

Mobile clientMobile client

Mobile client

Dataprovider

Mobile client

Dataprovider

Mobile client

21

Service Oriented Mobile Data Mining

• A user can choose the mining algorithm and select which part of a result (data mining model) he wants to visualize.

22

Mining@home• The Public Resource Computing paradigm (PRC) is currently used to

execute large scientific applications with the help of private computers (Seti@home, Climate@home, Einstein@home).

• PRC model can be exploited to program to P2P data mining tasks involving hundreds or thousands of nodes.

• Highly decentralized data analysistasks can be programmed as largecollections of tasks or services.

23

1.30

E+

05

2.46

E+

05

3.61

E+

05

4.86

E+

05

6.16

E+

05

7.25

E+

05

8.49

E+

05

9.64

E+

05

1.06

E+

06

1.18

E+

06

1.0E+02

1.0E+03

1.0E+04

1.0E+05

1.0E+06

1.0E+07

0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0

Dataset size (MB)

Exe

cutio

n tim

e (m

s) Resource creation

Notification subscription

Task submission

Dataset download

Data mining

Results notification

Resource destruction

Total

Impact of Service Overhead

Execution times

dataset download

data mining

In a Grid scenario the data mining step represents from 85% to 88% of the total execution time, the dataset download takes about 11%, while the other steps range from 0.5% to 4%.

24

Weka4WS: Application Speedup

25

Summary• New HPC infrastructures allow us to attack new problems, BUT

require to solve more challenging problems.• New programming models and environments

are required• Data is becoming a BIG player, programming data analysis

applications and services is a must.• New ways to efficiently compose different models and

paradigms are needed.• Relationships between different programming levels

(from libraries to services) must be addressed.

• In a long-term vision, pervasive collections of data analysis services and applications must be accessed and used as public utilities.

• We must be ready for managing with this scenario.

26

Thanks