Nephele Framework Architechture

transcript

NepheleNepheleEfficientEfficient Parallel Data Parallel Data

Processing in Processing in thethe CloudCloud

Daniel Warneke and Odej Kao

daniel.warneke@tu-berlin.de

Complex and Distributed IT-SystemsTechnische Universität Berlin

Compute Clouds and Data

Processing (1/2)

● Cloud Computing provides IT infrastructure on demand ■ Infrastructure-as-a-Service (IaaS)■ Different types of virtual machines■ No long-term obligations, pay as you go

● Data processing as major cloud application■ Flexible and easy deployment■ No need for own compute center■ Amazon integrated Hadoop as core service

23.11.2009 Nephele: Efficient Data Processing in the Cloud 2

Compute Clouds and

Data Processing (2/2)

● Current situation:■ Allocate set of virtual machines■ Deploy processing framework■ Submit processing job■ Destroy virtual machines

utilization!

■ Destroy virtual machines

● Imitation of static clusters■ Cloud‘s features remain unused■ Poor resource utilization � higher processing cost!

Outline

● Data Processing and Compute Clouds● Opportunities● Nephele■ Architecture■ Job Description■ Job Description■ Scheduling

● Evaluation● Conclusion

Opportunities

● Embrace dynamics and heterogeneity offered by clouds!

● Enables new ways of job scheduling

● VMs can be allocated/deallocated according to job ● VMs can be allocated/deallocated according to job progress, or to respond to peaks in workload

● Requirements:■ Processing framework must be aware of the cloud■ Job description/schedule must be able to express when to

allocate/deallocate virtual machines

Nephele

● Data processing framework for compute clouds■ Runs on clouds following IaaS abstraction■ Allocates/deallocates VMs on behalf of user

according to job progress

● Job description based on directed acyclic graphs (DAGs)■ Vertices represent job‘s individual tasks■ Edges denote communication channels

Nephele Architecture

● Classic master worker pattern● Workers are allocated on demand

Client

Public Network (Internet)

Workload over time

Compute Cloud

Master

Worker

Private / Virtualized Network

Worker Worker

Job Description

● Job Graphs focus on simplicity■ No explicit modeling of parallelization■ No explicit assignment to VMs■ No explicit assignment of channel

Output 1

Task 1

Task: LineWriterTask.programOutput: s3://user:key@storage/outp

● Users can provide annotations to influence construction of schedule

Task 1

Input 1

Task: MyTask.program

Task: LineReaderTask.programInput: s3://user:key@storage/input

Output 1

ID: 2Type: m1.large

Execution Graph

● Primary scheduling data structure

Output 1 (1)● Explicit parallelization● Tasks can be declared „parallelizable“● Tasks specify „wiring“ of subtasks

Task 1

Input 1

ID: 1Type: m1.small

● Explicit assignment to virtual machines● Specified by ID and type● Type refers to hardware profile

Task 1 (2)

Input 1 (1)

● Tasks specify „wiring“ of subtasks

Stage 0

Stage 1

Output 1

ID: 2Type: m1.large

Dealing with On-Demand

VM Allocation

● Issues with on-demand allocation:■ When to allocate virtual machines?■ When to deallocate virtual machines?■ No guarantee of resource availability!

Output 1 (1)

Stage 0

● Stages ensure three properties:■ VMs of upcoming stage are available■ All workers are set up and ready■ Data of previous stages is stored in

persistent manner

Task 1

Input 1

ID: 1Type: m1.small

Task 1 (2)

Input 1 (1)

Stage 0

Stage 1

Output 1

ID: 2Type: m1.large

Channel Types

● Network channels (pipeline)■ Vertices must be in same stage

● In-memory channels (pipeline)■ Vertices must run on same VM

Output 1 (1)

Stage 0

Task 1

Input 1

ID: 1Type: m1.small

■ Vertices must run on same VM■ Vertices must be in same stage

● File channels■ Vertices must run on same VM■ Vertices must be in different stages

Task 1 (2)

Input 1 (1)

Evaluation Job

● Given 100 GB of integer numbers, 100 bytes each■ Find the smallest 20% of these numbers■ Calculate the average of these 20%

● Implemented job for both Nephele and Hadoop● Implemented job for both Nephele and Hadoop■ Popular open-source data processing framework■ Highly „advertized“ as major cloud application

● Hardware platform:■ Private cloud based on Eucalyptus (EC2 WS API)■ Commodity servers with Intel Xeon 2.66 GHz (8 cores)

Results Hadoop

● 1. Task: Terasort■ Map (a-c), Reduce (b-d)

● 2. Task: Aggregation 1■ Map (d-f), Reduce (e-g)■ Map (d-f), Reduce (e-g)

● 3. Task: Aggregation 2■ Map, Reduce (g-h)

Used cloud resources:

6 x c1.xlarge (8 CPU cores, 18 GB RAM)

Nephele Execution Graph

Stage 1

Stage 0

ID: 7m1.small

BigIntegerMerger (1)

BigIntegerAggregater (1)

BigIntegerWriter (1) In-memory channel

Network channel

File channel

... BigIntegerReader (126) ...

... BigIntegerSorter (126) ...

... BigIntegerMerger (6) ...

... BigIntegerMerger (2) ...

... 21 ...

ID: 1c1.xlarge

... 21 ...

ID: 6c1.xlarge

ID: 1c1.xlarge

ID: 6c1.xlarge

BigIntegerMerger (1)

Results Nephele

● (a) Experiment starts● (b) Sorting starts● (c) Merging starts● (d) Deallocation of

six c1.xlarge VMs

Transfer penalty

for changing VMs

six c1.xlarge VMs● (e) Experiment ends

Used cloud resources:

6 x c1.xlarge (8 CPU cores, 18 GB RAM)1 x m1.small (1 CPU core, 1 GB RAM)

Conclusion

● Nephele facilitates new ways of data processing in clouds

● Variety of research issues to address in the future:■ Dynamic compression to mitigate transfer penalty■ Dynamic compression to mitigate transfer penalty■ Feedback-based construction of Execution Graphs■ More sophisticated ways to improve fault tolerance

● Nephele will become open source soon!■ Feel free to contact me: daniel.warneke@tu-berlin.de

Nephele Framework Architechture

Documents