Post on 27-Oct-2014
transcript
NepheleNepheleEfficientEfficient Parallel Data Parallel Data
Processing in Processing in thethe CloudCloud
Daniel Warneke and Odej Kao
daniel.warneke@tu-berlin.de
Complex and Distributed IT-SystemsTechnische Universität Berlin
Compute Clouds and Data
Processing (1/2)
● Cloud Computing provides IT infrastructure on demand ■ Infrastructure-as-a-Service (IaaS)■ Different types of virtual machines■ No long-term obligations, pay as you go
● Data processing as major cloud application■ Flexible and easy deployment■ No need for own compute center■ Amazon integrated Hadoop as core service
23.11.2009 Nephele: Efficient Data Processing in the Cloud 2
Compute Clouds and
Data Processing (2/2)
● Current situation:■ Allocate set of virtual machines■ Deploy processing framework■ Submit processing job■ Destroy virtual machines
Poor
utilization!
■ Destroy virtual machines
● Imitation of static clusters■ Cloud‘s features remain unused■ Poor resource utilization � higher processing cost!
23.11.2009 Nephele: Efficient Data Processing in the Cloud 3
Outline
● Data Processing and Compute Clouds● Opportunities● Nephele■ Architecture■ Job Description■ Job Description■ Scheduling
● Evaluation● Conclusion
23.11.2009 Nephele: Efficient Data Processing in the Cloud 4
Opportunities
● Embrace dynamics and heterogeneity offered by clouds!
● Enables new ways of job scheduling
● VMs can be allocated/deallocated according to job ● VMs can be allocated/deallocated according to job progress, or to respond to peaks in workload
● Requirements:■ Processing framework must be aware of the cloud■ Job description/schedule must be able to express when to
allocate/deallocate virtual machines
23.11.2009 Nephele: Efficient Data Processing in the Cloud 5
Nephele
● Data processing framework for compute clouds■ Runs on clouds following IaaS abstraction■ Allocates/deallocates VMs on behalf of user
according to job progress
● Job description based on directed acyclic graphs (DAGs)■ Vertices represent job‘s individual tasks■ Edges denote communication channels
23.11.2009 Nephele: Efficient Data Processing in the Cloud 6
Nephele Architecture
● Classic master worker pattern● Workers are allocated on demand
Client
Public Network (Internet)
Workload over time
23.11.2009 Nephele: Efficient Data Processing in the Cloud 7
Compute Cloud
Clo
ud C
ontr
olle
r
Per
sist
ent S
tora
ge
Master
Worker
Private / Virtualized Network
Worker Worker
Job Description
● Job Graphs focus on simplicity■ No explicit modeling of parallelization■ No explicit assignment to VMs■ No explicit assignment of channel
types
Output 1
Task 1
Task: LineWriterTask.programOutput: s3://user:key@storage/outp
● Users can provide annotations to influence construction of schedule
23.11.2009 Nephele: Efficient Data Processing in the Cloud 8
Task 1
Input 1
Task: MyTask.program
Task: LineReaderTask.programInput: s3://user:key@storage/input
Output 1
ID: 2Type: m1.large
Execution Graph
● Primary scheduling data structure
Output 1 (1)● Explicit parallelization● Tasks can be declared „parallelizable“● Tasks specify „wiring“ of subtasks
Task 1
Input 1
ID: 1Type: m1.small
● Explicit assignment to virtual machines● Specified by ID and type● Type refers to hardware profile
23.11.2009 Nephele: Efficient Data Processing in the Cloud 9
Task 1 (2)
Input 1 (1)
● Tasks specify „wiring“ of subtasks
Stage 0
Stage 1
Output 1
ID: 2Type: m1.large
Dealing with On-Demand
VM Allocation
● Issues with on-demand allocation:■ When to allocate virtual machines?■ When to deallocate virtual machines?■ No guarantee of resource availability!
Output 1 (1)
Stage 0
● Stages ensure three properties:■ VMs of upcoming stage are available■ All workers are set up and ready■ Data of previous stages is stored in
persistent manner
Task 1
Input 1
ID: 1Type: m1.small
23.11.2009 Nephele: Efficient Data Processing in the Cloud 10
Task 1 (2)
Input 1 (1)
Stage 0
Stage 1
Output 1
ID: 2Type: m1.large
Channel Types
● Network channels (pipeline)■ Vertices must be in same stage
● In-memory channels (pipeline)■ Vertices must run on same VM
Output 1 (1)
Stage 0
Task 1
Input 1
ID: 1Type: m1.small
■ Vertices must run on same VM■ Vertices must be in same stage
● File channels■ Vertices must run on same VM■ Vertices must be in different stages
23.11.2009 Nephele: Efficient Data Processing in the Cloud 11
Task 1 (2)
Input 1 (1)
Evaluation Job
● Given 100 GB of integer numbers, 100 bytes each■ Find the smallest 20% of these numbers■ Calculate the average of these 20%
● Implemented job for both Nephele and Hadoop● Implemented job for both Nephele and Hadoop■ Popular open-source data processing framework■ Highly „advertized“ as major cloud application
● Hardware platform:■ Private cloud based on Eucalyptus (EC2 WS API)■ Commodity servers with Intel Xeon 2.66 GHz (8 cores)
23.11.2009 Nephele: Efficient Data Processing in the Cloud 12
Results Hadoop
● 1. Task: Terasort■ Map (a-c), Reduce (b-d)
● 2. Task: Aggregation 1■ Map (d-f), Reduce (e-g)■ Map (d-f), Reduce (e-g)
● 3. Task: Aggregation 2■ Map, Reduce (g-h)
23.11.2009 Nephele: Efficient Data Processing in the Cloud 13
Used cloud resources:
6 x c1.xlarge (8 CPU cores, 18 GB RAM)
Nephele Execution Graph
Stage 1
Stage 0
ID: 7m1.small
BigIntegerMerger (1)
BigIntegerAggregater (1)
BigIntegerWriter (1) In-memory channel
Network channel
File channel
23.11.2009 Nephele: Efficient Data Processing in the Cloud 14
... BigIntegerReader (126) ...
... BigIntegerSorter (126) ...
... BigIntegerMerger (6) ...
... BigIntegerMerger (2) ...
... 21 ...
ID: 1c1.xlarge
... 21 ...
ID: 6c1.xlarge
ID: 1c1.xlarge
ID: 6c1.xlarge
BigIntegerMerger (1)
Results Nephele
● (a) Experiment starts● (b) Sorting starts● (c) Merging starts● (d) Deallocation of
six c1.xlarge VMs
Transfer penalty
for changing VMs
six c1.xlarge VMs● (e) Experiment ends
23.11.2009 Nephele: Efficient Data Processing in the Cloud 15
Used cloud resources:
6 x c1.xlarge (8 CPU cores, 18 GB RAM)1 x m1.small (1 CPU core, 1 GB RAM)
Conclusion
● Nephele facilitates new ways of data processing in clouds
● Variety of research issues to address in the future:■ Dynamic compression to mitigate transfer penalty■ Dynamic compression to mitigate transfer penalty■ Feedback-based construction of Execution Graphs■ More sophisticated ways to improve fault tolerance
● Nephele will become open source soon!■ Feel free to contact me: daniel.warneke@tu-berlin.de
23.11.2009 Nephele: Efficient Data Processing in the Cloud 16