COST MINIMIZATION FOR BIG DATA PROCESSING IN GEO- …cssongguo/papers/bigdata14-ppt.pdf · 2014....

Post on 04-Sep-2020

2 views 0 download

transcript

COST MINIMIZATION FOR BIG

DATA PROCESSING IN GEO-

DISTRIBUTED DATA CENTERS

1

Song Guo

The University of Aizu

Homepage: http://www.u-aizu.ac.jp/~sguo

Email: sguo@u-aizu.ac.jp

System Model

• Topology:

– geo-distributed data centers (DCs) connected with switches

• Cost:

– Inter-DC cost “CR” vs Intra-DC cost “CL”

– Server cost when a server is turned on

CR

CL

2

What Is The Problems?

• Where to put the data and computation? (Data & task placement)

– Same server : “0”

– Same DC : “CL” (0 < CL < CR)

– Different DCs : “CR”

• How to utilize physical resources of servers?

– Server ON/OFF (DCR)

– To balance storage and computation resources

• How to route the data transmission?

– What is the transmission rate?

– What is the transmission path? (Data flow routing)

3

General Problem Formulation

• What is our objective?

– To minmize the total cost: Both server cost and network cost

• What is the constraints?

– Data and task placement

– Hadoop Distributed File System

– Data flow transmission

– QoS satisfaction

– 2D Markov Chain

4

Data and Task Placement

• Multiple copies of data and at least one task computation unit for each

task must be put in a server

• Each required resource (storage and computation, etc.) must not exceed

the server capacity

• The total task rate in all servers shall equal to original user task rate

• If a storage or computation unit is located in a server, this server must be

turned on

5

Hadoop Distributed File System

6

• P- copy storage policy

• HDFS data distribution example (P=3)

Rack 1 Rack 2 Rack 3 Rack 4 Rack 5

Data Flow Transmission

7

Rack 1 Rack 2

2

4

5

5

4

2

1

Rack N

Rack 1

CL

CR

1

1

DC 2

DC 1

Storage

Computation

Data Flow Transmission

• Only severs with data residence can be flow source nodes

• The total outgoing flow from source nodes shall not exceed the user

request rate λ

• the destination receives all data from others only when it does not hold a

copy of data

8

QoS Satisfaction

• Fluid flow model

– Pipelined transmission

– Computation process starts ASA first chunk arrives

9

Bottleneck

2D Markov Chain

Data

Storage Computation User

• Step 1: User requests arrive with rate λ

• Step 2: Data is transmitted to the computation unit with rate γ

• Step 3: Computation is executed with rate μ

Cloud services

Results

Rate λ Rate μ Rate γ

10

This process can be modeled by a 2D Markov Chain

2D Markov Chain

• 2D Markov chain

– User request rate λ

– Computaion rate μ

– Data transmission rate γ

• Computation can happen when

and only when data arrives

– The total system delay T will be affected by λ , μ and γ

– Computaion rate μ is related to how much computation

resource is distributed to each task

– Data transmission rate γ is related to the data flow path

– T shall not exceed the QoS

11

QoS Satisfaction

• By solving the ODEs, we can derive the state probability πjk(p, q) as:

12

• When B goes to infinity, the mean number of tasks for chunk k on

server j Tjk is

• Finally,

Notations

13

Formulation

14

Data&Request Placement

Data Flow Transmission

QoS Satisfaction

Performance Evaluation

15

Performance Evaluation

16

• Our proposal outperforms the traditional mechanism under all settings

• Our proposal saves approximately 20% overall cost than the traditional

“locate computation with data” mechanism

Contributions

• We propose a two-dimensional Markov chain and derive the

expected task completion time in closed form. We explore the big

data placement problem to answer the following questions:

– a) how to place these data chunks in the servers,

– b) how to distribute tasks onto servers without violating the resource

constraints, and

– c) how to resize data centers to achieve the operation cost minimization

goal.

Previous works ONLY focus on the “locate data with computation” policy, but

we show that jointly consider “data and computation location” will give a

better performance in cost minimization.

17