Load Balancing in Parallel and Distributed Database

Load Balancing in

Distributed DatabaseMd. Shamsur Rahim 14-98181-3 Student, MScCS, AIUB

AZM Ehtesham Chowdhury 15-98451-1 Student, MScCS, AIUB

Saiful Akhter 15-98502-1 Student, MScCS, AIUB

Load Balancing:

Means distributing transaction and queries among different nodes.

The goal is to maximize the throughput.

Parallel Execution Problems

1. Initialization

2. Interference

3. Skew

Parallel Execution Problems : Initialization

Initialization is necessary before execution.

This sequential steps includes

Process/ Thread Creation and initialization

Communication Initialization etc.

The duration is proportional to the degree of parallelism

The degree of parallelism should be fixed according to query complexity.

Formula for finding response time for an Operator:

𝑅𝑒𝑠𝑝𝑜𝑛𝑠𝑒𝑇𝑖𝑚𝑒 = 𝑎 ∗ 𝑛 +𝑐∗𝑁

𝑛

The equation can be further derived to obtain:

𝑁 = 𝑡𝑜𝑢𝑝𝑙𝑒𝑠, 𝑐 = 𝑎𝑣𝑔 𝑝𝑟𝑜𝑐𝑒𝑠𝑠𝑖𝑛𝑔 𝑡𝑖𝑚𝑒n = No. Of Processors

optimal number of processors to allocate (n) maximal achievable speedup (S)

𝑛 = √𝑐 ∗ 𝑁

𝑎𝑆 =

𝑛

2

Parallel Execution Problems : Interferences

Parallel execution can be slowed down by interference.

Interference occurs when several processors simultaneously access the same

resource,

Hardware

Solution: Duplicate Shared resource

Software.

Solution: Partition the shared resource into several independent resources

Parallel Execution Problems : Skew

Problem appears with intra- operator parallelism (variation in partition size) is known as data

skew.

Classification of Skew:

Attribute Value Skew : inherent in the dataset

e.g., there are more citizens in Paris than in Waterloo

Tuple Placement Skew: introduced when the data are initially partitioned

e.g., with range partitioning

Selectivity Skew

introduced when there is variation in the selectivity of select predicates on each node

Redistribution Skew

occurs in the redistribution step between two operators.

Join Product Skew

occurs because the join selectivity may vary between nodes

Inter-Query Parallelism

Form of parallelism where many different Queries or Transactions are

executed in parallel with one another on many processors.

Advantages:

Increases Transaction Throughput.

Scales up the Transaction processing system

Easy to implement in Shared Memory Parallel System.

Example: Oracle 8 & Oracle Rdb.

Intra-Query Parallelism

Form of parallelism where Single Query is executed in parallel on many

processors.

2 Types.

Intra-operation parallelism

Inter-operation parallelism

Advantages:

speed up a single complex long running queries.

Best suited for complex scientific calculations (queries).

Example: Informix, Terradata.

Intra-operation parallelism

The process of speeding up a query through parallelizing the execution of

individual operations.

The operations which can be parallelized are Sort, Join, Projection, Selection

and so on.

Inter-operation parallelism

The process of speeding up a query through parallelizing various operations

which are part of the query.

Example Step:

A query which involves join of 4 tables executed in two processors

Each processor shall join two relations locally and the result1 and result2 can be

joined further to produce the final result.

Intra-Operator Load Balancing

Depends on

The degree of parallelism.

Allocation of processors for the operator.

The home of the operator (the set of processors where it is executed) must be

carefully decided.

The skew problem makes it hard for a parallel query optimizer to make this

decision statically.

Require a very accurate and detailed cost model.

Two Solutions incorporated in a hybrid query optimizer.

Adaptive

Specialized

Adaptive Technique

The main idea is to statically decide on an initial allocation of the

processors to the operator (using a cost model).

Adapt to skew using load reallocation.

Load reallocation is to detect the oversized partitions.

Partition them again onto several processors.

Adaptive Technique(Continued)

Advantage:

More dynamic adjustment of the degree of parallelism.

useful to improve intra-operator load balancing in all kinds of parallel

architectures.

By reducing processor interference

Excellent load balancing for intra-operator parallelism

Adaptive Technique(Continued)

specific control operators.

Detect whether the static estimates for intermediate result sizes differ from

the run-time values.

Relation redistribution in order to prevent join product skew and

redistribution skew.

Depends on difference between the estimate and the real value is sufficiently

high.

Specialized techniques

Two main techniques.

Range partitioning

Sampling

Avoid redistribution skew of the building relation.

Processors can get partitions of equal numbers of tuples, corresponding to

different ranges of join attribute values.

Specialized techniques(Continued)

To deal with skew as follows:

Sample the building relation to determine the partitioning

ranges.

Redistribute the building relation to the processors using the

ranges. Each processor builds a hash table containing the

incoming tuples.

Redistribute the probing relation using the same ranges to

the processors. For each tuple received, each processor

probes the hash table to perform the join.

Inter-Operator Load Balancing

Important to Choose for each operator

How many and which processors to assign for its execution.

Taking into account pipeline parallelism, which requires inter-operator

communication.

Harder to achieve in shared-nothing for this Reasons:

Choice of the degree of parallelism cause to errors

Reason: Both processors and operators are discrete entities.

Inter-Operator Load

Balancing(Continued)

Processors associated with the latest operators in a pipeline

chain may remain idle a significant time.

Shared-memory allows the parallel execution of independent

pipeline chains

It is known as Tasks.

Dynamically adjusting the degree of intra-operator parallelism

of the tasks in order to reach maximum resource utilization.

Activations

Represents a sequential unit of work

Can be executed by any thread

Self-contained

Can only be executed in the same SM(shared memory)-node

Activation Queues

Moving data activation along pipeline chains

Also called table queues

Threads have unrestricted access to the same SM-node queues

Small number of queue results interference

A thread a queue

Thread

Simple strategy for good load balancing if number of threads are higher than

the processors

One thread per processor per query reduce the overhead of interference

Thread will consume activation as much as possible to limit thread

interference

THANK YOU

Reference:

M. Tamer Özsu • Patrick Valduriez, Principles of Distributed Database Systems,

Third Edition

Date post:	15-Jul-2015
Category:	Technology
Upload:	md-shamsur-rahim
View:	91 times
Download:	1 times

Load Balancing in Parallel and Distributed Database

Technology