Date post: | 15-Jul-2015 |
Category: |
Technology |
Upload: | md-shamsur-rahim |
View: | 91 times |
Download: | 1 times |
Load Balancing in
Distributed DatabaseMd. Shamsur Rahim 14-98181-3 Student, MScCS, AIUB
AZM Ehtesham Chowdhury 15-98451-1 Student, MScCS, AIUB
Saiful Akhter 15-98502-1 Student, MScCS, AIUB
Load Balancing:
Means distributing transaction and queries among different nodes.
The goal is to maximize the throughput.
Parallel Execution Problems
1. Initialization
2. Interference
3. Skew
Parallel Execution Problems : Initialization
Initialization is necessary before execution.
This sequential steps includes
Process/ Thread Creation and initialization
Communication Initialization etc.
The duration is proportional to the degree of parallelism
The degree of parallelism should be fixed according to query complexity.
Formula for finding response time for an Operator:
𝑅𝑒𝑠𝑝𝑜𝑛𝑠𝑒𝑇𝑖𝑚𝑒 = 𝑎 ∗ 𝑛 +𝑐∗𝑁
𝑛
The equation can be further derived to obtain:
𝑁 = 𝑡𝑜𝑢𝑝𝑙𝑒𝑠, 𝑐 = 𝑎𝑣𝑔 𝑝𝑟𝑜𝑐𝑒𝑠𝑠𝑖𝑛𝑔 𝑡𝑖𝑚𝑒n = No. Of Processors
optimal number of processors to allocate (n) maximal achievable speedup (S)
𝑛 = √𝑐 ∗ 𝑁
𝑎𝑆 =
𝑛
2
Parallel Execution Problems : Interferences
Parallel execution can be slowed down by interference.
Interference occurs when several processors simultaneously access the same
resource,
Hardware
Solution: Duplicate Shared resource
Software.
Solution: Partition the shared resource into several independent resources
Parallel Execution Problems : Skew
Problem appears with intra- operator parallelism (variation in partition size) is known as data
skew.
Classification of Skew:
Attribute Value Skew : inherent in the dataset
e.g., there are more citizens in Paris than in Waterloo
Tuple Placement Skew: introduced when the data are initially partitioned
e.g., with range partitioning
Selectivity Skew
introduced when there is variation in the selectivity of select predicates on each node
Redistribution Skew
occurs in the redistribution step between two operators.
Join Product Skew
occurs because the join selectivity may vary between nodes
Inter-Query Parallelism
Form of parallelism where many different Queries or Transactions are
executed in parallel with one another on many processors.
Advantages:
Increases Transaction Throughput.
Scales up the Transaction processing system
Easy to implement in Shared Memory Parallel System.
Example: Oracle 8 & Oracle Rdb.
Intra-Query Parallelism
Form of parallelism where Single Query is executed in parallel on many
processors.
2 Types.
Intra-operation parallelism
Inter-operation parallelism
Advantages:
speed up a single complex long running queries.
Best suited for complex scientific calculations (queries).
Example: Informix, Terradata.
Intra-operation parallelism
The process of speeding up a query through parallelizing the execution of
individual operations.
The operations which can be parallelized are Sort, Join, Projection, Selection
and so on.
Inter-operation parallelism
The process of speeding up a query through parallelizing various operations
which are part of the query.
Example Step:
A query which involves join of 4 tables executed in two processors
Each processor shall join two relations locally and the result1 and result2 can be
joined further to produce the final result.
Intra-Operator Load Balancing
Depends on
The degree of parallelism.
Allocation of processors for the operator.
The home of the operator (the set of processors where it is executed) must be
carefully decided.
The skew problem makes it hard for a parallel query optimizer to make this
decision statically.
Require a very accurate and detailed cost model.
Two Solutions incorporated in a hybrid query optimizer.
Adaptive
Specialized
Adaptive Technique
The main idea is to statically decide on an initial allocation of the
processors to the operator (using a cost model).
Adapt to skew using load reallocation.
Load reallocation is to detect the oversized partitions.
Partition them again onto several processors.
Adaptive Technique(Continued)
Advantage:
More dynamic adjustment of the degree of parallelism.
useful to improve intra-operator load balancing in all kinds of parallel
architectures.
By reducing processor interference
Excellent load balancing for intra-operator parallelism
Adaptive Technique(Continued)
specific control operators.
Detect whether the static estimates for intermediate result sizes differ from
the run-time values.
Relation redistribution in order to prevent join product skew and
redistribution skew.
Depends on difference between the estimate and the real value is sufficiently
high.
Specialized techniques
Two main techniques.
Range partitioning
Sampling
Avoid redistribution skew of the building relation.
Processors can get partitions of equal numbers of tuples, corresponding to
different ranges of join attribute values.
Specialized techniques(Continued)
To deal with skew as follows:
Sample the building relation to determine the partitioning
ranges.
Redistribute the building relation to the processors using the
ranges. Each processor builds a hash table containing the
incoming tuples.
Redistribute the probing relation using the same ranges to
the processors. For each tuple received, each processor
probes the hash table to perform the join.
Inter-Operator Load Balancing
Important to Choose for each operator
How many and which processors to assign for its execution.
Taking into account pipeline parallelism, which requires inter-operator
communication.
Harder to achieve in shared-nothing for this Reasons:
Choice of the degree of parallelism cause to errors
Reason: Both processors and operators are discrete entities.
Inter-Operator Load
Balancing(Continued)
Processors associated with the latest operators in a pipeline
chain may remain idle a significant time.
Shared-memory allows the parallel execution of independent
pipeline chains
It is known as Tasks.
Dynamically adjusting the degree of intra-operator parallelism
of the tasks in order to reach maximum resource utilization.
Activations
Represents a sequential unit of work
Can be executed by any thread
Self-contained
Can only be executed in the same SM(shared memory)-node
Activation Queues
Moving data activation along pipeline chains
Also called table queues
Threads have unrestricted access to the same SM-node queues
Small number of queue results interference
A thread a queue
Thread
Simple strategy for good load balancing if number of threads are higher than
the processors
One thread per processor per query reduce the overhead of interference
Thread will consume activation as much as possible to limit thread
interference
THANK YOU
Reference:
M. Tamer Özsu • Patrick Valduriez, Principles of Distributed Database Systems,
Third Edition