Parallel Computing
Parallel Algorithm Design
2010@FEUP Parallel Algorithm Design 2
Task/Channel Model
• Parallel computation = set of tasks
• Task
• Program
• Local memory
• Collection of I/O ports
• Tasks interact by sending messages
through channels
2010@FEUP Parallel Algorithm Design 3
Task/Channel Model
TaskChannel
2010@FEUP Parallel Algorithm Design 4
Foster’s Design Methodoly
ProblemPartitioning
Communication
AgglomerationMapping
1. Partitioning
2. Communication
3. Agglomeration
4. Mapping
2010@FEUP Parallel Algorithm Design 5
1. Partitioning
• Dividing computation and data into pieces
• Domain decomposition
• Divide data into pieces
• e.g., An array into sub-arrays (reduction); A loop into
sub-loops (matrix multiplication), A search space into
sub-spaces (chess)
• Functional decomposition
• Divide computation into pieces
• e.g., pipelines (floating point multiplication),
workflows (pay roll processing)
• Determine how to associate data with
computations
2010@FEUP Parallel Algorithm Design 6
Partitioning
• The individual pieces are called primitive
tasks.
• Desirable attributes for partition
• Many more primitive tasks than processors on
target computer.
• Tasks of roughly equal size (in computation
and data).
• Number of tasks increases with problem size.
2010@FEUP Parallel Algorithm Design 7
Example of domain decomposition
2010@FEUP Parallel Algorithm Design 8
Example of Functional Decomposition
2010@FEUP Parallel Algorithm Design 9
2. Communication
• Determine values passed among tasks
• Local communication
• Task needs values from a small number of
other tasks
• Create channels illustrating data flow
• Global communication
• Significant number of tasks contribute data to
perform a computation
• Don’t create channels for them early in design
2010@FEUP Parallel Algorithm Design 10
Desirable attributes for communication
• Balanced
• Communication operations balanced among
tasks
• Small degree:
• Each task communicates with only small group
of neighbors
• Concurrency
• Tasks can perform communications
concurrently
• Task can perform computations concurrently
2010@FEUP Parallel Algorithm Design 11
3. Agglomeration
• Agglomeration is the process of grouping
tasks into larger tasks to improve
performance.
• Here, minimizing communication is
typically a design goal.
• Grouping tasks that communicate with each
other eliminates the need for communication,
called increasing the locality
• Grouping tasks can also allow us to combine
multiple communications into one.
2010@FEUP Parallel Algorithm Design 12
Desirable attributes of agglomeration
• Increased the locality of the parallel algorithm
• Agglomerated tasks have similar computational
and communication costs
• Number of tasks increases with problem size
• Number of tasks is as small as possible, yet at
least as great as the number of processors on
target computer
2010@FEUP Parallel Algorithm Design 13
4. Mapping
• Mapping is the process of assigning
agglomerated tasks to the processors
• Here, were thinking of a distributed memory
machine
• If we choose the number of agglomerated tasks
to equal the number of processors then the
mapping is already done. Each processor gets
one agglomerated task
2010@FEUP Parallel Algorithm Design 14
Mapping Goals
• Processor utilization: would like
processors to have roughly equal
computational and communication costs
• Minimize interprocessor communication
• This can be posed as a graph partitioning
problem:
• Each partition should have roughly the same
number of nodes
• The partition should cut a minimal amount of
edges
2010@FEUP Parallel Algorithm Design 15
Partitioning a graph
P0 P1 P0 P1P0
P1
Equalizing processor utilization and minimizing interprocessor
communication are often competing forces
2010@FEUP Parallel Algorithm Design 16
Mapping heuristics
• Static number of tasks
• Structured communication
• Constant computation time per task
− Agglomerate tasks to minimize comm
− Create one task per processor
• Variable computation time per task
− Cyclically map tasks to processors
• Unstructured communication
− Use a static load balancing algorithm
• Dynamic number of tasks
• Use a run-time task-scheduling algorithm
− e.g., a master slave strategy
• Use a dynamic load balancing algorithm
− e.g., share load among neighboring processors; remapping
periodically
2010@FEUP Parallel Algorithm Design 17
Example
• 1. Boundary value problems
Ice water Rod Insulation
2010@FEUP Parallel Algorithm Design 18
Boundary Value Problem
2
2222
x
uaua
t
u
c
ka 2
t
uu
t
u jiji
,1,
Heat conduction physics
Discretization
ui,j
= temperature at position i and time j
2,1,,1
2
2 2
x
uuu
x
u jijiji
jijijiji ruurruu ,1,,11, )21( 2
2
)( x
tar
2010@FEUP Parallel Algorithm Design 19
Boundary Value Problem
• Partition
• One data item per grid point
• Associate one primitive task with each grid
point
• Two-dimensional domain decomposition
• Communication
• Identify communication pattern between
primitive tasks
• Each interior primitive task has three incoming
and three outgoing channels
2010@FEUP Parallel Algorithm Design 20
Boundary Value Problem
• Agglomeration and mapping
Agglomeration
2010@FEUP Parallel Algorithm Design 21
Model Analysis
• Sequential execution
• – time to update element
• n – number of elements
• m – number of iterations
• Sequential execution time: m n
• Parallel execution
• p – number of processors
• message time = + q/β ≈ , if q « β
• Parallel execution time m (n /p + 2)
2010@FEUP Parallel Algorithm Design 22
Example – Parallel reduction
• Given associative operator
• a0 a
1 a
2 … a
n-1
• Examples
• Add
• Multiply
• And, Or
• Maximum, Minimum
1 task 1 of the values to operate
(1 of the a’s)
Data decomposition
2010@FEUP Parallel Algorithm Design 23
Parallel reduction
Further steps to
reach a binomial tree
2010@FEUP Parallel Algorithm Design 24
Parallel reduction
4 2 0 7
-3 5 -6 -3
8 1 2 3
-4 4 6 -1
2010@FEUP Parallel Algorithm Design 25
Parallel reduction
1 7 -6 4
4 5 8 2
2010@FEUP Parallel Algorithm Design 26
Parallel reduction
8 -2
9 10
2010@FEUP Parallel Algorithm Design 27
Parallel reduction
17 8
2010@FEUP Parallel Algorithm Design 28
Parallel reduction
25
Binomial tree
2010@FEUP Parallel Algorithm Design 29
Agglomeration
sum
sum sum
sum
Analysis
• Parallel running time
• – time to perform the binary operation
• - time to communicate a value via a channel
• n values and p tasks
• Time for the tasks perform its inner
calculations: (n/p - 1)
• Communication steps: log p
• After each receiving communication there is
an operation
• Total time: (n/p - 1) + log p ( + )
2010@FEUP Parallel Algorithm Design 30
Example: the N-body problem
2010@FEUP Parallel Algorithm Design 31
v
B1
B2 B3
m
(x,y)
f1f2
The N-body problem
2010@FEUP Parallel Algorithm Design 32
The N-body problem partitioning
• Domain partitioning
• Assume one task per particle
• Task has particle’s position, velocity
vector and mass
• Iteration
• Get positions and mass of all other particles
• Compute new position and velocity
2010@FEUP Parallel Algorithm Design 33
Gather and All-Gather operations
2010@FEUP Parallel Algorithm Design 34
Gather operation
All-Gather operation
(p-1)
All-Gather
2010@FEUP Parallel Algorithm Design 35
To avoid conflicts all-gather is performed in log p steps,
doubling the data in each step
Communication (n items) = + (n / )
With p tasks there are log p iterations
The number of items doubles
at each iteration
p
pnp
p
np
i
i
)1(log)
2(
log
1
1
Analysis
• N-body problem parallel version
• n bodies and p tasks
• m iterations over time
2010@FEUP Parallel Algorithm Design 36
p
n
p
pnpm
)1(logTotal time excluding I/O
Considering I/O
2010@FEUP Parallel Algorithm Design 37
Reading or writing n items of data through an I/O channel
io
+ n/io
In N-body problem the initial values
must be transmitted to the other tasks
Scatter operation
2010@FEUP Parallel Algorithm Design 38
Improving
1. First task transmits n/2
items to another task
2. The 2 tasks transmits n/4
items to 2 other tasks
3. The 4 tasks transmits n/8
items to 8 other tasks
4. And so on …
p
pnp
p
np
i
i
)1(log)
2(
log
1
1
Analysis considering I/O
• Total time after m iterations
• Initial reading + scattering
• Computing m iterations
• Final gathering + writing
2010@FEUP Parallel Algorithm Design 39
p
n
p
pnpm
p
pnp
n
io
io
)1(
log)1(
log22