Post on 12-Jan-2016
description
transcript
Chapter 11Broadcasting with Selective
Reduction-BSR-
Serpil TokdemirGSU, Department of Computer Science
What is Broadcasting with Selective Reduction?
BSR requires asymptotically no more resources than the PRAM for its implementation.
an extension of the PRAM It consists;
N processors M shared-memory locations MAU (memory access unit)
Forms of memory access; ER EW CR CW
The BSR Model of Parallel Computation
.
.
.
P1
P2
PN
MEMORY
ACCESS
UNIT
(MAU)
MEMORY LOCATIONS
…
…
…
.
.
.
.
.
.
PROCESSORS SHARED MEMORY
Broadcasting with Selective Reduction
During execution of an algorithm; several processors may read from or write to the same
memory location all processors may gain access to all memory locations at
the same time for the purpose of writing, at each memory location, a subset of the incoming
broadcast data is selected and reduced to one value. according to an appropriate selection and reduction
operator this value is finally stored in the memory location,
BSR accommodates; all forms of memory access allowed by the PRAM +
broadcasting with selective reduction.
BSR Continued
the width of the resulting MAU: O(M) the depth of the resulting MAU: O(logM) the size of the resulting MAU: O(MlogM)
How Long Does a Step Take in BSR? Memory access should require a(N, M)=O(logM)
We assume here that a(N, M)=O(1)
Similarly, a computational operation takes constant time;
c(N, M)=O(1)
THE BSR MODEL
Additional form of concurrent access to shared memory
BROADCAST – allows all processors to write all-shared memory locations simultaneously.
3 phases, A broadcasting phase,
Each processor Pi broadcasts a datum di and a tag gi, 1<=i<=N, destined to all memory locations.
A selection phase, Each memory location Uj uses a limit lj, 1<=j<=M, and a
selection rule to test the condition gi lj. is selected from the set;
<, <=, =, >=, >,
The BSR Model (Continued)
A reduction phase, All data di selected by Uj during the selection phase are
combined into one datum that is finally stored in Uj. Reduction operator –
SUM, PRODUCT, AND, OR, EXCLUSIVE-OR, MAXIMUM, MINIMUM
All three phases are performed simultaneously for all processors Pi and all memory locations Uj.
The three phases of the BROADCAST instruction
g1, d1
g1, d1
g1, d1
gN, dN
gN, dN
gN, dN
g1 l1g2 l1
gN l1
gN lM
g2 lM
g1 lM
dN
dN
The BSR Model
If a datum or a tag is not in a processor’s local register, obtain it from the shared memory by an ER or a CR
The limits, selection rule and reduction operator, are assumed to be known by the memory locations.
If not, they can be stored in memory by ER or CW
Notation for the BROADCAST Instruction: A
instruction Broadcast of BSR is written as follows:
a1
1 i ji N
j ig l
j M
U d
THE BSR MODEL
If no data are accepted by a given memory location,
Value is not affected by BROADCAST instruction If only one datum is accepted,
Uj is assigned the value of that datum.
Comparing BSR to the PRAM In BSR, the BROADCAST instruction requires O(1)
time. On a PRAM-same # of p’s and U’s- require O(M) time, since
Broadcast is equivalent to M CW instructions The latter is at least as powerful as the former
The BROADCAST instruction makes BSR strictly more powerful than the PRAM
THE BSR MODEL
A , in nondecreasing order distinct numbers , in increasing order
It is required to compute, for , the sum si of all those elements of X not equal to .
On the PRAM – O(n) – obviously optimal The sum S of all the elements of X is first computed, Y=X is merged with L, sorted by increasing order, Y is scanned, , , is computed by subtracting
from S all the elements of X equal to . n processors can compute one of the in O(1)
time
1 2, , , nX x x x
1 2, , , nL l l l 1 i n il
is 1 i n il
is
THE BSR MODEL
BSR using one BROADCAST instruction: Processor Pi, , broadcasts as the tag and
datum pair. Memory location Uj selects those xi not equal to , Those xi selected by Uj are added up to obtain ,
This requires O(1) time Does not depend on X and L being sorted
1 i n ,i ix x
il 1 j n
js 1 j n
BSR ALGORITHMS
Prefix Sums Given n numbers , prefix sums BSR PREFIX SUMS – n processors and n memory
locations Pi broadcast index as tag and as datum. Memory location uses its index j as limit. Relation for selection and as a reduction
operator. holds
1 2, , , nx x x
1 2 ,1j js x x x j n
i ix
jU
jU ,1js j n
BSR Algorithms – Prefix Sums
Algorithm BSR PREFIX SUMS
Consists of one BROADCAST instruction P(n)=n, t(n)=O(1), and c(n)=p(n)*t(n)=O(n) optimal
for j= 1 to n do in parallel
for i= 1 to n do in parallel
end for
end for.
j ii j
s x
BSR Algorithms – Prefix Sums Example: n={1, 2, 3}
BSR Algorithms – Sorting
A , rearrange the elements of X bbbbbbbbbb – in nondecreasing order
Requires n processors and n memory locations Consists of two steps;
The rank rj of each element xj is computed xj – Limit < - Relation - Reduction operator Uj holds rj , for
xj is placed in position of the sorted sequence S.
If and are equal,
1 2, , , nX x x x 1 2, , , nS s s s
1 j n
1 jr,j kx x mx
BSR Algorithms - Sorting
Second step continued ,
to position to position to position The next element with the next higher rank is
placed in position of S. Pi broadcasts the pair (ri, xi) Uj uses its index j as limit for selection as a reduction When this step terminates;
Uj holds sj – that is, the jth element of the sorted sequence
j k mr r r jx 1 jrkx 2 jrmx 3 jr
4 jr
BSR Algorithms - Sorting Algorithm BSR SORT
Step 1: for j= 1 to n do in parallel for i= 1 to n do in parallel
Step 2: for j= 1 to n do in parallel for i= 1 to n do parallel
0jr
1i j
jx x
r
end for
end for
1j jr r
i
j ir j
s x
end for
end for
BSR Algorithms - Sorting
Example: Processors broadcast
the pairs to all memory locations;
(8,1), (5,1), (2,1), (5,1)
Limits are 8, 5, 2, and 5
Since 5 < 8, 2 < 5, and 5<
8, r1=3 Only 2 < 5, so r2=1 r3=0 Only 2 < 5, so r4=1
8,5,2,5X
BSR Algorithms - Sorting
Example continued; Step 2 of the algorithm
Processors broadcast the pairs;
(4,8), (2,5), (1,2), (2,5)
Limits at the memory locations
1, 2, 3, 4 This gives the sorted
sequence; {2, 5, 5, 8}
BSR Algorithms - Sorting
Analysis: BSR SORT
p(n)=n and runs in t(n)=O(1) time, c(n)=O(n) Uniform analysis
assumed; the time required for memory access, was taken to be O(1).
Discriminating Analysis: , is taken to be equal to O(logM) – for
BSR & PRAM BSR: N=M=O(n), thus time is O(logn)
Each step is executed once and containing a constant number of computations and memory access, so;
,a N M
BSR Algorithms - Sorting
- OPTIMAL
PRAM SORT: N=M=O(n), thus time is O(logn) executes O(logn) computational and memory
access steps, therefore, Cost is NOT optimal
1 , , loga ct n N M N M n
log logc n n n n n
2log , , loga ct n n N M N M n
2 2log logc n p n t n n n n n
BSR Algorithms – Computing Maximal Points
, , n points in the plane
, for A point of S is said to be maximal with respect to S if
and only if it is not dominated by any other point of S. uses n processors and n memory locations
consists of three steps: auxiliary sequence is created,
mi, associated with point qi, is set initially to equal yi,
The largest y coordinate is found, mj is assigned the value of that coordinate
Pi broadcasts , xi = tag, yi = datum
,i i iq x y ,j j jq x y
1 2, , , nS q q q
,i i iq x y 1 i n
1 2, , , nm m m1 i n
,i ix y
BSR Algorithms – Computing Maximal Points
Uj uses as its limit The relation > for selection for reduction, to compute mj
If , , it accepts the y-coordinate of every point
assigns the max of these to mj.
A decision is made as to whether qi is a maximal point If mi was assigned to some point qk
If , then qk dominates qi
, Else , neither qk nor any other point does not
dominate ,
ix
a jq qx x 1 a n
k iq qy y
0im
k iq qy y
1im
BSR Algorithms – Computing Maximal Points
Algorithm BSR MAXIMAL POINTS
i j
j ix x
m y
Step 1: for i= 1 to n do in parallel
end for
Step 2: for j= 1 to n do in parallel
for i= 1 to n do in parallel
end for
end for
j im yStep 3: for i= 1 to n do in parallel
if
then
else
end if
end for.
i im y0im 1im
BSR Algorithms – Computing Maximal Points
Analysis; Each step – uses n processors & runs in O(1) time
P(n)=n, t(n)=O(1), and c(n)=O(n) By taking memory access time O(logn), cost becomes
O(nlogn) On the other hand cost for PRAM is O(nlog2n) – not
optimal
Example: are three points in the plane
1 2 3, ,q q q
1q
2q3q
x
y
BSR Algorithms – Computing Maximal Points
After step 1 of the algorithm, m1=y1, m2=y2, m3=y3
After step 2, m1=y3, m2=y3, m3=y3
Since, m1<y1, m2>y2 and m3=y3, both q1 and q3 are maximal
BSR Algorithms – Maximum Sum Sebsequence
, the subsequence has the largest
possible sum among all subsequences of X.
Algorithm BSR MAXIMUM SUM SUBSEQUENCE Step 1: for j=1 to n do in parallel
for i= 1 to n do in parallel
end for end for
Step 2
1 2, , , nX x x x
u v 1, , ,u u vx x x
1u u vx x x
j ii j
s x
BSR Algorithms – Maximum Sum Subsequence
Step 2: (2.1) for j= 1 to n do in parallel for i= 1 to n do in parallel
end for end for (2.2) for j= 1 to n do in parallel for i= 1 to n do in parallel
end for end for
j ji j
m s
i j
js m
a i
BSR Algorithms – Maximum Sum Subsequences
Step 3: for i= 1 to n do in parallel
end for Step 4:
(4.1) for i= 1 to n do in parallel
(i) L bi (ii) if bi=L
then u i end if end for
(4.2)
i i i ib m s x
MAX
ARBITRARY
uv a
BSR Algorithms – Maximum Sum Subsequences
Steps of algorithm; Prefix sums are computed – uses BSR PREFIX SUMS For each j;
Max prefix sum to he right of sj is found. Value and index mj, aj
(i, si) = tag and datum Uj uses j as limit, >= for selection and for reduction.
To compute ai
Pi broadcasts (si, i) as its tag and datum pair, Uj uses mj as limit, = for selection and for reduction.
For each i, the sum of max sum subsequence is computed
Uses EW instruction
BSR Algorithms – Maximum Sum Subsequences
Steps of algorithm continued The sum and starting index u of the overall
maximum sum subsequence are found. Requires MAX CW instruction and an ARBITRARY CW
instruction,
Analysis: Each step of algorithm runs in O(1) time and uses n processors. Thus;
p(n)=n,
t(n)=O(1)
and c(n)=O(n),
Optimal
BSR Algorithms – Maximum Sum Subsequences
Example: X={-1, 1, 2, -2}
After step 1, prefix sums - sj
-1, 0, 2, 0
Second broadcast instruction;
mj 2, 2, 2, 0
BSR Algorithms – Maximum Sum Subsequences
Example continued
Third broadcast instruction for computing aj
aj 3, 3, 3, 4
Step 3 computes each bi
bi 2, 3, 2, -2
Finally; L=3 u=2 v= a2=3