Bulk Synchronous Processing (BSP) Model

Bulk Synchronous Processing (BSP) Model

Course: CSC 8350

Instructor: Dr. Sushil Prasad

Presented by: Chris Moultrie

Outline

The Model Computation on BSP Model Automatic Memory Management Matrix Multiplication Computational Analysis BSP vs. PRAM BSPRAM

The Model

This model was proposed by Leslie Valiant in 1990.

Combination of 3 attributes: Components: to perform processing and/or

memory functions Router: to deliver messages among components Periodicity parameter L: to facilitate

synchronization at regular intervals of L time units.

Computation on BSP Model

A computation consists of several supersteps A superstep consists of:

A computation where each processor uses only locally held values

A global message transmission from each processor to any subset of others

A barrier synchronization At the end of a superstep, the transmitted

messages become available as local data for the next superstep

Continued..

The components can be seen as processors

The router can be seen as inter-connection network

The periodicity parameter can be seen as barrier.

Virtual Processors

LocalComputation

GlobalCommunication

Barrier Synchronization

Components (Processors)

No need for programmers to manage memory, assign communication and perform low-level synchronization.

This is achieved by programs written with sufficient parallel slackness.

When programs written for v virtual processors are run on p real processors with v >> p (e.g. v = p log p) then there is parallel slackness.

Parallel slackness makes work distribution more balanced (than in cases such as v=p or v < p).

Barrier Synchronization

After each period of L time units (periodicity parameter), a global check is made to determine whether each processor has completed its task.

If all processors have completed the superstep the machine proceeds to next superstep

Otherwise, the next period of L units is allocated to the unfinished superstep.

Synchronization can be switched off for a subset of processors. However, they can still send messages over the network.

Continued..

What is the optimal value for L? The lower bound is set by the hardware. The upper bound is set by the software,

which in turn defines the granularity of the system.

When each processor has an independent task of approximately L steps, only then optimal processor utilization can be achieved.

The Network (Router)

The network delivers messages point to point.

It assumes no combining, duplicating or broadcasting facilities.

It basically realizes arbitrary h-relations That is each processor sends at most h

messages and receives at most h messages.

Continued..

If ĝ is network throughput when it is in continuous operation and s is the latency or startup cost then h-relation is ĝh + s.

If ĝh > s, then we can let g = 2ĝ and the cost of a h-relation becomes gh (an overestimate of at most 2).

h-relations therefore can be realized in gh time for h larger than some h0

If L > gh0 then every h-relation for h < h0 will cost as if it were a h0 relation.

Continued..

Value of g is dictated by the network design. By increasing the bandwidth of network

connections and providing better switching the value of g is kept low.

As p increases, the required communication can increase with p2 and to maintain a fixed or low g, network costs increase similarly.

Automatic Memory Management

Random distribution, equally frequent access If p accesses are made to p components, one

component will get about (log p/log log p) accesses with high probability. Which will need Ω (log p/ log log p) time units.

If (p log p) accesses are made the probability is high that each processor will get no more than (3log p) and the time requirement will be O(log p)

In general, if pf(p) accesses are made, where f(p) grows faster than (log p) the worst-case access will exceed the average rate by even smaller factors.

To make the mapping from symbolic address to physical address efficiently computable hashing can be used.

Matrix Multiplication

n = 16 p = 4 Each processor has to

perform 2n3/p additions and multiplications, and receives 2n2/√p messages.

Every processor has 2n2/p elements which are to be sent at most √p times.

This may be achieved by data replication at source when g = O(n/√p) and L = O(n3/p) provided h is suitably small.

n X n

n/√p X n/√p

Matrix Multiplication on Hypercube

Let us assume that in g units of time a packet can traverse one edge of the hypercube.

That is, a packet takes O (g log p) time to go to an arbitrary destination.

In the previous example, the computational (local) bounds stay intact when implemented on a hypercube.

The communication now becomes O(n logp/ √p). Therefore, L = O(n3/p) suffice, if network can realize

the h-relation (g log p) given above.

The execution time for one superstep Si of a BSP program consisting of S supersteps is given by: wi +ghi + L Where, wi is the largest amount of work done and

hi is the largest number of messages sent or received during superstep Si.

The execution time of entire program is W + gH + LS Where, W = Σ(wi) and H = Σ(hi), for i = 0 to s-1

Computational Analysis

BSP vs. PRAM

BSP can be regarded as a generalization of the PRAM model.

If the BSP architecture has a small value of g (g=1), then it can be regarded as PRAM. Hashing can be used to automatically achieve

efficient memory management The value of L determines the degree of

parallel slackness required to achieve optimal efficiency. If L = g = 1 … corresponds to idealized PRAM

where no slackness is required.

BSPRAM

Variant of BSP, intended to support shared-memory style programming.

There are two levels of memory Local memory of individual processors Shared global memory

The network is implemented as a random-access shared memory unit.

As in BSP the computation proceeds in supersteps. A superstep consists of an input phase, a local

computation phase, and an output phase.

Continued.. In the input phase a processor can read data from

the main memory; in the output phase it can write data to the main memory.

The processors are synchronized between supersteps.

The computation within a superstep are asynchronous.

There are two types of BSPRAM, EREW BSPRAM, in which every cell of memory can be read from and written to only once in every superstep, and CRCW BSPRAM, which has no such restriction on memory access.

Computational Analysis

We will assume for the sake of convenience that if a value x is being written to a memory cell containing value y, the result may be determined by any function f(x,y) computable in O(1) time.

Similarly if values x1, x2, x3,….., xm are being written to a main memory cell containing the value y, the result may be determined by any prescribed function f(x1⊕….⊕Xm,y) where ⊕is a commutative and associative operator and both f and ⊕are computable in time O(1).

Continued.. The computation cost is similar to BSP and can be

given by w + hg + l, where w is the total number of local operations

performed by each processor, and h is defined as a sum of number of data units read from and written to the main memory. g and l are fixed parameter of computer.

We write BSPRAM(p,g,l) to denote a BSPRAM with the given values for p,g, and l

An Asynchronous EREW PRAM charges a unit cost for global read/write operation, d units for communication startup and B units for synchronization, which is equivalent to EREW BSPRAM (p,1,d+B)

Simulation

For efficient BSPRAM simulation on BSP some extra “parallelism” is necessary.

A BSPRAM algorithm has a slackness σ, if the communication cost of each of its supersteps is at least σ.

An optimal randomized simulation on BSP (p,g,l) can be achieved for Theorem (i) Any EREW BSPRAM (p,g,l) algorithm with

slackness σ ≥ log p; Theorem (ii) Any CRCW BSPRAM (p,g,l) algorithm with

slackness σ ≥ pε for some ε > 0.

Continued..

A BSPRAM algorithm is said to be communication-oblivious, if the sequence of communication and synchronization operations executed by any processor are the same for any size input but no such restrictions are made for local computation.

A BSPRAM algorithm is said to have granularity ˠ if all memory cells used by the algorithm can be partitioned into granules of size at least ˠ.

σ ≥ ˠ

Matrix multiplication on BSPRAM

We need to multiply two matrices X and Y and output the result matrix Z.

Zik = Xij + Yjk, for j = 1,…,n. here 1 ≤ i,k ≤ n Initialization: Zik <- 0 for i,k = 1,…,n Computation: Vijk <- XijYjk, Zik <- Zik + Vijk for all

i,j,k, 1 ≤ i,j,k ≤ n Computation of different triples i,j,k is

independent and therefore can be performed in parallel.

Continued..

The array V =(Vijk) is represented as a cube of volume n3 in integer three-dimensional space

The matrices are represented as projections of the cube.

Computation of point Vijk requires the input of its X and Y projections xij and yjk, and the output of its Z projection Zik.

Continued.. In order to provide a

communication efficient BSP algorithm, the array V must be divided into p regular cubic blocks of size n/p1/3

There will be n/p2/3 such partitions for each matrix.

Each processor can compute a block product sequentially.

Cost analysis: W = O(n3/p), H = O(n2/ p2/3),

S = O(1) The algorithm is oblivious with

σ = ˠ = n2/p2/3

References

Leslie G. Valiant, A bridging model for parallel computation, Communications of the ACM, 1990

Alexandre Tiskin, The bulk-synchronous parallel random access machine, Theoretical computer science, 1998 - Elsevier

Date post:	12-Jan-2016
Category:	Documents
Upload:	weylin
View:	51 times
Download:	0 times

Bulk Synchronous Processing (BSP) Model

Documents