Nov 2010 1 COMP60621 Concurrent Programming for Numerical Applications Lecture 1 The Nature of...

Nov 2010 1

COMP60621 Concurrent Programming for Numerical

Applications

Lecture 1

The Nature of Parallelism:

The Data-Parallel Algorithmic Core

Len Freeman, Graham Riley

Centre for Novel Computing

School of Computer Science

University of Manchester

Nov 2010 2

Overview

– Generic Properties of Applications

– Task-Parallelism vs. Data-Parallelism

– Four Data-Parallel 'Kernel' Algorithms

• Elementwise Vector Addition

• Vector Sum Reduction

• Matrix-Vector Multiplication

• Matrix-Matrix Multiplication

– Summary

Nov 2010 3

Generic Properties of Applications

– From extensive studies of HPC simulations, we conclude that there are many potential applications for HPC, drawn from diverse disciplines and with quite different intrinsic characteristics.

– On the other hand, there are some characteristics in common between the applications. For example, simulations often require the use of discrete approximations to continuous domains.

– Each application needs an underpinning mathematical model and an algorithmic procedure which 'animates' the model in a fashion suitable for digital computation.

– Is it possible to classify the nature of the parallelism that occurs in applications?

Nov 2010 4

Task-Parallelismvs.Data-Parallelism

– Perhaps the most practically interesting thing to emerge from our examples so far is the following distinction in styles of parallelism:

• Task-parallelism – in which different functions are performed simultaneously, possibly using (part of) the same data; the different functions may take very different times to execute.

• Data-parallelism – in which the same function is performed simultaneously, but on different (sets of) data; often, but not always, the function executes in the same time, even though the data values vary. There is further substructure in data-parallelism: experience points to three generic forms which are conveniently introduced using the three examples below.

Nov 2010 5

Data Parallelism

– Data-parallel algorithms are epitomised by four very simple examples:

• Element-wise vector addition;

• Vector sum reduction;

• Matrix-vector multiplication;

• Matrix-matrix multiplication.

– On their own, these are simple tasks, which we would normally expect to find embedded as subtasks of some more complex computation. Nevertheless, taken together, they are complex enough to illustrate most of the major issues in parallel computing (task-parallelism is readily included by chaining two or more of these examples together). They certainly deserve to be treated as a core part of our algorithmic presentation.

Nov 2010 6

Introduction to KernelData-Parallel Algorithms

For each example, we shall investigate:

• The work that needs to be done.

• The ways in which the necessary work might be done in parallel.

• Any inherent constraints associated with the resulting parallelism.

• How performance might be affected as a result of any choices made.

Remember that we are dealing with abstract parallelism (finding opportunities), so our discussion of concepts such as work and performance will be necessarily vague.

Nov 2010 7

Element-wiseVector Addition

– At Algorithm Level, a vector is best thought of as an abstract data type representing a one-dimensional array of elements, all of the same data type. For simplicity, we will use arrays of integer values (this can be generalised with little effort).

– The whole vector is normally identified by a user-defined name, while the individual elements of the vector are identified by use of a supplementary integer value, known as the index. The precise semantics of an index value can vary, but a convenient way of viewing it is as an offset, indicating how far away the element is from the first element (or base) of the vector. (In our examples, and using Fortran convention, an index of 1 corresponds to the first element.)

Nov 2010 8


For our purposes, it is convenient to look at vectors in a diagrammatic form, as follows:

vector name A integer elements

A1

A2

A3

A4

A5

A6

A7

A8

A9

A10

Nov 2010 9


The task of adding together the elements of two vectors can be drawn as follows:

+element-wise

A and B are input vectors. The result is an output vector.

A1

A2

A3

A4

A5

A6

A7

A8

A9

A10

B1

B2

B3

B4

B5

B6

B7

B8

B9

B10

A1+B1

A2+B2

A3+B3

A4+B4

A5+B5

A6+B6

A7+B7

A8+B8

A9+B9

etc.

Nov 2010 10


– A simple, sequential algorithm for (element-wise) addition is to form the output vector one element at a time, by running through the elements of the two input vectors, in index order, computing the sum of the pair of input elements at each index point.

– The work that has to be done comes in two forms:

• Accessing the elements of the vectors (two input vectors and one output vector); and

• Computing the sum of each pair of elements.

– How might this work be done in parallel?

– What range of options are there?

– How do these affect performance?

Nov 2010 11


– This has been a particularly easy case to study. The work is spread naturally over all the elements of the vectors, each parcel of work is independent of every other parcel of work, and the amount of work in each parcel is the same.

– Unfortunately, this kind of parallel work seldom appears on its own, but it is so convenient for parallel systems that it has become known as embarrassingly parallel. Luckily, parallelism in this form frequently does appear as a subtask in algorithms with much more complex structure.

– Related examples of this kind of parallel work are scalar multiplication of a vector (or matrix) and general matrix addition (a matrix is a generalisation of the array, used to model phenomena in two or more dimensions).

Nov 2010 12

Vector Sum– Finally, we look at the reduction of a vector into a scalar by

summing its elements – this is a simplified case of the more general vector inner (dot ) product. For simplicity, we continue to assume integer-valued elements.

– This reduction is implicit in the matrix-vector multiplication example since it is required to compute each inner product.

– The following diagram shows what needs to be done:A1

A2

A3

A4

A5

A6

A7

A8

A9

A10

--> sum(Ai)

Nov 2010 13

Vector Sum

– The standard sequential algorithm for this task is to set the output scalar value to zero, and then add the values of the successive elements of the input vector into this 'running total', one-at-a-time.

SUM = 0

DO I = 1, N

SUM = SUM + A(I)

END DO

• What scope is there for doing any of this work in parallel?

• What range of options are there?

• How do these affect performance?

Nov 2010 14

Vector Sum

– This example illustrates how parallelism can be found even in tasks whose output is clearly scalar (at least at the level of integers). Because the output is non-parallel, the amount of work that can be done in parallel decreases during the computation.

– The standard way of describing this kind of parallel work is divide-and-conquer. In its purest form, this leads to exponentially decreasing parallelism.

– Although it is perhaps the most simple of our three examples, the presence of a data write conflict leads to the most difficult problems in implementation, as we shall see later.

Nov 2010 15

Matrix-VectorMultiplication

– Now suppose we wish to multiply a vector by a two-dimensional matrix. This is not so straightforward, because the pattern of work is a little more complex.

– For the moment, suppose that the matrix A is dense (i.e. almost all of its elements are non-zero).

– For simplicity, assume the elements of both the matrix and the vector are integers, although generalisation is readily achieved.

,

where , are -vectors, and is an matrix.

b Ax

b x n A n n

Nov 2010 16


The following diagram shows what needs to be done:

• How might this work be done in parallel?

• What range of options are there?

• How do these affect performance?

* -->

Nov 2010 17


Two loop orderings for this problem:

– Row based algorithm:

DO I = 1, N

B(I) = 0

DO J = 1, N

B(i) = B(I)+A(I,J)*X(J)

END DO

END DO

– Column based algorithm:DO I = 1,N

B(i) = 0

END DO

DO J=1, N

DO I = 1,N

B(i) = B(I)+A(I,J)*X(J)

END DO

END DO

Nov 2010 18


– In the dense case, and the row-based algorithm, the outer I-loop can be parallelised; the work is split similarly to that in vector addition, i.e. by grouping together the elements of the output vector. The independent operation to be performed for each element of the output vector is to compute the inner product of the appropriate row of the matrix with the input vector. Since all rows of the matrix have the same number of elements, this gives the same amount of work for each element of the output vector.

– However, the work at each point is not entirely independent of the work at other points, since the whole input vector is required for the computation of each inner product (and therefore each component of the output vector) – shared reads. This is an important matter at the Program Level, as we shall see later.

Nov 2010 19


– In the dense case, and the column-based algorithm, the outer J-loop can be parallelised; the work is split similarly to that in the vector sum operation, i.e. by grouping together columns of the array A. This results in a reduction operation for each element of the result vector x – a vector-result reduction operation.

– Now there are dependencies amongst the tasks (the output data), but there are no shared reads – no dependencies amongst the input data.

Nov 2010 20


– Now let's consider what happens if the input matrix is sparse and structured (i.e. a well-defined, substantial number of its elements have a value of zero). We’ll restrict consideration to the row-based algorithm.

– For example, what happens if the matrix is (upper) triangular?

– A 'smart' sequential algorithm will avoid doing unnecessary work (multiplies by zero) in this case. What are the implications for parallel work?

* -->

Nov 2010 21


Two effects emerge.

• Firstly, the amount of work per output vector element becomes different, but predictable. More work is needed to compute the ‘earlier' elements. This can lead to an ‘unbalanced' workload in the parallel realisation.

• Secondly, the dependence between the computations of the output vector elements changes, since different parts of the input vector are required for the different length inner product calculations (the number of shared reads varies).

Overall, this example shows how data read conflicts can affect achievable performance if unwise implementation options are chosen.

Nov 2010 22

Matrix-MatrixMultiplication

– Triply-nested loop

DO I = 1, N

DO J = 1, N

C(I,J) = 0

DO K = 1, N

C(I,J) = C(I,J) + A(I,K)*(B(K,J)

END DO

END DO

END DO

,

where , , are matrices.

C A B

A B C N N

Nov 2010 23

Matrix-MatrixMultiplication

– One opportunity for parallelism is based on the observation that the computations of disjoint blocks of the result matrix are independent, although they will depend on (some of) the same data – lots of parallelism, but also lots of shared reads.

– Could partition the result matrix into either

• (block) columns;

• (block) rows;

• blocks.

Nov 2010 24

Algorithmic Core: Summary

– Parallel algorithms as-a-whole (i.e. including task-parallelism) boil down to one-or-more of the following three categories:

• Complete independence across the data elements (no sharing); embarrassingly parallel.

• Shared reads on abstract data elements; implement either by replicating the shared data (then we have independence and it becomes easy!); or by arranging for non-contending memory access (not always easy to achieve).

• Shared writes to data elements; in some special cases, we may be able to replicate the shared data (to an extent, but never completely); in the general case, the data must be protected (e.g. using locks) so that access to it is mutually exclusive.

Nov 2010 25

Recap– At Specification Level, a mathematical model of the application

is developed; at Algorithm Level, this specification is converted into an appropriate algorithm. Abstract parallelism emerges at both Levels, in both task-parallel and data-parallel forms.

– An algorithm is an abstract procedure for solving (an approximation to) the problem at hand; it is based on a discrete data domain that represents (an approximation to) the data domain of the specification. In HPC simulations, where the data domain of the specification is often continuous, it is necessary to develop a 'point-wise' discretisation for the algorithm to work on. Normally, parallelism is then exploited across the elements of the discretised data domain.

– The resulting abstract data-parallelism appears in three forms: independent, shared reads and shared writes (in increasing order of difficulty to implement).

Date post:	13-Dec-2015
Category:	Documents
Upload:	ezra-norris
View:	218 times
Download:	0 times

Nov 2010 1 COMP60621 Concurrent Programming for Numerical Applications Lecture 1 The Nature of...

Documents