+ All Categories
Home > Documents > CDA 4150 Lecture 4

CDA 4150 Lecture 4

Date post: 07-Feb-2016
Category:
Upload: cleary
View: 57 times
Download: 0 times
Share this document with a friend
Description:
CDA 4150 Lecture 4. Vector Processing CRAY like machines. a. b. Parallel Execution. Sequential Execution. c. Amdahl’s Law. Amdahl’s Law (cont.). Amdahl’s Law (revisited). Using as a function of n, where , then - PowerPoint PPT Presentation
33
CDA 4150 Lecture 4 Vector Processing CRAY like machines
Transcript
Page 1: CDA 4150 Lecture 4

CDA 4150 Lecture 4

Vector Processing

CRAY like machines

Page 2: CDA 4150 Lecture 4

a

b

c

Parallel ExecutionSequential Execution

Amdahl’s Law

s ProcessorofNumber

Speedup

g Processin ParallelinSpent Time

g ProcessinSequential inSpent Time

P

S

T

T

cba

P

P

S

Page 3: CDA 4150 Lecture 4

Amdahl’s Law (cont.)

P

S

PT

T

TS

P

TTT

T

TS

p

ss

sp

ssp

p

sp

11

1

1

1lim

11

11

limlim

11

11

1

pP

PpP

p

p

S

PP

S

PP

S

P

PS

Page 4: CDA 4150 Lecture 4

Amdahl’s Law (revisited)

Using as a function of n, where , then

= =

1lim

11

1

1

Sp

pp

Speedup p

)()1(1 np

pSpeedup

nn

1)(

np

pn

1)1(1

lim

p

Page 5: CDA 4150 Lecture 4

An extension of Amdahl’s Law in terms of a matrix multiplication equation (AX = Y).

4

3

2

1

4

3

2

1

44434241

34333231

24232221

14131211

y

y

y

y

x

x

x

x

aaaa

aaaa

aaaa

aaaa

4443432421414

4343332321313

4243232221212

4143132121111

xaxaxaxay

xaxaxaxay

xaxaxaxay

xaxaxaxay

Page 6: CDA 4150 Lecture 4

A 1

CPU

X

A 2 A 3 A 4

CPU CPU CPU

X X X

Compute each vector element in parallel by partitioning.

A1A2A3A4

x1x2x3x4

y1y2y3y4

=

Page 7: CDA 4150 Lecture 4

R. M. Russell, “The CRAY-1 Computer System”, CACM, vol. 21, pp. 63-72, 1978.

Introduces CRAY-1 as a vector processing Architecture

Page 8: CDA 4150 Lecture 4

CRAY -1Functional Units

64 elements

PIPELINE

SCALAR REGISTERS

MAIN MEMORY(16-way interleaving)

V0 FP ADD / SUB

FP MULT

LOGICAL

V1

V2

Vn

Page 9: CDA 4150 Lecture 4

* F0 – a floating point numberNOTE: Each vector register (Rn) holds floating point numbers.

Instruction Operation FunctionADDVADDSV (add scalar vector)

MULTVLV (load vector)

SV (store vector)

V1, V2, V3

V1, F0* , V2

V1, V2, V3

V1, R1

R1, V1

V1 V2 + V3

V1 V2 + F0

V1 V2 + V3

Load V1 with memory address location starting at address [R1]Store V1 into memory starting at location [R1]

Page 10: CDA 4150 Lecture 4

A pipeline machine can initiate several instructions within 1 clock tick, which are then being executed in parallel.

Related Concepts: - Convoys - Chimes

Timing

Page 11: CDA 4150 Lecture 4

Convoy The set of vector instructions that could

potentially begin execution together in one clock period.

Example:

LV V1, RX Load vector X MULTSV V2, F0, V1 Vector scalar multiplication LV V3, RY Load vector X ADDV V4, V2, V3 Add SV RY, V4 Storing results

Page 12: CDA 4150 Lecture 4

ConvoyNote: MULTSV V2, F0, V1 || LV V3, RY

is an example of a convoy, where 2 independent instructions are initiated within same chime.

LV V1, RX Load vector X

MULTSV V2, F0, V1 Vector scalar multiplication

LV V3, RY Load vector X

ADDV V4, V2, V3 Add

SV RY, V4 Storing results

Page 13: CDA 4150 Lecture 4

Chime

Not a specific amount of time, but rather a timing concept representing the number of clock periods required to complete a vector operation.

CRAY-1 chime is 64 clock periods.

- Note: CRAY-1 clock cycle takes 12.5 ns.

5 chimes would take : 5 * 64 * 12.5 = 4000 ns

Page 14: CDA 4150 Lecture 4

LV V1, RX Load vector X MULTSV V2, F0, V1 Vector scalar multiplicationLV V3, RY Load vector YADDV V4, V2, V3 AddSV RY, V4 Store result

Chime – Example #1

How many chimes will the vector sequence take?

Page 15: CDA 4150 Lecture 4

Chime - Example #1 ANSWER: 4 chimes

1st chime : LV V1, RX 2nd chime : MULTSV V2, F0, V1 || LV V3, RY

3rd chime : ADDV V4, V2, V3

4th chime : SV RY, V4

Note: MULTSV V2, F0, V1 || LV V3, RY

is an example of a convoy, where 2 independent instructions are initiated within same chime.

Page 16: CDA 4150 Lecture 4

CRAY-1 For I 1 to 64

A[I] = 3.0 * A[I] + (2.0 + B[I]) * C[I]

To execute this: 1st chime : V0 A

2nd chime : V1 B V3 2.0 + V1

V4 3.0 * V0

3rd chime : V5 C V6 V3 * V5

V7 V4 + V6

4th chime : A V7

Chime - Example #2

Can initiate operations to use array values immediately after they have been loaded into vector registers.

Page 17: CDA 4150 Lecture 4

Chaining Building dynamically a larger pipeline by

increasing number of stages.

V0 V1V4

V3

V5

*

+

3 stages 7 stages

Page 18: CDA 4150 Lecture 4

Chaining – Example #1

For J 1 to 64

C[J] A[J] + B[J]

D[J] F[J] * E[J]

END

V1

V0A

B

1 64

+V3

V2E

F

1 64

*V4 V5

C D

* No chaining - these are independent!!

Page 19: CDA 4150 Lecture 4

Chaining – Example #2 For J 1 to 64

C[J] A[J] * B[J] D[J] C[J] * E[J]

END

V1

V0A

B

1 64

* + V0 D

C

E

V2

V3

Page 20: CDA 4150 Lecture 4

V1

V0A

B

1 64

+ V2 C

V0 and V1 To functional unit

ADD TIME Result to V2

1 6 1

It takes 8 units to get the result to here

Latency

Page 21: CDA 4150 Lecture 4

More Chaining and Storing Matrices

Thanks to Dusty Price

Page 22: CDA 4150 Lecture 4

Sequential Approach…

+ *

v0

v1

v2

v3

v4

time 1 6 1 1 7 1

add mul

8 9

64 Elements in sequence: Ts = 64 * (8 + 9) = 1088

Page 23: CDA 4150 Lecture 4

+

v0

v1

v3

Using Pipeline Approach…

*

v3

v4

v5

Using pipelining it takes 8 units of time to fill pipeline and produce first result, each unit of time after that produces another resultTp+ = 8 + 63

8

The multiplication pipeline takes 9 units of time to fill, and produces another result after each additional unit of time

Tp* = 9 + 63

The combination of the two Tp = Tp+ + Tp* = 8 + 63 + 9 + 63 = 143

9

Page 24: CDA 4150 Lecture 4

Pipeline plus Chaining…

+ *

v0

v1

v2

v3

v4

time 1 6 1 1 7 1

add mul

8 9

Using the chaining technique, we now have one pipeline. This new pipeline takes 17 units of time to fill, and produces another result after each unit of time.

Tc = 17 + 63 = 80

Operation using Chaining Tc = 17 + 63 = 80

Page 25: CDA 4150 Lecture 4

Review of time differences in the three approaches…

Sequential: Ts = 17 * 64 = 1088

Pipelining: Tp = 8 + 63 + 9 + 63 = 143

Chaining: Tc = 17 + 63 = 80

Page 26: CDA 4150 Lecture 4

Storing Matrixes for Parallel Access (Memory Interleaving)

4 Memory Modules

M1 M2 M3 M4

A11

A12

A13

A14

A21

A22

A23

A24

A31

A32

A33

A34

A41

A42

A43

A44

One column of the matrix can be accessed in parallel.

Matrix

A11

A21

A31

A41

A12 A13 A14

A22

A32

A42 A43

A33

A23 A24

A34

A44

Page 27: CDA 4150 Lecture 4

Storing the Matrix by Column…

4 Memory Modules

M1 M2 M3 M4

A11

A21

A31

A41

A12

A22

A32

A42

A13

A23

A33

A43

A14

A24

A34

A44

Matrix

A11

A21

A31

A41

A12 A13 A14

A22

A32

A42 A43

A33

A23 A24

A34

A44

One Row can be accessed in parallel with this storage technique.

Page 28: CDA 4150 Lecture 4

Sometimes we need to access both rows and columns fast…

Matrix4 Memory Modules

M1 M2 M3 M4

A11

A24

A33

A42

A12

A21

A34

A43

A13

A22

A31

A44

A14

A23

A32

A41

A11

A21

A31

A41

A12 A13 A14

A22

A32

A42 A43

A33

A23 A24

A34

A44

By using a skewed matrix representation, we can now access each row and each column in parallel.

Page 29: CDA 4150 Lecture 4

Sometimes we need access to the main diagonal as well as rows and columns…

Matrix 5 Memory Modules

M1 M2 M3 M4 M5

A11

A24

A33

A44

A12

A21

A34

A43

A13

A22

A31

A42

A14

A23

A32

A41

A11

A21

A31

A41

A12 A13 A14

A22

A32

A42 A43

A33

A23 A24

A34

A44

At the cost of adding another memory module and wasted space, we can now access the matrix in parallel by row, column, and main diagonal.

Page 30: CDA 4150 Lecture 4

Program Transformation

FOR I 1 TO n do X A[I] + B[I] . . Y[I] 2 * X . . X C[I] / D[I] . . P X + 2ENDFOR

FOR I 1 TO n do X A[I] + B[I] . . Y[I] 2 * X . . XX C[I] / D[I] . . P XX + 2ENDFOR

removes datadependency

datadependency

Page 31: CDA 4150 Lecture 4

Scalar Expansion

FOR I 1 TO n do X A[I] + B[I] . . Y[I] 2 * X . .ENDFOR

FOR I 1 TO n do X A[I] + B[I] . . Y[I] 2 * X[I] . .ENDFOR

removes datadependencydata

dependency

Page 32: CDA 4150 Lecture 4

Loop Unrolling

FOR I 1 TO n do X[I] A[I] * B[I]ENDFOR

X[1] A[1] * B[1]X[2] A[2] * B[2]..X[n] A[n] * B[n]

Page 33: CDA 4150 Lecture 4

Loop Fusion or Jamming

FOR I 1 TO n do X[I] Y[I] * Z[I]ENDFORFOR I 1 TO n do M[I] P[I] + X[I]ENDFOR

a) FOR I 1 TO n do X[I] Y[I] * Z[I] M[I] P[I] + X[I] ENDFORb) FOR I 1 TO n do M[I] P[I] + Y[I] * Z[I] ENDFOR


Recommended