+ All Categories
Home > Documents > Eldorado John Feo Cray Inc. 2 Outline Why multithreaded architectures The Cray Eldorado ...

Eldorado John Feo Cray Inc. 2 Outline Why multithreaded architectures The Cray Eldorado ...

Date post: 05-Jan-2016
Category:
Upload: alan-miller
View: 218 times
Download: 1 times
Share this document with a friend
Popular Tags:
51
Eldorado John Feo Cray Inc
Transcript
Page 1: Eldorado John Feo Cray Inc. 2 Outline  Why multithreaded architectures  The Cray Eldorado  Programming environment  Program examples.

Eldorado

John Feo

Cray Inc

Page 2: Eldorado John Feo Cray Inc. 2 Outline  Why multithreaded architectures  The Cray Eldorado  Programming environment  Program examples.

2

Outline

Why multithreaded architectures

The Cray Eldorado

Programming environment

Program examples

Page 3: Eldorado John Feo Cray Inc. 2 Outline  Why multithreaded architectures  The Cray Eldorado  Programming environment  Program examples.

3

Overview

Eldorado is a peak in the North Cascades.

Internal Cray project name for the next MTA system.

Make the MTA-2 cost-effective and supportable.

Retain proven performance and programmability advantages of the MTA-2.

Eldorado

Hardware based on Red Storm.

Programming model and software based on MTA-2.

Page 4: Eldorado John Feo Cray Inc. 2 Outline  Why multithreaded architectures  The Cray Eldorado  Programming environment  Program examples.

4

~1970’s Generic computer

CPU Memory

Memory keeps pace with CPUMemory keeps pace with CPU

EverythingEverything is in Balance…

Instructions

Page 5: Eldorado John Feo Cray Inc. 2 Outline  Why multithreaded architectures  The Cray Eldorado  Programming environment  Program examples.

5

Processors have gotten much faster (~ 1000x)

Memories have gotten a little faster (~ 10x), and much larger(~ 1,000,000x)

System diameters have gotten much larger (~ 1000x)

Flops are for free, but bandwidth is very expensive

Systems are no longer balanced, processors are starved for data

… but balance is short-lived

Page 6: Eldorado John Feo Cray Inc. 2 Outline  Why multithreaded architectures  The Cray Eldorado  Programming environment  Program examples.

6

20th Century parallel computer

Avoid the straws and there are no flaws;otherwise, good luck to you

CPU

Memory

Cache

CPU Cache

CPU Cache

Page 7: Eldorado John Feo Cray Inc. 2 Outline  Why multithreaded architectures  The Cray Eldorado  Programming environment  Program examples.

7

Constraints on parallel programs

Place data near computation

Avoid modifying shared data

Access data in order and reuse

Avoid indirection and linked data-structures

Partition program into independent, balanced computations

Avoid adaptive and dynamic computations

Avoid synchronization and minimize inter-process communications

Page 8: Eldorado John Feo Cray Inc. 2 Outline  Why multithreaded architectures  The Cray Eldorado  Programming environment  Program examples.

8

A tiny solution space

Page 9: Eldorado John Feo Cray Inc. 2 Outline  Why multithreaded architectures  The Cray Eldorado  Programming environment  Program examples.

9

Multithreading

Hide latencies via parallelism

Maintain multiple active threads per processor, so that gaps introduced by long latency operations in one thread are filled by instructions in other threads

Requires special hardware support

Multiple threads

Single-cycle context switch

Multiple outstanding memory requests per thread

Fine-grain synchronization

Page 10: Eldorado John Feo Cray Inc. 2 Outline  Why multithreaded architectures  The Cray Eldorado  Programming environment  Program examples.

10

The generic transaction

Allocate memory

Write/read memory

Deallocate memory

Page 11: Eldorado John Feo Cray Inc. 2 Outline  Why multithreaded architectures  The Cray Eldorado  Programming environment  Program examples.

11

Execution time line – 1 thread

Page 12: Eldorado John Feo Cray Inc. 2 Outline  Why multithreaded architectures  The Cray Eldorado  Programming environment  Program examples.

12

Execution time line – 2 threads

Page 13: Eldorado John Feo Cray Inc. 2 Outline  Why multithreaded architectures  The Cray Eldorado  Programming environment  Program examples.

13

Execution time line – 3 threads

Page 14: Eldorado John Feo Cray Inc. 2 Outline  Why multithreaded architectures  The Cray Eldorado  Programming environment  Program examples.

14

Transaction per second on MTA

Threads P1 P2 P4

1

10

20

30

40

50

1,490.55

14,438.27

26,209.85

31,300.51

32,629.65

32,896.10

65,890.24 123,261.82

Page 15: Eldorado John Feo Cray Inc. 2 Outline  Why multithreaded architectures  The Cray Eldorado  Programming environment  Program examples.

15

Eldorado Processor (logical view)

i = n

i = 3

i = 2

i = 1

. . .

1 2 3 4

Sub- problem

A

i = n

i = 1

i = 0

. . .

Sub- problem

BSubproblem A

Serial Code

Unused streams

. . . .

Programs running in parallel

Concurrent threads of computation

Hardware streams (128)

Instruction Ready Pool;

Pipeline of executing instructions

Page 16: Eldorado John Feo Cray Inc. 2 Outline  Why multithreaded architectures  The Cray Eldorado  Programming environment  Program examples.

16

Eldorado System (logical view)

i = n

i = 3

i = 2

i = 1

. . .

1 2 3 4

Sub- problem

A

i = n

i = 1

i = 0

. . .

Sub- problem

BSubproblem A

Serial Code

Programs running in parallel

Concurrent threads of computation

Multithreaded across multiple processors

. . . . . . . . . . . .

Page 17: Eldorado John Feo Cray Inc. 2 Outline  Why multithreaded architectures  The Cray Eldorado  Programming environment  Program examples.

17

What is not important

Placing data near computation

Modifying shared data

Accessing data in order

Using indirection or linked data-structures

Partitioning program into independent, balanced computations

Using adaptive or dynamic computations

Minimizing synchronization operations

Page 18: Eldorado John Feo Cray Inc. 2 Outline  Why multithreaded architectures  The Cray Eldorado  Programming environment  Program examples.

18

What is important

Page 19: Eldorado John Feo Cray Inc. 2 Outline  Why multithreaded architectures  The Cray Eldorado  Programming environment  Program examples.

19

Eldorado system goals

Make the Cray MTA-2 cost-effective and supportable Mainstream Cray manufacturing Mainstream Cray service and support

Eldorado is a Red Storm with MTA-2 processors Eldorado leverages Red Storm technology Red Storm cabinet, cooling, power distribution Red Storm circuit boards and system interconnect Red Storm RAS system and I/O system (almost) Red Storm manufacturing, service, and support teams

Eldorado retains key MTA technology Instruction set Operating system Programming model and environment

Page 20: Eldorado John Feo Cray Inc. 2 Outline  Why multithreaded architectures  The Cray Eldorado  Programming environment  Program examples.

20

Red Storm

Red Storm consists of over 10,000 AMD Opteron™ processors connected by an innovative high speed, high bandwidth 3D mesh interconnect designed by Cray (Seastar)

Cray is responsible for the design, development, and delivery of the Red Storm system to support the Department of Energy's Nuclear stockpile stewardship program for advanced 3D modeling and simulation

Red Storm uses a distributed memory programming model (MPI)

Page 21: Eldorado John Feo Cray Inc. 2 Outline  Why multithreaded architectures  The Cray Eldorado  Programming environment  Program examples.

21

4 DIMM Slots4 DIMM Slots

CRAYSeastar™

CRAYSeastar™

CRAYSeastar™

CRAYSeastar™

L0 RAS ComputerL0 RAS Computer

Redundant VRMsRedundant VRMs

Red Storm Compute Board

Page 22: Eldorado John Feo Cray Inc. 2 Outline  Why multithreaded architectures  The Cray Eldorado  Programming environment  Program examples.

22

4 DIMM Slots4 DIMM Slots

CRAYSeastar2™

CRAYSeastar2™

CRAYSeastar2™

CRAYSeastar2™

L0 RAS ComputerL0 RAS Computer

Redundant VRMsRedundant VRMs

Eldorado Compute Board

Page 23: Eldorado John Feo Cray Inc. 2 Outline  Why multithreaded architectures  The Cray Eldorado  Programming environment  Program examples.

23

MTX Linux

Compute Service & IO

Service Partition• Linux OS• Specialized Linux nodes

Login PEs

IO Server PEs

Network Server PEs

FS Metadata Server PEs

Database Server PEs

Compute Partition

MTX (BSD)

RAID Controllers

Network

PCI-X

10 GigE

Fiber ChannelPCI-X

Eldorado system architecture

Page 24: Eldorado John Feo Cray Inc. 2 Outline  Why multithreaded architectures  The Cray Eldorado  Programming environment  Program examples.

24

Eldorado CPU

Page 25: Eldorado John Feo Cray Inc. 2 Outline  Why multithreaded architectures  The Cray Eldorado  Programming environment  Program examples.

25

Speeds and Feeds

CPU ASIC

140M memory ops

500M memory ops

1.5 GFlops

500M memory ops

100M memory ops

90M30M memory ops (1 4K processors)

16 GB DDR DRAM

Sustained memory rates are for random single word accesses over entire address space.

Page 26: Eldorado John Feo Cray Inc. 2 Outline  Why multithreaded architectures  The Cray Eldorado  Programming environment  Program examples.

26

Shared memory Some memory can be reserved as local memory at boot time Only compiler and runtime system have access to local memory

Memory module cache Decreases latency and increases bandwidth No coherency issues

8 word data segments randomly distributed across the memory system Eliminates stride sensitivity and hotspots Makes programming for data locality impossible Segment moves to cache, but only word moves to processor

Full/empty bits on all data words

Eldorado memory

tag bits data values

063

forwardtrap 1trap 2full-empty

Page 27: Eldorado John Feo Cray Inc. 2 Outline  Why multithreaded architectures  The Cray Eldorado  Programming environment  Program examples.

27

MTA-2 / Eldorado Comparisons

MTA-2 Eldorado

CPU clock speed 220 MHz 500 MHz

Max system size 256 P 8192 P

Max memory capacity 1 TB (4 GB/P) 128 TB (16 GB/P)

TLB reach 128 GB 128 TB

Network topology Modified Cayley graph 3D torus

Network bisection bandwidth 3.5 * P GB/s 15.36 * P2/3 GB/s

Network injection rate 220 MW/s per processor Variable (next slide)

Page 28: Eldorado John Feo Cray Inc. 2 Outline  Why multithreaded architectures  The Cray Eldorado  Programming environment  Program examples.

28

Eldorado Scaling

Example Topology 6x12x8 11x12x8 11x12x16 22x12x16 14x24x24

Processors 576 1056 2112 4224 8064

Memory capacity 9 TB 16.5 TB 33 TB 66 TB 126 TB

Sustainable remote memory reference rate (per processor)

60 MW/s 60 MW/s 45 MW/s 33 MW/s 30 MW/s

Sustainable remote memory reference rate (aggregate)

34.6 GW/s 63.4 GW/s 95.0 GW/s 139.4 GW/s 241.9 GW/s

Relative size 1.0 1.8 3.7 7.3 14.0

Relative performance 1.0 1.8 2.8 4.0 7.0

Page 29: Eldorado John Feo Cray Inc. 2 Outline  Why multithreaded architectures  The Cray Eldorado  Programming environment  Program examples.

29

Operating Systems LINUX on service and I/O nodes MTX on compute nodes Syscall offload for I/O

Run-Time System Job launch Node allocator Logarithmic loader Batch system – PBS Pro

Software

Programming Environment– Cray MTA compiler - Fortran, C, C++

– Debugger - mdb

– Performance tools: canal, traceview High Performance File Systems

Lustre

System Mgmt and Admin Accounting Red Storm Management System RSMS Graphical User Interface

The compute nodes run MTX, a multithreaded unix operating system

The service nodes run Linux

I/O service calls are offloaded to the service nodes

The programming environment runs on the service nodes

Page 30: Eldorado John Feo Cray Inc. 2 Outline  Why multithreaded architectures  The Cray Eldorado  Programming environment  Program examples.

30

CANAL

Compiler ANALysis

Static tool

Shows how the code is compiled and why

Page 31: Eldorado John Feo Cray Inc. 2 Outline  Why multithreaded architectures  The Cray Eldorado  Programming environment  Program examples.

31

Traceview

Page 32: Eldorado John Feo Cray Inc. 2 Outline  Why multithreaded architectures  The Cray Eldorado  Programming environment  Program examples.

32

Dashboard

Page 33: Eldorado John Feo Cray Inc. 2 Outline  Why multithreaded architectures  The Cray Eldorado  Programming environment  Program examples.

33

What is Eldorado’s sweet spot?

Any cache-unfriendly parallel application is likely to outperform on Eldorado

Any application whose performance depends upon ... Random access tables (GUPS, hash tables) Linked data structures (binary trees, relational graphs) Highly unstructured, sparse methods Sorting

Some candidate application areas: Adaptive meshes Graph problems (intelligence, protein folding, bioinformatics) Optimization problems (branch-and-bound, linear programming) Computational geometry (graphics, scene recognition and tracking)

Page 34: Eldorado John Feo Cray Inc. 2 Outline  Why multithreaded architectures  The Cray Eldorado  Programming environment  Program examples.

34

Sparse Matrix – Vector Multiply

C n x 1 = A n x m * B m x 1

Store A in packed row form A[nz], where nz is the number of non-zeros cols[nz] stores the column index of the non-zeros rows[n] stores the start index of each row in A

#pragma mta use 100 streams#pragma mta assert no dependencefor (i = 0; i < n; i++) { int j; double sum = 0.0; for (j = rows[i]; j < rows[i+1]; j++) sum += A[j] * B[cols[j]]; C[i] = sum;}

Page 35: Eldorado John Feo Cray Inc. 2 Outline  Why multithreaded architectures  The Cray Eldorado  Programming environment  Program examples.

35

Canal report

| #pragma mta use 100 streams | #pragma mta assert no dependence | for (i = 0; i < n; i++) { | int j; 3 P | double sum = 0.0; 4 P- | for (j = rows[i]; j < rows[i+1]; j++) | sum += A[j] * B[cols[j]]; 3 P | C[i] = sum; | }

Parallel region 2 in SpMVM Multiple processor implementation Requesting at least 100 streams

Loop 3 in SpMVM at line 33 in region 2 In parallel phase 1 Dynamically scheduled

Loop 4 in SpMVM at line 34 in loop 3 Loop summary: 3 memory operations, 2 floating point operations 3 instructions, needs 30 streams for full utilization, pipelined

Page 36: Eldorado John Feo Cray Inc. 2 Outline  Why multithreaded architectures  The Cray Eldorado  Programming environment  Program examples.

36

Performance

N = M = 1,000,000

Non-zeros 0 to 1000 per row, uniform distribution Nz = 499,902,410

T SpP

1

2

4

8

7.11

3.59

1.83

0.94

1.0

1.98

3.88

7.56

Time = (3 cycles * 499902410 iterations) / 220000000 cycles/sec = 6.82 sec

96% utilization

IBM Power4 1.7 GHz

26.1 s

Page 37: Eldorado John Feo Cray Inc. 2 Outline  Why multithreaded architectures  The Cray Eldorado  Programming environment  Program examples.

37

Integer sort

Any parallel sort works well on Eldorado

Bucket sort is best if universe is large and elements don’t cluster in just a few buckets

Bucket Sort

Count the number of elements in each bucket

Calculate the start of each bucket in dst

Copy elements from src to dst, placing each element in the correct segment

Sort the elements in each segment

Page 38: Eldorado John Feo Cray Inc. 2 Outline  Why multithreaded architectures  The Cray Eldorado  Programming environment  Program examples.

38

Step 1

for (i = 0; i < n; i++) count[src[i] >> shift] += 1;

Compiler automatically parallelizes the loop using an int_fetch_add instruction to increment count

src

Buckets

count

Page 39: Eldorado John Feo Cray Inc. 2 Outline  Why multithreaded architectures  The Cray Eldorado  Programming environment  Program examples.

39

Step 2

start(0) = 0;

for (i = 0; i < n_buckets; i++) start(i) = start(i-1) + count(i)

Compiler automatically parallelizes most first-order recurrences

src

dst

start 0 3 9 14 15 22 24 26 32

Page 40: Eldorado John Feo Cray Inc. 2 Outline  Why multithreaded architectures  The Cray Eldorado  Programming environment  Program examples.

40

Step 3

#pragma mta assert parallel#pragma mta assert noalias *src, *dstfor (i = 0; i < n; i++) { bucket = src(i) >> shift; index = start[bucket]++; dst(index) = src(I);}

Compiler can not automatically parallelize the loop because the order in which elements are copied to dst is not determinant

noalias pragma lets compiler generate optimal code (3 instructions)

The compiler uses an int_fetch_add instruction to increment start

Page 41: Eldorado John Feo Cray Inc. 2 Outline  Why multithreaded architectures  The Cray Eldorado  Programming environment  Program examples.

41

Step 4

Sort each segment

Since there are many more buckets than streams and we assume the number of elements per bucket is small, any sort algorithm will do we used a simple insertion sort

Most of the time is spent skipping through buckets of size 0 or 1

Page 42: Eldorado John Feo Cray Inc. 2 Outline  Why multithreaded architectures  The Cray Eldorado  Programming environment  Program examples.

42

Performance

N = 100,000,000

T SpP

1

2

4

8

9.14

4.56

2.41

1.26

1.0

2.0

3.79

7.25

9.14 sec * 220000000 cycles/sec / 100000000 elements

=20.1 cycles/elements

Page 43: Eldorado John Feo Cray Inc. 2 Outline  Why multithreaded architectures  The Cray Eldorado  Programming environment  Program examples.

43

Prefix operations on lists

P(i) = P(i - 1) + V(i)

List ranking - determine the rank of each node in the list A common procedure that occurs in many graph algorithms

1512

1

73

16

95

1413

6

10

11

4

28

N

R

1 2

2 13

3 4

4 14

5 6

7 10

7 8

3 15

9 10

6 11

11 1

12 1

12 13

1 9

14 15

8 16

16

5

Page 44: Eldorado John Feo Cray Inc. 2 Outline  Why multithreaded architectures  The Cray Eldorado  Programming environment  Program examples.

44

Hellman and JaJa algorithm

Mark nwalk nodes including the first node This step divides the list into nwalk sublists

Traverse the sublists computing each node’s rank in the sublist

From the lengths of the sublists, compute the start of each sublist

Re-traverse each sublist, incrementing each node’s local rank by the sublist’s start

Page 45: Eldorado John Feo Cray Inc. 2 Outline  Why multithreaded architectures  The Cray Eldorado  Programming environment  Program examples.

45

Steps 1 and 2

1512

1

73

16

95

1413

6

10

11

4

28

N 1 2 3 4 5 6 7 8 9 10 11 112 13 14 15 16

R 1 4 5 1 3 2

R 2 11

R 2 1 4 1 3

R 11

R 1 132 4

Page 46: Eldorado John Feo Cray Inc. 2 Outline  Why multithreaded architectures  The Cray Eldorado  Programming environment  Program examples.

46

Step 3

W

L

0 1

0 2

2 3

4 5

4 5

1 4

W

L

0 1

0 2

2 3

6 9

4 5

6 5

W

L

0 1

0 2

2 3

6 11

4 5

12 14

W

L

0 1

0 2

2 3

6 11

4 5

12 16

Page 47: Eldorado John Feo Cray Inc. 2 Outline  Why multithreaded architectures  The Cray Eldorado  Programming environment  Program examples.

47

Step 4

1512

1

73

16

95

1413

6

10

11

4

28

N 1 2 3 4 5 6 7 8 9 10 11 112 13 14 15 16

R + 6 7 10 11 1 9 8

R + 0 2 11

R + 2 4 3 6 1 5

R + 11 112

R + 12 13 11514 16

Page 48: Eldorado John Feo Cray Inc. 2 Outline  Why multithreaded architectures  The Cray Eldorado  Programming environment  Program examples.

48

Performance Comparison

Performance of List Ranking on Cray MTA (For Ordered List)

Problem Size (M)

0 20 40 60 80 100

Execution T

ime (S

econds)

0

1

2

3

4

1 Proc2 Proc4 Proc8 Proc

Performance of List Ranking on SUN E4500 (For Random List)

Problem Size (M)

0 20 40 60 80 100

Execution T

ime (S

econds)

0

20

40

60

80

100

120

140

1 Proc2 Proc4 Proc8 Proc

Performance of List Ranking on Cray MTA (For Random List)

Problem Size (M)

0 20 40 60 80 100

Execution T

ime (S

econds)

0

1

2

3

4

1 Proc2 Proc4 Proc8 Proc

Performance of List Ranking on Sun E4500 (For Ordered List)

Problem Size (M)

0 20 40 60 80 100

Execution T

ime (S

econds)

0

10

20

30

40

1 Proc2 Proc4 Proc8 Proc

Page 49: Eldorado John Feo Cray Inc. 2 Outline  Why multithreaded architectures  The Cray Eldorado  Programming environment  Program examples.

49

Two more kernels

Linked List Search

N = 1000 N = 10000

SunFire 880 MHz (1P)

Intel Xeon 2.8 GHz (32b) (1P)

Cray MTA-2 (1P)

Cray MTA-2 (10P)

9.30

7.15

0.49

0.05

107.0

40.0

1.98

0.20

Random Access (GUPs)

Giga updates per second

IBM Power4 1.7 GHz (256P)

Cray MTA-2 (1P)

Cray MTA-2 (5P)

Cray MTA-2 (10P)

0.0055

0.41

0.204

0.405

Page 50: Eldorado John Feo Cray Inc. 2 Outline  Why multithreaded architectures  The Cray Eldorado  Programming environment  Program examples.

50

Eldorado Timeline

ASIC Tape-out

Prototype Systems

Early Systems

(moderate-size of 500-1000P)

2004 2005 2006

Page 51: Eldorado John Feo Cray Inc. 2 Outline  Why multithreaded architectures  The Cray Eldorado  Programming environment  Program examples.

51

Eldorado Summary

Eldorado is a low-risk development project because it reuses much of the Red Storm infrastructure which is part of Cray’s standard product path Ranier module Cascade LWP

Eldorado retains the MTA-2’s easy to program model with parallelism managed by compiler and run-time Ideal for applications that do not run well on SMP clusters High productivity system for algorithms research and development

Applications scale to very large system sizes

15-18 months to system availability


Recommended