+ All Categories
Home > Documents > PARALLEL PROGRAMMING MANY-CORE COMPUTING: HARDWARE (2/5)

PARALLEL PROGRAMMING MANY-CORE COMPUTING: HARDWARE (2/5)

Date post: 03-Feb-2022
Category:
Upload: others
View: 2 times
Download: 0 times
Share this document with a friend
46
PARALLEL PROGRAMMING MANY-CORE COMPUTING: HARDWARE (2/5) Rob van Nieuwpoort [email protected]
Transcript
Page 1: PARALLEL PROGRAMMING MANY-CORE COMPUTING: HARDWARE (2/5)

PARALLEL PROGRAMMING

MANY-CORE COMPUTING:

HARDWARE (2/5)

Rob van Nieuwpoort

[email protected]

Page 3: PARALLEL PROGRAMMING MANY-CORE COMPUTING: HARDWARE (2/5)

Schedule 3

1. Introduction, performance metrics & analysis

2. Many-core hardware, low-level optimizations

3. Cuda class 1: basics

4. Cuda class 2: advanced

5. Case study: LOFAR telescope with many-cores

Page 4: PARALLEL PROGRAMMING MANY-CORE COMPUTING: HARDWARE (2/5)

Hierarchical systems 4

Grid

Cluster

Node

Multiple GPUs per node

Multiple chips per GPU

Streaming multiprocessors

Hardware threads

...

} This course

Page 5: PARALLEL PROGRAMMING MANY-CORE COMPUTING: HARDWARE (2/5)

Multi-core CPUs 5

Page 6: PARALLEL PROGRAMMING MANY-CORE COMPUTING: HARDWARE (2/5)

General Purpose Processors 6

Architecture

Few fat cores

Vectorization Streaming SIMD Extensions (SSE)

Advanced Vector Extensions (AVX)

Homogeneous

Stand-alone

Memory

Shared, multi-layered

Per-core cache and shared cache

Programming

Multi-threading

OS Scheduler

Coarse-grained parallelism

Page 7: PARALLEL PROGRAMMING MANY-CORE COMPUTING: HARDWARE (2/5)

Intel 7

Page 8: PARALLEL PROGRAMMING MANY-CORE COMPUTING: HARDWARE (2/5)

AMD Magny-Cours 8

Page 9: PARALLEL PROGRAMMING MANY-CORE COMPUTING: HARDWARE (2/5)

AMD Magny-Cours

Two 6-core processors on a single chip

Up to four of these chips in a single compute node

48 cores in total

Non-uniform memory access

Per-core cache

Per-chip cache

Local memory

Remote memory (hypertransport)

9

Page 10: PARALLEL PROGRAMMING MANY-CORE COMPUTING: HARDWARE (2/5)

AMD Magny-Cours 10

Page 11: PARALLEL PROGRAMMING MANY-CORE COMPUTING: HARDWARE (2/5)

AMD Magny-Cours 11

Page 12: PARALLEL PROGRAMMING MANY-CORE COMPUTING: HARDWARE (2/5)

AWARI on the Magny-Cours 12

DAS-2

51 hours

72 machines / 144 cores

72 GB RAM in total

1.4 TB disk in total

Magny-Cours

45 hours

1 machine, 48 cores

128 GB RAM in 1 machine

4.5 TB disk in 1 machine

Less than 12 hours with new algorithm (needs more RAM)

Page 13: PARALLEL PROGRAMMING MANY-CORE COMPUTING: HARDWARE (2/5)

Multi-core CPU programming

Threads

Pthreads, Java threads, …

OpenMP

MPI

OpenCL

Vectorization

Streaming SIMD Extensions (SSE)

Advanced Vector Extensions (AVX)

13

Page 14: PARALLEL PROGRAMMING MANY-CORE COMPUTING: HARDWARE (2/5)

Vectorizing with SSE

Assembly instructions

16 registers

C or C++: intrinsics

Name instruction, but not registers

Work on variables, not registers

Declare vector variables

14

Page 15: PARALLEL PROGRAMMING MANY-CORE COMPUTING: HARDWARE (2/5)

Vectorizing with SSE examples

float data[1024];

// init: data[0] = 0.0, data[1] = 1.0, data[2] = 2.0, etc.

init(data);

// Set all elements in my vector to zero.

__m128 myVector0 = _mm_setzero_ps();

// Load the first 4 elts of the array into my vector.

__m128 myVector1 = _mm_load_ps(data);

// Load the second 4 elts of the array into my vector.

__m128 myVector2 = _mm_load_ps(data+4);

0.0

0 element

value

1 2 3

0.0 0.0 0.0

0.0

0 element

value

1 2 3

3.0 2.0 1.0

4.0

0 element

value

1 2 3

7.0 6.0 5.0

15

Page 16: PARALLEL PROGRAMMING MANY-CORE COMPUTING: HARDWARE (2/5)

Vectorizing with SSE examples

// Add vectors 1 and 2; instruction performs 4 FLOPs.

__m128 myVector3 = _mm_add_ps(myVector1, myVector2);

// Multiply vectors 1 and 2; instruction performs 4 FLOPs.

__m128 myVector4 = _mm_mul_ps(myVector1, myVector2);

// _MM_SHUFFLE(w,x,y,z) selects w&x from vec1 and y&z from vec2.

__m128 myVector5 = _mm_shuffle_ps(myVector1, myVector2,

_MM_SHUFFLE(2, 3, 0, 1));

0 element

value

1 2 3

4.0 = + 6.0 8.0 10.0

0 element

value

1 2 3

0.0 1.0 2.0 3.0

0 element

value

1 2 3

4.0 5.0 6.0 7.0

0 element

value

1 2 3

0.0 = x 5.0 12.0 21.0

0 element

value

1 2 3

2.0 = 3.0 4.0 5.0 s

0 element

value

1 2 3

0.0 1.0 2.0 3.0

0 element

value

1 2 3

4.0 5.0 6.0 7.0

0 element

value

1 2 3

0.0 1.0 2.0 3.0

0 element

value

1 2 3

4.0 5.0 6.0 7.0

16

Page 17: PARALLEL PROGRAMMING MANY-CORE COMPUTING: HARDWARE (2/5)

Vector add

void vectorAdd(int size, float* a, float* b, float* c) {

for(int i=0; i<size; i++) {

c[i] = a[i] + b[i];

}

}

17

Page 18: PARALLEL PROGRAMMING MANY-CORE COMPUTING: HARDWARE (2/5)

Vector add with SSE: unroll loop

void vectorAdd(int size, float* a, float* b, float* c) {

for(int i=0; i<size/4; i += 4) {

c[i+0] = a[i+0] + b[i+0];

c[i+1] = a[i+1] + b[i+1];

c[i+2] = a[i+2] + b[i+2];

c[i+3] = a[i+3] + b[i+3];

}

}

18

Page 19: PARALLEL PROGRAMMING MANY-CORE COMPUTING: HARDWARE (2/5)

Vector add with SSE: vectorize loop

void vectorAdd(int size, float* a, float* b, float* c) {

for(int i=0; i<size/4; i += 4) {

__m128 vecA = _mm_load_ps(a + i); // load 4 elts from a

__m128 vecB = _mm_load_ps(b + i); // load 4 elts from b

__m128 vecC = _mm_add_ps(vecA, vecB); // add four elts

_mm_store_ps(c + i, vecC); // store four elts

}

}

19

Page 20: PARALLEL PROGRAMMING MANY-CORE COMPUTING: HARDWARE (2/5)

The Cell Broadband Engine 20

Page 21: PARALLEL PROGRAMMING MANY-CORE COMPUTING: HARDWARE (2/5)

Cell/B.E. 21

Page 22: PARALLEL PROGRAMMING MANY-CORE COMPUTING: HARDWARE (2/5)

Cell/B.E. 22

Architecture

Heterogeneous

1 PowerPC (PPE)

8 vector-processors (SPEs)

Programming

User-controlled scheduling

6 levels of parallelism, all under user control

Fine- and coarse-grain parallelism

Page 23: PARALLEL PROGRAMMING MANY-CORE COMPUTING: HARDWARE (2/5)

Cell/B.E. memory 23

“Normal” main memory

PPE: normal read / write

SPEs: Asynchronous manual transfers: DMA

Per-core fast memory: the Local Store (LS)

Application-managed cache

256 KB

128 x 128 bit vector registers

Page 24: PARALLEL PROGRAMMING MANY-CORE COMPUTING: HARDWARE (2/5)

Roadrunner (IBM) 24

Los Alamos National Laboratory

#1 of top500 June 2008 – November 2009

Now #10

122,400 cores, 1.4 petaflops

First petaflops system

PowerXCell 8i 3.2 Ghz / Opteron DC 1.8 GHz

Page 25: PARALLEL PROGRAMMING MANY-CORE COMPUTING: HARDWARE (2/5)

The Cell’s vector instructions

Differences with SSE

SPEs execute only vector instructions

More advanced shuffling

Not 16, but 128 registers!

Fused Multiply Add support

25

Page 26: PARALLEL PROGRAMMING MANY-CORE COMPUTING: HARDWARE (2/5)

FMA instruction

A B

Product

C

D =

( truncate digits )

A B

Product

C

D =

+

×

= (retain all digits)

×

=

+

(no loss of precision)

Multiply-Add (MAD): D = A * B + C

Fused Multiply-Add (FMA): D = A * B + C

26

Page 27: PARALLEL PROGRAMMING MANY-CORE COMPUTING: HARDWARE (2/5)

Cell Programming models

IBM Cell SDK

C + MPI

OpenCL

Many models from academia...

27

Page 28: PARALLEL PROGRAMMING MANY-CORE COMPUTING: HARDWARE (2/5)

Cell SDK

Threads, but only on the PPE

Distributed memory

Local stores = application-managed cache!

DMA transfers

Signaling and mailboxes

Vectorization

28

Page 29: PARALLEL PROGRAMMING MANY-CORE COMPUTING: HARDWARE (2/5)

Direct Memory Access (DMA)

Start asynchronous DMA mfc_get (local store space, main mem address, #bytes, tag);

Wait for DMA to finish mfc_write_tag_mask(tag);

mfc_read_tag_status_all();

DMA lists

Overlap communication with useful work

Double buffering

29

Page 30: PARALLEL PROGRAMMING MANY-CORE COMPUTING: HARDWARE (2/5)

Vector sum

float vectorSum(int size, float* vector) {

float result = 0.0;

for(int i=0; i<size; i++) {

result += vector[i];

}

Return result;

}

30

Page 31: PARALLEL PROGRAMMING MANY-CORE COMPUTING: HARDWARE (2/5)

Parallelization strategy

Partition problem into 8 pieces

(Assuming a chunk fits in the Local Store)

PPE starts 8 SPE threads

Each SPE processes 1 piece

Has to load data from PPE with DMA

PPE adds the 8 sub-results

31

Page 32: PARALLEL PROGRAMMING MANY-CORE COMPUTING: HARDWARE (2/5)

Vector sum SPE code (1)

float vectorSum(int size, float* PPEVector) {

float result = 0.0;

int chunkSize = size / NR_SPES; // Partition the data.

float localBuffer[chunkSize]; // Allocate a buffer in

// my private local store.

int tag = 42;

// Points to my chunk in PPE memory.

float* myRemoteChunk = PPEVector + chunkSize * MY_SPE_NUMBER;

32

Page 33: PARALLEL PROGRAMMING MANY-CORE COMPUTING: HARDWARE (2/5)

Vector sum SPE code (2)

// Copy the input data from the PPE.

mfc_get(localBuffer, myRemoteChunk, chunkSize, tag);

mfc_write_tag_mask(tag);

mfc_read_tag_status_all();

// The real work.

for(int i=0; i<chunkSize; i++) {

result += localBuffer[i];

}

return result;

}

33

Page 34: PARALLEL PROGRAMMING MANY-CORE COMPUTING: HARDWARE (2/5)

Can we optimize this strategy? 34

Page 35: PARALLEL PROGRAMMING MANY-CORE COMPUTING: HARDWARE (2/5)

Can we optimize this strategy? 35

Vectorization

Overlap communication and computation

Double buffering

Strategy:

Split in more chunks than SPEs

Let each SPE download the next chunk while processing the

current chunk

Page 36: PARALLEL PROGRAMMING MANY-CORE COMPUTING: HARDWARE (2/5)

DMA double buffering example (1)

float vectorSum(float* PPEVector, int size, int nrChunks) {

float result = 0.0;

int chunkSize = size / nrChunks;

int chunksPerSPE = nrChunks / NR_SPES;

int firstChunk = MY_SPE_NUMBER * chunksPerSPE;

int lastChunk = firstChunk + nrChunks;

// Allocate two buffers in my private local store.

float localBuffer[2][chunkSize];

int currentBuffer = 0;

// Start asynchronous DMA of first chunk.

float* myRemoteChunk = PPEVector + firstChunk * chunkSize;

mfc_get(localBuffer[currentBuffer], myRemoteChunk, chunkSize,

currentBuffer);

36

Page 37: PARALLEL PROGRAMMING MANY-CORE COMPUTING: HARDWARE (2/5)

DMA double buffering example (2)

for (int chunk = firstChunk; chunk < lastChunk; chunk++) {

// Prefetch next chunk asynchronously.

if(chunk != lastChunk - 1) {

float* nextRemoteChunk = PPEVector + (chunk+1) * chunkSize;

mfc_get(localBuffer[!currentBuffer], nextRemoteChunk,

chunkSize, !currentBuffer);

}

// Wait for of current buffer DMA to finish.

mfc_write_tag_mask(currentBuffer); mfc_read_tag_status_all();

// The real work.

for(int i=0; i<chunkSize; i++)

result += localBuffer[currentBuffer][i];

currentBuffer = !currentBuffer;

}

return result;

}

37

Page 38: PARALLEL PROGRAMMING MANY-CORE COMPUTING: HARDWARE (2/5)

Double and triple buffering

Read-only data

Double buffering

Read-write data

Triple buffering!

Work buffer

Prefetch buffer, asynchronous download

Finished buffer, asynchronous upload

General technique

On-chip networks

GPUs (PCI-e)

MPI (cluster)

38

Page 39: PARALLEL PROGRAMMING MANY-CORE COMPUTING: HARDWARE (2/5)

Intel’s many-core platforms 39

Page 40: PARALLEL PROGRAMMING MANY-CORE COMPUTING: HARDWARE (2/5)

Intel Single-chip Cloud Computer 40

Architecture

Tile-based many-core (48 cores)

A tile is a dual-core

Stand-alone

Memory

Per-core and per-tile

Shared off-chip

Programming

Multi-processing with message passing

User-controlled mapping/scheduling

Gain performance …

Coarse-grain parallelism

Multi-application workloads (cluster-like)

Page 41: PARALLEL PROGRAMMING MANY-CORE COMPUTING: HARDWARE (2/5)

Intel Single-chip Cloud Computer 41

Page 42: PARALLEL PROGRAMMING MANY-CORE COMPUTING: HARDWARE (2/5)

Intel SCC Tile

2 cores

16K L1 cache per core

256K L2 per core

8K Message passing buffer

On-chip network router

42

Page 43: PARALLEL PROGRAMMING MANY-CORE COMPUTING: HARDWARE (2/5)

Intel's Larrabee

GPU based on x86 architecture

Hardware multithreading

Wide SIMD

Achieved 1 tflop sustained application performance (SC09)

Canceled in Dec 2009, re-targeted to HPC market

43

Page 44: PARALLEL PROGRAMMING MANY-CORE COMPUTING: HARDWARE (2/5)

Intel's Many Integrated Core (MIC)

May 2010: Larrabee + 80-core research chip + SCC → MIC

X86 vector cores

Knights Ferry: 32 Cores, 128 Threads, 1.2GHz, 8MB shared cache

Knights Corner: 22 nm, 50+ cores

44

Page 45: PARALLEL PROGRAMMING MANY-CORE COMPUTING: HARDWARE (2/5)

GPU hardware introduction 45

Page 46: PARALLEL PROGRAMMING MANY-CORE COMPUTING: HARDWARE (2/5)

CPU vs GPU 46

Movie

The Mythbusters

Jamie Hyneman & Adam Savage

Discovery Channel

Appearance at NVIDIA’s NVISION 2008


Recommended