+ All Categories
Home > Documents > Hardware and Software for Parallel Computing · o Parallel computing requires parallel codes...

Hardware and Software for Parallel Computing · o Parallel computing requires parallel codes...

Date post: 27-Sep-2020
Category:
Upload: others
View: 9 times
Download: 0 times
Share this document with a friend
205
1/31/2014 1 Hardware and Software for Parallel Computing Florent Lebeau, CAPS entreprise UMPC - janvier 2014 Agenda Day 1 The Need for Parallel Computing Introduction to Parallel Computing o Different Levels of Parallelism History of Supercomputers o Hardware Accelerators Multiprocessing o Fork/join o MPI Multithreading o Pthread o TBB o Cilk o OpenMP Day 2 Hardware Accelerators o GPU CUDA OpenCL o Xeon Phi Intel Offload o Directive Standards OpenACC OpenMP 4.0 CapEx / OpEx with GPU Porting Methodology www.caps-entreprise.com 2
Transcript
Page 1: Hardware and Software for Parallel Computing · o Parallel computing requires parallel codes Porting onto specialized architectures o FPGA, cluster, GPU… o Not only developer’s

1/31/2014

1

Hardware and Software for Parallel

Computing

Florent Lebeau, CAPS entreprise

UMPC - janvier 2014

Agenda

Day 1

The Need for Parallel Computing

Introduction to Parallel Computing

o Different Levels of Parallelism

History of Supercomputers

o Hardware Accelerators

Multiprocessing

o Fork/join

o MPI

Multithreading

o Pthread

o TBB

o Cilk

o OpenMP

Day 2

Hardware Accelerators

o GPU

• CUDA

• OpenCL

o Xeon Phi

• Intel Offload

o Directive Standards

• OpenACC

• OpenMP 4.0

CapEx / OpEx with GPU

Porting Methodology

www.caps-entreprise.com 2

Page 2: Hardware and Software for Parallel Computing · o Parallel computing requires parallel codes Porting onto specialized architectures o FPGA, cluster, GPU… o Not only developer’s

1/31/2014

2

The Need for Parallel

Computing

The Demand (1)

Compute for research o Simulate complex physical phenomenon

o Validate a mathematical model

Compute for industry o Quality control by image processing

o DNA sequence alignment

o Oil & gas prospection

o Meteorology

Compute for the masses o Playing a 1080p DVD in real time

o Compressing / uncompressing streams

o Image processing

www.caps-entreprise.com 4

Page 3: Hardware and Software for Parallel Computing · o Parallel computing requires parallel codes Porting onto specialized architectures o FPGA, cluster, GPU… o Not only developer’s

1/31/2014

3

The Demand (2)

Computing needs o Data

o An operation

The computing cost (time) may be

o And even worse, sometimes

To reduce the computation time, one can o Reduce the amount of data

o Reduce the amount of operations

o Increase the computation speed

www.caps-entreprise.com 5

Qcomp = Qdata *Qop

The Demand (3)

The amount of computations keeps increasing

o Games and screens resolutions keep improving

o Longer weather prediction

o More accurate weather prediction

• Which increases the amount of data

Amount of data grows each year

But the given time to exploit these data is still the same

o 24h for another day of weather prediction

o 1/50s for a video stream image

www.caps-entreprise.com 6

Page 4: Hardware and Software for Parallel Computing · o Parallel computing requires parallel codes Porting onto specialized architectures o FPGA, cluster, GPU… o Not only developer’s

1/31/2014

4

The Demand (4)

So we need to increase computations per second

At a lower cost o To purchase

o To develop

o To maintain

Technologically sustainable o Easy adaptation to next architecture

According to the company strategy o Green computing, industrial partnerships with providers…

www.caps-entreprise.com 7

The Solution (1)

By the end of the 20th century, most of applications were written for CPUs (mainly x86) o The regular frequency increase of

micro-processors ensured performance gains without code modification

o The effort focused on hardware vendors, less on developers

Today frequency increase is stuck o Because of thermal dissipation and

power leakage

o Because of components distance and die surface

Computing faster is less and less feasible (sequentially) o But we can compute more things

simultaneously (in parallel)

www.caps-entreprise.com 8

Page 5: Hardware and Software for Parallel Computing · o Parallel computing requires parallel codes Porting onto specialized architectures o FPGA, cluster, GPU… o Not only developer’s

1/31/2014

5

The Solution (2)

Advanced optimizations of code to get the best of cutting-edge CPU functionalities o Vectorization

Code parallelization o Multi-threading

o Parallel computing requires parallel codes

Porting onto specialized architectures o FPGA, cluster, GPU…

o Not only developer’s choice, because may imply long-term hardware investments

www.caps-entreprise.com 9

Introduction to Parallel

Computing

Page 6: Hardware and Software for Parallel Computing · o Parallel computing requires parallel codes Porting onto specialized architectures o FPGA, cluster, GPU… o Not only developer’s

1/31/2014

6

Flynn’s Taxinomy

Classification of computer architectures

o Established by Michael J. Flynn en 1966

4 categories based on the data and instruction flow

o SISD

o SIMD

o MISD

o MIMD

• With shared memory (CPU cores)

• With distributed memory (clusters)

www.caps-entreprise.com 11

Flynn’s Taxinomy : SISD

SISD : Single Instruction Single Data

www.caps-entreprise.com 12

Data Instruction

Processor Memory Control

Page 7: Hardware and Software for Parallel Computing · o Parallel computing requires parallel codes Porting onto specialized architectures o FPGA, cluster, GPU… o Not only developer’s

1/31/2014

7

Flynn’s Taxinomy : SIMD

SIMD : Single Instruction Multiple Data

www.caps-entreprise.com 13

Data Instruction

Processor 0

Processor 1

Processor 2

Memory Control

Flynn’s Taxinomy : MISD

MISD : Multiple Instruction Single Data

www.caps-entreprise.com 14

Data Instruction

Processor 0

Processor 1

Processor 2

Memory Control 1

Control 0

Control 2

Page 8: Hardware and Software for Parallel Computing · o Parallel computing requires parallel codes Porting onto specialized architectures o FPGA, cluster, GPU… o Not only developer’s

1/31/2014

8

Flynn’s Taxinomy : MIMD

MIMD : Multiple Instruction Multiple Data

www.caps-entreprise.com 15

Data Instruction

Processor 0

Processor 1

Processor 2

Memory 0 Control 0

Memory 1

Memory 2

Control 1

Control 2

Distributed Memory Architectures

Processors only see their own memory

They communicate explicitly by message-passing if needed

A processor cannot access to the memory of another

So the distribution must be done to avoid communications

www.caps-entreprise.com 16

Netw

ork

Processor Processor

Processor Processor

Page 9: Hardware and Software for Parallel Computing · o Parallel computing requires parallel codes Porting onto specialized architectures o FPGA, cluster, GPU… o Not only developer’s

1/31/2014

9

Shared Memory Architectures (1)

A unique address space is provided by the hardware

If there is, cache consistency is maintained by hardware

www.caps-entreprise.com 17

Network

Processor Processor Processor Processor

Shared Memory Architectures (2)

Global memory space, accessible by all processors

Processors may have local memory (i.e. cache) to hold copies of some global memory

Consistency of these copies is usually maintained by hardware o Referred as Cache-Coherent

User-friendly

Programmer is in charge of correct synchronization between processes/threads

Suffer from lack of scalability

www.caps-entreprise.com 18

Page 10: Hardware and Software for Parallel Computing · o Parallel computing requires parallel codes Porting onto specialized architectures o FPGA, cluster, GPU… o Not only developer’s

1/31/2014

10

UMA : Unified Memory Access

Most commonly represented today by SMP machines

Identical processors

Equal access and access times to memory

Sometimes called CC-UMA - Cache Coherent UMA

www.caps-entreprise.com 19

NUMA : Non-Uniform Memory Access

Often made by physically linking two or more SMPs

One SMP can directly access memory of another SMP

Memory access across link is slower

www.caps-entreprise.com 20

Page 11: Hardware and Software for Parallel Computing · o Parallel computing requires parallel codes Porting onto specialized architectures o FPGA, cluster, GPU… o Not only developer’s

1/31/2014

11

The Amdahl’s Law

Amdahl’s law states that the sequential part of the execution limits the potential benefit from parallelization o The execution time TP using P cores is given by:

• where seq (in [0,1]) is percentage of execution that is inherently

sequential

Consequences of this law o Potential performance dominated by sequential parts of the

application

www.caps-entreprise.com 21

TP seq T1 (1 seq) *T1

P

What is a Hotspot?

A small part of code

o Most of the execution time spent in it

o Mostly loops that concentrate computation

Identified using performance profilers

Also known as kernels or compute intensive kernels

o But sometimes a hotspot can be implemented as several kernels

www.caps-entreprise.com 22

Page 12: Hardware and Software for Parallel Computing · o Parallel computing requires parallel codes Porting onto specialized architectures o FPGA, cluster, GPU… o Not only developer’s

1/31/2014

12

Speedup

Speedup

o Ratio between execution on 1 process and on P processes

Efficiency

o Ratio between the speedup and the number of cores used for the

parallel version

o Parallel application is scalable when efficiency is close to 1 with

number of cores increasing

www.caps-entreprise.com 23

SP =T1

TP

EP =T1

P*TP

Amdahl’s law

www.caps-entreprise.com 24

Page 13: Hardware and Software for Parallel Computing · o Parallel computing requires parallel codes Porting onto specialized architectures o FPGA, cluster, GPU… o Not only developer’s

1/31/2014

13

Speedup

Speedup

o Ratio between the original time and the optimized time

www.caps-entreprise.com 25

Sp Tseq

Tp

Gustafson’s Law

States that increasing the amount of data increases the

parallelism potential of the application

o The more computations you have, the more computations you may

overlap

A parallel architecture need to be well-exploited to get good

performances

o The more you send parallel computations on it, the best you get

www.caps-entreprise.com 26

Page 14: Hardware and Software for Parallel Computing · o Parallel computing requires parallel codes Porting onto specialized architectures o FPGA, cluster, GPU… o Not only developer’s

1/31/2014

14

Scalability

Scalability gives us an idea about the system behaviour

when the number of processors is increased.

Applications can often exploit large parallel machines by

scaling the problems to larger instances

To improve the scalability :

o Increase the parallelism of the application

www.caps-entreprise.com 27

Load Balancing

It is the capacity of

the application to

adapt the amount of

work between each

Proceesing Elements

It can be statically or

dynamically adapted

www.caps-entreprise.com 28

Page 15: Hardware and Software for Parallel Computing · o Parallel computing requires parallel codes Porting onto specialized architectures o FPGA, cluster, GPU… o Not only developer’s

1/31/2014

15

Granularity

Granularity means the amount of computation compared to

communications

Larger parallel tasks usually provide better speedups

o Reduce startup overhead

o Reduce communication and synchronization overhead

Smaller granularity can be exploited on strongly coupled

architecture, such as multicore

o Can require deep rewriting of the application

www.caps-entreprise.com 29

Different Levels of Parallelism

Page 16: Hardware and Software for Parallel Computing · o Parallel computing requires parallel codes Porting onto specialized architectures o FPGA, cluster, GPU… o Not only developer’s

1/31/2014

16

Different Levels of Parallelism

ALU o Vectorization (MMX / Now!, SSE)

Instruction Pipeline o Instruction Level Parallelism (ILP)

Multi-core o Mass market desktop workstations

Multi-socket o Bi-CPU desktop workstations

Multi-node o Cluster

Worldwide distributed computing o SETI@home

www.caps-entreprise.com 31

Parallelism in CPUs

Scalar / superscalar / SIMD o 80486 / Intel Pentium / Intel Pentium MMX

Mono/multicore o Dual-core (2005 avec AMD Opteron)

o Quad-core, etc.

o “Duplication of the processors”

Mono/multisocket o Intel Xeon bi-socket

www.caps-entreprise.com 32

Page 17: Hardware and Software for Parallel Computing · o Parallel computing requires parallel codes Porting onto specialized architectures o FPGA, cluster, GPU… o Not only developer’s

1/31/2014

17

Scalar Processors

One data is computed at a time

o The architecture is designed to perform one instruction on one data

per clock cycle

o Contrarily to vector (or superscalar architecture)

One data = a value or scalar variable

o i.e.: a value or a recipient determined by a type

o As opposed to a composite data:

• A vector

• A character string in certain programming languages

Ex : Intel 80486

www.caps-entreprise.com 33

Superscalar Processors

Able to perform many instructions simultaneously o Each in a different pipeline

Ex : Intel P5 (1993)

www.caps-entreprise.com 34

Page 18: Hardware and Software for Parallel Computing · o Parallel computing requires parallel codes Porting onto specialized architectures o FPGA, cluster, GPU… o Not only developer’s

1/31/2014

18

Vector Processors

Their architecture is based on pipelines

o A vector instruction executes the same operation on all the data of a

vector

Ex : Cray, NEC, IBM, DEC, Ti, Apple Altivec G4 et G5…

www.caps-entreprise.com 35

SIMD Processors

Single Instruction Multiple Data

Ex : MMX, SSE, ARM Neon, SPARC VIS, MIPS…

www.caps-entreprise.com 36

Page 19: Hardware and Software for Parallel Computing · o Parallel computing requires parallel codes Porting onto specialized architectures o FPGA, cluster, GPU… o Not only developer’s

1/31/2014

19

Today

Most architectures are based on superscalar processors

o And SIMD

Mono-socket for mass market

o Dual-socket or more in clusters or professional workstations

www.caps-entreprise.com 37

History of Supercomputers

Page 20: Hardware and Software for Parallel Computing · o Parallel computing requires parallel codes Porting onto specialized architectures o FPGA, cluster, GPU… o Not only developer’s

1/31/2014

20

Top500.org

Lists the world’s 500 largest supercomputers

www.caps-entreprise.com 39

Architecture Types

www.caps-entreprise.com 40

Page 21: Hardware and Software for Parallel Computing · o Parallel computing requires parallel codes Porting onto specialized architectures o FPGA, cluster, GPU… o Not only developer’s

1/31/2014

21

Architecture Types

Single Processor o But a big one

Cluster

UP^n avec SAS : o SMP if n is small

o MPP if n is large

(UP^n SAS)^m : o If n << m and /SAS : cluster

o Constellation otherwise

www.caps-entreprise.com 41

UP

withoutSAS

nUP )(

withSAS

nUP )(

withSAS

nUP )(

withoutSAS

m

withSAS

nUP )))(((

withoutSAS

m

withSAS

nUP )))(((

Architecture Types

www.caps-entreprise.com 42

Page 22: Hardware and Software for Parallel Computing · o Parallel computing requires parallel codes Porting onto specialized architectures o FPGA, cluster, GPU… o Not only developer’s

1/31/2014

22

Processor Types

www.caps-entreprise.com 43

Processor Types

www.caps-entreprise.com 44

Page 23: Hardware and Software for Parallel Computing · o Parallel computing requires parallel codes Porting onto specialized architectures o FPGA, cluster, GPU… o Not only developer’s

1/31/2014

23

Number of Processors

www.caps-entreprise.com 45

Interconnect Type

www.caps-entreprise.com 46

Page 24: Hardware and Software for Parallel Computing · o Parallel computing requires parallel codes Porting onto specialized architectures o FPGA, cluster, GPU… o Not only developer’s

1/31/2014

24

Installation Type

www.caps-entreprise.com 47

Clusters

Page 25: Hardware and Software for Parallel Computing · o Parallel computing requires parallel codes Porting onto specialized architectures o FPGA, cluster, GPU… o Not only developer’s

1/31/2014

25

An Exemple: Nova Cluster

CAPS entreprise, 2009

www.caps-entreprise.com 49

Nova Architecture

Nova is composed of:

o 1 login node (Nova0)

o 3 storage nodes over Lustre (Nova1 to Nova3)

o 20 compute nodes (Nova4 to Nova23)

Each compute node is made up of:

o A dual-socket Intel Nehalem (bi-processor) machine

• Each Nehalem processor is quad-core (4 CPU cores)

o 2 Nvidia Tesla C1060 GPUs

o 24 GB of memory

www.caps-entreprise.com 50

Page 26: Hardware and Software for Parallel Computing · o Parallel computing requires parallel codes Porting onto specialized architectures o FPGA, cluster, GPU… o Not only developer’s

1/31/2014

26

Nova’s Compute Nodes Architecture

www.caps-entreprise.com 51

12 GB

Intel QPI

12 GB

Intel S5520

chipset

PCIe 2.0 16x

Tesla C1060

Tesla C1060

Nova Architecture

www.caps-entreprise.com 52

nova0.caps-entreprise.com

Nova0

File system

(Nova1-3)

Compute Nodes

(Nova4-23)

Internet

Page 27: Hardware and Software for Parallel Computing · o Parallel computing requires parallel codes Porting onto specialized architectures o FPGA, cluster, GPU… o Not only developer’s

1/31/2014

27

Pros

• Less expensive

• Than one multiprocessor server of similar computing power

• In particular SMP

• More flexible

• The size is adapted to the needs and budget, contrarily to monolithic

configurations

www.caps-entreprise.com 53

Exploiting clusters

As a distributed system

o That is what they actually are

o Resources can be shared among users, applications …

o More complicated to program

As a virtual SMP

o Kerrighed

o The OS hide the underlying architecture

o Easier to program but less control

• A cluster is NUMA-type. Data distribution is important

www.caps-entreprise.com 54

Page 28: Hardware and Software for Parallel Computing · o Parallel computing requires parallel codes Porting onto specialized architectures o FPGA, cluster, GPU… o Not only developer’s

1/31/2014

28

Exploiting Clusters in Distributed Mode

In distributed mode, it is generally provided with a task

scheduler

o Enables to add more servers

o Enable to manage breakdowns

o Slurm, sge, PBS,…

www.caps-entreprise.com 55

$ srun –n 4 ./myProgram.exe myProgram is running on node 13 myProgram is running on node 14 myProgram is running on node 15 myProgram is running on node 16 $

Connection to the Login Node

Before offloading your computations on Nova’s compute

nodes, you need to login to Nova0

o “Secondary” will automatically connect you to Nova0

www.caps-entreprise.com 56

$ ssh [email protected]

mylogin@Nova0 $ #you can now work!

Page 29: Hardware and Software for Parallel Computing · o Parallel computing requires parallel codes Porting onto specialized architectures o FPGA, cluster, GPU… o Not only developer’s

1/31/2014

29

Module and Slurm

Nova is provided with module and srun commands

Useful Module commands

Useful Slurm commands

www.caps-entreprise.com 57

mylogin@Nova0 $ module avail

mylogin@Nova0 $ module load

mylogin@Nova0 $ module unload

mylogin@Nova0 $ module list

mylogin@Nova0 $ module switch

mylogin@Nova0 $ srun

mylogin@Nova0 $ sinfo

mylogin@Nova0 $ salloc

mylogin@Nova0 $ sacct

mylogin@Nova0 $ sreport

Running your Applications

Launch a Bash command on multiple nodes

Launch an application on multiple nodes

Launch n copies of a binary on N nodes

Launch on a specific partition

www.caps-entreprise.com 58

mylogin@Nova0 $ srun –N4 ls

mylogin@Nova0 $ srun –N3 ./myApp

mylogin@Nova0 $ srun –N1 –n3 ./myApp

mylogin@Nova0 $ srun –p hugePart –N16 ./myLargeJob

Page 30: Hardware and Software for Parallel Computing · o Parallel computing requires parallel codes Porting onto specialized architectures o FPGA, cluster, GPU… o Not only developer’s

1/31/2014

30

Clusters and Cloud Computing

Some companies provide an access to their server farm

o Amazon AWS/EC2

o Google App Engine

o IBM Blue Cloud

o Rackspace Mosso

o Argia Faascape

www.caps-entreprise.com 59

Amazon EC2

www.caps-entreprise.com 60

Page 31: Hardware and Software for Parallel Computing · o Parallel computing requires parallel codes Porting onto specialized architectures o FPGA, cluster, GPU… o Not only developer’s

1/31/2014

31

Hardware Accelerators

Clearspeed: Description

Clearspeed designs SIMD

processors for HPC and

embedded systems

Designs e710, e720 and

CATS-700 accelerators

units, based on the CSX700

processor

CSX700 released in 2008

www.caps-entreprise.com 62

Page 32: Hardware and Software for Parallel Computing · o Parallel computing requires parallel codes Porting onto specialized architectures o FPGA, cluster, GPU… o Not only developer’s

1/31/2014

32

Clearspeed: Architecture Overview

CSX700: o Made of 2 SIMD array

processors

o 96 cores on each array

o 256 kb on-chip scratchpad memory

o 2 x 64-bit DDR2 DRAM interface with ECC support

o PCIe 16x host interface

o 96 GFLOPS SP and DP

o 9 W power dissipation

www.caps-entreprise.com 63

Clearspeed: Software Tools

o ANSI C-based Compiler: Cn

o Eclipse IDE

o GDB-based debugger

o Visual profiler

o Libraries: FFT, BLAS, LAPACK …

o MS Windows and Linux tools

www.caps-entreprise.com 64

Page 33: Hardware and Software for Parallel Computing · o Parallel computing requires parallel codes Porting onto specialized architectures o FPGA, cluster, GPU… o Not only developer’s

1/31/2014

33

Clearspeed: Applications

Finance:

o « Fastest solution for credit risk analysis »

Space engineering

HPC

www.caps-entreprise.com 65

CELL: Description

Alliance of Sony, Toshiba and IBM

Dates back to the mid-2000’s

Based on the IBM POWER architecture core

Design to bridge the gap between CPU and GPU

www.caps-entreprise.com 66

Page 34: Hardware and Software for Parallel Computing · o Parallel computing requires parallel codes Porting onto specialized architectures o FPGA, cluster, GPU… o Not only developer’s

1/31/2014

34

CELL: Architecture Overview

1 main processor (PPE: Power Processing Element)

8 coprocessors (SPE: Synergistic Processing Element)

256 GFLOPS SP 26 GFLOPS DP

60 to 80 W power consumption

www.caps-entreprise.com 67

CELL: Software Tools

IBM online resource center no longer available, but the latest

SDK (v3.1) can still be found on the web.

SDK v3.1 contains:

o Eclipse IDE

o GNU toolchain tools (gcc, gdb)

o Performance tools

o Libraries (FFT, LAPACK, BLAS, Monte Carlo)

Linux only

www.caps-entreprise.com 68

Page 35: Hardware and Software for Parallel Computing · o Parallel computing requires parallel codes Porting onto specialized architectures o FPGA, cluster, GPU… o Not only developer’s

1/31/2014

35

CELL: Applications

Multimedia:

o Console video games

o Home cinema

HPC

o IBM Roadrunner

www.caps-entreprise.com 69

Tilera: Description

Fabless semiconductor

company focusing on

scalable multicore embedded

processors

TILE family processors

TILE-based platforms

www.caps-entreprise.com 70

Page 36: Hardware and Software for Parallel Computing · o Parallel computing requires parallel codes Porting onto specialized architectures o FPGA, cluster, GPU… o Not only developer’s

1/31/2014

36

Tilera: Architecture Overview

TILEProX64:

o 8 x 8 grid of identical RISC

processor cores

o 5.6 Mbytes on-chip cache

memory

o 19 – 23 W power

consumption

o 4 DDR2 memory

controllers with optional

ECC

o 443 Giga OPS

www.caps-entreprise.com 71

Tilera: Software Tools

Tilera Multicore

Development

Environment (MDE):

o ANSI C / C++ compiler

o Eclipse IDE

o Tools for gdb, gprof and

oprofile

o Graphical multicore

debugger and profiler

Linux only

www.caps-entreprise.com 72

Page 37: Hardware and Software for Parallel Computing · o Parallel computing requires parallel codes Porting onto specialized architectures o FPGA, cluster, GPU… o Not only developer’s

1/31/2014

37

Tilera: Applications

Cloud computing

o 3X perfomance-per-Watt

when running Memcache,

according to Facebook

Networking

Multimedia

www.caps-entreprise.com 73

Kalray: Description

Spin-off from the CEA

Fabless and software

company delivering

manycore processors

Developed MPPA (Multi-

Purpose Processor Array)

www.caps-entreprise.com 74

Page 38: Hardware and Software for Parallel Computing · o Parallel computing requires parallel codes Porting onto specialized architectures o FPGA, cluster, GPU… o Not only developer’s

1/31/2014

38

Kalray: Architecture Overview

256 VLIW processors organized in 16 clusters of 16 cores for the basic edition

High-end products up to 64 clusters, ie 1024 cores

DDR3 memory controllers

5 W power dissipation

2 Tera OPS for the 1024 cores version

www.caps-entreprise.com 75

Kalray: Software Tools

AccessCore SDK: o C-based programming

model: • Core algorithms in C

• Primitives for task and data parallelism

o GNU gcc and gdb are used for compilation and debug

o Eclipse 3.x IDE

o Libraries

o Linux only

www.caps-entreprise.com 76

Page 39: Hardware and Software for Parallel Computing · o Parallel computing requires parallel codes Porting onto specialized architectures o FPGA, cluster, GPU… o Not only developer’s

1/31/2014

39

Kalray: Applications

Signal Processing

o SIMILAN project

Video processing, industrial Imaging, transportation

o CHAPI project

www.caps-entreprise.com 77

Calxeda: Description

Provides ARM-based

SoC

EnergyCore family

processors

EnergyCards boards

www.caps-entreprise.com 78

Page 40: Hardware and Software for Parallel Computing · o Parallel computing requires parallel codes Porting onto specialized architectures o FPGA, cluster, GPU… o Not only developer’s

1/31/2014

40

Calxeda: Architecture Overview

4 ARM A9 cores

4 Mbytes L2 cache and

DDR3 controller with

ECC

220 MIPS / ARM core

5 W per node

www.caps-entreprise.com 79

Calxeda: Software Tools

Use ARM-dedicated tools:

o GNU gcc ARM

o JTAG debugger

www.caps-entreprise.com 80

Page 41: Hardware and Software for Parallel Computing · o Parallel computing requires parallel codes Porting onto specialized architectures o FPGA, cluster, GPU… o Not only developer’s

1/31/2014

41

Calxeda: Applications

Server market

o Web servers clusters

o Content distribution

networks

o Cloud storage

www.caps-entreprise.com 81

Supercomputers

Vector Architecture

o CRAY, NEC SX-9

Cluster

o Contemporary Supercomputers

www.caps-entreprise.com 82

Page 42: Hardware and Software for Parallel Computing · o Parallel computing requires parallel codes Porting onto specialized architectures o FPGA, cluster, GPU… o Not only developer’s

1/31/2014

42

CRAY Supercomputers

Seymour Cray

o 1957: co-founder of CDC (Control Data Corporation). Takes part in

the design of the first supercomputer: the CDC 6600.

o 1972: co-founder of Cray Research, Inc. Creates the Cray-1 in 1976

and the Cray-2 in 1985.

o 1989: founder of Cray Computer Corporation. The Cray-3 has never

been built and the company went bankrupt in 1995.

www.caps-entreprise.com 83

Multiprocessing

Page 43: Hardware and Software for Parallel Computing · o Parallel computing requires parallel codes Porting onto specialized architectures o FPGA, cluster, GPU… o Not only developer’s

1/31/2014

43

What are Processes?

An instance of a computer program that is being executed

o If a program is launched twice, two instances of this program will be

spawned

According to the OS

o A process can be switched with another

o A process is interruptible

Switches could be performed:

o tasks perform input/output operations

o when a task indicates that it can be switched

o or on hardware interrupts

www.caps-entreprise.com 85

Processes

With processes you can switch between processes o And if you do it quick enough it seems like they execute concurrently

Processes enable multi-tasking o OS, drivers and programs run “concurrently” on the microprocessor

o They actually share the multiprocessor core

• But with dual-cores CPUs, they are distributed over two cores

Processes make multicore possible o And take advantages of it by executing even faster

Processes have separate address spaces o And interact only through system-provided inter-process communication

mechanisms

www.caps-entreprise.com 86

Page 44: Hardware and Software for Parallel Computing · o Parallel computing requires parallel codes Porting onto specialized architectures o FPGA, cluster, GPU… o Not only developer’s

1/31/2014

44

Processes

Imply explicit

communications

o Through inter-

process

communication

mechanisms

www.caps-entreprise.com 87

Multiprocessing with Multiple Cores

Enables to execute different programs concurrently

o One on each core: the overall time needed is reduced

o You have to launch programs asynchronously

Enables to execute several instances of a same program

o On different data -> SPMD

o You need either to use another command line or specific tools like

MPI

www.caps-entreprise.com 88

$ riri.exe & fifi.exe & loulou.exe &

$ srun –N 9 donald.exe data1-9.txt

Page 45: Hardware and Software for Parallel Computing · o Parallel computing requires parallel codes Porting onto specialized architectures o FPGA, cluster, GPU… o Not only developer’s

1/31/2014

45

Tools for Multiprocessing

By hand

o Fork / join

o …

With the help of dedicated tools

o MPI

o PVM

o …

www.caps-entreprise.com 89

Fork/Join

www.caps-entreprise.com 90

Page 46: Hardware and Software for Parallel Computing · o Parallel computing requires parallel codes Porting onto specialized architectures o FPGA, cluster, GPU… o Not only developer’s

1/31/2014

46

Fork / join

The fork() system call will spawn a new child process which

is an identical process to the parent

o Except that has a new system process ID

The process is copied in memory from the parent and a new

process structure is assigned by the kernel

o All data and properties are inherited from parent, but separated

www.caps-entreprise.com 91

Fork / Join Example

www.caps-entreprise.com 92

#include …

using namespace std;

main()

{

string sIdentifier;

pid_t pID = fork();

if (pID == 0) // Code only executed by child process

{

sIdentifier = "Child Process: ";

}

else if (pID < 0) // failed to fork

{cerr << "Failed to fork" << endl; exit(1);}

else // Code only executed by parent process

{

sIdentifier = "Parent Process:";

}

// Code executed by both parent and child

cout << sIdentifier << pID << endl;

}

$ g++ -o ForkTest ForkTest.cpp

$ ./ForkTest

Parent Process pID: 1234

Child Process pID: 1245

Page 47: Hardware and Software for Parallel Computing · o Parallel computing requires parallel codes Porting onto specialized architectures o FPGA, cluster, GPU… o Not only developer’s

1/31/2014

47

Multiprocessing on Distributed Systems

The Fork / Join model is on-node programming

But since processes embed all their execution context

o They are quite autonomous

o And may be distributed over separated computers (nodes)

MPI

o Designed to implement multiprocessing on distributed systems

o Is based on the fork / join model

o Is portable (systems of different kinds can communicate)

www.caps-entreprise.com 93

MPI

Page 48: Hardware and Software for Parallel Computing · o Parallel computing requires parallel codes Porting onto specialized architectures o FPGA, cluster, GPU… o Not only developer’s

1/31/2014

48

MPI: Message Passing Interface

A high-level message-passing API

Designed for high performance, scalability and portability

A standard which comes in two versions

o 1995: v1.2 (MPI-1)

o 1997: v2.1 (MPI-2)

An API which comes in several implementations

o Some vendor-specific extensions...

o ...can break the portability of the application

www.caps-entreprise.com 95

MPI distribution

Free implementation

o MPICH (MPI-1)

o MPICH2 (MPI-2)

o OpenMPI

o LAM/MPI

Constructor

o HP MPI

o Intel MPI

www.caps-entreprise.com 96

Page 49: Hardware and Software for Parallel Computing · o Parallel computing requires parallel codes Porting onto specialized architectures o FPGA, cluster, GPU… o Not only developer’s

1/31/2014

49

Programming Paradigms

Sequential Programming Paradigm

Message-Passing Programming Paradigm

www.caps-entreprise.com 97

Data

Program

Sub-Data

Sub-Program

Sub-Data

Sub-Program

Sub-Data

Sub-Program

Sub-Data

Sub-Program

Network

MPI Programming

Each node in an MPI program runs a sub-program

o Written in a sequential language (C, Fortran, ...)

o Typically the same on each node (SPMD)

The variables of each sub-program:

o Have the same name

o Have different locations and different values (distributed memory)

o all variables are private to the sub-program

Communicate via send & receive routines

www.caps-entreprise.com 98

Page 50: Hardware and Software for Parallel Computing · o Parallel computing requires parallel codes Porting onto specialized architectures o FPGA, cluster, GPU… o Not only developer’s

1/31/2014

50

Data & Work Distribution

Each process is given a unique rank from 0 to N-1

System of N independent processes

Data and work distribution decisions based on rank

www.caps-entreprise.com 99

Rank=0 Data

Sub-Program

Rank=1 Data

Sub-Program

Rank=2 Data

Sub-Program

Rank=N-1 Data

Sub-Program

Network

Messages

Messages are blocks of data exchanged by sub-programs

For both send and receive steps, necessary information are:

o Ranks of the source/destination processes

o Data location

o Data type

o Data size

www.caps-entreprise.com 100

Network

Page 51: Hardware and Software for Parallel Computing · o Parallel computing requires parallel codes Porting onto specialized architectures o FPGA, cluster, GPU… o Not only developer’s

1/31/2014

51

Include In The Program File

In C

o #include <mpi.h>

In C++

o #include <mpi++.h>

In Fortran

o #include "mpif.h" ! F77 ou F90

o USE MPI

• Before IMPLICIT NONE

www.caps-entreprise.com 101

MPI C/C++ Functions

Header

o #include <mpi.h>

o #include <mpi++.h>

Function format

o mpierror = MPI_?????( … );

o MPI_?????( … );

MPI_* prefix reserved for MPI macros & routines

www.caps-entreprise.com 102

Page 52: Hardware and Software for Parallel Computing · o Parallel computing requires parallel codes Porting onto specialized architectures o FPGA, cluster, GPU… o Not only developer’s

1/31/2014

52

MPI Fortran Subroutines

Header

o include 'mpif.h’

Function format

o var = MPI_?????( … )

o CALL MPI_?????( … , mpierror)

MPI_* prefix reserved for MPI macros & routines

www.caps-entreprise.com 103

initialization & termination

Initializing MPI o int MPI_Init(int *argc, char **argv) (C)

o Subroutine MPI_Init( mpierror) (Fortran)

Must be the first called MPI routine

It initialize the MPI execution environment (Communicator, …)

Exiting MPI o int MPI_Finalize() (C)

o Subroutine MPI_Finalize(mpierror) (Fortran)

Must be called by all processes before exiting

Terminates MPI execution environment

www.caps-entreprise.com 104

Page 53: Hardware and Software for Parallel Computing · o Parallel computing requires parallel codes Porting onto specialized architectures o FPGA, cluster, GPU… o Not only developer’s

1/31/2014

53

Handles

A handle is a value used to identify an MPI object

For the programmers, handles are:

o Predefined constants (exist only after call to MPI_Init())

• Ex: MPI_COMM_WORLD

o Values returned by MPI routines: defined as special MPI typedefs.

Handles refer to internal data structures

www.caps-entreprise.com 105

Communicator MPI_COMM_WORLD

Communicator: set of interconnected MPI processes

All processes of one MPI program are combined in the

communicator MPI_COMM_WORLD

o Predefined in mpi.h and mpif.h

Each process in a communicator has its own rank

o Starting from 0

o Ending at size-1

www.caps-entreprise.com 106

Page 54: Hardware and Software for Parallel Computing · o Parallel computing requires parallel codes Porting onto specialized architectures o FPGA, cluster, GPU… o Not only developer’s

1/31/2014

54

Size & Rank

The size :

o How many processes in a communicator

• int MPI_Comm_size(MPI_Comm comm, int *size) (c)

• Subroutine MPI_Comm_size(MPI_COMM_WORLD, mpisize, mpierror)

(Fortran)

The rank :

o Uniquely identifies each process in a communicator

o Is the basis for work / data distribution

• int MPI_Comm_rank(MPI_Comm comm, int *rank) (C)

• Subroutine MPI_Comm_rank(MPI_COMM_WORLD, mpirank, mpierror)

(Fortran)

www.caps-entreprise.com 107

Basic Example in Fortran

program firstmpi

! The most basic MPI Program

include 'mpif.h'

integer :: mpierror, mpisize, mpirank

call MPI_Init(mpierror)

call MPI_Comm_size(MPI_COMM_WORLD, mpisize, mpierror)

call MPI_Comm_rank(MPI_COMM_WORLD, mpirank, mpierror)

! Do work here

call MPI_Finalize(mpierror)

end program firstmpi

www.caps-entreprise.com 108

Page 55: Hardware and Software for Parallel Computing · o Parallel computing requires parallel codes Porting onto specialized architectures o FPGA, cluster, GPU… o Not only developer’s

1/31/2014

55

Basic Example in C

#include <mpi.h>

void main(int argc, char *argv[])

{

/* The most basic MPI Program */

int mpierror, mpisize, mpirank;

mpierror=MPI_Init(&argc, &argv);

mpierror=MPI_Comm_size(MPI_COMM_WORLD, &mpisize);

mpierror=MPI_Comm_rank(MPI_COMM_WORLD, &mpirank);

/* Do work here */

mpierror=MPI_Finalize();

}

www.caps-entreprise.com 109

Compilation step

Compilation with C code

o Unless you have a specific implementation (e.g. bullmpi)

Compilation with Fortran code

www.caps-entreprise.com 110

$ mpicc -o mpiProg mpiProg.c

$ gcc mpiProg.c -o mpiProg -lmpi

$ icc mpiProg.c -o mpiProg -lmpi

$ mpif90 -o mpiProg mpiProg.f90

$ gfortran mpiProg.c -o mpiProg -lmpi

$ ifort mpiProg.c -o mpiProg -lmpi

Page 56: Hardware and Software for Parallel Computing · o Parallel computing requires parallel codes Porting onto specialized architectures o FPGA, cluster, GPU… o Not only developer’s

1/31/2014

56

Execution Step

If you execute on several nodes o Specify the number of nodes (-n 2 or -np 2)

o Specify the name of the machine in a file (ex:machines)

o Use the option -machinefile

If you use a cluster o Use the system of job

o Specify the number of nodes

• In a script with PBS or SGE

• In the command line with SLURM

• …

www.caps-entreprise.com 111

$ mpirun –np 2 –machinefile machines ./a.out

$ mpiexec –n 2 –machinefile machines ./a.out

Using MPI

The file of your application :

Introduce MPI in the application

Compile the application

Run the application

www.caps-entreprise.com 112

$ myProg.c

$ mpicc myProg.c

$ mpirun –np 2 –machinefile machines ./a.out

Page 57: Hardware and Software for Parallel Computing · o Parallel computing requires parallel codes Porting onto specialized architectures o FPGA, cluster, GPU… o Not only developer’s

1/31/2014

57

Point-to-point communication

Communication between 2 processes

Source process sends message to destination process

Communication takes place within a communicator

o e.g.: MPI_COMM_WORLD

Processes are identified by their rank in the communicator

www.caps-entreprise.com 113

0

4 3

5

1

2 Message

Messages

Blocks of data are exchanged by sub-programs

Contain one or more elements sharing a same datatype

MPI datatypes:

o Basic datatypes

o Derived datatypes (built from basic & after derived datatypes)

Datatype handles are used to describe the data in memory

www.caps-entreprise.com 114

Page 58: Hardware and Software for Parallel Computing · o Parallel computing requires parallel codes Porting onto specialized architectures o FPGA, cluster, GPU… o Not only developer’s

1/31/2014

58

Sending a message

C: o int MPI_Send(void *buff, int count, MPI_Datatype datatype, int dest,

int tag, MPI_Comm comm)

Fortran: o Subroutine MPI_Send(buff, count, datatype, dest, tag, comm, ierror)

buff is the starting point of the message with count elements of type datatype.

Dest is the rank of the destination in the communicator comm.

Tag is an integer added to the message, allowing the identification of the message.

www.caps-entreprise.com 115

Receiving a message

C:

o int MPI_Recv(void *buf, int count, MPI_Datatype datatype, int source,

int tag, MPI_Comm comm, MPI_Status *status);

Fortran:

o Subroutine MPI_Recv(buf, count, datatype, source, tag, comm,

status, ierror)

Buff, count and datatype describe the receive buffer.

Receive a message sent by process with rank source in

comm and with the same tag.

Additional info is returned in status.

www.caps-entreprise.com 116

Page 59: Hardware and Software for Parallel Computing · o Parallel computing requires parallel codes Porting onto specialized architectures o FPGA, cluster, GPU… o Not only developer’s

1/31/2014

59

MPI Basic C datatypes

MPI Datatype C Datatype

MPI_CHAR signed char

MPI_SHORT signed short int

MPI_INT signed int

MPI_LONG signed long int

MPI_UNSIGNED_CHAR unsigned char

MPI_UNSIGNED_SHORT unsigned short int

MPI_UNSIGNED unsigned int

MPI_UNSIGNED_LONG unsigned long int

MPI_FLOAT float

MPI_DOUBLE double

www.caps-entreprise.com 117

MPI Basic Fortran datatypes

www.caps-entreprise.com 118

MPI Datatypes Fortran Datatypes

MPI_INTEGER integer

MPI_REAL real

MPI_DOUBLE_PRECISION double precision

MPI_COMPLEX complex

MPI_LOGICAL logical

MPI_CHARACTER character

MPI_BYTE

MPI_PACKED

Page 60: Hardware and Software for Parallel Computing · o Parallel computing requires parallel codes Porting onto specialized architectures o FPGA, cluster, GPU… o Not only developer’s

1/31/2014

60

Requirements

For a communication to be successful:

o Sender must specify a valid destination rank

o Receiver must specify a valid source rank

o Same communicator on both sides

o Matching tags (choosen by user)

o Matching datatypes

o A large enough buffer on the receiver’s side

www.caps-entreprise.com 119

Wildcarding

The receiver can wildcard.

To receive from any source

o source = MPI_ANY_SOURCE

To receive from any tag

o tag = MPI_ANY_TAG

Source & tag are returned within the receiver's status

parameter.

www.caps-entreprise.com 120

Page 61: Hardware and Software for Parallel Computing · o Parallel computing requires parallel codes Porting onto specialized architectures o FPGA, cluster, GPU… o Not only developer’s

1/31/2014

61

A Simple Example (1)

#include <stdio.h>

#include <mpi.h>

#define MASTER 0 //We assume 2 MPI processes

#define SLAVE 1

int main(int argc, char **argv)

{

int val=0; //Value to be exchanged

int mpi_rank, mpi_size;

MPI_Status status;

MPI_Init(&argc,&argv);

MPI_Comm_rank(MPI_COMM_WORLD, &mpi_rank);

MPI_Comm_size(MPI_COMM_WORLD, &mpi_size);

...

www.caps-entreprise.com 121

A Simple Example (2)

...

if (mpi_size!=2) return -1;

if ( mpi_rank == MASTER ) /* The master sends a value */

{

val = 1492;

printf("I'm process %d and Value is %d\n", mpi_rank, val);

MPI_Send(&val, 1, MPI_INT, SLAVE, 777, MPI_COMM_WORLD);

}

else /* The slave prints the value */

{

printf("I'm process %d and Value is %d\n", mpi_rank, val);

MPI_Recv(&val, 1, MPI_INT, MASTER, 777, MPI_COMM_WORLD, &status);

printf("I'm process %d and Value is %d\n", mpi_rank, val);

}

MPI_Finalize();

}

www.caps-entreprise.com 122

Page 62: Hardware and Software for Parallel Computing · o Parallel computing requires parallel codes Porting onto specialized architectures o FPGA, cluster, GPU… o Not only developer’s

1/31/2014

62

The Sendrecv

C: o int MPI_Sendrecv( void *sendbuf, int sendcount, MPI_Datatype sendtype, int

dest, int sendtag, void *recvbuf, int recvcount, MPI_Datatype recvtype, int source, int recvtag, MPI_Comm comm, MPI_Status *status )

Fortran : o Subroutine MPI_Sendrecv(sendbuf, sendcount, sendtype, dest, sendtag, recvbuf,

recvcount, recvtype, source, recvtag, comm, status, err)

sendbuff is the starting point of the message with sendcount elements of type sendtype.

recvbuff is the starting point of the message with recvcount elements of type recvtype.

dest and source are the ranks of the destination and the sender in the communicator comm.

Sendtag and recvtag are integers added to the message, allowing the identification of the message.

www.caps-entreprise.com 123

Asynchronous Sending

C: o int MPI_Isend(void *buff, int count, MPI_Datatype datatype, int dest, int

tag, MPI_Comm comm, MPI_Request *request)

Fortran: o sSbroutine MPI_Isend(buff, count, datatype, dest, tag, comm, request,

ierror)

buff is the starting point of the message with count elements of type datatype.

Dest is the rank of the destination in the communicator comm.

Tag is an integer added to the message, allowing the identification of the message.

Additional info is returned in status.

request can be used later to query the status of the communication or wait for its completion

www.caps-entreprise.com 124

Page 63: Hardware and Software for Parallel Computing · o Parallel computing requires parallel codes Porting onto specialized architectures o FPGA, cluster, GPU… o Not only developer’s

1/31/2014

63

Asynchronous Receiving

C: o int MPI_Irecv(void *buff, int count, MPI_Datatype datatype, int source, int

tag, MPI_Comm comm, MPI_Status *status, MPI_Request *request);

Fortran: o Subroutine MPI_Irecv(buff, count, datatype, source, tag, comm, status,

request, ierror)

Buff, count and datatype describe the receive buffer.

Receive a message sent by process with rank source in comm and with the same tag.

Additional info is returned in status.

request can be used later to query the status of the communication or wait for its completion

www.caps-entreprise.com 125

The synchronization

C: o int MPI_Wait(MPI_Request *request, MPI_Status *status)

Fortran : o Subroutine MPI_Wait(request, status, ierror)

Wait until the operation identified by request is complete

Additional info is returned in status.

One is allowed to call MPI_Wait with a request like MPI_REQUEST_NULL argument. In this case the operation returns immediately with empty status.

www.caps-entreprise.com 126

Page 64: Hardware and Software for Parallel Computing · o Parallel computing requires parallel codes Porting onto specialized architectures o FPGA, cluster, GPU… o Not only developer’s

1/31/2014

64

Example

CALL MPI_INIT(ierr)

CALL MPI_COMM_RANK(comm, rank, ierr)

IF(rank.EQ.0) THEN

CALL MPI_ISEND(a(1), 10, MPI_REAL, 1, tag, comm, request, ierr)

**** do some computation to mask latency ****

CALL MPI_WAIT(request, status, ierr)

ELSE

CALL MPI_IRECV(a(1), 15, MPI_REAL, 0, tag, comm, request, ierr)

**** do some computation to mask latency ****

CALL MPI_WAIT(request, status, ierr)

END IF

www.caps-entreprise.com 127

Broadcast (1)

MPI_Bcast broadcasts a message from the process with

rank "root" to all other processes of the group

Before a MPI_Bcast :

After a MPI_Bcast :

www.caps-entreprise.com 128

10

Process 1 Process 2 Process 3 Process 4

10

Process 1

10

Process 2

10

Process 3

10

Process 4

Page 65: Hardware and Software for Parallel Computing · o Parallel computing requires parallel codes Porting onto specialized architectures o FPGA, cluster, GPU… o Not only developer’s

1/31/2014

65

Broadcast (2)

C

o int MPI_Bcast ( void *buff, int count, MPI_Datatype datatype, int root,

MPI_Comm comm )

Fortran :

o Subroutine MPI_BCAST(buff, count, datatype, root, comm)

Buff, count and datatype describe the receive/sender buffer

in comm

Root is the rank of the sender

www.caps-entreprise.com 129

Scatter (1)

MPI_Scatter sends data from one task to all other tasks in a

group

Before a MPI_Scatter :

After a MPI_Scatter :

www.caps-entreprise.com 130

10 11 12 13

Process 1 Process 2 Process 3 Process 4

10

Process 1

11

Process 2

12

Process 3

13

Process 4

Page 66: Hardware and Software for Parallel Computing · o Parallel computing requires parallel codes Porting onto specialized architectures o FPGA, cluster, GPU… o Not only developer’s

1/31/2014

66

Scatter (2)

C o int MPI_Scatter ( void *sendbuf, int sendcnt, MPI_Datatype sendtype,

void *recvbuf, int recvcnt, MPI_Datatype recvtype, int root, MPI_Comm comm )

Fortran o Subroutine MPI_SCATTER(sendbuf, sendcount, sendtype, recvbuf,

recvcount, recvtype, root, comm)

sendbuff, sendcount and sendtype describe the root sender in comm

recvbuff, recvcount and recvtype describe the receivers in comm

www.caps-entreprise.com 131

Gather (1)

MPI_Gather gathers together values from a group of

processes

Before a MPI_Gather :

After a MPI_Gather :

www.caps-entreprise.com 132

10 11 12 13

Process 1 Process 2 Process 3 Process 4

10

Process 1

11

Process 2

12

Process 3

13

Process 4

Page 67: Hardware and Software for Parallel Computing · o Parallel computing requires parallel codes Porting onto specialized architectures o FPGA, cluster, GPU… o Not only developer’s

1/31/2014

67

Gather (1)

C o int MPI_Gather ( void *sendbuf, int sendcnt, MPI_Datatype sendtype,

void *recvbuf, int recvcount, MPI_Datatype recvtype, int root, MPI_Comm comm )

Fortran o Subroutine MPI_GATHER(sendbuf, sendcount, sendtype, recvbuf,

recvcount, recvtype, root, comm)

sendbuff, sendcount and sendtype describe the senders in comm

recvbuff, recvcount and recvtype describe the root receiver in comm

www.caps-entreprise.com 133

Reduce (1)

MPI_Reduce reduces values on all processes to a single

value. The example is using the operation MPI_SUM.

Before a MPI_Reduce :

After a MPI_Reduce :

www.caps-entreprise.com 134

46

Process 1 Process 2 Process 3 Process 4

10

Process 1

11

Process 2

12

Process 3

13

Process 4

Page 68: Hardware and Software for Parallel Computing · o Parallel computing requires parallel codes Porting onto specialized architectures o FPGA, cluster, GPU… o Not only developer’s

1/31/2014

68

Reduce (2)

C o int MPI_Reduce ( void *sendbuf, void *recvbuf, int count,

MPI_Datatype datatype, MPI_Op op, int root, MPI_Comm comm)

Fortran o Subroutine MPI_Reduce(sendbuf, recvbuf, count, datatype, op, root,

comm, err)

sendbuff, count and datatype describe the senders buffer in comm

Recvbuff and root describe the receiver in comm

op specifies the operator of reduction

www.caps-entreprise.com 135

Reduction Operators

www.caps-entreprise.com 136

Name Meaning

MPI_MAX maximum

MPI_MIN minimum

MPI_SUM sum

MPI_PROD product

MPI_LAND logical and

MPI_BAND bit-wise and

MPI_LOR logical or

MPI_BOR bit-wise or

MPI_LXOR logical xor

MPI_BXOR bit-wise xor

MPI_MAXLOC max value and location

MPI_MINLOC min value and location

Page 69: Hardware and Software for Parallel Computing · o Parallel computing requires parallel codes Porting onto specialized architectures o FPGA, cluster, GPU… o Not only developer’s

1/31/2014

69

Barrier

C

o int MPI_Barrier ( MPI_Comm comm )

Fortran

o Subroutine MPI_BARRIER(comm)

comm is the communicator

Blocks the caller until all group members have called it

www.caps-entreprise.com 137

To go further

It exists different versions of the classical send o Basic send user-provided buffering (MPI_Bsend)

o Blocking ready send (MPI_Rsend)

o Blocking synchronous send (MPI_Ssend)

o …

It exists some completely collective communications o MPI_Allgather

o MPI_Allscatter

o MPI_Alltoall

o MPI_Allreduce

o …

You can mix OpenMP with MPI

www.caps-entreprise.com 138

Page 70: Hardware and Software for Parallel Computing · o Parallel computing requires parallel codes Porting onto specialized architectures o FPGA, cluster, GPU… o Not only developer’s

1/31/2014

70

Multithreading

www.caps-entreprise.com 139

What is a Thread?

An independent stream of instructions

o That can be scheduled to run as such by the operating system

o To the software developer, the concept of a "procedure" that runs

independently from its main program may best describe a thread

Consider a main program (a.out) that contains a number of

procedures being able to be scheduled to run

simultaneously and/or independently by the operating

system

o That would describe a "multi-threaded" program

www.caps-entreprise.com 140

Page 71: Hardware and Software for Parallel Computing · o Parallel computing requires parallel codes Porting onto specialized architectures o FPGA, cluster, GPU… o Not only developer’s

1/31/2014

71

What is a Thread?

So, in summary, in the UNIX environment a thread: o Exists within a process and uses the process resources

o Has its own independent flow of control as long as its parent process exists and the OS supports it

o Duplicates only the essential resources it needs to be independently schedulable

o May share the process resources with other threads that act equally independently (and dependently)

o Dies if the parent process dies - or something similar

o Is "lightweight" because most of the overhead has already been accomplished through the creation of its process

Because threads within the same process share resources: o Changes made by one thread to shared system resources (such as closing a file)

will be seen by all other threads

o Two pointers having the same value point to the same data

o Reading and writing to the same memory locations is possible, and therefore requires explicit synchronization by the programmer

www.caps-entreprise.com 141

Pthread

Page 72: Hardware and Software for Parallel Computing · o Parallel computing requires parallel codes Porting onto specialized architectures o FPGA, cluster, GPU… o Not only developer’s

1/31/2014

72

What are Pthreads?

Historically, hardware vendors have implemented their own proprietary versions of threads o These implementations differed substantially from each other making

it difficult for programmers to develop portable threaded applications

In order to take full advantage of the capabilities provided by threads, a standardized programming interface was required o For UNIX systems, this interface has been specified by the IEEE

POSIX 1003.1c standard (1995)

o Implementations adhering to this standard are referred to as POSIX threads, or Pthreads

o Most hardware vendors now offer Pthreads in addition to their proprietary API’s

www.caps-entreprise.com 143

Why Pthreads?

When compared to the cost of creating and managing a

process, a thread can be created with much less operating

system overhead

www.caps-entreprise.com 144

Page 73: Hardware and Software for Parallel Computing · o Parallel computing requires parallel codes Porting onto specialized architectures o FPGA, cluster, GPU… o Not only developer’s

1/31/2014

73

On-node Communications

MPI libraries usually implement on-node task communication via shared

memory, which involves at least one memory copy operation (process to

process)

o For Pthreads there is no intermediate memory copy required because

threads share the same address space within a single process.

o There is no data transfer, per se. It becomes more of a cache-to-CPU or

memory-to-CPU bandwidth (worst case) situation

www.caps-entreprise.com 145

Threads = Shared Memory Model

All threads have access to the same global, shared memory

o Threads also have their own private data

o Programmers are responsible for synchronizing access (protecting)

globally shared data

www.caps-entreprise.com 146

Page 74: Hardware and Software for Parallel Computing · o Parallel computing requires parallel codes Porting onto specialized architectures o FPGA, cluster, GPU… o Not only developer’s

1/31/2014

74

Thread-safeness

Refers an application's ability to execute multiple threads simultaneously without damaging shared data or creating race conditions

o The implication to users of external library routines is that if you aren't

100% certain the routine is thread-safe, then you take your chances with problems that could arise

www.caps-entreprise.com 147

Pthread API

For C/C++

From Intel, PathScale, PGI, GNU, IBM

Initially, your main() program comprises a single, default

thread

o All other threads must be explicitly created by the programmer

www.caps-entreprise.com 148

pthread_create () pthread_exit () pthread_cancel () pthread_attr_init () pthread_attr_destroy ()

Page 75: Hardware and Software for Parallel Computing · o Parallel computing requires parallel codes Porting onto specialized architectures o FPGA, cluster, GPU… o Not only developer’s

1/31/2014

75

Threads Life

Threads can spawn other threads

Threads can wait for another (actively or passively), die…

www.caps-entreprise.com 149

Pthread Example

www.caps-entreprise.com 150

#include <pthread.h> #include <stdio.h> #define NUM_THREADS 5 void *PrintHello(void *threadid) { long tid; tid = (long)threadid; printf("Hello World! It's me, thread #%ld!\n", tid); pthread_exit(NULL); }

int main (int argc, char *argv[]) { pthread_t threads[NUM_THREADS]; int rc; long t; for(t=0; t<NUM_THREADS; t++){ printf("In main: creating thread %ld\n", t); rc = pthread_create(&threads[t], NULL, PrintHello, (void *)t); if (rc){ printf("ERROR; return code is %d\n", rc); exit(-1); } } /* Last thing that main() should do */ pthread_exit(NULL); }

Page 76: Hardware and Software for Parallel Computing · o Parallel computing requires parallel codes Porting onto specialized architectures o FPGA, cluster, GPU… o Not only developer’s

1/31/2014

76

Pthread Management Basis

Routines

Joining is one way to accomplish synchronization between

threads

o The pthread_join() subroutine blocks the calling thread until the

specified threadid thread terminates

www.caps-entreprise.com 151

pthread_join () pthread_detach () pthread_attr_setdetachstate () pthread_attr_getdetachstate ()

Intel Thread Building Block: TBB

Page 78: Hardware and Software for Parallel Computing · o Parallel computing requires parallel codes Porting onto specialized architectures o FPGA, cluster, GPU… o Not only developer’s

1/31/2014

78

Intel TBB : C++ vs imperative langages

Object Oriented langage with:

o complex inheritance

o Classes contain data and code (functions)

Template functions: function change with data types

STL widely used.

April 13 Intel MIC Programming 155

Intel TBB : Express tasks not threads

Intel TBB have its own scheduler

o Automatically invoked at the beginning of a program

o Can be accessed using task_scheduler_init class

Work stealing scheduler, can cohabit with Cilk Plus

scheduler.

April 13 Intel MIC Programming 156

Page 79: Hardware and Software for Parallel Computing · o Parallel computing requires parallel codes Porting onto specialized architectures o FPGA, cluster, GPU… o Not only developer’s

1/31/2014

79

Intel TBB: overview

Link with tbb library (-ltbb or –ltbb_debug)

April 13 Intel MIC Programming 157

void SerialApplyFoo( float a[], size_t n )

{

for( size_t i=0; i!=n; ++i ) Foo(a[i]);

}

#include <tbb/tbb.h>

Using namespace tbb;

void SerialApplyFoo( float a[], size_t n )

{

for( size_t i=0; i!=n; ++i ) Foo(a[i]);

}

Intel TBB: parallel For (1)

Parallel for

April 13 Intel MIC Programming 158

#include <tbb/tbb.h>

Using namespace tbb;

class ApplyFoo {

float *const my_a;

public:

void operator()( const blocked_range<size_t>& r ) const {

float *a = my_a;

for( size_t i=r.begin(); i!=r.end(); ++i ) Foo(a[i]);

}

ApplyFoo( float a[] ) : my_a(a) {}

};

Page 80: Hardware and Software for Parallel Computing · o Parallel computing requires parallel codes Porting onto specialized architectures o FPGA, cluster, GPU… o Not only developer’s

1/31/2014

80

Intel TBB: parallel For (2)

Declare a class containing an operator ()

April 13 Intel MIC Programming 159

#include <tbb/tbb.h>

Using namespace tbb;

class ApplyFoo {

float *const my_a;

public:

void operator()( const blocked_range<size_t>& r ) const {

float *a = my_a;

for( size_t i=r.begin(); i!=r.end(); ++i ) Foo(a[i]);

}

ApplyFoo( float a[] ) : my_a(a) {}

};

Intel TBB: parallel For (3)

This operator takes a blocked_range<T> argument

April 13 Intel MIC Programming 160

#include <tbb/tbb.h>

Using namespace tbb;

class ApplyFoo {

float *const my_a;

public:

void operator()( const blocked_range<size_t>& r ) const {

float *a = my_a;

for( size_t i=r.begin(); i!=r.end(); ++i ) Foo(a[i]);

}

ApplyFoo( float a[] ) : my_a(a) {}

};

Page 81: Hardware and Software for Parallel Computing · o Parallel computing requires parallel codes Porting onto specialized architectures o FPGA, cluster, GPU… o Not only developer’s

1/31/2014

81

Intel TBB: parallel For (4)

Some other iteration spaces exists (ex: blocked_range2d)

April 13 Intel MIC Programming 161

#include <tbb/tbb.h>

Using namespace tbb;

class ApplyFoo {

float *const my_a;

public:

void operator()( const blocked_range<size_t>& r ) const {

float *a = my_a;

for( size_t i=r.begin(); i!=r.end(); ++i ) Foo(a[i]);

}

ApplyFoo( float a[] ) : my_a(a) {}

};

Intel TBB: parallel For (5)

A copy constructor must be declared

April 13 Intel MIC Programming 162

#include <tbb/tbb.h>

Using namespace tbb;

class ApplyFoo {

float *const my_a;

public:

void operator()( const blocked_range<size_t>& r ) const {

float *a = my_a;

for( size_t i=r.begin(); i!=r.end(); ++i ) Foo(a[i]);

}

ApplyFoo( float a[] ) : my_a(a) {}

};

Page 82: Hardware and Software for Parallel Computing · o Parallel computing requires parallel codes Porting onto specialized architectures o FPGA, cluster, GPU… o Not only developer’s

1/31/2014

82

Intel TBB: parallel For (6)

Then invokation will be as followed:

Blocked_range take 3 arguments:begin, end , grainsize

o Begin, End : iteration start and end

o Grainsize: size of a chunk to be executed by a thread.

April 13 Intel MIC Programming 163

void ParallelApplyFoo( float a[], size_t n )

{

parallel_for(blocked_range<size_t>(0,n), ApplyFoo(a));

}

Intel TBB: Tips on grainsize

No more than total num of iteration/ nb threads

One thread must execute at least 10 000 operations to

overtake overhead of work stealing scheduler.

Small loops with good load-balancing won’t suffer a lot from

scheduler overhead.

April 13 Intel MIC Programming 164

Page 83: Hardware and Software for Parallel Computing · o Parallel computing requires parallel codes Porting onto specialized architectures o FPGA, cluster, GPU… o Not only developer’s

1/31/2014

83

Intel TBB: more on partitioner

You can choose among 3 partionners as a 3rd arg.

o Simple_partitioner

o Auto_partitioner (default)

o Affinity_partitioner

April 13 Intel MIC Programming 165

void ParallelApplyFoo( float a[], size_t n )

{

parallel_for(blocked_range<size_t>(0,n), ApplyFoo(a), simple_partitioner ());

}

Intel TBB: Other constructs

Parallel_reduce

Parallel_while

April 13 Intel MIC Programming 166

Page 84: Hardware and Software for Parallel Computing · o Parallel computing requires parallel codes Porting onto specialized architectures o FPGA, cluster, GPU… o Not only developer’s

1/31/2014

84

Intel Cilk Plus

What is Intel Cilk Plus ?

Intel Cilk Plus is an extension to C and C++ bringing:

o Multi-core support.

o Vector Processing support.

It comes with:

o 3 keywords to manage thread-programming

o Vectorization intrinsics and directives

o Array Notation.

April 13 Intel MIC Programming 168

Page 85: Hardware and Software for Parallel Computing · o Parallel computing requires parallel codes Porting onto specialized architectures o FPGA, cluster, GPU… o Not only developer’s

1/31/2014

85

Where can I get Intel Cilk Plus ?

Available with Intel Composer XE Compilers.

Cilk Plus open projects: http://cilkplus.org

o Cilk Plus in GCC

Cilk Plus extension in LLVM.

Specifications are open : http://cilkplus.org

April 13 Intel MIC Programming 169

Data Parallelism with Intel Cilk Plus

3 keywords: _Cilk_spawn, _Cilk_sync, _Cilk_for

To simplify, <clik/cilk.h> header defines macros:

o cilk_spawn

o cilk_sync

o cilk_for

Clik Plus don’t create threads

April 13 Intel MIC Programming 170

Page 86: Hardware and Software for Parallel Computing · o Parallel computing requires parallel codes Porting onto specialized architectures o FPGA, cluster, GPU… o Not only developer’s

1/31/2014

86

Cilk_spawn, Cilk_sync (1)

Cilk_spawn: the statement following can be executed with

an other thread.

o It does not create threads but a new task is queued.

Cilk_sync: all spawned statements using cilk_spawn have to

be completed before execution continues.

o Implicit cilk_sync at the end of a function.

April 13 Intel MIC Programming 171

Cilk_spawn, cilk_sync (2)

Recursive Fibonacci function

April 13 Intel MIC Programming 172

Int fib (int n)

{

if (n < 2)

return n;

int x = cilk_spawn fib(n-1);

int y = fib (n-2);

cilk_sync;

return x+y;

}

Page 87: Hardware and Software for Parallel Computing · o Parallel computing requires parallel codes Porting onto specialized architectures o FPGA, cluster, GPU… o Not only developer’s

1/31/2014

87

Cilk_for

cilk_for is same as #pragma omp for

o Replace a for statement.

o Iterations shared among threads by the runtime and the compiler.

April 13 Intel MIC Programming 173

clik_for (int i = 0; i < 8; i++)

{

do_work(i);

}

Keeping serial execution

Possible to use pre-processing to hide Clik Plus keywords

April 13 Intel MIC Programming 174

#define cilk_spawn

#define cilk_sync

Int fib (int n)

{

if (n < 2)

return n;

int x = cilk_spawn fib(n-1);

int y = fib (n-2);

cilk_sync;

return x+y;

}

Page 88: Hardware and Software for Parallel Computing · o Parallel computing requires parallel codes Porting onto specialized architectures o FPGA, cluster, GPU… o Not only developer’s

1/31/2014

88

Who creates the threads ?

At the beginning of a program, Cilk Plus Runtime asks OS to

create as many threads as available cores.

Each thread manages a queue containing tasks to execute.

April 13 Intel MIC Programming 175

Array notation (1)

Array Notation: Tell the compiler to use SIMD instructions.

Exist in Fortran90

In Cilk plus: A[<lower_bound>:<length>:<stride]

April 13 Intel MIC Programming 176

// for(i = 0; i < N; i++) A[i] = c* B[i];

A[:] = c* B[:];

// for(i = 0; i < N; i++) A[i+N/2] = c* B[i];

A[N/2:N/2] = c* B[0:N/2];

Page 89: Hardware and Software for Parallel Computing · o Parallel computing requires parallel codes Porting onto specialized architectures o FPGA, cluster, GPU… o Not only developer’s

1/31/2014

89

Array notation (2)

Fortran90: arrays can overlap

o It generates temporary arrays

Cilk Plus Arrays Elements must not overlap:

o Better performances / undefined behavior if arrays overlaps

April 13 Intel MIC Programming 177

A[0:10] = c* A[1:11]; // undefined behavior

A[0:10] = c* A[0:10]; // ok, seen as a reduction

A[0:10:2] = c* A[1:11:2]; // ok, elements don’t overlap

Array Notation: Dynamic Arrays (3)

For Dynamically Allocated arrays:

o start and length have to be specified

April 13 Intel MIC Programming 178

Int f (float* a, float * b)

{

A[:] = c* B[:]; // COMPILATION ERROR !

A[0:N] = c*B[0:N]; // OK

}

Page 90: Hardware and Software for Parallel Computing · o Parallel computing requires parallel codes Porting onto specialized architectures o FPGA, cluster, GPU… o Not only developer’s

1/31/2014

90

Array Notation: Multi-dimentional arrays (4)

Multi-dimentional arrays: same constraints as 1D.

April 13 Intel MIC Programming 179

// for (i = 0; i < N; i ++)

// for (i = 0; j < N; j++)

// B[i][j] = 0;

B[:][:] = 0;

Array Notation: conditionals (5)

Conditionals

April 13 Intel MIC Programming 180

// for (i = 0; i < N; i ++)

// if (a[i] == 0)

// result[i] += 1;

If (a[:] == 0)

result[:] += 1;

Page 91: Hardware and Software for Parallel Computing · o Parallel computing requires parallel codes Porting onto specialized architectures o FPGA, cluster, GPU… o Not only developer’s

1/31/2014

91

Array Notation: builtin functions (6)

Built-in functions

o __sec_reduce_add (A[:]) // sum = 𝐴[𝑖]𝑁𝑖=0

o __sec_reduce_mul (A[:]) // product

o __sec_reduce_max(A[:]) // max among elements of A

o __sec_reduce_min(A[:]) // min

o __sec_reduce_max_index(A[:]) // return index of max element

o __sec_reduce_min_index(A[:]) // return index of min element

o __sec_reduce_all_zero (A[:]) // true if all zero

o __sec_reduce_any_zero (A[:]) // true if a zero exist

o __sec_reduce_all_nonzero (A[:]) // true if all elments are not zero

o __sec_reduce_any_nonzero (A[:]) // true if there is non-zero elements

April 13 Intel MIC Programming 181

Array Notation: elemental functions (6)

Function call can be declared with __declspec(vector) or

o 2 functions generated by compiler: 1 SIMD one scalar.

April 13 Intel MIC Programming 182

__declspec (vector) void calc (float* B)

{

B[0] = B[0] + rand ();

}

calc (B[:]); // SIMD version is called

Calc ( B[5]); // sequential version is called

Page 92: Hardware and Software for Parallel Computing · o Parallel computing requires parallel codes Porting onto specialized architectures o FPGA, cluster, GPU… o Not only developer’s

1/31/2014

92

Vectorization directive

#pragma SIMD in front of loops can be used instead of Array

sections

Clauses exist: linear, reduction, private etc…

April 13 Intel MIC Programming 183

#pragma simd

For (int i = 0; i < N; i++)

{

B[i] = B[i] + i;

}

Cilk Plus with MIC

Can be used alone, or combined with other programming

model : offload directives, openMP etc..

Straighforward vectorization with Cilk Plus on MIC.

_Cilk_shared : share memory between host and MIC.

o Only apply on static allocated memories.

April 13 Intel MIC Programming 184

Page 93: Hardware and Software for Parallel Computing · o Parallel computing requires parallel codes Porting onto specialized architectures o FPGA, cluster, GPU… o Not only developer’s

1/31/2014

93

OpenMP

The master thread

The master thread

executes all the

sequential region and

create some slave

threads

Thread ID = 0

www.caps-entreprise.com 186

Page 94: Hardware and Software for Parallel Computing · o Parallel computing requires parallel codes Porting onto specialized architectures o FPGA, cluster, GPU… o Not only developer’s

1/31/2014

94

What’s OpenMP

A standard API for shared memory parallel applications

In C/C++ or Fortran

Consists in compiler directives, runtime routines and

environment variables

Works on the fork and join model

An OpenMP program is portable

Requires less programming effort than Pthread

www.caps-entreprise.com 187

History

OpenMP 1.0 for Fortran (1997) & C/C++ (1998)

OpenMP 2.0 for Fortran (2000) & C/C++ (2002)

o Major revision

2005 : OpenMP 2.5

o Extensive rewrite and clarification

2008 : OpenMP 3.0

o Task

o Better support for nested parallelism

2013: OpenMP 4.0

o Accelerator directives

www.caps-entreprise.com 188

Page 95: Hardware and Software for Parallel Computing · o Parallel computing requires parallel codes Porting onto specialized architectures o FPGA, cluster, GPU… o Not only developer’s

1/31/2014

95

The parallel directives

A parallel region is a block of code that will be executed by

multiple threads

How to declare a parallel region

To put a synchronization barrier use

www.caps-entreprise.com 189

#pragma omp parallel

{

}

!$OMP PARALLEL

!$OMP END PARALLEL

#pragma omp barrier !$omp barrier

Example

Code :

Execution :

www.caps-entreprise.com 190

#pragma omp parallel

{

printf("hello world !\n");

}

$ ./a.out

Hello world !

Hello world !

Hello world !

Hello world !

Page 96: Hardware and Software for Parallel Computing · o Parallel computing requires parallel codes Porting onto specialized architectures o FPGA, cluster, GPU… o Not only developer’s

1/31/2014

96

The parallel loops

It exists a useful directive to parallelize the loops

All the threads will execute independently iterations of the loop

In C

In Fortran

We can fuse the directive parallel and the directive for/do

www.caps-entreprise.com 191

#pragma omp for

For (i=0; i<N; i++){

}

!$OMP DO

DO i=0,N

ENDDO

!$OMP END DO

Example C/C++

www.caps-entreprise.com 192

#include <omp.h>

#define N 10000

int main(){

int tab[N];

init(tab);

#pragma omp parallel for

for (i=0; i<N; i++)

{

tab[i] = foo(i, tab[i]);

}

}

Page 97: Hardware and Software for Parallel Computing · o Parallel computing requires parallel codes Porting onto specialized architectures o FPGA, cluster, GPU… o Not only developer’s

1/31/2014

97

Example Fortran

www.caps-entreprise.com 193

PROGRAM main

USE omp_lib

IMPLICIT NONE

INTEGER N 10000

INTEGER, DIMENSION ( N) :: tab

CALL init(tab);

!$omp parallel do

DO i=1, N

tab(i) = foo(i, tab(i))

ENDDO

!$omp end parallel do

END PROGRAM main

Get some useful informations

To get the ID of the current thread

To get the current number of thread in the region

To set the number of thread in the next parallel region

www.caps-entreprise.com 194

int my_id = omp_get_thread_num()

int nbThreads = omp_get_num_threads()

omp_set_num_threads(int n)

Page 98: Hardware and Software for Parallel Computing · o Parallel computing requires parallel codes Porting onto specialized architectures o FPGA, cluster, GPU… o Not only developer’s

1/31/2014

98

Example

Code :

Execution :

www.caps-entreprise.com 195

#pragma omp parallel

{

int id = omp_get_thread_num();

int size = omp_get_num_threads();

printf("hello world ! From %i of %i\n", id, size);

}

$ ./a.out

Hello world ! From 2 of 4

Hello world ! From 0 of 4

Hello world ! From 3 of 4

Hello world ! From 1 of 4

The data sharing clauses

Data can be shared by threads or private to each threads o Declare a list of private variables to each threads

o Declare a list of shared variables to all threads (default behaviour)

o Declare the default behavior for each variable

o None will return an error at compilation for each variables not explicitly precised

These clauses can be added to the directives o Parallel

o For/Do

o Single

www.caps-entreprise.com 196

#pragma omp directive private( variable [, variable])

#pragma omp directive shared( variable [, variable])

#pragma omp directive default(none | shared | private)

Page 99: Hardware and Software for Parallel Computing · o Parallel computing requires parallel codes Porting onto specialized architectures o FPGA, cluster, GPU… o Not only developer’s

1/31/2014

99

Example using shared & private

#include <omp.h>

#define N 1000

main ()

{

int i;

float a[N], b[N], c[N];

/* Some initializations */

for (i=0; i < N; i++)

a[i] = b[i] = i * 1.0;

#pragma omp parallel for shared(a,b,c) private(i)

for (i=0; i < n; i++)

c[i] = a[i] + b[i];

}

www.caps-entreprise.com 197

Critical part

The compilers and OpenMP

On Linux

o Gcc inputfile -fopenmp

o Icc inputfile -openmp

o Gfortran inputfile -fopenmp

o Ifort inputfile -openmp

On Windows

o Icc inputfile /Qopenmp

o Ifort inputfile /Qopenmp

www.caps-entreprise.com 198

Page 100: Hardware and Software for Parallel Computing · o Parallel computing requires parallel codes Porting onto specialized architectures o FPGA, cluster, GPU… o Not only developer’s

1/31/2014

100

The single directive (1)

Only one thread executes this region of code

Second case : Any of the threads can execute this part, but

only one

www.caps-entreprise.com 199

Code to execute just once

Waiting time

The single directive (2)

A region delimited by these directives will be executed by only one of all the threads (master or slaves)

Only one thread executes this region

Synchronization at the end

www.caps-entreprise.com 200

#pragma omp parallel

{

#pragma omp single

{

}

}

//end of parallel region

!$OMP PARALLEL

! Parallel region

!$OMP SINGLE

!only one thread in

!$OMP END SINGLE

!$OMP END PARALLEL

Page 101: Hardware and Software for Parallel Computing · o Parallel computing requires parallel codes Porting onto specialized architectures o FPGA, cluster, GPU… o Not only developer’s

1/31/2014

101

The master directive (1)

Only the master thread executes a region of code

www.caps-entreprise.com 201

Code to execute just once

The master directive (2)

A region delimited by these directives will be executed by the

master

Only the master thread executes this region

www.caps-entreprise.com 202

#pragma omp parallel

{

#pragma omp master

{

}

}

//end of parallel region

!$OMP PARALLEL

! Parallel region

!$OMP MASTER

!only the master in

!$OMP END MASTER

!$OMP END PARALLEL

Page 102: Hardware and Software for Parallel Computing · o Parallel computing requires parallel codes Porting onto specialized architectures o FPGA, cluster, GPU… o Not only developer’s

1/31/2014

102

The Critical Region (1)

www.caps-entreprise.com 203

Start of critical region

End of critical region

Critical part

Waiting time

The Critical Region (2)

All threads execute the code, but only one at a time

A lightweight special form exists but apply only on the

following statement

www.caps-entreprise.com 204

#pragma omp critical

{

}

!$OMP CRITICAL

!$OMP END CRITICAL

#pragma omp atomic

!$OMP ATOMIC

Page 103: Hardware and Software for Parallel Computing · o Parallel computing requires parallel codes Porting onto specialized architectures o FPGA, cluster, GPU… o Not only developer’s

1/31/2014

103

Example : critical & atomic

Is the same result in X at the end ?

www.caps-entreprise.com 205

PROGRAM CRITICAL

INTEGER X

X = 0

!$OMP PARALLEL SHARED(X)

!$OMP CRITICAL

X = X + 1

X = X * 2

!$OMP END CRITICAL

!$OMP END PARALLEL

END PROGRAM CRITICAL

PROGRAM ATOMIC

INTEGER X

X = 0

!$OMP PARALLEL SHARED(X)

!$OMP ATOMIC

X = X + 1

!$OMP ATOMIC

X = X * 2

!$OMP END PARALLEL

END PROGRAM ATOMIC

The performance clauses

To be sure to have performance, you can use this clause, so

at the runtime, you can adapt the behavior of your

application

At the beginning of a parallel region, you can also set the

number of threads

www.caps-entreprise.com 206

#pragma omp parallel if(cond)

{

}

#pragma omp parallel num_threads(n)

{

}

Page 104: Hardware and Software for Parallel Computing · o Parallel computing requires parallel codes Porting onto specialized architectures o FPGA, cluster, GPU… o Not only developer’s

1/31/2014

104

The nowait clause

There are some implicit synchronization barriers at the end of the blocks marked by these directives o Do/For

o Single

To avoid the synchronization, put this clause

o The threads will directly continue after the end of the region, but beware of data dependencies

www.caps-entreprise.com 207

#pragma omp for nowait

for(i=0; i<N; i++){

}

Environment variables

Number of threads

Limit of thread in the system

o Default value is 1024

The size of the stack for each threads

o Default value is 4 MB 32 bit-system and 8 MB 64 bit-system

www.caps-entreprise.com 208

$ export OMP_NUM_THREADS=[0-9*]

$ export OMP_THREAD_LIMIT=n

$ export OMP_STACKSIZE=size

Page 105: Hardware and Software for Parallel Computing · o Parallel computing requires parallel codes Porting onto specialized architectures o FPGA, cluster, GPU… o Not only developer’s

1/31/2014

105

Informations

To set the number of thread in the next level of parallelism

Get the maximum number of processors

Get time reference

o Get the current time in second

o omp_get_wtick

www.caps-entreprise.com 209

omp_set_num_threads(n)

int nb_procs = omp_get_num_procs()

Double t = omp_get_wtime()

Double t = omp_get_wtick()

Reduction (1)

Reduction applies to the following directives:

o for

o parallel

o sections

Declare a reduction:

o In C/C++ : reduction (operator: list)

o In fortran : reduction (operation|intrinsic : list)

The REDUCTION clause performs a reduction on the

variables that appear in the given list

www.caps-entreprise.com 210

Page 106: Hardware and Software for Parallel Computing · o Parallel computing requires parallel codes Porting onto specialized architectures o FPGA, cluster, GPU… o Not only developer’s

1/31/2014

106

Reduction (2)

Restrictions on the variables in the list:

o must be named scalar variables

o must be declared SHARED in the enclosing context

Principle:

o For each thread a private copy of each list variable is created

o At the end of the reduction, the private copies are combined into one

scalar

o The final result is written to the global shared variable.

www.caps-entreprise.com 211

Example

www.caps-entreprise.com 212

#include <omp.h>

#define N 10000

int main(){

int tab[N];

int sum = 0;

init(tab);

#pragma omp parallel for reduction(+:sum)

for (i=0; i<N; i++)

{

sum += foo(i, tab[i]);

}

}

Page 107: Hardware and Software for Parallel Computing · o Parallel computing requires parallel codes Porting onto specialized architectures o FPGA, cluster, GPU… o Not only developer’s

1/31/2014

107

Reduction Operator : C/C++

www.caps-entreprise.com 213

Intrinsic/Operation Operation

+ Sum

- Subtraction

* Product

/ Division

^ Power

&& Logical AND

|| Logical OR

Minimum

Maximum

& Bitwise AND

| Bitwise OR

Bitwise Exclusive OR

Reduction Operator : Fortran

www.caps-entreprise.com 214

Intrinsic/Operation Operation

+ Sum

- Subtraction

* Product

.EQV. Equality

.NEQV. Non-equality

.AND. Logical AND

.OR. Logical OR

MIN Minimum

MAX Maximum

IAND Bitwise AND

IOR Bitwise OR

IEOR Bitwise Exclusive OR

Page 108: Hardware and Software for Parallel Computing · o Parallel computing requires parallel codes Porting onto specialized architectures o FPGA, cluster, GPU… o Not only developer’s

1/31/2014

108

Sections

The sections directive specifies that the code in the enclosed section blocks are to be divided among the threads in the teams

Each section is executed once

May be declared into a parallel region

Declare parallel sections o omp sections[clause[[,],clause…]

omp section code block

omp section code block

omp end sections[nowait]

www.caps-entreprise.com 215

Sections : example

#pragma omp parallel

{

#pragma omp sections nowait

{

#pragma omp section

{

printf("Thread %d doing section 1\n",tid);

workToDoInSection1();

}

#pragma omp section

{

printf("Thread %d doing section 2\n",tid);

workToDoInSection2();

}

}

}

www.caps-entreprise.com 216

Page 109: Hardware and Software for Parallel Computing · o Parallel computing requires parallel codes Porting onto specialized architectures o FPGA, cluster, GPU… o Not only developer’s

1/31/2014

109

Task

Tasks are the most important change in OpenMP 3.0

Each thread encountering a task construct it and add it in a tasks pool

The task may be executed by the encountering thread, or deferred for execution by any other thread in the team

Declare a task: o omp task [clause [,clause] ]

Task synchronization: o omp taskwait

www.caps-entreprise.com 217

What can be done with task ?

Allows to parallelize irregular problems o Unbounded loops

o Recursive algorithms

o Producer/Consumers

o …

Tasks are work units which may be deferred

Tasks executed by threads in the team

Each task encountered by a thread is created

Tasks can be nested

www.caps-entreprise.com 218

Page 110: Hardware and Software for Parallel Computing · o Parallel computing requires parallel codes Porting onto specialized architectures o FPGA, cluster, GPU… o Not only developer’s

1/31/2014

110

Task : example

#pragma omp parallel

{

#pragma omp single nowait

{

iter = 25;

while( iter > 0)

{

#pragma omp task

myfunction(iter);

iter = iter -1;

}

}

#pragma omp taskwait

}

www.caps-entreprise.com 219

Critical part

Hardware Accelerators

Page 111: Hardware and Software for Parallel Computing · o Parallel computing requires parallel codes Porting onto specialized architectures o FPGA, cluster, GPU… o Not only developer’s

1/31/2014

111

Why Use Accelerators?

For application containing hotspots

o Parts of code where the majority of the execution time is spent

If hotspots are highly parallel:

o The application can be accelerated

o Otherwise, the algorithm may have to be changed

Accelerators can also be viewed as low-power compute

nodes

o Make the application power efficient

Intel MIC Programming 221

Hardware Accelerator

In HPC, two kinds of accelerators are extremely popular

o GPUs, such as the K20 in Titan (2nd fastest supercomputer)

• Nvidia Tesla

• AMD Firepro

o Intel Xeon Phi, for example in Tianhe 2 (fastest supercomputer)

• Based on x86 technology

Page 112: Hardware and Software for Parallel Computing · o Parallel computing requires parallel codes Porting onto specialized architectures o FPGA, cluster, GPU… o Not only developer’s

1/31/2014

112

GPUs

www.caps-entreprise.com 223

Today’s Hybrid/Heterogeneous Compute Node

Streaming engines (e.g. GPU) o Application specific

architectures (“narrow band”)

o Vector/SIMD

o Can be extremely fast

General purpose cores o Share a main memory

o Core ISA provides fast SIMD instructions

o Large cache memories

www.caps-entreprise.com 224

Page 113: Hardware and Software for Parallel Computing · o Parallel computing requires parallel codes Porting onto specialized architectures o FPGA, cluster, GPU… o Not only developer’s

1/31/2014

113

Stream computing

Stream programming is well suited to GPU

o But memory hierarchy is exposed

A similar computation is performed on a collection of data (stream)

o There is no data dependence between the computation on different stream elements

www.caps-entreprise.com 225

What is GPGPU ?

General Purpose computation using GPU in applications other than 3D graphics

o GPU accelerates critical path of application

Data parallel algorithms leverage GPU attributes

o Large data arrays, streaming throughput

o Fine-grain SIMD parallelism

o Low-latency floating point (FP) computation

Applications – see //GPGPU.org

o Game effects (FX) physics, image processing

o Physical modeling, computational engineering, matrix algebra, convolution, correlation, wave propagation

www.caps-entreprise.com 226

Page 114: Hardware and Software for Parallel Computing · o Parallel computing requires parallel codes Porting onto specialized architectures o FPGA, cluster, GPU… o Not only developer’s

1/31/2014

114

Previous GPGPU Constraints

Dealing with graphics API

o Working with the corner cases of the graphics API

Addressing modes

o Limited texture size/dimension

Shader capabilities

o Limited outputs

Instruction sets

o Lack of Integer & bit ops

Communication limited

o Between pixels

o Scatter a[i] = p

www.caps-entreprise.com 227

Input Registers

Fragment Program

Output Registers

Constants

Texture

Temp Registers

per thread

per Shader

per Context

FB Memory

CUDA

Page 115: Hardware and Software for Parallel Computing · o Parallel computing requires parallel codes Porting onto specialized architectures o FPGA, cluster, GPU… o Not only developer’s

1/31/2014

115

“Compute Unified Device Architecture”

General purpose programming model o User kicks off batches of threads on the GPU

o GPU = dedicated super-threaded, massively data parallel co-processor

Targeted software stack o Compute oriented drivers, language, and tools

Driver for loading computation programs into GPU o Standalone Driver - Optimized for computation

o Interface designed for compute - graphics free API

o Data sharing with OpenGL buffer objects

o Guaranteed maximum download & readback speeds

o Explicit GPU memory management

www.caps-entreprise.com 229

An Example of Physical Reality Behind CUDA

www.caps-entreprise.com 230

Northbridge

Southbridge

µProc

ATA

SATA

PCI

HDA

1394

USB

SCSI

PCI-e

DDR

DDR

SCSI

GPU

Device memory

Host Memory

(shared memory)

Page 116: Hardware and Software for Parallel Computing · o Parallel computing requires parallel codes Porting onto specialized architectures o FPGA, cluster, GPU… o Not only developer’s

1/31/2014

116

Parallel Computing on a GPU with Nvidia

NVIDIA GPU Computing Architecture o Via a separate HW interface

o In laptops, desktops, workstations, servers

The next GK110 GPUs (K20) will deliver up to 4.5 TFLOPS (SP) on compiled parallel C applications o 1 TFLOPS DP

Programmable in C with CUDA tools o Programming model scales transparently

o Multithreaded SPMD model uses application data parallelism and thread parallelism

www.caps-entreprise.com 231

Tesla K20

GeForce GTX 680

Tesla C2075

Extended C

Declspecs

o global, device, shared, local, constant

Keywords

o threadIdx, blockIdx

Intrinsics

o __syncthreads

Runtime API

o Memory, symbol, execution management

Function launch

www.caps-entreprise.com 232

__device__ float filter[N];

__global__ void convolve (float *image) {

__shared__ float region[M];

...

region[threadIdx] = image[i];

__syncthreads()

...

image[j] = result;

}

// Allocate GPU memory

void *myimage = cudaMalloc(bytes)

// 100 blocks, 10 threads per block

convolve<<<100, 10>>> (myimage);

Page 117: Hardware and Software for Parallel Computing · o Parallel computing requires parallel codes Porting onto specialized architectures o FPGA, cluster, GPU… o Not only developer’s

1/31/2014

117

Compilation Path

www.caps-entreprise.com 233

C/C++ CUDA Application

NVCC CPU Code

PTX Code Virtual

PTX to Target

Compiler

Physical

GF100 GT200

Hardware Overview

Device contains o Multiprocessors o Host access interface o Memory o 4 generation groups:

o 1.0, 1.1 (8800, 9800) o 1.2, 1.3 (GTX220, c1060) o 2.0 (c2050) o 3.0 (K10), 3.5 (K20)

Multiprocessors contains o ALUs o Registers o Shared Memory o Access to Local Memory o Access to Global Memory

www.caps-entreprise.com 234

DEVICE

Constant

Memory

Texture

Memory

Global

Memory

Multiprocessor 1

Shared Memory

Local

Memory

ALU

Registers

Local

Memory

ALU

Registers

Multiprocessor N

Shared Memory

Local

Memory

ALU

Registers

Local

Memory

ALU

Registers

Host

Page 118: Hardware and Software for Parallel Computing · o Parallel computing requires parallel codes Porting onto specialized architectures o FPGA, cluster, GPU… o Not only developer’s

1/31/2014

118

Nvidia Architecture Overview

GT200 / GF100 / GK110

www.caps-entreprise.com 235

SMX

Memory Sizes accross GPU Generations

Specifications 1.0-1.1 1.2-1.3 2.0 3.x

Multiprocessors Up to 16 Up to 30 Up to 16 Up to 14

ALU(SP)/Multipro. 8 8 32 192

32 bits Registers/Multipro.

8 k 16 k 32 k 64k

Shared Mem/Multipro. 16 kB 16 kB 16 / 48 kB 16 / 48 kB

Constant Memory 64 kB 64 kB 64 kB 64 kB

Local Memory

Global Memory Up to 4 GB Up to 4 GB Up to 12 GB Up to 12 GB

Cache on Global Mem No/- No/- Yes/L1-L2 Yes/L1-L2

Size of L2 Cache - - 768 kB Up to 1536 kB

Size of L1 Cache/multipro

- - 16 / 48 kB 16 / 48 kB

www.caps-entreprise.com 236

Page 119: Hardware and Software for Parallel Computing · o Parallel computing requires parallel codes Porting onto specialized architectures o FPGA, cluster, GPU… o Not only developer’s

1/31/2014

119

Thread Batching:

Grids and Blocks

A kernel is executed as a grid of thread blocks o All threads share data

memory space

A thread block is a batch of threads that can cooperate with each other by: o Synchronizing their execution

• For hazard-free shared memory accesses

o Efficiently sharing data through a low latency shared memory

Two threads from two different blocks cannot cooperate o Atomic operations added in HW

1.1

www.caps-entreprise.com 237

Host

Kernel

1

Kernel

2

Device

Grid 1

Block

(0, 0)

Block

(1, 0)

Block

(2, 0)

Block

(0, 1)

Block

(1, 1)

Block

(2, 1)

Grid 2

Block (1, 1)

Thread

(0, 1)

Thread

(1, 1)

Thread

(2, 1)

Thread

(3, 1)

Thread

(4, 1)

Thread

(0, 2)

Thread

(1, 2)

Thread

(2, 2)

Thread

(3, 2)

Thread

(4, 2)

Thread

(0, 0)

Thread

(1, 0)

Thread

(2, 0)

Thread

(3, 0)

Thread

(4, 0)

Block and Thread IDs

Threads and blocks have IDs o So each thread can

decide what data to work on

o Block ID: 1D, 2D or 3D o Thread ID: 1D, 2D, or 3D

Simplifies memory addressing when processing multidimensional data o Image processing o Solving PDEs on volumes o …

www.caps-entreprise.com 238

Device

Grid 1

Block

(0, 0)

Block

(1, 0)

Block

(2, 0)

Block

(0, 1)

Block

(1, 1)

Block

(2, 1)

Block (1, 1)

Thread

(0, 1)

Thread

(1, 1)

Thread

(2, 1)

Thread

(3, 1)

Thread

(4, 1)

Thread

(0, 2)

Thread

(1, 2)

Thread

(2, 2)

Thread

(3, 2)

Thread

(4, 2)

Thread

(0, 0)

Thread

(1, 0)

Thread

(2, 0)

Thread

(3, 0)

Thread

(4, 0)

Page 120: Hardware and Software for Parallel Computing · o Parallel computing requires parallel codes Porting onto specialized architectures o FPGA, cluster, GPU… o Not only developer’s

1/31/2014

120

Block and Thread keywords

Block keywords threadIdx.{x,y,z} defines

the thread index inside the block

blockDim.{x,y,z} defines the block dimensions

Grid keywords

blockIdx.{x,y,z} defines the block index inside the grid

gridDim.{x,y,z} defines the grid dimension

www.caps-entreprise.com 239

Block (1, 1)‏

Thread

‏(0,1,0)

Thread

‏(1,1,0)

Thread

‏(2,1,0)

Thread

‏(3,1,0)

Thread

‏(4,1,0)

Thread

‏(0,2,0)

Thread

‏(1,2,0)

Thread

‏(2,2,0)

Thread

‏(3,2,0)

Thread

‏(4,2,0)

Thread

‏(0,0,0)

Thread

‏(1,0,0)

Thread

‏(2,0,0)

Thread

‏(3,0,0)

Thread

‏(4,0,0)

Thread

‏(0,1,1)

Thread

‏(1,1,1)

Thread

‏(2,1,1)

Thread

‏(3,1,1)

Thread

‏(4,1,1)

Thread

‏(0,2,1)

Thread

‏(1,2,1)

Thread

‏(2,2,1)

Thread

‏(3,2,1)

Thread

‏(4,2,1)

Thread

‏(0,0,1)

Thread

‏(1,0,1)

Thread

‏(2,0,1)

Thread

‏(3,0,1)

Thread

‏(4,0,1)

blockDim.x

Grid 1

Block

‏(0 ,0)Block

‏(0 ,1)Block

‏(0 ,2)Block

‏(0 ,3)

Block

‏(1 ,3)

Block

‏(1 ,0)Block

‏(1 ,1)Block

‏(1 ,2)

gridDim.x

gri

dD

im.y

Memory Space Overview

Each thread can: o R/W per-thread registers o R/W per-thread shared

memory o R/W per-block local

memory o R/W per-grid global

memory o Read only per-grid

constant memory o Read only per-grid texture

memory

The host can: o R/W global memory o R/W constant memory o R/W texture memory

www.caps-entreprise.com 240

(Device) Grid

Constant

Memory

Texture

Memory

Global

Memory

Block (0, 0)

Shared Memory

Local

Memory

Thread (0, 0)

Registers

Local

Memory

Thread (1, 0)

Registers

Block (1, 0)

Shared Memory

Local

Memory

Thread (0, 0)

Registers

Local

Memory

Thread (1, 0)

Registers

Host

Page 121: Hardware and Software for Parallel Computing · o Parallel computing requires parallel codes Porting onto specialized architectures o FPGA, cluster, GPU… o Not only developer’s

1/31/2014

121

Memory Allocation

• cudaMalloc()

o Allocates object in the device Global Memory

o Requires two parameters

• Address of a pointer to the allocated object

• Size of of allocated object

• cudaFree()

o Frees object from device Global Memory

• Pointer to freed object

www.caps-entreprise.com 241

(Device) Grid

Constant

Memory

Texture

Memory

Global

Memory

Block (0, 0)

Shared Memory

Local

Memory

Thread (0, 0)

Registers

Local

Memory

Thread (1, 0)

Registers

Block (1, 0)

Shared Memory

Local

Memory

Thread (0, 0)

Registers

Local

Memory

Thread (1, 0)

Registers

Host

CUDA Host-Device Data Transfer

• cudaMemcpy() o memory data transfer o Requires 4 parameters

• Pointer to source • Pointer to destination • Number of bytes copied • Type of transfer

– Host to Host – Host to Device – Device to Host – Device to Device

• Asynchronous variant supported on 1.1+HW

www.caps-entreprise.com 242

(Device) Grid

Constant

Memory

Texture

Memory

Global

Memory

Block (0, 0)

Shared Memory

Local

Memory

Thread (0, 0)

Registers

Local

Memory

Thread (1, 0)

Registers

Block (1, 0)

Shared Memory

Local

Memory

Thread (0, 0)

Registers

Local

Memory

Thread (1, 0)

Registers

Host

Page 122: Hardware and Software for Parallel Computing · o Parallel computing requires parallel codes Porting onto specialized architectures o FPGA, cluster, GPU… o Not only developer’s

1/31/2014

122

CUDA Function Declarations

__global__ defines a kernel function

o Must return void

www.caps-entreprise.com 243

Executed on the: Only callable from the:

__device__ float DeviceFunc() device device

__global__ void KernelFunc() device host

__host__ float HostFunc() host host

CUDA Functions Declaration

Address of __device__ functions cannot be taken

For functions executed on the device:

o No recursion (HW < 2.0)

o Recursion possible for __device__ function (HW >= 2.0)

o No static variable declarations inside the function

o No variable number of arguments

www.caps-entreprise.com 244

Page 123: Hardware and Software for Parallel Computing · o Parallel computing requires parallel codes Porting onto specialized architectures o FPGA, cluster, GPU… o Not only developer’s

1/31/2014

123

Calling a Kernel Function

Thread Creation

A kernel function must be called with an execution configuration:

o Any call to a kernel function is asynchronous, explicit synchronization needed for blocking

www.caps-entreprise.com 245

__global__ void KernelFunc(...);

dim3 DimGrid(100, 50); // 5000 thread blocks

dim3 DimBlock(8, 8, 4); // 256 threads per block

KernelFunc<<< DimGrid, DimBlock >>>(...);

cudaThreadSynchronize();

Compilation

Any source file containing CUDA language extensions must be compiled with nvcc

nvcc is a compiler driver o Works by invoking all the necessary tools and compilers like

cudacc, g++, cl, ...

nvcc can output: o Either C code

• That must then be compiled with the rest of the application using another tool

o Or object code directly

www.caps-entreprise.com 246

Page 124: Hardware and Software for Parallel Computing · o Parallel computing requires parallel codes Porting onto specialized architectures o FPGA, cluster, GPU… o Not only developer’s

1/31/2014

124

Linking

Any executable with CUDA code requires one dynamic library: o The CUDA runtime library (cudart)

With gcc, you may need to link with the sandard C++ library o libstdc++

www.caps-entreprise.com 247

Debugging : CudaGDB

On Linux or Mac OS X

Compile your application with nvcc and –g and –G options

Execute the debugger with : cuda-gdb

Possible to : o Get device information, gridDim and blockDim

o Break on the host and in the kernel

o Switch between CUDA Threads and host thread

Can be integrated to : o Emacs GUI

o DDD

Another available debugger o Allinea DDT

www.caps-entreprise.com 248

Page 125: Hardware and Software for Parallel Computing · o Parallel computing requires parallel codes Porting onto specialized architectures o FPGA, cluster, GPU… o Not only developer’s

1/31/2014

125

GPU Debugging Pitfalls

But not all illegal program behavior can be caught

Conditions to Debug application on the local machine o Linux

• If single GPU, no Graphical Server running on the system

• 2 GPUs on the machine, 1 running the Graphical Server and the second running the CUDA program

o Windows

• Only possible if there is two GPU

• 1 for the visualization

• 1 to debug the CUDA application

On a remote machine, no problem

www.caps-entreprise.com 249

Profiler

31/01/2014 www.caps-entreprise.com 250

Page 126: Hardware and Software for Parallel Computing · o Parallel computing requires parallel codes Porting onto specialized architectures o FPGA, cluster, GPU… o Not only developer’s

1/31/2014

126

Parallel NSight

Available on Windows and Linux o Integrated to Microsoft

Visual Studio

o Integrated to Eclipse IDE

Debugging CUDA application o Using Microsoft Visual

Studio windows : Memory, Locals, Watches and Breakpoints

Analyzing the performance of your GPGPU applications o CUDA

o OpenCL

o DirectCompute

www.caps-entreprise.com 251

Warps

Each block of threads is split into warps

Each warp contains the same number of threads: 32

Each warp contains threads of consecutive, increasing thread IDs with the first warp containing thread #0

Each warp is executed by a multiprocessor in a SIMD fashion

www.caps-entreprise.com 252

Page 127: Hardware and Software for Parallel Computing · o Parallel computing requires parallel codes Porting onto specialized architectures o FPGA, cluster, GPU… o Not only developer’s

1/31/2014

127

Warps (2)

Divergent branches within a warp cause serialization

o If all threads in a warp take the same branch, no extra cost

o If each thread take one or two different branches, entire warp pays cost of both branches of code

o If threads take n different branches, entire warp pays cost of n branches of code

www.caps-entreprise.com 253

Coalescing 1.0/1.1

A coordinated read by a half-warp (16 threads)

A contiguous region of global memory o 64 bytes: each thread reads a word (int, float, ...)

o 128 bytes: each thread reads a double-word (int2, float2, ...)

o 256 bytes: each thread reads a quad-word (int4, float4,...)

Additional restrictions on G8x/G9x architecture: o Starting address for a region must be a multiple of region size

o The kth thread in a half-warp must access the kth element in a block being read

Exception: not all threads must be participating o Predicated access, divergence within a half-warp

www.caps-entreprise.com 254

Page 128: Hardware and Software for Parallel Computing · o Parallel computing requires parallel codes Porting onto specialized architectures o FPGA, cluster, GPU… o Not only developer’s

1/31/2014

128

Coalesced Access 1.0/1.1:

Reading floats

www.caps-entreprise.com 255

Uncoalesced Access 1.0/1.1:

Reading floats

www.caps-entreprise.com 256

Page 129: Hardware and Software for Parallel Computing · o Parallel computing requires parallel codes Porting onto specialized architectures o FPGA, cluster, GPU… o Not only developer’s

1/31/2014

129

Coalesced Access 2.x

Cache on global memory may hide coalescing issues

2 level of cache

o 16-48 KB of L1 per SM

o 768 KB of L2 global for all SM

Memory Latency

o Global: 400-800 cycles

o L2 Cache: 100-200 cycles

o L1 Cache: about 4 cycles (without bank conflict)

www.caps-entreprise.com 257

Shared Memory

Hundreds of times faster than global memory

o About same latency as registers

o 32 banks can be accessed simultaneously with 2.x compute capability

o Successive 32 bits words are assigned to successive banks

Threads of a same block can cooperate via shared memory

o Up to 48 KBytes with 2.x compute capability by multiprocessor

Can be used to avoid non-coalesced accesses

www.caps-entreprise.com 258

Page 130: Hardware and Software for Parallel Computing · o Parallel computing requires parallel codes Porting onto specialized architectures o FPGA, cluster, GPU… o Not only developer’s

1/31/2014

130

Shared Memory:

Performance Issues

The fast case:

o If all threads of a half-warp (or warp with cc 2.x) access different banks, there is no bank conflict

o If all threads of a half-warp (or warp with cc 2.x) read the identical address, there is no bank conflict (broadcast)

The slow case:

o Bank conflict: multiple threads in the same half-warp (or warp with cc 2.x)

o Must serialize the accesses

o Cost = max # of simultaneous accesses to a single bank

www.caps-entreprise.com 259

Shared Memory Access

www.caps-entreprise.com 260

Pattern accesses with no

bank conflicts:

each thread of the half-warp

accesses a different bank

Page 131: Hardware and Software for Parallel Computing · o Parallel computing requires parallel codes Porting onto specialized architectures o FPGA, cluster, GPU… o Not only developer’s

1/31/2014

131

Shared Memory Access (2)

www.caps-entreprise.com 261

Each thread reads one

address from the same

bank: no conflict (broadcast)

Threads accessing the

same bank to different

value:

conflict!

Optimizing Threads per Block

Choose threads per block as a multiple warp size o Avoid wasting computation on under populated warps

More threads per block == better memory lattency hiding o Kernel invocations can fail if too many registers are used

Heuristics o Minimum Required by the HW: 64 threads per block

• Only if multiple concurrent blocks

o 192 or 256 threads a better choice

• Usually still enough regs to compile and invoke successfully

o This all depends on your computation, so experiment!

www.caps-entreprise.com 262

Page 132: Hardware and Software for Parallel Computing · o Parallel computing requires parallel codes Porting onto specialized architectures o FPGA, cluster, GPU… o Not only developer’s

1/31/2014

132

Grid/Block Size Heuristics

# of blocks > # of multiprocessors o So all multiprocessors have at least one block to execute

# of blocks / # of multiprocessors > 2 o Multiple blocks can run concurrently in a multiprocessor

o Blocks that aren’t waiting at a _synchthreads() keep the hardware busy

o Subject to resource availbility - registers, shared memory

# of blocks > 100 to scale to future device o Blocks executed in pipeline fashion

o 1000 blocks per grid will scale accross multiple generations

www.caps-entreprise.com 263

Asynchronicity & Overlapping

Default CUDA API

o Kernel launches are asynchronous with CPU

o Memcopies block CPU thread (H2D=HostToDevice,

D2H=DeviceToHost)

o CUDA calls are sequential on GPU, serialized by the driver

But CUDA also offers asynchronicity and overlapping

o Asynchronous memcopies (D2H, H2D) with CPU

o Ability to concurrently execute a kernel and a memcopy

www.caps-entreprise.com 264

Page 133: Hardware and Software for Parallel Computing · o Parallel computing requires parallel codes Porting onto specialized architectures o FPGA, cluster, GPU… o Not only developer’s

1/31/2014

133

Page-locked Memory, Principles (1)

Operating systems handle memory with a mechanism called

paged virtual memory:

o Divides the virtual address space of an application into memory pages

(default on x86 is 4 KiBytes)

o Allows applications to use more memory than the physical RAM

available on the system, by swapping pages to a disk

o Physical address of the page can change, this is tranparent to the

application, as the virtual address does not change

Pages can be locked by the OS into physical memory to prevent

swapping and to guarantee a permanent physical address

www.caps-entreprise.com 265

Page-locked Memory, Principles (2)

A PCI-express device can only directly access physical addresses, never an application's virtual address space o So only page-locked memory can be directly exploited by the

hardware

CUDA allows the application to request page-locked memory from the CUDA kernel driver

Both the application and the device can use directly such memory o No need for time-consuming intermediate copies between the

application virtual address space and the device's on-board memory

www.caps-entreprise.com 266

Page 134: Hardware and Software for Parallel Computing · o Parallel computing requires parallel codes Porting onto specialized architectures o FPGA, cluster, GPU… o Not only developer’s

1/31/2014

134

CUDA page-locked memory

All CUDA version allows application to request page-locked

memory, often called pinned memory

o No other applications, not even the OS, can use the locked pages.

Do not use too much page-locked memory!

All CUDA memory copy functions take advantage of pinned

memory

Pinned memory is a prerequisite for asynchronous memory

copies

www.caps-entreprise.com 267

Different way to use Page-Locked Memory

Allocation directly in Page-Locked Memory

www.caps-entreprise.com 268

//Allocate the data in physical RAM

cudaMallocHost((void**) &hostPtr, size);

cudaFreeHost(hostPtr);

//Do not forget it or the data will stay alive in your Main memory

Page 135: Hardware and Software for Parallel Computing · o Parallel computing requires parallel codes Porting onto specialized architectures o FPGA, cluster, GPU… o Not only developer’s

1/31/2014

135

Asynchronicity (1)

Synchronous execution example:

o The application waits for the GPU to complete the requested task.

www.caps-entreprise.com 269

Asynchronicity (2)

The asynchronous version:

o Control is returned to the application before the device has completed the requested task.

www.caps-entreprise.com 270

Page 136: Hardware and Software for Parallel Computing · o Parallel computing requires parallel codes Porting onto specialized architectures o FPGA, cluster, GPU… o Not only developer’s

1/31/2014

136

Asynchronicity (3)

Advantages

o Enables full exploitation of the hardware available on the machine

(CPU + GPU together)

o Kernel launches are already asynchronous, no need to modify your

code

Drawback

o Needs explicit synchronization for data coherency

o Transfers require extra work to setup asynchronicity

But speed benefit already makes the extra work useful

www.caps-entreprise.com 271

Overlapping

Concurrent execution of GPU kernel and transfers from/to GPU o Makes use of asynchronicity

o Particularly handy when data make frequent,

o expensive round-trips between CPU and GPU

Typical cases o Several independent problems

o Several instances of a problem

o Single problem splitted into a set of sub-problems

Requires to use streams in your CUDA code

www.caps-entreprise.com 272

Page 137: Hardware and Software for Parallel Computing · o Parallel computing requires parallel codes Porting onto specialized architectures o FPGA, cluster, GPU… o Not only developer’s

1/31/2014

137

Basics of CUDA Streams (1)

You said “stream”?

o A sequence of operations that execute in order on GPU

Streams have the following properties:

Streams use asynchronicity

o Concurrent execution between CPU and GPU

Streams enable overlapping

o Concurrent execution of a kernel on the GPU and

o transfers from or to the GPU

www.caps-entreprise.com 273

Basics of CUDA Streams (2)

How it works:

Operations from different streams can be interleaved

A kernel and a memcopy from different streams can be overlapped

www.caps-entreprise.com 274

Page 138: Hardware and Software for Parallel Computing · o Parallel computing requires parallel codes Porting onto specialized architectures o FPGA, cluster, GPU… o Not only developer’s

1/31/2014

138

Code Example

// data allocation

float * hostPtr;

cudaMallocHost((void**) &hostPtr, 2 * size);

// streams declaration

cudaStream_t stream[2];

for(int i = 0; i < 2; ++i)

cudaStreamCreate(&stream[i]);

// streamed copy from host to device

for(int i = 0; i < 2; ++i)

cudaMemcpyAsync(inputDevPtr + i * size, hostPtr + i * size, size,

cudaMemcpyHostToDevice, stream[i]);

// streamed execution of the kernel

for(int i = 0; i < 2; ++i)

myKernel<<<100, 512, 0, stream[i]>>>(outputDevPtr + i * size, inputDevPtr + i * size, size);

// streamed copy from device to host

for(int i = 0; i < 2; ++i)

cudaMemcpyAsync(hostPtr + i * size, outputDevPtr + i * size, size,

cudaMemcpyDeviceToHost, stream[i]);

// threads synchronization

cudaThreadSynchronize();

www.caps-entreprise.com 275

Using multiple CUDA Accelerators with MPI

#CUDA accelerators > #cores o Multiple MPI processes per core

(beware of CPU overload)

#CUDA accelerators == #cores o The ideal case: generally one

MPI process per core and GPU

o CPU may be idle while GPU is working

#CUDA accelerators < #cores o Share the GPUs?

o Lock the GPUs?

o Load Balancing CPU & GPU?

www.caps-entreprise.com 276

CPU

CPU

GPU

GPU

CPU GPU

GPU

CPU

CPU

GPU

GPU

CPU

CPU

Page 139: Hardware and Software for Parallel Computing · o Parallel computing requires parallel codes Porting onto specialized architectures o FPGA, cluster, GPU… o Not only developer’s

1/31/2014

139

Resident Data

Think differently : instead of

Use resident data mechanism

www.caps-entreprise.com 277

Transfer

CPU->GPU

Compute

Kernel A

Transfer

GPU->CPU

Compute

Kernel B

CPU CPU GPU

CPU

Transfer

CPU->GPU

Transfer

CPU->GPU

Transfer

GPU->CPU

Compute

Kernel A

Transfer

GPU->CPU

Compute

Kernel B

CPU GPU CPU CPU GPU

Reducing Transfers

Use GPU-resident data as much as possible o Send once, use many times, read once

o Can tremendously boost performance

o Transfers can easily be the dominant factor in GPU usage

• Then follow Amdahl’s Law by optimizing transfers rather than kernels

Examples o Multiple steps of computations in a loop

o Multiple steps of computations in sequence

Do everything requiring the resident data on the GPU if possible o Unless the computations do not fit GPU at all

www.caps-entreprise.com 278

Page 140: Hardware and Software for Parallel Computing · o Parallel computing requires parallel codes Porting onto specialized architectures o FPGA, cluster, GPU… o Not only developer’s

1/31/2014

140

Partial transfers

Think differently : instead of

Use partial transfer mechanism

www.caps-entreprise.com 279

CPU

Transfer

CPU->GPU

Transfer

GPU->CPU

CPU

Compute

Kernel A

GPU

Transfer

CPU->GPU

CPU

Transfer

GPU->CPU

CPU

Compute

Kernel B

GPU CPU

I/O

CPU

Transfer

CPU->GPU GPU->CPU

CPU

Compute

Kernel A

GPU

CPU->GPU

CPU

Transfer

GPU->CPU

CPU

Compute

Kernel B

GPU CPU

I/O

Minimizing Quantities

Again, maximize resident data, this time by keeping sub-

arrays on the GPU

o Send once, use and update many times, read once

o If some data must absolutely come from outside the GPU

o If some data must absolutely goes to outside the GPU

Network or disk I/O

Computation steps unimplementable on GPU

Warning: each data transfer as an initial overhead

www.caps-entreprise.com 280

Page 141: Hardware and Software for Parallel Computing · o Parallel computing requires parallel codes Porting onto specialized architectures o FPGA, cluster, GPU… o Not only developer’s

1/31/2014

141

Reducing Transfers

The GPU computes faster than it performs transfers o Sometimes it is better to re-compute data than retrieving it from a remote

memory

Don’t try to factorize data to save memory, think performance o Memory saving is often a performance killer

• Allocate more memory to re-align data onto the GPU’s global memory

• Allocate more memory to avoid bank conflicts in shared memory

• Re-compute data to avoid transfers…

Avoid to compute borders on the GPU o Border cases are often performance killer due to:

• Incomplete warps

• Branch divergences

• Incomplete coalesced segments

www.caps-entreprise.com 281

CuComplex Header

Complex numbers : CuComplex Header

o Simple or double precision (HW >= 1.3)

o Include cuComplex.h

www.caps-entreprise.com 282

Page 142: Hardware and Software for Parallel Computing · o Parallel computing requires parallel codes Porting onto specialized architectures o FPGA, cluster, GPU… o Not only developer’s

1/31/2014

142

CuBLAS Library

Basic Linear Algebra Subprograms

o Include cublas.h

o Link with libcublas.so (linux) or cublas.dll (windows)

o Up to BLAS3 (same arguments)

Available functions

o Dot-product : cublasXdot()

o Matrix mutlitplication : cublasXgemm()

o …

User guide : http://developer.nvidia.com/cuda-toolkit-40

www.caps-entreprise.com 283

CuFFT Library

Fast Fourier Transform

o Include cufft.h

o Link with libcufft.so (linux) or cufft.dll (windows)

o 1D, 2D or 3D

Datatype

o Real or Complex type

o Simple or double precision (HW 1.3)

User guide : http://developer.nvidia.com/cuda-toolkit-40

www.caps-entreprise.com 284

Page 143: Hardware and Software for Parallel Computing · o Parallel computing requires parallel codes Porting onto specialized architectures o FPGA, cluster, GPU… o Not only developer’s

1/31/2014

143

Thrust

Templated Performance Primitives Library for CUDA

Similar to the C++ STL

Available functionnality

o Containers

o Iterators

o Sort

o Scan

o Reduction

o …

www.caps-entreprise.com 285

NPP Library

NVIDIA Performance Primitives library

GPU-accelerated image, video, and signal processing

functions

5x to 10x faster performance than CPU

Available functions

o Filter functions

o JPEG functions

o Geometry transforms

o Stastictics functions

o …

www.caps-entreprise.com 286

Page 144: Hardware and Software for Parallel Computing · o Parallel computing requires parallel codes Porting onto specialized architectures o FPGA, cluster, GPU… o Not only developer’s

1/31/2014

144

OpenCL

Before OpenCL

GPGPU o Vertex / pixel shaders

o Heavily constrained and not adapted

CTM / Brook o Then Brook+

o Then CAL/IL

CUDA o Widely broadcasted

No one of these technologies is hardware agnostic o Portability is not possible

www.caps-entreprise.com 288

Page 145: Hardware and Software for Parallel Computing · o Parallel computing requires parallel codes Porting onto specialized architectures o FPGA, cluster, GPU… o Not only developer’s

1/31/2014

145

What is Hybrid Computing with OpenCL?

OpenCL is o Open, royalty-free, standard

o For cross-platform, parallel programming of modern processors

o An Apple initiative

o Approved by Intel, Nvidia, AMD, etc.

o Specified by the Khronos group (same as OpenGL)

It intends to unify the access to heterogeneous hardware accelerators o CPUs (Intel i7, …)

o GPUs (Nvidia GTX & Tesla, AMD/ATI 58xx, …)

What’s the difference with CUDA or CAL/IL? o Portability over Nvidia, ATI, S3… platforms + CPUs

www.caps-entreprise.com 289

OpenCL Devices

NVIDIA

o All CUDA cards

AMD GPUs

o Radeon & Radeon HD

o FirePro, FireStream

o Mobility…

Intel & AMD CPUs

o X86 w/ >= SSE 3.x

Cell/B.E.

DSP

ARM

www.caps-entreprise.com 290

Page 146: Hardware and Software for Parallel Computing · o Parallel computing requires parallel codes Porting onto specialized architectures o FPGA, cluster, GPU… o Not only developer’s

1/31/2014

146

Inputs/Outputs with OpenCL programming

OpenCL architecture

www.caps-entreprise.com 291

Application

OpenCL kernels

OpenCL framework

OpenCL C language OpenCL API

OpenCL runtime

Driver

GPU hardware

OpenCL and C for CUDA

www.caps-entreprise.com 292

PTX

GPU

Entry point for developers who prefer high level C

Entry point for developers who prefer low level API

Shared backend compiler and optimization technology

C for OpenCL

C for CUDA

Page 147: Hardware and Software for Parallel Computing · o Parallel computing requires parallel codes Porting onto specialized architectures o FPGA, cluster, GPU… o Not only developer’s

1/31/2014

147

Compilation & Execution

Really simple

Include the OpenCL o #include <CL/cl.h>

Link with the OpenCL library

To execute

www.caps-entreprise.com 293

$ g++ -o myprogram myprogram.cc –L/PATH/TO/OPENCLLIB -lOpenCL

$ ./myprogram

OpenCL APIs

C language API o Binding C++ (official)

o Binding Java

o Binding Python

o …

In the remainder we will only see the C API o And lab sessions focus on the C API

The C++ API is available on the Khronos Website o http://www.khronos.org/registry/cl/

Extensions exist to o OpenGL

o Direct3D

www.caps-entreprise.com 294

Page 148: Hardware and Software for Parallel Computing · o Parallel computing requires parallel codes Porting onto specialized architectures o FPGA, cluster, GPU… o Not only developer’s

1/31/2014

148

Platform Model

Model consists of one or more interconnected devices

Computations occur within the Processing Elements of each device

www.caps-entreprise.com 295

Platform Version

3 different kind of versions for an OpenCL device

The platform version

o Version of the OpenCL runtime linked with the application

The device version

o Version of the hardware

The language version

o Higher revision of the OpenCL standard that this device supports

www.caps-entreprise.com 296

Page 149: Hardware and Software for Parallel Computing · o Parallel computing requires parallel codes Porting onto specialized architectures o FPGA, cluster, GPU… o Not only developer’s

1/31/2014

149

Execution Model

Kernels are submitted by the host application to devices throw command queues

Kernel instances, called Work-Item (WI), are identified by their point in the NDRange index space o This enables to parallelize the execution of the kernels

But still 2 programming models are supported o Data-parallel

o Task parallel

So even if we have a single programming model, we should have two different programming approaches according to the paradigm we are considering

www.caps-entreprise.com 297

NDRange

NDRange is a N-Dimensional index space

o N is 1, 2 or 3

o NDRange is defined by an integer array of length N specifying the

extent of the index space on each dimension

www.caps-entreprise.com 298

Page 150: Hardware and Software for Parallel Computing · o Parallel computing requires parallel codes Porting onto specialized architectures o FPGA, cluster, GPU… o Not only developer’s

1/31/2014

150

Work-Groups & Work-Items

Work-Items are organized into Work-Groups (WG)

Each Work-group has a unique global ID in the NDRange

Each Work-item has

o A unique global ID in the NDRange

o A unique local ID in its work-group

www.caps-entreprise.com 299

Parallelism Grains

CPU cores can handle only a few tasks

o But more complex

• Hard control flows

• Memory cache

o They can be either CPU threads or processes

• CPU threads: OpenMP, Pthread

• CPU Processes: MPI, fork()…

GPU threads are extremely lightweight

o Very little creation overhead

o Simple and regular computations

o GPU needs 1000s of threads (w.i.) for full efficiency

www.caps-entreprise.com 300

Page 151: Hardware and Software for Parallel Computing · o Parallel computing requires parallel codes Porting onto specialized architectures o FPGA, cluster, GPU… o Not only developer’s

1/31/2014

151

Memory Model

Four distinct memory regions o Global Memory

o Local Memory

o Constant Memory

o Private Memory

Global and Constant memories are common to all WI o May be cached depending on the hardware capabilities

Local memory is shared by all WI of a WG

Private memory is private to each WI

www.caps-entreprise.com 301

Memory Architecture

www.caps-entreprise.com 302

Page 152: Hardware and Software for Parallel Computing · o Parallel computing requires parallel codes Porting onto specialized architectures o FPGA, cluster, GPU… o Not only developer’s

1/31/2014

152

Memory Transfer (1)

2 types of transfers o Blocking (“synchronous”)

o Non-Blocking (“asynchronous”)

In the function clEnqueueRead/Write, set the “blocking” attribute to: o CL_TRUE, make a blocking transfer

o CL_FALSE, make a non-blocking transfer

For a non-blocking transfer o Need to link an event to the transfer command

o The event will be used for producer-consumer relationship, and/or explicit waiting

www.caps-entreprise.com 303

Memory Transfer (2)

Synchronizing ensures that data have been transferred to/from the device at this point

Example

Or can be used for the dependency flow in out-of-order queues o Use clEnqueueWaitForEvents() to synchronize in-order queues

www.caps-entreprise.com 304

cl_int clWaitForEvents (cl_uint num_events, const cl_event *evt_list)

cl_event evt;

err = clEnqueueReadBuffer( cmd_queue, buf_on_device, CL_FALSE, 0,

size, buf_on_host, 0, NULL, &evt );

clFlush(cmd_queue);

… //some work that doesn’t change the content of buf_on_host

clWaitForEvents(1, &evt);

… //work that may change buf_on_host (after transfer)

Page 153: Hardware and Software for Parallel Computing · o Parallel computing requires parallel codes Porting onto specialized architectures o FPGA, cluster, GPU… o Not only developer’s

1/31/2014

153

Queue Policy

A command queue is linked to a specific device

By default the command queue is in-order

o But you can use this option to make it out-of-order

CL_QUEUE_OUT_OF_ORDER_EXEC_MODE_ENABLE

www.caps-entreprise.com 305

Out-of-order Queue

With out-of-order queues

o No guarantee about the execution order

o So we need events dependence to ensure producer/consumer

relationship

When there is a dependence between commands

o Link cl_event objects to these commands

o Create list of events for this dependence

o Enqueue the command with its list of dependences

o The command will be launched only when all listed events have

terminated

Larger granularity : barrier

o Force waiting for all commands before the barrier to complete

www.caps-entreprise.com 306

Page 154: Hardware and Software for Parallel Computing · o Parallel computing requires parallel codes Porting onto specialized architectures o FPGA, cluster, GPU… o Not only developer’s

1/31/2014

154

Synchronizing Queues and Commands

May I synchronize an event according others in…

• OO: Out-of-Order

• IO: In-Order

• ✓ available

www.caps-entreprise.com 307

The same OO

queue

Another OO

queue

The same IO

queue

Another IO

queue

clWaitForEvents() ✓ ✓ ✓ ✓

clEnqueueWaitForEvents() ✓ ✓ Useless ✓

clEnqueueBarrier() ✓ ✓

Intel® Xeon PhiTM

Page 155: Hardware and Software for Parallel Computing · o Parallel computing requires parallel codes Porting onto specialized architectures o FPGA, cluster, GPU… o Not only developer’s

1/31/2014

155

Phi Specifications

Discrete accelerators

Connected to host by PCIe

Passively or actively cooled

Embeds 50+ 64-bit x86 CPU cores

Has its own DRAM

Runs its own Linux OS

www.caps-entreprise.com 309

Host CPU

PCIe

Phi Products

3120P/3120A 5110P/5120D 7110P/7120X

Max. # of Cores 57 60 61

Frequency 1.1 1.053 1.238

Cache (MB) 28.5 30 30.5

Memory Capacity (GB) 6 8 16

Memory Bandwidth (GB/s) 240 320 / 352 352

Peak DP (TFLOPS) 1.003 1.011 1.208

TDP (W) 300 225 / 245 300

Cooling Passive / Active Passive / Dense Form Factor

Passive / None

Applications Compute-bound workloads (MonteCarlo, Black-Scholes, …)

Memory bandwidth-bound workloads (STREAM … )

Supercomputing centers

www.caps-entreprise.com 310

Page 156: Hardware and Software for Parallel Computing · o Parallel computing requires parallel codes Porting onto specialized architectures o FPGA, cluster, GPU… o Not only developer’s

1/31/2014

156

Architectures Comparison

311

MIC

DRAM

ALU

ALU

DRAM

ALU

ALU Control

Cache

CPU

Power-efficient

multiprocessor

General-purpose architecture

GPU

DRAM

Massively data parallel

www.caps-entreprise.com

Coprocessor Card Design

Up to 16 channels of GDDR5 memory

Up to 8 GB GDDR5 (350 GB/s peak)

PCIe Gen2 compliant

Flash memory for coprocessor startup

System Management Controller (SMC) handles monitoring and control chores

www.caps-entreprise.com 312

Page 157: Hardware and Software for Parallel Computing · o Parallel computing requires parallel codes Porting onto specialized architectures o FPGA, cluster, GPU… o Not only developer’s

1/31/2014

157

Microarchitecture of the Entire Coprocessor

Can be viewed as: o a symmetric multiprocessor

(SMP)

o with a shared uniform memory access (UMA) system

Up to 61 cores interconnected by a ring interconnect (ODI) o Transactions are managed

transparently by the hardware

512 kB (L2) cache per core o Cache coherency across the

entire multiprocessor thanks to TD mechanism

o Data remains consistent without software intervention

www.caps-entreprise.com 313

Individual Coprocessor Core Architecture

22nm using Trigate transistors

Designed for high-level of power-efficient parallelism o Based on Intel architecture for programmability

64-bit execution

In-order code execution model with multithreading

4 hardware threads per core

Clocked at ~1GHz

32 kB of L1 instruction cache

32 kB of L1 data cache

512 kB of private (local) L2 cache

Instructions are fetched from memory, decoded, dispatched and executed either by o A scalar unit using traditional x86 and x87

instructions

o A vector processing unit (VPU) using the Intel Initial Many Core Instructions (IMCI) utilizing a 512-bit wide vector length

• No support for MMX instructions, Intel AVX or SSE extensions

22nm using 3D Trigate transistors

Highly parallel and power-efficient design o Based on Intel architecture for programmability

64-bit execution

In-order code execution model with multithreading

4 hardware threads per core

Clocked at ~1GHz

32 kB of L1 instruction cache

32 kB of L1 data cache

512 kB of private (local) L2 cache

Instructions are fetched from memory, decoded, dispatched and executed either by o A scalar unit using traditional x86 and x87

instructions

o A vector processing unit (VPU) using the Intel Initial Many Core Instructions (IMCI) utilizing a 512-bit wide vector length

• No support for MMX instructions, Intel AVX or SSE extensions

www.caps-entreprise.com 314

Instruction Fetch

Instruction Decode

Scalar Unit Vector Unit

Scalar Registers Vector Registers

L1 32 kB Icache and 32 kB Dcache

512 kB L2 Cache Local Subset

On-Die Interconnect

Page 158: Hardware and Software for Parallel Computing · o Parallel computing requires parallel codes Porting onto specialized architectures o FPGA, cluster, GPU… o Not only developer’s

1/31/2014

158

Instruction and Multithread Processing

Derived from Pentium design

Two instructions per clock cycle o One on the U-pipe

o One on the V-pipe

V-pipe cannot execute all instruction types o Vector instructions are mainly

executed only on the U-pipe

Instruction decoder is a two-cycle fully pipelined unit o Single-threaded code will use

50%

o At least 2 threads should be run per core

Instruction latency: 4 cycles

www.caps-entreprise.com 315

Xeon Phi Multithreading Capabilities

Xeon Phi has 4 hardware threads

o Xeon Phi can handle 4 live threads at the same time on each core

o Unlike Hyper-threading enabled CPUs, hardware threads cannot be

turned off

Intended to hide latencies

o Memory accesses and computations

o Inherent to in-order architecture

Maximum performance may be reached before that number

o Saturation may happen with two or three threads only

316 www.caps-entreprise.com

Page 159: Hardware and Software for Parallel Computing · o Parallel computing requires parallel codes Porting onto specialized architectures o FPGA, cluster, GPU… o Not only developer’s

1/31/2014

159

Cache Memory Considerations

Cores can supply L2 cache data to each other on-die: data may be replicated

If no data or code is shared between the cores o L2 size is 30.5 MB (61 cores)

If every core shares the same code and data o L2 size is 512 kB

L2 usable size depends on how code and data are shared among the cores

L1 cache latency: 1 cycle

L2 cache latency: 15-30 cycles

GDDR5 latency: 500-1000 cycles

317 www.caps-entreprise.com

VPU Architecture

8 mask registers for SIMD lane predicated execution

Extended Mathematical Unit (EMU) for SP exponent, logarithm, reciprocal and reciprocal square root operations

IEEE 754 2008 compliant

318

V0 V1 V3

V31 V30

512 bits

K0 K1 K2 K3 K4 K5 K6 K7

16 bits

32 Vector Registers

Vector Mask Registers

16 SP or 8 DP elements / cycle

www.caps-entreprise.com

Page 160: Hardware and Software for Parallel Computing · o Parallel computing requires parallel codes Porting onto specialized architectures o FPGA, cluster, GPU… o Not only developer’s

1/31/2014

160

Vectorization Brings Performance

Maximizing vectorization is as important as using enough

threads

Single precision

o 1.1 GHz * 61 cores * 16 lanes * 2 ops/core = 2.147 TFLOPS

Double precision

o 1.1 GHz * 61 cores * 8 lanes * 2 ops/core = 1.074 TFLOPS

319

VPU

www.caps-entreprise.com

Coprocessor Software Overview

Software tools are similar to those available on the host

Development Tools and Runtimes o Intel Parallel Studio XE 2013

• Intel Composer: Intel C/C++ and Intel Fortran compilers, parallel debugger, performance libraries

• Intel Inspector: identifies memory errors and threading errors

• Intel Advisor: helps programmer parallelize their code

• Intel Amplifier: performance profiler

Intel Manycore Platform Software Stack (Intel MPSS) o Specific to the coprocessor

• Middlewares

• Device drivers

• Coprocessor management utilities

• Linux OS

• GNU tools (gcc, gdb, …)

320 www.caps-entreprise.com

Page 161: Hardware and Software for Parallel Computing · o Parallel computing requires parallel codes Porting onto specialized architectures o FPGA, cluster, GPU… o Not only developer’s

1/31/2014

161

Coprocessor Linux OS

Intel Xeon Phi runs an autonomous linux OS

o Linux kernel version 2.6.34 or greater

o Minimal, with small memory footprint

• Includes Linux Standard Base (LSB) Core libraries and Busybox minimal

shell environment

o Controlled by the host via the PCIe bus

The host boots the card and provides the linux boot image

Be careful, memory on the Xeon Phi is volatile

o Data is lost each time the card is rebooted

321 www.caps-entreprise.com

MPSS Boot

Host coprocessor driver mic.ko:

o Provides PCIe access

o Loads linux kernel into the accelerator’s memory

o Starts booting

mpssd daemon

o Controls booting based on configuration files

o mpssd is a linux service

micctrl application

o Configures the linux OS on the coprocessor

322 www.caps-entreprise.com

Page 162: Hardware and Software for Parallel Computing · o Parallel computing requires parallel codes Porting onto specialized architectures o FPGA, cluster, GPU… o Not only developer’s

1/31/2014

162

Coprocessor Startup Configuration

Enable root access

o Add SSH keys in /root/.ssh

Generate default configuration files

o default.conf and micN.conf in /etc/sysconfig/mic

Start MPSS service

323

user@host $ sudo micctrl --initdefaults

user@host $ sudo service mpss start

www.caps-entreprise.com

Coprocessor Administration

Use micctrl utility

o Check coprocessor(s) status

o Check coprocessor(s) configuration

o Re-boot coprocessor(s)

o Shut down coprocessor(s)

o Add user

324

user@host $ micctrl --config

user@host $ micctrl -s

user@host $ micctrl -R [mic coprocessorlist]

user@host $ micctrl -S [mic coprocessorlist]

user@host $ micctrl –-useradd=<name> \

-sshkeys=<keydir> [mic coprocessorlist]

www.caps-entreprise.com

Page 163: Hardware and Software for Parallel Computing · o Parallel computing requires parallel codes Porting onto specialized architectures o FPGA, cluster, GPU… o Not only developer’s

1/31/2014

163

Available Execution Models

Native execution model o Application is compiled for and executed on Xeon Phi

o Or application is both compiled for host and Xeon Phi

• May run on both architectures

• May introduce communications between both versions

• Requires architecture-agnostic infrastructure

o Original application can be reused

Processor-centric execution model o Application runs on the host

o And offloads selected parts of code onto Xeon Phi

o Communications are driven by the host

o Application has to be modified

325 www.caps-entreprise.com

Available Execution Models

Intel MIC

Native

MPI OpenMP

Accelerator

Intel Offload

Intel MKL OpenCL OpenACC

326 www.caps-entreprise.com

Page 164: Hardware and Software for Parallel Computing · o Parallel computing requires parallel codes Porting onto specialized architectures o FPGA, cluster, GPU… o Not only developer’s

1/31/2014

164

Intel Offload

OpenMP Specifications

Currently: version 3.1

o Released in 2011

o Does NOT support accelerators

Heading toward version 4.0

o Released

o Supports accelerators

Intel expected that OpenMP 4.0 will reuse offload directives

and intends to support it when finalized

Intel MIC Programming 328

Page 165: Hardware and Software for Parallel Computing · o Parallel computing requires parallel codes Porting onto specialized architectures o FPGA, cluster, GPU… o Not only developer’s

1/31/2014

165

Intel Offload Directive Model

Syntax o In C:

o In FORTRAN:

Implements following behaviors: o Coprocessor memory allocation

o Data transfer from host to coprocessor

o Execution on coprocessor

o Data transfer from coprocessor to host

o Coprocessor memory deallocation

Intel MIC Programming 329

#pragma offload <clauses>

<statement>

!DIR$ offload <clauses>

<statement>

Offload Computations to the Coprocessor

Intel MIC Programming 330

! Fortran OpenMP

!dir$ omp offload target(mic)

!$omp parallel do

Do i=1, count

A(i)=B(i)*c+d

End do

!$omp end parallel do

// C/C++ OpenMP

#pragma offload target(mic)

#pragma omp parallel for

for(i=0; i<count; i++)

{

a[i]=b[i]*c+d;

}

Next statement can execute on coprocessor “mic” if available, else processor

Next OpenMP parallel construct can execute on coprocessor “mic” if available, else processor

Page 166: Hardware and Software for Parallel Computing · o Parallel computing requires parallel codes Porting onto specialized architectures o FPGA, cluster, GPU… o Not only developer’s

1/31/2014

166

Function and Variables on Coprocessor

Compile functions for, or allocate variables on, both the host

and the coprocessor

In C

In FORTRAN

Intel MIC Programming 331

__attribute__((target(mic))) <var/function>

__declspec(target(mic)) <var/function>

!DIR$ attributes offload:<mic>::<var/function>

#pragma offload_attribute(push,target(mic))

code

#pragma offload_attribute(pop)

Copy Clauses

Clauses Syntax Semantics

Inputs in(var-list : modifiers) Copy from the host to the accelerator

Outputs out(var-list : modifiers) Copy from the accelerator to the host

Inputs / Outputs inout(var-list : modifiers) Copy to the accelerator before offloading computations and back afterwards

Non-copied data nocopy(var-list : modifiers) Data is local to accelerator target

Intel MIC Programming 332

Page 167: Hardware and Software for Parallel Computing · o Parallel computing requires parallel codes Porting onto specialized architectures o FPGA, cluster, GPU… o Not only developer’s

1/31/2014

167

Modifier Options

Modifiers Syntax Semantics

Specify pointer length length(N) Transfer N elements of the pointer’s type (N*sizeof(element) bytes). It cannot be used to offload part of an array

Control target data allocation alloc(array section reference) Limits memory allocation the shape of array specified

Control pointer memory allocation alloc_if(condition) Allocate memory for the pointer if condition is true

Control freeing of pointer memory free_if(condition) Free memory used by pointer if condition is true

Move data from one variable to another

into(array section reference) Transfers data from a variable on the host to another variable on the coprocessor and vice versa

Control target data alignment align(expression) Specify minimum memory alignment on accelerator

Intel MIC Programming 333

Allocating Partial Arrays in C

In the example above:

o 7000 integers are allocated on the host

o 6000 integers are allocated on the coprocessor

o The first element on the coprocessor has index 10

o The last element on the coprocessor has index 6009

o 500 elements, in the range ptr[100]-ptr[599], are copied on the

coprocessor

Intel MIC Programming 334

int *ptr = (int*) malloc(7000*sizeof(int));

#pragma offload in(ptr[100:500]:alloc(ptr[10:6000])

{ … }

Page 168: Hardware and Software for Parallel Computing · o Parallel computing requires parallel codes Porting onto specialized architectures o FPGA, cluster, GPU… o Not only developer’s

1/31/2014

168

Allocating Partial Arrays in FORTRAN

In the example above:

o 30 integers are allocated on the host

o 18 integers are allocated on the coprocessor

o The first element on the coprocessor has index 3

o The last element on the coprocessor has index 20

o 8 elements, in the range T(7)-T(14), are copied on the coprocessor

Intel MIC Programming 335

INTEGER :: T(30)

!DIR$ OFFLOAD IN((T(7:14):ALLOC(3:20))

Moving Data Example

The example above performs a copy of the first 1500

elements of t1 to the last 1500 elements of t2

Intel MIC Programming 336

int t1[2000], t2[5000];

#pragma offload in(t1[0:1500]:into(t2[3500:1500]))

Page 169: Hardware and Software for Parallel Computing · o Parallel computing requires parallel codes Porting onto specialized architectures o FPGA, cluster, GPU… o Not only developer’s

1/31/2014

169

Memory Management

By default, directives allocate fresh memory on the Xeon Phi

memory for each variable when entering the construct

By default, memory is deallocated when exiting the construct

Memory allocation is expensive: modifiers can change the

default behavior to reuse memory space

o If data has been allocated by a previous offload construct

o If data has been allocated by an attribute directive

o If data should be reused by an other offload construct

Intel MIC Programming 337

Persistent Storage Example

Intel MIC Programming 338

int nb = 1000;

int *t = (int*) malloc(nb*sizeof(int));

void bar()

{

#pragma offload in(t[0:nb]:alloc_if(1))

foo(t, nb);

}

void foo(int * ptr, int size)

{

#pragma offload in(ptr[0:size]:alloc_if(0))

}

Allocation of t of size nb on the coprocessor

Reuse of t already on the coprocessor and free t at the end of offload section

Page 170: Hardware and Software for Parallel Computing · o Parallel computing requires parallel codes Porting onto specialized architectures o FPGA, cluster, GPU… o Not only developer’s

1/31/2014

170

Static and Dynamic Memory Example

Intel MIC Programming 339

__declspec(target(mic)) int array_host_mic[5000];

int array_host[5000];

void bar()

{

foo(&array_host[0], 5000);

foo(&array_host_mic[0], 5000);

}

void foo(int *t, int nb)

{

#pragma offload in(t[0:nb]:alloc_if(0))

}

Dynamic allocation of t on the coprocessor

Reuse of static allocation of array_host_mic on the coprocessor

Allocation on the processor and the coprocessor

Asynchronous Behavior

By default, Intel Offload directives cause the host thread to wait for the completion of the Xeon Phi instruction before going on to the next statement

Asynchronous behavior can be used specifying a signal clause to the offload directive

A offload_wait directive should be used to ensure completion

Intel MIC Programming 340

CPU MIC

1

2

3

4

5

CPU MIC

1

2

3

4

5

Page 171: Hardware and Software for Parallel Computing · o Parallel computing requires parallel codes Porting onto specialized architectures o FPGA, cluster, GPU… o Not only developer’s

1/31/2014

171

Asynchronous Computations Example (C)

341

char sig;

int count = 1000;

__attribute__((target(mic))) mic_compute()

do {

#pragma offload target(mic) signal(&sig)

{

mic_compute();

}

host_activity();

#pragma offload_wait(&sig)

count = count – 1;

} while(count > 0);

www.caps-entreprise.com

Asynchronous Computations Example (FORTRAN)

342

integer sig

integer count

count = 1000

!dir$ attributes offload:mic::mic_compute

do while (count .gt. 0)

!dir$ offload target(mic:0) signal(sig)

call mic_compute()

call host_activity()

!dir$ offload_wait target(mic:0) wait(sig)

count = count – 1

end do

www.caps-entreprise.com

Page 172: Hardware and Software for Parallel Computing · o Parallel computing requires parallel codes Porting onto specialized architectures o FPGA, cluster, GPU… o Not only developer’s

1/31/2014

172

Asynchronous Transfers

Use offload_transfer directives instead, for example in C

343

char sig1, sig2, sig3;

float *p = (float*)malloc(N*size(float));

#pragma offload_transfer in(p:length(N)) signal(&sig1)

host_activity();

#pragma offload wait(sig1) signal(&sig2)

{

foo(N, p);

}

#pragma offload_transfer wait(sig2) out(p:length(N)) signal(&sig3)

host_activity();

#pragma offload_wait(&sig3)

www.caps-entreprise.com

Compile and Execute Intel Offload Applications

Source Intel Compiler environment

Compile with –openmp

o To ignore Intel Offload directive use –no-offload flag

o To display offload optimizer report use –opt-report-phase=offload

Execute from the host

To retrieve basic profiling information

To retrieve timing information

344

user@host $ icc –openmp myProgram.c –o myApp.mic

user@host $ ./myApp.mic

user@host $ source /opt/intel/composer_xe_2013.2.146/bin/compilervars.sh intel64

user@host $ export H_TIME=1

user@host $ export H_TRACE=1

www.caps-entreprise.com

Page 173: Hardware and Software for Parallel Computing · o Parallel computing requires parallel codes Porting onto specialized architectures o FPGA, cluster, GPU… o Not only developer’s

1/31/2014

173

Directive Standard

Directive-based Programming (1)

Three ways of programming GPGPU applications:

www.caps-entreprise.com 346

Libraries

Ready-to-use

Acceleration

Directives

Quickly Accelerate

Existing Applications

Programming

Languages

Maximum Performance

Page 174: Hardware and Software for Parallel Computing · o Parallel computing requires parallel codes Porting onto specialized architectures o FPGA, cluster, GPU… o Not only developer’s

1/31/2014

174

Advantages of Directive-based Programming

Simple and fast development of accelerated applications

Non-intrusive

Helps to keep a unique version of code o To preserve code assets

o To reduce maintenance cost

o To be portable on several accelerators

Incremental approach

Enables "portable" performance

www.caps-entreprise.com 347

OpenACC

Page 175: Hardware and Software for Parallel Computing · o Parallel computing requires parallel codes Porting onto specialized architectures o FPGA, cluster, GPU… o Not only developer’s

1/31/2014

175

Various Many-core Paths

www.caps-entreprise.com 349

• Large number of small cores • Data parallelism is key • PCIe to CPU connection

AMD Discrete GPU

AMD APU

• Integrated CPU+GPU cores • Target power efficient

devices at this stage • Shared memory system with

partitions

INTEL Many Integrated Cores

• 50+ number of x86 cores • Support conventional programming • Vectorization is key • Run as an accelerator or standalone

NVIDIA GPU

• Large number of small cores • Data parallelism is key • Support nested and dynamic

parallelism • PCIe to host CPU or low

power ARM CPU (CARMA)

OpenACC Initiative

www.caps-entreprise.com 350

Launched by CAPS, Cray, Nvidia and PGI in november 2011 o Allinea, Georgia Tech, U. Houston, ORNL, Rogue

Wave, Sandia NL, Swiss National Computing Center, TUD joined in 2012

Open Standard

A directive-based approach for programming heterogeneous many-core hardware for C/C++ and FORTRAN applications

Specification version 2.0 (June 2013)

http://www.openacc-standard.com

Page 176: Hardware and Software for Parallel Computing · o Parallel computing requires parallel codes Porting onto specialized architectures o FPGA, cluster, GPU… o Not only developer’s

1/31/2014

176

OpenACC Compilers (1)

CAPS Compilers:

Source-to-source

compilers

Support Intel Xeon Phi,

NVIDIA GPUs, AMD

GPUs and APUs

PGI Accelerator

Extension of x86 PGI

compiler

Support Intel Xeon Phi,

NVIDIA GPUs, AMD

GPUs and APUs

Intel MIC Programming 351

Cray Compiler:

Provided with Cray systems only

CAPS Compilers (2)

Take the original application as input and generate another

application source code as output

o Automatically turn the OpenACC source code into a accelerator-

specific source code (CUDA, OpenCL)

Compile the entire hybrid application

Just prefix the original compilation line with capsmc to

produce a hybrid application

352

$ capsmc gcc myprogram.c

$ capsmc gfortran myprogram.f90

Intel MIC Programming

Page 177: Hardware and Software for Parallel Computing · o Parallel computing requires parallel codes Porting onto specialized architectures o FPGA, cluster, GPU… o Not only developer’s

1/31/2014

177

Compilation Paths

CAPS Compilers drives all compilation passes

Host application compilation o Calls traditional CPU

compilers o CAPS Runtime is linked

to the host part of the application

Device code production o According to the

specified target o A dynamic library is

built

www.caps-entreprise.com 353

Fun #3

C++ Frontend

C Frontend

Fortran Frontend

CUDA Code Generation

Executable (mybin.exe)

Instrumen-tation module

CPU compiler (gcc, ifort, …)

CUDA compilers

HWA Code (Dynamic library)

OpenCL Generation

OpenCL compilers

Extraction module

Fun #2

Host code

codelets

CAPS Runtime

Fun #1

CAPS Compilers Options

Usage:

To specify accelerator-specific code

To display the compilation process

354

$ capsmc –-openacc-target OPENCL –d -c gcc myprogram.c

$ capsmc –-openacc-target CUDA gcc myprogram.c #(default)

$ capsmc –-openacc-target OPENCL gcc myprogram.c #(for Xeon Phi)

$ capsmc [CAPSMC_FLAGS] <host_compiler> [HOST_COMPILER_FLAGS] <source_files>

Intel MIC Programming

Page 178: Hardware and Software for Parallel Computing · o Parallel computing requires parallel codes Porting onto specialized architectures o FPGA, cluster, GPU… o Not only developer’s

1/31/2014

178

Programming Model

Express data and computations to be executed on an accelerator

o Using marked code regions

Main OpenACC constructs

o Parallel and kernel regions

o Parallel loops

o Data regions

o Runtime API

355

Data/stream/vector

parallelism to be

exploited by HWA e.g. CUDA / OpenCL

CPU and HWA linked with a

PCIx bus

Intel MIC Programming

Execution Model

Among a bulk of computations executed by the CPU, some regions can be offloaded to hardware accelerators o Parallel regions o Kernels regions

Host is responsible for: o Allocating memory space on accelerator o Initiating data transfers o Launching computations o Waiting for completion o Deallocating memory space

Accelerators execute parallel regions: o Use work-sharing directives o Specify level of parallelization

356 Intel MIC Programming

Page 179: Hardware and Software for Parallel Computing · o Parallel computing requires parallel codes Porting onto specialized architectures o FPGA, cluster, GPU… o Not only developer’s

1/31/2014

179

OpenACC Execution Model

Host-controlled execution

Based on three parallelism levels

o Gangs – coarse grain

o Workers – fine grain

o Vectors – finest grain

357

Device

Gang

Worker

Vectors

Gang

Worker

Vectors

Intel MIC Programming

Gangs, Workers, Vectors

In CAPS Compilers, gangs, workers and vectors correspond

to the following in an OpenCL grid

Beware: this implementation is compiler-dependent

358

numGroups(1) = 1

numGroups(0) = number of gangs

localSize(1) = number of workers

localSize(0) = number of vectors

Intel MIC Programming

Page 180: Hardware and Software for Parallel Computing · o Parallel computing requires parallel codes Porting onto specialized architectures o FPGA, cluster, GPU… o Not only developer’s

1/31/2014

180

Directive Syntax

C

Fortran

359

!$acc directive-name [clause [, clause] …]

code to offload

!$acc end directive-name

#pragma acc directive-name [clause [, clause] …]

{

code to offload

}

Intel MIC Programming

Parallel Construct

Starts parallel execution on the accelerator

Creates gangs and workers

The number of gangs and workers remains constant for the parallel region

One worker in each gang begins executing the code in the region

www.caps-entreprise.com 360

#pragma acc parallel […]

{

for(i=0; i < n; i++) {

for(j=0; j < n; j++) {

}

}

}

Code executed on the hardware

accelerator

Page 181: Hardware and Software for Parallel Computing · o Parallel computing requires parallel codes Porting onto specialized architectures o FPGA, cluster, GPU… o Not only developer’s

1/31/2014

181

Gangs, Workers, Vectors in Parallel Constructs

In parallel constructs, the number of gangs, workers and vectors is the same for the entire section

The clauses: o num_gangs o num_workers o vector_length

Enable to specify the number of gangs, workers and vectors in the corresponding parallel section

www.caps-entreprise.com 361

#pragma acc parallel, num_gangs(128) \

num_workers(256)

{

for(i=0; i < n; i++) {

for(j=0; j < m; j++) {

}

}

}

… … …

256

128

Loop Constructs

A Loop directive applies to a loop that immediately follow the directive

The parallelism to use is described by one of the following clause:

o Gang for coarse-grain parallelism

o Worker for middle-grain parallelism

o Vector for fine-grain parallelism

www.caps-entreprise.com 362

Page 182: Hardware and Software for Parallel Computing · o Parallel computing requires parallel codes Porting onto specialized architectures o FPGA, cluster, GPU… o Not only developer’s

1/31/2014

182

Loop Directive Example

With gang, worker or vector clauses, the iterations of the following loop are executed in parallel

Gang, worker or vector clauses enable to distribute the iterations between the available gangs, workers or vectors

www.caps-entreprise.com 363

#pragma acc parallel, num_gangs(128) \

num_workers(192) \

vector_length(32)

{

#pragma acc loop gang

for(i=0; i < n; i++) {

#pragma acc loop worker

for(j=0; j < m; j++) {

#pragma acc loop vector

for(k=0; k < l; k++) {

}

}

}

}

192

128

i=0

j=0 j=1 j=2

i=0

i=1

… k=0 k=1 k=2

Kernels Construct

Defines a region of code to be compiled into a sequence of accelerator kernels o Typically, each loop nest will be a distinct kernel

The number of gangs and workers can be different for each kernel

www.caps-entreprise.com 364

#pragma acc kernels […]

{

for(i=0; i < n; i++) {

}

for(j=0; j < n; j++) {

}

}

$!acc kernels […]

DO i=1,n

END DO

DO j=1,n

END DO

$!acc end kernels

1st Kernel

2nd Kernel

Page 183: Hardware and Software for Parallel Computing · o Parallel computing requires parallel codes Porting onto specialized architectures o FPGA, cluster, GPU… o Not only developer’s

1/31/2014

183

Gang, Worker, Vector in Kernels Constructs

The parallelism description is the same as in parallel sections

However, these clauses accept an argument to specify the number of gangs, workers or vectors to use

Every loop can have a different number of gangs, workers or vectors in the same kernels region

www.caps-entreprise.com 365

#pragma acc kernels

{

#pragma acc loop gang(128)

for(i=0; i < n; i++) {

}

#pragma acc loop gang(64)

for(j=0; j < m; j++) {

}

}

64

i=0

i=0

i=2

… …

i=0

i=0

i=2

128

Data Independency

In kernels sections, the clause independent on loop directive specifies that iterations of the loop are data-independent

The user does not have to think about gangs, workers or vector parameters

It allows the compiler to generate code to execute the iterations in parallel with no synchronization

www.caps-entreprise.com 366

A[0] = 0;

#pragma acc loop independent

for(i=1; i<n; i++)

{

A[i] = A[i]-1;

}

Page 184: Hardware and Software for Parallel Computing · o Parallel computing requires parallel codes Porting onto specialized architectures o FPGA, cluster, GPU… o Not only developer’s

1/31/2014

184

What is the problem using discrete accelerators?

PCIe transfers have huge latencies

In kernels and parallel regions, data are implicitly managed

o Data are automatically transferred to and from the device

o Implies possible useless communications

Avoiding transfers leads to a better performance

OpenACC offers a solution to control transfers

www.caps-entreprise.com 367

Device Memory Reuse

In this example: o A and B are allocated

and transferred for the first kernels region

o A and C are allocated and transferred for the second kernels region

How to reuse A between the two kernels regions? o And save transfer and

allocation time

www.caps-entreprise.com 368

float A[n];

#pragma acc kernels

{

for(i=0; i < n; i++) {

A[i] = B[n – i];

}

}

init(C)

#pragma acc kernels

{

for(i=0; i < n; i++) {

C[i] += A[i] * alpha;

}

}

Page 185: Hardware and Software for Parallel Computing · o Parallel computing requires parallel codes Porting onto specialized architectures o FPGA, cluster, GPU… o Not only developer’s

1/31/2014

185

Memory Allocations

Avoid data reallocation using the create clause o It declares variables, arrays or subarrays to be allocated in the device

memory

o No data specified in this clause will be copied between host and device

The scope of such a clause corresponds to a data region

Kernels and Parallel regions implicitly define data regions

The present clause declares data that are already present on the device

www.caps-entreprise.com 369

Create and Present Clause Example

www.caps-entreprise.com 370

float A[n];

#pragma acc data create(A)

{

#pragma acc kernels present(A)

{

for(i=0; i < n; i++) {

A[i] = B[n – i];

}

}

init(C)

#pragma acc kernels present(A)

{

for(i=0; i < n; i++) {

C[i] += A[i] * alpha;

}

}

}

Allocation of A of size n on the device

Deallocation of A on the device

Reuse of A already allocated on the device

Reuse of A already allocated on the device

Page 186: Hardware and Software for Parallel Computing · o Parallel computing requires parallel codes Porting onto specialized architectures o FPGA, cluster, GPU… o Not only developer’s

1/31/2014

186

Data Storage: Mirroring

How is the data stored in a data region?

A data construct defines a section of code where data are mirrored between host and device

Mirroring duplicates a CPU memory block into the HWA memory o Users ensure consistency of copies via directives

www.caps-entreprise.com 371

Host Memory

Master copy

…………………………………………………….

HWA Memory

CAPS RT Descriptor

…………………………………………………….

Mirror copy

Arrays and Subarrays

In C and C++, arrays are specified with start and length

o For example, with an array of size n

In FORTRAN, arrays are specified with a list of range specifications

o For example, with an array a of size (n,m)

In any language, any array or subarray must be a contiguous block of memory

www.caps-entreprise.com 372

#pragma acc data create a[0:n]

!$acc data create a(0:n,0:m)

Page 187: Hardware and Software for Parallel Computing · o Parallel computing requires parallel codes Porting onto specialized architectures o FPGA, cluster, GPU… o Not only developer’s

1/31/2014

187

Transfers: Copyin Clause

Declares data that need only to be copied from the host to the device when entering the data section

o Performs input transfers only

It defines scalars, arrays and subarrays to be allocated on the device memory for the duration of the data region

www.caps-entreprise.com 373

#pragma acc data create(A[:n])

{

#pragma acc kernels present(A[:n]) \

copyin(B[:n])

{

for(i=0; i < n; i++) {

A[i] = B[n – i];

}

}

#pragma acc kernels present(A[:n])

{

for(i=0; i < n; i++) {

C[i] = A[i] * alpha;

}

}

}

Transfers: Copyout Clause

Declares data that need only to be copied from the device to the host when exiting data section

o Performs output transfers only

It defines scalars, arrays and subarrays to be allocated on the device memory for the duration of the data region

www.caps-entreprise.com 374

#pragma acc data create(A[:n])

{

#pragma acc kernels present(A[:n]) \

copyin(B[:n])

{

for(i=0; i < n; i++) {

A[i] = B[n – i];

}

}

#pragma acc kernels present(A[:n]) \

copyout(C[:n])

{

for(i=0; i < n; i++) {

C[i] = A[i] * alpha;

}

}

}

Page 188: Hardware and Software for Parallel Computing · o Parallel computing requires parallel codes Porting onto specialized architectures o FPGA, cluster, GPU… o Not only developer’s

1/31/2014

188

Transfers: Copy Clause

If we change the example, how to express that input and output transfers of C are required?

Use copy clause to: o Declare data that need to be

copied from the host to the device when entering the data section

o Assign values on the device that need to be copied back to the host when exiting the data section

o Allocate scalars, arrays and subarrays on the device memory for the duration of the data region

www.caps-entreprise.com 375

#pragma acc data create(A[:n])

{

#pragma acc kernels present(A[:n]) \

copyin(B[:n])

{

for(i=0; i < n; i++) {

A[i] = B[n – i];

}

}

init(C)

#pragma acc kernels present(A[:n]) \

copy(C[:n])

{

for(i=0; i < n; i++) {

C[i] += A[i] * alpha;

}

}

}

Present_or_create Clause

Combines two behaviors

Declares data that may be present

o If data is already present, use value in the device memory

o If not, allocate data on device when entering region and deallocate when exiting

May be shortened to pcreate

www.caps-entreprise.com 376

Page 189: Hardware and Software for Parallel Computing · o Parallel computing requires parallel codes Porting onto specialized architectures o FPGA, cluster, GPU… o Not only developer’s

1/31/2014

189

Present_or_copyin/copyout/copy Clauses

If data is already present, use value in the device memory

If not: o present_or_copyin/present_or_copyout/present_or_copy allocate

memory on device at region entry

o present_or_copyin/present_or_copy transfer data from the host at region entry

o present_or_copyout/copy transfer data from the device to the host at region exit

o present_or_copyin/present_or_copyout/prsent_or_copy deallocate memory at region exit

May be shortened to pcopyin, pcopyout and pcopy

www.caps-entreprise.com 377

Present_or_* Clauses Example

www.caps-entreprise.com 378

program main

!$acc data create(A(1:n))

call f1( n, A, B )

!$acc end data

call f1( n, A, C )

contains

subroutine f1( n, A, B )

!$acc kernels pcopyout(A(1:n)) \

copyin(B(1:n))

do i=1,n

A(i) = B(n – i)

end do

!$acc end kernels

end subroutine f1

end program main

Allocation of A of size n on the device

Reuse of A already allocated on the device Allocation of B of size n on the device for the duration of the subroutine and input transfer of B

Deallocation of A on the device

Allocation of A and B of size n on the device for the duration of the subroutine Input transfer of B and output transfer of A

Present_or_* clauses are generally safer

Page 190: Hardware and Software for Parallel Computing · o Parallel computing requires parallel codes Porting onto specialized architectures o FPGA, cluster, GPU… o Not only developer’s

1/31/2014

190

Default Behavior

CAPS Compilers is able to detect the variables required on the device for the kernels and parallel constructs.

According to the specification, depending on the type of the variables, they follow the following policies

o Tables: present_or_copy behavior

o Scalar

• if not live in or live out variable: private behavior

• copy behavior otherwise

www.caps-entreprise.com 379

OpenACC 2.0: New Features (1)

Atomic operations: o Different kinds of atomic

sections can be executed in parallel/kernels constructs

• Read, write, update, capture

Routine directives o Function invocation from

kernels/parallel constructs

o Functions can be executed from host or device

SC'13

#pragma acc atomic read v=x; #pragma acc atomic write x=42; #pragma acc atomic update x++; #pragma acc atomic capture v = x++;

#pragma acc routine (myfunc) void myfunc( … ) { … } int main() { #pragma acc kernels { myfunc( … ); } … myfunc( … ); }

380

Page 191: Hardware and Software for Parallel Computing · o Parallel computing requires parallel codes Porting onto specialized architectures o FPGA, cluster, GPU… o Not only developer’s

1/31/2014

191

OpenACC 2.0: New Features (2)

Enter/Exit data directives o Enable scope

free data management

Device type clause o Enable

architecture specific optimizations

SC'13

void init(int* array, int size){ array = malloc(sizeof(int)*size); #pragma acc enter data create(array[0:size]) } int main(){ int *array; … init(array, size); … #pragma acc exit data delete(array) … }

381

#pragma acc kernels device_type(nvidia) num_gangs(64) \ device_type(xeonphi) num_gangs(128) { #pragma acc loop gang for(int i=0; i<size; i++){ … } }

OpenMP 4.0

Page 192: Hardware and Software for Parallel Computing · o Parallel computing requires parallel codes Porting onto specialized architectures o FPGA, cluster, GPU… o Not only developer’s

1/31/2014

192

Intel Offload / OpenMP 4.0 / OpenACC 1.0

Directive Set Intel Offload OpenMP 4.0 (To be released)

OpenACC 1.0

Offloading computations #pragma offload !dir$ offload

#pragma omp target !$omp target

#pragma acc kernels #pragma acc parallel !$acc kernels !$acc parallel

Work Sharing #pragma omp parallel !$omp parallel

#pragma omp parallel !$omp parallel #pragma omp teams !$omp teams #pragma omp distribute !$omp distribute

#pragma acc loop !$acc loop #pragma loop gang/worker/vector !$acc loop gang/worker/vector

Data Clauses in, out, inout alloc_if, free_if

map to, map from, map tofrom map alloc

copyin, copyout, copy create, present, pcopyin, pcopyout, pcopy

Data regions - #pragma omp target data !$omp target data

#pragma acc data !$acc data

Transfer directives #pragma offload_transfer !dir$ offload_transfer

#pragma omp target update !$omp target update

#pragma acc update !$acc update

Intel MIC Programming 383

CAPEX / OPEX

with GPU

Page 193: Hardware and Software for Parallel Computing · o Parallel computing requires parallel codes Porting onto specialized architectures o FPGA, cluster, GPU… o Not only developer’s

1/31/2014

193

Goals – Why Using GPUs

Performance

Energy saving

Cheaper machine

Preparing code to manycore

www.caps-entreprise.com 385

Is the Machine Cheaper?

You may want

o To run faster than you competitor

o To run faster than the streamed data come

o To run faster in order to use less energy to compute

o To run differently to save energy

OPEX or CAPEX?

o Capital Expenditures

• Machine cost and software migration, surface cost

o Operational Expenditures

• Energy consumption, hardware and software maintenance

www.caps-entreprise.com 386

Page 194: Hardware and Software for Parallel Computing · o Parallel computing requires parallel codes Porting onto specialized architectures o FPGA, cluster, GPU… o Not only developer’s

1/31/2014

194

CAPEX-OPEX Analysis for a Heterogeneous

System

Capital Expenses (CapEx) o System acquisition cost

o Software migration cost

o Software acquisition cost

o Teaching cost

o Real estate cost

Operational Expenses (OpEx) o Energy cost (system + cooling)

o Maintenance cost

For a given amount of compute work, the CapEx-Opex analysis indicates the “real” value of a given system o For instance, if I add GPU do I save money?

o And how many should I add?

o Then should I use slower CPU?

www.caps-entreprise.com 387

Application Speedup and CapEx-OpEx

Adding GPUs/accelerators to the system o Increases system cost

o Increases base energy consumption (one GPU = x10 watt idle)

Exploiting the GPUs/accelerators o Decreases execution time, so potentially the energy consumption for a

given amount of work

o Reduces the number of nodes of the architecture • Threshold effect on the number of routers etc.

o Requires to migrate the code

Multiple views of the value of application speedup o Shorten time-to-market

• Threshold effect

o More work performed during the lifetime of the system

www.caps-entreprise.com 388

Page 195: Hardware and Software for Parallel Computing · o Parallel computing requires parallel codes Porting onto specialized architectures o FPGA, cluster, GPU… o Not only developer’s

1/31/2014

195

CapEx Hardware Parameters

Choice of the hardware configuration can be: o Fast CPU + Fast GPU (expensive node)

o Slow CPU + Fast GPU

o Fast CPU + Slow GPU

o Slow CPU + Slow GPU

o Fast CPU

o Slow CPU

Nodes performance impact on the number of nodes o More nodes means more network with non negligible cost and energy

consumption

o Less nodes may limit scalability issues if any

Application workload analysis is the only way to decide o Optimizing software can significantly increase performance and so reduce

needed hardware

o Code migration to GPU is on the critical path

www.caps-entreprise.com 389

Small systems: - a few nodes (1-8) - cost x10k€ Large systems - many nodes (x100) - cost x1M€

CapEx: Code Migration Cost

Migration cost o Learning cost

o Software environment cost

o Porting cost

Migration cost is mostly hardware size independent o Not an issue for dedicated large systems

o Different if the machine aims at serving a large community

Main migration benefit is to highlight manycore parallelism o Not specific to one kind of device

o Implementation is specific

Constructor specific implementation solution o Amortize period similar to the one of the hardware (3 years)

Agnostic parallelism expression o Using portable solution for multiple hardware generations (amortized on 10 years)

o Of course not that simple! Still requires some level of tuning

May be very useful for non scalable message passing code

www.caps-entreprise.com 390

Mastering the cost of migration has a significant impact on the total cost for small systems Typical effort: - Manpower: a few Man-Months - Cost: x 10k€

Page 196: Hardware and Software for Parallel Computing · o Parallel computing requires parallel codes Porting onto specialized architectures o FPGA, cluster, GPU… o Not only developer’s

1/31/2014

196

Two Applications Examples

Application 1

• Field: Monte Carlo simulation for thermal radiation

• MPI code

• Migration cost: 1 man month

Application 2

• Field: astrophysics, hydrodynamic

• MPI code

• Requires 3 GPUs per node for having enough memory space

• Migration cost: 2 man month

www.caps-entreprise.com 391

Power Consumption Application 1

www.caps-entreprise.com 392

0 = Baseline Energy Consumption

CPU energy

GPU energy

Power usage effectiveness (PUE) = Total facility power / IT equipment power Current 1.9, best practice 1.3 Src: http://www.google.com/corporate/datacenter/efficiency-measurements.html

Page 197: Hardware and Software for Parallel Computing · o Parallel computing requires parallel codes Porting onto specialized architectures o FPGA, cluster, GPU… o Not only developer’s

1/31/2014

197

Power Consumption Application 2

www.caps-entreprise.com 393

Configuration Execution

time (s) System Costs

Maintenance

Costs

Energy

Costs

CAPEX

+OPEX

Application 1 Migration cost = 1 man.month

4 nodes 6862 1.87€ 0.19€ 0.37€ 2.43€

4 nodes + 4 GPUs 1744 0.71€ 0.07€ 0,12€ 0.90€

4 nodes + 8 GPUs 1000 0.51€ 0.05€ 0,08€ 0.64€

4 nodes + 12 GPUs 731 0.45€

0.04€ 0,08€ 0.57€

Application 2 Migration cost = 2 man.month

4 nodes 713 0.19€ 0.02€ 0.025€ 0.239€

4 nodes + 12 GPUs 485 0.30€ 0.03€ 0.034€ 0.358€

4 nodes (slow ck)+ 12 GPUs 500 (estim.) 0.24€ 0.02€ 0.034€ 0.302€

CAPEX-OPEX Overview

Comparison on an equivalent workload

o CAPEX = System costs + Migration costs

o OPEX = Energy cost + Computer maintenance cost (10% Computer costs)

www.caps-entreprise.com 394

Page 198: Hardware and Software for Parallel Computing · o Parallel computing requires parallel codes Porting onto specialized architectures o FPGA, cluster, GPU… o Not only developer’s

1/31/2014

198

Cost per Run

Application 1 Application 2

www.caps-entreprise.com 395

- €

0,50 €

1,00 €

1,50 €

2,00 €

2,50 €

3,00 €

no GPU 1 GPU/node 2 GPU/node 3 GPU/node

Migration Costs(4 nodes)

Energy Costs(power + cooling)

Maintenance Costs(10% CC)

System Costs

- €

0,05 €

0,10 €

0,15 €

0,20 €

0,25 €

0,30 €

0,35 €

0,40 €

0,45 €

no GPU 3 GPU/node 3 GPU/node(slow ck)

Porting Methodology

Page 199: Hardware and Software for Parallel Computing · o Parallel computing requires parallel codes Porting onto specialized architectures o FPGA, cluster, GPU… o Not only developer’s

1/31/2014

199

Methodology to port applications

www.caps-entreprise.com 397

Migration process

Code analysis and definition step o This step performs a diagnosis of the application, specifies the main

porting operations, and makes a gross estimation of the potential speedup as well as porting cost

• This step is concluded by a Go / NoGo Analysis

First port of the application o This step implements a first GPGPU version of the code

o With this version, a GPGPU execution profile can obtained and bottlenecks can be identified

Fine tune of the code and setup for production o This is the last step of the migration process that aims at getting a

well-optimized production code

www.caps-entreprise.com 398

Page 200: Hardware and Software for Parallel Computing · o Parallel computing requires parallel codes Porting onto specialized architectures o FPGA, cluster, GPU… o Not only developer’s

1/31/2014

200

Phase 1 (details)

www.caps-entreprise.com 399

Step 1: code analysis

Hotspot identification o Profile the application

o Identify hotspot to convert to GPU kernels

o Code restructuration may be needed to get parallel hotspot that are computation intensive enough

CPU analysis o Check the efficiency of the application on the CPU

Parallelism discovery o Ensure kernels can be execute in parallel

o If not, reconsider the algorithms

www.caps-entreprise.com 400

Page 201: Hardware and Software for Parallel Computing · o Parallel computing requires parallel codes Porting onto specialized architectures o FPGA, cluster, GPU… o Not only developer’s

1/31/2014

201

Step 2: Getting an Hybrid GPGPU Program

Converts hotspots to GPU kernels

o Using HMPP codelets

Helps to identify GPGPU issues

Helps also to validate parallel implementation

o Difference between CPU and GPU results

www.caps-entreprise.com 401

Phase 2 (details)

www.caps-entreprise.com 402

Page 202: Hardware and Software for Parallel Computing · o Parallel computing requires parallel codes Porting onto specialized architectures o FPGA, cluster, GPU… o Not only developer’s

1/31/2014

202

Step 3: Optimizing the hybrid program

Optimize the GPGPU kernels with code transformations

o Loop unrolling & jamming

o Increase the grid dimension

o Distribute loops

o Fuse loops

o …

Make it Incrementally

o Each transformation at a time

Easier to be rigorous than find bugs

www.caps-entreprise.com 403

How to Avoid Over-Spending Manpower?

Implement a version control management

Incrementally port the code

Check-point restart techniques

Do not delay integration

Stick to the plan

Check everything is available

Report every hard issues

Reconsider basic assumptions

Do not minimize the workload

www.caps-entreprise.com 404

Page 203: Hardware and Software for Parallel Computing · o Parallel computing requires parallel codes Porting onto specialized architectures o FPGA, cluster, GPU… o Not only developer’s

1/31/2014

203

An Example of Checklist

Do you know what is the target machine?

Do you have access to all the necessary codes and libraries?

Do you know how to run the codes?

Are there input data sets available?

Do you have an execution profile representative for performance measurements?

Do you have access to the target architecture?

Do you know how to check the results?

Are you allowed to change floating point rounding?

Do you have somebody to ask questions about the application?

Do you need an application domain consultant?

Have you stated the performance gain to achieve with the end-user?

Are debugging tools installed on the target machines?

Did you check the drivers/libraries/OS versions on the target machines?

Has the application code already been running on the target machine?

www.caps-entreprise.com 405

Checking the Results

The validation of output data is essential in a porting

process, and often neglected

It ensures that the transformations do not affect the

application’s global behavior

o Considering the application as a black-box, delivering the same

output data for a given input data set is not always a good thing

• Since readability of the application is sometime critical

www.caps-entreprise.com 406

Page 204: Hardware and Software for Parallel Computing · o Parallel computing requires parallel codes Porting onto specialized architectures o FPGA, cluster, GPU… o Not only developer’s

1/31/2014

204

Hardware

What is the targeted hardware?

o AMD, Nvidia, Xeon Phi?

Do you intend to use multitargets

o For example multi-GPUs

Is the target hardware fixed forever?

o Is it supposed to change in the next years?

www.caps-entreprise.com 407

Software

May I change the compiler?

Software target is OpenCL or CUDA?

o Is the software target hardware agnostic?

Shall I use OpenACC?

o Can be the native CPU version of the application alterated?

www.caps-entreprise.com 408

Page 205: Hardware and Software for Parallel Computing · o Parallel computing requires parallel codes Porting onto specialized architectures o FPGA, cluster, GPU… o Not only developer’s

1/31/2014

205

Accelerator Programming model

Directive-based programming

Parallel Computing

OpenHMPP OpenACC GPGPU

Many-Core programming

Parallelization

HPC OpenCL

Code speedup NVIDIA Cuda

High Performance Computing

CAPS Compilers

CAPS Workbench

Portability

Performance

Visit CAPS Website: www.caps-entreprise.com


Recommended