+ All Categories
Home > Documents > Chap. 4 Part 1 - uoguelph.ca

Chap. 4 Part 1 - uoguelph.ca

Date post: 18-Feb-2022
Category:
Upload: others
View: 0 times
Download: 0 times
Share this document with a friend
26
Chap. 4 Part 1 CIS*3090 Fall 2016 Fall 2016 CIS*3090 Parallel Programming 1
Transcript
Page 1: Chap. 4 Part 1 - uoguelph.ca

Chap. 4 Part 1

CIS*3090 Fall 2016

Fall 2016 CIS*3090 Parallel Programming 1

Page 2: Chap. 4 Part 1 - uoguelph.ca

Part 2 of textbook: Parallel Abstractions

How can we think about conducting computations in parallel before getting down to coding?

“abstraction” higher-level concepts than

program code, with details missing

Covered by ch 4 “First Steps toward par. prog.” and ch 5 “Scalable Algorithmic Techniques”

Authors’ “parallel pseudocode” for specifying par. algos “without biasing toward a programming language”

Fall 2016 CIS*3090 Parallel Programming 2

Page 3: Chap. 4 Part 1 - uoguelph.ca

Two basic ways to organize parallel computations

“How are we going to put all these processors to work on this problem?” “What can we find for them to do?”

Based on analyzing either the data or the process/aka task (in sense of steps to be taken)

Fall 2016 CIS*3090 Parallel Programming 3

Page 4: Chap. 4 Part 1 - uoguelph.ca

Essence of data parallel (DP)

Apply same operations at once to many different data items

Ultimate example SIMD instructions

Master/workers typical DP pattern

provided the workers are all doing same operations (on portions of the data set)

DP scales by increasing no. of workers (each processing less data)

DP wins by applying parallelism to instances of data that can be worked on simultaneously

Fall 2016 CIS*3090 Parallel Programming 4

Page 5: Chap. 4 Part 1 - uoguelph.ca

Essence of task parallel (TP)

Tasks (processes, threads) specialized to different stages of calculation through which all data instances pass

Pipeline typical TP pattern

TP scales by increasing no. of stages (each performing fewer operations)

TP wins on basis of increasing throughput by applying parallelism to the subtasks

Fall 2016 CIS*3090 Parallel Programming 5

Page 6: Chap. 4 Part 1 - uoguelph.ca

Example: Red Cross blood drive in Peter Clark Hall

Problem: taking blood donations from large no. of people without being too time-consuming

“Data set” = the donors

“Processors” = RC personnel & volunteers

Fall 2016 CIS*3090 Parallel Programming 6

1a 1b Queue 1 Queue 2 1c, 1d 2 a-e 3

Page 7: Chap. 4 Part 1 - uoguelph.ca

Operations per donor

1) Screening:

a) Identify & check file

b) Test sample for iron & blood type

c) Take temperature & blood pressure

d) Answer health questions

2) Donation:

a) Lie down

b) Sanitize arm

c) Stick needle

d) Collect blood (obeying qty & time limits)

e) Remove needle

3) Recovery:

a) Rest and snack

Fall 2016 CIS*3090 Parallel Programming 7

• Where’s the TP? • Where’s the DP?

1a 1b Queue 1 Queue 2 1c, 1d 2 a-e 3

Page 8: Chap. 4 Part 1 - uoguelph.ca

Pseudocode for expressing parallel algorithms

Authors’ invention “Peril-L”

Represents additional parallel constructs on top of conventional pseudocode

Conceptually targets CTA, so can distinguish local vs. non-local memory refs.

Will later see how easy to translate into certain parallel programming languages

Peril-L keywords & features

Fall 2016 CIS*3090 Parallel Programming 8

Page 9: Chap. 4 Part 1 - uoguelph.ca

forall: parallel fork/join

Looks like loop!

Consider as spawning N threads in lieu of one thread doing N iterations

“index” variable has separate value in each thread

for i=1..3 vs. forall (i in (1..3))

how many i variables?

big diff bet iterations & threads!

some iteration could influence later one(s)

iterations can have fixed/var qty; how about threads??

What details about the threads execution are we (intentionally) leaving unspec’d at this level of abstraction?

how to spawn/fork & join (pthreads on cores,, cluster processes, or what?)

no. of processors (P)

distribution of T threads to P processors

T/P threads per processor, executing concurrently (but not truly parallel)

choosing P threads from pool of T, executing in parallel, repeat till all T executed

Fall 2016 CIS*3090 Parallel Programming 9

123 i

1 i 2 i 3 i

Page 10: Chap. 4 Part 1 - uoguelph.ca

Details left unspecified

How to spawn/fork & join?

No. of processors (P)

Distribution of T threads when P<T (aka “oversubscribed”)

T/P threads per processor

Concurrent, not all truly parallel

Choosing P threads from pool of T

Repeat till all T executed

Fall 2016 CIS*3090 Parallel Programming 10

Page 11: Chap. 4 Part 1 - uoguelph.ca

Inter-thread synchronization

exclusive: denotes critical section

implicit mutex

barrier: where all threads “check in”, then all continue

For this to work, all threads have to be “active” even if P<T

Suspend a thread that’s reached the barrier and run another one, continue till all arrive, then waking all

Fall 2016 CIS*3090 Parallel Programming 11

Page 12: Chap. 4 Part 1 - uoguelph.ca

Local vs. global variables

Local if declared inside forall block

Per-thread copies, not visible to other threads or outside block

Global (underlined) if declared outside

Indicates lambda latency!

All arrays start with 0 index

Fall 2016 CIS*3090 Parallel Programming 12

Page 13: Chap. 4 Part 1 - uoguelph.ca

Global memory conventions

Concurrent reads to same variable are OK, writes are serialized (last wins)

But concurrent writes non-deterministic!

If you don’t like that, insert explicit sync (exclusive)

Models worst case that happens with real HW

Forces you to pay attention to that and deal with it explicitly at program level

Fall 2016 CIS*3090 Parallel Programming 13

Page 14: Chap. 4 Part 1 - uoguelph.ca

Accessing global memory

2 methods, you choose, you pay:

Just reference a global variable in the pseudocode

Pays lambda penalty on each access!

Careful to use “exclusive” to ensure consistency!

“Localize” some/all global data via explicit call to localize() pseudo-function

Pays lambda penalty one time

Fall 2016 CIS*3090 Parallel Programming 14

Page 15: Chap. 4 Part 1 - uoguelph.ca

Localization convention (p93 code sample)

int allData[n]; Global data structure for n qty. data

forall (threadID in(0..P-1)) Spawn P threads

{ int size=n/P; Compute size of the local allocation

int locData[size] = localize(allData[]); … }

In Peril-L pseudocode, represents programmer’s choice to pay lambda penalty once (=λ∙size) per thread for global access

After that, locData[i] is fast access

What does it mean? How does it work?

Fall 2016 CIS*3090 Parallel Programming 15

Page 16: Chap. 4 Part 1 - uoguelph.ca

Inside localize() pseudo-func.

Is a “local copy” actually made?

Conceptually “no”

locdata is like an alias for that thread’s portion of allData

What about mismatch between locdata’s size and allData’s?

localize() automagically maps local array to thread’s portion of global array

Fall 2016 CIS*3090 Parallel Programming 16

Page 17: Chap. 4 Part 1 - uoguelph.ca

Can I call localize()?

This is pseudocode, not real library call!

Represents mechanism used to access global data on your platform in λ time

SMP: main mem L1 cache auto transfer

non-SM: message from another node

Reading from localized data is fast SMP: from L1 cache

non-SM: from local node’s memory

Fall 2016 CIS*3090 Parallel Programming 17

Page 18: Chap. 4 Part 1 - uoguelph.ca

Writing to localized data

Because it’s an alias, corresponding global data also changes (in principle)

SMP: cache coherency HW auto-updates main memory (and other L1 caches)

non-SMP: requires sending message

But localized write is fast

SMP: changes only L1 cache (initially)

non-SMP: changes node’s local memory

Fall 2016 CIS*3090 Parallel Programming 18

Page 19: Chap. 4 Part 1 - uoguelph.ca

Who pays lambda for writing?

Convention is that reader of global data will be charged for the sync cost

SMP: reflects lazy update of main mem. with relaxed consistency model and MESI protocol

non-SMP: only one message per reader needs to be sent

Fall 2016 CIS*3090 Parallel Programming 19

Page 20: Chap. 4 Part 1 - uoguelph.ca

Careful writing localized data!

Updates affect corresponding global data

If no intention of inter-thread communication, no problem

Writes by multiple threads not interfering with each other’s data

Otherwise, opens up possible corruption, data races between reading/writing threads

Must use some sync mechanism (shown later)

Fall 2016 CIS*3090 Parallel Programming 20

Page 21: Chap. 4 Part 1 - uoguelph.ca

“Owner Computes” style (p94)

Promoted by localization convention

Lets thread take ownership of portion of data set

Avoid requirement for exclusive lock access to entire global data structure by partitioning data among threads

Fall 2016 CIS*3090 Parallel Programming 21

Page 22: Chap. 4 Part 1 - uoguelph.ca

Localize() is smart!

Forces programmer to explicitly recognize and plan how to manage biggest problem of parallel computing:

Memory bandwidth bottleneck

Can manage at algo. level with Peril-L convention

Another magical pseudo-function:

mySize(global data set, my index)

When data doesn’t divide evenly into P chunks

Fall 2016 CIS*3090 Parallel Programming 22

Page 23: Chap. 4 Part 1 - uoguelph.ca

Global memory & CTA

As observed before, CTA doesn’t have “global mem” (GM) per se

Conceptually, dispersed to one or more processors, in their respective local mems.

To get at it, you have to make a non-local mem ref. via the relevant processor

Looks like a “shell game”

first, you have GM (pseudocode); then, you don’t (CTA); then, you (may) have it again (multicores/SMP)!

Fall 2016 CIS*3090 Parallel Programming 23

Page 24: Chap. 4 Part 1 - uoguelph.ca

Big picture: 3 layers

Top layer = Peril-L pseudocode

Programmer’s view is having both global & local memory available

Middle layer = CTA model, like a VM

Doesn’t have global mem, but can simulate it

Level where we conduct algo. performance estimation—O() complexity and lambda cost

Bottom layer = physical computer

global mem may (multicore/SMP) or may not (cluster) be available, in latter case can be simulated by messaging

Fall 2016 CIS*3090 Parallel Programming 24

Page 25: Chap. 4 Part 1 - uoguelph.ca

Benefits of layered approach

Allows complete disconnection of a parallel algo. from a particular HW platform, while still capturing key property of the platforms: non-local memory latency

Makes any pseudocoded algo. portable among wide variety of platforms

Fall 2016 CIS*3090 Parallel Programming 25

Page 26: Chap. 4 Part 1 - uoguelph.ca

Summary

Building on a generalized model of parallel processors,

started to define a pseudocode targeted for describing parallel algos

in a HW-agnostic way

that still recognizes the HW issues which affect parallel performance!

Fall 2016 CIS*3090 Parallel Programming 26


Recommended