+ All Categories
Home > Documents > Commercial Application Support - UMass Amherst Application Support -Not in textbook!-Commercial...

Commercial Application Support - UMass Amherst Application Support -Not in textbook!-Commercial...

Date post: 20-May-2018
Category:
Upload: vodan
View: 217 times
Download: 0 times
Share this document with a friend
39
1 1 L5 2 Commercial Application Support
Transcript

1

1

L5

2

Commercial Application Support

2

3

Commercial Application Support

- Not in textbook!

- Commercial enterprise oriented applications are commonly run on SMP type of machines to achieve high throughputs

- High-end is considered to be more massively parallel type of machine such as based on clusters of SMPs

- Workload is mostly concurrent interactive, serial batch, parallel jobs

- Workload and Resource managers: CICS, LoadLeveler etc.

- The most common approach to leverage SMPs is to utilize parallel relational database management systems. These systems are capable of marshaling all the compute resources inside a parallel machines to the programmer, there is no special version of SQL to create these applications

- Referred to often as Parallel Middleware

4

Parallel Middleware

- Using transaction monitors or client/server coding, each node in a high-performance MPP participates as in a LAN based system

- Key difference: high speed network compared to LAN

- New skills that are required to run enterprise applications are

- Database administrators need to do effective data partitioning

- OLTP (Online Transaction Monitors) architects need to exploit data locality

3

5

Data Warehouse, OLTP

- Most commonly cited workload on MPP; decision support (mostly query), and

- OLTP is update-intensive (run well on mainframes)

- Emergence of parallel database management software enabled it (IBM, Oracle, Sybase, Informix, Teradata, etc)

- Luckily many functions needed fall into the embarrassingly parallel category, that is > shifting large amounts of data is inherentlyparallel

- Return on investment shows viability

-

6

Application Configuration Model

Client

Client

Application servDatabase server

SMP

TPMonitor

Application servDatabase server

Application servDatabase server

Application servDatabase server

MPP

Client

Client

4

7

Where is the parallelism exploited

File Partitions, inter-query parallelism, intra-query parallelism

Databases

Packet switching, ATM, multiplexing, pipelining

Communications

Disk Arrays, tape arrays, file stripingIO

Client/server, multi-tier application partitions

Applications

Multitasking, multithreadingOS

Interleaving, multi-paths, multi-bank distributed memory

Memory

Instruction Pipelining, vector processing, ILP

CPUs

Forms of ParallelismResource

8

Empirical Dimensioning of SMPs

- Effective UNIX CPUs = X(K**(n-1))

- Where X=0.89

- K=0.97

- N = Number of processors installed

- X is fixed overhead required for multiprocessing

- K factor is estimating SW scaling efficiencies

- Formula for TPC-C benchmark workload only!

- DB2 for MVS, CICS for MVS follows this formula pretty closely up to 10 CPUs

5

9

Average number of CPUs/SMP

- 1995 was 2 CPUs

- 1996 6 CPUs

- 1997 7 CPUs

- 1998 8 CPUs

- 1999 12 CPUs

- 2000 20 CPUs

- (Source Gartner Group)

10

Scaling of SMPs

- typically based on if bus (microprocessor based) or cross-bar interconnect (mainframes)

- Microprocessor based SMPs run typically fewer than three concurrent applications

- Small mainframes can run 20 batch jobs with 100 intercative users and 3-4 applications

6

11

Return on Investment(Source Gartner Group)

$280M$84M$100M$500M$40MN/a$706MROI

$4M$9M$15M$15M$7.5M$10M$12MInvestment

2years4years5years2years2years2 years4yearsROI period

20GB150GB500GB120GB60GB60-600GB100-

500GB

Database

size

Regional

bank

TelecomRetailerBankingInt AirlinesFinancial

Services

Consumer

12

Performance Full-Table Scan IBM DB2/6000-PE

0

10

20

30

40

50

60

360 Million Rowson 12 nodes

360 Million Rowson 6 nodes

180 Million Rowson 6 nodes

Minutes

7

13

Types of Hardware for Enterprise Systems

Low

High

Query Complexity

Small/medium< 50GB

Database Size

Large> 50-100GB

SMP (<30 nodes)ParallelDatabase Servers

MPP (100s of nodes)Database Servers

PC & RISCWorkstations

TraditionalMainframes

14

Types of applications and their requirements

>2000500-1000OLTP: Number of transactions

>2000200-500OLTP: Number of users

>200GB10-100GBRaw data

Rapid, unpredictable, > 50% year

Slow, steady, < 20% year

Application growth rate

Very Large DSSOLTP, Medium Application type

Complex Analytical

Mission CriticalApplication type

MPPSMP

8

15

What you need to understand to choose the right configuration

1. Size of database and the largest tables

2. Number of concurrent users – usually 50-200

3. Complexity of queries– from IO only to extreme CPU burn rates

4. Speed of system bus, microprocessor, IO subsystem

5. Amount of data actually touched by each query– 100’s of rows to milli ons of rows

6. Definition of acceptable end user time – minutes or hours

7. Quality of the RDBMS, algorithms, and partitioning capabilit y

8. Database size and user population growth expectations- usually 2-4 times over 2-4 years!

16

Programming Models

9

17

Programming Models

- Represent different ways to program parallel machines. Originally parallel machines have been built for a particular programming model. Today, however, one could have almost any programming model on any machine.

- Programming Models

- Message-passing

- Data-parallel

- Shared-memory

- Commercial machines used in enterprise systems are usually configured with tools such as transaction monitors, and load balancing, for throughput rather than explicitly programmed by the user

18

Where do programming models fit into the big picture?

MessagePasing

Shared Address Space

Data Parallel

Libraries Compiler

ApplicationLevel

ProgrammingModel Support

SW

HWMachine

Communication Architecture

PrimitivesFor full model

Optimize

FewFastprimitives

10

19

Message Passing

- Abstraction – dependencies between computations are communicated through messages. Synchronization can be achieved by explicitl y programming for it, for example with blocking messages (source node will not continue until receiver received the message, or request-reply pairs, etc..)

- typically implemented with special li braries, and runtime support

- Examples: PVM, MPI, Active Messages

- Some of these are more suitable to user level (PVM, MPI)

- Others, Active Messages for example, are optimized for performance and are more suitable as foundations for other layers that can be shared memory or message-passing.

20

MPI

- widely used standard

- Works on both NOW and supercomputers

- Much more than simple message passing

- Has support for- Process groups, task paralleli sm

- Collective Communications

- Communication contexts

- Application topologies,

- More that 12 combinations of sends and receives

- Lots of datatype definitions

11

21

MPI examples: first the Hello World!#include <mpi.h>#include <stdio.h>

main(argc,argv)int argc;char **argv;

{int myrank;int size;int i;

MPI_Init(&argc,&argv);MPI_Comm_rank(MPI_COMM_WORLD,&myrank);printf("\n\rmyrank:%d",myrank);

/*if (myrank== 2) printf("\n\rI am process Nr2");*/

MPI_Comm_size(MPI_COMM_WORLD,&size);

if (myrank==0) printf ("\n\r Size : [%d]",size);

for (i=0; i < size; i++)if (myrank==i)printf ("\n\r HELLO WORLD FROM PROCESS [%d]",myrank);

MPI_Finalize();

}

22

Programming for Performance

12

23

Introduction

Rich space of techniques and issues• Trade off and interact with one another

Issues can be addressed/helped by software or hardware

• Algorithmic or programming techniques

• Architectural techniques

Focus here on performance issues and software techniques

• Why should architects care?– understanding the workloads for their machines– hardware/software tradeoffs: where should/shouldn’ t architecture help

• Point out some architectural implications

• Architectural techniques covered in rest of class

24

Programming as Successive RefinementNot all i ssues dealt with up front

Partitioning often independent of architecture, and done first

• View machine as a collection of communicating processors– balancing the workload– reducing the amount of inherent communication– reducing extra work

• Tug-o-war even among these three issues

Then interactions with architecture

• View machine as extended memory hierarchy– extra communication due to architectural interactions– cost of communication depends on how it is structured

• May inspire changes in partitioning

Discussion of issues is one at a time, but identifies tradeoffs

• Use examples, and measurements on SGI Origin2000

13

26

Partitioning for Performance

Balancing the workload and reducing wait time at synch points

Reducing inherent communication

Reducing extra work

Even these algorithmic issues trade off :

• Minimize comm. => run on 1 processor => extreme load imbalance

• Maximize load balance => random assignment of tiny tasks => no control over communication

• Good partition may imply extra work to compute or manage it

Goal is to compromise

• Fortunately, often not diff icult in practice

29

Identifying Concurrency (contd.)

Function parallelism:• entire large tasks (procedures) that can be done in parallel

• on same or different data

• e.g. different independent grid computations in Ocean

• pipelining, as in video encoding/decoding, or polygon rendering

• degree usually modest and does not grow with input size

• diff icult to load balance

• often used to reduce synch between data parallel phases

Most scalable programs data parallel (per this loose definition)

• function paralleli sm reduces synch between data parallel phases

14

30

Deciding How to Manage Concurrency

Static versus Dynamic techniques

Static:

• Algorithmic assignment based on input; won’ t change• Low runtime overhead• Computation must be predictable• Preferable when applicable (except in multiprogrammed/heterogeneous

environment)

Dynamic:• Adapt at runtime to balance load• Can increase communication and reduce locality• Can increase task management overheads

31

Dynamic Assignment

Profile-based (semi-static):• Profile work distribution at runtime, and repartition dynamically

• Applicable in many computations, e.g. Barnes-Hut, some graphics

Dynamic Tasking:

• Deal with unpredictabil ity in program or environment (e.g. Raytrace)– computation, communication, and memory system interactions – multiprogramming and heterogeneity– used by runtime systems and OS too

• Pool of tasks; take and add tasks until done

• E.g. “self-scheduling” of loop iterations (shared loop counter)

15

32

Dynamic Tasking with Task Queues

Centralized versus distributed queues

Task stealing with distributed queues• Can compromise comm and locality, and increase synchronization• Whom to steal from, how many tasks to steal, ...• Termination detection• Maximum imbalance related to size of task

QQ 0 Q2Q1 Q3

All remove tasks

P0 inserts P1 inserts P2 inserts P3 inserts

P0 removes P1 removes P2 removes P3 removes

(b) Distributed task queues (one per process)

Others maysteal

All processesinsert tasks

(a) Centralized task queue

33

Impact of Dynamic Assignment

On SGI Origin 2000 (cache-coherent shared memory):

Spe

edup

λλλ

λ

λ

λ

λ

66

6

6

6

νν

ν

ν

ν

ν

ν

σσ

σ

σ

σ

1 3 5 7 9 11 13 15 17

Number of processors Number of processors

19 21 23 25 27 29 310

5

10

15

Spe

edup

20

25

30

0(a) (b)

5

10

15

20

25

30

λλ

λ

λ

λ

λ

λ

66

6

6

6

νν

ν

ν

ν

ν

ν

σσ

σ

σ

σ

1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31

λ Origin, dynamic6 Challenge, dynamicν Origin, staticσ Challenge, static

λ Origin, semistatic6 Challenge, semistaticν Origin, staticσ Challenge, static

16

34

Determining Task Granularity

Task granularity: amount of work associated with a task

General rule:

• Coarse-grained => often less load balance

• Fine-grained => more overhead; often more comm., contention

Comm., contention actually affected by assignment, not size• Overhead by size itself too, particularly with task queues

35

Reducing Serialization

Careful about assignment and orchestration (including scheduling)

Event synchronization• Reduce use of conservative synchronization

– e.g. point-to-point instead of barriers, or granularity of pt-to-pt• But fine-grained synch more diff icult to program, more synch ops.

Mutual exclusion• Separate locks for separate data

– e.g. locking records in a database: lock per process, record, or field– lock per task in task queue, not per queue– finer grain => less contention/serialization, more space, less reuse

• Smaller, less frequent criti cal sections– don’ t do reading/testing in critical section, only modification– e.g. searching for task to dequeue in task queue, building tree

• Stagger criti cal sections in time

17

38

Domain Decomposition

Works well for scientific, engineering, graphics, ... applications

Exploits local-biased nature of physical problems

• Information requirements often short-range

• Or long-range but fall off with distance

Simple example: nearest-neighbor grid computation

Perimeter to Area comm-to-comp ratio (area to volume in 3-d)

•Depends on n,p: decreases with n, increases with p

P0 P1 P2 P3

P4

P8

P12

P5 P6 P7

P9 P11

P13 P14

P10

n

n np

np

P15

39

Domain Decomposition (contd)

Comm to comp: for block, for strip

• Retain block from here on

Application dependent: strip may be better in other cases• E.g. particle flow in tunnel

4*�

pn

2*pn

Best domain decomposition depends on information requirementsNearest neighbor example: block versus strip decomposition:

P0 P1 P2 P3

P4

P8

P12

P5 P6 P7

P9 P11

P13 P14 P15

P10

n

n

n

p------

n

p------

18

40

Finding a Domain Decomposition

Static, by inspection• Must be predictable: grid example above, and Ocean

Static, but not by inspection

• Input-dependent, require analyzing input structure

• E.g sparse matrix computations, data mining (assigning itemsets)

Semi-static (periodic repartitioning)

• Characteristics change but slowly; e.g. Barnes-Hut

Static or semi-static, with dynamic task stealing• Initial decomposition, but highly unpredictable; e.g ray tracing

41

Other Techniques

Scatter Decomposition, e.g. initial partition in Raytrace

Preserve locality in task stealing•Steal large tasks for locality, steal from same queues, ...

12

3 4

12

3 4

12

3 4

12

3 4

12

3 4

12

3 4

12

3 4

12

3 4

12

3 4

12

3 4

12

3 4

12

3 4

12

3 4

12

3 4

12

3 4

12

3 4

12

43

Domain decomposition Scatter decomposition

19

42

Implications of Comm-to-Comp Ratio

Architects examine application needs to see where to spend money

If denominator is execution time, ratio gives average BW needs

If operation count, gives extremes in impact of latency and bandwidth• Latency: assume no latency hiding• Bandwidth: assume all latency hidden• Reality is somewhere in between

Actual impact of comm. depends on structure and cost as well

• Need to keep communication balanced across processors as well

Sequential Work

Max (Work + Synch Wait Time + Comm Cost)Speedup <

43

Reducing Extra Work

Common sources of extra work:• Computing a good partition

– e.g. partitioning in Barnes-Hut or sparse matrix

• Using redundant computation to avoid communication

• Task, data and process management overhead– applications, languages, runtime systems, OS

• Imposing structure on communication– coalescing messages, allowing effective naming

Architectural Implications:• Reduce need by making communication and orchestration eff icient

Sequential Work

Max (Work + Synch Wait Time + Comm Cost + Extra Work)Speedup <

20

55

Exploiting Temporal Locality

• Structure algorithm so working sets map well to hierarchy– often techniques to reduce inherent communication do well here– schedule tasks for data reuse once assigned

• Multiple data structures in same phase– e.g. database records: local versus remote

• Solver example: blocking

•More useful when O(nk+1) computation on O(nk) data–many linear algebra computations (factorization, matrix multiply)

(a) Unblocked access pattern in a sweep (b) Blocked access pattern with B = 4

56

Exploiting Spatial LocalityBesides capacity, granularities are important:

• Granularity of allocation

• Granularity of communication or data transfer

• Granularity of coherence

Major spatial-related causes of artifactual communication:

• Conflict misses• Data distribution/layout (allocation granularity)• Fragmentation (communication granularity)• False sharing of data (coherence granularity)

All depend on how spatial access patterns interact with data structures

• Fix problems by modifying data structures, or layout/alignment

Examine later in context of architectures• one simple example here: data distribution in SAS solver

21

57

Spatial Locality Example

• Repeated sweeps over 2-d grid, each time adding 1 to elements

• Natural 2-d versus higher-dimensional array representation

P6 P7P4

P8

P0 P3

P5 P6 P7P4

P8

P0 P1 P2 P3

P5

P2P1

Page straddlespartition boundaries:diff icult to distribute memory well

Cache blockstraddles partitionboundary

(a) Two-dimensional array

Page doesnot straddlepartitionboundary

Cache block is within a partition

(b) Four-dimensional array

Contiguity in memory layout

58

Tradeoffs with Inherent Communication

Partitioning grid solver: blocks versus rows• Blocks still have a spatial locality problem on remote data

• Rowwise can perform better despite worse inherent c-to-c ratio

•Result depends on n and p

Good spacial locality onnonlocal accesses atrow-oriented boudary

Poor spacial locality onnonlocal accesses atcolumn-orientedboundary

22

59

Example Performance Impact

Equation solver on SGI Origin2000S

peed

up

Number of processors

Spe

edup

Number of processors

λλ

λ

λ

λ

λ

σσ

σ

σ

σ

σ

νν

ν

ν

ν

ν

1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 310

5

10

15

20

25

30

λ λλ

λ

λ

λ

λ

ν νν

ν

ν

ν

ν

σ σσ

σ

σ

σ

σ

υ υυ

υ

υ

υ

υ

6 66

6

6

6

6

1 3 5 7 9 11 13 15 1719 21 23 25 27 29 310

5

10

15

20

25

30

35

40

45

504D

4D-rr

ν 2D-rrσ 2Dυ Rows-rr

6 Rowsλ 2D

ν 4D

σ Rows

60

Architectural Implications of Locality

Communication abstraction that makes exploiting it easy

For cache-coherent SAS, e.g.:

• Size and organization of levels of memory hierarchy– cost-effectiveness: caches are expensive

– caveats: flexibility for different and time-shared workloads

• Replication in main memory useful? If so, how to manage?– hardware, OS/runtime, program?

• Granularities of allocation, communication, coherence (?)– small granularities => high overheads, but easier to program

Machine granularity (resource division among processors, memory...)

23

61

Structuring CommunicationGiven amount of comm (inherent or artifactual), goal is to reduce cost

Cost of communication as seen by process:

C = f * ( o + l + + tc - overlap)

– f = frequency of messages– o = overhead per message (at both ends)– l = network delay per message– nc = total data sent– m = number of messages– B = bandwidth along path (determined by network, NI, assist)– tc = cost induced by contention per message– overlap = amount of latency hidden by overlap with comp. or comm.

• Portion in parentheses is cost of a message (as seen by processor)• That portion, ignoring overlap, is latency of a message

• Goal: reduce terms in latency and increase overlap

nc/mB

62

Reducing Overhead

Can reduce no. of messages m or overhead per message o

o is usually determined by hardware or system software

• Program should try to reduce m by coalescing messages

• More control when communication is explicit

Coalescing data into larger messages:

• Easy for regular, coarse-grained communication

• Can be diff icult for irregular, naturally fine-grained communication– may require changes to algorithm and extra work

• coalescing data and determining what and to whom to send

– wil l discuss more in implications for programming models later

24

63

Reducing Network Delay

Network delay component = f*h* th– h = number of hops traversed in network– th = link+switch latency per hop

Reducing f: communicate less, or make messages larger

Reducing h:

• Map communication patterns to network topology– e.g. nearest-neighbor on mesh and ring; all-to-all

• How important is this?– used to be major focus of parallel algorithms– depends on no. of processors, how th, compares with other components

– less important on modern machines• overheads, processor count, multiprogramming

64

Reducing Contention

All resources have nonzero occupancy• Memory, communication controller, network link, etc.• Can only handle so many transactions per unit time

Effects of contention:• Increased end-to-end cost for messages• Reduced available bandwidth for individual messages• Causes imbalances across processors

Particularly insidious performance problem• Easy to ignore when programming• Slow down messages that don’ t even need that resource

– by causing other dependent resources to also congest• Effect can be devastating: Don’ t flood a resource!

25

65

Types of ContentionNetwork contention and end-point contention (hot-spots)

Location and Module Hot-spots• Location: e.g. accumulating into global variable, barrier

– solution: tree-structured communication

•Module: all -to-all personalized comm. in matrix transpose

–solution: stagger access by different processors to same node temporally

•In general, reduce burstiness; may conflict with making messages larger

Flat Tree structur ed

Contention Little contention

66

Overlapping Communication

Cannot afford to stall for high latencies• even on uniprocessors!

Overlap with computation or communication to hide latency

Requires extra concurrency (slackness), higher bandwidth

Techniques:

• Prefetching

• Block data transfer

• Proceeding past communication

• Multi threading

26

67

Summary of Tradeoffs

Different goals often have conflicting demands• Load Balance

– fine-grain tasks– random or dynamic assignment

• Communication– usually coarse grain tasks– decompose to obtain locality: not random/dynamic

• Extra Work– coarse grain tasks– simple assignment

• Communication Cost:– big transfers: amortize overhead and latency– small transfers: reduce contention

70

Summary

Speedupprob(p) =

• Goal is to reduce denominator components

• Both programmer and system have role to play

• Architecture cannot do much about load imbalance or too much communication

• But it can:– reduce incentive for creating ill-behaved programs (efficient naming,

communication and synchronization)– reduce artifactual communication– provide efficient naming for flexible assignment– allow effective overlapping of communication

Busy(1) + Data(1)

Busyuseful(p)+Datalocal(p)+Synch(p)+Dateremote(p)+Busyoverhead(p)

27

76

Case Study 2: Barnes-Hut

Locality Goal: • Particles close together in space should be on same processor

Diff iculties: Nonuniform, dynamically changing

(a) The spatial domain (b) Quadtree representation

77

Application Structure

• Main data structures: array of bodies, of cells, and of pointers to them– Each body/cell has several fields: mass, position, pointers to others – pointers are assigned to processes

Computeforces

Updateproperties

Tim

e-st

eps

Build tree

Computemoments of cells

Traverse tr eeto compute for ces

28

78

Partitioning

Decomposition: bodies in most phases, cells in computing moments

Challenges for assignment:

• Nonuniform body distribution => work and comm. Nonuniform– Cannot assign by inspection

• Distribution changes dynamically across time-steps– Cannot assign staticall y

• Information needs fall off with distance from body– Partitions should be spatially contiguous for locality

• Different phases have different work distributions across bodies– No single assignment ideal for all– Focus on force calculation phase

• Communication needs naturally fine-grained and irregular

79

Load Balancing

• Equal particles ≠ equal work.

– Solution: Assign costs to particles based on the work they do

• Work unknown and changes with time-steps

– Insight : System evolves slowly

– Solution: Count work per particle, and use as cost for next time-step.

Powerful technique for evolving physical systems

29

80

A Partitioning Approach: ORBOrthogonal Recursive Bisection:

• Recursively bisect space into subspaces with equal work– Work is associated with bodies, as before

• Continue until one partition per processor

• High overhead for large no. of processors

81

Another Approach: Costzones

Insight: Tree already contains an encoding of spatial locality.

• Costzones is low-overhead and very easy to program

(a) ORB (b) Costzones

P1 P2 P3 P4 P5 P6 P7 P8

30

82

Performance Comparison

• Speedups on simulated multiprocessor (16K particles)

• Extra work in ORB partitioning is key difference

ideal costzones: simulator ORB: simulator costzones: DASH costzones: KSR-1 costzones: Challenge

|16.0

|32.0

|48.0

|64.0

|80.0

|96.0

|112.0

|128.0

83

Orchestration and MappingSpatial locality: Very different than in Ocean, li ke other aspects

• Data distribution is much more diff icult than– Redistribution across time-steps– Logical granularity (body/cell) much smaller than page– Partitions contiguous in physical space does not imply contiguous in array– But, good temporal locali ty, and most misses logically non-local anyway

• Long cache blocks help within body/cell record, not entire partition

Temporal locality and working sets:• Important working set scales as 1/θ2log n• Slow growth rate, and fits in second-level caches, unlike Ocean

Synchronization:• Barriers between phases• No synch within force calculation: data written different from data read• Locks in tree-building, pt. to pt. event synch in center of mass phase

Mapping: ORB maps well to hypercube, costzones to linear array

31

84

Execution Time Breakdown

•Problem with static case is communication/locality, not load balance!

Tim

e (

s)

Process

DataSynchBusy

DataSynchBusy

13579 11 13 15 17 19 21 23 25 27 29 3105

10152025303540

Tim

e (

s)

Process

13579 11 13 15 17 19 21 23 25 27 29 3105

10152025303540

(a) Static assignment of bodies (b) Semistatic costzone assignment

•512K bodies on 32-processor Origin2000–Static, quite randomized in space, assignment of bodies versus costzones

85

Raytrace

Rays shot through pixels in image are called primary rays

• Reflect and refract when they hit objects

• Recursive process generates ray tree per primary ray

Hierarchical spatial data structure keeps track of primitives in scene• Nodes are space cells, leaves have linked li st of primitives

Tradeoffs between execution time and image quality

32

86

PartitioningScene-oriented approach

• Partition scene cells, process rays while they are in an assigned cell

Ray-oriented approach • Partition primary rays (pixels), access scene data as needed• Simpler; used here

Need dynamic assignment; use contiguous blocks to exploit spatial coherence among neighboring rays, plus tiles for task stealing

A block,the unit ofassignment

A tile,the unit of decompositionand stealing

Could use 2-D interleaved (scatter) assignment of tiles instead

87

Orchestration and Mapping

Spatial locality• Proper data distribution for ray-oriented approach very diff icult

• Dynamically changing, unpredictable access, fine-grained access

• Better spatial locality on image data than on scene data– Strip partition would do better, but less spatial coherence in scene access

Temporal locality• Working sets much larger and more diffuse than Barnes-Hut

• But still a lot of reuse in modern second-level caches– SAS program does not replicate in main memory

Synchronization:

• One barrier at end, locks on task queues

Mapping: natural to 2-d mesh for image, but li kely not important

33

88

Execution Time Breakdown

• Task stealing clearly very important for load balance

Tim

e (

s)

Process

13579 11 13 15 17 19 21 23 25 27 29 31

Tim

e (

s)

Process

13579 11 13 15 17 19 21 23 25 27 29 31

2040

0

6080

100120140

180200

160

2040

0

6080

100120140

180200

160

DataSynchBusy

89

Implications for Programming Models

Shared address space and explicit message passing• SAS may provide coherent replication or may not

• Focus primarily on former case

Assume distributed memory in all cases

Recall any model can be supported on any architecture

• Assume both are supported eff iciently

• Assume communication in SAS is only through loads and stores

• Assume communication in SAS is at cache block granularity

34

90

Issues to ConsiderFunctional issues:

• Naming

• Replication and coherence

• Synchronization

Organizational issues:

• Granularity at which communication is performed

Performance issues

• Endpoint overhead of communication– (latency and bandwidth depend on network so considered similar)

• Ease of performance modeling

Cost Issues• Hardware cost and design complexity

91

Naming

SAS: similar to uniprocessor; system does it all

MP: each process can only directly name the data in its address space

• Need to specify from where to obtain or where to transfer nonlocal data

• Easy for regular applications (e.g. Ocean)

• Diff icult for applications with irregular, time-varying data needs– Barnes-Hut: where the parts of the tree that I need? (change with time)– Raytrace: where are the parts of the scene that I need (unpredictable)

• Solution methods exist– Barnes-Hut: Extra phase determines needs and transfers data before

computation phase– Raytrace: scene-oriented rather than ray-oriented approach– both: emulate application-specific shared address space using hashing

35

92

ReplicationWho manages it (i.e. who makes local copies of data)?

• SAS: system, MP: program

Where in local memory hierarchy is replication first done?

• SAS: cache (or memory too), MP: main memory

At what granularity is data allocated in replication store?• SAS: cache block, MP: program-determined

How are replicated data kept coherent?

• SAS: system, MP: program

How is replacement of replicated data managed?

• SAS: dynamically at fine spatial and temporal grain (every access)

• MP: at phase boundaries, or emulate cache in main memory in software

Of course, SAS affords many more options too (discussed later)

93

Amount of Replication NeededMostly local data accessed => littl e replication

Cache-coherent SAS: • Cache holds active working set

– replaces at fine temporal and spatial grain (so little fragmentation too)

• Small enough working sets => need little or no replication in memory

Message Passing or SAS without hardware caching:

• Replicate all data needed in a phase in main memory– replication overhead can be very large (Barnes-Hut, Raytrace)– limits scalabili ty of problem size with no. of processors

• Emulate cache in software to achieve fine-temporal-grain replacement– expensive to manage in software (hardware is better at this)– may have to be conservative in size of cache used– fine-grained message generated by misses expensive (in message passing)– programming cost for cache and coalescing messages

36

94

Communication Overhead and GranularityOverhead directly related to hardware support provided

• Lower in SAS (order of magnitude or more)

Major tasks:

• Address translation and protection– SAS uses MMU– MP requires software protection, usually involving OS in some way

• Buffer management– fixed-size small messages in SAS easy to do in hardware– flexible-sized message in MP usually need software involvement

• Type checking and matching – MP does it in software: lots of possible message types due to flexibility

• A lot of research in reducing these costs in MP, but stil l much larger

Naming, replication and overhead favor SAS • Many irregular MP applications now emulate SAS/cache in software

95

Block Data Transfer

Fine-grained communication not most eff icient for long messages• Latency and overhead as well as traff ic (headers for each cache line)

SAS: can using block data transfer

• Explicit in system we assume, but can be automated at page or object level in general (more later)

• Especially important to amortize overhead when it is high– latency can be hidden by other techniques too

Message passing:

• Overheads are larger, so block transfer more important

• But very natural to use since message are explicit and flexible– Inherent in model

37

96

Synchronization

SAS: Separate from communication (data transfer)• Programmer must orchestrate separately

Message passing

• Mutual exclusion by fiat

• Event synchronization already in send-receive match in synchronous– need separate orchestratation (using probes or flags) in asynchronous

97

Hardware Cost and Design Complexity

Higher in SAS, and especially cache-coherent SAS

But both are more complex issues

• Cost – must be compared with cost of replication in memory– depends on market factors, sales volume and other nontechnical issues

• Complexity – must be compared with complexity of writing high-performance programs– Reduced by increasing experience

38

98

Performance Model

Three components:• Modeling cost of primitive system events of different types

• Modeling occurrence of these events in workload

• Integrating the two in a model to predict performance

Second and third are most challenging

Second is the case where cache-coherent SAS is more difficult

• replication and communication implicit, so events of interest implicit– similar to problems introduced by caching in uniprocessors

• MP has good guideline: messages are expensive, send infrequently

• Diff icult for irregular applications in either case (but more so in SAS)

Block transfer, synchronization, cost/complexity, and performance modeling advantageus for MP

99

Summary for Programming Models

Given tradeoffs, architect must address:• Hardware support for SAS (transparent naming) worthwhile?

• Hardware support for replication and coherence worthwhile?

• Should explicit communication support also be provided in SAS?

Current trend:

• Tightly-coupled multiprocessors support for cache-coherent SAS in hw

• Other major platform is clusters of workstations or multiprocessors– currently don’t support SAS in hardware, mostly use message passing

39

100

Summary

Crucial to understand characteristics of parallel programs• Implications for a host or architectural issues at all levels

Architectural convergence has led to:

• Greater portabilit y of programming models and software– Many performance issues similar across programming models too

• Clearer articulation of performance issues – Used to use PRAM model for algorithm design– Now models that incorporate communication cost (BSP, logP,….)– Emphasis in modeling shifted to end-points, where cost is greatest– But need techniques to model application behavior, not just machines

Performance issues trade off with one another; iterative refinement

Ready to understand using workloads to evaluate systems issues

101

Backup slides


Recommended