System and Architectural Models

SystemandArchitecturalModels

ArchitecturalModels

▪ Threeparallelprogrammingmodels- Thatdifferinwhatcommunicationabstractionstheypresenttotheprogrammer- Programmingmodelsareimportantbecausethey(1)influencehowprogrammersthinkwhenwritingprogramsand(2)influencethedesignofparallelhardwareplatformsdesignedtoexecutethem

▪ Correspondingmachinearchitectures- Abstractionpresentedbythehardwaretolow-levelsoftware

▪ We’llfocusondifferencesincommunication/synchronization

Systemlayers:interface,implementation,interface,...

Compilerand/orparallelruntime

Operatingsystem

Micro-architecture(hardwareimplementation)

ParallelApplications

Languageorlibraryprimitives/mechanisms

Abstractionsfordescribingconcurrent,parallel,or

independentcomputation

Abstractionsfordescribingcommunication

HardwareArchitecture(HW/SWboundary)

OSsystemcallAPI

“Programmingmodel”(provideswayofthinkingaboutthestructureofprograms)

Blueitalictext:abstraction/conceptReditalictext:systeminterfaceBlacktext:systemimplementation

Example:expressingparallelismwithpthreadsParallelApplication

Abstractionforconcurrentcomputation:athread ThreadProgramming

model

pthread_create()

pthreadlibraryimplementation

SystemcallAPIOSsupport:kernelthreadmanagement

x86-64modernmulti-coreCPU

Blueitalictext:abstraction/conceptReditalictext:systeminterfaceBlacktext:systemimplementation

Example:expressingparallelismwithISPCParallelApplications

Abstractionsfordescribingparallelcomputation:1.Forspecifyingsimultaneousexecution(trueparallelism)2.Forspecifyingindependentwork(potentiallyparallel)

ISPCProgramming

model

ISPClanguage(callISPCfunction,foreachconstruct)

ISPCcompiler

SystemcallAPIOSsupport

x86-64(includingAVXvectorinstructions)single-coreofCPU

Note:ThisdiagramisspecifictotheISPCgangabstraction.ISPCalsohasthe“task”languageprimitiveformulti-coreexecution.Idon’tdescribeitherebutitwouldbeinterestingtothinkabouthowthatdiagramwouldlook

6

Parallel Programming Models

• Programming model is made up of the languages and libraries that create an abstract view of the machine

• Control • How is parallelism created? • What orderings exist between operations?

• Data • What data is private vs. shared? • How is logically shared data accessed or communicated?

• Synchronization • What operations can be used to coordinate parallelism? • What are the atomic (indivisible) operations?

• Cost • How do we account for the cost of each of the above?

Threeprogrammingmodels(abstractions)

1.Sharedaddressspace

2.Messagepassing

3.Dataparallel

Sharedaddressspacemodel

Whatismemory?

▪ Onthefirstdayofclass,wedescribedaprogramasasequenceofinstructions.

▪ Someofthoseinstructionsreadandwritefrommemory.

▪ Butwhatismemory?- Tobeprecise,whatI’mreallyaskingis:whatisthelogicalabstractionofmemorypresentedtoaprogram

Value0x0 160x1 2550x2 140x3 00x4 00x5 00x6 60x7 00x8 320x9 480xA 2550xB 2550xC 2550xD 00xE 00xF 00x10 128

... ...0x1F 0

Aprogram’smemoryaddressspace

▪ Acomputer’smemoryisorganizedasaarrayofbytes

▪ Eachbyteisidentifiedbyits“address”inmemory(itspositioninthisarray)

“Thebytestoredataddress0x8hasthevalue32.”

“Thebytestoredataddress0x10(16)hasthevalue128.”

Intheillustrationontheright,theprogram’smemoryaddressspaceis32bytesinsize(sovalidaddressesrangefrom0x0to0x1F)

Address

Sharedaddressspacemodel(abstraction)

//writetoaddressholding//contentsofvariablexx=1;

Thread1:intx=0;spawn_thread(foo,&x);

Thread2:voidfoo(int*x){//readfromaddrstoring//contentsofvariablexwhile(x==0){}printx;

}

▪ Threadscommunicatebyreading/writingtosharedvariables

(PseudocodeprovidedinafakeC-likelanguageforbrevity.)

Thread2Sharedaddressspace

StoretoxThread1

x

Loadfromx

(Communicationoperationsshowninred)

Sharedaddressspacemodel

intx=0;Lockmy_lock;

spawn_thread(foo,&x,&my_lock);

mylock.lock();x++;mylock.unlock();

voidfoo(int*x,lock*my_lock){my_lock->lock();x++;my_lock->unlock();

printx;}

Thread1: Thread2:

(PseudocodeprovidedinafakeC-likelanguageforbrevity.)

Synchronizationprimitivesarealsosharedvariables:e.g.,locks

13

Simple Example• Consider applying a function f to the elements

of an array A and then computing its sum:

• Questions: • Where does A live? All in single memory?

Partitioned? • What work will be done by each processors? • They need to coordinate to get a single result, how?

∑−

=

1

0

])[(n

iiAf

A:

fA:f

sumA = array of all data fA = f(A) s = sum(fA) s:

14

Programming Model 1: Shared Memory• Program is a collection of threads of control.

• Can be created dynamically, mid-execution, in some languages • Each thread has a set of private variables, e.g., local stack variables • Also a set of shared variables, e.g., static variables, shared common

blocks, or global heap. • Threads communicate implicitly by writing and reading shared

variables. • Threads coordinate by synchronizing on shared variables

PnP1P0

s s = ...y = ..s ...

Shared memory

i: 2 i: 5 Private memory

i: 8

15

Simple Example

• Shared memory strategy: • small number p << n=size(A) processors • attached to single memory

• Parallel Decomposition: • Each evaluation and each partial sum is a task.

• Assign n/p numbers to each of p procs • Each computes independent “private” results and partial sum. • Collect the p partial sums and compute a global sum.

Two Classes of Data: • Logically Shared

• The original n numbers, the global sum. • Logically Private

• The individual function evaluations. • What about the individual partial sums?

∑−

=

1

0

])[(n

iiAf

16

Shared Memory “Code” for Computing a Sum

Thread 1

for i = 0, n/2-1 s = s + f(A[i])

Thread 2

for i = n/2, n-1 s = s + f(A[i])

static int s = 0;

• What is the problem with this program?

• A race condition or data race occurs when: -Two processors (or two threads) access the same

variable, and at least one does a write. -The accesses are concurrent (not synchronized) so

they could happen simultaneously

fork(sum,a[0:n/2-1]); sum(a[n/2,n-1]);

17

Shared Memory “Code” for Computing a Sum

Thread 1 …. compute f([A[i]) and put in reg0 reg1 = s reg1 = reg1 + reg0 s = reg1 …

Thread 2 … compute f([A[i]) and put in reg0 reg1 = s reg1 = reg1 + reg0 s = reg1 …

static int s = 0;

• Assume A = [3,5], f(x) = x2, and s=0 initially • For this program to work, s should be 32 + 52 = 34 at the end

• but it may be 34,9, or 25 • The atomic operations are reads and writes

• Never see ½ of one number, but += operation is not atomic • All computations happen in (private) registers

9 250 09 25

259

3 5A= f (x) = x2

18

Improved Code for Computing a Sum

Thread 1

local_s1= 0 for i = 0, n/2-1 local_s1 = local_s1 + f(A[i]) s = s + local_s1

Thread 2

local_s2 = 0 for i = n/2, n-1 local_s2= local_s2 + f(A[i]) s = s +local_s2

static int s = 0;

• Since addition is associative, it’s OK to rearrange order • Most computation is on private variables

- Sharing frequency is also reduced, which might improve speed - But there is still a race condition on the update of shared s - The race condition can be fixed by adding locks (only one thread

can hold a lock at a time; others wait for it)

static lock lk;

lock(lk);

unlock(lk);

lock(lk);

unlock(lk);

Why not do lock Inside loop?

Mechanismsforpreservingatomicity▪ Lock/unlockmutexaroundacriticalsection

LOCK(mylock);//criticalsectionUNLOCK(mylock);

▪ Somelanguageshavefirst-classsupportforatomicityofcodeblocksatomic{//criticalsection

}

▪ Intrinsicsforhardware-supportedatomicread-modify-writeoperationsatomicAdd(x,10);

Review:sharedaddressspacemodel▪ Threadscommunicateby:

- Reading/writingtosharedvariablesinasharedaddressspace- Inter-threadcommunicationisimplicitinmemoryloads/stores- Thread1storestoX- Later,thread2readsX(andobservesupdateofvaluebythread1)

- Manipulatingsynchronizationprimitives- e.g.,ensuringmutualexclusionviauseoflocks

▪ Thisisanaturalextensionofsequentialprogramming- Infact,allourdiscussionsinclasshaveassumedasharedaddressspacesofar!

HWimplementationofasharedaddressspaceKeyidea:anyprocessorcandirectlyreferencecontentsofanymemorylocation

Processor

LocalCache

Processor

LocalCache

Processor

LocalCache

Processor

LocalCache

Interconnect

Memory I/O

“Dance-hall”organizationProcessor Processor Processor Processor

Memory Memory

Processor

Processor

Processor

Processor

Memory

Processor Processor Processor Processor

Memory MemoryMemory Memory

Interconnectexamples

Memory

SharedBus

Multi-stagenetwork

Crossbar

*Caches(notshown)areanotherimplementationofasharedaddressspace(moreonthisinalaterlecture)

Non-uniformmemoryaccess(NUMA)

OnchipnetworkCore1 Core2

Core3 Core4

MemoryController

Memory

Core5 Core6

Core7 Core8

MemoryController

Memory

AMDHyper-transport/IntelQuickPath(QPI)

X

Thelatency*ofaccessingamemorylocationmaybedifferentfromdifferentprocessingcoresinthesystem

Example:latencytoaccessaddressxishigherfromcores5-8thancores1-4

Example:moderndual-socketconfiguration

*BandwidthfromanyonelocationmayalsobedifferenttodifferentCPUcores

Summary:sharedaddressspacemodel

▪ Communicationabstraction- Threadsread/writevariablesinsharedaddressspace

- Threadsmanipulatesynchronizationprimitives:locks,atomicops,etc.

- Logicalextensionofuniprocessorprogramming*

▪ Requireshardwaresupporttoimplementefficiently- Anyprocessorcanloadandstorefromanyaddress(itssharedaddressspace!)

- Canbecostlytoscaletolargenumbersofprocessors(oneofthereasonswhyhigh-corecountprocessorsareexpensive)

*ButNUMAimplementationrequiresreasoningaboutlocalityforperformance

Messagepassingmodelofcommunication

Messagepassingmodel(abstraction)

Thread1addressspace

VariableX

▪ Threadsoperatewithintheirownprivateaddressspaces

▪ Threadscommunicatebysending/receivingmessages- send:specifiesrecipient,buffertobetransmitted,andoptionalmessageidentifier(“tag”)- receive:sender,specifiesbuffertostoredata,andoptionalmessageidentifier

- Sendingmessagesistheonlywaytoexchangedatabetweenthreads1and2

- Why?

x

Thread2addressspace

VariableY

Y

(Communicationoperationsshowninred)

send(X,2,

semantics:sendvariableXasme

my_msg_id)

contextsoflocalssagetothread2ewiththeidandtagmessag

“my_msg_id”

recv(Y,1,my_msg_id)

semantics:receivemessagewithid“my_msg_id”fromthread1andstorecontentsinlocalvariableY

26

Message Passing• Program consists of a collection of named processes.

• Usually fixed at program startup time • Thread of control plus local address space -- NO shared data. • Logically shared data is partitioned over local processes.

• Processes communicate by explicit send/receive pairs • Coordination is implicit in every communication event. • MPI (Message Passing Interface) is the most commonly used SW

PnP1P0

y = ..s ...

s: 12

i: 2

Private memory

s: 14

i: 3

s: 11

i: 1

send P1,s

Network

receive Pn,s

Messagepassing(implementation)▪ Hardwareneednotimplementsystem-wideloadsandstorestoexecute

messagepassingprograms(toneedonlycommunicatemessagesbetweennodes)- Canconnectcommoditysystemstogethertoformlargeparallelmachine

(messagepassingisaprogrammingmodelforclustersandsupercomputers)

IBMBlueGene/PSupercomputer

Clusterofworkstations(Infinibandnetwork)

Imagecredit:IBM

Programmingmodelvs.implementationofcommunication▪ Commontoimplementmessagepassingabstractionson

machinesthatimplementasharedaddressspaceinhardware- “Sendingmessage”=copyingmemoryfrommessagelibrarybuffers- “Receivingmessage”=copydatafrommessagelibrarybuffers

▪ CanimplementsharedaddressspaceabstractiononmachinesthatdonotsupportitinHW(vialessefficientSWimplementations)- OSmarksallpageswithsharedvariablesasinvalid- OSpage-faulthandlerissuesappropriatenetworkrequests

▪ Keepclearinyourmind:whatistheprogrammingmodel(abstractionsusedtospecifyprogram)?AndwhatistheHWimplementation?

29

Programming Model 2a: Global Address Space

• Program consists of a collection of named threads. • Usually fixed at program startup time • Local and shared data, as in shared memory model • But, shared data is partitioned over local processes • Cost models says remote data is expensive

• Examples: UPC, Titanium, Co-Array Fortran • Global Address Space programming is an intermediate

point between message passing and shared memory

PnP1P0 s[myThread] = ...

y = ..s[i] ...i: 1 i: 5 Private

memory

Shared memory

i: 8

s[0]: 26 s[1]: 32 s[n]: 27

Thedata-parallelmodel

Programmingmodelsprovideawaytothinkabouttheorganizationofparallelprograms

▪ Sharedaddressspace:verylittlestructuretocommunication- Allthreadscanreadandwritetoallsharedvariables- Challenge:duetoimplementationdetails:notallreadsandwrites

aresamecost(costisoftennotapparentwhenreadingsourcecode!)

▪ Messagepassing:structuredcommunicationintheformofmessages- Allcommunicationoccursintheformofmessages(communicationis

explicitinsourcecode—thesendsandreceives)

▪ Dataparallel:rigidstructuretocomputation- Performsamefunctiononelementsoflargecollections

Data-parallelmodel▪ Organizecomputationasoperationsonsequencesof

elements- e.g.,performsamefunctiononallelementsofasequence

▪ Historically:sameoperationoneachelementofvector- MatchedcapabilitiesSIMDsupercomputersof80’s- ConnectionMachine(CM-1,CM-2):thousandsofprocessors,one

instructiondecodeunit- EarlyCraysupercomputerswerevectorprocessors- add(A,B,n)← thiswasoneinstructiononvectorsA,Boflengthn

Keydatatype:sequences

▪ Orderedcollectionofelements

▪ Forexample,inaC++likelanguage:Sequence<T>

▪ e.g.,Scalalists:List[T]

▪ Inafunctionallanguage(likeHaskell):seqT

▪ Canonlyaccesselementsofsequencethroughspecificoperations

Map▪ ▪

▪

▪

Higherorderfunction(functionthattakesafunctionasanargument)Appliesside-effectfreeunaryfunctionf::a->btoallelementsofinputsequence,toproduceoutputsequenceofthesamelengthInafunctionallanguage(e.g.,Haskell)-map::(a->b)->seqa->seqbInC++:transformtemplate<classInputIt,classOutputIt,classUnaryOperation>OutputIttransform(InputItfirst1,InputItlast1,

OutputItd_first,UnaryOperationunary_op);

f f f f f f

Parallelizingmap▪ Sincef::a->bisafunction(side-effectfree),then

applyingftoallelementsofthesequencecanbedoneinanyorderwithoutchangingtheoutputoftheprogram

▪ Theimplementationofmaphasflexibilitytoreorder/parallelizeprocessingofelementsofsequencehoweveritseesfit

OptimizingdatamovementinmapconstintN=1024;Sequence<float>input(N);Sequence<float>tmp(N);Sequence<float>output(N);

map(foo,input,tmp);map(bar,tmp,output);

parallel_for(inti=0;i<N;i++){

output[i]=bar(foo(input[i]));}

foo barinput outputtmp

▪ Considercodethatperformstwoback-to-backmaps(likethattoleft)

▪ Optimizingcompilerorruntimecanreorganizecode(bottom-left)toeliminatememoryloadsandstores(“mapfusion”)

▪ Additionaloptimizations:highlyoptimizedimplementationsofmapcanalsoperformoptimizationslikeprefetchingnextelementofinputsequence(tohidememorylatency)

▪ Whyarethesecomplexoptimizationspossible?

DataparallelisminISPC

//ISPCcode:exportvoidabsolute_value(

uniformintN,uniformfloat*x,uniformfloat*y)

{foreach(i=0...N){

if(x[i]<0)y[i]=-x[i];

elsey[i]=x[i];

}}

foreachconstruct

Thinkofloopbodyasafunction

Giventhisprogram,itisreasonabletothinkoftheprogramasusingforeachto“maptheloopbodyontoeachelement”ofthearraysXandY.

Butifwewanttobemoreprecise:asequenceisnotafirst-classISPCconcept.Itisimplicitlydefinedbyhowtheprogramhasimplementedarrayindexinglogicintheforeachloop.

(ThereisnooperationinISPCwiththesemantic:“mapthiscodeoverallelementsofthissequence”)

//mainC++code:constintN=1024;float*x=newfloat[N];float*y=newfloat[N];

//initializeNelementsofxhere

absolute_value(N,x,y);


//ISPCcode:exportvoidabsolute_repeat(


{foreach(i=0...N){

if(x[i]<0)y[2*i]=-x[i];

elsey[2*i]=x[i];

y[2*i+1]=y[2*i];}

}


Theinput/outputsequencesbeingmappedoverareimplicitlydefinedbyarrayindexinglogic

//mainC++code:constintN=1024;float*x=newfloat[N/2];float*y=newfloat[N];//initializeN/2elementsofxhere

absolute_repeat(N/2,x,y);

ThisisalsoavalidISPCprogram!

Ittakestheabsolutevalueofelementsofx,thenrepeatsittwiceintheoutputarrayy

(Lessobvioushowtothinkofthiscodeasmappingtheloopbodyontoexistingsequences.)


//ISPCcode:exportvoidshift_negative(


{foreach(i=0...N){

if(i>=1&&x[i]<0)y[i-1]=x[i];

elsey[i]=x[i];

}}

//mainC++code:constintN=1024;float*x=newfloat[N];float*y=newfloat[N];//initializeNelementsofx

shift_negative(N,x,y);

Theoutputofthisprogramisundefined!

Possibleformultipleiterationsoftheloopbodytowritetosamememorylocation

Data-parallelmodel(foreach)providesnospecificationoforderinwhichiterationsoccur

Butmodelprovidesnoprimitivesforfine-grainedmutualexclusion/synchronization).Itisnotintendedtohelpprogrammerswriteprogramswiththatstructure


Theinput/outputsequencesbeingmappedoverareimplicitlydefinedbyarrayindexinglogic

Gather/scatter:twokeydata-parallelcommunicationprimitives

constintN=1024;Sequence<float>input(N);Sequence<int>indices;Sequence<float>tmp_input(N);Sequence<float>output(N);

stream_gather(input,indices,tmp_input);absolute_value(tmp_input,output);

constintN=1024;Sequence<float>input(N);Sequence<int>indices;Sequence<float>tmp_output(N);Sequence<float>output(N);

absolute_value(input,tmp_output);stream_scatter(tmp_output,indices,output);

ISPCequivalent:

exportvoidabsolute_value(uniformfloatN,uniformfloat*input,uniformfloat*output,uniformint*indices)

{foreach(i=0...n){

floattmp=input[indices[i]];if(tmp<0)

output[i]=-tmp;else

output[i]=tmp;}

}

ISPCequivalent:

exportvoidabsolute_value(uniformfloatN,uniformfloat*input,uniformfloat*output,uniformint*indices)

{foreach(i=0...n){

if(input[i]<0)output[indices[i]]=-input[i];

elseoutput[indices[i]]=input[i];

}}

Mapabsolute_valueontostreamproducedbygather: Mapabsolute_valueontostream,scatterresults:

Gatherinstruction

3 12 4 9 9 15 13 0

gather(R1,R0,mem_base);

Indexvector:R0 Resultvector:R1

“Gatherfrombuffermem_baseintoR1accordingtoindicesspecifiedbyR0.”

Arrayinmemorywith(baseaddress= mem_base)

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

mem_base

GathersupportedwithAVX2in2013ButAVX2doesnotsupportSIMDscatter(mustimplementasscalarloop)ScatterinstructionexistsinAVX512

Hardwaresupportedgather/scatterdoesexistonGPUs.(stillanexpensiveoperationcomparedtoload/storeofcontiguousvector)

Summary:data-parallelmodel

▪ Data-parallelismisaboutimposingrigidprogramstructuretofacilitatesimpleprogrammingandadvancedoptimizations

▪ Basicidea:mapafunctionontoalargecollectionofdata- Functional:side-effectfreeexecution- Nocommunicationamongdistinctfunctioninvocations(allowinvocationstobescheduledinanyorder,includinginparallel)

▪ Inpracticethat’showmanysimpleprogramswork

▪ But...manymodernperformance-orienteddata-parallellanguagesdonotenforcethisstructureinthelanguage- ISPC,OpenCL,CUDA,etc.- Theychooseflexibility/familiarityofimperativeC-stylesyntaxoverthesafetyofamorefunctionalform

Summary▪ Programmingmodelsprovideawaytothinkaboutthe

organizationofparallelprograms.

▪ Theyprovideabstractionsthatpermitmultiplevalidimplementations.

▪ Iwantyoutoalwaysbethinkingaboutabstractionvs.implementationfortheremainderofthiscourse.

Date post:	16-Oct-2021
Category:	Documents
Upload:	others
View:	6 times
Download:	0 times

System and Architectural Models

Documents