SystemandArchitecturalModels
ArchitecturalModels
▪ Threeparallelprogrammingmodels- Thatdifferinwhatcommunicationabstractionstheypresenttotheprogrammer- Programmingmodelsareimportantbecausethey(1)influencehowprogrammersthinkwhenwritingprogramsand(2)influencethedesignofparallelhardwareplatformsdesignedtoexecutethem
▪ Correspondingmachinearchitectures- Abstractionpresentedbythehardwaretolow-levelsoftware
▪ We’llfocusondifferencesincommunication/synchronization
Systemlayers:interface,implementation,interface,...
Compilerand/orparallelruntime
Operatingsystem
Micro-architecture(hardwareimplementation)
ParallelApplications
Languageorlibraryprimitives/mechanisms
Abstractionsfordescribingconcurrent,parallel,or
independentcomputation
Abstractionsfordescribingcommunication
HardwareArchitecture(HW/SWboundary)
OSsystemcallAPI
“Programmingmodel”(provideswayofthinkingaboutthestructureofprograms)
Blueitalictext:abstraction/conceptReditalictext:systeminterfaceBlacktext:systemimplementation
Example:expressingparallelismwithpthreadsParallelApplication
Abstractionforconcurrentcomputation:athread ThreadProgramming
model
pthread_create()
pthreadlibraryimplementation
SystemcallAPIOSsupport:kernelthreadmanagement
x86-64modernmulti-coreCPU
Blueitalictext:abstraction/conceptReditalictext:systeminterfaceBlacktext:systemimplementation
Example:expressingparallelismwithISPCParallelApplications
Abstractionsfordescribingparallelcomputation:1.Forspecifyingsimultaneousexecution(trueparallelism)2.Forspecifyingindependentwork(potentiallyparallel)
ISPCProgramming
model
ISPClanguage(callISPCfunction,foreachconstruct)
ISPCcompiler
SystemcallAPIOSsupport
x86-64(includingAVXvectorinstructions)single-coreofCPU
Note:ThisdiagramisspecifictotheISPCgangabstraction.ISPCalsohasthe“task”languageprimitiveformulti-coreexecution.Idon’tdescribeitherebutitwouldbeinterestingtothinkabouthowthatdiagramwouldlook
6
Parallel Programming Models
• Programming model is made up of the languages and libraries that create an abstract view of the machine
• Control • How is parallelism created? • What orderings exist between operations?
• Data • What data is private vs. shared? • How is logically shared data accessed or communicated?
• Synchronization • What operations can be used to coordinate parallelism? • What are the atomic (indivisible) operations?
• Cost • How do we account for the cost of each of the above?
Threeprogrammingmodels(abstractions)
1.Sharedaddressspace
2.Messagepassing
3.Dataparallel
Sharedaddressspacemodel
Whatismemory?
▪ Onthefirstdayofclass,wedescribedaprogramasasequenceofinstructions.
▪ Someofthoseinstructionsreadandwritefrommemory.
▪ Butwhatismemory?- Tobeprecise,whatI’mreallyaskingis:whatisthelogicalabstractionofmemorypresentedtoaprogram
Value0x0 160x1 2550x2 140x3 00x4 00x5 00x6 60x7 00x8 320x9 480xA 2550xB 2550xC 2550xD 00xE 00xF 00x10 128
... ...0x1F 0
Aprogram’smemoryaddressspace
▪ Acomputer’smemoryisorganizedasaarrayofbytes
▪ Eachbyteisidentifiedbyits“address”inmemory(itspositioninthisarray)
“Thebytestoredataddress0x8hasthevalue32.”
“Thebytestoredataddress0x10(16)hasthevalue128.”
Intheillustrationontheright,theprogram’smemoryaddressspaceis32bytesinsize(sovalidaddressesrangefrom0x0to0x1F)
Address
Sharedaddressspacemodel(abstraction)
//writetoaddressholding//contentsofvariablexx=1;
Thread1:intx=0;spawn_thread(foo,&x);
Thread2:voidfoo(int*x){//readfromaddrstoring//contentsofvariablexwhile(x==0){}printx;
}
▪ Threadscommunicatebyreading/writingtosharedvariables
(PseudocodeprovidedinafakeC-likelanguageforbrevity.)
Thread2Sharedaddressspace
StoretoxThread1
x
Loadfromx
(Communicationoperationsshowninred)
Sharedaddressspacemodel
intx=0;Lockmy_lock;
spawn_thread(foo,&x,&my_lock);
mylock.lock();x++;mylock.unlock();
voidfoo(int*x,lock*my_lock){my_lock->lock();x++;my_lock->unlock();
printx;}
Thread1: Thread2:
(PseudocodeprovidedinafakeC-likelanguageforbrevity.)
Synchronizationprimitivesarealsosharedvariables:e.g.,locks
13
Simple Example• Consider applying a function f to the elements
of an array A and then computing its sum:
• Questions: • Where does A live? All in single memory?
Partitioned? • What work will be done by each processors? • They need to coordinate to get a single result, how?
∑−
=
1
0
])[(n
iiAf
A:
fA:f
sumA = array of all data fA = f(A) s = sum(fA) s:
14
Programming Model 1: Shared Memory• Program is a collection of threads of control.
• Can be created dynamically, mid-execution, in some languages • Each thread has a set of private variables, e.g., local stack variables • Also a set of shared variables, e.g., static variables, shared common
blocks, or global heap. • Threads communicate implicitly by writing and reading shared
variables. • Threads coordinate by synchronizing on shared variables
PnP1P0
s s = ...y = ..s ...
Shared memory
i: 2 i: 5 Private memory
i: 8
15
Simple Example
• Shared memory strategy: • small number p << n=size(A) processors • attached to single memory
• Parallel Decomposition: • Each evaluation and each partial sum is a task.
• Assign n/p numbers to each of p procs • Each computes independent “private” results and partial sum. • Collect the p partial sums and compute a global sum.
Two Classes of Data: • Logically Shared
• The original n numbers, the global sum. • Logically Private
• The individual function evaluations. • What about the individual partial sums?
∑−
=
1
0
])[(n
iiAf
16
Shared Memory “Code” for Computing a Sum
Thread 1
for i = 0, n/2-1 s = s + f(A[i])
Thread 2
for i = n/2, n-1 s = s + f(A[i])
static int s = 0;
• What is the problem with this program?
• A race condition or data race occurs when: -Two processors (or two threads) access the same
variable, and at least one does a write. -The accesses are concurrent (not synchronized) so
they could happen simultaneously
fork(sum,a[0:n/2-1]); sum(a[n/2,n-1]);
17
Shared Memory “Code” for Computing a Sum
Thread 1 …. compute f([A[i]) and put in reg0 reg1 = s reg1 = reg1 + reg0 s = reg1 …
Thread 2 … compute f([A[i]) and put in reg0 reg1 = s reg1 = reg1 + reg0 s = reg1 …
static int s = 0;
• Assume A = [3,5], f(x) = x2, and s=0 initially • For this program to work, s should be 32 + 52 = 34 at the end
• but it may be 34,9, or 25 • The atomic operations are reads and writes
• Never see ½ of one number, but += operation is not atomic • All computations happen in (private) registers
9 250 09 25
259
3 5A= f (x) = x2
18
Improved Code for Computing a Sum
Thread 1
local_s1= 0 for i = 0, n/2-1 local_s1 = local_s1 + f(A[i]) s = s + local_s1
Thread 2
local_s2 = 0 for i = n/2, n-1 local_s2= local_s2 + f(A[i]) s = s +local_s2
static int s = 0;
• Since addition is associative, it’s OK to rearrange order • Most computation is on private variables
- Sharing frequency is also reduced, which might improve speed - But there is still a race condition on the update of shared s - The race condition can be fixed by adding locks (only one thread
can hold a lock at a time; others wait for it)
static lock lk;
lock(lk);
unlock(lk);
lock(lk);
unlock(lk);
Why not do lock Inside loop?
Mechanismsforpreservingatomicity▪ Lock/unlockmutexaroundacriticalsection
LOCK(mylock);//criticalsectionUNLOCK(mylock);
▪ Somelanguageshavefirst-classsupportforatomicityofcodeblocksatomic{//criticalsection
}
▪ Intrinsicsforhardware-supportedatomicread-modify-writeoperationsatomicAdd(x,10);
Review:sharedaddressspacemodel▪ Threadscommunicateby:
- Reading/writingtosharedvariablesinasharedaddressspace- Inter-threadcommunicationisimplicitinmemoryloads/stores- Thread1storestoX- Later,thread2readsX(andobservesupdateofvaluebythread1)
- Manipulatingsynchronizationprimitives- e.g.,ensuringmutualexclusionviauseoflocks
▪ Thisisanaturalextensionofsequentialprogramming- Infact,allourdiscussionsinclasshaveassumedasharedaddressspacesofar!
HWimplementationofasharedaddressspaceKeyidea:anyprocessorcandirectlyreferencecontentsofanymemorylocation
Processor
LocalCache
Processor
LocalCache
Processor
LocalCache
Processor
LocalCache
Interconnect
Memory I/O
“Dance-hall”organizationProcessor Processor Processor Processor
Memory Memory
Processor
Processor
Processor
Processor
Memory
Processor Processor Processor Processor
Memory MemoryMemory Memory
Interconnectexamples
Memory
SharedBus
Multi-stagenetwork
Crossbar
*Caches(notshown)areanotherimplementationofasharedaddressspace(moreonthisinalaterlecture)
Non-uniformmemoryaccess(NUMA)
OnchipnetworkCore1 Core2
Core3 Core4
MemoryController
Memory
Core5 Core6
Core7 Core8
MemoryController
Memory
AMDHyper-transport/IntelQuickPath(QPI)
X
Thelatency*ofaccessingamemorylocationmaybedifferentfromdifferentprocessingcoresinthesystem
Example:latencytoaccessaddressxishigherfromcores5-8thancores1-4
Example:moderndual-socketconfiguration
*BandwidthfromanyonelocationmayalsobedifferenttodifferentCPUcores
Summary:sharedaddressspacemodel
▪ Communicationabstraction- Threadsread/writevariablesinsharedaddressspace
- Threadsmanipulatesynchronizationprimitives:locks,atomicops,etc.
- Logicalextensionofuniprocessorprogramming*
▪ Requireshardwaresupporttoimplementefficiently- Anyprocessorcanloadandstorefromanyaddress(itssharedaddressspace!)
- Canbecostlytoscaletolargenumbersofprocessors(oneofthereasonswhyhigh-corecountprocessorsareexpensive)
*ButNUMAimplementationrequiresreasoningaboutlocalityforperformance
Messagepassingmodelofcommunication
Messagepassingmodel(abstraction)
Thread1addressspace
VariableX
▪ Threadsoperatewithintheirownprivateaddressspaces
▪ Threadscommunicatebysending/receivingmessages- send:specifiesrecipient,buffertobetransmitted,andoptionalmessageidentifier(“tag”)- receive:sender,specifiesbuffertostoredata,andoptionalmessageidentifier
- Sendingmessagesistheonlywaytoexchangedatabetweenthreads1and2
- Why?
x
Thread2addressspace
VariableY
Y
(Communicationoperationsshowninred)
send(X,2,
semantics:sendvariableXasme
my_msg_id)
contextsoflocalssagetothread2ewiththeidandtagmessag
“my_msg_id”
recv(Y,1,my_msg_id)
semantics:receivemessagewithid“my_msg_id”fromthread1andstorecontentsinlocalvariableY
26
Message Passing• Program consists of a collection of named processes.
• Usually fixed at program startup time • Thread of control plus local address space -- NO shared data. • Logically shared data is partitioned over local processes.
• Processes communicate by explicit send/receive pairs • Coordination is implicit in every communication event. • MPI (Message Passing Interface) is the most commonly used SW
PnP1P0
y = ..s ...
s: 12
i: 2
Private memory
s: 14
i: 3
s: 11
i: 1
send P1,s
Network
receive Pn,s
Messagepassing(implementation)▪ Hardwareneednotimplementsystem-wideloadsandstorestoexecute
messagepassingprograms(toneedonlycommunicatemessagesbetweennodes)- Canconnectcommoditysystemstogethertoformlargeparallelmachine
(messagepassingisaprogrammingmodelforclustersandsupercomputers)
IBMBlueGene/PSupercomputer
Clusterofworkstations(Infinibandnetwork)
Imagecredit:IBM
Programmingmodelvs.implementationofcommunication▪ Commontoimplementmessagepassingabstractionson
machinesthatimplementasharedaddressspaceinhardware- “Sendingmessage”=copyingmemoryfrommessagelibrarybuffers- “Receivingmessage”=copydatafrommessagelibrarybuffers
▪ CanimplementsharedaddressspaceabstractiononmachinesthatdonotsupportitinHW(vialessefficientSWimplementations)- OSmarksallpageswithsharedvariablesasinvalid- OSpage-faulthandlerissuesappropriatenetworkrequests
▪ Keepclearinyourmind:whatistheprogrammingmodel(abstractionsusedtospecifyprogram)?AndwhatistheHWimplementation?
29
Programming Model 2a: Global Address Space
• Program consists of a collection of named threads. • Usually fixed at program startup time • Local and shared data, as in shared memory model • But, shared data is partitioned over local processes • Cost models says remote data is expensive
• Examples: UPC, Titanium, Co-Array Fortran • Global Address Space programming is an intermediate
point between message passing and shared memory
PnP1P0 s[myThread] = ...
y = ..s[i] ...i: 1 i: 5 Private
memory
Shared memory
i: 8
s[0]: 26 s[1]: 32 s[n]: 27
Thedata-parallelmodel
Programmingmodelsprovideawaytothinkabouttheorganizationofparallelprograms
▪ Sharedaddressspace:verylittlestructuretocommunication- Allthreadscanreadandwritetoallsharedvariables- Challenge:duetoimplementationdetails:notallreadsandwrites
aresamecost(costisoftennotapparentwhenreadingsourcecode!)
▪ Messagepassing:structuredcommunicationintheformofmessages- Allcommunicationoccursintheformofmessages(communicationis
explicitinsourcecode—thesendsandreceives)
▪ Dataparallel:rigidstructuretocomputation- Performsamefunctiononelementsoflargecollections
Data-parallelmodel▪ Organizecomputationasoperationsonsequencesof
elements- e.g.,performsamefunctiononallelementsofasequence
▪ Historically:sameoperationoneachelementofvector- MatchedcapabilitiesSIMDsupercomputersof80’s- ConnectionMachine(CM-1,CM-2):thousandsofprocessors,one
instructiondecodeunit- EarlyCraysupercomputerswerevectorprocessors- add(A,B,n)← thiswasoneinstructiononvectorsA,Boflengthn
Keydatatype:sequences
▪ Orderedcollectionofelements
▪ Forexample,inaC++likelanguage:Sequence<T>
▪ e.g.,Scalalists:List[T]
▪ Inafunctionallanguage(likeHaskell):seqT
▪ Canonlyaccesselementsofsequencethroughspecificoperations
Map▪ ▪
▪
▪
Higherorderfunction(functionthattakesafunctionasanargument)Appliesside-effectfreeunaryfunctionf::a->btoallelementsofinputsequence,toproduceoutputsequenceofthesamelengthInafunctionallanguage(e.g.,Haskell)-map::(a->b)->seqa->seqbInC++:transformtemplate<classInputIt,classOutputIt,classUnaryOperation>OutputIttransform(InputItfirst1,InputItlast1,
OutputItd_first,UnaryOperationunary_op);
f f f f f f
Parallelizingmap▪ Sincef::a->bisafunction(side-effectfree),then
applyingftoallelementsofthesequencecanbedoneinanyorderwithoutchangingtheoutputoftheprogram
▪ Theimplementationofmaphasflexibilitytoreorder/parallelizeprocessingofelementsofsequencehoweveritseesfit
OptimizingdatamovementinmapconstintN=1024;Sequence<float>input(N);Sequence<float>tmp(N);Sequence<float>output(N);
map(foo,input,tmp);map(bar,tmp,output);
parallel_for(inti=0;i<N;i++){
output[i]=bar(foo(input[i]));}
foo barinput outputtmp
▪ Considercodethatperformstwoback-to-backmaps(likethattoleft)
▪ Optimizingcompilerorruntimecanreorganizecode(bottom-left)toeliminatememoryloadsandstores(“mapfusion”)
▪ Additionaloptimizations:highlyoptimizedimplementationsofmapcanalsoperformoptimizationslikeprefetchingnextelementofinputsequence(tohidememorylatency)
▪ Whyarethesecomplexoptimizationspossible?
DataparallelisminISPC
//ISPCcode:exportvoidabsolute_value(
uniformintN,uniformfloat*x,uniformfloat*y)
{foreach(i=0...N){
if(x[i]<0)y[i]=-x[i];
elsey[i]=x[i];
}}
foreachconstruct
Thinkofloopbodyasafunction
Giventhisprogram,itisreasonabletothinkoftheprogramasusingforeachto“maptheloopbodyontoeachelement”ofthearraysXandY.
Butifwewanttobemoreprecise:asequenceisnotafirst-classISPCconcept.Itisimplicitlydefinedbyhowtheprogramhasimplementedarrayindexinglogicintheforeachloop.
(ThereisnooperationinISPCwiththesemantic:“mapthiscodeoverallelementsofthissequence”)
//mainC++code:constintN=1024;float*x=newfloat[N];float*y=newfloat[N];
//initializeNelementsofxhere
absolute_value(N,x,y);
DataparallelisminISPC
//ISPCcode:exportvoidabsolute_repeat(
uniformintN,uniformfloat*x,uniformfloat*y)
{foreach(i=0...N){
if(x[i]<0)y[2*i]=-x[i];
elsey[2*i]=x[i];
y[2*i+1]=y[2*i];}
}
Thinkofloopbodyasafunction
Theinput/outputsequencesbeingmappedoverareimplicitlydefinedbyarrayindexinglogic
//mainC++code:constintN=1024;float*x=newfloat[N/2];float*y=newfloat[N];//initializeN/2elementsofxhere
absolute_repeat(N/2,x,y);
ThisisalsoavalidISPCprogram!
Ittakestheabsolutevalueofelementsofx,thenrepeatsittwiceintheoutputarrayy
(Lessobvioushowtothinkofthiscodeasmappingtheloopbodyontoexistingsequences.)
DataparallelisminISPC
//ISPCcode:exportvoidshift_negative(
uniformintN,uniformfloat*x,uniformfloat*y)
{foreach(i=0...N){
if(i>=1&&x[i]<0)y[i-1]=x[i];
elsey[i]=x[i];
}}
//mainC++code:constintN=1024;float*x=newfloat[N];float*y=newfloat[N];//initializeNelementsofx
shift_negative(N,x,y);
Theoutputofthisprogramisundefined!
Possibleformultipleiterationsoftheloopbodytowritetosamememorylocation
Data-parallelmodel(foreach)providesnospecificationoforderinwhichiterationsoccur
Butmodelprovidesnoprimitivesforfine-grainedmutualexclusion/synchronization).Itisnotintendedtohelpprogrammerswriteprogramswiththatstructure
Thinkofloopbodyasafunction
Theinput/outputsequencesbeingmappedoverareimplicitlydefinedbyarrayindexinglogic
Gather/scatter:twokeydata-parallelcommunicationprimitives
constintN=1024;Sequence<float>input(N);Sequence<int>indices;Sequence<float>tmp_input(N);Sequence<float>output(N);
stream_gather(input,indices,tmp_input);absolute_value(tmp_input,output);
constintN=1024;Sequence<float>input(N);Sequence<int>indices;Sequence<float>tmp_output(N);Sequence<float>output(N);
absolute_value(input,tmp_output);stream_scatter(tmp_output,indices,output);
ISPCequivalent:
exportvoidabsolute_value(uniformfloatN,uniformfloat*input,uniformfloat*output,uniformint*indices)
{foreach(i=0...n){
floattmp=input[indices[i]];if(tmp<0)
output[i]=-tmp;else
output[i]=tmp;}
}
ISPCequivalent:
exportvoidabsolute_value(uniformfloatN,uniformfloat*input,uniformfloat*output,uniformint*indices)
{foreach(i=0...n){
if(input[i]<0)output[indices[i]]=-input[i];
elseoutput[indices[i]]=input[i];
}}
Mapabsolute_valueontostreamproducedbygather: Mapabsolute_valueontostream,scatterresults:
Gatherinstruction
3 12 4 9 9 15 13 0
gather(R1,R0,mem_base);
Indexvector:R0 Resultvector:R1
“Gatherfrombuffermem_baseintoR1accordingtoindicesspecifiedbyR0.”
Arrayinmemorywith(baseaddress= mem_base)
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
mem_base
GathersupportedwithAVX2in2013ButAVX2doesnotsupportSIMDscatter(mustimplementasscalarloop)ScatterinstructionexistsinAVX512
Hardwaresupportedgather/scatterdoesexistonGPUs.(stillanexpensiveoperationcomparedtoload/storeofcontiguousvector)
Summary:data-parallelmodel
▪ Data-parallelismisaboutimposingrigidprogramstructuretofacilitatesimpleprogrammingandadvancedoptimizations
▪ Basicidea:mapafunctionontoalargecollectionofdata- Functional:side-effectfreeexecution- Nocommunicationamongdistinctfunctioninvocations(allowinvocationstobescheduledinanyorder,includinginparallel)
▪ Inpracticethat’showmanysimpleprogramswork
▪ But...manymodernperformance-orienteddata-parallellanguagesdonotenforcethisstructureinthelanguage- ISPC,OpenCL,CUDA,etc.- Theychooseflexibility/familiarityofimperativeC-stylesyntaxoverthesafetyofamorefunctionalform
Summary▪ Programmingmodelsprovideawaytothinkaboutthe
organizationofparallelprograms.
▪ Theyprovideabstractionsthatpermitmultiplevalidimplementations.
▪ Iwantyoutoalwaysbethinkingaboutabstractionvs.implementationfortheremainderofthiscourse.