11/2/17
1
CS61C:GreatIdeasinComputerArchitecture
Lecture19:Thread-LevelParallelProcessing
Krste Asanović &RandyH.Katz
http://inst.eecs.berkeley.edu/~cs61c/fa17
111/2/17 Fall2017 - Lecture#19
Agenda
• MIMD- multipleprogramssimultaneously• Threads• Parallelprogramming:OpenMP• Synchronizationprimitives• SynchronizationinOpenMP• And,inConclusion…
211/2/17 Fall2017 - Lecture#19
ImprovingPerformance1. Increaseclockratefs
− Reachedpracticalmaximumfortoday’stechnology− <5GHzforgeneralpurposecomputers
2. LowerCPI(cyclesperinstruction)− SIMD,“instructionlevelparallelism”
3. Performmultipletaskssimultaneously− MultipleCPUs,eachexecutingdifferentprogram− Tasksmayberelated
§ E.g.eachCPUperformspartofabigmatrixmultiplication− orunrelated
§ E.g.distributedifferentwebhttprequestsoverdifferentcomputers§ E.g.runpptx (viewlectureslides)andbrowser(youtube)simultaneously
4. Doalloftheabove:− Highfs,SIMD,multipleparalleltasks
Today’slecture
311/2/17 Fall2017 - Lecture#19
New-SchoolMachineStructures(It’sabitmorecomplicated!)
• ParallelRequestsAssignedtocomputere.g.,Search“Katz”
• ParallelThreadsAssignedtocoree.g.,Lookup,Ads
• ParallelInstructions>[email protected].,5pipelinedinstructions
• ParallelData>[email protected].,Addof4pairsofwords
• HardwaredescriptionsAllgates@onetime
• ProgrammingLanguages
SmartPhone
WarehouseScale
Computer
SoftwareHardware
HarnessParallelism&AchieveHighPerformance
LogicGates
Core Core…
Memory(Cache)
Input/Output
Computer
CacheMemory
Core
InstructionUnit(s) FunctionalUnit(s)
A3+B3A2+B2A1+B1A0+B0
Projects3and5!
411/2/17 Fall2017 - Lecture#19
ParallelComputerArchitectures
Severalseparatecomputers,somemeansforcommunication(e.g.,Ethernet)
Massivearrayofcomputers,fastcommunicationbetweenprocessors
Multi-coreCPU:1datapathinsinglechip
shareL3cache,memory,peripheralsExample:Hivemachines
GPU“graphicsprocessingunit”
511/2/17 Fall2017 - Lecture#19
Example:CPUwithTwoCoresProcessor“Core”1
Control
DatapathPC
Registers(ALU)
MemoryInput
Output
Bytes
I/O-MemoryInterfaces
Processor0MemoryAccesses
Processor“Core”2
Control
DatapathPC
Registers(ALU)
Processor1MemoryAccesses
611/2/17 Fall2017 - Lecture#19
11/2/17
2
MultiprocessorExecutionModel
• Eachprocessor(core)executesitsowninstructions• Separate resources(notshared)
− Datapath(PC,registers,ALU)− Highestlevelcaches(e.g.,1st and2nd)
• Shared resources− Memory(DRAM)− Often3rd levelcache
§ Oftenonsamesiliconchip§ Butnotarequirement
• Nomenclature− “MultiprocessorMicroprocessor”− Multicoreprocessor
§ E.g.,fourcoreCPU(centralprocessingunit)§ Executesfourdifferentinstructionstreamssimultaneously
711/2/17 Fall2017 - Lecture#19
TransitiontoMulticore
Sequential App Performance
811/2/17 Fall2017 - Lecture#19
Pixel2vs.iPhone8
911/2/17 Fall2017 - Lecture#19
Pixel2vs.iPhone8
1011/2/17 Fall2017 - Lecture#19
ALUs nm MHz GFlops
2.35Ghz+1.9Ghz,64BitOcta-Core
Pixel2vs.iPhone8
1111/2/17 Fall2017 - Lecture#19
Pixel2vs.iPhone8
1211/2/17 Fall2017 - Lecture#19
11/2/17
3
MultiprocessorExecutionModel
• Sharedmemory− Each“core”hasaccesstotheentirememoryintheprocessor− Specialhardwarekeepscachesconsistent(nextlecture!)− Advantages:
§ Simplifiescommunicationinprogramviasharedvariables− Drawbacks:
§ Doesnotscalewell:o “Slow”memorysharedbymany“customers”(cores)o Maybecomebottleneck(Amdahl’sLaw)
• Twowaystouseamultiprocessor:− Job-levelparallelism
§ Processorsworkonunrelatedproblems§ Nocommunicationbetweenprograms
− Partitionworkofsingletaskbetweenseveralcores§ E.g.,eachperformspartoflargematrixmultiplication
1311/2/17 Fall2017 - Lecture#19
ParallelProcessing• It’sdifficult!• It’sinevitable
− Onlypathtoincreaseperformance− Onlypathtolowerenergyconsumption(improvebatterylife)
• Inmobilesystems(e.g.,smartphones,tablets)− Multiplecores− Dedicatedprocessors,e.g.,
§ Motionprocessor,imageprocessor,neuralprocessoriniPhone8+X§ GPU(graphicsprocessingunit)
• Warehouse-scalecomputers(nextweek!)− Multiple“nodes”
§ “Boxes”withseveralCPUs,disksperbox− MIMD(multi-core)andSIMD(e.g.AVX)ineachnode
1411/2/17 Fall2017 - Lecture#19
1511/2/17 Fall2017 - Lecture#19
PotentialParallelPerformance(assumingsoftwarecanuseit)
Year Cores SIMD bits /Core Core *SIMD bits
Total, e.g.FLOPs/Cycle
2003 2 128 256 42005 4 128 512 82007 6 128 768 122009 8 128 1024 162011 10 256 2560 402013 12 256 3072 482015 14 512 7168 1122017 16 512 8192 1282019 18 1024 18432 2882021 20 1024 20480 320
2.5X 8X 20X
MIMD SIMD MIMD&SIMD+2/
2yrs2X/4yrs
12years
20xin12years201/12 =1.28xà 28%peryearor2xevery3years!
IF(!)wecanuseit
Agenda
• MIMD- multipleprogramssimultaneously• Threads• Parallelprogramming:OpenMP• Synchronizationprimitives• SynchronizationinOpenMP• And,inConclusion…
1611/2/17 Fall2017 - Lecture#19
ProgramsRunningonmyComputerPID TTY TIME CMD220 ?? 0:04.34 /usr/libexec/UserEventAgent (Aqua)222 ?? 0:10.60 /usr/sbin/distnoted agent224 ?? 0:09.11 /usr/sbin/cfprefsd agent229 ?? 0:04.71 /usr/sbin/usernoted230 ?? 0:02.35 /usr/libexec/nsurlsessiond232 ?? 0:28.68 /System/Library/PrivateFrameworks/CalendarAgent.framework/Executables/CalendarAgent234 ?? 0:04.36 /System/Library/PrivateFrameworks/GameCenterFoundation.framework/Versions/A/gamed235 ?? 0:01.90 /System/Library/CoreServices/cloudphotosd.app/Contents/MacOS/cloudphotosd236 ?? 0:49.72 /usr/libexec/secinitd239 ?? 0:01.66 /System/Library/PrivateFrameworks/TCC.framework/Resources/tccd240 ?? 0:12.68 /System/Library/Frameworks/Accounts.framework/Versions/A/Support/accountsd241 ?? 0:09.56 /usr/libexec/SafariCloudHistoryPushAgent242 ?? 0:00.27 /System/Library/PrivateFrameworks/CallHistory.framework/Support/CallHistorySyncHelper243 ?? 0:00.74 /System/Library/CoreServices/mapspushd244 ?? 0:00.79 /usr/libexec/fmfd246 ?? 0:00.09 /System/Library/PrivateFrameworks/AskPermission.framework/Versions/A/Resources/askpermissiond248 ?? 0:01.03 /System/Library/PrivateFrameworks/CloudDocsDaemon.framework/Versions/A/Support/bird249 ?? 0:02.50 /System/Library/PrivateFrameworks/IDS.framework/identityservicesd.app/Contents/MacOS/identityservicesd250 ?? 0:04.81 /usr/libexec/secd254 ?? 0:24.01 /System/Library/PrivateFrameworks/CloudKitDaemon.framework/Support/cloudd258 ?? 0:04.73 /System/Library/PrivateFrameworks/TelephonyUtilities.framework/callservicesd267 ?? 0:02.15 /System/Library/CoreServices/AirPlayUIAgent.app/Contents/MacOS/AirPlayUIAgent --launchd271 ?? 0:03.91 /usr/libexec/nsurlstoraged274 ?? 0:00.90 /System/Library/PrivateFrameworks/CommerceKit.framework/Versions/A/Resources/storeaccountd282 ?? 0:00.09 /usr/sbin/pboard283 ?? 0:00.90
/System/Library/PrivateFrameworks/InternetAccounts.framework/Versions/A/XPCServices/com.apple.internetaccounts.xpc/Contents/MacOS/com.apple.internetaccounts285 ?? 0:04.72 /System/Library/Frameworks/ApplicationServices.framework/Frameworks/ATS.framework/Support/fontd291 ?? 0:00.25 /System/Library/Frameworks/Security.framework/Versions/A/Resources/CloudKeychainProxy.bundle/Contents/MacOS/CloudKeychainProxy292 ?? 0:09.54 /System/Library/CoreServices/CoreServicesUIAgent.app/Contents/MacOS/CoreServicesUIAgent293 ?? 0:00.29
/System/Library/PrivateFrameworks/CloudPhotoServices.framework/Versions/A/Frameworks/CloudPhotoServicesConfiguration.framework/Versions/A/XPCServices/com.apple.CloudPhotosConfiguration.xpc/Contents/MacOS/com.apple.CloudPhotosConfiguration297 ?? 0:00.84 /System/Library/PrivateFrameworks/CloudServices.framework/Resources/com.apple.sbd302 ?? 0:26.11 /System/Library/CoreServices/Dock.app/Contents/MacOS/Dock303 ?? 0:09.55 /System/Library/CoreServices/SystemUIServer.app/Contents/MacOS/SystemUIServer
…156total at this momentHow does mylaptopdothis?
Imagine doing 156assignments all at the same time!1711/2/17 Fall2017 - Lecture#19
ps -x
Threads• Sequentialflowofinstructionsthatperformssometask
−Uptonowwejustcalledthisa“program”
• Eachthreadhas:− DedicatedPC(programcounter)− Separateregisters− Accessesthesharedmemory
• Eachphysicalcoreprovidesone(ormore)− Hardwarethreads thatactivelyexecuteinstructions− Eachexecutesone“hardwarethread”
• Operatingsystemmultiplexesmultiple− Softwarethreads ontotheavailablehardwarethreads− Allthreadsexceptthosemappedtohardwarethreadsarewaiting
1811/2/17 Fall2017 - Lecture#19
11/2/17
4
OperatingSystemThreads
Giveillusionofmany“simultaneously”activethreads1. Multiplexsoftwarethreadsontohardwarethreads:
a) Switchoutblockedthreads(e.g.,cachemiss,userinput,networkaccess)b) Timer(e.g.,switchactivethreadevery1ms)
2. Removeasoftwarethreadfromahardwarethreadbya) Interruptingitsexecutionb) SavingitsregistersandPCtomemory
3. Startexecutingadifferentsoftwarethreadbya) Loadingitspreviouslysavedregistersintoahardwarethread’sregistersb) JumpingtoitssavedPC
1911/2/17 Fall2017 - Lecture#19
Example:FourCoresThreadpool:Listofthreadscompetingforprocessor
OSmapsthreadstocoresandscheduleslogical(software)threads
Core2
Each“Core”activelyrunsoneinstructionstreamatatime
Core1 Core3 Core4
2011/2/17 Fall2017 - Lecture#19
Multithreading
• Typicalscenario:− Activethreadencounterscachemiss− Activethreadwaits~ 1000cyclesfordatafromDRAM−à switchoutandrundifferentthreaduntildataavailable
• Problem−Mustsavecurrentthreadstateandloadnewthreadstate
§ PC,allregisters(couldbemany,e.g.AVX)−àmustperformswitchin≪1000cycles
• Canhardwarehelp?−Moore’sLaw:transistorsareplenty
2111/2/17 Fall2017 - Lecture#19
• TwocopiesofPCandRegistersinsideprocessorhardware
• Looksidenticaltotwoprocessorstosoftware(hardwarethread0,hardwarethread1)
• Hyperthreading:• Boththreadscanbeactivesimultaneously
HardwareAssistedSoftwareMultithreading
22
MemoryInput
Output
Bytes
I/O-MemoryInterfaces
Processor(1 Core,2Threads)
Control
DatapathPC0
Registers0
(ALU)
PC1
Registers1
CS61c Lecture19:ThreadLevelParallelProcessing
Multithreading
• Logicalthreads− ≈1%morehardware,≈10%(?)betterperformance
§ Separateregisters§ Sharedatapath,ALU(s),caches
• Multicore− =>DuplicateProcessors− ≈50%morehardware,≈2Xbetterperformance?
• Modernmachinesdoboth−Multiplecoreswithmultiplethreads percore
2311/2/17 Fall2017 - Lecture#19
Randy’sLaptop
$ sysctl -a | grep hw
hw.physicalcpu: 2hw.logicalcpu: 4hw.l1icachesize: 32,768 hw.l1dcachesize: 32,768hw.l2cachesize: 262,144hw.l3cachesize: 4,194,304
• 2Cores• 4Threadstotal
2411/2/17 Fall2017 - Lecture#19
11/2/17
5
Example:6Cores,24LogicalThreads
Threadpool:Listofthreadscompetingforprocessor
OSmapsthreadstocoresandscheduleslogical(software)threads
Thread1Core2
Thread2
Thread3
Thread4
Thread1Core6
Thread2
Thread3
Thread4
Thread1Core4
Thread2
Thread3
Thread4
Thread1Core5
Thread2
Thread3
Thread4
Thread1Core3
Thread2
Thread3
Thread4
Thread1Core1
Thread2
Thread3
Thread4
4Logicalthreadspercore(hardware)thread2511/2/17 Fall2017 - Lecture#19
Break!
2611/2/17 Fall2017 - Lecture#19
Agenda
• MIMD- multipleprogramssimultaneously• Threads• Parallelprogramming:OpenMP• Synchronizationprimitives• SynchronizationinOpenMP• And,inConclusion…
2711/2/17 Fall2017 - Lecture#19
LanguagesSupportingParallelProgramming
ActorScript Concurrent Pascal JoCaml OrcAda Concurrent ML Join OzAfnix Concurrent Haskell Java PictAlef Curry Joule ReiaAlice CUDA Joyce SALSAAPL E LabVIEW ScalaAxum Eiffel Limbo SISALChapel Erlang Linda SRCilk Fortan 90 MultiLisp Stackless PythonClean Go Modula-3 SuperPascalClojure Io Occam VHDLConcurrent C Janus occam-π XC
Whichonetopick?2811/2/17 Fall2017 - Lecture#19
WhySoManyParallelProgrammingLanguages?
• Why“intrinsics”?− TOIntel:fixyour#()&$!Compiler!
• It’shappening...but− SIMDfeaturesarecontinuallyaddedtocompilers(Intel,gcc)− Intenseareaofresearch− Researchprogress:
§ 20+yearstotranslateCintogood(fast!)assembly§ HowlongtotranslateCintogood(fast!)parallelcode?
o Generalproblemisveryhardtosolveo Presentstate:specializedsolutionsforspecificcaseso Youropportunitytobecomefamous!
2911/2/17 Fall2017 - Lecture#19
ParallelProgrammingLanguages
• Numberofchoicesisindicationof−Nouniversalsolution
§ Needsareveryproblemspecific− E.g.,
§ Scientificcomputing/machinelearning(matrixmultiply)§ Webserver:handlemanyunrelatedrequestssimultaneously§ Input/output:it’sallhappeningsimultaneously!
• Specializedlanguagesfordifferenttasks− Someareeasiertouse(forsomeproblems)−Noneisparticularly”easy”touse
• 61C− Parallellanguageexamplesforhigh-performancecomputing−OpenMP
3011/2/17 Fall2017 - Lecture#19
11/2/17
6
ParallelLoops
• Serialexecution:for (int i=0; i<100; i++) {
…}
• ParallelExecution:
for (int i=0; i<25; i++) { …
}
for (int i=25; i<50; i++) {
…}
for (int i=50; i<75; i++) {
…}
for (int i=75; i<100; i++) {
…}
3111/2/17 Fall2017 - Lecture#19
Parallelfor inOpenMP
#include <omp.h>
#pragma omp parallel forfor (int i=0; i<100; i++) {
…}
3211/2/17 Fall2017 - Lecture#19
OpenMPExample
$ gcc-5 -fopenmp for.c;./a.outthread 0, i = 0thread 1, i = 3thread 2, i = 6thread 3, i = 8thread 0, i = 1thread 1, i = 4thread 2, i = 7thread 3, i = 9thread 0, i = 2thread 1, i = 501 02 03 14 15 16 27 28 39 40
3311/2/17 Fall2017 - Lecture#19
OpenMP
• Cextension:nonewlanguagetolearn• Multi-threaded,shared-memoryparallelism
− CompilerDirectives,#pragma− RuntimeLibraryRoutines,#include <omp.h>
• #pragma− IgnoredbycompilersunawareofOpenMP− Samesourceformultiplearchitectures
§ E.g.,sameprogramfor1&16cores
• Onlyworkswithsharedmemory
3411/2/17 Fall2017 - Lecture#19
OpenMPProgrammingModel• Fork- JoinModel:
• OpenMPprogramsbeginassingleprocess(masterthread)− Sequentialexecution
• Whenparallelregionisencountered− Masterthread“forks” intoteamofparallelthreads− Executedsimultaneously− Atendofparallelregion,parallelthreads”join”,leavingonlymasterthread
• Processrepeatsforeachparallelregion− Amdahl’sLaw?
3511/2/17 Fall2017 - Lecture#19
WhatKindofThreads?
• OpenMPthreadsareoperatingsystem(software)threads• OSwillmultiplexrequestedOpenMPthreadsontoavailablehardwarethreads• Hopefullyeachgetsarealhardwarethreadtorunon,sonoOS-leveltime-multiplexing• Butothertasksonmachinecompeteforhardwarethreads!• Be“careful”(?)whentimingresultsforProject3!
− 5AM?− Jobqueue?
3611/2/17 Fall2017 - Lecture#19
11/2/17
7
Example2:Computingp
http://openmp.org/mp-documents/omp-hands-on-SC08.pdf3711/2/17 Fall2017 - Lecture#19
Sequentialp
pi = 3.142425985001
• Resemblesp,butnotveryaccurate• Let’sincreasenum_steps andparallelize
3811/2/17 Fall2017 - Lecture#19
Parallelize(1)…
• Problem:eachthreadsneedsaccesstothesharedvariablesum
• Coderunssequentially…
3911/2/17 Fall2017 - Lecture#19
Parallelize(2)…
sum[0] sum[1]
1. Computesum[0]andsum[1]
inparallel
2. Computesum = sum[0] + sum[1]
sequentially
4011/2/17 Fall2017 - Lecture#19
Parallelp
4111/2/17 Fall2017 - Lecture#19
TrialRun
i = 1, id = 1i = 0, id = 0i = 2, id = 2i = 3, id = 3i = 5, id = 1i = 4, id = 0i = 6, id = 2i = 7, id = 3i = 9, id = 1i = 8, id = 0pi = 3.142425985001
4211/2/17 Fall2017 - Lecture#19
11/2/17
8
Scaleup:num_steps = 106
pi = 3.141592653590
Youverify howmany digitsarecorrect …
4311/2/17 Fall2017 - Lecture#19
CanWeParallelizeComputingsum?
Summationinsideparallelsection• Insignificantspeedupinthisexample,but…• pi = 3.138450662641• Wrong!And value changes between runs?!• What’s going on?
AlwayslookingforwaystobeatAmdahl’sLaw…
4411/2/17 Fall2017 - Lecture#19
PeerInstructionWhatarethepossiblevaluesof*(x1) afterexecutingthiscodebytwoconcurrent threads?
# *(x1) = 100lw x2,0(x1)addi x2,x2,1sw x2,0(x1)
Answer *(x1)
RED 100 or101GREEN 101ORANGE 101or102YELLOW 100or101or102
4511/2/17 Fall2017 - Lecture#19
• Operationisreallypi = pi + sum[id]
• Whatif>1threadsreadscurrent(same)valueofpi,computesthesum,storestheresultbacktopi?
• Eachprocessorreadssameintermediatevalueofpi!
• Resultdependsonwhogetstherewhen• A“race”à resultisnot
deterministic
What’sGoingOn?
4611/2/17 Fall2017 - Lecture#19
Administrivia
• Homework4(Caches,FloatingPoint)duetomorrow at11:59pm• Project2-2dueMonday
− ProjectOfficehoursthatMondaywillbewellstaffed!− TestyourCPUthoroughly!
§ WriteprogramswithVenusandloadthemintoyourcircuit
• Project3willbereleasedMondaynight− Atwo-weekperformanceproject− Canearnextracreditfromtheperformancecontest(Project5)
• MidtermscoreswillbereleasedbeforeTuesdayonGradescope
4711/2/17 Fall2017 - Lecture#19
Break!
4811/2/17 Fall2017 - Lecture#19
11/2/17
9
Agenda
• MIMD- multipleprogramssimultaneously• Threads• Parallelprogramming:OpenMP• Synchronizationprimitives• SynchronizationinOpenMP• And,inConclusion…
4911/2/17 Fall2017 - Lecture#19
Synchronization
• Problem:− Limitaccesstosharedresourceto1actoratatime− E.g.only1personpermittedtoeditafileatatime
§ otherwisechangesbyseveralpeoplegetallmixedup
• Solution:• Taketurns:
• Onlyonepersonget’sthemicrophone&talksatatime
• Alsogoodpracticeforclassrooms,btw…
5011/2/17 Fall2017 - Lecture#19
Locks
• Computersuselockstocontrolaccesstosharedresources− Servespurposeofmicrophoneinexample− Alsoreferredtoas“semaphore”
• Usuallyimplementedwithavariable− int lock;
§ 0forunlocked§ 1forlocked
5111/2/17 Fall2017 - Lecture#19
SynchronizationwithLocks// wait for lock releasedwhile (lock != 0) ;// lock == 0 now (unlocked)
// set locklock = 1;
// access shared resource ... // e.g. pi// sequential execution! (Amdahl ...)
// release locklock = 0;
5211/2/17 Fall2017 - Lecture#19
LockSynchronization
Thread1
while (lock != 0) ;
lock = 1;
// critical section
lock = 0;
Thread2
while (lock != 0) ;
lock = 1; // critical sectionlock = 0;
• Thread2findslocknotset,beforethread1setsit
• Boththreadsbelievetheygotandsetthelock!
Tryasyoulike,thisproblemhasnosolution,notevenattheassemblylevel.
Unlessweintroducenewinstructions,thatis!5311/2/17 Fall2017 - Lecture#19
HardwareSynchronization
• Solution:−Atomicread/write−Read&writeinsingleinstruction
§ Nootheraccesspermittedbetweenreadandwrite−Note:
§ Mustusesharedmemory (multiprocessing)
• Commonimplementations:−Atomicswapofregister↔memory−Pairofinstructionsfor“linked”readandwrite
§ writefailsifmemorylocationhasbeen“tampered”withafterlinkedread• RISCVhasvariationsofboth,butforsimplicitywewillfocusontheformer
5411/2/17 Fall2017 - Lecture#19
11/2/17
10
RISCVAtomicMemoryOperations(AMOs)
• AMOsatomicallyperformanoperationonanoperandinmemoryandsetthedestinationregistertotheoriginalmemoryvalue• R-TypeInstructionFormat:Add,And,Or,Swap,Xor,Max,Max Unsigned,Min,Min Unsigned
5511/2/17 Fall2017 - Lecture#19
Loadfromaddressinrs1to“t”rd =”t”,i.e.,thevalueinmemoryStoreataddressinrs1thecalculation“t”<operation>rs2aq andrl insureinorderexecution
amoadd.w rd,rs2,(rs1):t = M[x[rs1]]; x[rd] = t; M[x[rs1]] = t + x[rs2]
RISCVCriticalSection
• Assumethatthelockisinmemorylocationstoredinregistera0• Thelockis“set”ifitis1;itis“free”ifitis0(it’sinitialvalue)
li t0, 1 # Get 1 to set lockTry: amoswap.w.aq t1, t0, (a0) # t1 gets old lock value
# while we set it to 1bnez t1, Try # if it was already 1, another
# thread has the lock,# so we need to try again
… critical section goes here …amoswap.w.rl x0, x0, (a0) # store 0 in lock to release
5611/2/17 Fall2017 - Lecture#19
LockSynchronization
BrokenSynchronization
while (lock != 0) ;
lock = 1;
// critical section
lock = 0;
Fix(lockisatlocation(a0))
li t0, 1Try amoswap.w.aq t1, t0, (a0)
bnez t1, TryLocked:
# critical section
Unlock:amoswap.w.rl x0, x0, (a0)
5711/2/17 Fall2017 - Lecture#19
Agenda
• MIMD- multipleprogramssimultaneously• Threads• Parallelprogramming:OpenMP• Synchronizationprimitives• SynchronizationinOpenMP• And,inConclusion…
5811/2/17 Fall2017 - Lecture#19
OpenMPLocks
5911/2/17 Fall2017 - Lecture#19
SynchronizationinOpenMP
• Typicallyareusedinlibrariesofhigherlevelparallelprogrammingconstructs• E.g.OpenMPoffers$pragmasforcommoncases:
− critical− atomic− barrier− ordered
• OpenMPoffersmanymorefeatures− Seeonlinedocumentation−Ortutorialat
§ http://openmp.org/mp-documents/omp-hands-on-SC08.pdf
6011/2/17 Fall2017 - Lecture#19
11/2/17
11
OpenMP CriticalSection
6111/2/17 Fall2017 - Lecture#19
TheTroublewithLocks…• …isdead-locks• Consider2cookssharingakitchen
− Eachcooksamealthatrequiressaltandpepper(locks)− Cook1grabssalt− Cook2grabspepper− Cook1noticess/heneedspepper
§ it’snotthere,sos/hewaits− Cook2realizess/heneedssalt
§ it’snotthere,sos/hewaits
• Anotsocommoncauseofcookstarvation− Butdeadlocksarepossibleinparallelprograms− Verydifficulttodebug
§ malloc/free iseasy…
6211/2/17 Fall2017 - Lecture#19
Agenda
• MIMD- multipleprogramssimultaneously• Threads• Parallelprogramming:OpenMP• Synchronizationprimitives• SynchronizationinOpenMP• And,inConclusion…
6311/2/17 Fall2017 - Lecture#19
And,inConclusion,…• Sequentialsoftwareexecutionspeedislimited• Parallelprocessingistheonlypathtohigherperformance
− SIMD:instructionlevelparallelism§ ImplementedinallhighperformanceCPUstoday(x86,ARM,…)§ Partiallysupportedbycompilers
− MIMD:threadlevelparallelism§ Multicoreprocessors§ SupportedbyOperatingSystems(OS)§ Requiresprogrammerinterventiontoexploitatsingleprogramlevel
o E.g.OpenMP− SIMD&MIMDformaximumperformance
• Synchronization− Requireshardwaresupport:specializedassemblyinstructions− Typicallyusehigher-levelsupport− Bewareofdeadlocks
6411/2/17 Fall2017 - Lecture#19