+ All Categories
Home > Documents > L19 TLP - University of California, Berkeleyinst.eecs.berkeley.edu/~cs61c/fa17/lec/19/L19 TLP...

L19 TLP - University of California, Berkeleyinst.eecs.berkeley.edu/~cs61c/fa17/lec/19/L19 TLP...

Date post: 27-Sep-2020
Category:
Upload: others
View: 0 times
Download: 0 times
Share this document with a friend
11
11/2/17 1 CS 61C: Great Ideas in Computer Architecture Lecture 19: Thread-Level Parallel Processing Krste Asanović & Randy H. Katz http://inst.eecs.berkeley.edu/~cs61c/fa17 1 11/2/17 Fall 2017 - Lecture #19 Agenda MIMD - multiple programs simultaneously Threads Parallel programming: OpenMP Synchronization primitives Synchronization in OpenMP And, in Conclusion … 2 11/2/17 Fall 2017 - Lecture #19 Improving Performance 1. Increase clock rate f s Reached practical maximum for today’s technology < 5GHz for general purpose computers 2. Lower CPI (cycles per instruction) SIMD, “instruction level parallelism” 3. Perform multiple tasks simultaneously Multiple CPUs, each executing different program Tasks may be related § E.g. each CPU performs part of a big matrix multiplication or unrelated § E.g. distribute different web http requests over different computers § E.g. run pptx (view lecture slides) and browser (youtube) simultaneously 4. Do all of the above: High fs, SIMD, multiple parallel tasks Today’s lecture 3 11/2/17 Fall 2017 - Lecture #19 New-School Machine Structures (It’s a bit more complicated!) Parallel Requests Assigned to computer e.g., Search “Katz” Parallel Threads Assigned to core e.g., Lookup, Ads Parallel Instructions >1 instruction @ one time e.g., 5 pipelined instructions Parallel Data >1 data item @ one time e.g., Add of 4 pairs of words Hardware descriptions All gates @ one time Programming Languages Smart Phone Warehouse Scale Computer Software Hardware Harness Parallelism & Achieve High Performance Logic Gates Core Core Memory (Cache) Input/Output Computer Cache Memory Core Instruction Unit(s) Functional Unit(s) A3+B3 A2+B2 A1+B1 A0+B0 Projects 3 and 5! 4 11/2/17 Fall 2017 - Lecture #19 Parallel Computer Architectures Several separate computers, some means for communication (e.g., Ethernet) Massive array of computers, fast communication between processors Multi-core CPU: 1 datapath in single chip share L3 cache, memory, peripherals Example: Hive machines GPU “graphics processing unit” 5 11/2/17 Fall 2017 - Lecture #19 Example: CPU with Two Cores Processor “Core” 1 Control Datapath PC Registers (ALU) Memory Input Output Bytes I/O-Memory Interfaces Processor 0 Memory Accesses Processor “Core” 2 Control Datapath PC Registers (ALU) Processor 1 Memory Accesses 6 11/2/17 Fall 2017 - Lecture #19
Transcript
Page 1: L19 TLP - University of California, Berkeleyinst.eecs.berkeley.edu/~cs61c/fa17/lec/19/L19 TLP (6up).pdfPixel 2 vs. iPhone 8 11/2/17 Fall 2017-Lecture #19 9 Pixel 2 vs. iPhone 8 11/2/17

11/2/17

1

CS61C:GreatIdeasinComputerArchitecture

Lecture19:Thread-LevelParallelProcessing

Krste Asanović &RandyH.Katz

http://inst.eecs.berkeley.edu/~cs61c/fa17

111/2/17 Fall2017 - Lecture#19

Agenda

• MIMD- multipleprogramssimultaneously• Threads• Parallelprogramming:OpenMP• Synchronizationprimitives• SynchronizationinOpenMP• And,inConclusion…

211/2/17 Fall2017 - Lecture#19

ImprovingPerformance1. Increaseclockratefs

− Reachedpracticalmaximumfortoday’stechnology− <5GHzforgeneralpurposecomputers

2. LowerCPI(cyclesperinstruction)− SIMD,“instructionlevelparallelism”

3. Performmultipletaskssimultaneously− MultipleCPUs,eachexecutingdifferentprogram− Tasksmayberelated

§ E.g.eachCPUperformspartofabigmatrixmultiplication− orunrelated

§ E.g.distributedifferentwebhttprequestsoverdifferentcomputers§ E.g.runpptx (viewlectureslides)andbrowser(youtube)simultaneously

4. Doalloftheabove:− Highfs,SIMD,multipleparalleltasks

Today’slecture

311/2/17 Fall2017 - Lecture#19

New-SchoolMachineStructures(It’sabitmorecomplicated!)

• ParallelRequestsAssignedtocomputere.g.,Search“Katz”

• ParallelThreadsAssignedtocoree.g.,Lookup,Ads

• ParallelInstructions>[email protected].,5pipelinedinstructions

• ParallelData>[email protected].,Addof4pairsofwords

• HardwaredescriptionsAllgates@onetime

• ProgrammingLanguages

SmartPhone

WarehouseScale

Computer

SoftwareHardware

HarnessParallelism&AchieveHighPerformance

LogicGates

Core Core…

Memory(Cache)

Input/Output

Computer

CacheMemory

Core

InstructionUnit(s) FunctionalUnit(s)

A3+B3A2+B2A1+B1A0+B0

Projects3and5!

411/2/17 Fall2017 - Lecture#19

ParallelComputerArchitectures

Severalseparatecomputers,somemeansforcommunication(e.g.,Ethernet)

Massivearrayofcomputers,fastcommunicationbetweenprocessors

Multi-coreCPU:1datapathinsinglechip

shareL3cache,memory,peripheralsExample:Hivemachines

GPU“graphicsprocessingunit”

511/2/17 Fall2017 - Lecture#19

Example:CPUwithTwoCoresProcessor“Core”1

Control

DatapathPC

Registers(ALU)

MemoryInput

Output

Bytes

I/O-MemoryInterfaces

Processor0MemoryAccesses

Processor“Core”2

Control

DatapathPC

Registers(ALU)

Processor1MemoryAccesses

611/2/17 Fall2017 - Lecture#19

Page 2: L19 TLP - University of California, Berkeleyinst.eecs.berkeley.edu/~cs61c/fa17/lec/19/L19 TLP (6up).pdfPixel 2 vs. iPhone 8 11/2/17 Fall 2017-Lecture #19 9 Pixel 2 vs. iPhone 8 11/2/17

11/2/17

2

MultiprocessorExecutionModel

• Eachprocessor(core)executesitsowninstructions• Separate resources(notshared)

− Datapath(PC,registers,ALU)− Highestlevelcaches(e.g.,1st and2nd)

• Shared resources− Memory(DRAM)− Often3rd levelcache

§ Oftenonsamesiliconchip§ Butnotarequirement

• Nomenclature− “MultiprocessorMicroprocessor”− Multicoreprocessor

§ E.g.,fourcoreCPU(centralprocessingunit)§ Executesfourdifferentinstructionstreamssimultaneously

711/2/17 Fall2017 - Lecture#19

TransitiontoMulticore

Sequential App Performance

811/2/17 Fall2017 - Lecture#19

Pixel2vs.iPhone8

911/2/17 Fall2017 - Lecture#19

Pixel2vs.iPhone8

1011/2/17 Fall2017 - Lecture#19

ALUs nm MHz GFlops

2.35Ghz+1.9Ghz,64BitOcta-Core

Pixel2vs.iPhone8

1111/2/17 Fall2017 - Lecture#19

Pixel2vs.iPhone8

1211/2/17 Fall2017 - Lecture#19

Page 3: L19 TLP - University of California, Berkeleyinst.eecs.berkeley.edu/~cs61c/fa17/lec/19/L19 TLP (6up).pdfPixel 2 vs. iPhone 8 11/2/17 Fall 2017-Lecture #19 9 Pixel 2 vs. iPhone 8 11/2/17

11/2/17

3

MultiprocessorExecutionModel

• Sharedmemory− Each“core”hasaccesstotheentirememoryintheprocessor− Specialhardwarekeepscachesconsistent(nextlecture!)− Advantages:

§ Simplifiescommunicationinprogramviasharedvariables− Drawbacks:

§ Doesnotscalewell:o “Slow”memorysharedbymany“customers”(cores)o Maybecomebottleneck(Amdahl’sLaw)

• Twowaystouseamultiprocessor:− Job-levelparallelism

§ Processorsworkonunrelatedproblems§ Nocommunicationbetweenprograms

− Partitionworkofsingletaskbetweenseveralcores§ E.g.,eachperformspartoflargematrixmultiplication

1311/2/17 Fall2017 - Lecture#19

ParallelProcessing• It’sdifficult!• It’sinevitable

− Onlypathtoincreaseperformance− Onlypathtolowerenergyconsumption(improvebatterylife)

• Inmobilesystems(e.g.,smartphones,tablets)− Multiplecores− Dedicatedprocessors,e.g.,

§ Motionprocessor,imageprocessor,neuralprocessoriniPhone8+X§ GPU(graphicsprocessingunit)

• Warehouse-scalecomputers(nextweek!)− Multiple“nodes”

§ “Boxes”withseveralCPUs,disksperbox− MIMD(multi-core)andSIMD(e.g.AVX)ineachnode

1411/2/17 Fall2017 - Lecture#19

1511/2/17 Fall2017 - Lecture#19

PotentialParallelPerformance(assumingsoftwarecanuseit)

Year Cores SIMD bits /Core Core *SIMD bits

Total, e.g.FLOPs/Cycle

2003 2 128 256 42005 4 128 512 82007 6 128 768 122009 8 128 1024 162011 10 256 2560 402013 12 256 3072 482015 14 512 7168 1122017 16 512 8192 1282019 18 1024 18432 2882021 20 1024 20480 320

2.5X 8X 20X

MIMD SIMD MIMD&SIMD+2/

2yrs2X/4yrs

12years

20xin12years201/12 =1.28xà 28%peryearor2xevery3years!

IF(!)wecanuseit

Agenda

• MIMD- multipleprogramssimultaneously• Threads• Parallelprogramming:OpenMP• Synchronizationprimitives• SynchronizationinOpenMP• And,inConclusion…

1611/2/17 Fall2017 - Lecture#19

ProgramsRunningonmyComputerPID TTY TIME CMD220 ?? 0:04.34 /usr/libexec/UserEventAgent (Aqua)222 ?? 0:10.60 /usr/sbin/distnoted agent224 ?? 0:09.11 /usr/sbin/cfprefsd agent229 ?? 0:04.71 /usr/sbin/usernoted230 ?? 0:02.35 /usr/libexec/nsurlsessiond232 ?? 0:28.68 /System/Library/PrivateFrameworks/CalendarAgent.framework/Executables/CalendarAgent234 ?? 0:04.36 /System/Library/PrivateFrameworks/GameCenterFoundation.framework/Versions/A/gamed235 ?? 0:01.90 /System/Library/CoreServices/cloudphotosd.app/Contents/MacOS/cloudphotosd236 ?? 0:49.72 /usr/libexec/secinitd239 ?? 0:01.66 /System/Library/PrivateFrameworks/TCC.framework/Resources/tccd240 ?? 0:12.68 /System/Library/Frameworks/Accounts.framework/Versions/A/Support/accountsd241 ?? 0:09.56 /usr/libexec/SafariCloudHistoryPushAgent242 ?? 0:00.27 /System/Library/PrivateFrameworks/CallHistory.framework/Support/CallHistorySyncHelper243 ?? 0:00.74 /System/Library/CoreServices/mapspushd244 ?? 0:00.79 /usr/libexec/fmfd246 ?? 0:00.09 /System/Library/PrivateFrameworks/AskPermission.framework/Versions/A/Resources/askpermissiond248 ?? 0:01.03 /System/Library/PrivateFrameworks/CloudDocsDaemon.framework/Versions/A/Support/bird249 ?? 0:02.50 /System/Library/PrivateFrameworks/IDS.framework/identityservicesd.app/Contents/MacOS/identityservicesd250 ?? 0:04.81 /usr/libexec/secd254 ?? 0:24.01 /System/Library/PrivateFrameworks/CloudKitDaemon.framework/Support/cloudd258 ?? 0:04.73 /System/Library/PrivateFrameworks/TelephonyUtilities.framework/callservicesd267 ?? 0:02.15 /System/Library/CoreServices/AirPlayUIAgent.app/Contents/MacOS/AirPlayUIAgent --launchd271 ?? 0:03.91 /usr/libexec/nsurlstoraged274 ?? 0:00.90 /System/Library/PrivateFrameworks/CommerceKit.framework/Versions/A/Resources/storeaccountd282 ?? 0:00.09 /usr/sbin/pboard283 ?? 0:00.90

/System/Library/PrivateFrameworks/InternetAccounts.framework/Versions/A/XPCServices/com.apple.internetaccounts.xpc/Contents/MacOS/com.apple.internetaccounts285 ?? 0:04.72 /System/Library/Frameworks/ApplicationServices.framework/Frameworks/ATS.framework/Support/fontd291 ?? 0:00.25 /System/Library/Frameworks/Security.framework/Versions/A/Resources/CloudKeychainProxy.bundle/Contents/MacOS/CloudKeychainProxy292 ?? 0:09.54 /System/Library/CoreServices/CoreServicesUIAgent.app/Contents/MacOS/CoreServicesUIAgent293 ?? 0:00.29

/System/Library/PrivateFrameworks/CloudPhotoServices.framework/Versions/A/Frameworks/CloudPhotoServicesConfiguration.framework/Versions/A/XPCServices/com.apple.CloudPhotosConfiguration.xpc/Contents/MacOS/com.apple.CloudPhotosConfiguration297 ?? 0:00.84 /System/Library/PrivateFrameworks/CloudServices.framework/Resources/com.apple.sbd302 ?? 0:26.11 /System/Library/CoreServices/Dock.app/Contents/MacOS/Dock303 ?? 0:09.55 /System/Library/CoreServices/SystemUIServer.app/Contents/MacOS/SystemUIServer

…156total at this momentHow does mylaptopdothis?

Imagine doing 156assignments all at the same time!1711/2/17 Fall2017 - Lecture#19

ps -x

Threads• Sequentialflowofinstructionsthatperformssometask

−Uptonowwejustcalledthisa“program”

• Eachthreadhas:− DedicatedPC(programcounter)− Separateregisters− Accessesthesharedmemory

• Eachphysicalcoreprovidesone(ormore)− Hardwarethreads thatactivelyexecuteinstructions− Eachexecutesone“hardwarethread”

• Operatingsystemmultiplexesmultiple− Softwarethreads ontotheavailablehardwarethreads− Allthreadsexceptthosemappedtohardwarethreadsarewaiting

1811/2/17 Fall2017 - Lecture#19

Page 4: L19 TLP - University of California, Berkeleyinst.eecs.berkeley.edu/~cs61c/fa17/lec/19/L19 TLP (6up).pdfPixel 2 vs. iPhone 8 11/2/17 Fall 2017-Lecture #19 9 Pixel 2 vs. iPhone 8 11/2/17

11/2/17

4

OperatingSystemThreads

Giveillusionofmany“simultaneously”activethreads1. Multiplexsoftwarethreadsontohardwarethreads:

a) Switchoutblockedthreads(e.g.,cachemiss,userinput,networkaccess)b) Timer(e.g.,switchactivethreadevery1ms)

2. Removeasoftwarethreadfromahardwarethreadbya) Interruptingitsexecutionb) SavingitsregistersandPCtomemory

3. Startexecutingadifferentsoftwarethreadbya) Loadingitspreviouslysavedregistersintoahardwarethread’sregistersb) JumpingtoitssavedPC

1911/2/17 Fall2017 - Lecture#19

Example:FourCoresThreadpool:Listofthreadscompetingforprocessor

OSmapsthreadstocoresandscheduleslogical(software)threads

Core2

Each“Core”activelyrunsoneinstructionstreamatatime

Core1 Core3 Core4

2011/2/17 Fall2017 - Lecture#19

Multithreading

• Typicalscenario:− Activethreadencounterscachemiss− Activethreadwaits~ 1000cyclesfordatafromDRAM−à switchoutandrundifferentthreaduntildataavailable

• Problem−Mustsavecurrentthreadstateandloadnewthreadstate

§ PC,allregisters(couldbemany,e.g.AVX)−àmustperformswitchin≪1000cycles

• Canhardwarehelp?−Moore’sLaw:transistorsareplenty

2111/2/17 Fall2017 - Lecture#19

• TwocopiesofPCandRegistersinsideprocessorhardware

• Looksidenticaltotwoprocessorstosoftware(hardwarethread0,hardwarethread1)

• Hyperthreading:• Boththreadscanbeactivesimultaneously

HardwareAssistedSoftwareMultithreading

22

MemoryInput

Output

Bytes

I/O-MemoryInterfaces

Processor(1 Core,2Threads)

Control

DatapathPC0

Registers0

(ALU)

PC1

Registers1

CS61c Lecture19:ThreadLevelParallelProcessing

Multithreading

• Logicalthreads− ≈1%morehardware,≈10%(?)betterperformance

§ Separateregisters§ Sharedatapath,ALU(s),caches

• Multicore− =>DuplicateProcessors− ≈50%morehardware,≈2Xbetterperformance?

• Modernmachinesdoboth−Multiplecoreswithmultiplethreads percore

2311/2/17 Fall2017 - Lecture#19

Randy’sLaptop

$ sysctl -a | grep hw

hw.physicalcpu: 2hw.logicalcpu: 4hw.l1icachesize: 32,768 hw.l1dcachesize: 32,768hw.l2cachesize: 262,144hw.l3cachesize: 4,194,304

• 2Cores• 4Threadstotal

2411/2/17 Fall2017 - Lecture#19

Page 5: L19 TLP - University of California, Berkeleyinst.eecs.berkeley.edu/~cs61c/fa17/lec/19/L19 TLP (6up).pdfPixel 2 vs. iPhone 8 11/2/17 Fall 2017-Lecture #19 9 Pixel 2 vs. iPhone 8 11/2/17

11/2/17

5

Example:6Cores,24LogicalThreads

Threadpool:Listofthreadscompetingforprocessor

OSmapsthreadstocoresandscheduleslogical(software)threads

Thread1Core2

Thread2

Thread3

Thread4

Thread1Core6

Thread2

Thread3

Thread4

Thread1Core4

Thread2

Thread3

Thread4

Thread1Core5

Thread2

Thread3

Thread4

Thread1Core3

Thread2

Thread3

Thread4

Thread1Core1

Thread2

Thread3

Thread4

4Logicalthreadspercore(hardware)thread2511/2/17 Fall2017 - Lecture#19

Break!

2611/2/17 Fall2017 - Lecture#19

Agenda

• MIMD- multipleprogramssimultaneously• Threads• Parallelprogramming:OpenMP• Synchronizationprimitives• SynchronizationinOpenMP• And,inConclusion…

2711/2/17 Fall2017 - Lecture#19

LanguagesSupportingParallelProgramming

ActorScript Concurrent Pascal JoCaml OrcAda Concurrent ML Join OzAfnix Concurrent Haskell Java PictAlef Curry Joule ReiaAlice CUDA Joyce SALSAAPL E LabVIEW ScalaAxum Eiffel Limbo SISALChapel Erlang Linda SRCilk Fortan 90 MultiLisp Stackless PythonClean Go Modula-3 SuperPascalClojure Io Occam VHDLConcurrent C Janus occam-π XC

Whichonetopick?2811/2/17 Fall2017 - Lecture#19

WhySoManyParallelProgrammingLanguages?

• Why“intrinsics”?− TOIntel:fixyour#()&$!Compiler!

• It’shappening...but− SIMDfeaturesarecontinuallyaddedtocompilers(Intel,gcc)− Intenseareaofresearch− Researchprogress:

§ 20+yearstotranslateCintogood(fast!)assembly§ HowlongtotranslateCintogood(fast!)parallelcode?

o Generalproblemisveryhardtosolveo Presentstate:specializedsolutionsforspecificcaseso Youropportunitytobecomefamous!

2911/2/17 Fall2017 - Lecture#19

ParallelProgrammingLanguages

• Numberofchoicesisindicationof−Nouniversalsolution

§ Needsareveryproblemspecific− E.g.,

§ Scientificcomputing/machinelearning(matrixmultiply)§ Webserver:handlemanyunrelatedrequestssimultaneously§ Input/output:it’sallhappeningsimultaneously!

• Specializedlanguagesfordifferenttasks− Someareeasiertouse(forsomeproblems)−Noneisparticularly”easy”touse

• 61C− Parallellanguageexamplesforhigh-performancecomputing−OpenMP

3011/2/17 Fall2017 - Lecture#19

Page 6: L19 TLP - University of California, Berkeleyinst.eecs.berkeley.edu/~cs61c/fa17/lec/19/L19 TLP (6up).pdfPixel 2 vs. iPhone 8 11/2/17 Fall 2017-Lecture #19 9 Pixel 2 vs. iPhone 8 11/2/17

11/2/17

6

ParallelLoops

• Serialexecution:for (int i=0; i<100; i++) {

…}

• ParallelExecution:

for (int i=0; i<25; i++) { …

}

for (int i=25; i<50; i++) {

…}

for (int i=50; i<75; i++) {

…}

for (int i=75; i<100; i++) {

…}

3111/2/17 Fall2017 - Lecture#19

Parallelfor inOpenMP

#include <omp.h>

#pragma omp parallel forfor (int i=0; i<100; i++) {

…}

3211/2/17 Fall2017 - Lecture#19

OpenMPExample

$ gcc-5 -fopenmp for.c;./a.outthread 0, i = 0thread 1, i = 3thread 2, i = 6thread 3, i = 8thread 0, i = 1thread 1, i = 4thread 2, i = 7thread 3, i = 9thread 0, i = 2thread 1, i = 501 02 03 14 15 16 27 28 39 40

3311/2/17 Fall2017 - Lecture#19

OpenMP

• Cextension:nonewlanguagetolearn• Multi-threaded,shared-memoryparallelism

− CompilerDirectives,#pragma− RuntimeLibraryRoutines,#include <omp.h>

• #pragma− IgnoredbycompilersunawareofOpenMP− Samesourceformultiplearchitectures

§ E.g.,sameprogramfor1&16cores

• Onlyworkswithsharedmemory

3411/2/17 Fall2017 - Lecture#19

OpenMPProgrammingModel• Fork- JoinModel:

• OpenMPprogramsbeginassingleprocess(masterthread)− Sequentialexecution

• Whenparallelregionisencountered− Masterthread“forks” intoteamofparallelthreads− Executedsimultaneously− Atendofparallelregion,parallelthreads”join”,leavingonlymasterthread

• Processrepeatsforeachparallelregion− Amdahl’sLaw?

3511/2/17 Fall2017 - Lecture#19

WhatKindofThreads?

• OpenMPthreadsareoperatingsystem(software)threads• OSwillmultiplexrequestedOpenMPthreadsontoavailablehardwarethreads• Hopefullyeachgetsarealhardwarethreadtorunon,sonoOS-leveltime-multiplexing• Butothertasksonmachinecompeteforhardwarethreads!• Be“careful”(?)whentimingresultsforProject3!

− 5AM?− Jobqueue?

3611/2/17 Fall2017 - Lecture#19

Page 7: L19 TLP - University of California, Berkeleyinst.eecs.berkeley.edu/~cs61c/fa17/lec/19/L19 TLP (6up).pdfPixel 2 vs. iPhone 8 11/2/17 Fall 2017-Lecture #19 9 Pixel 2 vs. iPhone 8 11/2/17

11/2/17

7

Example2:Computingp

http://openmp.org/mp-documents/omp-hands-on-SC08.pdf3711/2/17 Fall2017 - Lecture#19

Sequentialp

pi = 3.142425985001

• Resemblesp,butnotveryaccurate• Let’sincreasenum_steps andparallelize

3811/2/17 Fall2017 - Lecture#19

Parallelize(1)…

• Problem:eachthreadsneedsaccesstothesharedvariablesum

• Coderunssequentially…

3911/2/17 Fall2017 - Lecture#19

Parallelize(2)…

sum[0] sum[1]

1. Computesum[0]andsum[1]

inparallel

2. Computesum = sum[0] + sum[1]

sequentially

4011/2/17 Fall2017 - Lecture#19

Parallelp

4111/2/17 Fall2017 - Lecture#19

TrialRun

i = 1, id = 1i = 0, id = 0i = 2, id = 2i = 3, id = 3i = 5, id = 1i = 4, id = 0i = 6, id = 2i = 7, id = 3i = 9, id = 1i = 8, id = 0pi = 3.142425985001

4211/2/17 Fall2017 - Lecture#19

Page 8: L19 TLP - University of California, Berkeleyinst.eecs.berkeley.edu/~cs61c/fa17/lec/19/L19 TLP (6up).pdfPixel 2 vs. iPhone 8 11/2/17 Fall 2017-Lecture #19 9 Pixel 2 vs. iPhone 8 11/2/17

11/2/17

8

Scaleup:num_steps = 106

pi = 3.141592653590

Youverify howmany digitsarecorrect …

4311/2/17 Fall2017 - Lecture#19

CanWeParallelizeComputingsum?

Summationinsideparallelsection• Insignificantspeedupinthisexample,but…• pi = 3.138450662641• Wrong!And value changes between runs?!• What’s going on?

AlwayslookingforwaystobeatAmdahl’sLaw…

4411/2/17 Fall2017 - Lecture#19

PeerInstructionWhatarethepossiblevaluesof*(x1) afterexecutingthiscodebytwoconcurrent threads?

# *(x1) = 100lw x2,0(x1)addi x2,x2,1sw x2,0(x1)

Answer *(x1)

RED 100 or101GREEN 101ORANGE 101or102YELLOW 100or101or102

4511/2/17 Fall2017 - Lecture#19

• Operationisreallypi = pi + sum[id]

• Whatif>1threadsreadscurrent(same)valueofpi,computesthesum,storestheresultbacktopi?

• Eachprocessorreadssameintermediatevalueofpi!

• Resultdependsonwhogetstherewhen• A“race”à resultisnot

deterministic

What’sGoingOn?

4611/2/17 Fall2017 - Lecture#19

Administrivia

• Homework4(Caches,FloatingPoint)duetomorrow at11:59pm• Project2-2dueMonday

− ProjectOfficehoursthatMondaywillbewellstaffed!− TestyourCPUthoroughly!

§ WriteprogramswithVenusandloadthemintoyourcircuit

• Project3willbereleasedMondaynight− Atwo-weekperformanceproject− Canearnextracreditfromtheperformancecontest(Project5)

• MidtermscoreswillbereleasedbeforeTuesdayonGradescope

4711/2/17 Fall2017 - Lecture#19

Break!

4811/2/17 Fall2017 - Lecture#19

Page 9: L19 TLP - University of California, Berkeleyinst.eecs.berkeley.edu/~cs61c/fa17/lec/19/L19 TLP (6up).pdfPixel 2 vs. iPhone 8 11/2/17 Fall 2017-Lecture #19 9 Pixel 2 vs. iPhone 8 11/2/17

11/2/17

9

Agenda

• MIMD- multipleprogramssimultaneously• Threads• Parallelprogramming:OpenMP• Synchronizationprimitives• SynchronizationinOpenMP• And,inConclusion…

4911/2/17 Fall2017 - Lecture#19

Synchronization

• Problem:− Limitaccesstosharedresourceto1actoratatime− E.g.only1personpermittedtoeditafileatatime

§ otherwisechangesbyseveralpeoplegetallmixedup

• Solution:• Taketurns:

• Onlyonepersonget’sthemicrophone&talksatatime

• Alsogoodpracticeforclassrooms,btw…

5011/2/17 Fall2017 - Lecture#19

Locks

• Computersuselockstocontrolaccesstosharedresources− Servespurposeofmicrophoneinexample− Alsoreferredtoas“semaphore”

• Usuallyimplementedwithavariable− int lock;

§ 0forunlocked§ 1forlocked

5111/2/17 Fall2017 - Lecture#19

SynchronizationwithLocks// wait for lock releasedwhile (lock != 0) ;// lock == 0 now (unlocked)

// set locklock = 1;

// access shared resource ... // e.g. pi// sequential execution! (Amdahl ...)

// release locklock = 0;

5211/2/17 Fall2017 - Lecture#19

LockSynchronization

Thread1

while (lock != 0) ;

lock = 1;

// critical section

lock = 0;

Thread2

while (lock != 0) ;

lock = 1; // critical sectionlock = 0;

• Thread2findslocknotset,beforethread1setsit

• Boththreadsbelievetheygotandsetthelock!

Tryasyoulike,thisproblemhasnosolution,notevenattheassemblylevel.

Unlessweintroducenewinstructions,thatis!5311/2/17 Fall2017 - Lecture#19

HardwareSynchronization

• Solution:−Atomicread/write−Read&writeinsingleinstruction

§ Nootheraccesspermittedbetweenreadandwrite−Note:

§ Mustusesharedmemory (multiprocessing)

• Commonimplementations:−Atomicswapofregister↔memory−Pairofinstructionsfor“linked”readandwrite

§ writefailsifmemorylocationhasbeen“tampered”withafterlinkedread• RISCVhasvariationsofboth,butforsimplicitywewillfocusontheformer

5411/2/17 Fall2017 - Lecture#19

Page 10: L19 TLP - University of California, Berkeleyinst.eecs.berkeley.edu/~cs61c/fa17/lec/19/L19 TLP (6up).pdfPixel 2 vs. iPhone 8 11/2/17 Fall 2017-Lecture #19 9 Pixel 2 vs. iPhone 8 11/2/17

11/2/17

10

RISCVAtomicMemoryOperations(AMOs)

• AMOsatomicallyperformanoperationonanoperandinmemoryandsetthedestinationregistertotheoriginalmemoryvalue• R-TypeInstructionFormat:Add,And,Or,Swap,Xor,Max,Max Unsigned,Min,Min Unsigned

5511/2/17 Fall2017 - Lecture#19

Loadfromaddressinrs1to“t”rd =”t”,i.e.,thevalueinmemoryStoreataddressinrs1thecalculation“t”<operation>rs2aq andrl insureinorderexecution

amoadd.w rd,rs2,(rs1):t = M[x[rs1]]; x[rd] = t; M[x[rs1]] = t + x[rs2]

RISCVCriticalSection

• Assumethatthelockisinmemorylocationstoredinregistera0• Thelockis“set”ifitis1;itis“free”ifitis0(it’sinitialvalue)

li t0, 1 # Get 1 to set lockTry: amoswap.w.aq t1, t0, (a0) # t1 gets old lock value

# while we set it to 1bnez t1, Try # if it was already 1, another

# thread has the lock,# so we need to try again

… critical section goes here …amoswap.w.rl x0, x0, (a0) # store 0 in lock to release

5611/2/17 Fall2017 - Lecture#19

LockSynchronization

BrokenSynchronization

while (lock != 0) ;

lock = 1;

// critical section

lock = 0;

Fix(lockisatlocation(a0))

li t0, 1Try amoswap.w.aq t1, t0, (a0)

bnez t1, TryLocked:

# critical section

Unlock:amoswap.w.rl x0, x0, (a0)

5711/2/17 Fall2017 - Lecture#19

Agenda

• MIMD- multipleprogramssimultaneously• Threads• Parallelprogramming:OpenMP• Synchronizationprimitives• SynchronizationinOpenMP• And,inConclusion…

5811/2/17 Fall2017 - Lecture#19

OpenMPLocks

5911/2/17 Fall2017 - Lecture#19

SynchronizationinOpenMP

• Typicallyareusedinlibrariesofhigherlevelparallelprogrammingconstructs• E.g.OpenMPoffers$pragmasforcommoncases:

− critical− atomic− barrier− ordered

• OpenMPoffersmanymorefeatures− Seeonlinedocumentation−Ortutorialat

§ http://openmp.org/mp-documents/omp-hands-on-SC08.pdf

6011/2/17 Fall2017 - Lecture#19

Page 11: L19 TLP - University of California, Berkeleyinst.eecs.berkeley.edu/~cs61c/fa17/lec/19/L19 TLP (6up).pdfPixel 2 vs. iPhone 8 11/2/17 Fall 2017-Lecture #19 9 Pixel 2 vs. iPhone 8 11/2/17

11/2/17

11

OpenMP CriticalSection

6111/2/17 Fall2017 - Lecture#19

TheTroublewithLocks…• …isdead-locks• Consider2cookssharingakitchen

− Eachcooksamealthatrequiressaltandpepper(locks)− Cook1grabssalt− Cook2grabspepper− Cook1noticess/heneedspepper

§ it’snotthere,sos/hewaits− Cook2realizess/heneedssalt

§ it’snotthere,sos/hewaits

• Anotsocommoncauseofcookstarvation− Butdeadlocksarepossibleinparallelprograms− Verydifficulttodebug

§ malloc/free iseasy…

6211/2/17 Fall2017 - Lecture#19

Agenda

• MIMD- multipleprogramssimultaneously• Threads• Parallelprogramming:OpenMP• Synchronizationprimitives• SynchronizationinOpenMP• And,inConclusion…

6311/2/17 Fall2017 - Lecture#19

And,inConclusion,…• Sequentialsoftwareexecutionspeedislimited• Parallelprocessingistheonlypathtohigherperformance

− SIMD:instructionlevelparallelism§ ImplementedinallhighperformanceCPUstoday(x86,ARM,…)§ Partiallysupportedbycompilers

− MIMD:threadlevelparallelism§ Multicoreprocessors§ SupportedbyOperatingSystems(OS)§ Requiresprogrammerinterventiontoexploitatsingleprogramlevel

o E.g.OpenMP− SIMD&MIMDformaximumperformance

• Synchronization− Requireshardwaresupport:specializedassemblyinstructions− Typicallyusehigher-levelsupport− Bewareofdeadlocks

6411/2/17 Fall2017 - Lecture#19


Recommended