Pattern Language for Parallel Programming, 2004

"Ifyoubuildit,theywillcome."

Andsowebuiltthem.Multiprocessorworkstations,massivelyparallelsupercomputers,aclusterineverydepartment...andtheyhaven'tcome.Programmershaven'tcometoprogramthesewonderfulmachines.Oh,afewprogrammersinlovewiththechallengehaveshownthatmosttypesofproblemscanbeforcefitontoparallelcomputers,butgeneralprogrammers,especiallyprofessionalprogrammerswho"havelives",ignoreparallelcomputers.

Andtheydosoattheirownperil.Parallelcomputersaregoingmainstream.Multithreadedmicroprocessors,multicoreCPUs,multiprocessorPCs,clusters,parallelgameconsoles...parallelcomputersaretakingovertheworldofcomputing.Thecomputerindustryisreadytofloodthemarketwithhardwarethatwillonlyrunatfullspeedwithparallelprograms.Butwhowillwritetheseprograms?

Thisisanoldproblem.Evenintheearly1980s,whenthe"killermicros"startedtheirassaultontraditionalvectorsupercomputers,weworriedendlesslyabouthowtoattractnormalprogrammers.Wetriedeverythingwecouldthinkof:highlevelhardwareabstractions,implicitlyparallelprogramminglanguages,parallellanguageextensions,andportablemessagepassinglibraries.Butaftermanyyearsofhardwork,thefactofthematteristhat"they"didn'tcome.Theoverwhelmingmajorityofprogrammerswillnotinvesttheefforttowriteparallelsoftware.

Acommonviewisthatyoucan'tteacholdprogrammersnewtricks,sotheproblemwillnotbesolveduntiltheoldprogrammersfadeawayandanewgenerationtakesover.

Butwedon'tbuyintothatdefeatistattitude.Programmershaveshownaremarkableabilitytoadoptnewsoftwaretechnologiesovertheyears.LookathowmanyoldFortranprogrammersarenowwritingelegantJavaprogramswithsophisticatedobjectorienteddesigns.Theproblemisn'twitholdprogrammers.Theproblemiswitholdparallelcomputingexpertsandthewaythey'vetriedtocreateapoolofcapableparallelprogrammers.

Andthat'swherethisbookcomesin.Wewanttocapturetheessenceofhowexpertparallelprogrammersthinkaboutparallelalgorithmsandcommunicatethatessentialunderstandinginawayprofessionalprogrammerscanreadilymaster.Thetechnologywe'veadoptedtoaccomplishthistaskisapatternlanguage.Wemadethischoicenotbecausewestartedtheprojectasdevoteesofdesignpatternslookingforanewfieldtoconquer,butbecausepatternshavebeenshowntoworkinwaysthatwouldbeapplicableinparallelprogramming.Forexample,patternshavebeenveryeffectiveinthefieldofobjectorienteddesign.Theyhaveprovidedacommonlanguageexpertscanusetotalkabouttheelementsofdesignandhavebeenextremelyeffectiveathelpingprogrammersmasterobjectorienteddesign.

Thisbookcontainsourpatternlanguageforparallelprogramming.Thebookopenswithacoupleofchapterstointroducethekeyconceptsinparallelcomputing.Thesechaptersfocusontheparallelcomputingconceptsandjargonusedinthepatternlanguageasopposedtobeinganexhaustiveintroductiontothefield.

Thepatternlanguageitselfispresentedinfourpartscorrespondingtothefourphasesofcreatingaparallelprogram:

*

FindingConcurrency.Theprogrammerworksintheproblemdomaintoidentifytheavailableconcurrencyandexposeitforuseinthealgorithmdesign.

*

AlgorithmStructure.Theprogrammerworkswithhighlevelstructuresfororganizingaparallelalgorithm.

*

SupportingStructures.Weshiftfromalgorithmstosourcecodeandconsiderhowtheparallelprogramwillbeorganizedandthetechniquesusedtomanageshareddata.

*

ImplementationMechanisms.Thefinalstepistolookatspecificsoftwareconstructsforimplementingaparallelprogram.

Thepatternsmakingupthesefourdesignspacesaretightlylinked.Youstartatthetop(FindingConcurrency),workthroughthepatterns,andbythetimeyougettothebottom(ImplementationMechanisms),youwillhaveadetaileddesignforyourparallelprogram.

Ifthegoalisaparallelprogram,however,youneedmorethanjustaparallelalgorithm.Youalsoneedaprogrammingenvironmentandanotationforexpressingtheconcurrencywithintheprogram'ssourcecode.Programmersusedtobeconfrontedbyalargeandconfusingarrayofparallelprogrammingenvironments.Fortunately,overtheyearstheparallelprogrammingcommunityhasconvergedaroundthreeprogrammingenvironments.

*

OpenMP.AsimplelanguageextensiontoC,C++,orFortrantowriteparallelprogramsforsharedmemorycomputers.

*

MPI.Amessagepassinglibraryusedonclustersandotherdistributedmemorycomputers.

*

Java.Anobjectorientedprogramminglanguagewithlanguagefeaturessupportingparallelprogrammingonsharedmemorycomputersandstandardclasslibrariessupportingdistributedcomputing.

Manyreaderswillalreadybefamiliarwithoneormoreoftheseprogrammingnotations,butforreaderscompletelynewtoparallelcomputing,we'veincludedadiscussionoftheseprogrammingenvironmentsintheappendixes.

Inclosing,wehavebeenworkingformanyyearsonthispatternlanguage.Presentingitasabooksopeoplecanstartusingitisanexcitingdevelopmentforus.Butwedon'tseethisastheendofthiseffort.Weexpectthatotherswillhavetheirownideasaboutnewandbetterpatternsforparallelprogramming.We'veassuredlymissedsomeimportantfeaturesthatreallybelonginthispatternlanguage.Weembracechangeandlookforwardtoengagingwiththelargerparallelcomputingcommunitytoiterateonthislanguage.Overtime,we'llupdateandimprovethepatternlanguageuntilittrulyrepresentstheconsensusviewoftheparallelprogrammingcommunity.Thenourrealworkwillbeginusingthepatternlanguagetoguidethecreationofbetterparallelprogrammingenvironmentsandhelpingpeopletousethesetechnologiestowriteparallelsoftware.Wewon'trestuntilthedaysequentialsoftwareisrare.

ACKNOWLEDGMENTS

Westartedworkingtogetheronthispatternlanguagein1998.It'sbeenalongandtwistedroad,startingwithavagueideaaboutanewwaytothinkaboutparallelalgorithmsandfinishingwiththisbook.Wecouldn'thavedonethiswithoutagreatdealofhelp.

ManiChandy,whothoughtwewouldmakeagoodteam,introducedTimtoBeverlyandBerna.TheNationalScienceFoundation,IntelCorp.,andTrinityUniversityhavesupportedthisresearchatvarioustimesovertheyears.HelpwiththepatternsthemselvescamefromthepeopleatthePatternLanguagesofPrograms(PLoP)workshopsheldinIllinoiseachsummer.Theformatofthese

workshopsandtheresultingreviewprocesswaschallengingandsometimesdifficult,butwithoutthemwewouldhaveneverfinishedthispatternlanguage.Wewouldalsoliketothankthereviewerswhocarefullyreadearlymanuscriptsandpointedoutcountlesserrorsandwaystoimprovethebook.

Finally,wethankourfamilies.Writingabookishardontheauthors,butthatistobeexpected.Whatwedidn'tfullyappreciatewashowharditwouldbeonourfamilies.WearegratefultoBeverly'sfamily(DanielandSteve),Tim'sfamily(Noah,August,andMartha),andBerna'sfamily(Billie)forthesacrificesthey'vemadetosupportthisproject.

TimMattson,Olympia,Washington,April2004

BeverlySanders,Gainesville,Florida,April2004

BernaMassingill,SanAntonio,Texas,April2004

Chapter 1. APatternLanguageforParallelProgramming Section1.1. INTRODUCTION Section1.2. PARALLELPROGRAMMING Section1.3. DESIGNPATTERNSANDPATTERNLANGUAGES Section1.4. APATTERNLANGUAGEFORPARALLELPROGRAMMING

Chapter 2. BackgroundandJargonofParallelComputing Section2.1. CONCURRENCYINPARALLELPROGRAMSVERSUSOPERATINGSYSTEMS Section2.2. PARALLELARCHITECTURES:ABRIEFINTRODUCTION Section2.3. PARALLELPROGRAMMINGENVIRONMENTS Section2.4. THEJARGONOFPARALLELCOMPUTING Section2.5. AQUANTITATIVELOOKATPARALLELCOMPUTATION Section2.6. COMMUNICATION Section2.7. SUMMARY

Chapter 3. TheFindingConcurrencyDesignSpace Section3.1. ABOUTTHEDESIGNSPACE Section3.2. THETASKDECOMPOSITIONPATTERN Section3.3. THEDATADECOMPOSITIONPATTERN Section3.4. THEGROUPTASKSPATTERN Section3.5. THEORDERTASKSPATTERN Section3.6. THEDATASHARINGPATTERN Section3.7. THEDESIGNEVALUATIONPATTERN Section3.8. SUMMARY

Chapter 4. TheAlgorithmStructureDesignSpace Section4.1. INTRODUCTION Section4.2. CHOOSINGANALGORITHMSTRUCTUREPATTERN Section4.3. EXAMPLES

Section4.4. THETASKPARALLELISMPATTERN Section4.5. THEDIVIDEANDCONQUERPATTERN Section4.6. THEGEOMETRICDECOMPOSITIONPATTERN Section4.7. THERECURSIVEDATAPATTERN Section4.8. THEPIPELINEPATTERN Section4.9. THEEVENTBASEDCOORDINATIONPATTERN

Chapter 5. TheSupportingStructuresDesignSpace Section5.1. INTRODUCTION Section5.2. FORCES Section5.3. CHOOSINGTHEPATTERNS Section5.4. THESPMDPATTERN Section5.5. THEMASTER/WORKERPATTERN Section5.6. THELOOPPARALLELISMPATTERN Section5.7. THEFORK/JOINPATTERN Section5.8. THESHAREDDATAPATTERN Section5.9. THESHAREDQUEUEPATTERN Section5.10. THEDISTRIBUTEDARRAYPATTERN Section5.11. OTHERSUPPORTINGSTRUCTURES

Chapter 6. TheImplementationMechanismsDesignSpace Section6.1. OVERVIEW Section6.2. UEMANAGEMENT Section6.3. SYNCHRONIZATION Section6.4. COMMUNICATION Endnotes

Appendix A: ABriefIntroductiontoOpenMP SectionA.1. CORECONCEPTS SectionA.2. STRUCTUREDBLOCKSANDDIRECTIVEFORMATS SectionA.3. WORKSHARING SectionA.4. DATAENVIRONMENTCLAUSES SectionA.5. THEOpenMPRUNTIMELIBRARY SectionA.6. SYNCHRONIZATION SectionA.7. THESCHEDULECLAUSE SectionA.8. THERESTOFTHELANGUAGE

Appendix B: ABriefIntroductiontoMPI SectionB.1. CONCEPTS SectionB.2. GETTINGSTARTED SectionB.3. BASICPOINTTOPOINTMESSAGEPASSING SectionB.4. COLLECTIVEOPERATIONS SectionB.5. ADVANCEDPOINTTOPOINTMESSAGEPASSING SectionB.6. MPIANDFORTRAN SectionB.7. CONCLUSION

Appendix C: ABriefIntroductiontoConcurrentProgramminginJava SectionC.1. CREATINGTHREADS SectionC.2. ATOMICITY,MEMORYSYNCHRONIZATION,ANDTHEvolatileKEYWORD

SectionC.3. SYNCHRONIZEDBLOCKS SectionC.4. WAITANDNOTIFY SectionC.5. LOCKS SectionC.6. OTHERSYNCHRONIZATIONMECHANISMSANDSHAREDDATA STRUCTURESSectionC.7. INTERRUPTS GlossaryBibliography

AbouttheAuthorsIndex

APatternLanguageforParallelProgramming>INTRODUCTION

Chapter 1. A Pattern Language for Parallel Programming

1.1INTRODUCTION

1.2PARALLELPROGRAMMING

1.3DESIGNPATTERNSANDPATTERNLANGUAGES

1.4APATTERNLANGUAGEFORPARALLELPROGRAMMING

1.1. INTRODUCTIONComputersareusedtomodelphysicalsystemsinmanyfieldsofscience,medicine,andengineering.Modelers,whethertryingtopredicttheweatherorrenderasceneinthenextblockbustermovie,canusuallyusewhatevercomputingpowerisavailabletomakeevermoredetailedsimulations.Vastamountsofdata,whethercustomershoppingpatterns,telemetrydatafromspace,orDNAsequences,requireanalysis.Todelivertherequiredpower,computerdesignerscombinemultipleprocessingelementsintoasinglelargersystem.Thesesocalledparallelcomputersrunmultipletaskssimultaneouslyandsolvebiggerproblemsinlesstime.

Traditionally,parallelcomputerswererareandavailableforonlythemostcriticalproblems.Sincethemid1990s,however,theavailabilityofparallelcomputershaschangeddramatically.Withmultithreadingsupportbuiltintothelatestmicroprocessorsandtheemergenceofmultipleprocessorcoresonasinglesilicondie,parallelcomputersarebecomingubiquitous.Now,almosteveryuniversitycomputersciencedepartmenthasatleastoneparallelcomputer.Virtuallyalloilcompanies,automobilemanufacturers,drugdevelopmentcompanies,andspecialeffectsstudiosuseparallelcomputing.

Forexample,incomputeranimation,renderingisthestepwhereinformationfromtheanimationfiles,suchaslighting,textures,andshading,isappliedto3Dmodelstogeneratethe2Dimagethatmakesupaframeofthefilm.Parallelcomputingisessentialtogeneratetheneedednumberofframes(24persecond)forafeaturelengthfilm.ToyStory,thefirstcompletelycomputergeneratedfeaturelengthfilm,releasedbyPixarin1995,wasprocessedona"renderfarm"consistingof100dual

processormachines[PS00].By1999,forToyStory2,Pixarwasusinga1,400processorsystemwiththeimprovementinprocessingpowerfullyreflectedintheimproveddetailsintextures,clothing,andatmosphericeffects.Monsters,Inc.(2001)usedasystemof250enterpriseserverseachcontaining14processorsforatotalof3,500processors.Itisinterestingthattheamountoftimerequiredtogenerateaframehasremainedrelativelyconstantascomputingpower(boththenumberofprocessorsandthespeedofeachprocessor)hasincreased,ithasbeenexploitedtoimprovethequalityoftheanimation.

ThebiologicalscienceshavetakendramaticleapsforwardwiththeavailabilityofDNAsequenceinformationfromavarietyoforganisms,includinghumans.Oneapproachtosequencing,championedandusedwithsuccessbyCeleraCorp.,iscalledthewholegenomeshotgunalgorithm.Theideaistobreakthegenomeintosmallsegments,experimentallydeterminetheDNAsequencesofthesegments,andthenuseacomputertoconstructtheentiresequencefromthesegmentsbyfindingoverlappingareas.ThecomputingfacilitiesusedbyCeleratosequencethehumangenomeincluded150fourwayserversplusaserverwith16processorsand64GBofmemory.Thecalculationinvolved500milliontrillionbasetobasecomparisons[Ein00].

TheSETI@homeproject[SET,ACK +02 ]providesafascinatingexampleofthepowerofparallelcomputing.Theprojectseeksevidenceofextraterrestrialintelligencebyscanningtheskywiththeworld'slargestradiotelescope,theAreciboTelescopeinPuertoRico.Thecollecteddataisthenanalyzedforcandidatesignalsthatmightindicateanintelligentsource.Thecomputationaltaskisbeyondeventhelargestsupercomputer,andcertainlybeyondthecapabilitiesofthefacilitiesavailabletotheSETI@homeproject.Theproblemissolvedwithpublicresourcecomputing,whichturnsPCsaroundtheworldintoahugeparallelcomputerconnectedbytheInternet.DataisbrokenupintoworkunitsanddistributedovertheInternettoclientcomputerswhoseownersdonatesparecomputingtimetosupporttheproject.EachclientperiodicallyconnectswiththeSETI@homeserver,downloadsthedatatoanalyze,andthensendstheresultsbacktotheserver.TheclientprogramistypicallyimplementedasascreensaversothatitwilldevoteCPUcyclestotheSETIproblemonlywhenthecomputerisotherwiseidle.AworkunitcurrentlyrequiresanaverageofbetweensevenandeighthoursofCPUtimeonaclient.Morethan205,000,000workunitshavebeenprocessedsincethestartoftheproject.Morerecently,similartechnologytothatdemonstratedbySETI@homehasbeenusedforavarietyofpublicresourcecomputingprojectsaswellasinternalprojectswithinlargecompaniesutilizingtheiridlePCstosolveproblemsrangingfromdrugscreeningtochipdesignvalidation.

Althoughcomputinginlesstimeisbeneficial,andmayenableproblemstobesolvedthatcouldn'tbeotherwise,itcomesatacost.Writingsoftwaretorunonparallelcomputerscanbedifficult.Onlyasmallminorityofprogrammershaveexperiencewithparallelprogramming.Ifallthesecomputersdesignedtoexploitparallelismaregoingtoachievetheirpotential,moreprogrammersneedtolearnhowtowriteparallelprograms.

Thisbookaddressesthisneedbyshowingcompetentprogrammersofsequentialmachineshowtodesignprogramsthatcanrunonparallelcomputers.Althoughmanyexcellentbooksshowhowtouseparticularparallelprogrammingenvironments,thisbookisuniqueinthatitfocusesonhowtothinkaboutanddesignparallelalgorithms.Toaccomplishthisgoal,wewillbeusingtheconceptofapatternlanguage.Thishighlystructuredrepresentationofexpertdesignexperiencehasbeenheavilyusedintheobjectorienteddesigncommunity.

Thebookopenswithtwointroductorychapters.Thefirstgivesanoverviewoftheparallelcomputinglandscapeandbackgroundneededtounderstandandusethepatternlanguage.Thisisfollowedbyamoredetailedchapterinwhichwelayoutthebasicconceptsandjargonusedbyparallelprogrammers.Thebookthenmovesintothepatternlanguageitself.

1.2. PARALLEL PROGRAMMINGThekeytoparallelcomputingisexploitableconcurrency.Concurrencyexistsinacomputationalproblemwhentheproblemcanbedecomposedintosubproblemsthatcansafelyexecuteatthesametime.Tobeofanyuse,however,itmustbepossibletostructurethecodetoexposeandlaterexploittheconcurrencyandpermitthesubproblemstoactuallyrunconcurrently;thatis,theconcurrencymustbeexploitable.

Mostlargecomputationalproblemscontainexploitableconcurrency.Aprogrammerworkswithexploitableconcurrencybycreatingaparallelalgorithmandimplementingthealgorithmusingaparallelprogrammingenvironment.Whentheresultingparallelprogramisrunonasystemwithmultipleprocessors,theamountoftimewehavetowaitfortheresultsofthecomputationisreduced.Inaddition,multipleprocessorsmayallowlargerproblemstobesolvedthancouldbedoneonasingleprocessorsystem.

Asasimpleexample,supposepartofacomputationinvolvescomputingthesummationofalargesetofvalues.Ifmultipleprocessorsareavailable,insteadofaddingthevaluestogethersequentially,thesetcanbepartitionedandthesummationsofthesubsetscomputedsimultaneously,eachonadifferentprocessor.Thepartialsumsarethencombinedtogetthefinalanswer.Thus,usingmultipleprocessorstocomputeinparallelmayallowustoobtainasolutionsooner.Also,ifeachprocessorhasitsownmemory,partitioningthedatabetweentheprocessorsmayallowlargerproblemstobehandledthancouldbehandledonasingleprocessor.

Thissimpleexampleshowstheessenceofparallelcomputing.Thegoalistousemultipleprocessorstosolveproblemsinlesstimeand/ortosolvebiggerproblemsthanwouldbepossibleonasingleprocessor.Theprogrammer'staskistoidentifytheconcurrencyintheproblem,structurethealgorithmsothatthisconcurrencycanbeexploited,andthenimplementthesolutionusingasuitableprogrammingenvironment.Thefinalstepistosolvetheproblembyexecutingthecodeonaparallelsystem.

Parallelprogrammingpresentsuniquechallenges.Often,theconcurrenttasksmakinguptheproblemincludedependenciesthatmustbeidentifiedandcorrectlymanaged.Theorderinwhichthetasksexecutemaychangetheanswersofthecomputationsinnondeterministicways.Forexample,intheparallelsummationdescribedearlier,apartialsumcannotbecombinedwithothersuntilitsowncomputationhascompleted.Thealgorithmimposesapartialorderonthetasks(thatis,theymustcompletebeforethesumscanbecombined).Moresubtly,thenumericalvalueofthesummationsmaychangeslightlydependingontheorderoftheoperationswithinthesumsbecausefloatingpoint

arithmeticisnonassociative.Agoodparallelprogrammermusttakecaretoensurethatnondeterministicissuessuchasthesedonotaffectthequalityofthefinalanswer.Creatingsafeparallelprogramscantakeconsiderableeffortfromtheprogrammer.

Evenwhenaparallelprogramis"correct",itmayfailtodelivertheanticipatedperformanceimprovementfromexploitingconcurrency.Caremustbetakentoensurethattheoverheadincurredbymanagingtheconcurrencydoesnotoverwhelmtheprogramruntime.Also,partitioningtheworkamongtheprocessorsinabalancedwayisoftennotaseasyasthesummationexamplesuggests.Theeffectivenessofaparallelalgorithmdependsonhowwellitmapsontotheunderlyingparallelcomputer,soaparallelalgorithmcouldbeveryeffectiveononeparallelarchitectureandadisasteronanother.

Wewillrevisittheseissuesandprovideamorequantitativeviewofparallelcomputationinthenextchapter.

1.3. DESIGN PATTERNS AND PATTERN LANGUAGESAdesignpatterndescribesagoodsolutiontoarecurringprobleminaparticularcontext.Thepatternfollowsaprescribedformatthatincludesthepatternname,adescriptionofthecontext,theforces(goalsandconstraints),andthesolution.Theideaistorecordtheexperienceofexpertsinawaythatcanbeusedbyothersfacingasimilarproblem.Inadditiontothesolutionitself,thenameofthepatternisimportantandcanformthebasisforadomainspecificvocabularythatcansignificantlyenhancecommunicationbetweendesignersinthesamearea.

DesignpatternswerefirstproposedbyChristopherAlexander.Thedomainwascityplanningandarchitecture[AIS77].DesignpatternswereoriginallyintroducedtothesoftwareengineeringcommunitybyBeckandCunningham[BC87]andbecameprominentintheareaofobjectorientedprogrammingwiththepublicationofthebookbyGamma,Helm,Johnson,andVlissides[GHJV95],affectionatelyknownastheGoF(GangofFour)book.Thisbookgivesalargecollectionofdesignpatternsforobjectorientedprogramming.Togiveoneexample,theVisitorpatterndescribesawaytostructureclassessothatthecodeimplementingaheterogeneousdatastructurecanbekeptseparatefromthecodetotraverseit.Thus,whathappensinatraversaldependsonboththetypeofeachnodeandtheclassthatimplementsthetraversal.Thisallowsmultiplefunctionalityfordatastructuretraversals,andsignificantflexibilityasnewfunctionalitycanbeaddedwithouthavingtochangethedatastructureclass.ThepatternsintheGoFbookhaveenteredthelexiconofobjectorientedprogrammingreferencestoitspatternsarefoundintheacademicliterature,tradepublications,andsystemdocumentation.Thesepatternshavebynowbecomepartoftheexpectedknowledgeofanycompetentsoftwareengineer.

AneducationalnonprofitorganizationcalledtheHillsideGroup[Hil]wasformedin1993topromotetheuseofpatternsandpatternlanguagesand,moregenerally,toimprovehumancommunicationaboutcomputers"byencouragingpeopletocodifycommonprogramminganddesignpractice".Todevelopnewpatternsandhelppatternwritershonetheirskills,theHillsideGroupsponsorsanannualPatternLanguagesofPrograms(PLoP)workshopandseveralspinoffsinotherpartsoftheworld,suchasChiliPLoP(inthewesternUnitedStates),KoalaPLoP(Australia),EuroPLoP(Europe),and

MensorePLoP(Japan).Theproceedingsoftheseworkshops[Pat]providearichsourceofpatternscoveringavastrangeofapplicationdomainsinsoftwaredevelopmentandhavebeenusedasabasisforseveralbooks[CS95,VCK96,MRB97,HFR99].

Inhisoriginalworkonpatterns,Alexanderprovidednotonlyacatalogofpatterns,butalsoapatternlanguagethatintroducedanewapproachtodesign.Inapatternlanguage,thepatternsareorganizedintoastructurethatleadstheuserthroughthecollectionofpatternsinsuchawaythatcomplexsystemscanbedesignedusingthepatterns.Ateachdecisionpoint,thedesignerselectsanappropriatepattern.Eachpatternleadstootherpatterns,resultinginafinaldesignintermsofawebofpatterns.Thus,apatternlanguageembodiesadesignmethodologyandprovidesdomainspecificadvicetotheapplicationdesigner.(Inspiteoftheoverlappingterminology,apatternlanguageisnotaprogramminglanguage.)

1.4. A PATTERN LANGUAGE FOR PARALLEL PROGRAMMINGThisbookdescribesapatternlanguageforparallelprogrammingthatprovidesseveralbenefits.Theimmediatebenefitsareawaytodisseminatetheexperienceofexpertsbyprovidingacatalogofgoodsolutionstoimportantproblems,anexpandedvocabulary,andamethodologyforthedesignofparallelprograms.Wehopetolowerthebarriertoparallelprogrammingbyprovidingguidancethroughtheentireprocessofdevelopingaparallelprogram.Theprogrammerbringstotheprocessagoodunderstandingoftheactualproblemtobesolvedandthenworksthroughthepatternlanguage,eventuallyobtainingadetailedparalleldesignorpossiblyworkingcode.Inthelongerterm,wehopethatthispatternlanguagecanprovideabasisforbothadisciplinedapproachtothequalitativeevaluationofdifferentprogrammingmodelsandthedevelopmentofparallelprogrammingtools.

ThepatternlanguageisorganizedintofourdesignspacesFindingConcurrency,AlgorithmStructure,SupportingStructures,andImplementationMechanismswhichformalinearhierarchy,withFindingConcurrencyatthetopandImplementationMechanismsatthebottom,asshowninFig.1.1.

Figure 1.1. Overview of the pattern language

TheFindingConcurrencydesignspaceisconcernedwithstructuringtheproblemtoexposeexploitableconcurrency.Thedesignerworkingatthislevelfocusesonhighlevelalgorithmicissues

andreasonsabouttheproblemtoexposepotentialconcurrency.TheAlgorithmStructuredesignspaceisconcernedwithstructuringthealgorithmtotakeadvantageofpotentialconcurrency.Thatis,thedesignerworkingatthislevelreasonsabouthowtousetheconcurrencyexposedinworkingwiththeFindingConcurrencypatterns.TheAlgorithmStructurepatternsdescribeoverallstrategiesforexploitingconcurrency.TheSupportingStructuresdesignspacerepresentsanintermediatestagebetweentheAlgorithmStructureandImplementationMechanismsdesignspaces.Twoimportantgroupsofpatternsinthisspacearethosethatrepresentprogramstructuringapproachesandthosethatrepresentcommonlyusedshareddatastructures.TheImplementationMechanismsdesignspaceisconcernedwithhowthepatternsofthehigherlevelspacesaremappedintoparticularprogrammingenvironments.Weuseittoprovidedescriptionsofcommonmechanismsforprocess/threadmanagement(forexample,creatingordestroyingprocesses/threads)andprocess/threadinteraction(forexample,semaphores,barriers,ormessagepassing).Theitemsinthisdesignspacearenotpresentedaspatternsbecauseinmanycasestheymapdirectlyontoelementswithinparticularparallelprogrammingenvironments.Theyareincludedinthepatternlanguageanyway,however,toprovideacompletepathfromproblemdescriptiontocode.

Chapter 2. Background and Jargon of Parallel Computing

2.1CONCURRENCYINPARALLELPROGRAMSVERSUSOPERATINGSYSTEMS

2.2PARALLELARCHITECTURES:ABRIEFINTRODUCTION

2.3PARALLELPROGRAMMINGENVIRONMENTS

2.4THEJARGONOFPARALLELCOMPUTING

2.5AQUANTITATIVELOOKATPARALLELCOMPUTATION

2.6COMMUNICATION

2.7SUMMARY

Inthischapter,wegiveanoverviewoftheparallelprogramminglandscape,anddefineanyspecializedparallelcomputingterminologythatwewilluseinthepatterns.Becausemanytermsincomputingareoverloaded,takingdifferentmeaningsindifferentcontexts,wesuggestthatevenreadersfamiliarwithparallelprogrammingatleastskimthischapter.

2.1. CONCURRENCY IN PARALLEL PROGRAMS VERSUS OPERATING SYSTEMSConcurrencywasfirstexploitedincomputingtobetterutilizeorshareresourceswithinacomputer.Modernoperatingsystemssupportcontextswitchingtoallowmultipletaskstoappeartoexecuteconcurrently,therebyallowingusefulworktooccurwhiletheprocessorisstalledononetask.Thisapplicationofconcurrency,forexample,allowstheprocessortostaybusybyswappinginanewtasktoexecutewhileanothertaskiswaitingforI/O.Byquicklyswappingtasksinandout,givingeach

taska"slice"oftheprocessortime,theoperatingsystemcanallowmultipleuserstousethesystemasifeachwereusingitalone(butwithdegradedperformance).

Mostmodernoperatingsystemscanusemultipleprocessorstoincreasethethroughputofthesystem.TheUNIXshellusesconcurrencyalongwithacommunicationabstractionknownaspipestoprovideapowerfulformofmodularity:Commandsarewrittentoacceptastreamofbytesasinput(theconsumer)andproduceastreamofbytesasoutput(theproducer).Multiplecommandscanbechainedtogetherwithapipeconnectingtheoutputofonecommandtotheinputofthenext,allowingcomplexcommandstobebuiltfromsimplebuildingblocks.Eachcommandisexecutedinitsownprocess,withallprocessesexecutingconcurrently.Becausetheproducerblocksifbufferspaceinthepipeisnotavailable,andtheconsumerblocksifdataisnotavailable,thejobofmanagingthestreamofresultsmovingbetweencommandsisgreatlysimplified.Morerecently,withoperatingsystemswithwindowsthatinviteuserstodomorethanonethingatatime,andtheInternet,whichoftenintroducesI/Odelaysperceptibletotheuser,almosteveryprogramthatcontainsaGUIincorporatesconcurrency.

Althoughthefundamentalconceptsforsafelyhandlingconcurrencyarethesameinparallelprogramsandoperatingsystems,therearesomeimportantdifferences.Foranoperatingsystem,theproblemisnotfindingconcurrencytheconcurrencyisinherentinthewaytheoperatingsystemfunctionsinmanagingacollectionofconcurrentlyexecutingprocesses(representingusers,applications,andbackgroundactivitiessuchasprintspooling)andprovidingsynchronizationmechanismssoresourcescanbesafelyshared.However,anoperatingsystemmustsupportconcurrencyinarobustandsecureway:Processesshouldnotbeabletointerferewitheachother(intentionallyornot),andtheentiresystemshouldnotcrashifsomethinggoeswrongwithoneprocess.Inaparallelprogram,findingandexploitingconcurrencycanbeachallenge,whileisolatingprocessesfromeachotherisnotthecriticalconcernitiswithanoperatingsystem.Performancegoalsaredifferentaswell.Inanoperatingsystem,performancegoalsarenormallyrelatedtothroughputorresponsetime,anditmaybeacceptabletosacrificesomeefficiencytomaintainrobustnessandfairnessinresourceallocation.Inaparallelprogram,thegoalistominimizetherunningtimeofasingleprogram.

2.2. PARALLEL ARCHITECTURES: A BRIEF INTRODUCTIONTherearedozensofdifferentparallelarchitectures,amongthemnetworksofworkstations,clustersofofftheshelfPCs,massivelyparallelsupercomputers,tightlycoupledsymmetricmultiprocessors,andmultiprocessorworkstations.Inthissection,wegiveanoverviewofthesesystems,focusingonthecharacteristicsrelevanttotheprogrammer.

2.2.1. Flynn's TaxonomyByfarthemostcommonwaytocharacterizethesearchitecturesisFlynn'staxonomy[Fly72].Hecategorizesallcomputersaccordingtothenumberofinstructionstreamsanddatastreamstheyhave,whereastreamisasequenceofinstructionsordataonwhichacomputeroperates.InFlynn'staxonomy,therearefourpossibilities:SISD,SIMD,MISD,andMIMD.

SingleInstruction,SingleData(SISD).InaSISDsystem,onestreamofinstructionsprocessesasinglestreamofdata,asshowninFig.2.1.ThisisthecommonvonNeumannmodelusedinvirtuallyallsingleprocessorcomputers.

Figure 2.1. The Single Instruction, Single Data (SISD) architecture

SingleInstruction,MultipleData(SIMD).InaSIMDsystem,asingleinstructionstreamisconcurrentlybroadcasttomultipleprocessors,eachwithitsowndatastream(asshowninFig.2.2).TheoriginalsystemsfromThinkingMachinesandMasParcanbeclassifiedasSIMD.TheCPPDAPGammaIIandQuadricsApemillearemorerecentexamples;thesearetypicallydeployedinspecializedapplications,suchasdigitalsignalprocessing,thataresuitedtofinegrainedparallelismandrequirelittleinterprocesscommunication.Vectorprocessors,whichoperateonvectordatainapipelinedfashion,canalsobecategorizedasSIMD.Exploitingthisparallelismisusuallydonebythecompiler.

Figure 2.2. The Single Instruction, Multiple Data (SIMD) architecture

MultipleInstruction,SingleData(MISD).Nowellknownsystemsfitthisdesignation.Itismentionedforthesakeofcompleteness.

MultipleInstruction,MultipleData(MIMD).InaMIMDsystem,eachprocessingelementhasitsownstreamofinstructionsoperatingonitsowndata.Thisarchitecture,showninFig.2.3,isthemostgeneralofthearchitecturesinthateachoftheothercasescanbemappedontotheMIMDarchitecture.Thevastmajorityofmodernparallelsystemsfitintothiscategory.

Figure 2.3. The Multiple Instruction, Multiple Data (MIMD) architecture

2.2.2. A Further Breakdown of MIMDTheMIMDcategoryofFlynn'staxonomyistoobroadtobeusefulonitsown;thiscategoryistypicallydecomposedaccordingtomemoryorganization.

Sharedmemory.Inasharedmemorysystem,allprocessesshareasingleaddressspaceandcommunicatewitheachotherbywritingandreadingsharedvariables.

OneclassofsharedmemorysystemsiscalledSMPs(symmetricmultiprocessors).AsshowninFig.2.4,allprocessorsshareaconnectiontoacommonmemoryandaccessallmemorylocationsatequalspeeds.SMPsystemsarearguablytheeasiestparallelsystemstoprogrambecauseprogrammersdonotneedtodistributedatastructuresamongprocessors.Becauseincreasingthenumberofprocessorsincreasescontentionforthememory,theprocessor/memorybandwidthistypicallyalimitingfactor.Thus,SMPsystemsdonotscalewellandarelimitedtosmallnumbersofprocessors.

Figure 2.4. The Symmetric Multiprocessor (SMP) architecture

TheothermainclassofsharedmemorysystemsiscalledNUMA(nonuniformmemoryaccess).AsshowninFig.2.5,thememoryisshared;thatis,itisuniformlyaddressablefromallprocessors,butsomeblocksofmemorymaybephysicallymorecloselyassociatedwithsomeprocessorsthanothers.Thisreducesthememorybandwidthbottleneckandallowssystemswithmoreprocessors;however,asaresult,theaccesstimefromaprocessortoamemorylocationcanbesignificantlydifferentdependingonhow"close"thememorylocationistotheprocessor.Tomitigatetheeffectsofnonuniformaccess,eachprocessorhasacache,alongwithaprotocoltokeepcacheentriescoherent.Hence,anothernameforthesearchitecturesiscachecoherentnonuniformmemoryaccesssystems(ccNUMA).Logically,programmingaccNUMAsystemisthesameasprogramminganSMP,buttoobtainthebestperformance,theprogrammerwillneedtobemorecarefulaboutlocalityissuesandcacheeffects.

Figure 2.5. An example of the nonuniform memory access (NUMA) architecture

Distributedmemory.Inadistributedmemorysystem,eachprocesshasitsownaddressspaceandcommunicateswithotherprocessesbymessagepassing(sendingandreceivingmessages).AschematicrepresentationofadistributedmemorycomputerisshowninFig.2.6.

Figure 2.6. The distributed-memory architecture

Dependingonthetopologyandtechnologyusedfortheprocessorinterconnection,communication

speedcanrangefromalmostasfastassharedmemory(intightlyintegratedsupercomputers)toordersofmagnitudeslower(forexample,inaclusterofPCsinterconnectedwithanEthernetnetwork).Theprogrammermustexplicitlyprogramallthecommunicationbetweenprocessorsandbeconcernedwiththedistributionofdata.

Distributedmemorycomputersaretraditionallydividedintotwoclasses:MPP(massivelyparallelprocessors)andclusters.InanMPP,theprocessorsandthenetworkinfrastructurearetightlycoupledandspecializedforuseinaparallelcomputer.Thesesystemsareextremelyscalable,insomecasessupportingtheuseofmanythousandsofprocessorsinasinglesystem[MSW96,IBM02].

Clustersaredistributedmemorysystemscomposedofofftheshelfcomputersconnectedbyanofftheshelfnetwork.WhenthecomputersarePCsrunningtheLinuxoperatingsystem,theseclustersarecalledBeowulfclusters.Asofftheshelfnetworkingtechnologyimproves,systemsofthistypearebecomingmorecommonandmuchmorepowerful.Clustersprovideaninexpensivewayforanorganizationtoobtainparallelcomputingcapabilities[Beo].Preconfiguredclustersarenowavailablefrommanyvendors.OnefrugalgroupevenreportedconstructingausefulparallelsystembyusingaclustertoharnessthecombinedpowerofobsoletePCsthatotherwisewouldhavebeendiscarded[HHS01].

Hybridsystems.Thesesystemsareclustersofnodeswithseparateaddressspacesinwhicheachnodecontainsseveralprocessorsthatsharememory.

AccordingtovanderSteenandDongarra's"OverviewofRecentSupercomputers"[vdSD03],whichcontainsabriefdescriptionofthesupercomputerscurrentlyorsoontobecommerciallyavailable,hybridsystemsformedfromclustersofSMPsconnectedbyafastnetworkarecurrentlythedominanttrendinhighperformancecomputing.Forexample,inlate2003,fourofthefivefastestcomputersintheworldwerehybridsystems[Top].

Grids.Gridsaresystemsthatusedistributed,heterogeneousresourcesconnectedbyLANsand/orWANs[FK03].OftentheinterconnectionnetworkistheInternet.Gridswereoriginallyenvisionedasawaytolinkmultiplesupercomputerstoenablelargerproblemstobesolved,andthuscouldbeviewedasaspecialtypeofdistributedmemoryorhybridMIMDmachine.Morerecently,theideaofgridcomputinghasevolvedintoageneralwaytoshareheterogeneousresources,suchascomputationservers,storage,applicationservers,informationservices,orevenscientificinstruments.Gridsdifferfromclustersinthatthevariousresourcesinthegridneednothaveacommonpointofadministration.Inmostcases,theresourcesonagridareownedbydifferentorganizationsthatmaintaincontroloverthepoliciesgoverninguseoftheresources.Thisaffectsthewaythesesystemsareused,themiddlewarecreatedtomanagethem,andmostimportantlyforthisdiscussion,theoverheadincurredwhencommunicatingbetweenresourceswithinthegrid.

2.2.3. SummaryWehaveclassifiedthesesystemsaccordingtothecharacteristicsofthehardware.Thesecharacteristicstypicallyinfluencethenativeprogrammingmodelusedtoexpressconcurrencyonasystem;however,thisisnotalwaysthecase.Itispossibleforaprogrammingenvironmentforasharedmemorymachinetoprovidetheprogrammerwiththeabstractionofdistributedmemoryandmessagepassing.Virtualdistributedsharedmemorysystemscontainmiddlewaretoprovidetheopposite:theabstractionofsharedmemoryonadistributedmemorymachine.

2.3. PARALLEL PROGRAMMING ENVIRONMENTSParallelprogrammingenvironmentsprovidethebasictools,languagefeatures,andapplication programminginterfaces(APIs)neededtoconstructaparallelprogram.Aprogrammingenvironmentimpliesaparticularabstractionofthecomputersystemcalledaprogrammingmodel.TraditionalsequentialcomputersusethewellknownvonNeumannmodel.Becauseallsequentialcomputersusethismodel,softwaredesignerscandesignsoftwaretoasingleabstractionandreasonablyexpectittomapontomost,ifnotall,sequentialcomputers.

Unfortunately,therearemanypossiblemodelsforparallelcomputing,reflectingthedifferentwaysprocessorscanbeinterconnectedtoconstructaparallelsystem.Themostcommonmodelsarebasedononeofthewidelydeployedparallelarchitectures:sharedmemory,distributedmemorywithmessagepassing,orahybridcombinationofthetwo.

Programmingmodelstoocloselyalignedtoaparticularparallelsystemleadtoprogramsthatarenotportablebetweenparallelcomputers.Becausetheeffectivelifespanofsoftwareislongerthanthatofhardware,manyorganizationshavemorethanonetypeofparallelcomputer,andmostprogrammersinsistonprogrammingenvironmentsthatallowthemtowriteportableparallelprograms.Also,explicitlymanaginglargenumbersofresourcesinaparallelcomputerisdifficult,suggestingthathigherlevelabstractionsoftheparallelcomputermightbeuseful.Theresultisthatasofthemid1990s,therewasaveritableglutofparallelprogrammingenvironments.ApartiallistoftheseisshowninTable2.1.Thiscreatedagreatdealofconfusionforapplicationdevelopersandhinderedtheadoptionofparallelcomputingformainstreamapplications.

Table 2.1. Some Parallel Programming Environments from the Mid-1990s

"C*inC CUMULVS JavaRMI PRIO Quake

ABCPL DAGGER javaPG P3L Quark

ACE DAPPLE JAVAR P4Linda QuickThreads

ACT++ DataParallelC JavaSpaces Pablo Sage++

ADDAP DC++ JIDL PADE SAM

Adl DCE++ Joyce PADRE SCANDAL

Adsmith DDD Karma Panda SCHEDULE

AFAPI DICE Khoros Papers SciTL

ALWAN DIPC KOAN/FortranS Para++ SDDA

AM DistributedSmalltalk

LAM Paradigm SHMEM

AMDC DOLIB Legion Parafrase2 SIMPLE

Amoeba DOME Lilac Paralation Sina

AppLeS DOSMOS Linda Parallaxis SISAL

ARTS DRL LiPS ParallelHaskell

SMI

AthapascanOb DSMThreads Locust ParallelC++ SONiC

Aurora Ease Lparx ParC SplitC

Automap ECO Lucid ParLib++ SR

bb_threads Eilean Maisie ParLin Sthreads

Blaze Emerald Manifold Parlog Strand

BlockComm EPL Mentat Parmacs SUIF

BSP Excalibur MetaChaos Parti SuperPascal

C* Express Midway pC Synergy

C** Falcon Millipede pC++ TCGMSG

C4 Filaments Mirage PCN Telegraphos

CarlOS FLASH Modula2* PCP: TheFORCE

Cashmere FM ModulaP PCU Threads.h++

CC++ Fork MOSIX PEACE TRAPPER

Charlotte FortranM MpC PENNY TreadMarks

Charm FX MPC++ PET UC

Charm++ GA MPI PETSc uC++

Chu GAMMA Multipol PH UNITY

Cid Glenda Munin Phosphorus V

Cilk GLU NanoThreads POET Vic*

CMFortran GUARD NESL Polaris VisifoldVNUS

Code HAsL NetClasses++ POOLT VPE

ConcurrentML HORUS Nexus POOMA Win32threads

Converse HPC Nimrod POSYBL WinPar

COOL HPC++ NOW PRESTO WWWinda

CORRELATE HPF ObjectiveLinda Prospero XENOOPS

CparPar IMPACT Occam Proteus XPC

CPS ISETLLinda Omega PSDM Zounds

CRL ISIS OOF90 PSI ZPL

CSP JADA Orca PVM

Cthreads JADE P++ QPC++

Fortunately,bythelate1990s,theparallelprogrammingcommunityconvergedpredominantlyontwoenvironmentsforparallelprogramming:OpenMP[OMP]forsharedmemoryandMPI[Mesb]formessagepassing.

OpenMPisasetoflanguageextensionsimplementedascompilerdirectives.ImplementationsarecurrentlyavailableforFortran,C,andC++.OpenMPisfrequentlyusedtoincrementallyaddparallelismtosequentialcode.Byaddingacompilerdirectivearoundaloop,forexample,thecompilercanbeinstructedtogeneratecodetoexecutetheiterationsoftheloopinparallel.Thecompilertakescareofmostofthedetailsofthreadcreationandmanagement.OpenMPprogramstendtoworkverywellonSMPs,butbecauseitsunderlyingprogrammingmodeldoesnotincludeanotionofnonuniformmemoryaccesstimes,itislessidealforccNUMAanddistributedmemorymachines.

MPIisasetoflibraryroutinesthatprovideforprocessmanagement,messagepassing,andsomecollectivecommunicationoperations(theseareoperationsthatinvolvealltheprocessesinvolvedinaprogram,suchasbarrier,broadcast,andreduction).MPIprogramscanbedifficulttowritebecausetheprogrammerisresponsiblefordatadistributionandexplicitinterprocesscommunicationusingmessages.Becausetheprogrammingmodelassumesdistributedmemory,MPIisagoodchoiceforMPPsandotherdistributedmemorymachines.

NeitherOpenMPnorMPIisanidealfitforhybridarchitecturesthatcombinemultiprocessornodes,eachwithmultipleprocessesandasharedmemory,intoalargersystemwithseparateaddressspacesforeachnode:TheOpenMPmodeldoesnotrecognizenonuniformmemoryaccesstimes,soitsdataallocationcanleadtopoorperformanceonmachinesthatarenotSMPs,whileMPIdoesnotincludeconstructstomanagedatastructuresresidinginasharedmemory.OnesolutionisahybridmodelinwhichOpenMPisusedoneachsharedmemorynodeandMPIisusedbetweenthenodes.Thisworkswell,butitrequirestheprogrammertoworkwithtwodifferentprogrammingmodelswithinasingleprogram.AnotheroptionistouseMPIonboththesharedmemoryanddistributedmemoryportionsofthealgorithmandgiveuptheadvantagesofasharedmemoryprogrammingmodel,evenwhenthehardwaredirectlysupportsit.

Newhighlevelprogrammingenvironmentsthatsimplifyportableparallelprogrammingandmoreaccuratelyreflecttheunderlyingparallelarchitecturesaretopicsofcurrentresearch[Cen].Another

approachmorepopularinthecommercialsectoristoextendMPIandOpenMP.Inthemid1990s,theMPIForumdefinedanextendedMPIcalledMPI2.0,althoughimplementationsarenotwidelyavailableatthetimethiswaswritten.ItisalargecomplexextensiontoMPIthatincludesdynamicprocesscreation,parallelI/O,andmanyotherfeatures.Ofparticularinteresttoprogrammersofmodernhybridarchitecturesistheinclusionofonesidedcommunication.Onesidedcommunicationmimicssomeofthefeaturesofasharedmemorysystembylettingoneprocesswriteintoorreadfromthememoryregionsofotherprocesses.Theterm"onesided"referstothefactthatthereadorwriteislaunchedbytheinitiatingprocesswithouttheexplicitinvolvementoftheotherparticipatingprocess.AmoresophisticatedabstractionofonesidedcommunicationisavailableaspartoftheGlobalArrays[NHL96,NHK +02 ,Gloa]package.GlobalArraysworkstogetherwithMPItohelpaprogrammermanagedistributedarraydata.Aftertheprogrammerdefinesthearrayandhowitislaidoutinmemory,theprogramexecutes"puts"or"gets"intothearraywithoutneedingtoexplicitlymanagewhichMPIprocess"owns"theparticularsectionofthearray.Inessence,theglobalarrayprovidesanabstractionofagloballysharedarray.Thisonlyworksforarrays,butthesearesuchcommondatastructuresinparallelcomputingthatthispackage,althoughlimited,canbeveryuseful.

JustasMPIhasbeenextendedtomimicsomeofthebenefitsofasharedmemoryenvironment,OpenMPhasbeenextendedtorunindistributedmemoryenvironments.TheannualWOMPAT(WorkshoponOpenMPApplicationsandTools)workshopscontainmanypapersdiscussingvariousapproachesandexperienceswithOpenMPinclustersandccNUMAenvironments.

MPIisimplementedasalibraryofroutinestobecalledfromprogramswritteninasequentialprogramminglanguage,whereasOpenMPisasetofextensionstosequentialprogramminglanguages.Theyrepresenttwoofthepossiblecategoriesofparallelprogrammingenvironments(librariesandlanguageextensions),andthesetwoparticularenvironmentsaccountfortheoverwhelmingmajorityofparallelcomputingbeingdonetoday.Thereis,however,onemorecategoryofparallelprogrammingenvironments,namelylanguageswithbuiltinfeaturestosupportparallelprogramming.Javaissuchalanguage.Ratherthanbeingdesignedtosupporthighperformancecomputing,Javaisanobjectoriented,generalpurposeprogrammingenvironmentwithfeaturesforexplicitlyspecifyingconcurrentprocessingwithsharedmemory.Inaddition,thestandardI/OandnetworkpackagesprovideclassesthatmakeiteasyforJavatoperforminterprocesscommunicationbetweenmachines,thusmakingitpossibletowriteprogramsbasedonboththesharedmemoryandthedistributedmemorymodels.Thenewerjava.niopackagessupportI/Oinawaythatislessconvenientfortheprogrammer,butgivessignificantlybetterperformance,andJava21.5includesnewsupportforconcurrentprogramming,mostsignificantlyinthejava.util.concurrent.*packages.Additionalpackagesthatsupportdifferentapproachestoparallelcomputingarewidelyavailable.

Althoughtherehavebeenothergeneralpurposelanguages,bothpriortoJavaandmorerecent(forexample,C#),thatcontainedconstructsforspecifyingconcurrency,Javaisthefirsttobecomewidelyused.Asaresult,itmaybethefirstexposureformanyprogrammerstoconcurrentandparallelprogramming.AlthoughJavaprovidessoftwareengineeringbenefits,currentlytheperformanceofparallelJavaprogramscannotcompetewithOpenMPorMPIprogramsfortypicalscientificcomputingapplications.TheJavadesignhasalsobeencriticizedforseveraldeficienciesthatmatterinthisdomain(forexample,afloatingpointmodelthatemphasizesportabilityandmorereproducibleresultsoverexploitingtheavailablefloatingpointhardwaretothefullest,inefficienthandlingof

arrays,andlackofalightweightmechanismtohandlecomplexnumbers).TheperformancedifferencebetweenJavaandotheralternativescanbeexpectedtodecrease,especiallyforsymbolicorothernonnumericproblems,ascompilertechnologyforJavaimprovesandasnewpackagesandlanguageextensionsbecomeavailable.TheTitaniumproject[Tita]isanexampleofaJavadialectdesignedforhighperformancecomputinginaccNUMAenvironment.

Forthepurposesofthisbook,wehavechosenOpenMP,MPI,andJavaasthethreeenvironmentswewilluseinourexamplesOpenMPandMPIfortheirpopularityandJavabecauseitislikelytobemanyprogrammers'firstexposuretoconcurrentprogramming.Abriefoverviewofeachcanbefoundintheappendixes.

2.4. THE JARGON OF PARALLEL COMPUTINGInthissection,wedefinesometermsthatarefrequentlyusedthroughoutthepatternlanguage.Additionaldefinitionscanbefoundintheglossary.

Task.Thefirststepindesigningaparallelprogramistobreaktheproblemupintotasks.Ataskisasequenceofinstructionsthatoperatetogetherasagroup.Thisgroupcorrespondstosomelogicalpartofanalgorithmorprogram.Forexample,considerthemultiplicationoftwoorderNmatrices.Dependingonhowweconstructthealgorithm,thetaskscouldbe(1)themultiplicationofsubblocksofthematrices,(2)innerproductsbetweenrowsandcolumnsofthematrices,or(3)individualiterationsoftheloopsinvolvedinthematrixmultiplication.Thesearealllegitimatewaystodefinetasksformatrixmultiplication;thatis,thetaskdefinitionfollowsfromthewaythealgorithmdesignerthinksabouttheproblem.

Unitofexecution(UE).Tobeexecuted,ataskneedstobemappedtoaUEsuchasaprocessorthread.Aprocessisacollectionofresourcesthatenablestheexecutionofprograminstructions.Theseresourcescanincludevirtualmemory,I/Odescriptors,aruntimestack,signalhandlers,userandgroupIDs,andaccesscontroltokens.Amorehighlevelviewisthataprocessisa"heavyweight"unitofexecutionwithitsownaddressspace.AthreadisthefundamentalUEinmodernoperatingsystems.Athreadisassociatedwithaprocessandsharestheprocess'senvironment.Thismakesthreadslightweight(thatis,acontextswitchbetweenthreadstakesonlyasmallamountoftime).Amorehighlevelviewisthatathreadisa"lightweight"UEthatsharesanaddressspacewithotherthreads.

WewilluseunitofexecutionorUEasagenerictermforoneofacollectionofpossiblyconcurrentlyexecutingentities,usuallyeitherprocessesorthreads.Thisisconvenientintheearlystagesofprogramdesignwhenthedistinctionsbetweenprocessesandthreadsarelessimportant.

Processingelement(PE).Weusethetermprocessingelement(PE)asagenerictermforahardwareelementthatexecutesastreamofinstructions.TheunitofhardwareconsideredtobeaPEdependsonthecontext.Forexample,someprogrammingenvironmentsvieweachworkstationinaclusterofSMPworkstationsasexecutingasingleinstructionstream;inthissituation,thePEwouldbetheworkstation.Adifferentprogrammingenvironmentrunningonthesamehardware,however,mightvieweachprocessorofeachworkstationasexecutinganindividualinstructionstream;inthiscase,thePEistheindividualprocessor,andeachworkstationcontainsseveralPEs.

Loadbalanceandloadbalancing.Toexecuteaparallelprogram,thetasksmustbemappedtoUEs,andtheUEstoPEs.Howthemappingsaredonecanhaveasignificantimpactontheoverallperformanceofaparallelalgorithm.ItiscrucialtoavoidthesituationinwhichasubsetofthePEsisdoingmostoftheworkwhileothersareidle.LoadbalancereferstohowwelltheworkisdistributedamongPEs.LoadbalancingistheprocessofallocatingworktoPEs,eitherstaticallyordynamically,sothattheworkisdistributedasevenlyaspossible.

Synchronization.Inaparallelprogram,duetothenondeterminismoftaskschedulingandotherfactors,eventsinthecomputationmightnotalwaysoccurinthesameorder.Forexample,inonerun,ataskmightreadvariablexbeforeanothertaskreadsvariabley;inthenextrunwiththesameinput,theeventsmightoccurintheoppositeorder.Inmanycases,theorderinwhichtwoeventsoccurdoesnotmatter.Inothersituations,theorderdoesmatter,andtoensurethattheprogramiscorrect,theprogrammermustintroducesynchronizationtoenforcethenecessaryorderingconstraints.TheprimitivesprovidedforthispurposeinourselectedenvironmentsarediscussedintheImplementationMechanismsdesignspace(Section6.3).

Synchronousversusasynchronous.Weusethesetwotermstoqualitativelyrefertohowtightlycoupledintimetwoeventsare.Iftwoeventsmusthappenatthesametime,theyaresynchronous;otherwisetheyareasynchronous.Forexample,messagepassing(thatis,communicationbetweenUEsbysendingandreceivingmessages)issynchronousifamessagesentmustbereceivedbeforethesendercancontinue.Messagepassingisasynchronousifthesendercancontinueitscomputationregardlessofwhathappensatthereceiver,orifthereceivercancontinuecomputationswhilewaitingforareceivetocomplete.

Raceconditions.Araceconditionisakindoferrorpeculiartoparallelprograms.ItoccurswhentheoutcomeofaprogramchangesastherelativeschedulingofUEsvaries.BecausetheoperatingsystemandnottheprogrammercontrolstheschedulingoftheUEs,raceconditionsresultinprogramsthatpotentiallygivedifferentanswersevenwhenrunonthesamesystemwiththesamedata.Raceconditionsareparticularlydifficulterrorstodebugbecausebytheirnaturetheycannotbereliablyreproduced.Testinghelps,butisnotaseffectiveaswithsequentialprograms:Aprogrammayruncorrectlythefirstthousandtimesandthenfailcatastrophicallyonthethousandandfirstexecutionandthenrunagaincorrectlywhentheprogrammerattemptstoreproducetheerrorasthefirststepindebugging.

Raceconditionsresultfromerrorsinsynchronization.IfmultipleUEsreadandwritesharedvariables,theprogrammermustprotectaccesstothesesharedvariablessothereadsandwritesoccurinavalidorderregardlessofhowthetasksareinterleaved.Whenmanyvariablesaresharedorwhentheyareaccessedthroughmultiplelevelsofindirection,verifyingbyinspectionthatnoraceconditionsexistcanbeverydifficult.Toolsareavailablethathelpdetectandfixraceconditions,suchasThreadCheckerfromIntelCorporation,andtheproblemremainsanareaofactiveandimportantresearch[NM92].

Deadlocks.Deadlocksareanothertypeoferrorpeculiartoparallelprograms.Adeadlockoccurswhenthereisacycleoftasksinwhicheachtaskisblockedwaitingforanothertoproceed.Becauseallarewaitingforanothertasktodosomething,theywillallbeblockedforever.Asasimpleexample,considertwotasksinamessagepassingenvironment.TaskAattemptstoreceiveamessagefromtaskB,afterwhichAwillreplybysendingamessageofitsowntotaskB.Meanwhile,taskBattemptsto

receiveamessagefromtaskA,afterwhichBwillsendamessagetoA.Becauseeachtaskiswaitingfortheothertosenditamessagefirst,bothtaskswillbeblockedforever.Fortunately,deadlocksarenotdifficulttodiscover,asthetaskswillstopatthepointofthedeadlock.

2.5. A QUANTITATIVE LOOK AT PARALLEL COMPUTATIONThetwomainreasonsforimplementingaparallelprogramaretoobtainbetterperformanceandtosolvelargerproblems.Performancecanbebothmodeledandmeasured,sointhissectionwewilltakeaanotherlookatparallelcomputationsbygivingsomesimpleanalyticalmodelsthatillustratesomeofthefactorsthatinfluencetheperformanceofaparallelprogram.

Consideracomputationconsistingofthreeparts:asetupsection,acomputationsection,andafinalizationsection.ThetotalrunningtimeofthisprogramononePEisthengivenasthesumofthetimesforthethreeparts.

Equation2.1

WhathappenswhenwerunthiscomputationonaparallelcomputerwithmultiplePEs?Supposethatthesetupandfinalizationsectionscannotbecarriedoutconcurrentlywithanyotheractivities,butthatthecomputationsectioncouldbedividedintotasksthatwouldrunindependentlyonasmanyPEsasareavailable,withthesametotalnumberofcomputationstepsasintheoriginalcomputation.ThetimeforthefullcomputationonPPEscanthereforebegivenbyOfcourse,Eq.2.2describesaveryidealizedsituation.However,theideathatcomputationshaveaserialpart(forwhichadditionalPEsareuseless)andaparallelizablepart(forwhichmorePEsdecreasetherunningtime)isrealistic.Thus,thissimplemodelcapturesanimportantrelationship.

Equation2.2

AnimportantmeasureofhowmuchadditionalPEshelpistherelativespeedupS,whichdescribeshowmuchfasteraproblemrunsinawaythatnormalizesawaytheactualrunningtime.

Equation2.3

ArelatedmeasureistheefficiencyE,whichisthespeedupnormalizedbythenumberofPEs.

Equation2.4

Equation2.5

Ideally,wewouldwantthespeeduptobeequaltoP,thenumberofPEs.Thisissometimescalledperfectlinearspeedup.Unfortunately,thisisanidealthatcanrarelybeachievedbecausetimesforsetupandfinalizationarenotimprovedbyaddingmorePEs,limitingthespeedup.Thetermsthatcannotberunconcurrentlyarecalledtheserialterms.Theirrunningtimesrepresentsomefractionofthetotal,calledtheserialfraction,denoted .Equation2.6

Thefractionoftimespentintheparallelizablepartoftheprogramisthen(1 ).Wecanthus rewritetheexpressionfortotalcomputationtimewithPPEsas

Equation2.7

Now,rewritingSintermsofthenewexpressionforTtotal(P),weobtainthefamousAmdahl'slaw:

Equation2.8

Equation2.9

Thus,inanidealparallelalgorithmwithnooverheadintheparallelpart,thespeedupshouldfollowEq.2.9.Whathappenstothespeedupifwetakeouridealparallelalgorithmanduseaverylargenumberofprocessors?TakingthelimitasPgoestoinfinityinourexpressionforSyields

Equation2.10

Eq.2.10thusgivesanupperboundonthespeedupobtainableinanalgorithmwhoseserialpartrepresents ofthetotalcomputation.Theseconceptsarevitaltotheparallelalgorithmdesigner.Indesigningaparallelalgorithm,itisimportanttounderstandthevalueoftheserialfractionsothatrealisticexpectationscanbesetforperformance.Itmaynotmakesensetoimplementacomplex,arbitrarilyscalableparallelalgorithmif10%ormoreofthealgorithmisserialand10%isfairlycommon.

Ofcourse,Amdahl'slawisbasedonassumptionsthatmayormaynotbetrueinpractice.Inreallife,anumberoffactorsmaymaketheactualrunningtimelongerthanthisformulaimplies.Forexample,creatingadditionalparalleltasksmayincreaseoverheadandthechancesofcontentionforsharedresources.Ontheotherhand,iftheoriginalserialcomputationislimitedbyresourcesotherthantheavailabilityofCPUcycles,theactualperformancecouldbemuchbetterthanAmdahl'slawwouldpredict.Forexample,alargeparallelmachinemayallowbiggerproblemstobeheldinmemory,thusreducingvirtualmemorypaging,ormultipleprocessorseachwithitsowncachemayallowmuchmoreoftheproblemtoremaininthecache.Amdahl'slawalsorestsontheassumptionthatforanygiveninput,theparallelandserialimplementationsperformexactlythesamenumberofcomputationalsteps.Iftheserialalgorithmbeingusedintheformulaisnotthebestpossiblealgorithmfortheproblem,thenacleverparallelalgorithmthatstructuresthecomputationdifferentlycanreducethetotalnumberofcomputationalsteps.

Ithasalsobeenobserved[Gus88]thattheexerciseunderlyingAmdahl'slaw,namelyrunningexactlythesameproblemwithvaryingnumbersofprocessors,isartificialinsomecircumstances.If,say,theparallelapplicationwereaweathersimulation,thenwhennewprocessorswereadded,onewouldmostlikelyincreasetheproblemsizebyaddingmoredetailstothemodelwhilekeepingthetotalexecutiontimeconstant.Ifthisisthecase,thenAmdahl'slaw,orfixedsizespeedup,givesapessimisticviewofthebenefitsofadditionalprocessors.

Toseethis,wecanreformulatetheequationtogivethespeedupintermsofperformanceonaPprocessorsystem.EarlierinEq.2.2,weobtainedtheexecutiontimeforTprocessors,Ttotal(P),from

theexecutiontimeoftheserialtermsandtheexecutiontimeoftheparallelizablepartwhenexecutedononeprocessor.Here,wedotheoppositeandobtainTtotal(1)fromtheserialandparallelterms

whenexecutedonPprocessors.

Equation2.11

Now,wedefinethesocalledscaledserialfraction,denotedscaled,as

Equation2.12

andthen

Equation2.13

Rewritingtheequationforspeedup(Eq.2.3)andsimplifying,weobtainthescaled(orfixedtime)speedup.[1]

[1]Thisequation,sometimesknownasGustafson'slaw,wasattributedin[Gus88]toE.Barsis.

Equation2.14

ThisgivesexactlythesamespeedupasAmdahl'slaw,butallowsadifferentquestiontobeaskedwhenthenumberofprocessorsisincreased.SincescaleddependsonP,theresultoftakingthelimitisn'timmediatelyobvious,butwouldgivethesameresultasthelimitinAmdahl'slaw.However,supposewetakethelimitinPwhileholdingTcomputeandthusscaledconstant.Theinterpretationisthatweareincreasingthesizeoftheproblemsothatthetotalrunningtimeremainsconstantwhenmoreprocessorsareadded.(Thiscontainstheimplicitassumptionthattheexecutiontimeoftheserialtermsdoesnotchangeastheproblemsizegrows.)Inthiscase,thespeedupislinearinP.Thus,whileaddingmoreprocessorstosolveafixedproblemmayhitthespeeduplimitsofAmdahl'slawwitharelativelysmallnumberofprocessors,iftheproblemgrowsasmoreprocessorsareadded,Amdahl'slawwillbepessimistic.Thesetwomodelsofspeedup,alongwithafixedmemoryversionofspeedup,arediscussedin[SN90].

2.6. COMMUNICATION

2.6.1. Latency and BandwidthAsimplebutusefulmodelcharacterizesthetotaltimeformessagetransferasthesumofafixedcost

plusavariablecostthatdependsonthelengthofthemessage.

Equation2.15

Thefixedcost iscalledlatencyandisessentiallythetimeittakestosendanemptymessageover thecommunicationmedium,fromthetimethesendroutineiscalledtothetimethedataisreceivedbytherecipient.Latency(giveninsomeappropriatetimeunit)includesoverheadduetosoftwareandnetworkhardwareplusthetimeittakesforthemessagetotraversethecommunicationmedium.Thebandwidth (giveninsomemeasureofbytespertimeunit)isameasureofthecapacityofthe communicationmedium.Nisthelengthofthemessage.

Thelatencyandbandwidthcanvarysignificantlybetweensystemsdependingonboththehardwareusedandthequalityofthesoftwareimplementingthecommunicationprotocols.Becausethesevaluescanbemeasuredwithfairlysimplebenchmarks[DD97],itissometimesworthwhiletomeasurevaluesfor and ,asthesecanhelpguideoptimizationstoimprovecommunicationperformance. Forexample,inasysteminwhich isrelativelylarge,itmightbeworthwhiletotrytorestructurea programthatsendsmanysmallmessagestoaggregatethecommunicationintoafewlargemessagesinstead.Dataforseveralrecentsystemshasbeenpresentedin[BBC +03 ].

2.6.2. Overlapping Communication and Computation and Latency HidingIfwelookmorecloselyatthecomputationtimewithinasingletaskonasingleprocessor,itcanroughlybedecomposedintocomputationtime,communicationtime,andidletime.Thecommunicationtimeisthetimespentsendingandreceivingmessages(andthusonlyappliestodistributedmemorymachines),whereastheidletimeistimethatnoworkisbeingdonebecausethetaskiswaitingforanevent,suchasthereleaseofaresourceheldbyanothertask.

Acommonsituationinwhichataskmaybeidleiswhenitiswaitingforamessagetobetransmittedthroughthesystem.Thiscanoccurwhensendingamessage(astheUEwaitsforareplybeforeproceeding)orwhenreceivingamessage.Sometimesitispossibletoeliminatethiswaitbyrestructuringthetasktosendthemessageand/orpostthereceive(thatis,indicatethatitwantstoreceiveamessage)andthencontinuethecomputation.Thisallowstheprogrammertooverlapcommunicationandcomputation.WeshowanexampleofthistechniqueinFig.2.7.Thisstyleofmessagepassingismorecomplicatedfortheprogrammer,becausetheprogrammermusttakecaretowaitforthereceivetocompleteafteranyworkthatcanbeoverlappedwithcommunicationiscompleted.

Figure 2.7. Communication without (left) and with (right) support for overlapping communication and computation. Although UE 0 in the computation on the right still has some idle time waiting for the reply from UE 1, the idle time is reduced and the

computation requires less total time because of UE 1 's earlier start.

AnothertechniqueusedonmanyparallelcomputersistoassignmultipleUEstoeachPE,sothatwhenoneUEiswaitingforcommunication,itwillbepossibletocontextswitchtoanotherUEandkeeptheprocessorbusy.Thisisanexampleoflatencyhiding.Itisincreasinglybeingusedonmodernhighperformancecomputingsystems,themostfamousexamplebeingtheMTAsystemfromCrayResearch[ACC +90 ].

2.7. SUMMARYThischapterhasgivenabriefoverviewofsomeoftheconceptsandvocabularyusedinparallelcomputing.Additionaltermsaredefinedintheglossary.Wealsodiscussedthemajorprogrammingenvironmentsinuseforparallelcomputing:OpenMP,MPI,andJava.Throughoutthebook,wewillusethesethreeprogrammingenvironmentsforourexamples.MoredetailsaboutOpenMP,MPI,andJavaandhowtousethemtowriteparallelprogramsareprovidedintheappendixes.

Chapter 3. The Finding Concurrency Design Space

3.1ABOUTTHEDESIGNSPACE

3.2THETASKDECOMPOSITIONPATTERN

3.3THEDATADECOMPOSITIONPATTERN

3.4THEGROUPTASKSPATTERN

3.5THEORDERTASKSPATTERN

3.6THEDATASHARINGPATTERN

3.7THEDESIGNEVALUATIONPATTERN

3.8SUMMARY

3.1. ABOUT THE DESIGN SPACEThesoftwaredesignerworksinanumberofdomains.Thedesignprocessstartsintheproblemdomainwithdesignelementsdirectlyrelevanttotheproblembeingsolved(forexample,fluidflows,decisiontrees,atoms,etc.).Theultimateaimofthedesignissoftware,soatsomepoint,thedesignelementschangeintoonesrelevanttoaprogram(forexample,datastructuresandsoftwaremodules).Wecallthistheprogramdomain.Althoughitisoftentemptingtomoveintotheprogramdomainassoonaspossible,adesignerwhomovesoutoftheproblemdomaintoosoonmaymissvaluabledesignoptions.

Thisisparticularlyrelevantinparallelprogramming.Parallelprogramsattempttosolvebiggerproblemsinlesstimebysimultaneouslysolvingdifferentpartsoftheproblemondifferentprocessingelements.Thiscanonlywork,however,iftheproblemcontainsexploitableconcurrency,thatis,multipleactivitiesortasksthatcanexecuteatthesametime.Afteraproblemhasbeenmappedontotheprogramdomain,however,itcanbedifficulttoseeopportunitiestoexploitconcurrency.

Hence,programmersshouldstarttheirdesignofaparallelsolutionbyanalyzingtheproblemwithintheproblemdomaintoexposeexploitableconcurrency.WecallthedesignspaceinwhichthisanalysisiscarriedouttheFindingConcurrencydesignspace.Thepatternsinthisdesignspacewillhelpidentifyandanalyzetheexploitableconcurrencyinaproblem.Afterthisisdone,oneormorepatternsfromtheAlgorithmStructurespacecanbechosentohelpdesigntheappropriatealgorithmstructuretoexploittheidentifiedconcurrency.

AnoverviewofthisdesignspaceanditsplaceinthepatternlanguageisshowninFig.3.1.

Figure 3.1. Overview of the Finding Concurrency design space and its place in the pattern language

ExperienceddesignersworkinginafamiliardomainmayseetheexploitableconcurrencyimmediatelyandcouldmovedirectlytothepatternsintheAlgorithmStructuredesignspace.

3.1.1. OverviewBeforestartingtoworkwiththepatternsinthisdesignspace,thealgorithmdesignermustfirstconsidertheproblemtobesolvedandmakesuretheefforttocreateaparallelprogramwillbejustified:Istheproblemlargeenoughandtheresultssignificantenoughtojustifyexpendingefforttosolveitfaster?Ifso,thenextstepistomakesurethekeyfeaturesanddataelementswithintheproblemarewellunderstood.Finally,thedesignerneedstounderstandwhichpartsoftheproblemaremostcomputationallyintensive,becausetheefforttoparallelizetheproblemshouldbefocusedonthoseparts.

Afterthisanalysisiscomplete,thepatternsintheFindingConcurrencydesignspacecanbeusedtostartdesigningaparallelalgorithm.Thepatternsinthisdesignspacecanbeorganizedintothreegroups.

DecompositionPatterns.Thetwodecompositionpatterns,TaskDecompositionandDataDecomposition,areusedtodecomposetheproblemintopiecesthatcanexecuteconcurrently.

DependencyAnalysisPatterns.Thisgroupcontainsthreepatternsthathelpgroupthetasksandanalyzethedependenciesamongthem:GroupTasks,OrderTasks,andDataSharing.Nominally,thepatternsareappliedinthisorder.Inpractice,however,itisoftennecessarytoworkbackandforthbetweenthem,orpossiblyevenrevisitthedecompositionpatterns.

DesignEvaluationPattern.ThefinalpatterninthisspaceguidesthealgorithmdesignerthroughananalysisofwhathasbeendonesofarbeforemovingontothepatternsintheAlgorithmStructuredesignspace.Thispatternisimportantbecauseitoftenhappensthatthebestdesignisnotfoundonthefirstattempt,andtheearlierdesignflawsareidentified,the

easiertheyaretocorrect.Ingeneral,workingthroughthepatternsinthisspaceisaniterativeprocess.

3.1.2. Using the Decomposition PatternsThefirststepindesigningaparallelalgorithmistodecomposetheproblemintoelementsthatcanexecuteconcurrently.Wecanthinkofthisdecompositionasoccurringintwodimensions.

Thetaskdecompositiondimensionviewstheproblemasastreamofinstructionsthatcanbebrokenintosequencescalledtasksthatcanexecutesimultaneously.Forthecomputationtobeefficient,theoperationsthatmakeupthetaskshouldbelargelyindependentoftheoperationstakingplaceinsideothertasks.

Thedatadecompositiondimensionfocusesonthedatarequiredbythetasksandhowitcanbedecomposedintodistinctchunks.Thecomputationassociatedwiththedatachunkswillonlybeefficientifthedatachunkscanbeoperateduponrelativelyindependently.

Viewingtheproblemdecompositionintermsoftwodistinctdimensionsissomewhatartificial.Ataskdecompositionimpliesadatadecompositionandviceversa;hence,thetwodecompositionsarereallydifferentfacetsofthesamefundamentaldecomposition.Wedividethemintoseparatedimensions,however,becauseaproblemdecompositionusuallyproceedsmostnaturallybyemphasizingonedimensionofthedecompositionovertheother.Bymakingthemdistinct,wemakethisdesignemphasisexplicitandeasierforthedesignertounderstand.

3.1.3. Background for ExamplesInthissection,wegivebackgroundinformationonsomeoftheexamplesthatareusedinseveralpatterns.Itcanbeskippedforthetimebeingandrevisitedlaterwhenreadingapatternthatreferstooneoftheexamples.

Medical imaging

PET(PositronEmissionTomography)scansprovideanimportantdiagnostictoolbyallowingphysicianstoobservehowaradioactivesubstancepropagatesthroughapatient'sbody.Unfortunately,theimagesformedfromthedistributionofemittedradiationareoflowresolution,dueinparttothescatteringoftheradiationasitpassesthroughthebody.Itisalsodifficulttoreasonfromtheabsoluteradiationintensities,becausedifferentpathwaysthroughthebodyattenuatetheradiationdifferently.

Tosolvethisproblem,modelsofhowradiationpropagatesthroughthebodyareusedtocorrecttheimages.AcommonapproachistobuildaMonteCarlomodel,asdescribedbyLjungbergandKing[LK98].Randomlyselectedpointswithinthebodyareassumedtoemitradiation(usuallyagammaray),andthetrajectoryofeachrayisfollowed.Asaparticle(ray)passesthroughthebody,itisattenuatedbythedifferentorgansittraverses,continuinguntiltheparticleleavesthebodyandhitsacameramodel,therebydefiningafulltrajectory.Tocreateastatisticallysignificantsimulation,thousands,ifnotmillions,oftrajectoriesarefollowed.

Thisproblemcanbeparallelizedintwoways.Becauseeachtrajectoryisindependent,itispossibletoparallelizetheapplicationbyassociatingeachtrajectorywithatask.ThisapproachisdiscussedintheExamplessectionoftheTaskDecompositionpattern.Anotherapproachwouldbetopartitionthe

bodyintosectionsandassigndifferentsectionstodifferentprocessingelements.ThisapproachisdiscussedintheExamplessectionoftheDataDecompositionpattern.

Linear algebra

Linearalgebraisanimportanttoolinappliedmathematics:Itprovidesthemachineryrequiredtoanalyzesolutionsoflargesystemsoflinearequations.Theclassiclinearalgebraproblemasks,formatrixAandvectorb,whatvaluesforxwillsolvetheequation

Equation3.1

ThematrixAinEq.3.1takesonacentralroleinlinearalgebra.Manyproblemsareexpressedintermsoftransformationsofthismatrix.Thesetransformationsareappliedbymeansofamatrixmultiplication

Equation3.2

IfT,A,andCaresquarematricesoforderN,matrixmultiplicationisdefinedsuchthateachelementoftheresultingmatrixCis

Equation3.3

wherethesubscriptsdenoteparticularelementsofthematrices.Inotherwords,theelementoftheproductmatrixCinrowiandcolumnjisthedotproductoftheithrowofTandthejthcolumnofA.Hence,computingeachoftheN2elementsofCrequiresNmultiplicationsandN1additions,makingtheoverallcomplexityofmatrixmultiplicationO(N3).

Therearemanywaystoparallelizeamatrixmultiplicationoperation.Itcanbeparallelizedusingeitherataskbaseddecomposition(asdiscussedintheExamplessectionoftheTaskDecompositionpattern)oradatabaseddecomposition(asdiscussedintheExamplessectionoftheDataDecompositionpattern).

Molecular dynamics

Moleculardynamicsisusedtosimulatethemotionsofalargemolecularsystem.Forexample,moleculardynamicssimulationsshowhowalargeproteinmovesaroundandhowdifferentlyshapeddrugsmightinteractwiththeprotein.Notsurprisingly,moleculardynamicsisextremelyimportantin

thepharmaceuticalindustry.Itisalsoausefultestproblemforcomputerscientistsworkingonparallelcomputing:Itisstraightforwardtounderstand,relevanttoscienceatlarge,anddifficulttoparallelizeeffectively.Asaresult,ithasbeenthesubjectofmuchresearch[Mat94,PH95,Pli95].

Thebasicideaistotreatamoleculeasalargecollectionofballsconnectedbysprings.Theballsrepresenttheatomsinthemolecule,whilethespringsrepresentthechemicalbondsbetweentheatoms.Themoleculardynamicssimulationitselfisanexplicittimesteppingprocess.Ateachtimestep,theforceoneachatomiscomputedandthenstandardclassicalmechanicstechniquesareusedtocomputehowtheforcemovestheatoms.Thisprocessiscarriedoutrepeatedlytostepthroughtimeandcomputeatrajectoryforthemolecularsystem.

Theforcesduetothechemicalbonds(the"springs")arerelativelysimpletocompute.Thesecorrespondtothevibrationsandrotationsofthechemicalbondsthemselves.Theseareshortrangeforcesthatcanbecomputedwithknowledgeofthehandfulofatomsthatsharechemicalbonds.Themajordifficultyarisesbecausetheatomshavepartialelectricalcharges.Hence,whileatomsonlyinteractwithasmallneighborhoodofatomsthroughtheirchemicalbonds,theelectricalchargescauseeveryatomtoapplyaforceoneveryotheratom.

ThisisthefamousNbodyproblem.OntheorderofN2termsmustbecomputedtofindthesenonbondedforces.BecauseNislarge(tensorhundredsofthousands)andthenumberoftimestepsinasimulationishuge(tensofthousands),thetimerequiredtocomputethesenonbondedforcesdominatesthecomputation.SeveralwayshavebeenproposedtoreducetheeffortrequiredtosolvetheNbodyproblem.Weareonlygoingtodiscussthesimplestone:thecutoffmethod.

Theideaissimple.Eventhougheachatomexertsaforceoneveryotheratom,thisforcedecreaseswiththesquareofthedistancebetweentheatoms.Hence,itshouldbepossibletopickadistancebeyondwhichtheforcecontributionissosmallthatitcanbeignored.Byignoringtheatomsthatexceedthiscutoff,theproblemisreducedtoonethatscalesasO(Nxn),wherenisthenumberofatomswithinthecutoffvolume,usuallyhundreds.Thecomputationisstillhuge,anditdominatestheoverallruntimeforthesimulation,butatleasttheproblemistractable.

Thereareahostofdetails,butthebasicsimulationcanbesummarizedasinFig.3.2.

Theprimarydatastructuresholdtheatomicpositions(atoms),thevelocitiesofeachatom(velocity),theforcesexertedoneachatom(forces),andlistsofatomswithinthecutoffdistanceofeachatoms(neighbors).Theprogramitselfisatimesteppingloop,inwhicheachiterationcomputestheshortrangeforceterms,updatestheneighborlists,andthenfindsthenonbondedforces.Aftertheforceoneachatomhasbeencomputed,asimpleordinarydifferentialequationissolvedtoupdatethepositionsandvelocities.Physicalpropertiesbasedonatomicmotionsarethenupdated,andwegotothenexttimestep.

Therearemanywaystoparallelizethemoleculardynamicsproblem.Weconsiderthemostcommonapproach,startingwiththetaskdecomposition(discussedintheTaskDecompositionpattern)andfollowingwiththeassociateddatadecomposition(discussedintheDataDecompositionpattern).Thisexampleshowshowthetwodecompositionsfittogethertoguidethedesignoftheparallelalgorithm.

Figure 3.2. Pseudocode for the molecular dynamics example

Int const N // number of atoms

Array of Real :: atoms (3,N) //3D coordinatesArray of Real :: velocities (3,N) //velocity vectorArray of Real :: forces (3,N) //force in each dimensionArray of List :: neighbors(N) //atoms in cutoff volume

loop over time steps vibrational_forces (N, atoms, forces) rotational_forces (N, atoms, forces) neighbor_list (N, atoms, neighbors) non_bonded_forces (N, atoms, neighbors, forces) update_atom_positions_and_velocities( N, atoms, velocities, forces) physical_properties ( ... Lots of stuff ... )end loop

3.2. THE TASK DECOMPOSITION PATTERN

ProblemHowcanaproblembedecomposedintotasksthatcanexecuteconcurrently?

ContextEveryparallelalgorithmdesignstartsfromthesamepoint,namelyagoodunderstandingoftheproblembeingsolved.Theprogrammermustunderstandwhicharethecomputationallyintensivepartsoftheproblem,thekeydatastructures,andhowthedataisusedastheproblem'ssolutionunfolds.

Thenextstepistodefinethetasksthatmakeuptheproblemandthedatadecompositionimpliedbythetasks.Fundamentally,everyparallelalgorithminvolvesacollectionoftasksthatcanexecuteconcurrently.Thechallengeistofindthesetasksandcraftanalgorithmthatletsthemrunconcurrently.

Insomecases,theproblemwillnaturallybreakdownintoacollectionofindependent(ornearlyindependent)tasks,anditiseasiesttostartwithataskbaseddecomposition.Inothercases,thetasksaredifficulttoisolateandthedecompositionofthedata(asdiscussedintheDataDecompositionpattern)isabetterstartingpoint.Itisnotalwaysclearwhichapproachisbest,andoftenthealgorithmdesignerneedstoconsiderboth.

Regardlessofwhetherthestartingpointisataskbasedoradatabaseddecomposition,however,aparallelalgorithmultimatelyneedstasksthatwillexecuteconcurrently,sothesetasksmustbeidentified.

ForcesThemainforcesinfluencingthedesignatthispointareflexibility,efficiency,andsimplicity.

Flexibility.Flexibilityinthedesignwillallowittobeadaptedtodifferentimplementationrequirements.Forexample,itisusuallynotagoodideatonarrowtheoptionstoasinglecomputersystemorstyleofprogrammingatthisstageofthedesign.

Efficiency.Aparallelprogramisonlyusefulifitscalesefficientlywiththesizeoftheparallelcomputer(intermsofreducedruntimeand/ormemoryutilization).Forataskdecomposition,thismeansweneedenoughtaskstokeepallthePEsbusy,withenoughworkpertasktocompensateforoverheadincurredtomanagedependencies.However,thedriveforefficiencycanleadtocomplexdecompositionsthatlackflexibility.

Simplicity.Thetaskdecompositionneedstobecomplexenoughtogetthejobdone,butsimpleenoughtolettheprogrambedebuggedandmaintainedwithreasonableeffort.

SolutionThekeytoaneffectivetaskdecompositionistoensurethatthetasksaresufficientlyindependentsothatmanagingdependenciestakesonlyasmallfractionoftheprogram'soverallexecutiontime.ItisalsoimportanttoensurethattheexecutionofthetaskscanbeevenlydistributedamongtheensembleofPEs(theloadbalancingproblem).

Inanidealworld,thecompilerwouldfindthetasksfortheprogrammer.Unfortunately,thisalmostneverhappens.Instead,itmustusuallybedonebyhandbasedonknowledgeoftheproblemandthecoderequiredtosolveit.Insomecases,itmightbenecessarytocompletelyrecasttheproblemintoaformthatexposesrelativelyindependenttasks.

Inataskbaseddecomposition,welookattheproblemasacollectionofdistincttasks,payingparticularattentionto

Theactionsthatarecarriedouttosolvetheproblem.(Arethereenoughofthemtokeeptheprocessingelementsonthetargetmachinesbusy?)

Whethertheseactionsaredistinctandrelativelyindependent.

Asafirstpass,wetrytoidentifyasmanytasksaspossible;itismucheasiertostartwithtoomanytasksandmergethemlateronthantostartwithtoofewtasksandlatertrytosplitthem.

Taskscanbefoundinmanydifferentplaces.

Insomecases,eachtaskcorrespondstoadistinctcalltoafunction.Definingataskforeachfunctioncallleadstowhatissometimescalledafunctionaldecomposition.

Anotherplacetofindtasksisindistinctiterationsoftheloopswithinanalgorithm.Iftheiterationsareindependentandthereareenoughofthem,thenitmightworkwelltobaseataskdecompositiononmappingeachiterationontoatask.Thisstyleoftaskbaseddecompositionleadstowhataresometimescalledloopsplittingalgorithms.

Tasksalsoplayakeyroleindatadrivendecompositions.Inthiscase,alargedatastructureisdecomposedandmultipleunitsofexecutionconcurrentlyupdatedifferentchunksofthedatastructure.Inthiscase,thetasksarethoseupdatesonindividualchunks.

AlsokeepinmindtheforcesgivenintheForcessection:

Flexibility.Thedesignneedstobeflexibleinthenumberoftasksgenerated.Usuallythisisdonebyparameterizingthenumberandsizeoftasksonsomeappropriatedimension.Thiswillletthedesignbeadaptedtoawiderangeofparallelcomputerswithdifferentnumbersofprocessors.

Efficiency.Therearetwomajorefficiencyissuestoconsiderinthetaskdecomposition.First,eachtaskmustincludeenoughworktocompensatefortheoverheadincurredbycreatingthetasksandmanagingtheirdependencies.Second,thenumberoftasksshouldbelargeenoughsothatalltheunitsofexecutionarebusywithusefulworkthroughoutthecomputation.

Simplicity.Tasksshouldbedefinedinawaythatmakesdebuggingandmaintenancesimple.Whenpossible,tasksshouldbedefinedsotheyreusecodefromexistingsequentialprogramsthatsolverelatedproblems.

Afterthetaskshavebeenidentified,thenextstepistolookatthedatadecompositionimpliedbythetasks.TheDataDecompositionpatternmayhelpwiththisanalysis.

Examples

Medical imaging

ConsiderthemedicalimagingproblemdescribedinSec.3.1.3.Inthisapplication,apointinsideamodelofthebodyisselectedrandomly,aradioactivedecayisallowedtooccuratthispoint,andthetrajectoryoftheemittedparticleisfollowed.Tocreateastatisticallysignificantsimulation,thousands,ifnotmillions,oftrajectoriesarefollowed.

Itisnaturaltoassociateataskwitheachtrajectory.Thesetasksareparticularlysimpletomanageconcurrentlybecausetheyarecompletelyindependent.Furthermore,therearelargenumbersoftrajectories,sotherewillbemanytasks,makingthisdecompositionsuitableforalargerangeofcomputersystems,fromasharedmemorysystemwithasmallnumberofprocessingelementstoalargeclusterwithhundredsofprocessingelements.

Withthebasictasksdefined,wenowconsiderthecorrespondingdatadecompositionthatis,wedefinethedataassociatedwitheachtask.Eachtaskneedstoholdtheinformationdefiningthetrajectory.Butthatisnotall:Thetasksneedaccesstothemodelofthebodyaswell.Althoughitmightnotbeapparentfromourdescriptionoftheproblem,thebodymodelcanbeextremelylarge.Becauseitisareadonlymodel,thisisnoproblemifthereisaneffectivesharedmemorysystem;eachtaskcanreaddataasneeded.Ifthetargetplatformisbasedonadistributedmemoryarchitecture,however,thebodymodelwillneedtobereplicatedoneachPE.Thiscanbeverytimeconsumingandcanwasteagreatdealofmemory.ForsystemswithsmallmemoriesperPEand/orwithslownetworksbetweenPEs,adecompositionoftheproblembasedonthebodymodelmightbemoreeffective.

Thisisacommonsituationinparallelprogramming:Manyproblemscanbedecomposedprimarilyintermsofdataorprimarilyintermsoftasks.Ifataskbaseddecompositionavoidstheneedtobreakupanddistributecomplexdatastructures,itwillbeamuchsimplerprogramtowriteanddebug.Ontheotherhand,ifmemoryand/ornetworkbandwidthisalimitingfactor,adecompositionthatfocuseson

thedatamightbemoreeffective.Itisnotsomuchamatterofoneapproachbeing"better"thananotherasamatterofbalancingtheneedsofthemachinewiththeneedsoftheprogrammer.WediscussthisinmoredetailintheDataDecompositionpattern.

Matrix multiplication

Considerthemultiplicationoftwomatrices(C=AB),asdescribedinSec.3.1.3.Wecanproduceataskbaseddecompositionofthisproblembyconsideringthecalculationofeachelementoftheproductmatrixasaseparatetask.EachtaskneedsaccesstoonerowofAandonecolumnofB.Thisdecompositionhastheadvantagethatallthetasksareindependent,andbecauseallthedatathatissharedamongtasks(AandB)isreadonly,itwillbestraightforwardtoimplementinasharedmemoryenvironment.

Theperformanceofthisalgorithm,however,wouldbepoor.ConsiderthecasewherethethreematricesaresquareandoforderN.ForeachelementofC,NelementsfromAandNelementsfromBwouldberequired,resultingin2NmemoryreferencesforNmultiply/addoperations.Memoryaccesstimeisslowcomparedtofloatingpointarithmetic,sothebandwidthofthememorysubsystemwouldlimittheperformance.

Abetterapproachwouldbetodesignanalgorithmthatmaximizesreuseofdataloadedintoaprocessor'scaches.Wecanarriveatthisalgorithmintwodifferentways.First,wecouldgrouptogethertheelementwisetaskswedefinedearliersothetasksthatusesimilarelementsoftheAandBmatricesrunonthesameUE(seetheGroupTaskspattern).Alternatively,wecouldstartwiththedatadecompositionanddesignthealgorithmfromthebeginningaroundthewaythematricesfitintothecaches.WediscussthisexamplefurtherintheExamplessectionoftheDataDecompositionpattern.

Molecular dynamics

ConsiderthemoleculardynamicsproblemdescribedinSec.3.1.3.PseudocodeforthisexampleisshownagaininFig.3.3.

Beforeperformingthetaskdecomposition,weneedtobetterunderstandsomedetailsoftheproblem.First,theneighbor_list ()computationistimeconsuming.Thegistofthecomputationisaloopovereachatom,insideofwhicheveryotheratomischeckedtodeterminewhetheritfallswithintheindicatedcutoffvolume.Fortunately,thetimestepsareverysmall,andtheatomsdon'tmoveverymuchinanygiventimestep.Hence,thistimeconsumingcomputationisonlycarriedoutevery10to100steps.

Figure 3.3. Pseudocode for the molecular dynamics example

Int const N // number of atoms

Array of Real :: atoms (3,N) //3D coordinatesArray of Real :: velocities (3,N) //velocity vectorArray of Real :: forces (3,N) //force in each dimensionArray of List :: neighbors(N) //atoms in cutoff volume

loop over time steps vibrational_forces (N, atoms, forces) rotational_forces (N, atoms, forces) neighbor_list (N, atoms, neighbors)

non_bonded_forces (N, atoms, neighbors, forces) update_atom_positions_and_velocities( N, atoms, velocities, forces) physical_properties ( ... Lots of stuff ... )end loop

Second,thephysical_properties()functioncomputesenergies,correlationcoefficients,andahostofinterestingphysicalproperties.Thesecomputations,however,aresimpleanddonotsignificantlyaffecttheprogram'soverallruntime,sowewillignoretheminthisdiscussion.

Becausethebulkofthecomputationtimewillbeinnon_bonded_forces(),wemustpickaproblemdecompositionthatmakesthatcomputationrunefficientlyinparallel.Theproblemismadeeasierbythefactthateachofthefunctionsinsidethetimeloophasasimilarstructure:Inthesequentialversion,eachfunctionincludesaloopoveratomstocomputecontributionstotheforcevector.Thus,anaturaltaskdefinitionistheupdaterequiredbyeachatom,whichcorrespondstoaloopiterationinthesequentialversion.Afterperformingthetaskdecomposition,therefore,weobtainthefollowingtasks.

Tasksthatfindthevibrationalforcesonanatom

Tasksthatfindtherotationalforcesonanatom

Tasksthatfindthenonbondedforcesonanatom

Tasksthatupdatethepositionandvelocityofanatom

Atasktoupdatetheneighborlistforalltheatoms(whichwewillleavesequential)

Withourcollectionoftasksinhand,wecanconsidertheaccompanyingdatadecomposition.Thekeydatastructuresaretheneighborlist,theatomiccoordinates,theatomicvelocities,andtheforcevector.Everyiterationthatupdatestheforcevectorneedsthecoordinatesofaneighborhoodofatoms.Thecomputationofnonbondedforces,however,potentiallyneedsthecoordinatesofalltheatoms,becausethemoleculebeingsimulatedmightfoldbackonitselfinunpredictableways.Wewillusethisinformationtocarryoutthedatadecomposition(intheDataDecompositionpattern)andthedatasharinganalysis(intheDataSharingpattern).

Known uses

Taskbaseddecompositionsareextremelycommoninparallelcomputing.Forexample,thedistancegeometrycodeDGEOM[Mat96]usesataskbaseddecomposition,asdoestheparallelWESDYNmoleculardynamicsprogram[MR95].

3.3. THE DATA DECOMPOSITION PATTERN

ProblemHowcanaproblem'sdatabedecomposedintounitsthatcanbeoperatedonrelativelyindependently?

ContextTheparallelalgorithmdesignermusthaveadetailedunderstandingoftheproblembeingsolved.Inaddition,thedesignershouldidentifythemostcomputationallyintensivepartsoftheproblem,thekeydatastructuresrequiredtosolvetheproblem,andhowdataisusedastheproblem'ssolutionunfolds.

Afterthebasicproblemisunderstood,theparallelalgorithmdesignershouldconsiderthetasksthatmakeuptheproblemandthedatadecompositionimpliedbythetasks.Boththetaskanddatadecompositionsneedtobeaddressedtocreateaparallelalgorithm.Thequestionisnotwhichdecompositiontodo.Thequestioniswhichonetostartwith.Adatabaseddecompositionisagoodstartingpointifthefollowingistrue.

Themostcomputationallyintensivepartoftheproblemisorganizedaroundthemanipulationofalargedatastructure.

Similaroperationsarebeingappliedtodifferentpartsofthedatastructure,insuchawaythatthedifferentpartscanbeoperatedonrelativelyindependently.

Forexample,manylinearalgebraproblemsupdatelargematrices,applyingasimilarsetofoperationstoeachelementofthematrix.Inthesecases,itisstraightforwardtodrivetheparallelalgorithmdesignbylookingathowthematrixcanbebrokenupintoblocksthatareupdatedconcurrently.Thetaskdefinitionsthenfollowfromhowtheblocksaredefinedandmappedontotheprocessingelementsoftheparallelcomputer.

ForcesThemainforcesinfluencingthedesignatthispointareflexibility,efficiency,andsimplicity.

Flexibility.Flexibilitywillallowthedesigntobeadaptedtodifferentimplementationrequirements.Forexample,itisusuallynotagoodideatonarrowtheoptionstoasinglecomputersystemorstyleofprogrammingatthisstageofthedesign.

Efficiency.Aparallelprogramisonlyusefulifitscalesefficientlywiththesizeoftheparallelcomputer(intermsofreducedruntimeand/ormemoryutilization).

Simplicity.Thedecompositionneedstobecomplexenoughtogetthejobdone,butsimpleenoughtolettheprogrambedebuggedandmaintainedwithreasonableeffort.

SolutionInsharedmemoryprogrammingenvironmentssuchasOpenMP,thedatadecompositionwillfrequentlybeimpliedbythetaskdecomposition.Inmostcases,however,thedecompositionwillneedtobedonebyhand,becausethememoryisphysicallydistributed,becausedatadependenciesaretoocomplexwithoutexplicitlydecomposingthedata,ortoachieveacceptableefficiencyonaNUMAcomputer.

Ifataskbaseddecompositionhasalreadybeendone,thedatadecompositionisdrivenbytheneedsofeachtask.Ifwelldefinedanddistinctdatacanbeassociatedwitheachtask,thedecompositionshould

besimple.

Whenstartingwithadatadecomposition,however,weneedtolooknotatthetasks,butatthecentraldatastructuresdefiningtheproblemandconsiderwhethertheycantheybebrokendownintochunksthatcanbeoperatedonconcurrently.Afewcommonexamplesincludethefollowing.

Arraybasedcomputations.Concurrencycanbedefinedintermsofupdatesofdifferentsegmentsofthearray.Ifthearrayismultidimensional,itcanbedecomposedinavarietyofways(rows,columns,orblocksofvaryingshapes).

Recursivedatastructures.Wecanthinkof,forexample,decomposingtheparallelupdateofalargetreedatastructurebydecomposingthedatastructureintosubtreesthatcanbeupdatedconcurrently.

Regardlessofthenatureoftheunderlyingdatastructure,ifthedatadecompositionistheprimaryfactordrivingthesolutiontotheproblem,itservesastheorganizingprincipleoftheparallelalgorithm.

Whenconsideringhowtodecomposetheproblem'sdatastructures,keepinmindthecompetingforces.

Flexibility.Thesizeandnumberofdatachunksshouldbeflexibletosupportthewidestrangeofparallelsystems.Oneapproachistodefinechunkswhosesizeandnumberarecontrolledbyasmallnumberofparameters.Theseparametersdefinegranularityknobsthatcanbevariedtomodifythesizeofthedatachunkstomatchtheneedsoftheunderlyinghardware.(Note,however,thatmanydesignsarenotinfinitelyadaptablewithrespecttogranularity.)

Theeasiestplacetoseetheimpactofgranularityonthedatadecompositionisintheoverheadrequiredtomanagedependenciesbetweenchunks.Thetimerequiredtomanagedependenciesmustbesmallcomparedtotheoverallruntime.Inagooddatadecomposition,thedependenciesscaleatalowerdimensionthanthecomputationaleffortassociatedwitheachchunk.Forexample,inmanyfinitedifferenceprograms,thecellsattheboundariesbetweenchunks,thatis,thesurfacesofthechunks,mustbeshared.Thesizeofthesetofdependentcellsscalesasthesurfacearea,whiletheeffortrequiredinthecomputationscalesasthevolumeofthechunk.Thismeansthatthecomputationaleffortcanbescaled(basedonthechunk'svolume)tooffsetoverheadsassociatedwithdatadependencies(basedonthesurfaceareaofthechunk).

Efficiency.Itisimportantthatthedatachunksbelargeenoughthattheamountofworktoupdatethechunkoffsetstheoverheadofmanagingdependencies.AmoresubtleissuetoconsiderishowthechunksmapontoUEs.AneffectiveparallelalgorithmmustbalancetheloadbetweenUEs.Ifthisisn'tdonewell,somePEsmighthaveadisproportionateamountofwork,andtheoverallscalabilitywillsuffer.Thismayrequirecleverwaystobreakuptheproblem.Forexample,iftheproblemclearsthecolumnsinamatrixfromlefttoright,acolumnmappingofthematrixwillcauseproblemsastheUEswiththeleftmostcolumnswillfinishtheirworkbeforetheothers.Arowbasedblockdecompositionorevenablockcyclicdecomposition(inwhichrowsareassignedcyclicallytoPEs)woulddoamuchbetterjobofkeepingalltheprocessorsfullyoccupied.TheseissuesarediscussedinmoredetailintheDistributedArraypattern.

Simplicity.Overlycomplexdatadecompositionscanbeverydifficulttodebug.Adatadecompositionwillusuallyrequireamappingofaglobalindexspaceontoatasklocalindexspace.Makingthismappingabstractallowsittobeeasilyisolatedandtested.

Afterthedatahasbeendecomposed,ifithasnotalreadybeendone,thenextstepistolookatthetaskdecompositionimpliedbythetasks.TheTaskDecompositionpatternmayhelpwiththisanalysis.

Examples

Medical imaging

ConsiderthemedicalimagingproblemdescribedinSec.3.1.3.Inthisapplication,apointinsideamodelofthebodyisselectedrandomly,aradioactivedecayisallowedtooccuratthispoint,andthetrajectoryoftheemittedparticleisfollowed.Tocreateastatisticallysignificantsimulation,thousandsifnotmillionsoftrajectoriesarefollowed.

Inadatabaseddecompositionofthisproblem,thebodymodelisthelargecentraldatastructurearoundwhichthecomputationcanbeorganized.Themodelisbrokenintosegments,andoneormoresegmentsareassociatedwitheachprocessingelement.Thebodysegmentsareonlyread,notwritten,duringthetrajectorycomputations,sotherearenodatadependenciescreatedbythedecompositionofthebodymodel.

Afterthedatahasbeendecomposed,weneedtolookatthetasksassociatedwitheachdatasegment.Inthiscase,eachtrajectorypassingthroughthedatasegmentdefinesatask.Thetrajectoriesareinitiatedandpropagatedwithinasegment.Whenasegmentboundaryisencountered,thetrajectorymustbepassedbetweensegments.Itisthistransferthatdefinesthedependenciesbetweendatachunks.

Ontheotherhand,inataskbasedapproachtothisproblem(asdiscussedintheTaskDecompositionpattern),thetrajectoriesforeachparticledrivethealgorithmdesign.EachPEpotentiallyneedstoaccessthefullbodymodeltoserviceitssetoftrajectories.Inasharedmemoryenvironment,thisiseasybecausethebodymodelisareadonlydataset.Inadistributedmemoryenvironment,however,thiswouldrequiresubstantialstartupoverheadasthebodymodelisbroadcastacrossthesystem.

Thisisacommonsituationinparallelprogramming:Differentpointsofviewleadtodifferentalgorithmswithpotentiallyverydifferentperformancecharacteristics.Thetaskbasedalgorithmissimple,butitonlyworksifeachprocessingelementhasaccesstoalargememoryandiftheoverheadincurredloadingthedataintomemoryisinsignificantcomparedtotheprogram'sruntime.Analgorithmdrivenbyadatadecomposition,ontheotherhand,makesefficientuseofmemoryand(indistributedmemoryenvironments)lessuseofnetworkbandwidth,butitincursmorecommunicationoverheadduringtheconcurrentpartofcomputationandissignificantlymorecomplex.ChoosingwhichistheappropriateapproachcanbedifficultandisdiscussedfurtherintheDesignEvaluationpattern.

Matrix multiplication

Considerthestandardmultiplicationoftwomatrice

Date post:	30-Oct-2015
Category:	Documents
Upload:	nicoleta-nico
View:	35 times
Download:	0 times

Pattern Language for Parallel Programming, 2004

Documents