Date post: | 30-Oct-2015 |
Category: |
Documents |
Upload: | nicoleta-nico |
View: | 35 times |
Download: | 0 times |
of 328
"Ifyoubuildit,theywillcome."
Andsowebuiltthem.Multiprocessorworkstations,massivelyparallelsupercomputers,aclusterineverydepartment...andtheyhaven'tcome.Programmershaven'tcometoprogramthesewonderfulmachines.Oh,afewprogrammersinlovewiththechallengehaveshownthatmosttypesofproblemscanbeforcefitontoparallelcomputers,butgeneralprogrammers,especiallyprofessionalprogrammerswho"havelives",ignoreparallelcomputers.
Andtheydosoattheirownperil.Parallelcomputersaregoingmainstream.Multithreadedmicroprocessors,multicoreCPUs,multiprocessorPCs,clusters,parallelgameconsoles...parallelcomputersaretakingovertheworldofcomputing.Thecomputerindustryisreadytofloodthemarketwithhardwarethatwillonlyrunatfullspeedwithparallelprograms.Butwhowillwritetheseprograms?
Thisisanoldproblem.Evenintheearly1980s,whenthe"killermicros"startedtheirassaultontraditionalvectorsupercomputers,weworriedendlesslyabouthowtoattractnormalprogrammers.Wetriedeverythingwecouldthinkof:highlevelhardwareabstractions,implicitlyparallelprogramminglanguages,parallellanguageextensions,andportablemessagepassinglibraries.Butaftermanyyearsofhardwork,thefactofthematteristhat"they"didn'tcome.Theoverwhelmingmajorityofprogrammerswillnotinvesttheefforttowriteparallelsoftware.
Acommonviewisthatyoucan'tteacholdprogrammersnewtricks,sotheproblemwillnotbesolveduntiltheoldprogrammersfadeawayandanewgenerationtakesover.
Butwedon'tbuyintothatdefeatistattitude.Programmershaveshownaremarkableabilitytoadoptnewsoftwaretechnologiesovertheyears.LookathowmanyoldFortranprogrammersarenowwritingelegantJavaprogramswithsophisticatedobjectorienteddesigns.Theproblemisn'twitholdprogrammers.Theproblemiswitholdparallelcomputingexpertsandthewaythey'vetriedtocreateapoolofcapableparallelprogrammers.
Andthat'swherethisbookcomesin.Wewanttocapturetheessenceofhowexpertparallelprogrammersthinkaboutparallelalgorithmsandcommunicatethatessentialunderstandinginawayprofessionalprogrammerscanreadilymaster.Thetechnologywe'veadoptedtoaccomplishthistaskisapatternlanguage.Wemadethischoicenotbecausewestartedtheprojectasdevoteesofdesignpatternslookingforanewfieldtoconquer,butbecausepatternshavebeenshowntoworkinwaysthatwouldbeapplicableinparallelprogramming.Forexample,patternshavebeenveryeffectiveinthefieldofobjectorienteddesign.Theyhaveprovidedacommonlanguageexpertscanusetotalkabouttheelementsofdesignandhavebeenextremelyeffectiveathelpingprogrammersmasterobjectorienteddesign.
Thisbookcontainsourpatternlanguageforparallelprogramming.Thebookopenswithacoupleofchapterstointroducethekeyconceptsinparallelcomputing.Thesechaptersfocusontheparallelcomputingconceptsandjargonusedinthepatternlanguageasopposedtobeinganexhaustiveintroductiontothefield.
Thepatternlanguageitselfispresentedinfourpartscorrespondingtothefourphasesofcreatingaparallelprogram:
*
FindingConcurrency.Theprogrammerworksintheproblemdomaintoidentifytheavailableconcurrencyandexposeitforuseinthealgorithmdesign.
*
AlgorithmStructure.Theprogrammerworkswithhighlevelstructuresfororganizingaparallelalgorithm.
*
SupportingStructures.Weshiftfromalgorithmstosourcecodeandconsiderhowtheparallelprogramwillbeorganizedandthetechniquesusedtomanageshareddata.
*
ImplementationMechanisms.Thefinalstepistolookatspecificsoftwareconstructsforimplementingaparallelprogram.
Thepatternsmakingupthesefourdesignspacesaretightlylinked.Youstartatthetop(FindingConcurrency),workthroughthepatterns,andbythetimeyougettothebottom(ImplementationMechanisms),youwillhaveadetaileddesignforyourparallelprogram.
Ifthegoalisaparallelprogram,however,youneedmorethanjustaparallelalgorithm.Youalsoneedaprogrammingenvironmentandanotationforexpressingtheconcurrencywithintheprogram'ssourcecode.Programmersusedtobeconfrontedbyalargeandconfusingarrayofparallelprogrammingenvironments.Fortunately,overtheyearstheparallelprogrammingcommunityhasconvergedaroundthreeprogrammingenvironments.
*
OpenMP.AsimplelanguageextensiontoC,C++,orFortrantowriteparallelprogramsforsharedmemorycomputers.
*
MPI.Amessagepassinglibraryusedonclustersandotherdistributedmemorycomputers.
*
Java.Anobjectorientedprogramminglanguagewithlanguagefeaturessupportingparallelprogrammingonsharedmemorycomputersandstandardclasslibrariessupportingdistributedcomputing.
Manyreaderswillalreadybefamiliarwithoneormoreoftheseprogrammingnotations,butforreaderscompletelynewtoparallelcomputing,we'veincludedadiscussionoftheseprogrammingenvironmentsintheappendixes.
Inclosing,wehavebeenworkingformanyyearsonthispatternlanguage.Presentingitasabooksopeoplecanstartusingitisanexcitingdevelopmentforus.Butwedon'tseethisastheendofthiseffort.Weexpectthatotherswillhavetheirownideasaboutnewandbetterpatternsforparallelprogramming.We'veassuredlymissedsomeimportantfeaturesthatreallybelonginthispatternlanguage.Weembracechangeandlookforwardtoengagingwiththelargerparallelcomputingcommunitytoiterateonthislanguage.Overtime,we'llupdateandimprovethepatternlanguageuntilittrulyrepresentstheconsensusviewoftheparallelprogrammingcommunity.Thenourrealworkwillbeginusingthepatternlanguagetoguidethecreationofbetterparallelprogrammingenvironmentsandhelpingpeopletousethesetechnologiestowriteparallelsoftware.Wewon'trestuntilthedaysequentialsoftwareisrare.
ACKNOWLEDGMENTS
Westartedworkingtogetheronthispatternlanguagein1998.It'sbeenalongandtwistedroad,startingwithavagueideaaboutanewwaytothinkaboutparallelalgorithmsandfinishingwiththisbook.Wecouldn'thavedonethiswithoutagreatdealofhelp.
ManiChandy,whothoughtwewouldmakeagoodteam,introducedTimtoBeverlyandBerna.TheNationalScienceFoundation,IntelCorp.,andTrinityUniversityhavesupportedthisresearchatvarioustimesovertheyears.HelpwiththepatternsthemselvescamefromthepeopleatthePatternLanguagesofPrograms(PLoP)workshopsheldinIllinoiseachsummer.Theformatofthese
workshopsandtheresultingreviewprocesswaschallengingandsometimesdifficult,butwithoutthemwewouldhaveneverfinishedthispatternlanguage.Wewouldalsoliketothankthereviewerswhocarefullyreadearlymanuscriptsandpointedoutcountlesserrorsandwaystoimprovethebook.
Finally,wethankourfamilies.Writingabookishardontheauthors,butthatistobeexpected.Whatwedidn'tfullyappreciatewashowharditwouldbeonourfamilies.WearegratefultoBeverly'sfamily(DanielandSteve),Tim'sfamily(Noah,August,andMartha),andBerna'sfamily(Billie)forthesacrificesthey'vemadetosupportthisproject.
TimMattson,Olympia,Washington,April2004
BeverlySanders,Gainesville,Florida,April2004
BernaMassingill,SanAntonio,Texas,April2004
Chapter 1. APatternLanguageforParallelProgramming Section1.1. INTRODUCTION Section1.2. PARALLELPROGRAMMING Section1.3. DESIGNPATTERNSANDPATTERNLANGUAGES Section1.4. APATTERNLANGUAGEFORPARALLELPROGRAMMING
Chapter 2. BackgroundandJargonofParallelComputing Section2.1. CONCURRENCYINPARALLELPROGRAMSVERSUSOPERATINGSYSTEMS Section2.2. PARALLELARCHITECTURES:ABRIEFINTRODUCTION Section2.3. PARALLELPROGRAMMINGENVIRONMENTS Section2.4. THEJARGONOFPARALLELCOMPUTING Section2.5. AQUANTITATIVELOOKATPARALLELCOMPUTATION Section2.6. COMMUNICATION Section2.7. SUMMARY
Chapter 3. TheFindingConcurrencyDesignSpace Section3.1. ABOUTTHEDESIGNSPACE Section3.2. THETASKDECOMPOSITIONPATTERN Section3.3. THEDATADECOMPOSITIONPATTERN Section3.4. THEGROUPTASKSPATTERN Section3.5. THEORDERTASKSPATTERN Section3.6. THEDATASHARINGPATTERN Section3.7. THEDESIGNEVALUATIONPATTERN Section3.8. SUMMARY
Chapter 4. TheAlgorithmStructureDesignSpace Section4.1. INTRODUCTION Section4.2. CHOOSINGANALGORITHMSTRUCTUREPATTERN Section4.3. EXAMPLES
Section4.4. THETASKPARALLELISMPATTERN Section4.5. THEDIVIDEANDCONQUERPATTERN Section4.6. THEGEOMETRICDECOMPOSITIONPATTERN Section4.7. THERECURSIVEDATAPATTERN Section4.8. THEPIPELINEPATTERN Section4.9. THEEVENTBASEDCOORDINATIONPATTERN
Chapter 5. TheSupportingStructuresDesignSpace Section5.1. INTRODUCTION Section5.2. FORCES Section5.3. CHOOSINGTHEPATTERNS Section5.4. THESPMDPATTERN Section5.5. THEMASTER/WORKERPATTERN Section5.6. THELOOPPARALLELISMPATTERN Section5.7. THEFORK/JOINPATTERN Section5.8. THESHAREDDATAPATTERN Section5.9. THESHAREDQUEUEPATTERN Section5.10. THEDISTRIBUTEDARRAYPATTERN Section5.11. OTHERSUPPORTINGSTRUCTURES
Chapter 6. TheImplementationMechanismsDesignSpace Section6.1. OVERVIEW Section6.2. UEMANAGEMENT Section6.3. SYNCHRONIZATION Section6.4. COMMUNICATION Endnotes
Appendix A: ABriefIntroductiontoOpenMP SectionA.1. CORECONCEPTS SectionA.2. STRUCTUREDBLOCKSANDDIRECTIVEFORMATS SectionA.3. WORKSHARING SectionA.4. DATAENVIRONMENTCLAUSES SectionA.5. THEOpenMPRUNTIMELIBRARY SectionA.6. SYNCHRONIZATION SectionA.7. THESCHEDULECLAUSE SectionA.8. THERESTOFTHELANGUAGE
Appendix B: ABriefIntroductiontoMPI SectionB.1. CONCEPTS SectionB.2. GETTINGSTARTED SectionB.3. BASICPOINTTOPOINTMESSAGEPASSING SectionB.4. COLLECTIVEOPERATIONS SectionB.5. ADVANCEDPOINTTOPOINTMESSAGEPASSING SectionB.6. MPIANDFORTRAN SectionB.7. CONCLUSION
Appendix C: ABriefIntroductiontoConcurrentProgramminginJava SectionC.1. CREATINGTHREADS SectionC.2. ATOMICITY,MEMORYSYNCHRONIZATION,ANDTHEvolatileKEYWORD
SectionC.3. SYNCHRONIZEDBLOCKS SectionC.4. WAITANDNOTIFY SectionC.5. LOCKS SectionC.6. OTHERSYNCHRONIZATIONMECHANISMSANDSHAREDDATA STRUCTURESSectionC.7. INTERRUPTS GlossaryBibliography
AbouttheAuthorsIndex
APatternLanguageforParallelProgramming>INTRODUCTION
Chapter 1. A Pattern Language for Parallel Programming
1.1INTRODUCTION
1.2PARALLELPROGRAMMING
1.3DESIGNPATTERNSANDPATTERNLANGUAGES
1.4APATTERNLANGUAGEFORPARALLELPROGRAMMING
1.1. INTRODUCTIONComputersareusedtomodelphysicalsystemsinmanyfieldsofscience,medicine,andengineering.Modelers,whethertryingtopredicttheweatherorrenderasceneinthenextblockbustermovie,canusuallyusewhatevercomputingpowerisavailabletomakeevermoredetailedsimulations.Vastamountsofdata,whethercustomershoppingpatterns,telemetrydatafromspace,orDNAsequences,requireanalysis.Todelivertherequiredpower,computerdesignerscombinemultipleprocessingelementsintoasinglelargersystem.Thesesocalledparallelcomputersrunmultipletaskssimultaneouslyandsolvebiggerproblemsinlesstime.
Traditionally,parallelcomputerswererareandavailableforonlythemostcriticalproblems.Sincethemid1990s,however,theavailabilityofparallelcomputershaschangeddramatically.Withmultithreadingsupportbuiltintothelatestmicroprocessorsandtheemergenceofmultipleprocessorcoresonasinglesilicondie,parallelcomputersarebecomingubiquitous.Now,almosteveryuniversitycomputersciencedepartmenthasatleastoneparallelcomputer.Virtuallyalloilcompanies,automobilemanufacturers,drugdevelopmentcompanies,andspecialeffectsstudiosuseparallelcomputing.
Forexample,incomputeranimation,renderingisthestepwhereinformationfromtheanimationfiles,suchaslighting,textures,andshading,isappliedto3Dmodelstogeneratethe2Dimagethatmakesupaframeofthefilm.Parallelcomputingisessentialtogeneratetheneedednumberofframes(24persecond)forafeaturelengthfilm.ToyStory,thefirstcompletelycomputergeneratedfeaturelengthfilm,releasedbyPixarin1995,wasprocessedona"renderfarm"consistingof100dual
processormachines[PS00].By1999,forToyStory2,Pixarwasusinga1,400processorsystemwiththeimprovementinprocessingpowerfullyreflectedintheimproveddetailsintextures,clothing,andatmosphericeffects.Monsters,Inc.(2001)usedasystemof250enterpriseserverseachcontaining14processorsforatotalof3,500processors.Itisinterestingthattheamountoftimerequiredtogenerateaframehasremainedrelativelyconstantascomputingpower(boththenumberofprocessorsandthespeedofeachprocessor)hasincreased,ithasbeenexploitedtoimprovethequalityoftheanimation.
ThebiologicalscienceshavetakendramaticleapsforwardwiththeavailabilityofDNAsequenceinformationfromavarietyoforganisms,includinghumans.Oneapproachtosequencing,championedandusedwithsuccessbyCeleraCorp.,iscalledthewholegenomeshotgunalgorithm.Theideaistobreakthegenomeintosmallsegments,experimentallydeterminetheDNAsequencesofthesegments,andthenuseacomputertoconstructtheentiresequencefromthesegmentsbyfindingoverlappingareas.ThecomputingfacilitiesusedbyCeleratosequencethehumangenomeincluded150fourwayserversplusaserverwith16processorsand64GBofmemory.Thecalculationinvolved500milliontrillionbasetobasecomparisons[Ein00].
TheSETI@homeproject[SET,ACK +02 ]providesafascinatingexampleofthepowerofparallelcomputing.Theprojectseeksevidenceofextraterrestrialintelligencebyscanningtheskywiththeworld'slargestradiotelescope,theAreciboTelescopeinPuertoRico.Thecollecteddataisthenanalyzedforcandidatesignalsthatmightindicateanintelligentsource.Thecomputationaltaskisbeyondeventhelargestsupercomputer,andcertainlybeyondthecapabilitiesofthefacilitiesavailabletotheSETI@homeproject.Theproblemissolvedwithpublicresourcecomputing,whichturnsPCsaroundtheworldintoahugeparallelcomputerconnectedbytheInternet.DataisbrokenupintoworkunitsanddistributedovertheInternettoclientcomputerswhoseownersdonatesparecomputingtimetosupporttheproject.EachclientperiodicallyconnectswiththeSETI@homeserver,downloadsthedatatoanalyze,andthensendstheresultsbacktotheserver.TheclientprogramistypicallyimplementedasascreensaversothatitwilldevoteCPUcyclestotheSETIproblemonlywhenthecomputerisotherwiseidle.AworkunitcurrentlyrequiresanaverageofbetweensevenandeighthoursofCPUtimeonaclient.Morethan205,000,000workunitshavebeenprocessedsincethestartoftheproject.Morerecently,similartechnologytothatdemonstratedbySETI@homehasbeenusedforavarietyofpublicresourcecomputingprojectsaswellasinternalprojectswithinlargecompaniesutilizingtheiridlePCstosolveproblemsrangingfromdrugscreeningtochipdesignvalidation.
Althoughcomputinginlesstimeisbeneficial,andmayenableproblemstobesolvedthatcouldn'tbeotherwise,itcomesatacost.Writingsoftwaretorunonparallelcomputerscanbedifficult.Onlyasmallminorityofprogrammershaveexperiencewithparallelprogramming.Ifallthesecomputersdesignedtoexploitparallelismaregoingtoachievetheirpotential,moreprogrammersneedtolearnhowtowriteparallelprograms.
Thisbookaddressesthisneedbyshowingcompetentprogrammersofsequentialmachineshowtodesignprogramsthatcanrunonparallelcomputers.Althoughmanyexcellentbooksshowhowtouseparticularparallelprogrammingenvironments,thisbookisuniqueinthatitfocusesonhowtothinkaboutanddesignparallelalgorithms.Toaccomplishthisgoal,wewillbeusingtheconceptofapatternlanguage.Thishighlystructuredrepresentationofexpertdesignexperiencehasbeenheavilyusedintheobjectorienteddesigncommunity.
Thebookopenswithtwointroductorychapters.Thefirstgivesanoverviewoftheparallelcomputinglandscapeandbackgroundneededtounderstandandusethepatternlanguage.Thisisfollowedbyamoredetailedchapterinwhichwelayoutthebasicconceptsandjargonusedbyparallelprogrammers.Thebookthenmovesintothepatternlanguageitself.
1.2. PARALLEL PROGRAMMINGThekeytoparallelcomputingisexploitableconcurrency.Concurrencyexistsinacomputationalproblemwhentheproblemcanbedecomposedintosubproblemsthatcansafelyexecuteatthesametime.Tobeofanyuse,however,itmustbepossibletostructurethecodetoexposeandlaterexploittheconcurrencyandpermitthesubproblemstoactuallyrunconcurrently;thatis,theconcurrencymustbeexploitable.
Mostlargecomputationalproblemscontainexploitableconcurrency.Aprogrammerworkswithexploitableconcurrencybycreatingaparallelalgorithmandimplementingthealgorithmusingaparallelprogrammingenvironment.Whentheresultingparallelprogramisrunonasystemwithmultipleprocessors,theamountoftimewehavetowaitfortheresultsofthecomputationisreduced.Inaddition,multipleprocessorsmayallowlargerproblemstobesolvedthancouldbedoneonasingleprocessorsystem.
Asasimpleexample,supposepartofacomputationinvolvescomputingthesummationofalargesetofvalues.Ifmultipleprocessorsareavailable,insteadofaddingthevaluestogethersequentially,thesetcanbepartitionedandthesummationsofthesubsetscomputedsimultaneously,eachonadifferentprocessor.Thepartialsumsarethencombinedtogetthefinalanswer.Thus,usingmultipleprocessorstocomputeinparallelmayallowustoobtainasolutionsooner.Also,ifeachprocessorhasitsownmemory,partitioningthedatabetweentheprocessorsmayallowlargerproblemstobehandledthancouldbehandledonasingleprocessor.
Thissimpleexampleshowstheessenceofparallelcomputing.Thegoalistousemultipleprocessorstosolveproblemsinlesstimeand/ortosolvebiggerproblemsthanwouldbepossibleonasingleprocessor.Theprogrammer'staskistoidentifytheconcurrencyintheproblem,structurethealgorithmsothatthisconcurrencycanbeexploited,andthenimplementthesolutionusingasuitableprogrammingenvironment.Thefinalstepistosolvetheproblembyexecutingthecodeonaparallelsystem.
Parallelprogrammingpresentsuniquechallenges.Often,theconcurrenttasksmakinguptheproblemincludedependenciesthatmustbeidentifiedandcorrectlymanaged.Theorderinwhichthetasksexecutemaychangetheanswersofthecomputationsinnondeterministicways.Forexample,intheparallelsummationdescribedearlier,apartialsumcannotbecombinedwithothersuntilitsowncomputationhascompleted.Thealgorithmimposesapartialorderonthetasks(thatis,theymustcompletebeforethesumscanbecombined).Moresubtly,thenumericalvalueofthesummationsmaychangeslightlydependingontheorderoftheoperationswithinthesumsbecausefloatingpoint
arithmeticisnonassociative.Agoodparallelprogrammermusttakecaretoensurethatnondeterministicissuessuchasthesedonotaffectthequalityofthefinalanswer.Creatingsafeparallelprogramscantakeconsiderableeffortfromtheprogrammer.
Evenwhenaparallelprogramis"correct",itmayfailtodelivertheanticipatedperformanceimprovementfromexploitingconcurrency.Caremustbetakentoensurethattheoverheadincurredbymanagingtheconcurrencydoesnotoverwhelmtheprogramruntime.Also,partitioningtheworkamongtheprocessorsinabalancedwayisoftennotaseasyasthesummationexamplesuggests.Theeffectivenessofaparallelalgorithmdependsonhowwellitmapsontotheunderlyingparallelcomputer,soaparallelalgorithmcouldbeveryeffectiveononeparallelarchitectureandadisasteronanother.
Wewillrevisittheseissuesandprovideamorequantitativeviewofparallelcomputationinthenextchapter.
1.3. DESIGN PATTERNS AND PATTERN LANGUAGESAdesignpatterndescribesagoodsolutiontoarecurringprobleminaparticularcontext.Thepatternfollowsaprescribedformatthatincludesthepatternname,adescriptionofthecontext,theforces(goalsandconstraints),andthesolution.Theideaistorecordtheexperienceofexpertsinawaythatcanbeusedbyothersfacingasimilarproblem.Inadditiontothesolutionitself,thenameofthepatternisimportantandcanformthebasisforadomainspecificvocabularythatcansignificantlyenhancecommunicationbetweendesignersinthesamearea.
DesignpatternswerefirstproposedbyChristopherAlexander.Thedomainwascityplanningandarchitecture[AIS77].DesignpatternswereoriginallyintroducedtothesoftwareengineeringcommunitybyBeckandCunningham[BC87]andbecameprominentintheareaofobjectorientedprogrammingwiththepublicationofthebookbyGamma,Helm,Johnson,andVlissides[GHJV95],affectionatelyknownastheGoF(GangofFour)book.Thisbookgivesalargecollectionofdesignpatternsforobjectorientedprogramming.Togiveoneexample,theVisitorpatterndescribesawaytostructureclassessothatthecodeimplementingaheterogeneousdatastructurecanbekeptseparatefromthecodetotraverseit.Thus,whathappensinatraversaldependsonboththetypeofeachnodeandtheclassthatimplementsthetraversal.Thisallowsmultiplefunctionalityfordatastructuretraversals,andsignificantflexibilityasnewfunctionalitycanbeaddedwithouthavingtochangethedatastructureclass.ThepatternsintheGoFbookhaveenteredthelexiconofobjectorientedprogrammingreferencestoitspatternsarefoundintheacademicliterature,tradepublications,andsystemdocumentation.Thesepatternshavebynowbecomepartoftheexpectedknowledgeofanycompetentsoftwareengineer.
AneducationalnonprofitorganizationcalledtheHillsideGroup[Hil]wasformedin1993topromotetheuseofpatternsandpatternlanguagesand,moregenerally,toimprovehumancommunicationaboutcomputers"byencouragingpeopletocodifycommonprogramminganddesignpractice".Todevelopnewpatternsandhelppatternwritershonetheirskills,theHillsideGroupsponsorsanannualPatternLanguagesofPrograms(PLoP)workshopandseveralspinoffsinotherpartsoftheworld,suchasChiliPLoP(inthewesternUnitedStates),KoalaPLoP(Australia),EuroPLoP(Europe),and
MensorePLoP(Japan).Theproceedingsoftheseworkshops[Pat]providearichsourceofpatternscoveringavastrangeofapplicationdomainsinsoftwaredevelopmentandhavebeenusedasabasisforseveralbooks[CS95,VCK96,MRB97,HFR99].
Inhisoriginalworkonpatterns,Alexanderprovidednotonlyacatalogofpatterns,butalsoapatternlanguagethatintroducedanewapproachtodesign.Inapatternlanguage,thepatternsareorganizedintoastructurethatleadstheuserthroughthecollectionofpatternsinsuchawaythatcomplexsystemscanbedesignedusingthepatterns.Ateachdecisionpoint,thedesignerselectsanappropriatepattern.Eachpatternleadstootherpatterns,resultinginafinaldesignintermsofawebofpatterns.Thus,apatternlanguageembodiesadesignmethodologyandprovidesdomainspecificadvicetotheapplicationdesigner.(Inspiteoftheoverlappingterminology,apatternlanguageisnotaprogramminglanguage.)
1.4. A PATTERN LANGUAGE FOR PARALLEL PROGRAMMINGThisbookdescribesapatternlanguageforparallelprogrammingthatprovidesseveralbenefits.Theimmediatebenefitsareawaytodisseminatetheexperienceofexpertsbyprovidingacatalogofgoodsolutionstoimportantproblems,anexpandedvocabulary,andamethodologyforthedesignofparallelprograms.Wehopetolowerthebarriertoparallelprogrammingbyprovidingguidancethroughtheentireprocessofdevelopingaparallelprogram.Theprogrammerbringstotheprocessagoodunderstandingoftheactualproblemtobesolvedandthenworksthroughthepatternlanguage,eventuallyobtainingadetailedparalleldesignorpossiblyworkingcode.Inthelongerterm,wehopethatthispatternlanguagecanprovideabasisforbothadisciplinedapproachtothequalitativeevaluationofdifferentprogrammingmodelsandthedevelopmentofparallelprogrammingtools.
ThepatternlanguageisorganizedintofourdesignspacesFindingConcurrency,AlgorithmStructure,SupportingStructures,andImplementationMechanismswhichformalinearhierarchy,withFindingConcurrencyatthetopandImplementationMechanismsatthebottom,asshowninFig.1.1.
Figure 1.1. Overview of the pattern language
TheFindingConcurrencydesignspaceisconcernedwithstructuringtheproblemtoexposeexploitableconcurrency.Thedesignerworkingatthislevelfocusesonhighlevelalgorithmicissues
andreasonsabouttheproblemtoexposepotentialconcurrency.TheAlgorithmStructuredesignspaceisconcernedwithstructuringthealgorithmtotakeadvantageofpotentialconcurrency.Thatis,thedesignerworkingatthislevelreasonsabouthowtousetheconcurrencyexposedinworkingwiththeFindingConcurrencypatterns.TheAlgorithmStructurepatternsdescribeoverallstrategiesforexploitingconcurrency.TheSupportingStructuresdesignspacerepresentsanintermediatestagebetweentheAlgorithmStructureandImplementationMechanismsdesignspaces.Twoimportantgroupsofpatternsinthisspacearethosethatrepresentprogramstructuringapproachesandthosethatrepresentcommonlyusedshareddatastructures.TheImplementationMechanismsdesignspaceisconcernedwithhowthepatternsofthehigherlevelspacesaremappedintoparticularprogrammingenvironments.Weuseittoprovidedescriptionsofcommonmechanismsforprocess/threadmanagement(forexample,creatingordestroyingprocesses/threads)andprocess/threadinteraction(forexample,semaphores,barriers,ormessagepassing).Theitemsinthisdesignspacearenotpresentedaspatternsbecauseinmanycasestheymapdirectlyontoelementswithinparticularparallelprogrammingenvironments.Theyareincludedinthepatternlanguageanyway,however,toprovideacompletepathfromproblemdescriptiontocode.
Chapter 2. Background and Jargon of Parallel Computing
2.1CONCURRENCYINPARALLELPROGRAMSVERSUSOPERATINGSYSTEMS
2.2PARALLELARCHITECTURES:ABRIEFINTRODUCTION
2.3PARALLELPROGRAMMINGENVIRONMENTS
2.4THEJARGONOFPARALLELCOMPUTING
2.5AQUANTITATIVELOOKATPARALLELCOMPUTATION
2.6COMMUNICATION
2.7SUMMARY
Inthischapter,wegiveanoverviewoftheparallelprogramminglandscape,anddefineanyspecializedparallelcomputingterminologythatwewilluseinthepatterns.Becausemanytermsincomputingareoverloaded,takingdifferentmeaningsindifferentcontexts,wesuggestthatevenreadersfamiliarwithparallelprogrammingatleastskimthischapter.
2.1. CONCURRENCY IN PARALLEL PROGRAMS VERSUS OPERATING SYSTEMSConcurrencywasfirstexploitedincomputingtobetterutilizeorshareresourceswithinacomputer.Modernoperatingsystemssupportcontextswitchingtoallowmultipletaskstoappeartoexecuteconcurrently,therebyallowingusefulworktooccurwhiletheprocessorisstalledononetask.Thisapplicationofconcurrency,forexample,allowstheprocessortostaybusybyswappinginanewtasktoexecutewhileanothertaskiswaitingforI/O.Byquicklyswappingtasksinandout,givingeach
taska"slice"oftheprocessortime,theoperatingsystemcanallowmultipleuserstousethesystemasifeachwereusingitalone(butwithdegradedperformance).
Mostmodernoperatingsystemscanusemultipleprocessorstoincreasethethroughputofthesystem.TheUNIXshellusesconcurrencyalongwithacommunicationabstractionknownaspipestoprovideapowerfulformofmodularity:Commandsarewrittentoacceptastreamofbytesasinput(theconsumer)andproduceastreamofbytesasoutput(theproducer).Multiplecommandscanbechainedtogetherwithapipeconnectingtheoutputofonecommandtotheinputofthenext,allowingcomplexcommandstobebuiltfromsimplebuildingblocks.Eachcommandisexecutedinitsownprocess,withallprocessesexecutingconcurrently.Becausetheproducerblocksifbufferspaceinthepipeisnotavailable,andtheconsumerblocksifdataisnotavailable,thejobofmanagingthestreamofresultsmovingbetweencommandsisgreatlysimplified.Morerecently,withoperatingsystemswithwindowsthatinviteuserstodomorethanonethingatatime,andtheInternet,whichoftenintroducesI/Odelaysperceptibletotheuser,almosteveryprogramthatcontainsaGUIincorporatesconcurrency.
Althoughthefundamentalconceptsforsafelyhandlingconcurrencyarethesameinparallelprogramsandoperatingsystems,therearesomeimportantdifferences.Foranoperatingsystem,theproblemisnotfindingconcurrencytheconcurrencyisinherentinthewaytheoperatingsystemfunctionsinmanagingacollectionofconcurrentlyexecutingprocesses(representingusers,applications,andbackgroundactivitiessuchasprintspooling)andprovidingsynchronizationmechanismssoresourcescanbesafelyshared.However,anoperatingsystemmustsupportconcurrencyinarobustandsecureway:Processesshouldnotbeabletointerferewitheachother(intentionallyornot),andtheentiresystemshouldnotcrashifsomethinggoeswrongwithoneprocess.Inaparallelprogram,findingandexploitingconcurrencycanbeachallenge,whileisolatingprocessesfromeachotherisnotthecriticalconcernitiswithanoperatingsystem.Performancegoalsaredifferentaswell.Inanoperatingsystem,performancegoalsarenormallyrelatedtothroughputorresponsetime,anditmaybeacceptabletosacrificesomeefficiencytomaintainrobustnessandfairnessinresourceallocation.Inaparallelprogram,thegoalistominimizetherunningtimeofasingleprogram.
2.2. PARALLEL ARCHITECTURES: A BRIEF INTRODUCTIONTherearedozensofdifferentparallelarchitectures,amongthemnetworksofworkstations,clustersofofftheshelfPCs,massivelyparallelsupercomputers,tightlycoupledsymmetricmultiprocessors,andmultiprocessorworkstations.Inthissection,wegiveanoverviewofthesesystems,focusingonthecharacteristicsrelevanttotheprogrammer.
2.2.1. Flynn's TaxonomyByfarthemostcommonwaytocharacterizethesearchitecturesisFlynn'staxonomy[Fly72].Hecategorizesallcomputersaccordingtothenumberofinstructionstreamsanddatastreamstheyhave,whereastreamisasequenceofinstructionsordataonwhichacomputeroperates.InFlynn'staxonomy,therearefourpossibilities:SISD,SIMD,MISD,andMIMD.
SingleInstruction,SingleData(SISD).InaSISDsystem,onestreamofinstructionsprocessesasinglestreamofdata,asshowninFig.2.1.ThisisthecommonvonNeumannmodelusedinvirtuallyallsingleprocessorcomputers.
Figure 2.1. The Single Instruction, Single Data (SISD) architecture
SingleInstruction,MultipleData(SIMD).InaSIMDsystem,asingleinstructionstreamisconcurrentlybroadcasttomultipleprocessors,eachwithitsowndatastream(asshowninFig.2.2).TheoriginalsystemsfromThinkingMachinesandMasParcanbeclassifiedasSIMD.TheCPPDAPGammaIIandQuadricsApemillearemorerecentexamples;thesearetypicallydeployedinspecializedapplications,suchasdigitalsignalprocessing,thataresuitedtofinegrainedparallelismandrequirelittleinterprocesscommunication.Vectorprocessors,whichoperateonvectordatainapipelinedfashion,canalsobecategorizedasSIMD.Exploitingthisparallelismisusuallydonebythecompiler.
Figure 2.2. The Single Instruction, Multiple Data (SIMD) architecture
MultipleInstruction,SingleData(MISD).Nowellknownsystemsfitthisdesignation.Itismentionedforthesakeofcompleteness.
MultipleInstruction,MultipleData(MIMD).InaMIMDsystem,eachprocessingelementhasitsownstreamofinstructionsoperatingonitsowndata.Thisarchitecture,showninFig.2.3,isthemostgeneralofthearchitecturesinthateachoftheothercasescanbemappedontotheMIMDarchitecture.Thevastmajorityofmodernparallelsystemsfitintothiscategory.
Figure 2.3. The Multiple Instruction, Multiple Data (MIMD) architecture
2.2.2. A Further Breakdown of MIMDTheMIMDcategoryofFlynn'staxonomyistoobroadtobeusefulonitsown;thiscategoryistypicallydecomposedaccordingtomemoryorganization.
Sharedmemory.Inasharedmemorysystem,allprocessesshareasingleaddressspaceandcommunicatewitheachotherbywritingandreadingsharedvariables.
OneclassofsharedmemorysystemsiscalledSMPs(symmetricmultiprocessors).AsshowninFig.2.4,allprocessorsshareaconnectiontoacommonmemoryandaccessallmemorylocationsatequalspeeds.SMPsystemsarearguablytheeasiestparallelsystemstoprogrambecauseprogrammersdonotneedtodistributedatastructuresamongprocessors.Becauseincreasingthenumberofprocessorsincreasescontentionforthememory,theprocessor/memorybandwidthistypicallyalimitingfactor.Thus,SMPsystemsdonotscalewellandarelimitedtosmallnumbersofprocessors.
Figure 2.4. The Symmetric Multiprocessor (SMP) architecture
TheothermainclassofsharedmemorysystemsiscalledNUMA(nonuniformmemoryaccess).AsshowninFig.2.5,thememoryisshared;thatis,itisuniformlyaddressablefromallprocessors,butsomeblocksofmemorymaybephysicallymorecloselyassociatedwithsomeprocessorsthanothers.Thisreducesthememorybandwidthbottleneckandallowssystemswithmoreprocessors;however,asaresult,theaccesstimefromaprocessortoamemorylocationcanbesignificantlydifferentdependingonhow"close"thememorylocationistotheprocessor.Tomitigatetheeffectsofnonuniformaccess,eachprocessorhasacache,alongwithaprotocoltokeepcacheentriescoherent.Hence,anothernameforthesearchitecturesiscachecoherentnonuniformmemoryaccesssystems(ccNUMA).Logically,programmingaccNUMAsystemisthesameasprogramminganSMP,buttoobtainthebestperformance,theprogrammerwillneedtobemorecarefulaboutlocalityissuesandcacheeffects.
Figure 2.5. An example of the nonuniform memory access (NUMA) architecture
Distributedmemory.Inadistributedmemorysystem,eachprocesshasitsownaddressspaceandcommunicateswithotherprocessesbymessagepassing(sendingandreceivingmessages).AschematicrepresentationofadistributedmemorycomputerisshowninFig.2.6.
Figure 2.6. The distributed-memory architecture
Dependingonthetopologyandtechnologyusedfortheprocessorinterconnection,communication
speedcanrangefromalmostasfastassharedmemory(intightlyintegratedsupercomputers)toordersofmagnitudeslower(forexample,inaclusterofPCsinterconnectedwithanEthernetnetwork).Theprogrammermustexplicitlyprogramallthecommunicationbetweenprocessorsandbeconcernedwiththedistributionofdata.
Distributedmemorycomputersaretraditionallydividedintotwoclasses:MPP(massivelyparallelprocessors)andclusters.InanMPP,theprocessorsandthenetworkinfrastructurearetightlycoupledandspecializedforuseinaparallelcomputer.Thesesystemsareextremelyscalable,insomecasessupportingtheuseofmanythousandsofprocessorsinasinglesystem[MSW96,IBM02].
Clustersaredistributedmemorysystemscomposedofofftheshelfcomputersconnectedbyanofftheshelfnetwork.WhenthecomputersarePCsrunningtheLinuxoperatingsystem,theseclustersarecalledBeowulfclusters.Asofftheshelfnetworkingtechnologyimproves,systemsofthistypearebecomingmorecommonandmuchmorepowerful.Clustersprovideaninexpensivewayforanorganizationtoobtainparallelcomputingcapabilities[Beo].Preconfiguredclustersarenowavailablefrommanyvendors.OnefrugalgroupevenreportedconstructingausefulparallelsystembyusingaclustertoharnessthecombinedpowerofobsoletePCsthatotherwisewouldhavebeendiscarded[HHS01].
Hybridsystems.Thesesystemsareclustersofnodeswithseparateaddressspacesinwhicheachnodecontainsseveralprocessorsthatsharememory.
AccordingtovanderSteenandDongarra's"OverviewofRecentSupercomputers"[vdSD03],whichcontainsabriefdescriptionofthesupercomputerscurrentlyorsoontobecommerciallyavailable,hybridsystemsformedfromclustersofSMPsconnectedbyafastnetworkarecurrentlythedominanttrendinhighperformancecomputing.Forexample,inlate2003,fourofthefivefastestcomputersintheworldwerehybridsystems[Top].
Grids.Gridsaresystemsthatusedistributed,heterogeneousresourcesconnectedbyLANsand/orWANs[FK03].OftentheinterconnectionnetworkistheInternet.Gridswereoriginallyenvisionedasawaytolinkmultiplesupercomputerstoenablelargerproblemstobesolved,andthuscouldbeviewedasaspecialtypeofdistributedmemoryorhybridMIMDmachine.Morerecently,theideaofgridcomputinghasevolvedintoageneralwaytoshareheterogeneousresources,suchascomputationservers,storage,applicationservers,informationservices,orevenscientificinstruments.Gridsdifferfromclustersinthatthevariousresourcesinthegridneednothaveacommonpointofadministration.Inmostcases,theresourcesonagridareownedbydifferentorganizationsthatmaintaincontroloverthepoliciesgoverninguseoftheresources.Thisaffectsthewaythesesystemsareused,themiddlewarecreatedtomanagethem,andmostimportantlyforthisdiscussion,theoverheadincurredwhencommunicatingbetweenresourceswithinthegrid.
2.2.3. SummaryWehaveclassifiedthesesystemsaccordingtothecharacteristicsofthehardware.Thesecharacteristicstypicallyinfluencethenativeprogrammingmodelusedtoexpressconcurrencyonasystem;however,thisisnotalwaysthecase.Itispossibleforaprogrammingenvironmentforasharedmemorymachinetoprovidetheprogrammerwiththeabstractionofdistributedmemoryandmessagepassing.Virtualdistributedsharedmemorysystemscontainmiddlewaretoprovidetheopposite:theabstractionofsharedmemoryonadistributedmemorymachine.
2.3. PARALLEL PROGRAMMING ENVIRONMENTSParallelprogrammingenvironmentsprovidethebasictools,languagefeatures,andapplication programminginterfaces(APIs)neededtoconstructaparallelprogram.Aprogrammingenvironmentimpliesaparticularabstractionofthecomputersystemcalledaprogrammingmodel.TraditionalsequentialcomputersusethewellknownvonNeumannmodel.Becauseallsequentialcomputersusethismodel,softwaredesignerscandesignsoftwaretoasingleabstractionandreasonablyexpectittomapontomost,ifnotall,sequentialcomputers.
Unfortunately,therearemanypossiblemodelsforparallelcomputing,reflectingthedifferentwaysprocessorscanbeinterconnectedtoconstructaparallelsystem.Themostcommonmodelsarebasedononeofthewidelydeployedparallelarchitectures:sharedmemory,distributedmemorywithmessagepassing,orahybridcombinationofthetwo.
Programmingmodelstoocloselyalignedtoaparticularparallelsystemleadtoprogramsthatarenotportablebetweenparallelcomputers.Becausetheeffectivelifespanofsoftwareislongerthanthatofhardware,manyorganizationshavemorethanonetypeofparallelcomputer,andmostprogrammersinsistonprogrammingenvironmentsthatallowthemtowriteportableparallelprograms.Also,explicitlymanaginglargenumbersofresourcesinaparallelcomputerisdifficult,suggestingthathigherlevelabstractionsoftheparallelcomputermightbeuseful.Theresultisthatasofthemid1990s,therewasaveritableglutofparallelprogrammingenvironments.ApartiallistoftheseisshowninTable2.1.Thiscreatedagreatdealofconfusionforapplicationdevelopersandhinderedtheadoptionofparallelcomputingformainstreamapplications.
Table 2.1. Some Parallel Programming Environments from the Mid-1990s
"C*inC CUMULVS JavaRMI PRIO Quake
ABCPL DAGGER javaPG P3L Quark
ACE DAPPLE JAVAR P4Linda QuickThreads
ACT++ DataParallelC JavaSpaces Pablo Sage++
ADDAP DC++ JIDL PADE SAM
Adl DCE++ Joyce PADRE SCANDAL
Adsmith DDD Karma Panda SCHEDULE
AFAPI DICE Khoros Papers SciTL
ALWAN DIPC KOAN/FortranS Para++ SDDA
AM DistributedSmalltalk
LAM Paradigm SHMEM
AMDC DOLIB Legion Parafrase2 SIMPLE
Amoeba DOME Lilac Paralation Sina
AppLeS DOSMOS Linda Parallaxis SISAL
ARTS DRL LiPS ParallelHaskell
SMI
AthapascanOb DSMThreads Locust ParallelC++ SONiC
Aurora Ease Lparx ParC SplitC
Automap ECO Lucid ParLib++ SR
bb_threads Eilean Maisie ParLin Sthreads
Blaze Emerald Manifold Parlog Strand
BlockComm EPL Mentat Parmacs SUIF
BSP Excalibur MetaChaos Parti SuperPascal
C* Express Midway pC Synergy
C** Falcon Millipede pC++ TCGMSG
C4 Filaments Mirage PCN Telegraphos
CarlOS FLASH Modula2* PCP: TheFORCE
Cashmere FM ModulaP PCU Threads.h++
CC++ Fork MOSIX PEACE TRAPPER
Charlotte FortranM MpC PENNY TreadMarks
Charm FX MPC++ PET UC
Charm++ GA MPI PETSc uC++
Chu GAMMA Multipol PH UNITY
Cid Glenda Munin Phosphorus V
Cilk GLU NanoThreads POET Vic*
CMFortran GUARD NESL Polaris VisifoldVNUS
Code HAsL NetClasses++ POOLT VPE
ConcurrentML HORUS Nexus POOMA Win32threads
Converse HPC Nimrod POSYBL WinPar
COOL HPC++ NOW PRESTO WWWinda
CORRELATE HPF ObjectiveLinda Prospero XENOOPS
CparPar IMPACT Occam Proteus XPC
CPS ISETLLinda Omega PSDM Zounds
CRL ISIS OOF90 PSI ZPL
CSP JADA Orca PVM
Cthreads JADE P++ QPC++
Fortunately,bythelate1990s,theparallelprogrammingcommunityconvergedpredominantlyontwoenvironmentsforparallelprogramming:OpenMP[OMP]forsharedmemoryandMPI[Mesb]formessagepassing.
OpenMPisasetoflanguageextensionsimplementedascompilerdirectives.ImplementationsarecurrentlyavailableforFortran,C,andC++.OpenMPisfrequentlyusedtoincrementallyaddparallelismtosequentialcode.Byaddingacompilerdirectivearoundaloop,forexample,thecompilercanbeinstructedtogeneratecodetoexecutetheiterationsoftheloopinparallel.Thecompilertakescareofmostofthedetailsofthreadcreationandmanagement.OpenMPprogramstendtoworkverywellonSMPs,butbecauseitsunderlyingprogrammingmodeldoesnotincludeanotionofnonuniformmemoryaccesstimes,itislessidealforccNUMAanddistributedmemorymachines.
MPIisasetoflibraryroutinesthatprovideforprocessmanagement,messagepassing,andsomecollectivecommunicationoperations(theseareoperationsthatinvolvealltheprocessesinvolvedinaprogram,suchasbarrier,broadcast,andreduction).MPIprogramscanbedifficulttowritebecausetheprogrammerisresponsiblefordatadistributionandexplicitinterprocesscommunicationusingmessages.Becausetheprogrammingmodelassumesdistributedmemory,MPIisagoodchoiceforMPPsandotherdistributedmemorymachines.
NeitherOpenMPnorMPIisanidealfitforhybridarchitecturesthatcombinemultiprocessornodes,eachwithmultipleprocessesandasharedmemory,intoalargersystemwithseparateaddressspacesforeachnode:TheOpenMPmodeldoesnotrecognizenonuniformmemoryaccesstimes,soitsdataallocationcanleadtopoorperformanceonmachinesthatarenotSMPs,whileMPIdoesnotincludeconstructstomanagedatastructuresresidinginasharedmemory.OnesolutionisahybridmodelinwhichOpenMPisusedoneachsharedmemorynodeandMPIisusedbetweenthenodes.Thisworkswell,butitrequirestheprogrammertoworkwithtwodifferentprogrammingmodelswithinasingleprogram.AnotheroptionistouseMPIonboththesharedmemoryanddistributedmemoryportionsofthealgorithmandgiveuptheadvantagesofasharedmemoryprogrammingmodel,evenwhenthehardwaredirectlysupportsit.
Newhighlevelprogrammingenvironmentsthatsimplifyportableparallelprogrammingandmoreaccuratelyreflecttheunderlyingparallelarchitecturesaretopicsofcurrentresearch[Cen].Another
approachmorepopularinthecommercialsectoristoextendMPIandOpenMP.Inthemid1990s,theMPIForumdefinedanextendedMPIcalledMPI2.0,althoughimplementationsarenotwidelyavailableatthetimethiswaswritten.ItisalargecomplexextensiontoMPIthatincludesdynamicprocesscreation,parallelI/O,andmanyotherfeatures.Ofparticularinteresttoprogrammersofmodernhybridarchitecturesistheinclusionofonesidedcommunication.Onesidedcommunicationmimicssomeofthefeaturesofasharedmemorysystembylettingoneprocesswriteintoorreadfromthememoryregionsofotherprocesses.Theterm"onesided"referstothefactthatthereadorwriteislaunchedbytheinitiatingprocesswithouttheexplicitinvolvementoftheotherparticipatingprocess.AmoresophisticatedabstractionofonesidedcommunicationisavailableaspartoftheGlobalArrays[NHL96,NHK +02 ,Gloa]package.GlobalArraysworkstogetherwithMPItohelpaprogrammermanagedistributedarraydata.Aftertheprogrammerdefinesthearrayandhowitislaidoutinmemory,theprogramexecutes"puts"or"gets"intothearraywithoutneedingtoexplicitlymanagewhichMPIprocess"owns"theparticularsectionofthearray.Inessence,theglobalarrayprovidesanabstractionofagloballysharedarray.Thisonlyworksforarrays,butthesearesuchcommondatastructuresinparallelcomputingthatthispackage,althoughlimited,canbeveryuseful.
JustasMPIhasbeenextendedtomimicsomeofthebenefitsofasharedmemoryenvironment,OpenMPhasbeenextendedtorunindistributedmemoryenvironments.TheannualWOMPAT(WorkshoponOpenMPApplicationsandTools)workshopscontainmanypapersdiscussingvariousapproachesandexperienceswithOpenMPinclustersandccNUMAenvironments.
MPIisimplementedasalibraryofroutinestobecalledfromprogramswritteninasequentialprogramminglanguage,whereasOpenMPisasetofextensionstosequentialprogramminglanguages.Theyrepresenttwoofthepossiblecategoriesofparallelprogrammingenvironments(librariesandlanguageextensions),andthesetwoparticularenvironmentsaccountfortheoverwhelmingmajorityofparallelcomputingbeingdonetoday.Thereis,however,onemorecategoryofparallelprogrammingenvironments,namelylanguageswithbuiltinfeaturestosupportparallelprogramming.Javaissuchalanguage.Ratherthanbeingdesignedtosupporthighperformancecomputing,Javaisanobjectoriented,generalpurposeprogrammingenvironmentwithfeaturesforexplicitlyspecifyingconcurrentprocessingwithsharedmemory.Inaddition,thestandardI/OandnetworkpackagesprovideclassesthatmakeiteasyforJavatoperforminterprocesscommunicationbetweenmachines,thusmakingitpossibletowriteprogramsbasedonboththesharedmemoryandthedistributedmemorymodels.Thenewerjava.niopackagessupportI/Oinawaythatislessconvenientfortheprogrammer,butgivessignificantlybetterperformance,andJava21.5includesnewsupportforconcurrentprogramming,mostsignificantlyinthejava.util.concurrent.*packages.Additionalpackagesthatsupportdifferentapproachestoparallelcomputingarewidelyavailable.
Althoughtherehavebeenothergeneralpurposelanguages,bothpriortoJavaandmorerecent(forexample,C#),thatcontainedconstructsforspecifyingconcurrency,Javaisthefirsttobecomewidelyused.Asaresult,itmaybethefirstexposureformanyprogrammerstoconcurrentandparallelprogramming.AlthoughJavaprovidessoftwareengineeringbenefits,currentlytheperformanceofparallelJavaprogramscannotcompetewithOpenMPorMPIprogramsfortypicalscientificcomputingapplications.TheJavadesignhasalsobeencriticizedforseveraldeficienciesthatmatterinthisdomain(forexample,afloatingpointmodelthatemphasizesportabilityandmorereproducibleresultsoverexploitingtheavailablefloatingpointhardwaretothefullest,inefficienthandlingof
arrays,andlackofalightweightmechanismtohandlecomplexnumbers).TheperformancedifferencebetweenJavaandotheralternativescanbeexpectedtodecrease,especiallyforsymbolicorothernonnumericproblems,ascompilertechnologyforJavaimprovesandasnewpackagesandlanguageextensionsbecomeavailable.TheTitaniumproject[Tita]isanexampleofaJavadialectdesignedforhighperformancecomputinginaccNUMAenvironment.
Forthepurposesofthisbook,wehavechosenOpenMP,MPI,andJavaasthethreeenvironmentswewilluseinourexamplesOpenMPandMPIfortheirpopularityandJavabecauseitislikelytobemanyprogrammers'firstexposuretoconcurrentprogramming.Abriefoverviewofeachcanbefoundintheappendixes.
2.4. THE JARGON OF PARALLEL COMPUTINGInthissection,wedefinesometermsthatarefrequentlyusedthroughoutthepatternlanguage.Additionaldefinitionscanbefoundintheglossary.
Task.Thefirststepindesigningaparallelprogramistobreaktheproblemupintotasks.Ataskisasequenceofinstructionsthatoperatetogetherasagroup.Thisgroupcorrespondstosomelogicalpartofanalgorithmorprogram.Forexample,considerthemultiplicationoftwoorderNmatrices.Dependingonhowweconstructthealgorithm,thetaskscouldbe(1)themultiplicationofsubblocksofthematrices,(2)innerproductsbetweenrowsandcolumnsofthematrices,or(3)individualiterationsoftheloopsinvolvedinthematrixmultiplication.Thesearealllegitimatewaystodefinetasksformatrixmultiplication;thatis,thetaskdefinitionfollowsfromthewaythealgorithmdesignerthinksabouttheproblem.
Unitofexecution(UE).Tobeexecuted,ataskneedstobemappedtoaUEsuchasaprocessorthread.Aprocessisacollectionofresourcesthatenablestheexecutionofprograminstructions.Theseresourcescanincludevirtualmemory,I/Odescriptors,aruntimestack,signalhandlers,userandgroupIDs,andaccesscontroltokens.Amorehighlevelviewisthataprocessisa"heavyweight"unitofexecutionwithitsownaddressspace.AthreadisthefundamentalUEinmodernoperatingsystems.Athreadisassociatedwithaprocessandsharestheprocess'senvironment.Thismakesthreadslightweight(thatis,acontextswitchbetweenthreadstakesonlyasmallamountoftime).Amorehighlevelviewisthatathreadisa"lightweight"UEthatsharesanaddressspacewithotherthreads.
WewilluseunitofexecutionorUEasagenerictermforoneofacollectionofpossiblyconcurrentlyexecutingentities,usuallyeitherprocessesorthreads.Thisisconvenientintheearlystagesofprogramdesignwhenthedistinctionsbetweenprocessesandthreadsarelessimportant.
Processingelement(PE).Weusethetermprocessingelement(PE)asagenerictermforahardwareelementthatexecutesastreamofinstructions.TheunitofhardwareconsideredtobeaPEdependsonthecontext.Forexample,someprogrammingenvironmentsvieweachworkstationinaclusterofSMPworkstationsasexecutingasingleinstructionstream;inthissituation,thePEwouldbetheworkstation.Adifferentprogrammingenvironmentrunningonthesamehardware,however,mightvieweachprocessorofeachworkstationasexecutinganindividualinstructionstream;inthiscase,thePEistheindividualprocessor,andeachworkstationcontainsseveralPEs.
Loadbalanceandloadbalancing.Toexecuteaparallelprogram,thetasksmustbemappedtoUEs,andtheUEstoPEs.Howthemappingsaredonecanhaveasignificantimpactontheoverallperformanceofaparallelalgorithm.ItiscrucialtoavoidthesituationinwhichasubsetofthePEsisdoingmostoftheworkwhileothersareidle.LoadbalancereferstohowwelltheworkisdistributedamongPEs.LoadbalancingistheprocessofallocatingworktoPEs,eitherstaticallyordynamically,sothattheworkisdistributedasevenlyaspossible.
Synchronization.Inaparallelprogram,duetothenondeterminismoftaskschedulingandotherfactors,eventsinthecomputationmightnotalwaysoccurinthesameorder.Forexample,inonerun,ataskmightreadvariablexbeforeanothertaskreadsvariabley;inthenextrunwiththesameinput,theeventsmightoccurintheoppositeorder.Inmanycases,theorderinwhichtwoeventsoccurdoesnotmatter.Inothersituations,theorderdoesmatter,andtoensurethattheprogramiscorrect,theprogrammermustintroducesynchronizationtoenforcethenecessaryorderingconstraints.TheprimitivesprovidedforthispurposeinourselectedenvironmentsarediscussedintheImplementationMechanismsdesignspace(Section6.3).
Synchronousversusasynchronous.Weusethesetwotermstoqualitativelyrefertohowtightlycoupledintimetwoeventsare.Iftwoeventsmusthappenatthesametime,theyaresynchronous;otherwisetheyareasynchronous.Forexample,messagepassing(thatis,communicationbetweenUEsbysendingandreceivingmessages)issynchronousifamessagesentmustbereceivedbeforethesendercancontinue.Messagepassingisasynchronousifthesendercancontinueitscomputationregardlessofwhathappensatthereceiver,orifthereceivercancontinuecomputationswhilewaitingforareceivetocomplete.
Raceconditions.Araceconditionisakindoferrorpeculiartoparallelprograms.ItoccurswhentheoutcomeofaprogramchangesastherelativeschedulingofUEsvaries.BecausetheoperatingsystemandnottheprogrammercontrolstheschedulingoftheUEs,raceconditionsresultinprogramsthatpotentiallygivedifferentanswersevenwhenrunonthesamesystemwiththesamedata.Raceconditionsareparticularlydifficulterrorstodebugbecausebytheirnaturetheycannotbereliablyreproduced.Testinghelps,butisnotaseffectiveaswithsequentialprograms:Aprogrammayruncorrectlythefirstthousandtimesandthenfailcatastrophicallyonthethousandandfirstexecutionandthenrunagaincorrectlywhentheprogrammerattemptstoreproducetheerrorasthefirststepindebugging.
Raceconditionsresultfromerrorsinsynchronization.IfmultipleUEsreadandwritesharedvariables,theprogrammermustprotectaccesstothesesharedvariablessothereadsandwritesoccurinavalidorderregardlessofhowthetasksareinterleaved.Whenmanyvariablesaresharedorwhentheyareaccessedthroughmultiplelevelsofindirection,verifyingbyinspectionthatnoraceconditionsexistcanbeverydifficult.Toolsareavailablethathelpdetectandfixraceconditions,suchasThreadCheckerfromIntelCorporation,andtheproblemremainsanareaofactiveandimportantresearch[NM92].
Deadlocks.Deadlocksareanothertypeoferrorpeculiartoparallelprograms.Adeadlockoccurswhenthereisacycleoftasksinwhicheachtaskisblockedwaitingforanothertoproceed.Becauseallarewaitingforanothertasktodosomething,theywillallbeblockedforever.Asasimpleexample,considertwotasksinamessagepassingenvironment.TaskAattemptstoreceiveamessagefromtaskB,afterwhichAwillreplybysendingamessageofitsowntotaskB.Meanwhile,taskBattemptsto
receiveamessagefromtaskA,afterwhichBwillsendamessagetoA.Becauseeachtaskiswaitingfortheothertosenditamessagefirst,bothtaskswillbeblockedforever.Fortunately,deadlocksarenotdifficulttodiscover,asthetaskswillstopatthepointofthedeadlock.
2.5. A QUANTITATIVE LOOK AT PARALLEL COMPUTATIONThetwomainreasonsforimplementingaparallelprogramaretoobtainbetterperformanceandtosolvelargerproblems.Performancecanbebothmodeledandmeasured,sointhissectionwewilltakeaanotherlookatparallelcomputationsbygivingsomesimpleanalyticalmodelsthatillustratesomeofthefactorsthatinfluencetheperformanceofaparallelprogram.
Consideracomputationconsistingofthreeparts:asetupsection,acomputationsection,andafinalizationsection.ThetotalrunningtimeofthisprogramononePEisthengivenasthesumofthetimesforthethreeparts.
Equation2.1
WhathappenswhenwerunthiscomputationonaparallelcomputerwithmultiplePEs?Supposethatthesetupandfinalizationsectionscannotbecarriedoutconcurrentlywithanyotheractivities,butthatthecomputationsectioncouldbedividedintotasksthatwouldrunindependentlyonasmanyPEsasareavailable,withthesametotalnumberofcomputationstepsasintheoriginalcomputation.ThetimeforthefullcomputationonPPEscanthereforebegivenbyOfcourse,Eq.2.2describesaveryidealizedsituation.However,theideathatcomputationshaveaserialpart(forwhichadditionalPEsareuseless)andaparallelizablepart(forwhichmorePEsdecreasetherunningtime)isrealistic.Thus,thissimplemodelcapturesanimportantrelationship.
Equation2.2
AnimportantmeasureofhowmuchadditionalPEshelpistherelativespeedupS,whichdescribeshowmuchfasteraproblemrunsinawaythatnormalizesawaytheactualrunningtime.
Equation2.3
ArelatedmeasureistheefficiencyE,whichisthespeedupnormalizedbythenumberofPEs.
Equation2.4
Equation2.5
Ideally,wewouldwantthespeeduptobeequaltoP,thenumberofPEs.Thisissometimescalledperfectlinearspeedup.Unfortunately,thisisanidealthatcanrarelybeachievedbecausetimesforsetupandfinalizationarenotimprovedbyaddingmorePEs,limitingthespeedup.Thetermsthatcannotberunconcurrentlyarecalledtheserialterms.Theirrunningtimesrepresentsomefractionofthetotal,calledtheserialfraction,denoted .Equation2.6
Thefractionoftimespentintheparallelizablepartoftheprogramisthen(1 ).Wecanthus rewritetheexpressionfortotalcomputationtimewithPPEsas
Equation2.7
Now,rewritingSintermsofthenewexpressionforTtotal(P),weobtainthefamousAmdahl'slaw:
Equation2.8
Equation2.9
Thus,inanidealparallelalgorithmwithnooverheadintheparallelpart,thespeedupshouldfollowEq.2.9.Whathappenstothespeedupifwetakeouridealparallelalgorithmanduseaverylargenumberofprocessors?TakingthelimitasPgoestoinfinityinourexpressionforSyields
Equation2.10
Eq.2.10thusgivesanupperboundonthespeedupobtainableinanalgorithmwhoseserialpartrepresents ofthetotalcomputation.Theseconceptsarevitaltotheparallelalgorithmdesigner.Indesigningaparallelalgorithm,itisimportanttounderstandthevalueoftheserialfractionsothatrealisticexpectationscanbesetforperformance.Itmaynotmakesensetoimplementacomplex,arbitrarilyscalableparallelalgorithmif10%ormoreofthealgorithmisserialand10%isfairlycommon.
Ofcourse,Amdahl'slawisbasedonassumptionsthatmayormaynotbetrueinpractice.Inreallife,anumberoffactorsmaymaketheactualrunningtimelongerthanthisformulaimplies.Forexample,creatingadditionalparalleltasksmayincreaseoverheadandthechancesofcontentionforsharedresources.Ontheotherhand,iftheoriginalserialcomputationislimitedbyresourcesotherthantheavailabilityofCPUcycles,theactualperformancecouldbemuchbetterthanAmdahl'slawwouldpredict.Forexample,alargeparallelmachinemayallowbiggerproblemstobeheldinmemory,thusreducingvirtualmemorypaging,ormultipleprocessorseachwithitsowncachemayallowmuchmoreoftheproblemtoremaininthecache.Amdahl'slawalsorestsontheassumptionthatforanygiveninput,theparallelandserialimplementationsperformexactlythesamenumberofcomputationalsteps.Iftheserialalgorithmbeingusedintheformulaisnotthebestpossiblealgorithmfortheproblem,thenacleverparallelalgorithmthatstructuresthecomputationdifferentlycanreducethetotalnumberofcomputationalsteps.
Ithasalsobeenobserved[Gus88]thattheexerciseunderlyingAmdahl'slaw,namelyrunningexactlythesameproblemwithvaryingnumbersofprocessors,isartificialinsomecircumstances.If,say,theparallelapplicationwereaweathersimulation,thenwhennewprocessorswereadded,onewouldmostlikelyincreasetheproblemsizebyaddingmoredetailstothemodelwhilekeepingthetotalexecutiontimeconstant.Ifthisisthecase,thenAmdahl'slaw,orfixedsizespeedup,givesapessimisticviewofthebenefitsofadditionalprocessors.
Toseethis,wecanreformulatetheequationtogivethespeedupintermsofperformanceonaPprocessorsystem.EarlierinEq.2.2,weobtainedtheexecutiontimeforTprocessors,Ttotal(P),from
theexecutiontimeoftheserialtermsandtheexecutiontimeoftheparallelizablepartwhenexecutedononeprocessor.Here,wedotheoppositeandobtainTtotal(1)fromtheserialandparallelterms
whenexecutedonPprocessors.
Equation2.11
Now,wedefinethesocalledscaledserialfraction,denotedscaled,as
Equation2.12
andthen
Equation2.13
Rewritingtheequationforspeedup(Eq.2.3)andsimplifying,weobtainthescaled(orfixedtime)speedup.[1]
[1]Thisequation,sometimesknownasGustafson'slaw,wasattributedin[Gus88]toE.Barsis.
Equation2.14
ThisgivesexactlythesamespeedupasAmdahl'slaw,butallowsadifferentquestiontobeaskedwhenthenumberofprocessorsisincreased.SincescaleddependsonP,theresultoftakingthelimitisn'timmediatelyobvious,butwouldgivethesameresultasthelimitinAmdahl'slaw.However,supposewetakethelimitinPwhileholdingTcomputeandthusscaledconstant.Theinterpretationisthatweareincreasingthesizeoftheproblemsothatthetotalrunningtimeremainsconstantwhenmoreprocessorsareadded.(Thiscontainstheimplicitassumptionthattheexecutiontimeoftheserialtermsdoesnotchangeastheproblemsizegrows.)Inthiscase,thespeedupislinearinP.Thus,whileaddingmoreprocessorstosolveafixedproblemmayhitthespeeduplimitsofAmdahl'slawwitharelativelysmallnumberofprocessors,iftheproblemgrowsasmoreprocessorsareadded,Amdahl'slawwillbepessimistic.Thesetwomodelsofspeedup,alongwithafixedmemoryversionofspeedup,arediscussedin[SN90].
2.6. COMMUNICATION
2.6.1. Latency and BandwidthAsimplebutusefulmodelcharacterizesthetotaltimeformessagetransferasthesumofafixedcost
plusavariablecostthatdependsonthelengthofthemessage.
Equation2.15
Thefixedcost iscalledlatencyandisessentiallythetimeittakestosendanemptymessageover thecommunicationmedium,fromthetimethesendroutineiscalledtothetimethedataisreceivedbytherecipient.Latency(giveninsomeappropriatetimeunit)includesoverheadduetosoftwareandnetworkhardwareplusthetimeittakesforthemessagetotraversethecommunicationmedium.Thebandwidth (giveninsomemeasureofbytespertimeunit)isameasureofthecapacityofthe communicationmedium.Nisthelengthofthemessage.
Thelatencyandbandwidthcanvarysignificantlybetweensystemsdependingonboththehardwareusedandthequalityofthesoftwareimplementingthecommunicationprotocols.Becausethesevaluescanbemeasuredwithfairlysimplebenchmarks[DD97],itissometimesworthwhiletomeasurevaluesfor and ,asthesecanhelpguideoptimizationstoimprovecommunicationperformance. Forexample,inasysteminwhich isrelativelylarge,itmightbeworthwhiletotrytorestructurea programthatsendsmanysmallmessagestoaggregatethecommunicationintoafewlargemessagesinstead.Dataforseveralrecentsystemshasbeenpresentedin[BBC +03 ].
2.6.2. Overlapping Communication and Computation and Latency HidingIfwelookmorecloselyatthecomputationtimewithinasingletaskonasingleprocessor,itcanroughlybedecomposedintocomputationtime,communicationtime,andidletime.Thecommunicationtimeisthetimespentsendingandreceivingmessages(andthusonlyappliestodistributedmemorymachines),whereastheidletimeistimethatnoworkisbeingdonebecausethetaskiswaitingforanevent,suchasthereleaseofaresourceheldbyanothertask.
Acommonsituationinwhichataskmaybeidleiswhenitiswaitingforamessagetobetransmittedthroughthesystem.Thiscanoccurwhensendingamessage(astheUEwaitsforareplybeforeproceeding)orwhenreceivingamessage.Sometimesitispossibletoeliminatethiswaitbyrestructuringthetasktosendthemessageand/orpostthereceive(thatis,indicatethatitwantstoreceiveamessage)andthencontinuethecomputation.Thisallowstheprogrammertooverlapcommunicationandcomputation.WeshowanexampleofthistechniqueinFig.2.7.Thisstyleofmessagepassingismorecomplicatedfortheprogrammer,becausetheprogrammermusttakecaretowaitforthereceivetocompleteafteranyworkthatcanbeoverlappedwithcommunicationiscompleted.
Figure 2.7. Communication without (left) and with (right) support for overlapping communication and computation. Although UE 0 in the computation on the right still has some idle time waiting for the reply from UE 1, the idle time is reduced and the
computation requires less total time because of UE 1 's earlier start.
AnothertechniqueusedonmanyparallelcomputersistoassignmultipleUEstoeachPE,sothatwhenoneUEiswaitingforcommunication,itwillbepossibletocontextswitchtoanotherUEandkeeptheprocessorbusy.Thisisanexampleoflatencyhiding.Itisincreasinglybeingusedonmodernhighperformancecomputingsystems,themostfamousexamplebeingtheMTAsystemfromCrayResearch[ACC +90 ].
2.7. SUMMARYThischapterhasgivenabriefoverviewofsomeoftheconceptsandvocabularyusedinparallelcomputing.Additionaltermsaredefinedintheglossary.Wealsodiscussedthemajorprogrammingenvironmentsinuseforparallelcomputing:OpenMP,MPI,andJava.Throughoutthebook,wewillusethesethreeprogrammingenvironmentsforourexamples.MoredetailsaboutOpenMP,MPI,andJavaandhowtousethemtowriteparallelprogramsareprovidedintheappendixes.
Chapter 3. The Finding Concurrency Design Space
3.1ABOUTTHEDESIGNSPACE
3.2THETASKDECOMPOSITIONPATTERN
3.3THEDATADECOMPOSITIONPATTERN
3.4THEGROUPTASKSPATTERN
3.5THEORDERTASKSPATTERN
3.6THEDATASHARINGPATTERN
3.7THEDESIGNEVALUATIONPATTERN
3.8SUMMARY
3.1. ABOUT THE DESIGN SPACEThesoftwaredesignerworksinanumberofdomains.Thedesignprocessstartsintheproblemdomainwithdesignelementsdirectlyrelevanttotheproblembeingsolved(forexample,fluidflows,decisiontrees,atoms,etc.).Theultimateaimofthedesignissoftware,soatsomepoint,thedesignelementschangeintoonesrelevanttoaprogram(forexample,datastructuresandsoftwaremodules).Wecallthistheprogramdomain.Althoughitisoftentemptingtomoveintotheprogramdomainassoonaspossible,adesignerwhomovesoutoftheproblemdomaintoosoonmaymissvaluabledesignoptions.
Thisisparticularlyrelevantinparallelprogramming.Parallelprogramsattempttosolvebiggerproblemsinlesstimebysimultaneouslysolvingdifferentpartsoftheproblemondifferentprocessingelements.Thiscanonlywork,however,iftheproblemcontainsexploitableconcurrency,thatis,multipleactivitiesortasksthatcanexecuteatthesametime.Afteraproblemhasbeenmappedontotheprogramdomain,however,itcanbedifficulttoseeopportunitiestoexploitconcurrency.
Hence,programmersshouldstarttheirdesignofaparallelsolutionbyanalyzingtheproblemwithintheproblemdomaintoexposeexploitableconcurrency.WecallthedesignspaceinwhichthisanalysisiscarriedouttheFindingConcurrencydesignspace.Thepatternsinthisdesignspacewillhelpidentifyandanalyzetheexploitableconcurrencyinaproblem.Afterthisisdone,oneormorepatternsfromtheAlgorithmStructurespacecanbechosentohelpdesigntheappropriatealgorithmstructuretoexploittheidentifiedconcurrency.
AnoverviewofthisdesignspaceanditsplaceinthepatternlanguageisshowninFig.3.1.
Figure 3.1. Overview of the Finding Concurrency design space and its place in the pattern language
ExperienceddesignersworkinginafamiliardomainmayseetheexploitableconcurrencyimmediatelyandcouldmovedirectlytothepatternsintheAlgorithmStructuredesignspace.
3.1.1. OverviewBeforestartingtoworkwiththepatternsinthisdesignspace,thealgorithmdesignermustfirstconsidertheproblemtobesolvedandmakesuretheefforttocreateaparallelprogramwillbejustified:Istheproblemlargeenoughandtheresultssignificantenoughtojustifyexpendingefforttosolveitfaster?Ifso,thenextstepistomakesurethekeyfeaturesanddataelementswithintheproblemarewellunderstood.Finally,thedesignerneedstounderstandwhichpartsoftheproblemaremostcomputationallyintensive,becausetheefforttoparallelizetheproblemshouldbefocusedonthoseparts.
Afterthisanalysisiscomplete,thepatternsintheFindingConcurrencydesignspacecanbeusedtostartdesigningaparallelalgorithm.Thepatternsinthisdesignspacecanbeorganizedintothreegroups.
DecompositionPatterns.Thetwodecompositionpatterns,TaskDecompositionandDataDecomposition,areusedtodecomposetheproblemintopiecesthatcanexecuteconcurrently.
DependencyAnalysisPatterns.Thisgroupcontainsthreepatternsthathelpgroupthetasksandanalyzethedependenciesamongthem:GroupTasks,OrderTasks,andDataSharing.Nominally,thepatternsareappliedinthisorder.Inpractice,however,itisoftennecessarytoworkbackandforthbetweenthem,orpossiblyevenrevisitthedecompositionpatterns.
DesignEvaluationPattern.ThefinalpatterninthisspaceguidesthealgorithmdesignerthroughananalysisofwhathasbeendonesofarbeforemovingontothepatternsintheAlgorithmStructuredesignspace.Thispatternisimportantbecauseitoftenhappensthatthebestdesignisnotfoundonthefirstattempt,andtheearlierdesignflawsareidentified,the
easiertheyaretocorrect.Ingeneral,workingthroughthepatternsinthisspaceisaniterativeprocess.
3.1.2. Using the Decomposition PatternsThefirststepindesigningaparallelalgorithmistodecomposetheproblemintoelementsthatcanexecuteconcurrently.Wecanthinkofthisdecompositionasoccurringintwodimensions.
Thetaskdecompositiondimensionviewstheproblemasastreamofinstructionsthatcanbebrokenintosequencescalledtasksthatcanexecutesimultaneously.Forthecomputationtobeefficient,theoperationsthatmakeupthetaskshouldbelargelyindependentoftheoperationstakingplaceinsideothertasks.
Thedatadecompositiondimensionfocusesonthedatarequiredbythetasksandhowitcanbedecomposedintodistinctchunks.Thecomputationassociatedwiththedatachunkswillonlybeefficientifthedatachunkscanbeoperateduponrelativelyindependently.
Viewingtheproblemdecompositionintermsoftwodistinctdimensionsissomewhatartificial.Ataskdecompositionimpliesadatadecompositionandviceversa;hence,thetwodecompositionsarereallydifferentfacetsofthesamefundamentaldecomposition.Wedividethemintoseparatedimensions,however,becauseaproblemdecompositionusuallyproceedsmostnaturallybyemphasizingonedimensionofthedecompositionovertheother.Bymakingthemdistinct,wemakethisdesignemphasisexplicitandeasierforthedesignertounderstand.
3.1.3. Background for ExamplesInthissection,wegivebackgroundinformationonsomeoftheexamplesthatareusedinseveralpatterns.Itcanbeskippedforthetimebeingandrevisitedlaterwhenreadingapatternthatreferstooneoftheexamples.
Medical imaging
PET(PositronEmissionTomography)scansprovideanimportantdiagnostictoolbyallowingphysicianstoobservehowaradioactivesubstancepropagatesthroughapatient'sbody.Unfortunately,theimagesformedfromthedistributionofemittedradiationareoflowresolution,dueinparttothescatteringoftheradiationasitpassesthroughthebody.Itisalsodifficulttoreasonfromtheabsoluteradiationintensities,becausedifferentpathwaysthroughthebodyattenuatetheradiationdifferently.
Tosolvethisproblem,modelsofhowradiationpropagatesthroughthebodyareusedtocorrecttheimages.AcommonapproachistobuildaMonteCarlomodel,asdescribedbyLjungbergandKing[LK98].Randomlyselectedpointswithinthebodyareassumedtoemitradiation(usuallyagammaray),andthetrajectoryofeachrayisfollowed.Asaparticle(ray)passesthroughthebody,itisattenuatedbythedifferentorgansittraverses,continuinguntiltheparticleleavesthebodyandhitsacameramodel,therebydefiningafulltrajectory.Tocreateastatisticallysignificantsimulation,thousands,ifnotmillions,oftrajectoriesarefollowed.
Thisproblemcanbeparallelizedintwoways.Becauseeachtrajectoryisindependent,itispossibletoparallelizetheapplicationbyassociatingeachtrajectorywithatask.ThisapproachisdiscussedintheExamplessectionoftheTaskDecompositionpattern.Anotherapproachwouldbetopartitionthe
bodyintosectionsandassigndifferentsectionstodifferentprocessingelements.ThisapproachisdiscussedintheExamplessectionoftheDataDecompositionpattern.
Linear algebra
Linearalgebraisanimportanttoolinappliedmathematics:Itprovidesthemachineryrequiredtoanalyzesolutionsoflargesystemsoflinearequations.Theclassiclinearalgebraproblemasks,formatrixAandvectorb,whatvaluesforxwillsolvetheequation
Equation3.1
ThematrixAinEq.3.1takesonacentralroleinlinearalgebra.Manyproblemsareexpressedintermsoftransformationsofthismatrix.Thesetransformationsareappliedbymeansofamatrixmultiplication
Equation3.2
IfT,A,andCaresquarematricesoforderN,matrixmultiplicationisdefinedsuchthateachelementoftheresultingmatrixCis
Equation3.3
wherethesubscriptsdenoteparticularelementsofthematrices.Inotherwords,theelementoftheproductmatrixCinrowiandcolumnjisthedotproductoftheithrowofTandthejthcolumnofA.Hence,computingeachoftheN2elementsofCrequiresNmultiplicationsandN1additions,makingtheoverallcomplexityofmatrixmultiplicationO(N3).
Therearemanywaystoparallelizeamatrixmultiplicationoperation.Itcanbeparallelizedusingeitherataskbaseddecomposition(asdiscussedintheExamplessectionoftheTaskDecompositionpattern)oradatabaseddecomposition(asdiscussedintheExamplessectionoftheDataDecompositionpattern).
Molecular dynamics
Moleculardynamicsisusedtosimulatethemotionsofalargemolecularsystem.Forexample,moleculardynamicssimulationsshowhowalargeproteinmovesaroundandhowdifferentlyshapeddrugsmightinteractwiththeprotein.Notsurprisingly,moleculardynamicsisextremelyimportantin
thepharmaceuticalindustry.Itisalsoausefultestproblemforcomputerscientistsworkingonparallelcomputing:Itisstraightforwardtounderstand,relevanttoscienceatlarge,anddifficulttoparallelizeeffectively.Asaresult,ithasbeenthesubjectofmuchresearch[Mat94,PH95,Pli95].
Thebasicideaistotreatamoleculeasalargecollectionofballsconnectedbysprings.Theballsrepresenttheatomsinthemolecule,whilethespringsrepresentthechemicalbondsbetweentheatoms.Themoleculardynamicssimulationitselfisanexplicittimesteppingprocess.Ateachtimestep,theforceoneachatomiscomputedandthenstandardclassicalmechanicstechniquesareusedtocomputehowtheforcemovestheatoms.Thisprocessiscarriedoutrepeatedlytostepthroughtimeandcomputeatrajectoryforthemolecularsystem.
Theforcesduetothechemicalbonds(the"springs")arerelativelysimpletocompute.Thesecorrespondtothevibrationsandrotationsofthechemicalbondsthemselves.Theseareshortrangeforcesthatcanbecomputedwithknowledgeofthehandfulofatomsthatsharechemicalbonds.Themajordifficultyarisesbecausetheatomshavepartialelectricalcharges.Hence,whileatomsonlyinteractwithasmallneighborhoodofatomsthroughtheirchemicalbonds,theelectricalchargescauseeveryatomtoapplyaforceoneveryotheratom.
ThisisthefamousNbodyproblem.OntheorderofN2termsmustbecomputedtofindthesenonbondedforces.BecauseNislarge(tensorhundredsofthousands)andthenumberoftimestepsinasimulationishuge(tensofthousands),thetimerequiredtocomputethesenonbondedforcesdominatesthecomputation.SeveralwayshavebeenproposedtoreducetheeffortrequiredtosolvetheNbodyproblem.Weareonlygoingtodiscussthesimplestone:thecutoffmethod.
Theideaissimple.Eventhougheachatomexertsaforceoneveryotheratom,thisforcedecreaseswiththesquareofthedistancebetweentheatoms.Hence,itshouldbepossibletopickadistancebeyondwhichtheforcecontributionissosmallthatitcanbeignored.Byignoringtheatomsthatexceedthiscutoff,theproblemisreducedtoonethatscalesasO(Nxn),wherenisthenumberofatomswithinthecutoffvolume,usuallyhundreds.Thecomputationisstillhuge,anditdominatestheoverallruntimeforthesimulation,butatleasttheproblemistractable.
Thereareahostofdetails,butthebasicsimulationcanbesummarizedasinFig.3.2.
Theprimarydatastructuresholdtheatomicpositions(atoms),thevelocitiesofeachatom(velocity),theforcesexertedoneachatom(forces),andlistsofatomswithinthecutoffdistanceofeachatoms(neighbors).Theprogramitselfisatimesteppingloop,inwhicheachiterationcomputestheshortrangeforceterms,updatestheneighborlists,andthenfindsthenonbondedforces.Aftertheforceoneachatomhasbeencomputed,asimpleordinarydifferentialequationissolvedtoupdatethepositionsandvelocities.Physicalpropertiesbasedonatomicmotionsarethenupdated,andwegotothenexttimestep.
Therearemanywaystoparallelizethemoleculardynamicsproblem.Weconsiderthemostcommonapproach,startingwiththetaskdecomposition(discussedintheTaskDecompositionpattern)andfollowingwiththeassociateddatadecomposition(discussedintheDataDecompositionpattern).Thisexampleshowshowthetwodecompositionsfittogethertoguidethedesignoftheparallelalgorithm.
Figure 3.2. Pseudocode for the molecular dynamics example
Int const N // number of atoms
Array of Real :: atoms (3,N) //3D coordinatesArray of Real :: velocities (3,N) //velocity vectorArray of Real :: forces (3,N) //force in each dimensionArray of List :: neighbors(N) //atoms in cutoff volume
loop over time steps vibrational_forces (N, atoms, forces) rotational_forces (N, atoms, forces) neighbor_list (N, atoms, neighbors) non_bonded_forces (N, atoms, neighbors, forces) update_atom_positions_and_velocities( N, atoms, velocities, forces) physical_properties ( ... Lots of stuff ... )end loop
3.2. THE TASK DECOMPOSITION PATTERN
ProblemHowcanaproblembedecomposedintotasksthatcanexecuteconcurrently?
ContextEveryparallelalgorithmdesignstartsfromthesamepoint,namelyagoodunderstandingoftheproblembeingsolved.Theprogrammermustunderstandwhicharethecomputationallyintensivepartsoftheproblem,thekeydatastructures,andhowthedataisusedastheproblem'ssolutionunfolds.
Thenextstepistodefinethetasksthatmakeuptheproblemandthedatadecompositionimpliedbythetasks.Fundamentally,everyparallelalgorithminvolvesacollectionoftasksthatcanexecuteconcurrently.Thechallengeistofindthesetasksandcraftanalgorithmthatletsthemrunconcurrently.
Insomecases,theproblemwillnaturallybreakdownintoacollectionofindependent(ornearlyindependent)tasks,anditiseasiesttostartwithataskbaseddecomposition.Inothercases,thetasksaredifficulttoisolateandthedecompositionofthedata(asdiscussedintheDataDecompositionpattern)isabetterstartingpoint.Itisnotalwaysclearwhichapproachisbest,andoftenthealgorithmdesignerneedstoconsiderboth.
Regardlessofwhetherthestartingpointisataskbasedoradatabaseddecomposition,however,aparallelalgorithmultimatelyneedstasksthatwillexecuteconcurrently,sothesetasksmustbeidentified.
ForcesThemainforcesinfluencingthedesignatthispointareflexibility,efficiency,andsimplicity.
Flexibility.Flexibilityinthedesignwillallowittobeadaptedtodifferentimplementationrequirements.Forexample,itisusuallynotagoodideatonarrowtheoptionstoasinglecomputersystemorstyleofprogrammingatthisstageofthedesign.
Efficiency.Aparallelprogramisonlyusefulifitscalesefficientlywiththesizeoftheparallelcomputer(intermsofreducedruntimeand/ormemoryutilization).Forataskdecomposition,thismeansweneedenoughtaskstokeepallthePEsbusy,withenoughworkpertasktocompensateforoverheadincurredtomanagedependencies.However,thedriveforefficiencycanleadtocomplexdecompositionsthatlackflexibility.
Simplicity.Thetaskdecompositionneedstobecomplexenoughtogetthejobdone,butsimpleenoughtolettheprogrambedebuggedandmaintainedwithreasonableeffort.
SolutionThekeytoaneffectivetaskdecompositionistoensurethatthetasksaresufficientlyindependentsothatmanagingdependenciestakesonlyasmallfractionoftheprogram'soverallexecutiontime.ItisalsoimportanttoensurethattheexecutionofthetaskscanbeevenlydistributedamongtheensembleofPEs(theloadbalancingproblem).
Inanidealworld,thecompilerwouldfindthetasksfortheprogrammer.Unfortunately,thisalmostneverhappens.Instead,itmustusuallybedonebyhandbasedonknowledgeoftheproblemandthecoderequiredtosolveit.Insomecases,itmightbenecessarytocompletelyrecasttheproblemintoaformthatexposesrelativelyindependenttasks.
Inataskbaseddecomposition,welookattheproblemasacollectionofdistincttasks,payingparticularattentionto
Theactionsthatarecarriedouttosolvetheproblem.(Arethereenoughofthemtokeeptheprocessingelementsonthetargetmachinesbusy?)
Whethertheseactionsaredistinctandrelativelyindependent.
Asafirstpass,wetrytoidentifyasmanytasksaspossible;itismucheasiertostartwithtoomanytasksandmergethemlateronthantostartwithtoofewtasksandlatertrytosplitthem.
Taskscanbefoundinmanydifferentplaces.
Insomecases,eachtaskcorrespondstoadistinctcalltoafunction.Definingataskforeachfunctioncallleadstowhatissometimescalledafunctionaldecomposition.
Anotherplacetofindtasksisindistinctiterationsoftheloopswithinanalgorithm.Iftheiterationsareindependentandthereareenoughofthem,thenitmightworkwelltobaseataskdecompositiononmappingeachiterationontoatask.Thisstyleoftaskbaseddecompositionleadstowhataresometimescalledloopsplittingalgorithms.
Tasksalsoplayakeyroleindatadrivendecompositions.Inthiscase,alargedatastructureisdecomposedandmultipleunitsofexecutionconcurrentlyupdatedifferentchunksofthedatastructure.Inthiscase,thetasksarethoseupdatesonindividualchunks.
AlsokeepinmindtheforcesgivenintheForcessection:
Flexibility.Thedesignneedstobeflexibleinthenumberoftasksgenerated.Usuallythisisdonebyparameterizingthenumberandsizeoftasksonsomeappropriatedimension.Thiswillletthedesignbeadaptedtoawiderangeofparallelcomputerswithdifferentnumbersofprocessors.
Efficiency.Therearetwomajorefficiencyissuestoconsiderinthetaskdecomposition.First,eachtaskmustincludeenoughworktocompensatefortheoverheadincurredbycreatingthetasksandmanagingtheirdependencies.Second,thenumberoftasksshouldbelargeenoughsothatalltheunitsofexecutionarebusywithusefulworkthroughoutthecomputation.
Simplicity.Tasksshouldbedefinedinawaythatmakesdebuggingandmaintenancesimple.Whenpossible,tasksshouldbedefinedsotheyreusecodefromexistingsequentialprogramsthatsolverelatedproblems.
Afterthetaskshavebeenidentified,thenextstepistolookatthedatadecompositionimpliedbythetasks.TheDataDecompositionpatternmayhelpwiththisanalysis.
Examples
Medical imaging
ConsiderthemedicalimagingproblemdescribedinSec.3.1.3.Inthisapplication,apointinsideamodelofthebodyisselectedrandomly,aradioactivedecayisallowedtooccuratthispoint,andthetrajectoryoftheemittedparticleisfollowed.Tocreateastatisticallysignificantsimulation,thousands,ifnotmillions,oftrajectoriesarefollowed.
Itisnaturaltoassociateataskwitheachtrajectory.Thesetasksareparticularlysimpletomanageconcurrentlybecausetheyarecompletelyindependent.Furthermore,therearelargenumbersoftrajectories,sotherewillbemanytasks,makingthisdecompositionsuitableforalargerangeofcomputersystems,fromasharedmemorysystemwithasmallnumberofprocessingelementstoalargeclusterwithhundredsofprocessingelements.
Withthebasictasksdefined,wenowconsiderthecorrespondingdatadecompositionthatis,wedefinethedataassociatedwitheachtask.Eachtaskneedstoholdtheinformationdefiningthetrajectory.Butthatisnotall:Thetasksneedaccesstothemodelofthebodyaswell.Althoughitmightnotbeapparentfromourdescriptionoftheproblem,thebodymodelcanbeextremelylarge.Becauseitisareadonlymodel,thisisnoproblemifthereisaneffectivesharedmemorysystem;eachtaskcanreaddataasneeded.Ifthetargetplatformisbasedonadistributedmemoryarchitecture,however,thebodymodelwillneedtobereplicatedoneachPE.Thiscanbeverytimeconsumingandcanwasteagreatdealofmemory.ForsystemswithsmallmemoriesperPEand/orwithslownetworksbetweenPEs,adecompositionoftheproblembasedonthebodymodelmightbemoreeffective.
Thisisacommonsituationinparallelprogramming:Manyproblemscanbedecomposedprimarilyintermsofdataorprimarilyintermsoftasks.Ifataskbaseddecompositionavoidstheneedtobreakupanddistributecomplexdatastructures,itwillbeamuchsimplerprogramtowriteanddebug.Ontheotherhand,ifmemoryand/ornetworkbandwidthisalimitingfactor,adecompositionthatfocuseson
thedatamightbemoreeffective.Itisnotsomuchamatterofoneapproachbeing"better"thananotherasamatterofbalancingtheneedsofthemachinewiththeneedsoftheprogrammer.WediscussthisinmoredetailintheDataDecompositionpattern.
Matrix multiplication
Considerthemultiplicationoftwomatrices(C=AB),asdescribedinSec.3.1.3.Wecanproduceataskbaseddecompositionofthisproblembyconsideringthecalculationofeachelementoftheproductmatrixasaseparatetask.EachtaskneedsaccesstoonerowofAandonecolumnofB.Thisdecompositionhastheadvantagethatallthetasksareindependent,andbecauseallthedatathatissharedamongtasks(AandB)isreadonly,itwillbestraightforwardtoimplementinasharedmemoryenvironment.
Theperformanceofthisalgorithm,however,wouldbepoor.ConsiderthecasewherethethreematricesaresquareandoforderN.ForeachelementofC,NelementsfromAandNelementsfromBwouldberequired,resultingin2NmemoryreferencesforNmultiply/addoperations.Memoryaccesstimeisslowcomparedtofloatingpointarithmetic,sothebandwidthofthememorysubsystemwouldlimittheperformance.
Abetterapproachwouldbetodesignanalgorithmthatmaximizesreuseofdataloadedintoaprocessor'scaches.Wecanarriveatthisalgorithmintwodifferentways.First,wecouldgrouptogethertheelementwisetaskswedefinedearliersothetasksthatusesimilarelementsoftheAandBmatricesrunonthesameUE(seetheGroupTaskspattern).Alternatively,wecouldstartwiththedatadecompositionanddesignthealgorithmfromthebeginningaroundthewaythematricesfitintothecaches.WediscussthisexamplefurtherintheExamplessectionoftheDataDecompositionpattern.
Molecular dynamics
ConsiderthemoleculardynamicsproblemdescribedinSec.3.1.3.PseudocodeforthisexampleisshownagaininFig.3.3.
Beforeperformingthetaskdecomposition,weneedtobetterunderstandsomedetailsoftheproblem.First,theneighbor_list ()computationistimeconsuming.Thegistofthecomputationisaloopovereachatom,insideofwhicheveryotheratomischeckedtodeterminewhetheritfallswithintheindicatedcutoffvolume.Fortunately,thetimestepsareverysmall,andtheatomsdon'tmoveverymuchinanygiventimestep.Hence,thistimeconsumingcomputationisonlycarriedoutevery10to100steps.
Figure 3.3. Pseudocode for the molecular dynamics example
Int const N // number of atoms
Array of Real :: atoms (3,N) //3D coordinatesArray of Real :: velocities (3,N) //velocity vectorArray of Real :: forces (3,N) //force in each dimensionArray of List :: neighbors(N) //atoms in cutoff volume
loop over time steps vibrational_forces (N, atoms, forces) rotational_forces (N, atoms, forces) neighbor_list (N, atoms, neighbors)
non_bonded_forces (N, atoms, neighbors, forces) update_atom_positions_and_velocities( N, atoms, velocities, forces) physical_properties ( ... Lots of stuff ... )end loop
Second,thephysical_properties()functioncomputesenergies,correlationcoefficients,andahostofinterestingphysicalproperties.Thesecomputations,however,aresimpleanddonotsignificantlyaffecttheprogram'soverallruntime,sowewillignoretheminthisdiscussion.
Becausethebulkofthecomputationtimewillbeinnon_bonded_forces(),wemustpickaproblemdecompositionthatmakesthatcomputationrunefficientlyinparallel.Theproblemismadeeasierbythefactthateachofthefunctionsinsidethetimeloophasasimilarstructure:Inthesequentialversion,eachfunctionincludesaloopoveratomstocomputecontributionstotheforcevector.Thus,anaturaltaskdefinitionistheupdaterequiredbyeachatom,whichcorrespondstoaloopiterationinthesequentialversion.Afterperformingthetaskdecomposition,therefore,weobtainthefollowingtasks.
Tasksthatfindthevibrationalforcesonanatom
Tasksthatfindtherotationalforcesonanatom
Tasksthatfindthenonbondedforcesonanatom
Tasksthatupdatethepositionandvelocityofanatom
Atasktoupdatetheneighborlistforalltheatoms(whichwewillleavesequential)
Withourcollectionoftasksinhand,wecanconsidertheaccompanyingdatadecomposition.Thekeydatastructuresaretheneighborlist,theatomiccoordinates,theatomicvelocities,andtheforcevector.Everyiterationthatupdatestheforcevectorneedsthecoordinatesofaneighborhoodofatoms.Thecomputationofnonbondedforces,however,potentiallyneedsthecoordinatesofalltheatoms,becausethemoleculebeingsimulatedmightfoldbackonitselfinunpredictableways.Wewillusethisinformationtocarryoutthedatadecomposition(intheDataDecompositionpattern)andthedatasharinganalysis(intheDataSharingpattern).
Known uses
Taskbaseddecompositionsareextremelycommoninparallelcomputing.Forexample,thedistancegeometrycodeDGEOM[Mat96]usesataskbaseddecomposition,asdoestheparallelWESDYNmoleculardynamicsprogram[MR95].
3.3. THE DATA DECOMPOSITION PATTERN
ProblemHowcanaproblem'sdatabedecomposedintounitsthatcanbeoperatedonrelativelyindependently?
ContextTheparallelalgorithmdesignermusthaveadetailedunderstandingoftheproblembeingsolved.Inaddition,thedesignershouldidentifythemostcomputationallyintensivepartsoftheproblem,thekeydatastructuresrequiredtosolvetheproblem,andhowdataisusedastheproblem'ssolutionunfolds.
Afterthebasicproblemisunderstood,theparallelalgorithmdesignershouldconsiderthetasksthatmakeuptheproblemandthedatadecompositionimpliedbythetasks.Boththetaskanddatadecompositionsneedtobeaddressedtocreateaparallelalgorithm.Thequestionisnotwhichdecompositiontodo.Thequestioniswhichonetostartwith.Adatabaseddecompositionisagoodstartingpointifthefollowingistrue.
Themostcomputationallyintensivepartoftheproblemisorganizedaroundthemanipulationofalargedatastructure.
Similaroperationsarebeingappliedtodifferentpartsofthedatastructure,insuchawaythatthedifferentpartscanbeoperatedonrelativelyindependently.
Forexample,manylinearalgebraproblemsupdatelargematrices,applyingasimilarsetofoperationstoeachelementofthematrix.Inthesecases,itisstraightforwardtodrivetheparallelalgorithmdesignbylookingathowthematrixcanbebrokenupintoblocksthatareupdatedconcurrently.Thetaskdefinitionsthenfollowfromhowtheblocksaredefinedandmappedontotheprocessingelementsoftheparallelcomputer.
ForcesThemainforcesinfluencingthedesignatthispointareflexibility,efficiency,andsimplicity.
Flexibility.Flexibilitywillallowthedesigntobeadaptedtodifferentimplementationrequirements.Forexample,itisusuallynotagoodideatonarrowtheoptionstoasinglecomputersystemorstyleofprogrammingatthisstageofthedesign.
Efficiency.Aparallelprogramisonlyusefulifitscalesefficientlywiththesizeoftheparallelcomputer(intermsofreducedruntimeand/ormemoryutilization).
Simplicity.Thedecompositionneedstobecomplexenoughtogetthejobdone,butsimpleenoughtolettheprogrambedebuggedandmaintainedwithreasonableeffort.
SolutionInsharedmemoryprogrammingenvironmentssuchasOpenMP,thedatadecompositionwillfrequentlybeimpliedbythetaskdecomposition.Inmostcases,however,thedecompositionwillneedtobedonebyhand,becausethememoryisphysicallydistributed,becausedatadependenciesaretoocomplexwithoutexplicitlydecomposingthedata,ortoachieveacceptableefficiencyonaNUMAcomputer.
Ifataskbaseddecompositionhasalreadybeendone,thedatadecompositionisdrivenbytheneedsofeachtask.Ifwelldefinedanddistinctdatacanbeassociatedwitheachtask,thedecompositionshould
besimple.
Whenstartingwithadatadecomposition,however,weneedtolooknotatthetasks,butatthecentraldatastructuresdefiningtheproblemandconsiderwhethertheycantheybebrokendownintochunksthatcanbeoperatedonconcurrently.Afewcommonexamplesincludethefollowing.
Arraybasedcomputations.Concurrencycanbedefinedintermsofupdatesofdifferentsegmentsofthearray.Ifthearrayismultidimensional,itcanbedecomposedinavarietyofways(rows,columns,orblocksofvaryingshapes).
Recursivedatastructures.Wecanthinkof,forexample,decomposingtheparallelupdateofalargetreedatastructurebydecomposingthedatastructureintosubtreesthatcanbeupdatedconcurrently.
Regardlessofthenatureoftheunderlyingdatastructure,ifthedatadecompositionistheprimaryfactordrivingthesolutiontotheproblem,itservesastheorganizingprincipleoftheparallelalgorithm.
Whenconsideringhowtodecomposetheproblem'sdatastructures,keepinmindthecompetingforces.
Flexibility.Thesizeandnumberofdatachunksshouldbeflexibletosupportthewidestrangeofparallelsystems.Oneapproachistodefinechunkswhosesizeandnumberarecontrolledbyasmallnumberofparameters.Theseparametersdefinegranularityknobsthatcanbevariedtomodifythesizeofthedatachunkstomatchtheneedsoftheunderlyinghardware.(Note,however,thatmanydesignsarenotinfinitelyadaptablewithrespecttogranularity.)
Theeasiestplacetoseetheimpactofgranularityonthedatadecompositionisintheoverheadrequiredtomanagedependenciesbetweenchunks.Thetimerequiredtomanagedependenciesmustbesmallcomparedtotheoverallruntime.Inagooddatadecomposition,thedependenciesscaleatalowerdimensionthanthecomputationaleffortassociatedwitheachchunk.Forexample,inmanyfinitedifferenceprograms,thecellsattheboundariesbetweenchunks,thatis,thesurfacesofthechunks,mustbeshared.Thesizeofthesetofdependentcellsscalesasthesurfacearea,whiletheeffortrequiredinthecomputationscalesasthevolumeofthechunk.Thismeansthatthecomputationaleffortcanbescaled(basedonthechunk'svolume)tooffsetoverheadsassociatedwithdatadependencies(basedonthesurfaceareaofthechunk).
Efficiency.Itisimportantthatthedatachunksbelargeenoughthattheamountofworktoupdatethechunkoffsetstheoverheadofmanagingdependencies.AmoresubtleissuetoconsiderishowthechunksmapontoUEs.AneffectiveparallelalgorithmmustbalancetheloadbetweenUEs.Ifthisisn'tdonewell,somePEsmighthaveadisproportionateamountofwork,andtheoverallscalabilitywillsuffer.Thismayrequirecleverwaystobreakuptheproblem.Forexample,iftheproblemclearsthecolumnsinamatrixfromlefttoright,acolumnmappingofthematrixwillcauseproblemsastheUEswiththeleftmostcolumnswillfinishtheirworkbeforetheothers.Arowbasedblockdecompositionorevenablockcyclicdecomposition(inwhichrowsareassignedcyclicallytoPEs)woulddoamuchbetterjobofkeepingalltheprocessorsfullyoccupied.TheseissuesarediscussedinmoredetailintheDistributedArraypattern.
Simplicity.Overlycomplexdatadecompositionscanbeverydifficulttodebug.Adatadecompositionwillusuallyrequireamappingofaglobalindexspaceontoatasklocalindexspace.Makingthismappingabstractallowsittobeeasilyisolatedandtested.
Afterthedatahasbeendecomposed,ifithasnotalreadybeendone,thenextstepistolookatthetaskdecompositionimpliedbythetasks.TheTaskDecompositionpatternmayhelpwiththisanalysis.
Examples
Medical imaging
ConsiderthemedicalimagingproblemdescribedinSec.3.1.3.Inthisapplication,apointinsideamodelofthebodyisselectedrandomly,aradioactivedecayisallowedtooccuratthispoint,andthetrajectoryoftheemittedparticleisfollowed.Tocreateastatisticallysignificantsimulation,thousandsifnotmillionsoftrajectoriesarefollowed.
Inadatabaseddecompositionofthisproblem,thebodymodelisthelargecentraldatastructurearoundwhichthecomputationcanbeorganized.Themodelisbrokenintosegments,andoneormoresegmentsareassociatedwitheachprocessingelement.Thebodysegmentsareonlyread,notwritten,duringthetrajectorycomputations,sotherearenodatadependenciescreatedbythedecompositionofthebodymodel.
Afterthedatahasbeendecomposed,weneedtolookatthetasksassociatedwitheachdatasegment.Inthiscase,eachtrajectorypassingthroughthedatasegmentdefinesatask.Thetrajectoriesareinitiatedandpropagatedwithinasegment.Whenasegmentboundaryisencountered,thetrajectorymustbepassedbetweensegments.Itisthistransferthatdefinesthedependenciesbetweendatachunks.
Ontheotherhand,inataskbasedapproachtothisproblem(asdiscussedintheTaskDecompositionpattern),thetrajectoriesforeachparticledrivethealgorithmdesign.EachPEpotentiallyneedstoaccessthefullbodymodeltoserviceitssetoftrajectories.Inasharedmemoryenvironment,thisiseasybecausethebodymodelisareadonlydataset.Inadistributedmemoryenvironment,however,thiswouldrequiresubstantialstartupoverheadasthebodymodelisbroadcastacrossthesystem.
Thisisacommonsituationinparallelprogramming:Differentpointsofviewleadtodifferentalgorithmswithpotentiallyverydifferentperformancecharacteristics.Thetaskbasedalgorithmissimple,butitonlyworksifeachprocessingelementhasaccesstoalargememoryandiftheoverheadincurredloadingthedataintomemoryisinsignificantcomparedtotheprogram'sruntime.Analgorithmdrivenbyadatadecomposition,ontheotherhand,makesefficientuseofmemoryand(indistributedmemoryenvironments)lessuseofnetworkbandwidth,butitincursmorecommunicationoverheadduringtheconcurrentpartofcomputationandissignificantlymorecomplex.ChoosingwhichistheappropriateapproachcanbedifficultandisdiscussedfurtherintheDesignEvaluationpattern.
Matrix multiplication
Considerthestandardmultiplicationoftwomatrice