1
ARCHERService2016AnnualReport
2
DocumentInformationandVersionHistoryVersion: 1.0Status Final
Author(s):AlanSimpson,AnneWhiting,StephenBooth,AndyTurner,FelipePopovics,SteveJordan,HarveyRichardson,MikeBrown,LornaSmith
Reviewer(s) AlanSimpson,LornaSmith,SteveJordan
Version Date Comments,Changes,Status Authors,contributors,
reviewers0.1 2017-01-05 Inputtinginitialinformation AnneWhiting0.2 2017-01-06 AddedCSEreport AnneWhiting0.3 2017-01-11 Addedgraphs JoBeech-Brandt0.4 2017-01-13 AddedupdatedCSEand2Crayreports AnneWhiting0.5 2017-01-13 Review AndyTurner0.6 2017-01-13 Review AlanSimpson1.0 2017-01-13 VersionforEPSRC AlanSimpson
3
TableofContentsDocumentInformationandVersionHistory............................................................................................................21. Introduction...................................................................................................................................................................42. ExecutiveSummary....................................................................................................................................................53. ServiceUtilisation.......................................................................................................................................................63.1 OverallUtilisation................................................................................................................................................63.2 UtilisationbyFundingBody............................................................................................................................63.3 AdditionalUsageGraph....................................................................................................................................7
4. UserSupportandLiaison(USL)...........................................................................................................................84.1 HelpdeskMetrics.................................................................................................................................................84.2 USLServiceHighlights.......................................................................................................................................8
5. OperationsandSystemsGroup(OSG)............................................................................................................105.1 ServiceFailures.................................................................................................................................................105.2 OSGServiceActivities.....................................................................................................................................10
6. ComputationalScienceandEngineering(CSE)...........................................................................................117. CrayServiceGroup..................................................................................................................................................147.1 SummaryofPerformanceandServiceEnhancements....................................................................147.2 ReliabilityandPerformance........................................................................................................................147.3 ServiceFailures.................................................................................................................................................14
8. CrayCentreofExcellence(CoE)........................................................................................................................15
4
1. IntroductionThisannualreportcoverstheperiodfrom1Jan2016to31Dec2016.ThereporthascontributionsfromalloftheteamsresponsiblefortheoperationofARCHER;
• ServiceProvider(SP)containingboththeUserSupportandLiaison(USL)TeamandtheOperationsandSystemsGroup(OSG);
• ComputationalScienceandEngineeringTeam(CSE);• Cray,includingcontributionsfromtheCrayServiceGroupandtheCrayCentreof
Excellence.
ThenextsectionofthisreportcontainsanExecutiveSummaryfortheyear.Section3providesasummaryoftheserviceutilisation.Section4providesasummaryoftheyearfortheUSLteam,detailingtheHelpdeskMetricsandoutliningsomeofthehighlightsfortheyear.TheOSGreportinSection5describestheirfourmainareasofresponsibility;maintainingday-to-dayoperationalsupport;planningserviceenhancementsinaneartomediumtimeframe;planningmajorserviceenhancements;andsupportinganddevelopingassociatedservicesthatunderpinthemainexternaloperationalservice.InSection6theCSEteamdescribeanumberofhighlightsoftheworkin2016.TheseincludetheworkfromthecentralisedteamonparallelI/Operformance;thetrainingprovidedtosupporttheKNLsystem;theWeeARCHIERaspberryPiSupercomputerattheBigBangFair;theARCHERChampionsinitiative;andWomeninHPC.InSections7and8,theCrayServiceteamandCrayCentreofExcellencegiveasummaryoftheiryear’sactivities,respectively.ThisreportandtheadditionalSAFEreportsareavailabletoviewonlineathttp://www.archer.ac.uk/about-us/reports/annual/2016.php
5
2. ExecutiveSummaryThesectionsfromthevariousteamsdescribehighlightsoftheiractivities.ThissectiongivesabriefsummaryofhighlightsfromthefirstyearoftheoverallARCHERservice.Moredetailsareprovidedintheappropriatesectionofthedocument.
• WorkwascarriedoutjointlybetweenSP,CSE,andCraytodelivertheexperimentalCray
12-nodeXC40KNLsysteminOctober2016.TheCSEservicecreatedanddeliveredtrainingcoursestosupportuseradoptionofthenewtechnology,andSAFEfunctionalityhasbeenintroducedtomanageandsupportKNLusage.Inthefirstquarterofuse188useraccountshavebeencreated,3589jobsweresubmittedusing3540kAUsandtheKNLutilisationwas47%forthisperiod.
• Utilisationoftheservicehasremainedveryhighwithameanpercentageutilizationof
95%for2016.Whilstthisispositive,reflectingthepopularityandusageoftheservice,ithaspresentedchallengestotheusercommunity,inparticulararoundjobqueuingtimes.TheSPServiceperformedadetailedanalysisofqueuetimesthatledtoadjustmentsinthejobpriorityformulaintheARCHERschedulingsystem.Analysisfollowingthesechangesshowedthattheymadequeuetimesmoreequitableacrossdifferentjobsizes.Therewasbothadramaticreductioninthenumberofjobsthatqueuedforverylongtimesandabalancingofqueuetimesacrossdifferentjobsizes.
• ARCHERhasbeeninstrumentalinsettingupanddrivingforwardstheWomeninHPC
initiative,andthisyearhasseenmanyhighlights.ThemostnoticeablewasperhapstherecognitionandinvolvementofWomeninHPCatSC16.On14November2016,WHPCwasagainrecognisedintheannualHPCWireReaders’andEditors’ChoiceAwards,receivingthreeprestigiousawards.GettingrecognitioninthiswayhighlightstheimpactthatWHPCandthediversityactivitiesputforwardundertheARCHEROutreachprogrammearehaving.
• AprogrammeofworkwasdeliveredbytheCSEteamtoinvestigateparallelI/OperformanceonARCHERandtoformulateconcreteadvicetousersanddevelopersonhowtomeasure,understandandoptimisetheI/Operformanceoftheirapplications.TheresultsofthisworkhavebeendocumentedintheARCHERBestPracticeGuide(http://www.archer.ac.uk/documentation/best-practice-guide/io.php),incorporatedintotrainingmaterialandwillbeusedtoproduceawhitepaperandawebinarinearly2017.
• Intotal,theServicedealtwith7426queriesduring2016,meetingallquerytargets.
Resolvinguserqueriespromptlyallowsuserstomaximisetheresearchimpactoftheservice.Thislevelofsupportisonlypossibleduetocloseandeffectivecollaborationbetweenallservicepartners.
• ResponsesreceivedtotheARCHERServiceannualusersurveyfor2015wereverypositive,withthemeansatisfactionscorefortheserviceof4.3outof5.ThehighestratedaspectoftheARCHERservicecontinuestobethehelpdeskwithameanscoreofover4.5outof5.
6
3. ServiceUtilisation3.1 OverallUtilisationUtilisationovertheyearwas94%,upfrom87%in2015.
3.2 UtilisationbyFundingBodyTheutilisationbyfundingbodyrelativetotheirallocationcanbeseenbelow.
ThisbarchartshowstheusageofARCHERbythetwoResearchCouncilspresentedasapercentageofthetotalResearchCouncilallocationonARCHER.
7
3.3 AdditionalUsageGraphThefollowingheatmapprovidesaviewofthedistributionofjobsizesonARCHERin2016.
TheheatmapshowsthatmostofthekAUsarespentonjobsbetween192coresand12,288cores(8to512nodes).ThenumberofkAUsusediscloselyrelatedtomoneyandshowshowtheinvestmentinthesystemisutilised.
8
4. UserSupportandLiaison(USL)4.1 HelpdeskMetrics
QueryClosureItwasabusyyearonthehelpdeskbutallServicelevelagreementsweremet.Atotalof7426querieswereansweredbytheServiceProvider,andover99.4%wereresolvedwithin2days.Inadditiontothis,theServiceProviderpassedon222in-depthqueriestoCSEandCray. 15Q1 15Q2 15Q3 16Q4 TOTALSelf-ServiceAdmin 1722 1172 775 1693 5288Admin 654 616 408 497 1869Technical 118 91 67 83 269TotalQueries 2494 1879 1250 2273 7426
OtherQueriesInadditiontotheAdminandTechnicalQueriesdetailedabove,theHelpdeskalsodealtwithPhonequeries,ChangeRequests,internalrequestsandUserRegistration. 16Q1 16Q2 16Q3 16Q4 TOTALPhoneCallsReceived 82(25) 81(21) 56(17) 80(16) 299(79)ChangeRequests 2 10 4 7 23UserRegistrationRequests
338 264 264 218 1084
Thenumbersshowninbracketsforthephonecallsreceivedarethecallsresultinginneworupdatedqueries.Itisworthnotingthatthevolumeoftelephonecallswaslowthroughouttheyear.Ofthe299callsreceivedintotal,only79(26%)wereactualARCHERusercallsthatresultedinqueries.Allphonecallswereansweredwithin2minutes,asrequired.
4.2 USLServiceHighlights
WorkonSchedulerPrioritisationFormulaThemeansystemutilisationforARCHERin2016was95%.Whilstthisreflectstheextensiveusemadeoftheservice,andsupportsthecaseforthefutureinvestmentinHPC,italsopresentschallengestotheusercommunity,primarilyaroundthequeuingtimesforsmallerjobs.TheSPServiceperformedadetailedanalysisofqueuetimesthatled,afterconsultationwithusersandtheResearchCouncils,toadjustmentsinthejobpriorityformulaintheARCHERschedulingsystem.Analysisfollowingthesechangesshowedthattheymadequeuetimesmoreequitableacrossdifferentjobsizes.Therewasbothadramaticreductioninthenumberofjobsthatqueueforverylongtimesandabalancingofqueuetimesacrossdifferentjobsizes.Thevolumeofusersregisteringtheirconcernswithqueuingtimeshasgreatlyreducedsincethischangehasbeenmade.
KNLSupportUsersupporthasbeenextendedtoprovidesupporttousersoftheKNLsystemincludingadditionalSAFEfunctionalityanduserdocumentationandassistancevideos.
9
SAFEImprovementsExtensiveworkhasbeencarriedouttoimproveboththeusabilityandfunctionalityofSAFEthisyear.Changesinclude:• AnewversionofSAFEwasrolledoutprovidinganimprovedandmoreuser-friendly
interfaceUpdatedtrainingmaterialwasproducedwhichincludesdocumentationandtrainingvideos.ThenewversionhasreceivedpositivefeedbackfromtheusercommunityandhasbeenrolledoutacrosstheDIRACandEPCCSAFEinstancestoprovideaconsistentuserexperience.
• SAFEfunctionalityhasbeenintroducedtoallowuserstoregisterpublications,makingiteasierforthemtorecordpublicationsinResearchfish.Thisfunctionalitywillalsoprovideausefulinputtotheservicebenefitsrealisationdatacollected.
• NewfunctionalityhasbeenintroducedtotheSAFEtofacilitatetheeasysignupofuserstoapplicationssuchasVASPandCASTEP.Itprovidestheoptionoflinkingautomatingaccesstoparticularlicensedsoftwareforparticularusergroupse.g.allnewNCASuserscouldbegivenautomaticaccesstotheUnifiedModelwhentheiruseraccountissetup.
• Newfunctionalitytosupportsign-upandmanagementofKNLresources.
Improvedreporting–schedulingcoefficientFunctionalitywasaddedtothelivestatuspageonthewebsitewiththeschedulingcoefficientmatrixandusagematrixforvariousperiodstoallowuserstoplantheiruseofARCHERmoreeffectively.Historicdataforthesetwoplotshasalsobeenmadeavailable.Theseupdateshavebeenwellreceivedbytheusercommunity.
ISO9001:2015certificationPreparatoryworkwascarriedoutacrosstheyeartoimproveandconsolidateprocessesanddocumentationprimarilytointroduceaconsistentframeworktoimprovetheserviceprovidedtoourusersbutalsotoachieveISO9001:2015certification.BenefitsarealreadybeingseeninprocessimprovementandthefirststageexternalauditwassuccessfullycarriedoutinDecember2016.ThefullexternalauditistakingplaceinFebruary2017.
10
5. OperationsandSystemsGroup(OSG)
5.1 ServiceFailuresTherewerenoSEV1ServiceFailuresintheperiodasdefinedinthemetric.
5.2 OSGServiceActivitiesPrincipalactivitiesundertaken(inadditiontoday-to-dayoperationalcover)included:
(1)Operatingsystemandapplicationssoftwaresupport:
a. PlanningforCLE5.2UP04upgradeontheXC30;b. Installingregularcompilerandprogrammingdevelopmentupgrades;c. SupportingOSenhancementstoexternalloginnodes.
(2)Loginnodesissue:
a. Substantialinvestigationandtestingassociatedwithseriousissueregardingmultiplerecurrentfailuresofloginnodes
(3)KNLinstallation
a. ClosecooperationwithCrayandtheCSEserviceoninstallationofKNLtestanddevelopmentplatform
(4)ISO9001:
a. AssessmentandanalysisofappropriateoperationalprocessesundertakenforISO9001accreditation
(5)Systemadministration:
a. Furtherdevelopmentandexpansionofautomatedtickethandling;b. Refinementoflocally-developedsystemsadministrationtools;c. Increaseintheshortqueuehoursfrom0900–1700MondaytoFridaytobe0800–
2000MondaytoFriday;d. ModificationofmaintenanceschedulewiththeapprovaloftheResearchCouncils,
reducingthenumberoffullmaintenancedaystooneamonthtominimiseuserimpact;
(6)SupportingCrayhardwareoperations:
a. Providingadditionalon-sitesupportforCraypersonnelduringmajorhardwareupgradeoperations(suchastheopticalcablere-work).
(7)Security:
a. Implementingenhancementstosecuritymonitoring;b. InstallingCray-suppliedsecurityfieldnotices;c. Providingadditionalhardeningofsecuritymeasures–specificdetailsarenotavailable
forobviousreasons.d. Successfulexternalsecurityauditundertakenwithnooutstandingissuesasaresult
(8)Outreach:
a. AttendanceattwoUKCSFmeetings:-Daresbury(March)andECMWF(September)b. Establishmentofregular3-waymeetingwithEPSRCandCraytodiscussoperational
issues
11
6. ComputationalScienceandEngineering(CSE)
ParallelI/OPerformanceStudiesAtthestartof2016,incollaborationwithusergroupsandtheotherServicepartners,theCSEserviceidentifiedanumberofpriorityareastoinvesttechnicaleffortfromthecentralisedCSEteam.OneofthekeyareasidentifiedwasgainingabetterpracticalunderstandingofparallelI/OperformanceonARCHERandhowitcomparestoothersystems.Ingeneral,theHPCcommunityoftenpoorlyunderstandparallelI/Operformanceintermsof:
• what“good”performanceactuallyisonaparticularfilesystem;• whataparticularapplicationdoes;• whatbenchmarksillustrate.
WedesignedaprogrammeofworkfortheCSEteamfor2016toaddresstheseproblemsbyprovidingusefuldataonparallelI/OperformanceonARCHER,andconcreteadvicetousersanddevelopersonhowtomeasure,understandandoptimisetheI/Operformanceoftheirapplications.ThisworkhasledtoanumberofpositiveimpactsfortheARCHERcommunity(andthewiderHPCcommunity):
• UnderstandingofwhatinformationcanbegainedfromdifferentsyntheticparallelI/ObenchmarksandhowwelltheymodelrealHPCapplications.
• QuantifyingthemaximumperformanceavailablefromtheARCHER(andother)parallelfilesystemsinproduction.Thisgivesusersusefulvaluestocomparecurrentperformanceto,toassessiftheapplicationsaredoingwellorbadlyintermsofperformance.
• UnderstandingtheperformanceandscalingcharacteristicsofdifferentcommonparallelI/Opatterns.ThisallowsuserstoassesswhatthebestapproachtotakeisforI/OintheirapplicationandallowsARCHERservicepartnerstoconfigureARCHERfilesystemstobestmeetuserrequirements.
• ProducingstatisticaldataonthevariabilityofparallelI/OperformanceonARCHERtoaidusersinassessingiftheirperformancevariationsarewithinnormalboundsornotandtoaidservicepartnersinbestconfiguringARCHER.
Thedata,analysis,andconclusionsfromthestudiessofararebeingdisseminatedtoARCHERusersandthewiderHPCcommunitythroughseveralmechanismstoensuretheworkhasthelargestpossibleimpact:
• ProductionofanewI/OchapterintheARCHERBestPracticeGuidecontainingpracticaladviceandreferenceI/Operformancedata:http://www.archer.ac.uk/documentation/best-practice-guide/io.php
• ProductionofanARCHERwhitepapercomparingparallelI/OperformanceacrossdifferentsystemsintheUK.ThisiscurrentlybeingreviewedandwillbepublishedinQ12017.
• IncorporationintoARCHERTrainingmaterials.• ParallelI/OwebinarinQ12017.
ARCHERKNLTrainingThe12-nodeKnightsLanding(KNL)manycoresystemwaslaunchedtowardstheendofOctober2016,andamajorchallengewastoprovidesufficienttrainingtonewuserssothattheycouldreadilymakeuseofit.TheKNLprocessoritselfisverynew,butitsintegrationintothestandardCrayenvironmentisevenmorerecent.WeworkedcloselywiththeEPCCIntelParallelComputingCentreandtheCrayCentreofExcellence,bothofwhomhadearlyaccesstoKNL
12
developmentsystems,sothatwecouldprovideawiderangeoftrainingtousersassoonasthesystemwasopenforuserservice.Thetrainingdeliveredincluded:
• Avirtualtutorial(i.e.interactiveonlinewebinar)on14Septembertitled“TheIntelKnightsLandingProcessor”tointroduceuserstobasicKNLconceptsandtoadvertisetheupcomingeCSEcallthatincludedKNLdevelopmentprojects.
• Avirtualtutorialon21SeptemberexplaininghowtoapplytotheupcomingeCSEcall,withafocusonapplicationstousetheKNLsystem.
• Avirtualtutorialon12Octobertitled“UsingKNLonARCHER”explaininghowtousetheKNLsysteminpractice,includinginformationonthelocaldetailssuchascompilersandqueuestructuresonARCHER.
• A1-dayhands-oncourse“UsingKnightsLandingManycoreProcessorsonARCHER”thatalloweduserstorunrealprogramsontheARCHERKNLnodes.
WehavealsolaunchedanewdrivingtestfortheARCHERKNLsystemwhichisbackedupbytrainingmaterialathttp://www.archer.ac.uk/documentation/knl-guide/knl-training-resources.php.ThiscontainsvideosofthetwotechnicalKNLvirtualtutorialsdescribedabove,anewvideodescribinghowtorequestKNLaccessviaSAFE,andlinkstoallthematerial(slidesandpracticalexercises)fromthehands-oncourse.WewillbecontinuingtheKNLtrainingprogrammeinto2017:
• The3-dayhands-on“CrayOptimizationWorkshop:ARCHERandKnightsLanding”,beingrunincollaborationwithCrayattheirBristolofficesinlateJanuary,willintegrateuseoftheARCHERKNLsystemintoexistingadvancedmaterialonCrayhardware,environment,compilersandtools.
• The2-daycourse“ProgrammingtheManycoreKnightsLandingProcessor”,toberuninLondonin1Q2017,willextendtheexisting1-dayhands-oncoursetoprovideacompleteintroductiontodevelopingandoptimisingparallelcodesfortheARCHERKNLsystem.
Despitethechallengingtimescales,wehavedevelopedanddeliveredawiderangeoftraditionalandonlinetrainingontheARCHERKNLsystemallowinguserstomakethebestuseoftheuniqueopportunityprovidedbyaccesstothissystem.
WeeArchie,WeeArchletandtheBigBangFairWeeArchieisasuitcase-sizedsupercomputer,designedtoletschoolchildrentrytheirhandatcomputingandlearnaboutthebenefitsofsupercomputing.TheRaspberryPi2systemhasbeencreatedtoberepresentativeofthesystemdesigninmassivelyparallelarchitectures.EachRaspberryPihasanLEDdisplaythatlightsupwheninuse,providingavisualdisplaythathelpsdemonstratehowmultipleprocessorsworkinparalleltosolvecomplextasks.Developedin2015,akeyhighlightof2016hasbeenusingWeeArchieasatooltoeducatethenextgenerationofHPCusers.WeeArchiewasthecentrepieceofourboothattheBigBangFair.TheBigBangFairistheUK’slargestSTEMevent,with70,000peopleattendingover4days.Ourboothhadaround6000peopletakepartandWeeArchiewasaconsiderabledraw.YoungpeoplecoulduseWeeArchieandseehowWeeArchieworks.Thiswasthelargesteventwehaveparticipatedin,bysomemargin,andwasmonthsintheplanning.TheeventensuredwecouldshowcaseARCHERtothenextgenerationofscientistsfromacrosstheUK.WeeArchiehasprovedsopopularthatwehavebuiltasecondonetocopewithdemand.2017willseeWeeArchiereturningtotheBigBangFair,aswellasvariousotherevents.Coupledwiththis,wewillbeintroducingWeeArchlet.ThisistheyoungersiblingofWeeArchie,anevensmallerRaspberryPicluster.WeeArchletisdesignedtobecheapandeasytobuildwhilestilldemonstratingthekeyconceptsofparallelcomputing.On-lineinstructionswillbeavailablefordownloadbyschoolsandcommunitygroupswantingtobuildandconfigureasystemthemselves.
13
ARCHERChampions2016sawthecreationanddevelopmentoftheARCHERChampions(http://www.archer.ac.uk/community/champions/),apeersupportnetworkbetweenstaffmemberswhoseroleinvolvesadvisingusersonaccesstolocal,regionalandnationalHPCresources.TheaimistohelptopromoteacoherentaccessstructuretoHPCresourcesacrosstheUK,withcoordinationbetweentiers.ThereisalsoafocusonsupportingandpromotingactivitiesdesignedtoprovidecareerdevelopmenttoresearchsoftwareengineersseekingacareerinHPC.ActivitiesarealsodesignedtobroadentheUKHPCuserbasetonewdisciplinesandcommunities.TherehavebeentwosuccessfulARCHERChampionsWorkshopsthisyear,oneinEdinburghinMarch,andoneinOxfordinSeptember.BotheventswereverywellattendedanddevelopedtheARCHERChampionsnetwork.2017willseethenextworkshopinLeeds,associatedwiththeHPC-SIGmeeting.
WomeninHPCRecognitionARCHERhasbeeninstrumentalinsettingupanddrivingforwardstheWomeninHPCinitiativeandthisyearhasseenmanyhighlights.ThemostnoticeableisperhapstherecognitionandinvolvementofWomeninHPCatSC16.On14November2016,WHPCwasrecognisedonceagainintheannualHPCWireReaders’andEditors’ChoiceAwards,receivingthefollowinghonours:Readers’Choice:WorkforceDiversityLeadershipAward;Editors’Choice:WorkforceDiversityLeadershipAwardandReaders’Choice:OutstandingLeadershipinHPC,forToniCollis.TheannualHPCwireReaders’andEditors’ChoiceAwardsaredeterminedthroughanominationandvotingprocesswiththeglobalHPCwirecommunity,aswellasselectionsfromtheHPCwireeditors.TheawardsareanannualfeatureofthepublicationandconstituteprestigiousrecognitionfromtheHPCcommunity.ThisisthesecondyearthatWHPChasbeenhonoredtoreceivetheReader’sChoiceWorkforceDiversityLeadershipAward.ReceivingrecognitionfromHPCwire’sreadershighlightstheimpactthatWHPCandthediversityactivitiesputforwardundertheARCHEROutreachprogrammearehaving.
14
7. CrayServiceGroup7.1 SummaryofPerformanceandServiceEnhancements2016wasanexcellentyearfortheARCHERservicewithnounscheduledsystemoutagescausedbytechnologyfailures.Theexcellentlevelofreliabilityinconjunctionwithhighutilisationofsystemresourceshasprovideduserswithanefficientandstablenationalserviceforcomputationalscience.AsmallCrayXC40KNLTestbedsystemwasinstalledinOctober2016.ThissystemwasprovidedtoenableearlyaccesstoIntelPhiKnightsLandingtechnologyfortheARCHERusercommunitywithinafamiliarprogrammingenvironmentframework.
7.2 ReliabilityandPerformanceWhereaspectsoftechnologyprovisionhaveperformedbelowthehighstandardsexpected,Crayhasworkedtoresolveissueswithaminimumofdisruptiontousers.Occasionally,complexproblemsofanintermittentnaturecanbedifficulttoidentifyandresolve.InsuchcasesCray’sregionalandglobalsupportteams,Engineeringgroupandservicepartnersworktogethertoinvestigateandimplementsolutions.ThemostsignificanttechnologyareasoftheARCHERservicewhereissueswereencounteredin2016were:
• Memoryfragmentationonjob-launchservicenodes.AworkaroundtopreventthisissueimpactinguserswasimplementedinOctober2016.
• Intermittentperiodsofinstabilityaffectingexternalloginandpre/post-processingnodes.Therehavebeentwoseparatecausesofinstabilityonexternalloginandpre/post-processingnodesin2016:
o AGPFSclientrelatedissuewhichwasresolvedinAugust2016.o AlustrerelatedissuefirstseeninNovember2016whichwillrequireapatchto
resolve.
7.3 ServiceFailuresTherewerenounscheduledincidentsclassifiedasfullservicefailuresin2016.
15
8. CrayCentreofExcellence(CoE)AtthestartoftheyeartheCrayCentresofExcellencemovedintoaneworganisation,theCrayEMEAResearchLab(CERL)undertheleadershipofAdrianTate,whichmadeadditionalexpertiseavailableforCoEactivities.FurtherinformationabouttheCERLisavailableontheCrayblogpostandannouncement.WealsostartedtheyearwithvariousdiscussionstodecideontherightfocusareasfortheCoE,inparticulartherewasawishtofocusonprojectsthatwouldbebeneficialbeyondindividualapplicationsandtoimpactcommunitiesratherthanspecificresearchgroups.KartheeSivalingamjoinedtheCoEduringtheyearbringingexpertiseinelementaryparticlephysicsandNumericalWeatherPrediction(NWP)andClimateapplications.
Longer-TermProjects
I/OperformanceTheCoEisleadingabroadinvestigationintoI/Operformanceandoptimisationthattouchesseveralareas.Severalcommunities,inparticulartheUKNWPcommunityhavedisplayedaninterestinI/Ooptimisationtoolsetsandabstractionlayers.WearecurrentlyinvestigatinghowexistingI/OtechnologiesdevelopedatCraycanbeleveragedbythesecommunities.Aspartofthisbroadinvestigation,wehavealsobeenexperimentingwiththeADIOSimplementationasaplatformtobothprovideawidesetofstorageAPIsforparallelI/Oandasaplatformforfuturedevelopment.Additionally,wehavedevelopedatoolthatprovidestheuserwithawaytooptimizedatamovementtoprovidemanysmallfilestoanapplicationandavoidpossiblecontentioninLustre.ThistoolhasbeenshowntobebeneficialforOpenFOAMapplicationsatscale.Thetoolwasannouncedtousersbuttherehasbeenlimitedinterestsowewillattempttocontactusersindividually.
AUTO-TUNINGTheaimofthisprojectistodeterminetheusefulnessofasimple-to-useauto-tuningtool.ThisbuildsonpreviousworkundertakenbyCrayundertheEUCRESTAExascaleresearchproject.Therearetwomainaspectstothisproject–gainingmoreexperiencewithapplicationsbothtodeterminetheusefulnessofthecurrentmock-upimplementationandthensomeefforttomakeimprovements.Weupdatedthedocumentationandcreatedasummarypresentation.BothEPCCandtheCoEhavestartedonthefirstpartoftheprojectwithEPCClookingatusingthetooltooptimizeconfigurationofVASP.TheCoEhasstartedadiscussionwithNCASandhopestoapplytheauto-tuningtotheUM.
ONETEPWewerepreviouslyspendingasmalleffort(viaotherUKApplicationStaff)supportingaPoisson-BoltzmannEquationsolverforONETEP(incollaborationwiththeUniversityofSouthampton).ThiseffortisnowcontinuingaspartofaprojectfundedundereCSE-07whichstartedinJune.Themostrecentworkhasinvolveddesign,implementationandtestingworkfortransferringthehigherordercorrectionfromONETEPtotheDL_MGmultigridsolver.
FilesystemandI/OIssuesFromtimetotimesomeusershavereportedconcernsoverI/Operformance.Occasionallythisprovestobeduetoasystemproblem,butmoreoftenthefilesystemissimplybusy.ImprovingandunderstandingI/OperformanceisafocusareafortheCoEandwehavebeenworkingonI/Operformancebenchmarking(ashasEPCC);inparticulartheprojectmentionedabovewhichwehopewilldeliversomenewoptionsforuserstohelpoptimizeI/O.WearedevelopinganexperimentalframeworkthatwilladdressmanyoftheconcernsoftheARCHERcommunity,whilebeingaccessibletousersinastransparentawayaspossibleandalsolooksforwardtofuturecapabilities,forexamplestorage-classmemorydevicesandSSDs.
16
TrainingandWorkshopsTheCoEassistedwithvariousworkshopsduringtheyear.ExampleswerethePortingandOptimizationworkshop,runaroundthetimeofCUG2016,andtheARCHERAdvancedOpenMPcourserunatCray’sEMEAHQinBristolinAugust.CoEstaffpresentedanARCHERwebinaroutliningnewfeaturesinrecentupdatesoftheProgrammingEnvironmentonARCHER.TheCoEwasabletoengagewithARCHERusersatvariouseventsincludingtheComputingInsightUKmeetinginManchester,theARCHERChampionsmeetinginOxford,theECMWFHPCworkshop,andeventsheldbytheEPSRCCentreforDoctoralTraininginPervasiveParallelism.CraysponsoredtheEuroMPIeventwhichthisyearwasheldinEdinburgh,WehavestartedplanningforanOptimizationWorkshopspecificallycoveringthenewlyinstalledARCHERKNLsystemtobeheldatCray’sEMEAHQfrom31stJanuaryto2February2017.
TheKNLAdditiontoARCHERTheCoEwasheavilyinvolvedwiththeintroductionofthenewIntelKnightsLanding(KNL)systemtoARCHER.WhilethesystemwasbeinginstalledtheCoEprovidedguidanceontheKNLhardwaremodesandhelpedEPCCandtheACFteamdecideonanappropriateconfigurationforstartofservice.TheCoEalsoransomesanitychecksandhelpedresolveteethingissuesasthesystemwentlive.TheKNLtalkgivenbytheCoEattheARCHERChampionsmeetinginSeptemberwasaugmentedandgiventoEPCCalongwithotherinformationtoassistwithpreparationoftheuserwebinarsanddocumentation.TheCoEworkedwithEPCContheARCHER-KNLwebdocumentationandtheCoEprovidedatoolandassociatedadvicetoassistuserswithproperbindingofhybridapplications(alsorelevanttoARCHER).
CaseStudiesandARCHERPromotionThefollowingCrayApplicationBriefwaspublished:
RemovingBottleneckstoLarge-ScaleGeneticandGenomicDataAnalysiswithDISSECTandtheCray®XC™Supercomputer(pdf),whichshowcasesworkoftheRoslinInstituteusingtheDISSECTgenomicandepidemiologicanalysistoolonARCHER.
ARCHERQueriesandSoftwareTheCoEhelpsresolvearangeofissuesthatcomeinfromusersviathehelpdesk,someofwhichrequiresignificanteffortandneedinteractionwithCrayR&Dexperts.Ofparticularnotewasanissuerelatingtoaresearchprojectwhichwasutilising“Director’sTime”onARCHER.TheprojectistryingtointegratesoftwarethatusesDMAPPanduGNIandthishasraisedsomesubtleproblemswithmemoryregistration.WewereabletoexplainhowtosetupthesoftwarecorrectlytoenablecoexistenceofbothAPIs.Lateintheyeartherewasaweekendwhenuserscomplainedofslowfilesystemperformanceandwehadtoworkwiththesystemsteamandoneuserinordertoobtainadetailedsystemprofileofanapplicationthatwassignificantlycontributingtotheloadonthefilesystem.AsaresultwenoticedthatNEMOcancausestressonthefilesystembyopeningthesamefilesmanytimesperprocess.Thisisstillunderinvestigation.
17
eCSEpanelmeetingsTheCoEcompletedtechnicalassessmentsandfinalreviewsforthethreeeCSEcallsduringtheyearandstaffattendedtheprojectplanningmeetings.Advicewasalsoprovidedinadvanceontechnicalconcernsoverprojectspriortopanelmeetings.
2017For2017theHeadofCERL,AdrianTate,iskeentoensurethatalltheCrayCoEscontinuetoprovideexcellentserviceandresultstocustomers,inparticularmakinguseoftheuniqueskillsofindividualstaffintheteamaswellaswiderCrayresources.TheoverallaimistoassistambitioususersintheeffectiveuseofalltheARCHERresourcesforworld-classscienceandresearch.AspartofthiscontinualreviewwearelikelytoseekfurtherguidancefromEPSRCandfromkeymembersoftheARCHERcommunity.