+ All Categories
Home > Documents > ARCHER Annual 2016 v1€¦ · This annual report covers the period from 1 Jan 2016 to 31 Dec 2016....

ARCHER Annual 2016 v1€¦ · This annual report covers the period from 1 Jan 2016 to 31 Dec 2016....

Date post: 31-Jul-2020
Category:
Upload: others
View: 0 times
Download: 0 times
Share this document with a friend
17
1 ARCHER Service 2016 Annual Report
Transcript
Page 1: ARCHER Annual 2016 v1€¦ · This annual report covers the period from 1 Jan 2016 to 31 Dec 2016. ... It provides the option of linking automating access ... Historic data for these

1

ARCHERService2016AnnualReport

Page 2: ARCHER Annual 2016 v1€¦ · This annual report covers the period from 1 Jan 2016 to 31 Dec 2016. ... It provides the option of linking automating access ... Historic data for these

2

DocumentInformationandVersionHistoryVersion: 1.0Status Final

Author(s):AlanSimpson,AnneWhiting,StephenBooth,AndyTurner,FelipePopovics,SteveJordan,HarveyRichardson,MikeBrown,LornaSmith

Reviewer(s) AlanSimpson,LornaSmith,SteveJordan

Version Date Comments,Changes,Status Authors,contributors,

reviewers0.1 2017-01-05 Inputtinginitialinformation AnneWhiting0.2 2017-01-06 AddedCSEreport AnneWhiting0.3 2017-01-11 Addedgraphs JoBeech-Brandt0.4 2017-01-13 AddedupdatedCSEand2Crayreports AnneWhiting0.5 2017-01-13 Review AndyTurner0.6 2017-01-13 Review AlanSimpson1.0 2017-01-13 VersionforEPSRC AlanSimpson

Page 3: ARCHER Annual 2016 v1€¦ · This annual report covers the period from 1 Jan 2016 to 31 Dec 2016. ... It provides the option of linking automating access ... Historic data for these

3

TableofContentsDocumentInformationandVersionHistory............................................................................................................21. Introduction...................................................................................................................................................................42. ExecutiveSummary....................................................................................................................................................53. ServiceUtilisation.......................................................................................................................................................63.1 OverallUtilisation................................................................................................................................................63.2 UtilisationbyFundingBody............................................................................................................................63.3 AdditionalUsageGraph....................................................................................................................................7

4. UserSupportandLiaison(USL)...........................................................................................................................84.1 HelpdeskMetrics.................................................................................................................................................84.2 USLServiceHighlights.......................................................................................................................................8

5. OperationsandSystemsGroup(OSG)............................................................................................................105.1 ServiceFailures.................................................................................................................................................105.2 OSGServiceActivities.....................................................................................................................................10

6. ComputationalScienceandEngineering(CSE)...........................................................................................117. CrayServiceGroup..................................................................................................................................................147.1 SummaryofPerformanceandServiceEnhancements....................................................................147.2 ReliabilityandPerformance........................................................................................................................147.3 ServiceFailures.................................................................................................................................................14

8. CrayCentreofExcellence(CoE)........................................................................................................................15

Page 4: ARCHER Annual 2016 v1€¦ · This annual report covers the period from 1 Jan 2016 to 31 Dec 2016. ... It provides the option of linking automating access ... Historic data for these

4

1. IntroductionThisannualreportcoverstheperiodfrom1Jan2016to31Dec2016.ThereporthascontributionsfromalloftheteamsresponsiblefortheoperationofARCHER;

• ServiceProvider(SP)containingboththeUserSupportandLiaison(USL)TeamandtheOperationsandSystemsGroup(OSG);

• ComputationalScienceandEngineeringTeam(CSE);• Cray,includingcontributionsfromtheCrayServiceGroupandtheCrayCentreof

Excellence.

ThenextsectionofthisreportcontainsanExecutiveSummaryfortheyear.Section3providesasummaryoftheserviceutilisation.Section4providesasummaryoftheyearfortheUSLteam,detailingtheHelpdeskMetricsandoutliningsomeofthehighlightsfortheyear.TheOSGreportinSection5describestheirfourmainareasofresponsibility;maintainingday-to-dayoperationalsupport;planningserviceenhancementsinaneartomediumtimeframe;planningmajorserviceenhancements;andsupportinganddevelopingassociatedservicesthatunderpinthemainexternaloperationalservice.InSection6theCSEteamdescribeanumberofhighlightsoftheworkin2016.TheseincludetheworkfromthecentralisedteamonparallelI/Operformance;thetrainingprovidedtosupporttheKNLsystem;theWeeARCHIERaspberryPiSupercomputerattheBigBangFair;theARCHERChampionsinitiative;andWomeninHPC.InSections7and8,theCrayServiceteamandCrayCentreofExcellencegiveasummaryoftheiryear’sactivities,respectively.ThisreportandtheadditionalSAFEreportsareavailabletoviewonlineathttp://www.archer.ac.uk/about-us/reports/annual/2016.php

Page 5: ARCHER Annual 2016 v1€¦ · This annual report covers the period from 1 Jan 2016 to 31 Dec 2016. ... It provides the option of linking automating access ... Historic data for these

5

2. ExecutiveSummaryThesectionsfromthevariousteamsdescribehighlightsoftheiractivities.ThissectiongivesabriefsummaryofhighlightsfromthefirstyearoftheoverallARCHERservice.Moredetailsareprovidedintheappropriatesectionofthedocument.

• WorkwascarriedoutjointlybetweenSP,CSE,andCraytodelivertheexperimentalCray

12-nodeXC40KNLsysteminOctober2016.TheCSEservicecreatedanddeliveredtrainingcoursestosupportuseradoptionofthenewtechnology,andSAFEfunctionalityhasbeenintroducedtomanageandsupportKNLusage.Inthefirstquarterofuse188useraccountshavebeencreated,3589jobsweresubmittedusing3540kAUsandtheKNLutilisationwas47%forthisperiod.

• Utilisationoftheservicehasremainedveryhighwithameanpercentageutilizationof

95%for2016.Whilstthisispositive,reflectingthepopularityandusageoftheservice,ithaspresentedchallengestotheusercommunity,inparticulararoundjobqueuingtimes.TheSPServiceperformedadetailedanalysisofqueuetimesthatledtoadjustmentsinthejobpriorityformulaintheARCHERschedulingsystem.Analysisfollowingthesechangesshowedthattheymadequeuetimesmoreequitableacrossdifferentjobsizes.Therewasbothadramaticreductioninthenumberofjobsthatqueuedforverylongtimesandabalancingofqueuetimesacrossdifferentjobsizes.

• ARCHERhasbeeninstrumentalinsettingupanddrivingforwardstheWomeninHPC

initiative,andthisyearhasseenmanyhighlights.ThemostnoticeablewasperhapstherecognitionandinvolvementofWomeninHPCatSC16.On14November2016,WHPCwasagainrecognisedintheannualHPCWireReaders’andEditors’ChoiceAwards,receivingthreeprestigiousawards.GettingrecognitioninthiswayhighlightstheimpactthatWHPCandthediversityactivitiesputforwardundertheARCHEROutreachprogrammearehaving.

• AprogrammeofworkwasdeliveredbytheCSEteamtoinvestigateparallelI/OperformanceonARCHERandtoformulateconcreteadvicetousersanddevelopersonhowtomeasure,understandandoptimisetheI/Operformanceoftheirapplications.TheresultsofthisworkhavebeendocumentedintheARCHERBestPracticeGuide(http://www.archer.ac.uk/documentation/best-practice-guide/io.php),incorporatedintotrainingmaterialandwillbeusedtoproduceawhitepaperandawebinarinearly2017.

• Intotal,theServicedealtwith7426queriesduring2016,meetingallquerytargets.

Resolvinguserqueriespromptlyallowsuserstomaximisetheresearchimpactoftheservice.Thislevelofsupportisonlypossibleduetocloseandeffectivecollaborationbetweenallservicepartners.

• ResponsesreceivedtotheARCHERServiceannualusersurveyfor2015wereverypositive,withthemeansatisfactionscorefortheserviceof4.3outof5.ThehighestratedaspectoftheARCHERservicecontinuestobethehelpdeskwithameanscoreofover4.5outof5.

Page 6: ARCHER Annual 2016 v1€¦ · This annual report covers the period from 1 Jan 2016 to 31 Dec 2016. ... It provides the option of linking automating access ... Historic data for these

6

3. ServiceUtilisation3.1 OverallUtilisationUtilisationovertheyearwas94%,upfrom87%in2015.

3.2 UtilisationbyFundingBodyTheutilisationbyfundingbodyrelativetotheirallocationcanbeseenbelow.

ThisbarchartshowstheusageofARCHERbythetwoResearchCouncilspresentedasapercentageofthetotalResearchCouncilallocationonARCHER.

Page 7: ARCHER Annual 2016 v1€¦ · This annual report covers the period from 1 Jan 2016 to 31 Dec 2016. ... It provides the option of linking automating access ... Historic data for these

7

3.3 AdditionalUsageGraphThefollowingheatmapprovidesaviewofthedistributionofjobsizesonARCHERin2016.

TheheatmapshowsthatmostofthekAUsarespentonjobsbetween192coresand12,288cores(8to512nodes).ThenumberofkAUsusediscloselyrelatedtomoneyandshowshowtheinvestmentinthesystemisutilised.

Page 8: ARCHER Annual 2016 v1€¦ · This annual report covers the period from 1 Jan 2016 to 31 Dec 2016. ... It provides the option of linking automating access ... Historic data for these

8

4. UserSupportandLiaison(USL)4.1 HelpdeskMetrics

QueryClosureItwasabusyyearonthehelpdeskbutallServicelevelagreementsweremet.Atotalof7426querieswereansweredbytheServiceProvider,andover99.4%wereresolvedwithin2days.Inadditiontothis,theServiceProviderpassedon222in-depthqueriestoCSEandCray. 15Q1 15Q2 15Q3 16Q4 TOTALSelf-ServiceAdmin 1722 1172 775 1693 5288Admin 654 616 408 497 1869Technical 118 91 67 83 269TotalQueries 2494 1879 1250 2273 7426

OtherQueriesInadditiontotheAdminandTechnicalQueriesdetailedabove,theHelpdeskalsodealtwithPhonequeries,ChangeRequests,internalrequestsandUserRegistration. 16Q1 16Q2 16Q3 16Q4 TOTALPhoneCallsReceived 82(25) 81(21) 56(17) 80(16) 299(79)ChangeRequests 2 10 4 7 23UserRegistrationRequests

338 264 264 218 1084

Thenumbersshowninbracketsforthephonecallsreceivedarethecallsresultinginneworupdatedqueries.Itisworthnotingthatthevolumeoftelephonecallswaslowthroughouttheyear.Ofthe299callsreceivedintotal,only79(26%)wereactualARCHERusercallsthatresultedinqueries.Allphonecallswereansweredwithin2minutes,asrequired.

4.2 USLServiceHighlights

WorkonSchedulerPrioritisationFormulaThemeansystemutilisationforARCHERin2016was95%.Whilstthisreflectstheextensiveusemadeoftheservice,andsupportsthecaseforthefutureinvestmentinHPC,italsopresentschallengestotheusercommunity,primarilyaroundthequeuingtimesforsmallerjobs.TheSPServiceperformedadetailedanalysisofqueuetimesthatled,afterconsultationwithusersandtheResearchCouncils,toadjustmentsinthejobpriorityformulaintheARCHERschedulingsystem.Analysisfollowingthesechangesshowedthattheymadequeuetimesmoreequitableacrossdifferentjobsizes.Therewasbothadramaticreductioninthenumberofjobsthatqueueforverylongtimesandabalancingofqueuetimesacrossdifferentjobsizes.Thevolumeofusersregisteringtheirconcernswithqueuingtimeshasgreatlyreducedsincethischangehasbeenmade.

KNLSupportUsersupporthasbeenextendedtoprovidesupporttousersoftheKNLsystemincludingadditionalSAFEfunctionalityanduserdocumentationandassistancevideos.

Page 9: ARCHER Annual 2016 v1€¦ · This annual report covers the period from 1 Jan 2016 to 31 Dec 2016. ... It provides the option of linking automating access ... Historic data for these

9

SAFEImprovementsExtensiveworkhasbeencarriedouttoimproveboththeusabilityandfunctionalityofSAFEthisyear.Changesinclude:• AnewversionofSAFEwasrolledoutprovidinganimprovedandmoreuser-friendly

interfaceUpdatedtrainingmaterialwasproducedwhichincludesdocumentationandtrainingvideos.ThenewversionhasreceivedpositivefeedbackfromtheusercommunityandhasbeenrolledoutacrosstheDIRACandEPCCSAFEinstancestoprovideaconsistentuserexperience.

• SAFEfunctionalityhasbeenintroducedtoallowuserstoregisterpublications,makingiteasierforthemtorecordpublicationsinResearchfish.Thisfunctionalitywillalsoprovideausefulinputtotheservicebenefitsrealisationdatacollected.

• NewfunctionalityhasbeenintroducedtotheSAFEtofacilitatetheeasysignupofuserstoapplicationssuchasVASPandCASTEP.Itprovidestheoptionoflinkingautomatingaccesstoparticularlicensedsoftwareforparticularusergroupse.g.allnewNCASuserscouldbegivenautomaticaccesstotheUnifiedModelwhentheiruseraccountissetup.

• Newfunctionalitytosupportsign-upandmanagementofKNLresources.

Improvedreporting–schedulingcoefficientFunctionalitywasaddedtothelivestatuspageonthewebsitewiththeschedulingcoefficientmatrixandusagematrixforvariousperiodstoallowuserstoplantheiruseofARCHERmoreeffectively.Historicdataforthesetwoplotshasalsobeenmadeavailable.Theseupdateshavebeenwellreceivedbytheusercommunity.

ISO9001:2015certificationPreparatoryworkwascarriedoutacrosstheyeartoimproveandconsolidateprocessesanddocumentationprimarilytointroduceaconsistentframeworktoimprovetheserviceprovidedtoourusersbutalsotoachieveISO9001:2015certification.BenefitsarealreadybeingseeninprocessimprovementandthefirststageexternalauditwassuccessfullycarriedoutinDecember2016.ThefullexternalauditistakingplaceinFebruary2017.

Page 10: ARCHER Annual 2016 v1€¦ · This annual report covers the period from 1 Jan 2016 to 31 Dec 2016. ... It provides the option of linking automating access ... Historic data for these

10

5. OperationsandSystemsGroup(OSG)

5.1 ServiceFailuresTherewerenoSEV1ServiceFailuresintheperiodasdefinedinthemetric.

5.2 OSGServiceActivitiesPrincipalactivitiesundertaken(inadditiontoday-to-dayoperationalcover)included:

(1)Operatingsystemandapplicationssoftwaresupport:

a. PlanningforCLE5.2UP04upgradeontheXC30;b. Installingregularcompilerandprogrammingdevelopmentupgrades;c. SupportingOSenhancementstoexternalloginnodes.

(2)Loginnodesissue:

a. Substantialinvestigationandtestingassociatedwithseriousissueregardingmultiplerecurrentfailuresofloginnodes

(3)KNLinstallation

a. ClosecooperationwithCrayandtheCSEserviceoninstallationofKNLtestanddevelopmentplatform

(4)ISO9001:

a. AssessmentandanalysisofappropriateoperationalprocessesundertakenforISO9001accreditation

(5)Systemadministration:

a. Furtherdevelopmentandexpansionofautomatedtickethandling;b. Refinementoflocally-developedsystemsadministrationtools;c. Increaseintheshortqueuehoursfrom0900–1700MondaytoFridaytobe0800–

2000MondaytoFriday;d. ModificationofmaintenanceschedulewiththeapprovaloftheResearchCouncils,

reducingthenumberoffullmaintenancedaystooneamonthtominimiseuserimpact;

(6)SupportingCrayhardwareoperations:

a. Providingadditionalon-sitesupportforCraypersonnelduringmajorhardwareupgradeoperations(suchastheopticalcablere-work).

(7)Security:

a. Implementingenhancementstosecuritymonitoring;b. InstallingCray-suppliedsecurityfieldnotices;c. Providingadditionalhardeningofsecuritymeasures–specificdetailsarenotavailable

forobviousreasons.d. Successfulexternalsecurityauditundertakenwithnooutstandingissuesasaresult

(8)Outreach:

a. AttendanceattwoUKCSFmeetings:-Daresbury(March)andECMWF(September)b. Establishmentofregular3-waymeetingwithEPSRCandCraytodiscussoperational

issues

Page 11: ARCHER Annual 2016 v1€¦ · This annual report covers the period from 1 Jan 2016 to 31 Dec 2016. ... It provides the option of linking automating access ... Historic data for these

11

6. ComputationalScienceandEngineering(CSE)

ParallelI/OPerformanceStudiesAtthestartof2016,incollaborationwithusergroupsandtheotherServicepartners,theCSEserviceidentifiedanumberofpriorityareastoinvesttechnicaleffortfromthecentralisedCSEteam.OneofthekeyareasidentifiedwasgainingabetterpracticalunderstandingofparallelI/OperformanceonARCHERandhowitcomparestoothersystems.Ingeneral,theHPCcommunityoftenpoorlyunderstandparallelI/Operformanceintermsof:

• what“good”performanceactuallyisonaparticularfilesystem;• whataparticularapplicationdoes;• whatbenchmarksillustrate.

WedesignedaprogrammeofworkfortheCSEteamfor2016toaddresstheseproblemsbyprovidingusefuldataonparallelI/OperformanceonARCHER,andconcreteadvicetousersanddevelopersonhowtomeasure,understandandoptimisetheI/Operformanceoftheirapplications.ThisworkhasledtoanumberofpositiveimpactsfortheARCHERcommunity(andthewiderHPCcommunity):

• UnderstandingofwhatinformationcanbegainedfromdifferentsyntheticparallelI/ObenchmarksandhowwelltheymodelrealHPCapplications.

• QuantifyingthemaximumperformanceavailablefromtheARCHER(andother)parallelfilesystemsinproduction.Thisgivesusersusefulvaluestocomparecurrentperformanceto,toassessiftheapplicationsaredoingwellorbadlyintermsofperformance.

• UnderstandingtheperformanceandscalingcharacteristicsofdifferentcommonparallelI/Opatterns.ThisallowsuserstoassesswhatthebestapproachtotakeisforI/OintheirapplicationandallowsARCHERservicepartnerstoconfigureARCHERfilesystemstobestmeetuserrequirements.

• ProducingstatisticaldataonthevariabilityofparallelI/OperformanceonARCHERtoaidusersinassessingiftheirperformancevariationsarewithinnormalboundsornotandtoaidservicepartnersinbestconfiguringARCHER.

Thedata,analysis,andconclusionsfromthestudiessofararebeingdisseminatedtoARCHERusersandthewiderHPCcommunitythroughseveralmechanismstoensuretheworkhasthelargestpossibleimpact:

• ProductionofanewI/OchapterintheARCHERBestPracticeGuidecontainingpracticaladviceandreferenceI/Operformancedata:http://www.archer.ac.uk/documentation/best-practice-guide/io.php

• ProductionofanARCHERwhitepapercomparingparallelI/OperformanceacrossdifferentsystemsintheUK.ThisiscurrentlybeingreviewedandwillbepublishedinQ12017.

• IncorporationintoARCHERTrainingmaterials.• ParallelI/OwebinarinQ12017.

ARCHERKNLTrainingThe12-nodeKnightsLanding(KNL)manycoresystemwaslaunchedtowardstheendofOctober2016,andamajorchallengewastoprovidesufficienttrainingtonewuserssothattheycouldreadilymakeuseofit.TheKNLprocessoritselfisverynew,butitsintegrationintothestandardCrayenvironmentisevenmorerecent.WeworkedcloselywiththeEPCCIntelParallelComputingCentreandtheCrayCentreofExcellence,bothofwhomhadearlyaccesstoKNL

Page 12: ARCHER Annual 2016 v1€¦ · This annual report covers the period from 1 Jan 2016 to 31 Dec 2016. ... It provides the option of linking automating access ... Historic data for these

12

developmentsystems,sothatwecouldprovideawiderangeoftrainingtousersassoonasthesystemwasopenforuserservice.Thetrainingdeliveredincluded:

• Avirtualtutorial(i.e.interactiveonlinewebinar)on14Septembertitled“TheIntelKnightsLandingProcessor”tointroduceuserstobasicKNLconceptsandtoadvertisetheupcomingeCSEcallthatincludedKNLdevelopmentprojects.

• Avirtualtutorialon21SeptemberexplaininghowtoapplytotheupcomingeCSEcall,withafocusonapplicationstousetheKNLsystem.

• Avirtualtutorialon12Octobertitled“UsingKNLonARCHER”explaininghowtousetheKNLsysteminpractice,includinginformationonthelocaldetailssuchascompilersandqueuestructuresonARCHER.

• A1-dayhands-oncourse“UsingKnightsLandingManycoreProcessorsonARCHER”thatalloweduserstorunrealprogramsontheARCHERKNLnodes.

WehavealsolaunchedanewdrivingtestfortheARCHERKNLsystemwhichisbackedupbytrainingmaterialathttp://www.archer.ac.uk/documentation/knl-guide/knl-training-resources.php.ThiscontainsvideosofthetwotechnicalKNLvirtualtutorialsdescribedabove,anewvideodescribinghowtorequestKNLaccessviaSAFE,andlinkstoallthematerial(slidesandpracticalexercises)fromthehands-oncourse.WewillbecontinuingtheKNLtrainingprogrammeinto2017:

• The3-dayhands-on“CrayOptimizationWorkshop:ARCHERandKnightsLanding”,beingrunincollaborationwithCrayattheirBristolofficesinlateJanuary,willintegrateuseoftheARCHERKNLsystemintoexistingadvancedmaterialonCrayhardware,environment,compilersandtools.

• The2-daycourse“ProgrammingtheManycoreKnightsLandingProcessor”,toberuninLondonin1Q2017,willextendtheexisting1-dayhands-oncoursetoprovideacompleteintroductiontodevelopingandoptimisingparallelcodesfortheARCHERKNLsystem.

Despitethechallengingtimescales,wehavedevelopedanddeliveredawiderangeoftraditionalandonlinetrainingontheARCHERKNLsystemallowinguserstomakethebestuseoftheuniqueopportunityprovidedbyaccesstothissystem.

WeeArchie,WeeArchletandtheBigBangFairWeeArchieisasuitcase-sizedsupercomputer,designedtoletschoolchildrentrytheirhandatcomputingandlearnaboutthebenefitsofsupercomputing.TheRaspberryPi2systemhasbeencreatedtoberepresentativeofthesystemdesigninmassivelyparallelarchitectures.EachRaspberryPihasanLEDdisplaythatlightsupwheninuse,providingavisualdisplaythathelpsdemonstratehowmultipleprocessorsworkinparalleltosolvecomplextasks.Developedin2015,akeyhighlightof2016hasbeenusingWeeArchieasatooltoeducatethenextgenerationofHPCusers.WeeArchiewasthecentrepieceofourboothattheBigBangFair.TheBigBangFairistheUK’slargestSTEMevent,with70,000peopleattendingover4days.Ourboothhadaround6000peopletakepartandWeeArchiewasaconsiderabledraw.YoungpeoplecoulduseWeeArchieandseehowWeeArchieworks.Thiswasthelargesteventwehaveparticipatedin,bysomemargin,andwasmonthsintheplanning.TheeventensuredwecouldshowcaseARCHERtothenextgenerationofscientistsfromacrosstheUK.WeeArchiehasprovedsopopularthatwehavebuiltasecondonetocopewithdemand.2017willseeWeeArchiereturningtotheBigBangFair,aswellasvariousotherevents.Coupledwiththis,wewillbeintroducingWeeArchlet.ThisistheyoungersiblingofWeeArchie,anevensmallerRaspberryPicluster.WeeArchletisdesignedtobecheapandeasytobuildwhilestilldemonstratingthekeyconceptsofparallelcomputing.On-lineinstructionswillbeavailablefordownloadbyschoolsandcommunitygroupswantingtobuildandconfigureasystemthemselves.

Page 13: ARCHER Annual 2016 v1€¦ · This annual report covers the period from 1 Jan 2016 to 31 Dec 2016. ... It provides the option of linking automating access ... Historic data for these

13

ARCHERChampions2016sawthecreationanddevelopmentoftheARCHERChampions(http://www.archer.ac.uk/community/champions/),apeersupportnetworkbetweenstaffmemberswhoseroleinvolvesadvisingusersonaccesstolocal,regionalandnationalHPCresources.TheaimistohelptopromoteacoherentaccessstructuretoHPCresourcesacrosstheUK,withcoordinationbetweentiers.ThereisalsoafocusonsupportingandpromotingactivitiesdesignedtoprovidecareerdevelopmenttoresearchsoftwareengineersseekingacareerinHPC.ActivitiesarealsodesignedtobroadentheUKHPCuserbasetonewdisciplinesandcommunities.TherehavebeentwosuccessfulARCHERChampionsWorkshopsthisyear,oneinEdinburghinMarch,andoneinOxfordinSeptember.BotheventswereverywellattendedanddevelopedtheARCHERChampionsnetwork.2017willseethenextworkshopinLeeds,associatedwiththeHPC-SIGmeeting.

WomeninHPCRecognitionARCHERhasbeeninstrumentalinsettingupanddrivingforwardstheWomeninHPCinitiativeandthisyearhasseenmanyhighlights.ThemostnoticeableisperhapstherecognitionandinvolvementofWomeninHPCatSC16.On14November2016,WHPCwasrecognisedonceagainintheannualHPCWireReaders’andEditors’ChoiceAwards,receivingthefollowinghonours:Readers’Choice:WorkforceDiversityLeadershipAward;Editors’Choice:WorkforceDiversityLeadershipAwardandReaders’Choice:OutstandingLeadershipinHPC,forToniCollis.TheannualHPCwireReaders’andEditors’ChoiceAwardsaredeterminedthroughanominationandvotingprocesswiththeglobalHPCwirecommunity,aswellasselectionsfromtheHPCwireeditors.TheawardsareanannualfeatureofthepublicationandconstituteprestigiousrecognitionfromtheHPCcommunity.ThisisthesecondyearthatWHPChasbeenhonoredtoreceivetheReader’sChoiceWorkforceDiversityLeadershipAward.ReceivingrecognitionfromHPCwire’sreadershighlightstheimpactthatWHPCandthediversityactivitiesputforwardundertheARCHEROutreachprogrammearehaving.

Page 14: ARCHER Annual 2016 v1€¦ · This annual report covers the period from 1 Jan 2016 to 31 Dec 2016. ... It provides the option of linking automating access ... Historic data for these

14

7. CrayServiceGroup7.1 SummaryofPerformanceandServiceEnhancements2016wasanexcellentyearfortheARCHERservicewithnounscheduledsystemoutagescausedbytechnologyfailures.Theexcellentlevelofreliabilityinconjunctionwithhighutilisationofsystemresourceshasprovideduserswithanefficientandstablenationalserviceforcomputationalscience.AsmallCrayXC40KNLTestbedsystemwasinstalledinOctober2016.ThissystemwasprovidedtoenableearlyaccesstoIntelPhiKnightsLandingtechnologyfortheARCHERusercommunitywithinafamiliarprogrammingenvironmentframework.

7.2 ReliabilityandPerformanceWhereaspectsoftechnologyprovisionhaveperformedbelowthehighstandardsexpected,Crayhasworkedtoresolveissueswithaminimumofdisruptiontousers.Occasionally,complexproblemsofanintermittentnaturecanbedifficulttoidentifyandresolve.InsuchcasesCray’sregionalandglobalsupportteams,Engineeringgroupandservicepartnersworktogethertoinvestigateandimplementsolutions.ThemostsignificanttechnologyareasoftheARCHERservicewhereissueswereencounteredin2016were:

• Memoryfragmentationonjob-launchservicenodes.AworkaroundtopreventthisissueimpactinguserswasimplementedinOctober2016.

• Intermittentperiodsofinstabilityaffectingexternalloginandpre/post-processingnodes.Therehavebeentwoseparatecausesofinstabilityonexternalloginandpre/post-processingnodesin2016:

o AGPFSclientrelatedissuewhichwasresolvedinAugust2016.o AlustrerelatedissuefirstseeninNovember2016whichwillrequireapatchto

resolve.

7.3 ServiceFailuresTherewerenounscheduledincidentsclassifiedasfullservicefailuresin2016.

Page 15: ARCHER Annual 2016 v1€¦ · This annual report covers the period from 1 Jan 2016 to 31 Dec 2016. ... It provides the option of linking automating access ... Historic data for these

15

8. CrayCentreofExcellence(CoE)AtthestartoftheyeartheCrayCentresofExcellencemovedintoaneworganisation,theCrayEMEAResearchLab(CERL)undertheleadershipofAdrianTate,whichmadeadditionalexpertiseavailableforCoEactivities.FurtherinformationabouttheCERLisavailableontheCrayblogpostandannouncement.WealsostartedtheyearwithvariousdiscussionstodecideontherightfocusareasfortheCoE,inparticulartherewasawishtofocusonprojectsthatwouldbebeneficialbeyondindividualapplicationsandtoimpactcommunitiesratherthanspecificresearchgroups.KartheeSivalingamjoinedtheCoEduringtheyearbringingexpertiseinelementaryparticlephysicsandNumericalWeatherPrediction(NWP)andClimateapplications.

Longer-TermProjects

I/OperformanceTheCoEisleadingabroadinvestigationintoI/Operformanceandoptimisationthattouchesseveralareas.Severalcommunities,inparticulartheUKNWPcommunityhavedisplayedaninterestinI/Ooptimisationtoolsetsandabstractionlayers.WearecurrentlyinvestigatinghowexistingI/OtechnologiesdevelopedatCraycanbeleveragedbythesecommunities.Aspartofthisbroadinvestigation,wehavealsobeenexperimentingwiththeADIOSimplementationasaplatformtobothprovideawidesetofstorageAPIsforparallelI/Oandasaplatformforfuturedevelopment.Additionally,wehavedevelopedatoolthatprovidestheuserwithawaytooptimizedatamovementtoprovidemanysmallfilestoanapplicationandavoidpossiblecontentioninLustre.ThistoolhasbeenshowntobebeneficialforOpenFOAMapplicationsatscale.Thetoolwasannouncedtousersbuttherehasbeenlimitedinterestsowewillattempttocontactusersindividually.

AUTO-TUNINGTheaimofthisprojectistodeterminetheusefulnessofasimple-to-useauto-tuningtool.ThisbuildsonpreviousworkundertakenbyCrayundertheEUCRESTAExascaleresearchproject.Therearetwomainaspectstothisproject–gainingmoreexperiencewithapplicationsbothtodeterminetheusefulnessofthecurrentmock-upimplementationandthensomeefforttomakeimprovements.Weupdatedthedocumentationandcreatedasummarypresentation.BothEPCCandtheCoEhavestartedonthefirstpartoftheprojectwithEPCClookingatusingthetooltooptimizeconfigurationofVASP.TheCoEhasstartedadiscussionwithNCASandhopestoapplytheauto-tuningtotheUM.

ONETEPWewerepreviouslyspendingasmalleffort(viaotherUKApplicationStaff)supportingaPoisson-BoltzmannEquationsolverforONETEP(incollaborationwiththeUniversityofSouthampton).ThiseffortisnowcontinuingaspartofaprojectfundedundereCSE-07whichstartedinJune.Themostrecentworkhasinvolveddesign,implementationandtestingworkfortransferringthehigherordercorrectionfromONETEPtotheDL_MGmultigridsolver.

FilesystemandI/OIssuesFromtimetotimesomeusershavereportedconcernsoverI/Operformance.Occasionallythisprovestobeduetoasystemproblem,butmoreoftenthefilesystemissimplybusy.ImprovingandunderstandingI/OperformanceisafocusareafortheCoEandwehavebeenworkingonI/Operformancebenchmarking(ashasEPCC);inparticulartheprojectmentionedabovewhichwehopewilldeliversomenewoptionsforuserstohelpoptimizeI/O.WearedevelopinganexperimentalframeworkthatwilladdressmanyoftheconcernsoftheARCHERcommunity,whilebeingaccessibletousersinastransparentawayaspossibleandalsolooksforwardtofuturecapabilities,forexamplestorage-classmemorydevicesandSSDs.

Page 16: ARCHER Annual 2016 v1€¦ · This annual report covers the period from 1 Jan 2016 to 31 Dec 2016. ... It provides the option of linking automating access ... Historic data for these

16

TrainingandWorkshopsTheCoEassistedwithvariousworkshopsduringtheyear.ExampleswerethePortingandOptimizationworkshop,runaroundthetimeofCUG2016,andtheARCHERAdvancedOpenMPcourserunatCray’sEMEAHQinBristolinAugust.CoEstaffpresentedanARCHERwebinaroutliningnewfeaturesinrecentupdatesoftheProgrammingEnvironmentonARCHER.TheCoEwasabletoengagewithARCHERusersatvariouseventsincludingtheComputingInsightUKmeetinginManchester,theARCHERChampionsmeetinginOxford,theECMWFHPCworkshop,andeventsheldbytheEPSRCCentreforDoctoralTraininginPervasiveParallelism.CraysponsoredtheEuroMPIeventwhichthisyearwasheldinEdinburgh,WehavestartedplanningforanOptimizationWorkshopspecificallycoveringthenewlyinstalledARCHERKNLsystemtobeheldatCray’sEMEAHQfrom31stJanuaryto2February2017.

TheKNLAdditiontoARCHERTheCoEwasheavilyinvolvedwiththeintroductionofthenewIntelKnightsLanding(KNL)systemtoARCHER.WhilethesystemwasbeinginstalledtheCoEprovidedguidanceontheKNLhardwaremodesandhelpedEPCCandtheACFteamdecideonanappropriateconfigurationforstartofservice.TheCoEalsoransomesanitychecksandhelpedresolveteethingissuesasthesystemwentlive.TheKNLtalkgivenbytheCoEattheARCHERChampionsmeetinginSeptemberwasaugmentedandgiventoEPCCalongwithotherinformationtoassistwithpreparationoftheuserwebinarsanddocumentation.TheCoEworkedwithEPCContheARCHER-KNLwebdocumentationandtheCoEprovidedatoolandassociatedadvicetoassistuserswithproperbindingofhybridapplications(alsorelevanttoARCHER).

CaseStudiesandARCHERPromotionThefollowingCrayApplicationBriefwaspublished:

RemovingBottleneckstoLarge-ScaleGeneticandGenomicDataAnalysiswithDISSECTandtheCray®XC™Supercomputer(pdf),whichshowcasesworkoftheRoslinInstituteusingtheDISSECTgenomicandepidemiologicanalysistoolonARCHER.

ARCHERQueriesandSoftwareTheCoEhelpsresolvearangeofissuesthatcomeinfromusersviathehelpdesk,someofwhichrequiresignificanteffortandneedinteractionwithCrayR&Dexperts.Ofparticularnotewasanissuerelatingtoaresearchprojectwhichwasutilising“Director’sTime”onARCHER.TheprojectistryingtointegratesoftwarethatusesDMAPPanduGNIandthishasraisedsomesubtleproblemswithmemoryregistration.WewereabletoexplainhowtosetupthesoftwarecorrectlytoenablecoexistenceofbothAPIs.Lateintheyeartherewasaweekendwhenuserscomplainedofslowfilesystemperformanceandwehadtoworkwiththesystemsteamandoneuserinordertoobtainadetailedsystemprofileofanapplicationthatwassignificantlycontributingtotheloadonthefilesystem.AsaresultwenoticedthatNEMOcancausestressonthefilesystembyopeningthesamefilesmanytimesperprocess.Thisisstillunderinvestigation.

Page 17: ARCHER Annual 2016 v1€¦ · This annual report covers the period from 1 Jan 2016 to 31 Dec 2016. ... It provides the option of linking automating access ... Historic data for these

17

eCSEpanelmeetingsTheCoEcompletedtechnicalassessmentsandfinalreviewsforthethreeeCSEcallsduringtheyearandstaffattendedtheprojectplanningmeetings.Advicewasalsoprovidedinadvanceontechnicalconcernsoverprojectspriortopanelmeetings.

2017For2017theHeadofCERL,AdrianTate,iskeentoensurethatalltheCrayCoEscontinuetoprovideexcellentserviceandresultstocustomers,inparticularmakinguseoftheuniqueskillsofindividualstaffintheteamaswellaswiderCrayresources.TheoverallaimistoassistambitioususersintheeffectiveuseofalltheARCHERresourcesforworld-classscienceandresearch.AspartofthiscontinualreviewwearelikelytoseekfurtherguidancefromEPSRCandfromkeymembersoftheARCHERcommunity.


Recommended