+ All Categories
Home > Documents > Going to Light Speed with DataWarp - Cray User Group ... · PDF fileGoing to Light Speed with...

Going to Light Speed with DataWarp - Cray User Group ... · PDF fileGoing to Light Speed with...

Date post: 07-Mar-2018
Category:
Upload: hoangthuan
View: 217 times
Download: 4 times
Share this document with a friend
20
Going to Light Speed with DataWarp An Administrators Perspective Tina Declerck and Dave Paul CUG 2016 – May 10, 2016
Transcript
Page 1: Going to Light Speed with DataWarp - Cray User Group ... · PDF fileGoing to Light Speed with DataWarp! ... – Each DW node is a DVS server ... – Defines how a DW instance is used

Going to Light Speed with DataWarp!An Administrators Perspective

TinaDeclerckandDavePaulCUG2016–May10,2016

Page 2: Going to Light Speed with DataWarp - Cray User Group ... · PDF fileGoing to Light Speed with DataWarp! ... – Each DW node is a DVS server ... – Defines how a DW instance is used

Hardware Description

•  Hardware–144nodes–  2SSDspernode•  4devices-nvme•  IntelP3608

–  Abilitytoincreaseendurance(DWPD)•  Decreasesavailablespace•  NERSCconfiguredwith10DWPD–default3DWPD

Page 3: Going to Light Speed with DataWarp - Cray User Group ... · PDF fileGoing to Light Speed with DataWarp! ... – Each DW node is a DVS server ... – Defines how a DW instance is used

DataWarp configuration •  UsesDVStoprojecttocomputenodes–  EachDWnodeisaDVSserver–  LimitsaccesstoGPFSinCLE5.2

•  DWschedulerdaemon–  Runsonsdb

•  ReSTfulAPI–gunicorn–  Runsonmom/loginnode–  UsesnginxasthehYpserver

Page 4: Going to Light Speed with DataWarp - Cray User Group ... · PDF fileGoing to Light Speed with DataWarp! ... – Each DW node is a DVS server ... – Defines how a DW instance is used

User access

•  Assigned2ways–  Perjob–  Persistent

•  #DWdirec[vesinjobscripts–  Privatemode–  Striped–  Type:currentlyonlyscratchsupported–  Howmuchspaceneeded

Page 5: Going to Light Speed with DataWarp - Cray User Group ... · PDF fileGoing to Light Speed with DataWarp! ... – Each DW node is a DVS server ... – Defines how a DW instance is used

Pools & Granularity

•  PoolsdefineasetofDataWarpnodeswithaspecificconfigura[on

•  DataWarpsupportsmul[plepools–  Na[veSLURMdoesNOT

•  Granularityisconfiguredatthenodeandpoollevels–  Poolgranularitydefinesthesmallestunitthatcanbeallocatedpernode

Page 6: Going to Light Speed with DataWarp - Cray User Group ... · PDF fileGoing to Light Speed with DataWarp! ... – Each DW node is a DVS server ... – Defines how a DW instance is used

Sessions, Instances, and Fragments

•  Session–  EquatestoajobID

•  Instance–  DataWarpspaceallocatedtoajoborpersistentovermanyjobs

•  Fragment–  Por[onsoftheinstanceoneachnodeallocatedtoit

Page 7: Going to Light Speed with DataWarp - Cray User Group ... · PDF fileGoing to Light Speed with DataWarp! ... – Each DW node is a DVS server ... – Defines how a DW instance is used

But wait, there’s more…

•  Configura[on–  DefineshowaDWinstanceisused

•  Namespace–  Aconfigura[oncanhave0ormorenamespaces–  Basicallyadirectoryorfolderinascratchconfigura[on

Page 8: Going to Light Speed with DataWarp - Cray User Group ... · PDF fileGoing to Light Speed with DataWarp! ... – Each DW node is a DVS server ... – Defines how a DW instance is used

We’re not done yet…

•  Registra[on–  Bindsasessionwithaconfigura[on– Maintainsinforma[onforstage-in/stage-out

•  Ac[va[on–  Anavailableinstanceconfigura[ononasetofnodes

Page 9: Going to Light Speed with DataWarp - Cray User Group ... · PDF fileGoing to Light Speed with DataWarp! ... – Each DW node is a DVS server ... – Defines how a DW instance is used

Putting it all together

POOL

96nodejobDWstripedType=scratch

8nodejobType=private

Type=scratchpersistentstriped

Page 10: Going to Light Speed with DataWarp - Cray User Group ... · PDF fileGoing to Light Speed with DataWarp! ... – Each DW node is a DVS server ... – Defines how a DW instance is used

General problem solving - dwstat •  sessstatetoken creatorownercreated expira[onnodes•  2520CA---myBBname CLI333332016-02-19T13:45:33never 0

CA--- u1_bb1 CLI111112016-03-02T15:01:01never 0•  6185CA---2128492SLURM555552016-05-09T07:13:58never96

•  inststatesessbytes nodescreated expira[onintactlabelpublicconfs•  2234CA---2520212.91GiB12016-02-19T13:45:33nevertruemyBBnametrue1

CA--- 1.04TiB52016-03-02T15:01:02nevertrueu1_bb1 true1•  5534CA---61851.87TiB92016-05-09T07:13:58nevertrueI6185-0false1

•  confstateinsttypeaccess_typeac[vs•  2505CA---2234scratchstripe0

CA--- scratchstripe0•  5811CA---5534scratchstripe1

Page 11: Going to Light Speed with DataWarp - Cray User Group ... · PDF fileGoing to Light Speed with DataWarp! ... – Each DW node is a DVS server ... – Defines how a DW instance is used

•  regstatesessconfwait•  5877CA---61145764true•  5890CA---61315773true•  5943CA---61855811true•  ac[vstatesessconfnodesmount•  5732CA---6185581196/var/opt/cray/dws/mounts/batch/2128492/ss

•  fragstateinstcapacitygrannode•  61382CA--2234212.91GiB4MiBnid00457•  CA-- 212.91GiB4MiBnid02249•  73697CA-- 212.91GiB4MiBnid00205•  73698CA-- 212.91GiB4MiBnid01801•  73699CA-- 212.91GiB4MiBnid00014•  73700CA-- 212.91GiB4MiBnid01169•  165142CA--5487425.81GiB4MiBnid01418

Page 12: Going to Light Speed with DataWarp - Cray User Group ... · PDF fileGoing to Light Speed with DataWarp! ... – Each DW node is a DVS server ... – Defines how a DW instance is used

•  nsstateconffragspan•  49200CA--2505613821•  52607CA-- 5•  59484CA--5764165142129•  States–  Goal:C–createorD–destroy–  Setup:A–actualizedor–non-actualized–  Condi[on:F–fuseblownor–fuseintact–  Status:T–transi[oningor–stableorblocked–  Spectrum:M-mixedor–notdelayed

Page 13: Going to Light Speed with DataWarp - Cray User Group ... · PDF fileGoing to Light Speed with DataWarp! ... – Each DW node is a DVS server ... – Defines how a DW instance is used

scontrol show burst Name=crayDefaultPool=wlm_poolGranularity=218016MTotalSpace=872936064MUsedSpace=234803232MStageInTimeout=86400StageOutTimeout=86400Flags=EnablePersistent,TeardownFailureGetSysState=/opt/cray/dw_wlm/default/bin/dw_wlm_cliAllocatedBuffers:Name=u1_bb1CreateTime=2016-03-02T15:01:01Size=1090080MState=allocatedUserID=user1(11111)Name=u2_spaceCreateTime=2016-05-09T11:00:43Size=1090080MState=allocatedUserID=user2(22222)Name=myBBnameCreateTime=2016-02-19T13:45:33Size=218016MState=allocatedUserID=user3(33333)Name=u4_Test2CreateTime=2016-05-05T18:31:36Size=654048MState=allocatedUserID=user4(44444)Name=u4_TestCreateTime=2016-05-05T16:01:02Size=654048MState=allocatedUserID=user4(44444)Name=u5_30TBCreateTime=2016-05-05T14:31:08Size=31612320MState=allocatedUserID=user5(55555)PerUserBufferUse:UserID=user1(11111)Used=1090080MUserID=user2(22222)Used=1090080MUserID=user3(33333)Used=218016MUserID=user4(44444)Used=1962144MUserID=user5(55555)Used=31612320M

Page 14: Going to Light Speed with DataWarp - Cray User Group ... · PDF fileGoing to Light Speed with DataWarp! ... – Each DW node is a DVS server ... – Defines how a DW instance is used

Job hung with processes in ‘D’ state •  Nodestuckcomple[ng(mostlikelyadmindownifusingAlps)–  WithSLURMlogintothenodetoseewhattheproblemis–  Processhungin‘D’stateonaDWinstance–  Getthejobinforma[onandlookat:•  ‘dwstatsessions’tofindthesessionid•  ‘dwstatinstance’tofindtheinstanceid•  ‘dwstatfragments’|grep<instanceid>

–  FindtheMDSnode•  Drainthenodeandreboottocleartheissue

Page 15: Going to Light Speed with DataWarp - Cray User Group ... · PDF fileGoing to Light Speed with DataWarp! ... – Each DW node is a DVS server ... – Defines how a DW instance is used

DW server crash

•  Dwstatshowsa‘D’estroyindicatorthatdoesn’tclear

•  “scontrolshowburst”(SLURM)where“alloca[on”size=0orstate=teardown.

•  OncetheDW-serverisrebootedmostrecoveryissuesarehandledbytheDWSsonwarewithoutneedforfurtherinterven[on.

Page 16: Going to Light Speed with DataWarp - Cray User Group ... · PDF fileGoing to Light Speed with DataWarp! ... – Each DW node is a DVS server ... – Defines how a DW instance is used

Problem w/ size=0

•  Silentproblem•  Registra[onstuckin‘D’stateandeitherTorM•  Dwclirmac[va[on-waittoseeifthatclearstheissue

•  Dwcliupdateregista[on--id<num>--no-wait–  Cancausedatalossifalldataisn’tstagedout

Page 17: Going to Light Speed with DataWarp - Cray User Group ... · PDF fileGoing to Light Speed with DataWarp! ... – Each DW node is a DVS server ... – Defines how a DW instance is used

Log Files

•  SMW–logperdwnode+logforsdb–  /var/opt/cray/log/p0-current/dws–  /var/opt/cray/log/p0-current/console&message•  Grepdwandxfstoseeinforma[on

•  Onmom/loginnodes–  /var/log/nginx

Page 18: Going to Light Speed with DataWarp - Cray User Group ... · PDF fileGoing to Light Speed with DataWarp! ... – Each DW node is a DVS server ... – Defines how a DW instance is used

Important Notes •  Tocreateordestroyapersistentinstanceacomputenodemustbeallocated

•  Exis[ngissues–  Symboliclinksdon’twork–  Ifthereisanemptydirectoryinthestage-indirectorythestage-inwillfail

•  Ifmaxwrites/dayisreachedthenodewillbesettoread-only(ro)

•  CheckstatusofanSSDwithxtcheckssd

Page 19: Going to Light Speed with DataWarp - Cray User Group ... · PDF fileGoing to Light Speed with DataWarp! ... – Each DW node is a DVS server ... – Defines how a DW instance is used

ThisworkwassupportedbytheDirector,OfficeofScience,OfficeofAdvancedScien_ficCompu_ngResearchoftheU.S.DepartmentofEnergyundercontractNo.DEAC02-05CH11231.

Page 20: Going to Light Speed with DataWarp - Cray User Group ... · PDF fileGoing to Light Speed with DataWarp! ... – Each DW node is a DVS server ... – Defines how a DW instance is used

National Energy Research Scientific Computing Center

20


Recommended