GPU-‐enabled Studies of Molecular Systems on Keeneland
-‐ On pursuing high resource u/liza/on and coordinated simula/ons’ progression
Michela Taufer, Sandeep Patel, Samuel Schlachter, and Stephen Herbein
University of Delaware
Jeremy Logan, ORNL
Taxonomy of simula/ons
• Simula/ons applying fully atomis/cally resolved molecular models and force fields GPUs enable longer /me and space scales
• Variable job lengths (ns/day): As a trajectory evolves Across trajectories with different e.g., concentra/ons
• Fully or par/ally coordinated simula/on progression: Fully coordinated needed for e.g., replica-‐exchange molecular dynamics (REMD)
Par/ally coordinated for e.g., SDS and nanotubes systems
1
Constraints on high–end computer systems
• Resource constraints on high-‐end clusters: Limited wall-‐/me limit per job (e.g., 24 hours) Mandatory use of resource managers No direct submission and monitoring of GPU jobs
• Logical GPU job does not map to physical GPU job Workflow managers s/ll in infancy
• System and applica/on failures on GPUs are undetected Resource managers remain with no no/on of job termina/ons on GPUs
2
Moving beyond virtualiza/on
• When clusters do include virtualiza;on E.g., Shadowfax
• We can schedule isolated CPU/GPU pairs This allows us to associate failures with a specific GPU
• Virtualiza/on imposes overheads Power Performance Noise or ji]er Portability and maintainability
… and may not be available
3
Our goal: Pursuing BOTH high accelerators’ utilization and (fully or partially) coordinated simulations’ progression
on GPUs in effective and cross-platform ways
Our approach
• Two so_ware modules that plug into exis/ng resource managers and workflow managers No virtualiza/on to embrace diverse clusters and programming languages
• A companion module: Runs on the head node of the cluster Accepts jobs from workflow manager Instan/ates "children" wrapper modules Dynamically splits jobs and distributes job segments to wrapper modules
• A wrapper module: Launches on compute node as a resource manager job Receives and runs job segments from companion module Reports status of job segments to companion module
4
Modules in ac/on
5
Workflow Manager
Resource Manager
Front-end node Back-end node
Companion Module
User node
Job queue
Modules in ac/on
6 Front-end node Back-end node
Workflow Manager: • generate set of 24-hour jobs
Workflow Manager
Resource Manager
Companion Module
24-hour jobs
User node
Job queue
Modules in ac/on
7 Front-end node
Job queue
Back-end node
24-hour jobs
Workflow Manager: • send set of 24-hour jobs to companion module Companion Module: • receive 24-hour jobs • generate a Wrapper Module (WM) instance per back-node
Workflow Manager
Resource Manager
Companion Module
WM instance
User node
Modules in ac/on
8 Front-end node
Job queue
Back-end node
Workflow Manager
Resource Manager
Companion Module
Companion Module: • submit WM instance as a job to resource manager
User node
WM instance
24-hour jobs
Modules in ac/on
9 Front-end node
Job queue
Back-end node
Workflow Manager
Resource Manager
Companion Module
Companion Module: • submit WM instance as a job to resource manager
User node
WM instance
24-hour jobs
Modules in ac/on
10 Front-end node Back-end node
Workflow Manager
Resource Manager
Companion Module
Resource Manager: • launch WM instance as a job on back-end node
Job queue
24-hour jobs
WM job
User node
WM instance
Modules in ac/on
11
Job queue
WM job
Resource Manager
Companion Module
Wrapper Module: • ask companion module for job segments, as many as the available GPUs
24-hour jobs
User node
Front-end node Back-end node
Modules in ac/on
12
Job queue
WM job
Workflow Manager
Resource Manager
Companion Module
Companion Module: • fragment jobs into 6-hour subjobs
24-hour jobs
User node
Front-end node Back-end node
Modules in ac/on
13
Job queue
WM job
Workflow Manager
Resource Manager
Companion Module
Companion Module: • fragment jobs into 6-hour subjobs • send bundle of 3 subjobs to WM job
24-hour jobs
User node
Front-end node Back-end node
Modules in ac/on
14
Job queue
WM job
Workflow Manager
Resource Manager
Companion Module
Companion Module: • fragment jobs into 6-hour subjobs • send bundle of 3 subjobs to WM job
24-hour jobs
User node
Front-end node Back-end node
Modules in ac/on
15
Job queue
Workflow Manager
Resource Manager
Companion Module
24-hour jobs
WM job
Wrapper Module: • instantiate subjobs on GPUs • monitor system and application failures as well as time constraints
User node
Front-end node Back-end node
Modules in ac/on
16
Job queue
Workflow Manager
Resource Manager
Companion Module
24-hour jobs
WM job
User node Wrapper Module: • instantiate subjobs on GPUs • monitor system and application failures as well as time constraints
Front-end node Back-end node
Modules in ac/on
17
Job queue
Workflow Manager
Resource Manager
Companion Module
24-hour jobs
WM job
Wrapper Module: • if subjob terminates prematurely because of e.g., system or application failures, it request new subjob
User node
Front-end node Back-end node
Modules in ac/on
18
Job queue
Workflow Manager
Resource Manager
Companion Module
24-hour jobs
WM job
User node
Front-end node Back-end node
Companion Module: • adjust length of new subjob based on heuristics, e.g., to complete initially 6-hour period • send subjob to wrapper module for execution
Modules in ac/on
19
Job queue
Workflow Manager
Resource Manager
Companion Module
24-hour jobs
WM job
User node
Front-end node Back-end node
Companion Module: • adjust length of new subjob based on heuristics, e.g., to complete initially 6-hour period • send subjob to wrapper module for execution
MD Simula/ons
• MD simula/ons: Case study 1: Study of sodium dodecyl sulfate (SDS) molecules aqueous solu/ons and electrolyte solu/ons
Case study 2: Study of nanotubes in aqueous solu/ons and electrolyte solu/ons
• GPU code FEN ZI (Yun Dong de FEN ZI = Moving MOLECULES) MD simula/ons in NVT and NVE ensembles and energy minimiza/on in explicit solvent
Constraints on interatomic distances e.g., shake and ra]le, atomic restraints, and freezing fast degrees of mo/ons
Electrosta/c interac/ons, i.e., Ewald summa/on, performed on GPU • Metrics of interest:
U/liza/on of GPUs – i.e., /me ra/o accountable for simula/on’s progression
20
The Keeneland system
• GPU descrip/on: 3 M2090 GPUs per node
• So_ware: TORQUE Resource Manager Globus allows for the use of Pegasus Workflow Manager Shared Lustre file system
• Constraints: 24-‐hour /me limit 1 job per node (cannot have mul/ple jobs on one node) Can set GPUs into Shared/Exclusive mode but not complete isola/on (e.g., user that get access first can steal all the GPUs)
Vendor specific with specific version of NVIDIA driver (>260))
21
Modeling max u/liza/on
• With our approach using segments in 24-‐hour period:
• Without our approach:
22
€
utilization =days∑
tmax − tarrival (i) − tlastchk (i)( ) + trestart[ ]i=1
n−1
∑ − tmax − tarrival (n)( )
tmaxGPUs∑
€
utilization =days∑
tmax − tarrival (1) − tlastchk (1)( ) − tmax − tarrival (1)( )tmaxGPUs
∑
€
tarrival (i) = tlastcheck (i) when tarrival (i) > tmaxtarrival (i) otherwise
⎧ ⎨ ⎪
⎩ ⎪
tlastchk (n) = f (molecular _ system)
where:
€
n
Case study 1: Sodium Dodecyl Sulfate (SDS)
23
molar concentrations: 0.10 molar concentrations: 0.25
molar concentrations: 0.50 molar concentrations: 1.00
Initial structures: surfactant molecules randomly distributed
Case study 1: variable simula/on /mes
24
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
0 2 4 6 8 10 12
perfo
rman
ce (n
s/da
y)
simulation time (ns)
0.1 0.25 0.5 1
Case study 1: testbeds
• Taxonomy of our simula/ons: 4 concentra/ons and 3 200-‐ns trajectories per concentra/on at 298K
• Test 1: Jobs with same concentra/ons assigned to same node
• Test 2: Jobs with different concentra/ons assigned to same node
25
24 hours 24 hours 24 hours 24 hours
Test 1: Test 2:
€
utilization =days∑
tmax − tarrival (1) − tlastchk (1)( ) − tmax − tarrival (1)( )tmaxGPUs
∑€
utilization =days∑
tmax − tarrival (i) − tlastchk (i)( ) + trestart[ ]i=1
n−1
∑ − tmax − tarrival (n)( )
tmaxGPUs∑
Case study 1: modeling max u/liza/on
• With our approach using segments in 24-‐hour period:
• Without our approach:
26
€
utilization =days∑
tmax − trestart[ ]i=1
n−1
∑ − tmax − tarrival (n)( )
tmaxGPUs∑
€
tarrival (i) = tlastcheck (i) when tarrival (i) > tmaxtarrival (i) otherwise
⎧ ⎨ ⎪
⎩ ⎪
tmax = 24hours
where:
€
n
€
utilization =days∑
tmax − tmax − tarrival (1)( )tmaxGPUs
∑
Case study 1: modeling arrival /me
We model in two ways: Scien/sts: run short simula/on, compute ns/day, define job’s speed to constant rate to fit into 24-‐hour period
Our approach: segment 24-‐hour job in segments, adjust segment length based on heuris/c that takes into account change in ns/day
27
€
tarrival (i)
Case study 1: our heuris/c
28
observed performance
our heuristic
projected performance
Case study 1: results
29
• Run 12 10-‐day trajectories with 4 concentra/ons and 3 different seeds on Keeneland, three trajectories per node
99.54%
99.18%
97.83%
95.83%
98.82%
98.44%
96.98%
94.85%
98.78%
98.26%
96.21%
93.50%
97.08%
96.53%
94.49%
91.72%
0.5
1
3
6
With our approach W/o our approach
test 1test 2 test 2test 1tchkpnt(hours)
Case study 1: snapshots of ongoing simula/ons
30
molar concentrations: 0.10 time: 0ns
molar concentrations: 0.25 time: 0ns
molar concentrations: 0.50 time: 0ns
molar concentrations: 1.00 time: 0ns
Initial structures: surfactant molecules randomly distributed molar concentrations: 0.10 time: 22ns
molar concentrations: 1.00 time: 15ns
molar concentrations: 0.25 time: 20ns
molar concentrations: 0.50 time: 20ns
Case study 2: Carbon Nanotubes
• Study nanotubes in aqueous solu/ons and electrolyte solu/ons Different temperatures Different separa/ons
• Scien/fic metrics: Poten/al of mean force Effect of electrolytes, ie, sodium chloride and iodide
Ion spa/al distribu/ons
31
24 Å
13.6 Å
Case study 2: testbeds
• Taxonomy of the simula/ons: 10 temperatures ranging from 280K to 360K along with 20 tube separa/ons
200ns per trajectory with 5.8ns+/-‐3% per day on 64 nodes • Test 1:
Hardware errors, i.e., ECC error and system failures • Test 2:
Hardware and applica/on errors
32 24 hours 24 hours
Modeling max u/liza/on
• With our approach:
• Without our approach:
33
€
utilization =days∑
tmax − tarrival (i) − tlastchk (i)( ) + trestart[ ]i=1
n−1
∑ − tmax − tarrival (n)( )
tmaxGPUs∑
€
utilization =days∑
tmax − tarrival (1) − tlastchk (1)( ) − tmax − tarrival (1)( )tmaxGPUs
∑
€
tarrival (i)i<n = weilbul(scale,shape)tarrival (n) = 0.03 × tmaxtmax = 24hours
where:
Case study 2: modeling system failures
34 • Weibul distribu/on: scale = 203.8 and shape = 0.525
failu
res
occu
rren
ces
probability density function (pdf)
P(system failure) = 0.057 P(two more jobs fail because of system given that one already failed) = 0.333
hours
Case study 2: modeling applica/on failures
35 • Weibul distribu/on: scale = 56.56, shape = 0.3361
failu
res
occu
rren
ces
probability density function (pdf)
hours
P(application failure) = 0.038
Case study 2: results
• Run 200ns for each nanotube system – equivalent to ~35 days on 64 nodes of Keeneland, each with 3 GPUs
36
99.69%
99.64%
99.47%
99.28%
99.54%
99.47%
99.23%
98.98%
94.07%
94.02%
93.79%
93.61%
90.32%
90.24%
89.98%
89.73%
0.5
1
3
6
With our approach W/o our approach
sysfailsysfailappfail
sysfailappfail
sysfailtchkpnt(hours)
Case study 2: scien/fic results
37
38
Case study 2: scien/fic results
Case study 2: scien/fic results
39
Conclusions
• GPUs are s/ll second class ci/zens on high-‐end clusters Virtualiza/on is too costly Lightweight, user-‐level OSs are work in progress
• Rather than rewri/ng exis/ng workflow and resource managers, we propose to complement them with: Companion Module complemen/ng the workflow manager Wrapper Module suppor/ng the resource managers
• We model the maximum usability for: SDS systems with dynamically variable run/mes Carbon nanotube systems with hardware and applica/on failures
• Usability increases significantly in both cases • The science is work in progress
Stay tune for our next publica/ons
40
Acknowledgments
41
Related work: Taufer et al., CiSE 2012 Ganesan et al., JCC 2011 Bauer et al., JCC 2011 Davis et al., BICoB 2009
Patel’s group Taufer’s group
Sponsors:
Contact: [email protected] , [email protected]