Post on 24-Aug-2020
transcript
Frank Giraldo
Department of Applied Mathematics
Naval Postgraduate School
Monterey CA 93943
fxgirald@nps.edu
http://frankgiraldo.wix.com/mysite
Lessons Learned on the Development of a Flexible Software Infrastructure for ESMs for Modern Computing Architectures
1
AcknowledgementsFunded by ONR, AFOSR, NSF, DOE
Parts of this talk include discussions with:
Jeremy Kozdon and Lucas Wilcox (NPS)
Michal Kopera (Boise State)
Simone Marras (NJIT)
Daniel Abdi (TempoQuest)
Emil Constantinescu (Argonne National Lab)
Andreas Mueller (ECMWF)
2
Talk Summary
Caveats
Lesson Learned via Development of GNuME
Summary of Lessons Learned
Where to next
3
CaveatsThis talk represents opinions based on my own experience
building unstructured/adaptive fluids dynamics solvers (3
big codes so far).
I am an applied mathematician and not a software
engineer - although have learned some best practices.
I am programming language agnostic - use what I need
although have mainly used Fortran.
Lessons learned will be applied to a new collaboration with
Caltech, MIT, and JPL on Data-driven ESMs.
4
Talk Summary
Caveats
Lessons Learned via Development of GNuME
Summary of Lessons Learned
Where to next
5
GNuME (Galerkin Numerical Modeling Environment) is a framework for developing flow solvers. GNuME contains a suite of high-order spatial discretization methods (CG/DG), time-integrators, adaptive mesh refinement, with an emphasis on targeting current and future HPC architectures. GNuME currently houses three components:
1. NUMA1 - nonhydrostatic (deep-planet) atmospheric model (global/limited-area)
2. NUMO - nonhydrostatic ocean model (limited-area)
3. Shallow water - coastal and global with wetting and drying
GNuME is written in modern Fortran and C.
Fortran used primarily.
C used for interfacing with C-libraries and for many-core compute kernels.
GNuME emerged from the desire to unify all the codes in my group under one umbrella.
What is GNuME
6[1] NUMA used in U.S. Navy’s NEPTUNE NWP system
GNuME Components
SWE shallow water
equation solver
NUMO incompressible Navier-Stokes
equation solver
NUMA global or local compressible Navier-Stokes
equation solver
[1] Burstedde et al, SIAM J. Sci. Comput. 5 (2015)
OCCA2
[2] Medina et al, arXiv:1403.0968} (2014)
MPI
p4est1 adaptive mesh
manager
CG/DG Element-based
Galerkin Methods
Time-Integrators
Elliptic Solvers/
Preconditioners
Separating the core (red) components allowed us to unify all 10 codes for maintainability (e.g., various PDEs, 2d, 3d, CG, DG, etc.)
The communicators (brown) are radically different. MPI is quite general (array-based). Many-core is quite specific (e.g., Structure of Arrays -> Array of Structures).
The solvers (blue) are independent and users can add new components safely. Ideally, they should have their own subdirectories and distinct builds.
GNuME Components: Some Lessons
8
GNuME Components: CG/DG Numerics
⌦e
�e<latexit sha1_base64="f1z7iLKNKy5Oy9yDCQvrnRp/hbg=">AAAB7nicbVBNS8NAEN34WetX1aOXYBE8lUQEFTwUPOixgrGFNpTJdtIu3d3E3Y1QQv+EFw8qXv093vw3btsctPXBwOO9GWbmRSln2njet7O0vLK6tl7aKG9ube/sVvb2H3SSKYoBTXiiWhFo5ExiYJjh2EoVgog4NqPh9cRvPqHSLJH3ZpRiKKAvWcwoGCu1OjcgBHSxW6l6NW8Kd5H4BamSAo1u5avTS2gmUBrKQeu276UmzEEZRjmOy51MYwp0CH1sWypBoA7z6b1j99gqPTdOlC1p3Kn6eyIHofVIRLZTgBnoeW8i/ue1MxNfhDmTaWZQ0tmiOOOuSdzJ826PKaSGjywBqpi91aUDUECNjahsQ/DnX14kwWntsubdnVXrV0UaJXJIjsgJ8ck5qZNb0iABoYSTZ/JK3pxH58V5dz5mrUtOMXNA/sD5/AE3eY+a</latexit><latexit sha1_base64="f1z7iLKNKy5Oy9yDCQvrnRp/hbg=">AAAB7nicbVBNS8NAEN34WetX1aOXYBE8lUQEFTwUPOixgrGFNpTJdtIu3d3E3Y1QQv+EFw8qXv093vw3btsctPXBwOO9GWbmRSln2njet7O0vLK6tl7aKG9ube/sVvb2H3SSKYoBTXiiWhFo5ExiYJjh2EoVgog4NqPh9cRvPqHSLJH3ZpRiKKAvWcwoGCu1OjcgBHSxW6l6NW8Kd5H4BamSAo1u5avTS2gmUBrKQeu276UmzEEZRjmOy51MYwp0CH1sWypBoA7z6b1j99gqPTdOlC1p3Kn6eyIHofVIRLZTgBnoeW8i/ue1MxNfhDmTaWZQ0tmiOOOuSdzJ826PKaSGjywBqpi91aUDUECNjahsQ/DnX14kwWntsubdnVXrV0UaJXJIjsgJ8ck5qZNb0iABoYSTZ/JK3pxH58V5dz5mrUtOMXNA/sD5/AE3eY+a</latexit><latexit sha1_base64="f1z7iLKNKy5Oy9yDCQvrnRp/hbg=">AAAB7nicbVBNS8NAEN34WetX1aOXYBE8lUQEFTwUPOixgrGFNpTJdtIu3d3E3Y1QQv+EFw8qXv093vw3btsctPXBwOO9GWbmRSln2njet7O0vLK6tl7aKG9ube/sVvb2H3SSKYoBTXiiWhFo5ExiYJjh2EoVgog4NqPh9cRvPqHSLJH3ZpRiKKAvWcwoGCu1OjcgBHSxW6l6NW8Kd5H4BamSAo1u5avTS2gmUBrKQeu276UmzEEZRjmOy51MYwp0CH1sWypBoA7z6b1j99gqPTdOlC1p3Kn6eyIHofVIRLZTgBnoeW8i/ue1MxNfhDmTaWZQ0tmiOOOuSdzJ826PKaSGjywBqpi91aUDUECNjahsQ/DnX14kwWntsubdnVXrV0UaJXJIjsgJ8ck5qZNb0iABoYSTZ/JK3pxH58V5dz5mrUtOMXNA/sD5/AE3eY+a</latexit>
n<latexit sha1_base64="KocwDc1L8B7UxHM0TciBMLBaFZQ=">AAAB8HicbVBNS8NAFHypX7V+VT16WSyCp5KKoIKHghePFYwttqFsti/t0s0m7G6EEvovvHhQ8erP8ea/cdPmoK0DC8PMe+y8CRLBtXHdb6e0srq2vlHerGxt7+zuVfcPHnScKoYei0WsOgHVKLhEz3AjsJMopFEgsB2Mb3K//YRK81jem0mCfkSHkoecUWOlx15EzSgIMzntV2tu3Z2BLJNGQWpQoNWvfvUGMUsjlIYJqnW34SbGz6gynAmcVnqpxoSyMR1i11JJI9R+Nks8JSdWGZAwVvZJQ2bq742MRlpPosBO5gn1opeL/3nd1ISXfsZlkhqUbP5RmApiYpKfTwZcITNiYgllitushI2ooszYkiq2hMbiycvEO6tf1d2781rzumijDEdwDKfQgAtowi20wAMGEp7hFd4c7bw4787HfLTkFDuH8AfO5w9WiJDf</latexit><latexit sha1_base64="KocwDc1L8B7UxHM0TciBMLBaFZQ=">AAAB8HicbVBNS8NAFHypX7V+VT16WSyCp5KKoIKHghePFYwttqFsti/t0s0m7G6EEvovvHhQ8erP8ea/cdPmoK0DC8PMe+y8CRLBtXHdb6e0srq2vlHerGxt7+zuVfcPHnScKoYei0WsOgHVKLhEz3AjsJMopFEgsB2Mb3K//YRK81jem0mCfkSHkoecUWOlx15EzSgIMzntV2tu3Z2BLJNGQWpQoNWvfvUGMUsjlIYJqnW34SbGz6gynAmcVnqpxoSyMR1i11JJI9R+Nks8JSdWGZAwVvZJQ2bq742MRlpPosBO5gn1opeL/3nd1ISXfsZlkhqUbP5RmApiYpKfTwZcITNiYgllitushI2ooszYkiq2hMbiycvEO6tf1d2781rzumijDEdwDKfQgAtowi20wAMGEp7hFd4c7bw4787HfLTkFDuH8AfO5w9WiJDf</latexit><latexit sha1_base64="KocwDc1L8B7UxHM0TciBMLBaFZQ=">AAAB8HicbVBNS8NAFHypX7V+VT16WSyCp5KKoIKHghePFYwttqFsti/t0s0m7G6EEvovvHhQ8erP8ea/cdPmoK0DC8PMe+y8CRLBtXHdb6e0srq2vlHerGxt7+zuVfcPHnScKoYei0WsOgHVKLhEz3AjsJMopFEgsB2Mb3K//YRK81jem0mCfkSHkoecUWOlx15EzSgIMzntV2tu3Z2BLJNGQWpQoNWvfvUGMUsjlIYJqnW34SbGz6gynAmcVnqpxoSyMR1i11JJI9R+Nks8JSdWGZAwVvZJQ2bq742MRlpPosBO5gn1opeL/3nd1ISXfsZlkhqUbP5RmApiYpKfTwZcITNiYgllitushI2ooszYkiq2hMbiycvEO6tf1d2781rzumijDEdwDKfQgAtowi20wAMGEp7hFd4c7bw4787HfLTkFDuH8AfO5w9WiJDf</latexit>
⌦ =Ne[
e=1
⌦e
<latexit sha1_base64="/CmCilk+RChMzOhbGe5PoVd0sV0=">AAACFHicbVDLSsNAFJ34rPUVdelmsAgupCQiqGCh4MaVVjC20NQwmd62Q2eSMDMRSshPuPFX3LhQcevCnX/j9LHQ1gMDh3PO5c49YcKZ0o7zbc3NLywuLRdWiqtr6xub9tb2nYpTScGjMY9lIyQKOIvA00xzaCQSiAg51MP+xdCvP4BULI5u9SCBliDdiHUYJdpIgX3oXwvoElzBfsi6NE18zgTTKsig4ub32VUAOR5nAgjsklN2RsCzxJ2QEpqgFthffjumqYBIU06UarpOolsZkZpRDnnRTxUkhPZJF5qGRkSAamWjq3K8b5Q27sTSvEjjkfp7IiNCqYEITVIQ3VPT3lD8z2umunPayliUpBoiOl7USTnWMR5WhNtMAtV8YAihkpm/YtojklBtiiyaEtzpk2eJd1Q+Kzs3x6Xq+aSNAtpFe+gAuegEVdElqiEPUfSIntErerOerBfr3foYR+esycwO+gPr8wc9VJ5x</latexit><latexit sha1_base64="/CmCilk+RChMzOhbGe5PoVd0sV0=">AAACFHicbVDLSsNAFJ34rPUVdelmsAgupCQiqGCh4MaVVjC20NQwmd62Q2eSMDMRSshPuPFX3LhQcevCnX/j9LHQ1gMDh3PO5c49YcKZ0o7zbc3NLywuLRdWiqtr6xub9tb2nYpTScGjMY9lIyQKOIvA00xzaCQSiAg51MP+xdCvP4BULI5u9SCBliDdiHUYJdpIgX3oXwvoElzBfsi6NE18zgTTKsig4ub32VUAOR5nAgjsklN2RsCzxJ2QEpqgFthffjumqYBIU06UarpOolsZkZpRDnnRTxUkhPZJF5qGRkSAamWjq3K8b5Q27sTSvEjjkfp7IiNCqYEITVIQ3VPT3lD8z2umunPayliUpBoiOl7USTnWMR5WhNtMAtV8YAihkpm/YtojklBtiiyaEtzpk2eJd1Q+Kzs3x6Xq+aSNAtpFe+gAuegEVdElqiEPUfSIntErerOerBfr3foYR+esycwO+gPr8wc9VJ5x</latexit><latexit sha1_base64="/CmCilk+RChMzOhbGe5PoVd0sV0=">AAACFHicbVDLSsNAFJ34rPUVdelmsAgupCQiqGCh4MaVVjC20NQwmd62Q2eSMDMRSshPuPFX3LhQcevCnX/j9LHQ1gMDh3PO5c49YcKZ0o7zbc3NLywuLRdWiqtr6xub9tb2nYpTScGjMY9lIyQKOIvA00xzaCQSiAg51MP+xdCvP4BULI5u9SCBliDdiHUYJdpIgX3oXwvoElzBfsi6NE18zgTTKsig4ub32VUAOR5nAgjsklN2RsCzxJ2QEpqgFthffjumqYBIU06UarpOolsZkZpRDnnRTxUkhPZJF5qGRkSAamWjq3K8b5Q27sTSvEjjkfp7IiNCqYEITVIQ3VPT3lD8z2umunPayliUpBoiOl7USTnWMR5WhNtMAtV8YAihkpm/YtojklBtiiyaEtzpk2eJd1Q+Kzs3x6Xq+aSNAtpFe+gAuegEVdElqiEPUfSIntErerOerBfr3foYR+esycwO+gPr8wc9VJ5x</latexit>
q(e)N (x, t) =
MX
j=1
j(x)q(e)j (t)
<latexit sha1_base64="NKU1BY5ZkOilVahwWn4VeeZoK7s=">AAACRnicbVBNSxtBGH43Wj9itak9ehkMQgISdkWoQgWhFy+VCI0K2WSZnczqxJnddeZdMSz773rx2lv/Qi89qHjtJFkhfjww8PC8z/sxT5hKYdB1/ziVufkPC4tLy9WVj6trn2qf109NkmnGOyyRiT4PqeFSxLyDAiU/TzWnKpT8LLz6Pq6f3XBtRBL/xFHKe4pexCISjKKVglrfVxQvwyi/LoLjft7gzaLxLN0W29gkB8Q3mfKlUAJNkA8PvKL/g/ipEcFwxtqcGTQsB2EzqNXdljsBeUu8ktShRDuo/fYHCcsUj5FJakzXc1Ps5VSjYJIXVT8zPKXsil7wrqUxVdz08kkOBdmyyoBEibYvRjJRZztyqowZqdA6x8ea17Wx+F6tm2G018tFnGbIYzZdFGWSYELGoZKB0JyhHFlCmRb2VsIuqaYMbfRVG4L3+stvSWentd9yT3brh9/KNJZgAzahAR58hUM4gjZ0gMEv+Av38ODcOf+cR+dpaq04Zc8XeIEK/AcBKbIr</latexit><latexit sha1_base64="NKU1BY5ZkOilVahwWn4VeeZoK7s=">AAACRnicbVBNSxtBGH43Wj9itak9ehkMQgISdkWoQgWhFy+VCI0K2WSZnczqxJnddeZdMSz773rx2lv/Qi89qHjtJFkhfjww8PC8z/sxT5hKYdB1/ziVufkPC4tLy9WVj6trn2qf109NkmnGOyyRiT4PqeFSxLyDAiU/TzWnKpT8LLz6Pq6f3XBtRBL/xFHKe4pexCISjKKVglrfVxQvwyi/LoLjft7gzaLxLN0W29gkB8Q3mfKlUAJNkA8PvKL/g/ipEcFwxtqcGTQsB2EzqNXdljsBeUu8ktShRDuo/fYHCcsUj5FJakzXc1Ps5VSjYJIXVT8zPKXsil7wrqUxVdz08kkOBdmyyoBEibYvRjJRZztyqowZqdA6x8ea17Wx+F6tm2G018tFnGbIYzZdFGWSYELGoZKB0JyhHFlCmRb2VsIuqaYMbfRVG4L3+stvSWentd9yT3brh9/KNJZgAzahAR58hUM4gjZ0gMEv+Av38ODcOf+cR+dpaq04Zc8XeIEK/AcBKbIr</latexit><latexit sha1_base64="NKU1BY5ZkOilVahwWn4VeeZoK7s=">AAACRnicbVBNSxtBGH43Wj9itak9ehkMQgISdkWoQgWhFy+VCI0K2WSZnczqxJnddeZdMSz773rx2lv/Qi89qHjtJFkhfjww8PC8z/sxT5hKYdB1/ziVufkPC4tLy9WVj6trn2qf109NkmnGOyyRiT4PqeFSxLyDAiU/TzWnKpT8LLz6Pq6f3XBtRBL/xFHKe4pexCISjKKVglrfVxQvwyi/LoLjft7gzaLxLN0W29gkB8Q3mfKlUAJNkA8PvKL/g/ipEcFwxtqcGTQsB2EzqNXdljsBeUu8ktShRDuo/fYHCcsUj5FJakzXc1Ps5VSjYJIXVT8zPKXsil7wrqUxVdz08kkOBdmyyoBEibYvRjJRZztyqowZqdA6x8ea17Wx+F6tm2G018tFnGbIYzZdFGWSYELGoZKB0JyhHFlCmRb2VsIuqaYMbfRVG4L3+stvSWentd9yT3brh9/KNJZgAzahAR58hUM4gjZ0gMEv+Av38ODcOf+cR+dpaq04Zc8XeIEK/AcBKbIr</latexit>
Domain decomposition Reference element
Basis functions - Lagrange polynomials
Legendre-Gauss-Lobatto points
-1 1-1
1
0
0
Approximate local solution as:
j(x)<latexit sha1_base64="7vWgw/n/+amCHhQHLcDzNDEdOXA=">AAAB+nicbVDLSsNAFL3xWesr1qWbwSLUTUmloO4KblxWMLbQhDCZTtqxkwczE2kJ+RU3LlTc+iXu/BsnbRbaemDgcM693DPHTziTyrK+jbX1jc2t7cpOdXdv/+DQPKo9yDgVhNok5rHo+1hSziJqK6Y47SeC4tDntOdPbgq/90SFZHF0r2YJdUM8iljACFZa8syak0jmPTacEKuxH2TT/Nwz61bTmgOtklZJ6lCi65lfzjAmaUgjRTiWctCyEuVmWChGOM2rTippgskEj+hA0wiHVLrZPHuOzrQyREEs9IsUmqu/NzIcSjkLfT1ZRJTLXiH+5w1SFVy5GYuSVNGILA4FKUcqRkURaMgEJYrPNMFEMJ0VkTEWmChdV1WX0Fr+8iqxL5rXTeuuXe+0yzYqcAKn0IAWXEIHbqELNhCYwjO8wpuRGy/Gu/GxGF0zyp1j+APj8wdYE5Qk</latexit><latexit sha1_base64="7vWgw/n/+amCHhQHLcDzNDEdOXA=">AAAB+nicbVDLSsNAFL3xWesr1qWbwSLUTUmloO4KblxWMLbQhDCZTtqxkwczE2kJ+RU3LlTc+iXu/BsnbRbaemDgcM693DPHTziTyrK+jbX1jc2t7cpOdXdv/+DQPKo9yDgVhNok5rHo+1hSziJqK6Y47SeC4tDntOdPbgq/90SFZHF0r2YJdUM8iljACFZa8syak0jmPTacEKuxH2TT/Nwz61bTmgOtklZJ6lCi65lfzjAmaUgjRTiWctCyEuVmWChGOM2rTippgskEj+hA0wiHVLrZPHuOzrQyREEs9IsUmqu/NzIcSjkLfT1ZRJTLXiH+5w1SFVy5GYuSVNGILA4FKUcqRkURaMgEJYrPNMFEMJ0VkTEWmChdV1WX0Fr+8iqxL5rXTeuuXe+0yzYqcAKn0IAWXEIHbqELNhCYwjO8wpuRGy/Gu/GxGF0zyp1j+APj8wdYE5Qk</latexit><latexit sha1_base64="7vWgw/n/+amCHhQHLcDzNDEdOXA=">AAAB+nicbVDLSsNAFL3xWesr1qWbwSLUTUmloO4KblxWMLbQhDCZTtqxkwczE2kJ+RU3LlTc+iXu/BsnbRbaemDgcM693DPHTziTyrK+jbX1jc2t7cpOdXdv/+DQPKo9yDgVhNok5rHo+1hSziJqK6Y47SeC4tDntOdPbgq/90SFZHF0r2YJdUM8iljACFZa8syak0jmPTacEKuxH2TT/Nwz61bTmgOtklZJ6lCi65lfzjAmaUgjRTiWctCyEuVmWChGOM2rTippgskEj+hA0wiHVLrZPHuOzrQyREEs9IsUmqu/NzIcSjkLfT1ZRJTLXiH+5w1SFVy5GYuSVNGILA4FKUcqRkURaMgEJYrPNMFEMJ0VkTEWmChdV1WX0Fr+8iqxL5rXTeuuXe+0yzYqcAKn0IAWXEIHbqELNhCYwjO8wpuRGy/Gu/GxGF0zyp1j+APj8wdYE5Qk</latexit>
9
Bottom line: choose methodologies that extend the shelf-life of the model (e.g., offer new capabilities).
NUMA with dynamically adaptive mesh refinement: Uniform simulation (left) is 2-4x slower than AMR simulation (right) on 200 processors with 1 million DOF
GNuME Numerics: Future Capabilities
10 10
WRF HIGRAD WRF
NUMA-LF NUMA-LFNUMA-ARK2 NUMA-ARK2
Gr Sc Frno-slip free-slip
NUMO 2D 1.25x106 6.74 0.420 0.482Hiester et al. (2011) 2D 1.25x106 - 0.417 0.482Fringer et al. (2006) 2D 1.25x106 - 0.396 0.428
Simpson & Britter (1979) EXP 4.8x106 ~7 0.432 -NUMO 2D 1.25x106 0.71 0.407 0.475
Hartel et al. (2000) 2D 1.25x106 0.71 0.406 0.477Cantero et al. (2007) 3D 1.5x106 0.71 0.407 -
Fr =uq
�⇢⇢0
gH
Gr =g�⇢H
3
⌫2Sc =
⌫
T
inertiagravity
=buoyancyviscosity
=viscositydiffusivity
=
free-slip
no-slip
distance from initial position (m)
GNuME Numerics: NUMO Lock Exchange
11
Geometric flexibility (allows for unstructured/non-orthogonal grids, adaptivity/consistent nesting)
allows for focused regions (not just with grid spacing but also order of accuracy)
GNuME: Global Shallow Water
12Marras et al. QJRMS 2015
Current ESMs run adequately on CPU-only computers (X86).
E.g., NUMA on Mira at 3 km global resolution (dynamics only)
Mira (ALCF) is an IBM BG/Q with 786,432 cores each with 4 threads with a peak performance of 10 petaflops.
GNuME: HPC Landscape
13Mueller et al. IJHPCA 2017
Current HPC Landscape: it has changed to hybrid
GNuME: HPC Landscape
14[top500.org. June 2017]
Rank Name Theoretical Peak PFlops # of Cores Hardware Country
1 Sunway TaihuLight 125 10,649,600 260-core
manycore China
2 Tianhe-2 54 3,120,000 5-core Intel Xeon Phi (KNC) China
3 Piz-Daint 25 361,760Intel Xeon Phi+Nvidia
P100Switzerland
4 Titan 27 560,640 AMD Opteron + Nvidia K20 U.S.
5 Sequoia 27 1,572,864 IBM BG/Q U.S.
Near Future HPC Landscape: it has changed to hybrid
GNuME: HPC Landscape
Rank Name Theoretical Peak PFlops # of Cores Hardware Country
1 Aurora 1000+ ? Not KNH U.S.
2 Summit 207 230,000IBM Power 9 +
Nvidia Volta GPUs
U.S.
3 Sunway TaihuLight 125 10,649,600 260-core
manycore China
4 Tianhe-2 54 3,120,000 5-core Intel Xeon Phi (KNC) China
5 Piz-Daint 25 361,760 Intel Xeon Phi+Nvidia P100 Switzerland
6 Titan 27 560,640 AMD Opteron + Nvidia K20 U.S.
7 Sequoia 27 1,572,864 IBM BG/Q U.S.
15
Many APIs exist for accessing manycore hardware:
OpenACC, OpenMP, CUDA, OpenCL, Kokkos, OCCA.
As a domain scientist, perhaps easier to use DSLs (e.g.,
Gridtools, etc.).
Our strategy has been to control as much as possible and
deal directly with arrays (array of structures).
16
GNuME: HPC Landscape
How to harness different hardware
In the Hardware-agnostic approach, the idea is to write the compute-kernels once and offload them to different back-ends for varying architectures (figure shows the idea). This solution offers portability but no guarantee on performance.
The performance comes from tuning at the Kernel Language level (code snippet below).
GNuME: HPC Landscape
17
*[Courtesy of Tim Warburton]
OpenMP
CUDA
How to access different hardware
Example of NUMA hardware-agnostic approach (with Occa1) on GPUs and Intel Xeon Phis. Same code-base achieves 90% (weak) scaling efficiency.
On GPU, we use a node-per-thread approach whereas on many-core an element-per thread approach (via myParallelFor)
GNuME: HPC Landscape
18
Titan OLFC: GPUs
Results with NUMA from [Abdi et al. IJHPCA 2017]
Intel KNL
10−3
10−2
10−1
100
101
102
103
10−1
100
101
102
103
104
GFLOPS/GB
GF
LO
PS
/s
334
GB/s
1707 GFLOPS/s
Volume kernelUpdate Kernel of ARKGradient kernelDiffusion kernelextract_q_gmres_schurcreate_lhs_gmres_schur_set2ccreate_rhs_gmres_schurRoofline
Nvidia K20X GPU
[1] Medina et al, arXiv:1403.0968} (2014)
Talk Summary
Caveats
Lessons Learned via Development of GNuME
Summary of Lessons Learned
Where to next
19
Our driving principle has been simplicity (readability) over performance.
Have also chosen approaches that extend the shelf-life of the model (e.g., numerics, hardware-agnosticism, etc.).
Balance must be struck between rewriting code (a good thing) and reusing code as much as possible.
Choice of Programming Language is important but a balance must be struck between maintainability (developer) and readability (user). Ideally would like one language for both prototyping and deployment (e.g., Firedrake, Julia).
For OpenSource, need a simple enough language with a large user-base in ESM community.
DSLs can insulate domain scientists from much of the software engineering complexities and speed-up progress (my group has not tried this approach, yet).
Summary
20
Talk Summary
Caveats
Development of GNuME
Summary of Lessons Learned
Where to next
21
In collaboration with Caltech (Andrew Stuart’s talk ), MIT (Raffaele Ferrari’s talk), and JPL, we will develop an infrastructure for running both global atmosphere and a multitude of hi-res LES domains (with data) for teaching the global model.
To do so, we will redesign the code in a hardware agnostic approach (e.g., OCCA), with compute-kernels written in a variety of languages (use existing functions), managed by a Python/Julia workflow.
Open source from the beginning (e.g., via Github).
Numerical methods will stay the same (e.g., element-based Galerkin methods, time-integrators, non-conforming grids).
Create a software infrastructure that allows for specific components (e.g., ESM vs CFD) with specific builds. To make it easier for collaborators to add components and streamline the workflow and increase (work) performance.
Where to Next