Incorporating Performance Tracking and Regression Testing Into GenIDLEST Using POINT Tools Rui Liu...

Incorporating Performance Tracking and Regression Testing Into

GenIDLEST Using POINT Tools

Rui LiuRick Kufrin

Amit AmritkarDanesh Tafti

Advanced Application Support Group

National Center for Supercomputing Applications

University of Illinois at Urbana-Champaign

High Performance Computational Fluid Thermal Science and

Engineering GroupVirginia Tech

GenIDLEST• Generalized Incompressible Direct and Large Eddy Simulation of

Turbulence• General purpose turbulent fluid flow and heat transfer solver

– In development over a period of 15 years• Besides turbulent flow, has capabilities for two-phase modeling and

fluid-structure interaction• Library of 325 subroutines and ~145,000 lines of code (Fortran 90)• Mixed-mode (hybrid MPI/OpenMP) parallelism

Projects Using GenIDLEST• The Adaptive Code Kitchen & Flexible Tools for Dynamic Application Composition

(NSF)• Syngas Particulate Deposition and Erosion at the Leading Edge of a Turbine Blade

with Film Cooling (DOE-NETL)• Fluid-Structure Interaction in Arterial Blood Flow (ICTAS-VT)• Evaluation of Engine Scale Combustor Liner Heat Transfer for Can and Annular

Combustor (Solar Turbines)• Investigation of Air-side Fouling in Exhast Gas Recirculators (Modine-US Army)• Extreme OpenMP: A Productive Programming Model for High End Computing

(NSF)• Unsteady Aerodynamics of Flapping Wings at Re=10,000-100,000 for Micro-Air

Vehicles (Army Research Office)• Sand Ingestion and Deposition in Internal and External Cooling Configurations in

the Turbine Flow Path (Pratt & Whitney)• Advanced Fuel Gasification: A Coupled Experimental Computational Program

(ICTAS-VT)

Initial Motivation• August 2009: Tafti group contacted NCSA’s Advanced

Application Support group regarding severe slowdown in performance on the Altix system (“Cobalt”) since 5/2008

• Code execution times reported for a ribbed duct CFD problem:Processors NCSA NASA

Run Date 9-Apr-08 20-Aug-09 21-Aug-09

MPI OMP MPI OMP-1 OMP-2 MPI OMP

8 61.2 63.59 77.07 370.4 673.8 70.3 153.8

16 28.67 33.42 37.82 383.7 696 33.7 140.9

32 13.45 20.01 18.15 404 721.9

What Had Changed?

• NCSA had upgraded the Altix system software from ProPack 5 to ProPack 6 in August ‘08– Substantial software changes: kernel, compiler, glibc, MKL

• Compiler versions (Intel) used possibly changed– However, compiler used at NCSA not noted by VaTech, so

was unknown– NASA compiler version identified only as 9.1 and 10.1

• New hardware arrived at NCSA– Cobalt now a “hybrid”, with older (Madison) IA-64

processors on co-compute[12], and newer (Montvale, multicore) processors on co-compute3

Puzzle: Great Variability Between Runs at Single Site (NCSA) in Same Week

• What?– These are the two groups of runs labeled “OMP-1” and

“OMP-2” in comparison table on prior slide– Execution times on same processor counts:

370.4/673.8(8), 383.7/696(16), 404/721.9(32)• Requested and received PBS job IDs for these runs– Examining the job histories revealed that the slower times

had run on the Madison (older) system, while the faster times were obtained on the Montvale (multicore) system

– Magnitude of difference surprising, though…– …but either way, the code was still in the intensive care

unit

First Step: Thread Placement

• Initial hunch: threads were not being placed optimally– With respect to each other– With respect to “their data”

• System tools used to launch/examine at runtime:– “dplace”, “omplace”, “ps”, “top”, “dlook” (SGI tool for

examining data placement across nodes)• Verdict: all threads were running on the same

core!!! • Why?

From the man page for SGI’s “omplace”

• … so the tools (compiler/placement) were fighting each other.

• Solution? Don’t use that compiler version!– Moved to 11.1.038

• was not the default at NCSA (and still isn’t)– Runtime improved (8 cores) from 673.8 secs to 152.8 secs

(~ 4.5x speedup)

“omplace does not pin OpenMP threads properly when used with the Intel OpenMP library built on Feb 15 2008. The build date of the OpenMP library will be printed at run time if KMP_VERSION is set to 1. Note that this version of the OpenMP library was shipped with Intel compiler package versions 10.1.015 and 10.1.017. This library is incompatible with dplace and omplace because it introduces CPU affinity functionality without the ability to disable it.”

Next Step

• While compiler changes resulted in substantial improvement from initial version, the code was still ~2.5x slower than 2008 timings

• An obvious (but often overlooked question):Have there been any changes to the code in the interim?– Answer: many changes, but a belief that none

would impact performance…• Time to include performance tools…

PerfSuite• Entry-level, easy-to-use performance toolset developed at

NCSA1

• Currently funded by NSF SDCI (Software Development for Cyberinfrastructure) program as part of “POINT” project:– Collaborators: NCSA, Oregon, Tennessee, and PSC

• Two primary modes of operation for measurement:– “counting mode”, which counts overall occurrences of one or

more hardware performance events (e.g., retired instructions, cache misses, branching, …)

– “profiling mode”, which produces a statistical profile of an application triggered by hardware event overflows (generalization of “gprof”)

1 Kufrin, R. PerfSuite: An Accessible, Open Source Performance Analysis Environment for Linux. 6th International Conference on Linux Clusters: The HPC Revolution 2005. Chapel Hill, NC, April 2005.

Profiling With PerfSuite• When all is working properly, the PerfSuite tool

“psrun” can be used with unmodified applications to obtain profiles, which is attractive to typical users– An XML “configuration file” is used to specify what to

measure and how• Unfortunately, psrun was not functional after Altix

upgrade to ProPack 6:– Currently investigating SGI-supplied/developed patches to

PerfSuite that address PP-related problems (SGI releases “SGI/PerfSuite” through their online SupportFolio)

• Alternate approach: use the PerfSuite API rather than the tools; workaround for lack of psrun

PerfSuite Performance API• Call “init” once, call “start”,

“read”, and “suspend” as many times as you like

• Call “stop” (supplying a file name prefix of your choice) to write out the performance data results to an XML document

• Optionally, call “shutdown”• Additional routines

ps_hwpc_numevents() and ps_hwpc_eventnames() allow querying current configuration.

Fortrancall psf_hwpc_init (ierr)

call psf_hwpc_start (ierr)

call psf_hwpc_read (integer*8 values,ierr)

call psf_hwpc_suspend (ierr)

call psf_hwpc_stop (prefix, ierr)

call psf_hwpc_shutdown (ierr)

C / C++ps_hwpc_init (void)

ps_hwpc_start (void)

ps_hwpc_read (long long *values)

ps_hwpc_suspend (void)

ps_hwpc_stop (char *prefix)

ps_hwpc_shutdown (void)

include 'fperfsuite.h'

call PSF_hwpc_init(ierr)

c$omp parallel private(ierr)

call PSF_hwpc_start(ierr)

c$omp end parallel

do j = 1, n

do i = 1, m

do k = 1, l

c(i,j) = c(i,j) + a(i,k)*b(k,j)

end do

end do

end do

c$omp parallel private(ierr)

call PSF_hwpc_stop('perf', ierr)

c$omp end parallel

call PSF_hwpc_shutdown(ierr)

The “ierr” argument to PerfSuite routines should be tested for error conditions (omitted here for brevity)

FORTRAN API Example

In multithreaded case (e.g.,

OpenMP), each thread must call

individually

Sample Profile ExcerptPerfSuite Hardware Performance Summary Report

Version : 1.0Created : Fri Sep 04 04:21:30 PM CDT 2009Generator : psprocess 0.4XML Source : genidlest.0.168652.co-compute1.xml

Profile Information==========================================================================Class : PAPIVersion : 3.6.2Event : PAPI_TOT_CYC (Total cycles)Period : 14500000Samples : 17724Domain : allRun Time : 162.59 (seconds)Min Self % : (all)

Function:File:Line Summary-------------------------------------------------------------------------Samples Self % Total % Function:File:Line

5190 29.28% 29.28% exchange_var:exchange_var.f:119 1220 6.88% 36.17% sgs_test_filter:sgs_test_filter.f:54 723 4.08% 40.24% sum_domain:sum_domain.f:55 649 3.66% 43.91% mpi_sendbuf:mpi_sendbuf.f:137

PerfSuite Profile Displayed with TAU’s “ParaProf” Visualization Tool

• “Stacked” and “Unstacked” views allow quick overview of parallel run and show wide variability between threads in various subroutines within the code

• This led to suspicion of data layout among threads (locality)

• Profiling based on total cycles and processor stalls (“bubbles”) isolated offending routine(s)

Addressing Locality/First-Touch Policy• The PerfSuite profiling results pointed out

suspect routines. VaTech examined and found that code changes (introduced since 2008) used F90 array-style initialization for local arrays; changed to parallel version:

Original version:

buf2ds=0.0;buf2dr=0.0

Modified for first-touch policy:

c$omp parallel do private(m) do m = 1, m_blk(myproc) buf2ds(:,:,:,m)=0.0 buf2dr(:,:,:,m)=0.0 enddoc$omp end parallel do

After these changes were implemented, the code performance improved by a further factor of nearly 2x, began to approach target

Runtime Improvements Due To Node Optimizations (8 procs)

Aside: Miscellaneous Items Uncovered During Optimization Cycle

• As noted, NCSA’s Cobalt is a hybrid system (mix of Madison/Montvale Itanium 2)– These CPUs also differ in the number of available PMUs

(performance monitoring units): Madison has 4, Montvale has 12

– The PAPI 3.x library decides at build time (for PAPI) how much space to allocate for registers. As a result, separate library builds would be necessary to work on both machines

– After reporting/discussing with the PAPI team, work was done by Haihang You and Dan Terpstra to address this deficiency. The mods to PAPI were not released in version 3.x, but were made generally available for PAPI 4.x (PAPI-C), released recently. Benefits community as a whole!

Miscellaneous Items Uncovered (cont’d)

• VaTech noted unusual discrepancy in internal time (collected by MPI_Wtime()) and PerfSuite-reported wall-clock time. We realized that the ordering of the two mattered, especially in multi-threaded (OpenMP) runs

• Reason? Output in “stop” is serialized among threads (to minimize filesystem contention)

• Comparisons of MPI/PerfSuite times important to validate results and provide a sanity check

start = MPI_Wtime()start PerfSuite measurementcomputestop PerfSuite measurementend = MPI_Wtime()MPI_time = end-start

start PerfSuite measurementstart = MPI_Wtime()computeend = MPI_Wtime()stop PerfSuite measurementMPI_time = end-start

TIME!

Miscellaneous Items Uncovered (cont’d)

• For runs at high processor count, the nature of measurement can impact the system. These are issues that were known (since the NCSA Altix was installed):– PAPI “multiplexing” is achieved through regular interrupts– Profiling through statistical sampling also generates regular interrupts– A single system image with large numbers of processors must deal

with interrupts being generated concurrently… can become overwhelmed

• We have adjusted the default interrupt frequency on Cobalt when using PerfSuite to help address

• Linux kernel developers have modified the relevant code for scalability– Changes are implemented in the upcoming Altix UV system and

associated software

A Final Hurdle with Higher Processor-Count Jobs

• With initial optimizations/changes implemented using smaller (8-32) jobs, moving to largest run (256 procs) gave:

• Initial suspicion was that memory reserved for PerfSuite profiling (sample) buffers may have had excessive requirements– Internal memory usage tracking showed ~70MB/thread, so this

was not the problem source• MPI_MEMMAP_OFF used to disable SGI’s MPT memory

mapping optimizations; allowed jobs to complete

MPI: mmap failed (memmap_base) for 8068448256 pages (132193456226304 bytes) on 256 ranks Attempted byte sizes of regions: static+heap 0x76c97d4000 symheap 0x0 stack 0x17132c000 mpibuf 0x0MPI: daemon terminated: co-compute2 - job aborting

First “Legitimate” Timings: OpenMPco-compute1 co-compute2

Processors Time Speedup Time Speedup

1 7224 1 4988 1

2 3559 2.03 3057 1.63

4 1878 3.85 1686 2.96

8 1080 6.69 1012 4.93

16 816 8.85 316 15.78

32 141 51.23 167 29.87

64 73 98.96 74 67.41

128 50 144.48 50 99.76

256 37 195.24 39 127.9

•Initial interpretation at VaTech: superlinear scaling occurring between 32-128 processors on co-compute1, 64 processors on co-compute2•Why the substantial differences between two “identical” machines at low core counts?

Speedup (Main Timestep Loop)

Machine Comparison• Although co-compute1 and co-compute2 are, in many ways, identical,

there is an important difference:

• These runs use ~ 50GB of memory, more than can be serviced from a single node at low core counts

• The cost of remote memory access for the lower core count runs resulted in sublinear scaling. More pronounced on co-compute1 since each node has only 1/3 the memory of co-compute2

• Additional runs using the “dlook” utility revealed exactly how many pages were allocated across how many nodes

Co-compute1 Co-compute2

Model SGI Altix 3700 SGI Altix 3700

Processor type Itanium 2 (Madison) Itanium 2 (Madison)

Clock speed 1.6 GHz 1.6 GHz

Local memory/node 4 GB 12 GB

Timings (Main Timestep Loop)

At this point, 8 nodes (co-2) supply 96 GB

Development/Optimization Observations

• Many runs were made to arrive at the current “optimized” version– By geographically distinct groups– With multiple compiler versions– In-progress code changes– Various compiler flags– Multiple performance tools

• Very easy to get buried in the volume of data• Consistency in recording results is critical– Cannot control how others handle this

• Need for performance regression testing and tracking

The TAU Portal

• Web-based access to performance data• Supports collaborative performance study– Secure performance data sharing– Does not require TAU installation– Launch TAU performance tools via Java WebStart• ParaProf, PerfExplorer

• http://tau.nic.uoregon.edu/

TAU Portal Entry PageAccess is free of charge

Create your account at this page

Passwords required to access/upload data

Do not use existing password; security is light

TAU Portal Workspaces• Basic unit of

organization for performance experiments

• Can be shared between users

• Each experiment initially shown as “metadata”

• Can launch ParaProf directly from workspace

Example Basic ParaProf Display

• Bar chart displays of profiles are a commonly used display technique (we showed one earlier with PerfSuite data)

• All experiment trials previously uploaded to the portal are accessible and can be viewed, compared off the portal

Using the TAU Portal from a Batch Job

• It is extremely easy to incorporate upload of data to the portal from a batch job through command-line utilities

• For TAU-generated profiles:paraprof –-pack myprof.ppk profile.*

• For PerfSuite-generated data:paraprof --pack myprof.ppk *.xml

• This results in a “packed” data file, to upload:tau_portal.py up –u name –p pw –w wkspace \

-e exp packed_data_file

For More Information

• GenIDLESThttp://www.hpcfd.me.vt.edu/codes.shtml

• PerfSuitehttp://perfsuite.ncsa.uiuc.edu/http://perfsuite.sourceforge.net/

• POINThttp://nic.uoregon.edu/point/

http://www.hpcfd.me.vt.edu/codes.shtml

http://perfsuite.ncsa.uiuc.edu/

http://perfsuite.sourceforge.net/

http://nic.uoregon.edu/point/

Date post:	26-Dec-2015
Category:	Documents
Upload:	blake-marshall
View:	220 times
Download:	0 times

Incorporating Performance Tracking and Regression Testing Into GenIDLEST Using POINT Tools Rui Liu...

Documents