Music Gap Report

Scalable Methods for Monitoring and Detecting Behavioral Equivalence Classesin Scientific Codes ∗

Todd [email protected]

Rob [email protected]

Daniel A. Reeddan [email protected]

Renaissance Computing InstituteUniversity of North Carolina at Chapel Hill

Abstract

Emerging petascale systems will have many hundreds ofthousands of processors, but traditional task-level tracingtools already fail to scale to much smaller systems becausethe I/O backbones of these systems cannot handle the peakload offered by their cores. Complete event traces of all pro-cesses are thus infeasible. To retain the benefits of detailedperformance measurement while reducing volume of col-lected data, we developed AMPL, a general-purpose toolkitthat reduces data volume using stratified sampling.

We adopt a scalable sampling strategy, since the sam-ple size required to measure a system varies sub-linearlywith process count. By grouping, or stratifying, processesthat behave similarly, we can further reduce data overheadwhile also providing insight into an application’s behavior.

In this paper, we describe the AMPL toolkit and we re-port our experiences using it on large-scale scientific ap-plications. We show that AMPL can successfully reducethe overhead of tracing scientific applications by an orderof magnitude or more, and we show that our tool scalessub-linearly, so the improvement will be more dramatic onpetascale machines. Finally, we illustrate the use of AMPLto monitor applications by performance-equivalent strata,and we show that this technique can allow for further re-ductions in trace data volume and traced execution time.

1 Introduction

Processor counts in modern supercomputers are risingrapidly. Of the systems on the current Top500 list [13],over 400 are labeled as distributed-memory “clusters”, upfrom just over 250 in 2005. The mean processor count ofsystems in the Top 100 has risen exponentially in the past

∗Part of this work was performed under the auspices of the SciDACPerformance Engineering Research Institute, grant number DE-FC02-06ER25764.

decade. In 1997, the fastest system had just over 1,000 pro-cessors, while the current performance leader, IBM’s BlueGene/L[5], has over 200,000 cores. Only one system in thecurrent top 100 has fewer than 1,000 processors.

Effectively monitoring highly concurrent systems is adaunting challenge. An application event trace can gener-ate hundreds of megabytes of data for each minute of exe-cution time, and this data needs to be stored and analyzedoffline. However, the largest supercomputers are using disk-less nodes. For example, Blue Gene/L at Lawrence Liver-more National Laboratory has 106,496 diskless nodes forcomputation, but only 1,664 I/O nodes. Each I/O node isconnected by gigabit ethernet to a network of 224 I/O dataservers. Peak throughput of this system is around 25 GB/s[16]. A full trace from 212,992 processors could easily sat-urate this pathway, perturbing measurements and makingthe recorded trace useless.

Even if a large trace could be collected and stored ef-ficiently, traces from petascale systems would contain farmore data than could be analyzed manually. Fortunately,Amdahl’s law constrains scalable applications to exhibit ex-tremely regular behavior. A scalable performance monitor-ing system could exploit such regularity to remove redun-dancies in collected data so that its outputs would not de-pend on total system size. An analyst using such a systemcould collect just enough performance data to assess appli-cation performance, and no more.

Using simulation and ex post facto experiments, Mendes,et al. [12] showed that statistical sampling is a promisingapproach to the data reduction problem. It can be used toaccurately estimate the global properties of a population ofprocesses without collecting data from all of them. Sam-pling is particularly well-suited to large systems, since thesample size needed to measure a set of processes scales sub-linearly with the size of the set. For data with fixed variance,the sample size is constant in the limit, so sampling verylarge populations of processes is proportionally much lesscostly than measuring small ones.

1

mailto:[email protected]



In this paper, we extend Mendes’ work with infrastruc-ture for on-line, sampled event tracing of arbitrary per-formance metrics gathered using on-node instrumentation.Summary data is collected dynamically and used to tune thesample size as a run progresses. We also explore the appli-cation of techniques for subdividing, or stratifying, a pop-ulation into independently sampled behavioral equivalenceclasses. Stratification can provide insight into the workingsof an application, as it gives the analyst a rough classifi-cation of the behavior of running processes. If the behav-ior within each stratum is homogeneous, the overall cost ofmonitoring is reduced. We have implemented these tech-niques in the Adaptive Monitoring and Profiling Library(AMPL), which can be linked with instrumented applica-tions written in C, C++, or FORTRAN.

We review the statistical methods used in this paper in§2. We describe the architecture and implementation ofAMPL in §3. An experimental validation of AMPL is givenin §4 using sPPM[1], Chombo[3], and ADCIRC[11], threewell known scientific codes. Finally, §5 discusses relatedwork, and §6 details conclusions drawn from our results aswell as plans for future work.

2 Statistical Sampling Theory

Statistical sampling has long been used in surveys andopinion polls to estimate general characteristics of popula-tions by observing the responses of only a small subset, orsample, of the total population. Below, we review the basicprinciples of sampling theory, and we present their appli-cation to performance monitoring on large-scale computingsystems. We also discuss stratified sampling and its role inreducing measurement overhead in scientific applications.

2.1 Estimating Mean Values

Given a set of population elements Y , sampling theoryestimates the mean using only a small sample of the totalpopulation. For sample elements, y1, y2, ..., yn, the samplemean y is an estimator of the population mean Y . We wouldlike to ensure that the value of y is within a certain errorbound d of Y with some confidence. If we denote the risk ofnot falling within the error bound as α, then the confidenceis 1− α, yielding

Pr(|Y − y| > d) ≤ α. (1)

Stated differently, zα standard deviations of the estimatorshould fall within the error bound:

zα√V ar(y) ≤ d, (2)

where zα is the normal confidence interval computed fromthe confidence bound 1 − α. Given the variance of an esti-mator for the population mean, this inequality can be solved

to obtain a minimum sample size, n, that will satisfy theconstraints, zα and d. For a simple random sample, we have

n ≥ N

[1 +N

(d

zαS

)2]−1

(3)

where S is the standard deviation of the population, andN is the total population size. The estimation of mean val-ues is described in [19, 12], so we omit further elementaryderivations. However, two aspects of (3) warrant empha-sis. First, (3) implies that the minimum cost of monitoringa population depends on its variance. Given the same con-fidence and error bounds. a population with high variancerequires more sampled elements than a population with lowvariance. Intuitively, highly regular SPMD codes with lim-ited data dependent behavior will benefit more from sam-pling than will more irregular, dynamic codes.

Second, asN increases, n approaches (zαS/d)2, and therelative sampling cost n/N becomes smaller. For a fixedsample variance, the relative cost of monitoring declines assystem size increases. As mentioned, sample size is con-stant in the limit, so sampling can be extremely beneficialfor monitoring very large systems.

2.2 Sampling Performance Metrics

Formula (3) suggests that one can substantially reducethe number of processes monitored in a large parallel sys-tem, but we must modify it slightly for sampled traces.Formula (3) assumes that the granularity of sampling is inline with the events to be estimated. However, our popu-lation consists of M processes, each executing applicationcode with embedded instrumentation. Each time controlpasses to an instrumentation point, some metric is measuredfor a performance event Yi. Thus, the population is di-vided hierarchically into primary units (processes) and sec-ondary units (events). Each process “contains” some pos-sibly changing number of events, and when we sample aprocess, we receive all of its data. We must account for thiswhen designing our sampling strategy.

A simple random sample of primary units in a partitionedpopulation is formally called cluster sampling, where theprimary units are “clusters” of secondary units. Here, wegive a brief overview of this technique as it applies to par-allel applications. A more extensive treatment of the math-ematics involved can be found in [19].

We are given a parallel application running on M pro-cesses, and we want to sample it repeatedly over some timeinterval. The ith process has Ni events per interval, suchthat

M∑i=1

Ni = N. (4)

Events on each process are Yij , where i = 1, 2, ...,M ; j =1, 2, ..., Ni. The population mean Y is simply the mean overthe values of all events:

Y =1N

M∑i=1

N∑j=1

Yij . (5)

We wish to estimate Y using a random sample of m pro-cesses. The counts of events collected from the sampledprocesses are referred to as ni. Y can be estimated from thesample values with the cluster sample mean:

yc =∑mi=1 yiT∑mi=1 ni

, (6)

where yiT is the total of all sample values collected from theith process. The cluster mean yc is then simply the sum ofall sample values divided by the number of events sampled.

Given that yc is an effective estimator for Y , one mustchoose a suitable sample size to ensure statistical confi-dence in the estimator. To compute this, we need the vari-ance, given by:

V ar(yc) =M −mMmN2

s2r, s2r =∑mi=1(yiT − ycni)2

m− 1(7)

where N is the average number of events for each processin the primary population, and s2r is an estimator for thesecondary population variance S2. We can use V ar(yc) in(2) and obtain an equation for sample size as follows:

m =Ms2r

N2V 2 + s2r, V =

(d

zα

)2

(8)

The only remaining unknown is N , the number of uniqueevents. For this, we can use a straightforward estimator,N ≈ Mn/m. We can now use equation (8) for adaptivesampling. Given an estimate for the variance of the eventpopulation, we can calculate approximately the size, m, ofour next sample.

2.3 Stratified Sampling

Parallel applications often have behavioral equivalenceclasses among their processes, which is reflected in per-formance data about the application. For example, if pro-cess zero of an application reads input data, manages check-points and writes results, the performance profile of processzero will differ from that of the other processes. Similar sit-uations arise from spatial and functional decompositions ormaster-worker paradigms.

One can exploit this property to reduce real-time moni-toring overhead beyond what is possible with application-wide sampling. This is a commonly used technique in the

design of political polls and sociological studies, where itmay be very costly to survey every member of a population[19]. The communication cost of monitoring is the directanalog of this for large parallel applications.

Equation (3) shows that the minimum sample size isstrongly correlated with variance of sampled data. Intu-itively, if a process population has a high variance, and thusa large minimum sample size for confidence and error con-straints, one can reduce the sampling requirement by parti-tioning the population into lower-variance groups.

Consider the case where there are k behavioral equiv-alence classes, or strata, in a population of N processes,with sizes N1, N2, ..., Nk; means Y1, Y1, ..., Yk; and vari-ances S2

1 , S22 , ..., S

2k . Assume further that in the ith stratum,

one uses a sample size ni, calculated with (8). Y can be es-timated as yst =

∑ki=1 wiyi, using the strata sample means

y1, y1, ..., yk.The weights wi = Ni/N are simply the ratios of stratum

sizes to total population size, and yst is the stratified samplemean. This is more efficient than y when:

k∑i=1

Ni(Yi − Y )2 >1N

k∑i=1

(N −Ni)S2i . (9)

In other words, when the variance between strata is signifi-cantly higher than the variance within strata, stratified sam-pling can reduce the number of processes that must be sam-pled to estimate the stratified sample means. For perfor-mance analysis, stratification gives insight to the structureof processes in a running application. The stratified samplemeans provide us with measures of the behavioral proper-ties of separate groups of processes, and an engineer canuse this information to assess the performance of his code.

3 The AMPL Library

The use of sampling to estimate scalar properties of pop-ulations of processes has been studied before [12]. Wehave built the Adaptive Monitoring and Profiling Library(AMPL), which uses the analysis described in §2 as aheuristic to sample arbitrary event traces at runtime.

AMPL collects and aggregates summary statistics fromeach process in a running parallel application. Using thevariance of the sampled measurements, it calculates a min-imum sample size as described in §2. AMPL dynamicallymonitors variance and it periodically updates sample sizeto fit the monitored data. This sampling can be performedglobally, across all running processes, or the user can spec-ify groups of processes to be sampled independently.

3.1 AMPL Architecture

The AMPL runtime is divided functionally into twocomponents: a central client and per-process monitoring

Initial Sample Monitor Windows Monitor WindowsNew SampleSend Update

. . . . . .

. . .w1 w5 . . .w6 w10

Figure 1. AMPL Runtime Sampling. Client process is at center, sampled processes are in white, andunsampled processes are dark. Arrows show communication; sample intervals are denoted by wi.

agents. Agents selectively enable and disable an externaltrace library. The monitored execution is divided into a se-quence of update intervals. Within each update interval is asequence of data collection windows. The agents enable ordisable collection for an entire window. They also accumu-late summary data across the entire update interval and theysend the data to the client at the end of the interval. Theclient then calculates a new sample size based on the vari-ance of the monitored data, randomly selects a new sampleset, and sends an update to monitored nodes. A monitoringagent receives this update and adopts the new sampling pol-icy for the duration of the interval This process repeats untilthe monitored application’s execution completes. Figure 1shows the phases of this cycle in detail.

Interaction between the client and agents enables AMPLto adapt to changing variance in measured performancedata. The user can configure which points in the code areused to determine AMPL’s windows, the number of win-dows between updates from the client, and confidence anderror bounds for the adaptive monitoring. As discussed in§2.3, these confidence and error bounds also affect the vol-ume of collected data, giving AMPL an adaptive control toeither increase accuracy or decrease trace volume and I/Ooverhead. Thus, traces using AMPL can be tuned to matchthe bandwidth restrictions of its host system.

An AMPL user can also elect to monitor subgroups of anapplication’s processes separately. Per-group monitoring issimilar to the global monitoring described here.

3.2 Modular Communication

AMPL is organized into layers. Initially, we imple-mented a communication layer in MPI, for close integra-tion with the scientific codes AMPL was designed to mon-itor. AMPL is not tied to MPI, and we have implementedthe communication layer modularly to allow for integrationwith other libraries and protocols. Client-to-agent samplingupdates and agent-to-client data transport can be specifiedindependently. Figure 2 shows the communication layer in

AMPL ClientAdaptive Sampling Computation

AMPL Monitoring Agent

Application Code

TAU InstrumentationTracing + Data Collection

StatisticalSummary

DataSample

SetUpdates

Figure 2. AMPL Software Architecture

the context of AMPLs high-level architectural design.It is up to the user of AMPL to set the policy for the

implementation of the random sampling of monitored pro-cesses. When the client requests that a population’s sampleset be updated, it only specifies the number,m, of processesin the population of M that should be monitored, not theirspecific ranks. The update mechanism sends to each agenta probability within [0..1] that determines with what proba-bility the agent enables data collection in its process.

We provide two standard update mechanisms. The sub-set update mechanism selects a fixed sample set of pro-cesses that will report at each window until the next update.The processes in this subset are instructed to collect datawith probability 1; all other processes receive 0. This givesconsistency between windows, but may accumulate samplebias if the number of windows per update interval is set toolarge. The global update policy uniformly sends m/M toeach agent. Thus, in each window the expected number ofagents that will collect data will be m. This makes for morerandom sampling at the cost of consistency. It also requiresthat all agents receive the update.

The desirability of each of our update policies depends

.25

.25

.25

.25.25.25.25

.25

.25

.25

.25

.25.25 .25 .25

.25

(a) Global

1

0

0 0

00

0 0

10

1

0 1

0

0 0

(b) Subset

Figure 3. Update mechanisms. Outer circlesare monitored processes, labeled by proba-bility of recording trace data. Client is shownat center.

on two factors: (a) the efficiency of the primitives avail-able for global communication and (b) the need for multiplesamples over several time windows from the same subsetof the processes. To produce a simple statistical charac-terization of system or application behavior, global updatehas the advantage that its samples are truly random. How-ever, if one desires performance data from the same nodesfor a long period (e.g., to compute a performance profilefor each sampled node), the subset update mechanism isneeded. Figure 3 illustrates these policies.

3.3 Tool Integration

AMPL is written in C++, and it is designed to acceptdata from existing data collection tools. It provides C andC++ bindings for its external interface, and it can label per-formance events either by simple integer identifiers or bycallpaths. Multiple performance metrics can be monitoredsimultaneously, so that data gathered from hardware perfor-mance counter APIs like PAPI [2] can be recorded alongwith timing information.

AMPL contains no measurement or tracing tools of itsown. We integrated AMPL with the University of Oregon’sTuning and Analysis Utilities (TAU) [20], a widely usedsource-instrumentation toolkit for many languages, includ-ing C, C++, and FORTRAN. TAU uses PAPI and varioustimer libraries as data sources. We modified TAU’s pro-filer to pass summary performance data to AMPL for onlinemonitoring. The integration of AMPL with TAU requiredonly a few hundred lines of code and slight modificationsso that TAU could dynamically enable and disable tracingunder AMPL’s direction. Other tracing and profiling toolscould be integrated with a similar level of effort.

AMPL is intended to be used on very large systems suchas IBM’s Blue Gene/L [5], Cray’s XT3 [21] and Linuxclusters [9, 6]. As such, we designed its routines to be

WindowsPerUpdate = 4UpdateMechanism = Subset

EpochMarker = "TIMESTEP"

Metrics {"WALL_CLOCK" Report"PAPI_FP_INS" Guide

}

Group {Name = "Adaptive"Members = 0-127Confidence = .90Error = .03

}

Group {Name = "Static"SampleSize = 30Members = 128-255PinnedNodes = 128-137

}

Figure 4. AMPL Configuration File

called from within source-level instrumentation, as com-pute nodes on architectures like BlueGene do not currentlysupport multiple processes, threads, or any other OS-levelconcurrency. All analyses and communication of data aredriven by calls to AMPL’s data-collection hooks.

3.4 Usage

To monitor an application, an analyst first compiles theapplication using the AMPL-enabled TAU. This automati-cally links the resulting executable with our library. AMPLruntime configuration and sampling parameters can be ad-justed using a configuration file. See Figure 4.

This configuration file uses the TIMESTEP procedureto delineate sample windows. During the execution ofTIMESTEP, summary data is collected from monitoredprocesses. The system adaptively updates the sample sizeevery 4 windows, based on the variance of data collected inthe intervening windows. Subset sampling is used to sendupdates.

The user has specified two groups, each to be sampled in-dependently. The first group, labeled Adaptive, consistsof the first 128 processes. This group’s sample size will berecalculated dynamically to yield a confidence of 90% anderror of 3%, based on the variance of floating-point instruc-tion counts. Wall-clock times of instrumented routines willbe reported but not guaranteed within confidence or errorbounds.

The explicit SampleSize directive causes the secondgroup to be monitored statically. AMPL will monitor ex-actly 30 processes from the second 128 processes in the job.The PinnedNodes directive tells AMPL that nodes 128through 137 should always be included in the sample set,with the remaining 20 randomly chosen from the group’smembers. Fine-grained control over adaptation policies forparticular call sites is also provided, and this can be speci-fied in a separate file.

4 Experimental Results

To assess the performance of the AMPL library and itsefficacy in reducing monitoring overhead and data volume,we conducted a series of experiments using three well-known scientific applications. Here, we describe our tests.Our environment is covered in §4.1 - §4.2. We measure thecost of exhaustive tracing in §4.3, and in §4.4 , we verifythe accuracy of AMPL’s measurement using a small-scaletest. In §4.5 - §4.7, we measure AMPL’s overhead at largerscales. We provide results varying sampling parameters andsystem size. Finally, we use clustering techniques to findstrata in applications, and we show how stratified samplingcan be used to further reduce monitoring overhead.

4.1 Setup

Our experiments were conducted on two systems. Thefirst is an IBM Blue Gene/L system with 2048 dual-core,700 MHz PowerPC compute nodes. Each node has 1 GBRAM (512 MB per core). The interconnect consists of a3-D torus network and two tree-strucutured networks. Onthis particular system there is one I/O node per 32 computenodes. I/O nodes are connected via gigabit ethernet to aswitch, and the switch is connected via 8 links to an 8-nodefile server cluster using IBM’s General Parallel File Sys-tem (GPFS). All our experiments were done in a file systemfronted by two servers. We used IBMs xlC compilers andIBM’s MPI implementation.

Our second system is a Linux cluster with 64 dual-processor, dual-core Intel Woodcrest nodes. There are atotal of 256 cores, each running at 2.6 GHz. Each nodehas 4 GB RAM, and Infiniband 4X is the primary intercon-nect. The system uses NFS for the shared file system, withan Infiniband switch connected to the NFS server by fourchannel-bonded gigabit links. We used the Intel compil-ers and OpenMPI. OpenMPI was configured to use Infini-band for communication between nodes and shared mem-ory within a node.

4.2 Applications

We used the following three scientific applications to testour library.

sPPM. ASCI sPPM [1] is a gas dynamics benchmark de-signed to mimic the behavior of classified codes run at De-partment of Energy national laboratories. sPPM is part ofthe ASCI Purple suite of applications, and is written in For-tran 77. The sPPM algorithm solves a 3-D gas dynamicsproblem on a uniform Cartesian mesh. The problem is stat-ically divided (i.e., each node is allocated its own portionof the mesh), and this allocation does not change during ex-ecution. Thus, computational load on sPPM processes istypically well-balanced because each processor is allocatedexactly the same amount of work.

ADCIRC. The Advanced Circulation Model (ADCIRC)is a finite-element hydrodynamic model for coastal re-gions [11]. It is currently used in the design of levees andfor predicting storm-surge inundation caused by hurricanes.It is written in Fortran 77. ADCIRC requires its input meshto be pre-partitioned using the METIS [7] library. Staticpartitioning with METIS can result in load imbalances atruntime, and, as such, behavior across ADCIRC processescan be more variable than that of sPPM.

Chombo. Chombo[3] is a library for block-structuredadaptive mesh refinement (AMR). It is used to solve a broadrange of partial differential equations, particularly for prob-lems involving many spatial scales or highly localized be-havior. Chombo provides C++ classes and data structuresfor building adaptively refined grids. The Chombo pack-age includes a Godunov solver application[4] for modelingmagnetohydrodynamics in explosions. Our tests were con-ducted using this application and the explosion input setprovided with it.

4.3 Exhaustive Monitoring

We ran several tests using sPPM on BlueGene/L to mea-sure the costs of exhaustive tracing. First, we ran sPPMuninstrumented and unmodified for process counts from 32to 2048. Next, to assess worst-case tracing overhead, weinstrumented all functions in sPPM with TAU and ran thesame set of tests with tracing enabled. In trace mode, TAUrecords timestamps for function entries and exits, as well asruntime information about MPI messages. Because perfor-mance engineers do not typically instrument every functionin a code, we ran the same set of tests with only the SPPMand RUNHYD subroutines instrumented.

Figure 5(a) shows timings for each of our traced runs.It is clear from the figure that trace monitoring overhead

100

1000

10000

32 64 128256

5121024

2048

Tota

l tim

e fo

r dou

ble

times

tep

Number of processes

Uninstrumented sPPMOnly SPPM instrumented

Full Instrumentation

(a) Timings.

100000

1e+06

1e+07

1e+08

1e+09

32 64 128256

5121024

2048

Dat

a vo

lum

e (in

byt

es)

Number of processes

Only SPPM instrumentedFull Instrumentation

(b) Data volume.

Figure 5. Data volume and timing for sPPM on Blue Gene/L using varied instrumentation

scales linearly with the number of processes after 128 pro-cesses.

Figure 5(b) shows the data volume for the traced runs.As expected, data volume increases linearly with the num-ber of monitored processes. For runs with only sPPM in-strumented, approximately 11 megabytes data were pro-duced per process, per double-timestep. For exhaustiveinstrumentation, each process generated 92 megabytes ofdata. For 2048 processes, this amounts to 183 gigabytes ofdata for just two timesteps of the application. Extrapolatinglinearly, a full two-step trace on a system the size of Blue-Gene/L at LLNL would consume 6 terabytes.

4.4 Sample Accuracy

AMPL uses the techniques described in §2 as a heuris-tic for the guided sampling of vector-valued event traces.Since we showed in §4.3 that it is impossible to collect anexhaustive trace from all nodes in a cluster without severeperturbation, we ran the verification experiments at smallscale.

As before, we used TAU to instrument the SPPM andRUNHYD subroutines of sPPM. We measured the elapsedtime of SPPM, and we used the return from RUNHYD todelineate windows. RUNHYD contains the control logicfor each double-timestep that sPPM executes, so this isroughly equivalent to sampling AMPL windows every twotimesteps.

We ran SPPM on 32 processes of our Woodcrest clusterwith AMPL tracing enabled and with confidence and errorbounds set to 90% and 8%, respectively. To avoid the ex-treme perturbation that occurs when the I/O system is sat-urated, we ran with only one active CPU per node and werecorded trace data to the local disk on each node. Insteadof disabling tracing on unsampled nodes, we recorded fulltrace data from 32 processes, and we marked the sample

set for each window of the run. This way, we know whichsubset of the exhaustive data would have been collected byAMPL, and we can compare the measured trace to a fulltrace of the application. Our exhaustive traces were 20 totaltimesteps long, and required a total of 29 gigabytes of diskspace for all 32 processes.

Measuring trace similarity is not straightforward, so weused a generalization of the confidence measure to evaluateour sampling. We modeled each collected trace as a poly-line, as per Lu and Reed in [10], with each point on the linerepresenting the value being measured. In this case, this isthe time taken by one invocation of the SPPM subroutine.

Let pi(t) be the event traces collected from each processin the system. We define the mean trace for M processes,p(t) to be:

p(t) =1M

∫p0(t) + p1(t) + ...+ pM (t)dt

We define the trace confidence, ctrace, for a given run to bethe percentage of time the mean trace of sampled processes,ps(t) is within an error bound, d, of the mean trace over allprocesses, pexh(t),

ctrace =1T

∫ T

0

X(t)dt

X(t) =

{1 if err(t) > d,0 if err(t) ≤ d.

, err(t) =∣∣∣∣ ps(t)− pexh(t)

pexh(t)

∣∣∣∣where T is the total time taken by the run.

We calculated ctrace for the full set of 32 monitoredprocesses and for the samples that AMPL recommended.Figure 6 shows the first two seconds of the trace. pexh(t)is shown in black, with ps(t) superimposed in gray. Theshaded region shows the error bound around pexh(t). Ac-tual error is shown at bottom. For the first two seconds of

Figure 6. Mean trace (black) and samplemean trace (blue) for two seconds of a runof sPPM on a 32-node Woodcrest system.

the trace, the sampled portion is entirely within the errorbound.

We measured the error for all 20 timesteps of our sPPMrun, and we calculated ctrace to be 95.499% for our errorbound of 8%. This is actually better than the confidencebound we specified for AMPL. We can attribute this highconfidence to the fact that AMPL intentionally oversampleswhen it predicts very small samples (10 or fewer processs),and to sPPM’s general lack of inter-node variability.

4.5 Data Volume and Runtime Overhead

We measured AMPL overhead as a function of samplingparameters. As in §4.4, we compiled all of our test appli-cations to trace with TAU, and we enabled AMPL for allruns. In these experiments with a fixed number of proces-sors, we varied confidence and error constraints from 90%confidence and 8% error at the lowest to exhaustive mon-itoring (100% confidence and 0% error). Only those pro-cesses in the sample set wrote trace data. Processes disabledby AMPL did not write trace data to disk until selected fortracing again.

For both sPPM and ADCIRC, we ran with 2048 pro-cesses on our BlueGene/L system. We ran Chombo with128 processes on our smaller Woodcrest cluster.

Instrumentation

The routines instrumented varied from code to code, but weattempted to choose routines that would yield useful metricsfor performance tuning. For sPPM, we instrumented themain timestep loop and the SPPM routine, and we measuredelapsed time for each. Sample updates were set for every 2time steps, and we ran a total of 20 time steps.

For ADCIRC, we added instrumentation only to theTIME-STEP routine and MPI calls. ADCIRC sends MPI

messages frequently and its time step is much shorterthan sPPM, so we set AMPL’s window to 800 ADCIRCtimesteps, and we used the mean time taken for calls toMPI Waitsome() during each window to guide our sam-pling. MPI Waitsome() is a good measure of load bal-ance, as a large value indicates that a process is idle andwaiting on others.

For Chombo, we instrumented the coarse timestep loopin the Godunov solver. This timestep loop is fixed-length,though the timestep routine subcycles smaller time stepswhen they are necessary to minimize error. Thus, the num-ber of floating point instructions per timestep can vary. Weused PAPI[2] to measure the number of floating point in-structions per coarse timestep, and we set AMPL to guidethe sample size using on this metric.

Discussion

Figures 7(a) and 7(b) show the measured time and data over-head, respectively. Total data volume scales linearly withthe total processes in the system. In the presence of an I/Obottleneck, total time scales with the data volume, The ex-periments illustrate that AMPL is able to reduce both.

The elapsed time of the sPPM routine varies little be-tween processes in the running application. Hence, over-head for monitoring SPPM with AMPL is orders of magni-tude smaller than the overhead of monitoring exhaustively.For 90% confidence and 8% error, monitoring a full 2048-node run of sPPM adds only 5% to the total time of an unin-strumented run. For both 99% confidence and 3% error, and95% confidence and 5% error, overheads were 8%. In fact,for each of these runs, all windows but the first have samplesizes of only 10 out of 2048 processes. Moreover, AMPL’sinitial estimate for sample size is conservative. It choosesthe worst-case sample size for the requested confidence anderror, which can exceed the capacity of the I/O bottleneckon BlueGene/L. If these runs were extended past 20 timewindows, all of the overhead in Figure 7 would be amor-tized over the run.

For large runs, AMPL can reduce the amount of col-lected data by more than an order of magnitude. With a90% confidence interval and 8% error tolerance, AMPL col-lects only 1.9 GB of performance data for 2048 processes,while sampling all processes would require over 21 GB ofspace. Even with a 99% confidence interval and a 3% errorbound, AMPL never collects more than half as much datafrom sPPM as would exhaustive tracing techniques.

As sampling constraints are varied, the time overhead re-sults for ADCIRC changes are similar to the sPPM results.Using AMPL, total time for 90% confidence and 8% erroris over an order of magnitude less than that of exhaustivemonitoring. Data reduction is even greater. Using AMPLwith 90% confidence and 8% error, data volume for AD-

0.01

0.1

1

10

100

1000

90 / 895 / 5

95 / 398 / 3

99 / 3100 / 0

Ove

rhea

d (%

of t

otal

tim

e)

Confidence / Error bound

Chombo (128 procs, Woodcrest)sPPM (2048 procs, BlueGene/L)

ADCIRC (2048 procs, BlueGene/L)

(a) Percent of total execution time.

100000

1e+06

1e+07

1e+08

1e+09

90 / 895 / 5

95 / 398 / 3

99 / 3100 / 0

Size

of t

race

(in

byte

s)

Confidence / error bound

Chombo (128 procs, Woodcrest)sPPM (2048 procs, BlueGene/L)

ADCIRC (2048 procs, BlueGene/L)

(b) Output data volume.

Figure 7. AMPL trace overhead for three applications, varying confidence and error bounds.

CIRC is 28 times smaller than an exhaustive trace.Data overhead for ADCIRC is higher than for sPPM be-

cause ADCIRC is more sensitive to instrumentation. Anuninstrumented run with 2048 processes takes only 155 sec-onds, but the shortest instrumented ADCIRC run took 355seconds. With sPPM, we did not see this degree of pertur-bation. In both cases, we instrumented only key routines,but ADCIRC makes more frequent MPI calls and is moresensitive to TAU’s MPI wrapper library. Since sPPM makesless frequent MPI calls, its running time is less perturbed byinstrumentation.

For Chombo, the Woodcrest cluster’s file system wasable to handle the smaller load of 128 processor traces wellfor all our tests, and we do not see the degree of I/O seri-alization that was present on our BlueGene/L runs. Therewas no significant variation in the runtimes of the Chomboruns with AMPL, and even running with exhaustive tracingtook about the same amount of time. However, the overheadof instrumentation in Chombo was high. In general, instru-mented runs took approximately 15 times longer due to thelarge number of MPI calls Chombo makes. This overheadcould be reduced if we removed instrumentation for morefrequently invoked MPI calls.

Data overhead for the 128-processor Chombo runs scalessimilarly to trace volume for ADCIRC and SPPM. Com-pared to exhaustive monitoring, we were able to reducetrace data volume by 15 times compared to an exhaustivelytraced run with 128 processes. As with both ADCIRC andsPPM, the sample size and data volume both climb gradu-ally as we tighten the confidence and error bounds.

4.6 Projected Overhead at Scale

§4.5 shows that AMPL’s overhead can be tuned withuser-defined sampling parameters, and it illustrates thatAMPL can effectively improve trace overhead on relatively

small systems. However, we would like to know howAMPL will perform on even larger machines.

We configured ADCIRC with TAU and AMPL as in§4.5, but in these experiments we fixed the sampling param-eters and varied only the process count. Figure 8(a) showsthe sample size and the amount of data collected by AMPLfor each run expressed as a fraction of the exhaustive case.Although the sample size increases as the system grows, therelative data volume decreases. For 2048 nodes, total vol-ume is less than 10 percent of the exhaustive case. As men-tioned in §2, sample size is constant in the limit, so we canexpect this curve to level off after 2048 processes, and wecan expect very large systems to require that increasinglysmaller subsets of processes be sampled.

Figure 8(b) shows sample size and data volume forChombo tracing as system size is increased. As with AD-CIRC, we see that the fraction of data collected decreasesfor larger systems, as the minimum required sample sizescales more slowly than the size of the system. For largersystems, we can thus expect to see comparatively smalleroverhead and trace volume, just as we did for ADCIRC andsPPM.

4.7 Stratification

We now turn to an examination of AMPL’s performancewith stratification. As discussed in §2.3, if we know whichgroups of processes will behave similarly, we can sampleeach group independently and, in theory, reduce data vol-ume. We use simple clustering algorithms to find groupsin our performance data, and we observe further reductionsin data volume achieved by using these groups to stratifysamples taken with AMPL runs.

Clustering algorithms find sets of elements in their in-put data with minimal dissimilarity. Here we used a wellknown algorithm, k-medoids[8], to analyze summary data

0

10

20

30

40

50

60

70

80

90

100

32 64 128256

5121024

2048 0

10

20

30

40

50

60

70

80

90

100

Perc

enta

ge

Sam

ple

size

(# o

f pro

cess

es)

Total size (# of processes)

Data Volume (% of exhaustive)Sample Size (% of total)

Average Sample Size

(a) Data volume for ADCIRC on BlueGene/L.

0

10

20

30

40

50

60

70

80

90

100

8 16 32 64 128 0

10

20

30

Perc

enta

ge

Sam

ple

size

(# o

f pro

cess

es)

Total size (# of processes)

Data Volume (% of exhaustive)Sample Size (% of total)

Average Sample Size

(b) Data volume for Chombo on our Woodcrest with Infiniband.

Figure 8. Scaling runs of Chombo and ADCIRC on two different machines. Note that in both in-stances, data volume and sample size decrease proportionally, but sample size increases absolutely.

0

2e+06

4e+06

6e+06

8e+06

1e+07

0 1 2 3 4 5 6 7 8 9 10

Dat

a vo

lum

e (p

ropo

rtion

of m

axim

um)

Sam

ple

size

(in

proc

esse

s)

Number of clusters

Data Volume (bytes)Avg. Total Sample Size

(a) Data overhead and average total sample size.

0

100

200

300

400

500

600

700

800

900

0 1 2 3 4 5 6 7 8 9 10

Tim

e (s

econ

ds)

Number of clusters

ADCIRC

(b) Percent total execution time.

Figure 9. Time and data volume vs. clus-ters for stratified ADCIRC trace, measuringMPI Waitsome().

from ADCIRC and to subdivide the population of processesinto groups.

K-medoids requires that the user have a dissimilaritymetric to compare the elements being clustered, and thatthe user specify k, the number of clusters to find. In ourexperiment, the elements are ADCIRC processes. Comput-ing dissimilarity is slightly less straightforward. We config-ured ADCIRC as in §4.6 and recorded summary data for all1024 processes in the run. After each window, processes themean time for all calls to MPI Waitsome() to a local logfile. The logged data thus consisted of time-varying vec-tors of local means. For simplicity, we used the Euclidiandistance between these vectors as our similarity measure.

We ran k-medoids on our 1024-process ADCIRC datafor cluster counts from 1 to 10, and we used the output toconstruct a stratified AMPL configuration file. We then re-ran ADCIRC with stratified sampling for each of these con-figuration files. The single-cluster case is identical to thenon-stratified runs above. Figures 9(a) and 9(b) show dataoverhead for different cluster counts and execution time, re-spectively.

Figure 9(a) shows that a further 25% reduction in datavolume was achieved on 1024-node runs of ADCIRC bystratifying the population into two clusters. On the Blue-Gene/L, we see a 37% reduction in execution time. Surpris-ingly, for k = 3 and k = 4, dividing the population actuallycauses a significant increase in both data volume and ex-ecution time, while 5-10 clusters perform similarly to the2-cluster case. Since tracing itself can perturb a running ap-plication, it may be that we are clustering on perturbationnoise for the 3 and 4-cluster cases. This is a likely culpritfor the increases in overhead we see here, as our samplesare chosen randomly and the balance of perturbation acrossnodes in the system is nondeterministic. Because we used

off-line clustering we do not accurately capture this kindof dynamic behavior. This could be improved by using anonline clustering algorithm and adjusting the strata dynam-ically.

For the remainder of the tests, our results are consistentwith our expectations. With 5-clusters, the AMPL-enabledrun of ADCIRC behaves similarly to the 2-cluster case.Data and time overhead of subsequent tests with more strat-ification gradually increase, but they do not come close toexceeding the overhead of the 2-cluster test. There is, how-ever, a slow rise in overhead from 5 clusters and on, and thiscan be explained by one of the weaknesses of k-medoidsclustering. K-medoids requires that the user to specify k inadvance, but the user may have no idea how many equiva-lence classes actually exist in the data. Thus, the user caneither guess, or he can run k-medoids many times beforefinding an optimal clustering. If k exceeds the actual num-ber groups that exist in the data, the algorithm can begin tocluster on noise.

5 Related Work

Mendes et al. [12] used statistical sampling to monitorscalar properties of large systems. We have taken thesemethods and applied them to monitoring trace data at theapplication level, while Mendes measures higher-level sys-tem properties at the cluster and grid level.

Noeth et al. [15] have developed a scalable, lossless MPItrace framework capable of reducing data volumes of largeMPI traces for very regular applications by orders of mag-nitude. This framework collects compressed data at runtimeand aggregates it at exit time via a global trace-reduction op-eration. The approach does not support the collection of ar-bitrary numerical performance data from remote processes;only an MPI event trace is preserved. In contrast, AMPLcan measure a trace of arbitrary numerical values over thecourse of a run, and is more suited to collecting application-level performance data and timings. We believe that thetwo approaches are complementary, and that Noeth’s scal-able MPI traces could be annotated intelligently with thesort of data AMPL collects based on information learnedby AMPL at runtime. We are considering this for futurework.

Roth et al.have developed MRNet [17] as part of theParaDyn project [18]. MRNet uses tree-based overlay net-works to aggregate performance data in very large clusters,and it has been used at Lawrence Livermore Laboratorywith Blue Gene/L. MRNet allows a random sample to betaken from monitored nodes, but it does not guide the sam-pling or handle reductions of trace data directly. AMPLcould benefit by using MRNet to restructure communica-tion operations as reduction trees separate from the mon-itored application, in place of the embedded PMPI calls it

currently uses. Such a modification could potentially reducethe perturbation problems we saw in our experiments withstratification.

Nikolayev et al. [14] have used statistical clustering toreduce data overhead in large applications by sampling onlyrepresentatives from clusters detected at runtime. The ap-proach is entirely online: clusters are generated on-the-flybased on data observed at runtime. Nikolayev’s methodassumes that one representative from each cluster will beenough to approximate its behavior, and it does not offer thestatistical guarantees that AMPL provides. AMPL could,however, benefit from the sort of online clustering done inthis work.

6 Conclusions and Future Work

We have shown that the techniques used in AMPL canreduce the data overhead and the execution time of instru-mented scientific applications by over an order of magni-tude on small systems. Since the overhead of our moni-toring methods scales sub-linearly with the number of con-current processes in a system, AMPL, or similar samplingframeworks, will be for monitoring petascale machineswhen they arrive. Further, we have shown that, in addi-tion to estimating global, aggregate quantities across a largecluster, populations of processes can be stratified and mon-itored as such with our tool. This will allow for further re-ductions in data volume and execution time.

The ideas presented here are implemented in a librarythat is easily integrated with existing tools. We were able tointegrate AMPL easily with the TAU performance toolkit,and we believe that our techniques are widely applicable tomany other domains of monitoring.

Building on this work, we are currently developing moresophisticated performance models and low-overhead, on-line analysis of distributed applications. We plan to applythe techniques we have developed to other monitoring tools,and to use them to facilitate online, real-time analysis ofscalable applications at the petascale and beyond.

References

[1] ASCI. The ASCI Purple sPPM benchmark code [on-line]. 2002. Available from: http://www.llnl.gov/asci/platforms/purple/rfp/benchmarks/limited/sppm.

[2] S. Browne, J. Dongarra, N. Garner, G. Ho, and P. J.Mucci. A portable programming interface for perfor-mance evaluation on modern processors. The Interna-tional Journal of High Performance Computing Appli-cations, 14(3):189–204, Fall 2000.

[3] P. Colella, D. T. Graves, D. Modiano, D. B. Ser-afini, and B. v. Straalen. Chombo software package

http://www.llnl.gov/asci/platforms/purple/rfp/benchmarks/limited/sppm

http://www.llnl.gov/asci/platforms/purple/rfp/benchmarks/limited/sppm

for AMR applications. Technical Report (LawrenceBerkeley National Laboratory), 2000. Available from:http://seesar.lbl.gov/anag/chombo.

[4] R. Crockett, P. Colella, R. Fisher, R. I. Klein, andC. McKee. An unsplit, cell-centered Godunov methodfor ideal mhd. Journal of Computional Physics,Vol.203:422–448, 2005.

[5] A. Gara, M. A. Blumrich, D. Chen, G. L.-T. Chiu,P. Coteus, M. E. Giampapa, R. A. Haring, P. Heidel-berger, D. Hoenicke, G. V. Kopcsay, T. A. Liebsch,M. Ohmacht, B. D. Steinmacher-Burow, T. Takken,and P. Vranas. Overview of the Blue Gene/L systemarchitecture. IBM Journal of Research and Develop-ment, 49(2/3), 2005.

[6] IBM. MareNostrum: A new concept in linux su-percomputing [online]. February 15 2005. Avail-able from: http://www-128.ibm.com/developerworks/library/pa-nl3-marenostrum.html.

[7] G. Karypis and V. Kumar. A fast and high qual-ity multilevel scheme for partitioning irregular graphs.SIAM Journal on Scientific Computing, 20(1):359–392, 1999.

[8] L. Kaufman and P. J. Rousseeuw. Finding Groups inData: An Introduction to Cluster Analysis. Wiley Se-ries in Probability and Statistics. Wiley-Interscience,2nd edition, 2005.

[9] Lawrence Livermore National Laboratory. Liver-more computing resources. [online]. 2007. Availablefrom: http://www.llnl.gov/computing/hpc/resources/OCF resources.html.

[10] C.-d. Lu and D. A. Reed. Compact application sig-natures for parallel and distributed scientific codes,November, 2002 2002.

[11] R. Luettich, J. Westerink, and N. Scheffner. ADCIRC:an advanced three-dimensional circulation model forshelves coasts and estuaries, Report 1: theory andmethodology of ADCIRC-2DDI and ADCIRC-3DL,1992 1992.

[12] C. L. Mendes and D. A. Reed. Monitoring largesystems via statistical sampling. International Jour-nal of High Performance Computing Applications,18(2):267–277, 2004.

[13] H. Meuer, E. Strohmaier, J. Dongarra, and S. Horst.Top500 supercomputer sites [online]. Available from:http://www.top500.org.

[14] O. Y. Nikolayev, P. C. Roth, and D. A. Reed. Real-time statistical clustering for event trace reduction.The International Journal of Supercomputer Applica-tions and High Performance Computing, 11(2):144–159, 1997.

[15] M. Noeth, F. Mueller, M. Schulz, and B. R. de Supin-ski. Scalable compression and replay of communica-tion traces in massively parallel environments. In In-ternational Parallel and Distributed Processing Sym-posium (IPDPS), March 26-30 2007.

[16] R. Ross, J. Moreira, K. Cupps, and W. Pfeiffer. Par-allel I/O on the IBM Blue Gene/L system. BlueGene/L Consortium Quarterly Newsletter, First Quar-ter, 2006. Available from: http://www-fp.mcs.anl.gov/bgconsortium/file%20system%20newsletter2.pdf.

[17] P. C. Roth, D. C. Arnold, and B. P. Miller. MRNet: Asoftware-based multicast/reduction network for scal-able tools. In Supercomputing 2003 (SC03), 2003.

[18] P. C. Roth and B. P. Miller. On-line automated perfor-mance diagnosis on thousands of processors. In ACMSIGPLAN Symposium on Principles and Practice ofParallel Programming (PPoPP’06), New York City,2006.

[19] R. L. Schaeffer, W. Mendenhall, and R. L. Ott. Ele-mentary Survey Sampling. Wadsworth Publishing Co.,Belmont, CA, 6th edition, 2006.

[20] S. Shende and A. Maloney. The TAU parallel per-formance system. International Journal of High Per-formance Computing Applications, 20(2):287–331,2006.

[21] J. S. Vetter, S. R. Alam, T. H. Dunigan, Jr., M. R. Fa-hey, P. C. Roth, and P. H. Worley. Early evaluation ofthe Cray XT3. In Proc. 20th IEEE International Par-allel and Distributed Processing Symposium (IPDPS),2006.

http://seesar.lbl.gov/anag/chombo

http://www-128.ibm.com/developerworks/library/pa-nl3-marenostrum.html

http://www-128.ibm.com/developerworks/library/pa-nl3-marenostrum.html

http://www.llnl.gov/computing/hpc/resources/OCF_resources.html

http://www.llnl.gov/computing/hpc/resources/OCF_resources.html

http://www.top500.org

http://www-fp.mcs.anl.gov/bgconsortium/file%20system%20newsletter2.pdf

http://www-fp.mcs.anl.gov/bgconsortium/file%20system%20newsletter2.pdf

Date post:	10-Feb-2022
Category:	Documents
Upload:	others
View:	1 times
Download:	0 times

Music Gap Report

Documents