[American Institute of Aeronautics and Astronautics AIAA SPACE 2010 Conference & Exposition -...

Copyright © 2010 by Edmund H. Conrow

1

Space Program Schedule Change Probability Distributions

Dr. Edmund H. Conrow, CMC, CPCM, CRM, PMP*

Management and Technology Associates, Redondo Beach, California 90278

Many complex space and non-space programs are driven by performance requirements

with cost and/or schedule being less dominant, if not dependent variables. The net result

when comparing actual versus initial projections is cost and schedule change that grows

(slips) during the course of a program. Several different samples of space and non-space

data were examined to estimate descriptive statistics (e.g., mean, median, skewness, kurtosis)

and determine what types of statistical distributions might represent schedule change data.

The analyses performed also tested assertions of other researchers who postulated that space

and non-space program schedule change can be represented by a normal distribution or an

extreme value (Gumbel) distribution. A data sample of 28 NASA programs was examined to

test the hypothesis claimed by other researchers that schedule change can be represented by

a normal distribution. Both the skewness and kurtosis of the data and results from

Anderson-Darling and Kolmogorov-Smirnov statistical tests show that this data is not

normally distributed. A large sample of space and non-space programs (365 programs) was

obtained and evaluated against 23 different types of continuous univariate distributions.

Anderson-Darling and Kolmogorov-Smirnov test results show that all 23 distribution types

(including normal and extreme value distributions) were rejected at the 0.05 level. The

sample size was subsequently artificially reduced by only including every sixth value both

from random and sorted representations of the data that also had values associated with on-

time deliveries removed. Anderson-Darling and Kolmogorov-Smirnov test results show that

the extreme value distribution could not be rejected at the 0.05 level. This potential

contradiction resulted from the much smaller sample size versus the full sample (with or

without on-time delivery values removed) which led to diminished statistical power of the

Anderson-Darling and Kolmogorov-Smirnov tests; thus not rejecting the extreme value

distribution at the 0.05 level. Consequently, sample sizes of roughly 200 or more data points

may be necessary to provide a meaningful distribution fit of schedule change and potentially

other data. For smaller, and particularly much smaller, sample sizes it is recommended that

the data not be evaluated against candidate distribution types, but simply be converted into

an ascending cumulative distribution function (CDF) and this function used as appropriate

in subsequent analyses (e.g., Monte Carlo simulations).

Introduction

A variety of probability distributions have been assumed to represent variations in cost and schedule during the

acquisition phase of space (and aerospace) programs (e.g., extreme value, normal, triangular distributions) but few

rigorous statistical analyses have been conducted to evaluate the types of distributions that should be eliminated as

well as those that may be possible. In this paper I examine schedule variations for a wide variety of government and

commercial programs, primarily space programs, and provide an evaluation of which types of common probability

distributions can be rejected, and which types cannot be rejected. (The analyses were performed at the total program

level since schedule change information is rarely available at lower WBS levels.)

Dubos, Saleh, and Braun assumed that schedule change associated with NASA spacecraft development was

normally distributed [1], specifically:

“The data in our sample are not rich enough to allow us to infer a probability distribution function

for final total schedule duration (FTD) or relative schedule slippage (RSS). However, for the

purpose of introducing the concept of a technology readiness level (TRL) schedule-risk curve, let

us assume that for a given TRL, RSS (or FTD) is normally distributed. That is, we formulate the

hypothesis that the RSS (or FTD) has a normal probability density function,”…

* Principal, P. O. Box 1125, Redondo Beach, CA 90278, www.risk-services.com, Associate Fellow and Life

Member.

AIAA SPACE 2010 Conference & Exposition 30 August - 2 September 2010, Anaheim, California

AIAA 2010-8834

Copyright © 2010 by Edmund H. Conrow. Published by the American Institute of Aeronautics and Astronautics, Inc., with permission.


2

While the probability density function (PDF) of the final total schedule duration (FTD) is somewhat

unimportant, the PDF for relative schedule slippage (RSS) between the actual versus planned schedule duration is of

key interest for the scope of the analysis reported in this paper. Note that Dubos, Saleh, and Braun provided no

evidence to support their assertion that RSS is normally distributed, yet evidence exists in the literature related to

both theoretical considerations and from statistical testing that aerospace program cost change and schedule change

are likely not normally distributed, for example they have a right hand skew (see [2], [3], [5], [6]).

Dubos, Saleh, and Braun then state [1]:

“If more RSS empirical data are available and warrant a different probability distribution function,

schedule-risk curves can still be developed … with the new probability distribution function and

its parameters;…”

The above statement assumes that a probability distribution function (assumed to be a PDF) may be defined that

represents RSS (or schedule change) that is not normal, but the authors provided no evidence to support this

assertion. Both the Dubos, Saleh, and Braun normality assertion and the assertion that a different but definable PDF

may exist will be explored in this paper.

Coleman, Summerville, and Dameron assumed that an extreme value distribution is “what we expect

theoretically” to model schedule change for DoD acquisition programs [7], [8]. However, an inadequate

justification was provided to support this assertion, specifically “the case is actually made by the statistical test” [8].

The assertions, methodology, and results of Coleman, Summerville, and Dameron ([7], [8]) are also explored in this

paper

Finally, results are provided from an extensive statistical evaluation of a sample of space and non-space program

schedule change data. Hence, three sets of analyses were performed and are reported in this paper. First, the same

NASA schedule change data used by Dubos, Saleh, and Braun [1] was evaluated to see whether/not a normal

distribution is an acceptable statistical distribution fit to that data. [Note: even if a normal distribution is not

rejected by the statistical tests used in the distribution fitting, this does not mean that it should be accepted as an

appropriate distribution type (let alone “the appropriate” distribution type) to represent the data.] Second, a large

sample of schedule change data derived from 365 space and non-space programs was evaluated to determine which

distribution types both were rejected and not rejected at the 0.05 level using two appropriate statistical tests. Third,

the assertions, methodology, and results of Coleman, Summerville, and Dameron, who evaluated a 59 program

sample, including those results associated with the extreme value distribution, are examined in light of the much

larger schedule change sample developed by Conrow (365 programs).

Statistical Analysis of Schedule Change Data

From a theoretical perspective Conrow investigated the trade space between program cost, performance, and

schedule and developed an analytical framework that incorporates these variables, acquisition dynamics during the

course of a program, variable constraints, and likely outcomes. One finding was that performance is the typical

dominant variable for Department of Defense and NASA program development programs [2] [3]. In effect, both the

government and contractor attempt to meet performance requirements, and in the process cost and or schedule are

typically adjusted (often meaning growth) on a case-by-case basis in order to meet these requirements [2] [3]. This

framework was validated in part by statistical results from analyzing roughly 50 major first generation DoD

development programs (none of which had been re-baselined in the final development phase). The estimated

skewness for cost change, performance change, and schedule change were 1.24, 0.38, and 1.24 respectively [2] [3].

The estimated kurtosis using the same data for cost change, performance change, and schedule change were 4.36,

4.62, and 4.70 respectively. Since a normal distribution has a skewness of 0.00 and kurtosis of 3.00, respectively,

the skewness results for cost change and schedule change show that both variables have a right hand skew and

together with the kurtosis results, neither variable is likely normally distributed. (For every DoD and NASA

program that is nearing its major delivery milestone, is it likely that the item(s) will be delivered ahead of schedule,

on-schedule, or behind schedule? More frequently, the delivery will be behind schedule than on or ahead of

schedule, which leads to a right-hand skew.) Furthermore, when the Conrow cost change and schedule change data

were evaluated against 23 candidate distributions, the normal distribution was rejected at the 0.05 level with both the

Anderson-Darling and Kolmogorov-Smirnov tests. The above theoretical and statistical analyses, first published in

part in 1995 [3] and 1996 [4], clearly show that a normal distribution is not an appropriate choice to represent cost

change or schedule change data.

Conrow evaluated the same 28 program data set that Dubos, Saleh, and Braun used in their analysis. (The data

was provided by Dr. Dale Thomas of NASA Marshall Space Flight Center.) Conrow first converted the RSS data to

a simple schedule change ratio defined by the final total schedule duration (FTD) divided by the initial schedule


3

duration estimate (IDE). Conrow determined that the sample had an average and median schedule change of 1.52

and 1.28, respectively. [Note, values equal to 1.0 indicate an on-time delivery, values less than 1.0 indicate an

accelerated delivery, and values greater than 1.0 indicate a late delivery (commonly called schedule slippage.]

Hence, the average and median schedule change was a 52 percent and 28 percent slippage, respectively. In addition,

the skewness and kurtosis were 1.35 and 4.49, respectively, which is considerably different than what would exist

for a normal distribution (0.00 and 3.00, respectively). The skewness and kurtosis results should have given Dubos,

Saleh, and Braun concern that the NASA 28 program sample was not normally distributed, even though these results

cannot be related to a statistical significance level. (Note: while it would have been almost trivial to generate,

Dubos, Saleh, and Braun did not provide descriptive statistics for the NASA 28 program sample in their paper [1],

which again would have provided evidence that the NASA 28 program data were not normally distributed.)

Descriptive statistics for the NASA 28 program sample estimated by Conrow are given in Table 1 for both schedule

change (ratio) and RSS. The results are identical for schedule change (ratio) and RSS given the differences in

representation of the two variable types (and minor rounding errors).

Table 1. Descriptive Statistics for Schedule Slippage and RSS

of 28 NASA Programs

The first and third quartiles are equivalent to the 25th

and 75th

percentiles, respectively. Skewness, a measure of

the symmetry of data, and kurtosis, corresponding to the “peakedness” of the data, represent a normalized form of

the third and fourth standardized moments of a distribution, respectively. (See:

http://en.wikipedia.org/wiki/Skewness and http://en.wikipedia.org/wiki/Kurtosis for additional information on

skewness and kurtosis, respectively.) [The reader should be aware that some statistical packages (e.g., Microsoft

Excel) estimate excess kurtosis rather than kurtosis. Excess kurtosis is simply kurtosis minus 3.0.]

The same 28 program sample was then evaluated to determine what types of common distributions might be

rejected and what types might not be rejected. The results of this analysis using schedule change (ratio) data are

given in Table 2. Note that the normal distribution was rejected at the 0.05 level (and actually at the 0.01 level) by

both the Anderson-Darling and Kolmogorov-Smirnov tests (which are both considered more powerful than the Chi-

Square test [9]). [In terms of common statistical tests, statistical power increases in the following order [ceteris

paribus (cet. par.)—all other factors held constant]: Chi Square, Kolmogorov-Smirnov, and Anderson-Darling [9].

The Chi Square test was not used in the schedule change analysis because of limited statistical power and common

implementation difficulties with relatively small sample sizes. Statistical power also increases with increasing

sample size and test significance level, and with decreasing sample standard deviation.] Note, however, that not

rejecting a type of probability distribution at the 0.05 or other level is different than accepting that this distribution

satisfactorily represents the data—this is a common mistake [9]. Accepting a given probability distribution may

require solid “real world” and/or theoretical justification for the type of distribution under evaluation beyond (just)

the statistical test results.

In addition, the same results occurred using the RSS data—both the Anderson Darling and Kolmogorov-Smirnov

tests rejected the normal distribution at the 0.05 level (and actually at the 0.01 level). [From Table 2, of the 23

candidate distributions evaluated against the NASA 28 program data set, 18 and 12 distributions were rejected at the

0.05 level by the Anderson-Darling and Kolmogorov-Smirnov tests, respectively. Given the small sample size, it is

Item Ratio RSS

Count 28 28

Mean 1.52 52.2

Standard Deviation 0.54 53.7Skewness 1.35 1.35

Kurtosis 4.49 4.49Minimum 1.00 0.0

1st Quartile 1.11 11.1Median 1.28 27.5

3rd Quartile 1.81 81.2

Maximum 3.14 213.7Range 2.14 213.7

http://en.wikipedia.org/wiki/Skewness

http://en.wikipedia.org/wiki/Kurtosis


4

understandable that multiple distribution types were not rejected in the evaluation. [This is in part because the

power of the statistical tests used is limited for small sample sizes. This assertion was verified by distribution fitting

data generated by a commercial normal probability distribution random number generator. For a sample of 28

points, 16 of 21 distribution types tested were not rejected at the 0.05 level by both the Anderson-Darling and

Kolmogorov-Smirnov tests. This number decreased to 13, 11, nine, nine, and eight different distribution types not

being rejected out of 21 distributions tested for a sample size of 64, 120, 250, 385, and 500 data points, respectively.

In addition, the normal distribution Anderson-Darling test statistic was not substantially smaller than those for the

other 20 distribution types until the sample reached 500 values, where it was 43 times smaller than the next smallest

test statistic (gamma distribution).] As mentioned above not rejecting a type of probability distribution is different

than accepting that distribution as being representative of the data [9].

Table 2. Evaluation of Candidate Probability Distributions for Schedule Slippage

of 28 NASA Programs

A sample of 365 aerospace development programs, composed of 301 spacecraft [Data Set (DS) 1] and 64 non-

spacecraft [DS 2] comprised of a single schedule change (ratio) value per program were collected by Conrow and

evaluated to determine what candidate distribution types could be rejected and not rejected.† (It was valid to

combine the two separate data sets because the P-value from the Mann-Whitney W test was greater than 0.05.

† Data set 1 was initially included 348 space programs. Six acquisition outliers were identified by the database

developer and Conrow and removed, leaving 342 programs. Conrow identified internal computational errors

affecting the plan and/or actual dates for 41 programs (two separate estimates had an inconsistency greater than 2

months), leaving 301 DS 1 programs. This computational error was unknown to the database developer, but the

developer agreed that Conrow’s approach was correct for filtering the data. Data set 2 initially included 244

programs. Eighty one programs were removed, leaving 163 programs, because they didn’t have the key delivery

milestone needed for the analysis. Forty one programs were then removed, leaving 122 programs, because the key

delivery milestone was an estimate (not an actual) value. Fifty eight programs were then removed, leaving 64 DS 2

programs, because one or more other necessary actual or plan schedule milestones were missing. These data issues

were unknown to the database developer, but the developer agreed that Conrow’s approach was correct for filtering

the data. (The DS 1 and DS 2 issues had existed for more than 10 years before being discovered.)

Anderson-Darling Reject at 0.05 Kolmogorov-Smirnov Reject at 0.05

Model Test Statistic Prob. Level? Test Statistic Prob. Level?

Beta 0.47 N 0.14 N

Cauchy 3.07 N 0.23 NErlang 1.18 Y 0.21 N

Error 19.35 Y 0.73 Y

Exponential 6.39 Y 0.48 Y

Extreme Value A 2.23 Y 0.25 YExtreme Value B 1.19 Y 0.20 Y

Gamma 1.27 Y 0.22 N

Inverse Gaussian 1.12 N 0.21 N

Inverted Weibull 0.77 Y 0.15 N

Laplace 2.52 Y 0.25 NLogistic 1.43 Y 0.19 Y

Log-Laplace 1.90 Y 0.19 N

Log-Logistic 1.04 Y 0.17 Y

Lognormal 1.12 Y 0.20 Y

Normal 1.60 Y 0.23 YPearson Type V 0.99 Y 0.19 N

Pearson Type VI 1.10 N 0.20 N

Random Walk 1.15 N 0.21 N

Rayleigh 2.45 Y 0.32 YUniform 44.56 Y 0.43 Y

Wald 15.99 Y 0.59 Y

Weibull 1.48 Y 0.21 Y


5

Hence, there was not a statistically significant difference between the medians at the 95.0% confidence level.)

Descriptive statistics for this 365 program data set are given in Table 3 (along with descriptive statistics for all other

schedule change data sets evaluated in this paper). (Note: descriptive statistics for the 28 program NASA data set

are repeated in Table 3 for consistency—the results are identical to those given in the ratio column of Table 1.)

From Table 3, the average and median schedule change for the 365 program sample was 1.27 and 1.16,

respectively. The skewness and kurtosis were 1.37 and 5.04, respectively, which is considerably different than what

would exist for a normal distribution (0.00 and 3.00, respectively). The data were then evaluated to determine what

types of common distributions might be rejected and what types might not be rejected. The results of this analysis

are given in Table 4. Note that the normal distribution, extreme value Type B distribution, and 21 other

distributions were rejected at the 0.05 level by both the Anderson-Darling and Kolmogorov-Smirnov tests. As

clearly shown in Table 4, schedule change data for space and non-space programs can not be accurately represented

by any of the 23 common distribution types evaluated.

Table 3. Descriptive Statistics of Key Evaluated Schedule Change Samples

Table 4. Evaluation of Candidate Probability Distributions for Schedule Change

of 365 Space and Non-Space Programs

Finally, since the P-value from the Mann-Whitney W test (difference in the distribution medians) was less than

0.05 it was not valid to combine the 28 program NASA data set with the previously mentioned 365 program data set

(DS 1 and DS 2) into a single data set.

DS 1 and DS 1 and DS 1 and DS 2, No 1.0, DS 1 and DS 2, No 1.0, DS 1 and DS 2, No 1.0, 1/6 Values,

NASA DS 2 DS 2, No 1.0 1/6 Values. Random 1/6 Values. Sorted Sorted, Different Endpoint

Count 28 365 278 47 47 47Mean 1.52 1.27 1.35 1.33 1.35 1.35Standard Deviation 0.54 0.33 0.34 0.32 0.37 0.38Skewness 1.35 1.37 1.07 0.92 0.95 1.18Kurtosis 4.49 5.04 4.48 3.11 4.78 5.83Minimum 1.00 0.50 0.50 0.92 0.50 0.5025th Percentile 1.11 1.00 1.12 1.07 1.11 1.11

Median 1.28 1.16 1.25 1.24 1.25 1.2575th Percentile 1.81 1.44 1.54 1.52 1.52 1.52Maximum 3.14 2.70 2.70 2.18 2.52 2.70Range 2.14 2.20 2.20 1.26 2.02 2.20



Beta 204.07 Y 0.62 Y

Cauchy 23.61 Y 0.22 YErlang 11.83 Y 0.14 Y

Error 279.53 Y 0.74 Y



Gamma 12.03 Y 0.13 Y

Inverse Gaussian 10.36 Y 0.13 Y

Inverted Weibull 8.28 Y 0.15 Y

Laplace 15.94 Y 0.19 YLogistic 12.87 Y 0.16 Y

Log-Laplace 11.89 Y 0.16 Y



Normal 16.27 Y 0.15 YPearson Type V 8.87 Y 0.13 Y

Pearson Type VI 10.14 Y 0.13 Y

Random Walk 10.47 Y 0.13 Y


Wald 128.80 Y 0.50 Y



6

As previously mentioned, it was valid to combine the two separate space and non-space data sets (DS 1 and DS

2) into a single data set because the P-value from the Mann-Whitney W test was greater than 0.05. Hence, there was

not a statistically significant difference between the medians at the 95.0% confidence level. Nevertheless,

distribution fitting results are presented here for the 301 program DS 1 sample. The results of the distribution fitting

analysis of space program schedule change (DS 1) are given in Table 5. As in the full sample case (365 programs,

DS 1 and DS 2 combined), the normal distribution, extreme value Type B distribution, and 21 other distributions

were rejected at the 0.05 level by both the Anderson-Darling and Kolmogorov-Smirnov tests. As clearly shown in

Table 5, schedule change data for space programs can not be accurately represented by any of the 23 common

distribution types evaluated.

Coleman and Summerville, assumed that an extreme value distribution is “what we expect theoretically” to

model schedule change for DoD acquisition programs [7]. However, Coleman, Summerville, and Dameron

provided an inadequate justification to support this assertion, specifically “the case is actually made by the statistical

test” [8]. Coleman, Summerville, and Dameron also stated that “given that finishing a program is akin to waiting for

the last event to finish, it is appealing that the Gumbel, or extreme value distribution, is the best fit to the data [8].”


of 301 Space Programs

From Table 4, both the extreme value Type A and Type B distributions were rejected at the 0.05 level by both

the Anderson-Darling and Kolmogorov-Smirnov tests. Coleman, Summerville, and Dameron removed schedule

change values of 1.00 (indicating on-time delivery vs. plan) because “the number of 1.0’s in the data base (schedules

finishing “on time”) creates problems in the fit statistics” [7]. Of the 59 total programs, 12 had a schedule ratio of

1.0 (or about 20 percent). [Coleman and Summerville noted that “we believe the disproportionate amount of 1.0’s is

politically motivated and not a natural occurrence” [7].] This statement is erroneous because award fee, user needs,

and other criteria often makes on-time delivery a priority (hence a ratio of 1.0).] They eliminated the 1.0 values then

re-fit the data to an extreme value (Type B) distribution. (The 1.0 values were later modeled using a discrete

distribution 12/59 of the time.) Coleman and Summerville, and Coleman, Summerville, and Dameron estimated the

extreme value location (µ) and scale (β) coefficients to be 1.16 and 0.32, respectively, as given in the last entry of

Table 6 [7] [8]. They also found that the extreme value (Type B) distribution could not be rejected at the 0.05 level



Beta 7.82 Y 0.15 Y


Error 229.28 Y 0.73 Y



Gamma 8.86 Y 0.13 Y







Normal 11.88 Y 0.15 YPearson Type V 6.67 Y 0.13 Y

Pearson Type VI 7.00 Y 0.13 Y



Wald 109.50 Y 0.50 Y



7

with the Kolmogorov-Smirnov test [7] [8], but they did not report results (or potentially evaluate the data) using the

more statistically powerful Anderson-Darling test.

When the 365 program data set assembled by Conrow (roughly six times larger than the Coleman and

Summerville, and Coleman, Summerville, and Dameron data set) was similarly evaluated, a different outcome was

reached, as shown in Table 6. The resulting extreme value Type B µ and β coefficients were estimated to be 1.12

and 0.24, respectively. However, the extreme value Type B distribution was rejected at the 0.05 level by both the

Anderson-Darling and Kolmogorov-Smirnov tests. (The rejection was from significance levels of 0.01 to 0.25, and

0.01 to 0.10 with the Anderson-Darling and Kolmogorov-Smirnov tests, respectively, where exact critical values

exist.)

A subset of the Conrow data was created by removing programs with an on-time delivery (1.0 value). Of the

365 program sample, 87 programs had a schedule ratio of 1.0, and were subsequently were removed (or about 24

percent), leaving 278 programs to evaluate by distribution fitting. The resulting extreme value Type B µ and β

coefficients were estimated to be 1.20 and 0.27, respectively as given in Table 6. The remaining 278 values (with

Table 6. Summary of Extreme Value Distribution Schedule Change Statistical Test Results

schedule change not equal to 1.0) were then evaluated to determine what types of common distributions might be

rejected and what types might not be rejected. The results of this analysis are given in Table 7. Note that the normal

distribution, extreme value Type B distribution (and 19 other distributions) were rejected at the 0.05 level by both

the Anderson-Darling and Kolmogorov-Smirnov tests. (For the extreme value Type B distribution, the rejection was

from significance levels of 0.01 to 0.25, and 0.01 to 0.10 with the Anderson-Darling and Kolmogorov-Smirnov

tests, respectively, where exact critical values exist.) Of the distributions evaluated, only the Pearson Type VI was

not rejected by the Anderson-Darling test at the 0.05 level, and only the Pearson Type V was not rejected by the

Kolmogorov-Smirnov test at the 0.05 level. Finally, note that by removing the 1.0 values (278 remaining programs)

the extreme value Type B µ and β coefficients both increased vs. those associated with the original 365 program

values.

Analyses were then performed to reduce the Conrow 278 program sample size to the same size of Coleman and

Summerville, and Coleman, Summerville, and Dameron. In the first case every sixth value from the randomly

ordered Conrow sample (excluding 1.0 values) was determined. This provided 47 values, the same number used by

Coleman and Summerville, and Coleman, Summerville, and Dameron. As given in Table 6, the resulting extreme

value Type B µ and β coefficients were estimated to be 1.18 and 0.23, respectively, and the distribution type was not

rejected at the 0.05 level by both the Anderson-Darling and Kolmogorov-Smirnov tests. For the second case, a

different 47 program data set was created by taking every sixth value of the sorted Conrow 278 values with 1.0

values removed. [Here the sorted 278 values represented an ascending cumulative distribution function (CDF).] As

given in Table 6, the resulting extreme value Type B µ and β coefficients were estimated to be 1.18 and 0.32,

respectively, and the distribution type was not rejected at the 0.05 level by both the Anderson-Darling and

Kolmogorov-Smirnov tests. For the third case, a different 47 program data set was created by using the

methodology of the second case together with taking the maximum value of the 278 programs substituted for the

Anderson-Darling Kolmogorov-Smirnov

Sample Reject at 0.05 Reject at 0.05 Location Scale

Case Size Prob. Level? Prob. Level? Parameter Parameter

Space and Non-Space Programs 365 Y Y 1.12 0.24

Space and Non-Space Programs, No 1.0 Values 278 Y Y 1.20 0.27

Space and Non-Space Programs, No 1.0 Values, Random

1/6 Program Sample, Correct Endpoint 47 N N 1.18 0.23

Space and Non-Space Programs, No 1.0 Values, Sorted 1/6 Program Sample, Correct Endpoint 47 N N 1.18 0.32

Space and Non-Space Programs, No 1.0 Values, Sorted

1/6 Program Sample, Maximum Endpoint 47 N N 1.18 0.32Space Only Programs 301 Y Y 1.13 0.25Space Only Programs, No 1.0 Values 227 Y Y 1.21 0.29Non-Space Only Programs 64 Y Y 1.10 0.17

Non-Space Only Programs, No 1.0 Values 51 Y N 1.14 0.19

Non-Space Only Programs (Coleman and Summerville, Coleman; Summerville, and Dameron) 59 Unknown Y Unknown Unknown

Non-Space Only Programs, No 1.0 Values (Coleman and Summerville; Coleman, Summerville, and Dameron) 47 Unknown N 1.16 0.32


8


of 278 Programs

47th

value of the second case. (Hence, the first 46 values in the first and second cases were identical—only the final,

47th

value was different.) As given in Table 6, the extreme value Type B µ and β coefficients were estimated to be

1.18 and 0.32, respectively, and the distribution type was not rejected at the 0.05 level by both the Anderson-Darling

and Kolmogorov-Smirnov tests.

Four other cases were then developed and evaluated. (The descriptive statistics for these four cases are not

reported in Table 3 because these results are considered secondary. The extreme value coefficients, Anderson-

Darling test results, and Kolmogorov-Smirnov test results for each case are given in Table 6.) Case four

corresponds to data from DS 1, 301 space programs. The extreme value Type B µ and β coefficients were estimated

to be 1.18 and 0.32, respectively, and the distribution type, plus the 22 other distribution types evaluated were

rejected at the 0.05 level by both the Anderson-Darling and Kolmogorov-Smirnov tests (as shown in Table 5). Case

five corresponds to data from DS 1 with the 74 1.0 values removed (24.6% of the total), thus leaving 227 programs

in the sample. The extreme value Type B µ and β coefficients were estimated to be 1.21 and 0.29, respectively, and

the extreme value Type B distribution type was rejected at the 0.05 level by both the Anderson-Darling and

Kolmogorov-Smirnov tests. (Only the beta, inverse Gaussian, Pearson Type VI, and random walk distributions

were not rejected by both the Anderson-Darling and Kolmogorov-Smirnov tests. In addition, the Pearson Type V

distribution was not rejected at the 0.05 level by the Kolmogorov-Smirnov test.) Case six corresponds to data from

DS 2, 64 non space programs. The extreme value Type B µ and β coefficients were estimated to be 1.10 and 0.17,

respectively, and the distribution type was rejected at the 0.05 level by both the Anderson-Darling and Kolmogorov-

Smirnov tests. (Only the beta distribution was not rejected by both the Anderson-Darling and Kolmogorov-Smirnov

tests. The Erlang, gamma, inverse Gaussian, log Laplace, Pearson Type V, Pearson Type VI, and random walk

distributions were not rejected by the Kolmogorov-Smirnov test but were rejected by the Anderson-Darling test.)

Case seven corresponds to data from DS 2 with the 13 1.0 values removed (20.3% of the total), thus leaving 51

programs in the sample. The extreme value Type B µ and β coefficients were estimated to be 1.14 and 0.19,

respectively, and the distribution type was rejected at the 0.05 level by the Anderson-Darling but not by the

Kolmogorov-Smirnov test. (Only the beta, Cauchy, inverse Gaussian, inverted Weibull, Pearson Type VI, and

random walk distributions were not rejected by both the Anderson-Darling and Kolmogorov-Smirnov tests. The



Beta 102.57 Y 0.44 Y


Error 213.87 Y 0.72 Y



Gamma 3.30 Y 0.10 Y







Normal 6.33 Y 0.13 YPearson Type V 1.54 Y 0.07 N

Pearson Type VI 2.11 N 0.09 Y



Wald 137.01 Y 0.50 Y



9

Erlang, extreme value Type B, Gamma, Laplace, log Laplace, log logistic, and Pearson Type V distributions were

not rejected by the Kolmogorov-Smirnov test but were rejected by the Anderson-Darling test.) Finally, note that by

removing the 1.0 values (13 programs) the extreme value Type B µ and β coefficients both increased vs. those

associated with the original 64 program values.

The above extreme value Type B analysis points to the danger in performing distribution fitting on small to

moderate sized samples, and potentially accepting the statistical results. This is a different situation than if the

results were rejected at the 0.05 level when the smaller sample size (47 programs) was used, because statistical

power increases with sample size (cet. par.) Statistical power limitations of small samples explains some key results

given in Table 6. Even when the 1.0 values were eliminated, when the sample was reduced from 278 programs to

47 programs the extreme value Type B distribution went from being rejected by both the Anderson-Darling and

Kolmogorov-Smirnov tests at the 0.05 level (278 programs) to not being rejected by these same tests (47 programs).

The primary difference in these three data subset cases was that the sample size was reduced by a factor of 6, or put

another way, 5/6 of the data was eliminated. No new information was added in any of the three subset cases—

information was in fact eliminated, and the structure of the three cases did not constrain the results. The latter point

is clear from examining the descriptive statistics given in Table 3 coupled with the extreme value Type B location

and scale parameters given in Table 6—variations between the full sample without 1.0 values (278 programs) and

the three samples reduced to 1/6 of the values (47 programs) were generally minor.

Discussion

The first Dubos, Saleh, and Braun claim given in the introduction to this paper, RSS is normally distributed,

cannot be accepted given the results presented in Table 2 from distribution fitting the same 28 program sample. The

results show that a normal distribution was rejected at the 0.05 level by both the Anderson-Darling and

Kolmogorov-Smirnov tests. (In addition, the skewness value given in Table 1, 1.35 is inconsistent with a normal

distribution, whose skewness value is 0.0.) Using a much larger sample of space and non-space data (365 programs,

Table 4), and space data only (301 programs, Table 5), all of the 23 distribution types evaluated, including the

normal distribution, were rejected at the 0.05 level by the Anderson-Darling and Kolmogorov-Smirnov tests. The

second Dubos, Saleh, and Braun claim given in the Introduction, is that if more RSS data is obtained and warrant a

different PDF, then that new PDF can be used to develop schedule-risk curves. Results given in Table 3 show that

space and non-space schedule change data, and space program only schedule change data has a right hand skew

(skewness coefficient > 0). More importantly, results given in Tables 4 and 5 show that no pre-defined PDF (of the

23 types tested) accurately represents either space or space and non-space schedule change data when the sample

size is sufficiently large to preclude Type II errors at a given significance level. [A Type II error is failure to reject a

given distribution type when that distribution is incorrect—a false negative. Thus when the statistical power for a

test increases, the chances of it making a Type II error decrease. As previously mentioned, statistical power

increases with sample size (cet. par.). In addition, in terms of common statistical tests, statistical power increases in

the following order (cet. par.): Chi Square, Kolmogorov-Smirnov, and Anderson-Darling [9].]

Of the rationale provided by Coleman, Summerville, and Dameron to support the use of an extreme value Type

B distribution, only the statement: “given that finishing a program is akin to waiting for the last event to finish, it is

appealing that the Gumbel, or extreme value distribution, is the best fit to the data [8]” has one acceptable theoretical

facet. As mentioned above, one of Conrow’s findings was that performance is the typical dominant variable for

DoD and NASA program development programs [2] [3]. In effect, both the government and contractor attempt to

meet performance requirements, and in the process cost and/or schedule are typically adjusted (often meaning

growth) on a case-by-case basis as all three variables are traded in order to meet performance requirements within

cost and schedule constraints [2] [3]. Hence, the statement of Coleman, Summerville, and Dameron that “finishing a

program is akin to waiting for the last event to finish [8],” implies cost, performance, and schedule trades continuing

throughout the development program, which more often then not, contributes to cost growth and/or schedule

slippage. [The nature of these trades is complex, and Conrow determined that the coefficient of determination (R2)

between cost change, performance change, and schedule change on a roughly 50 program sample of DoD non-space

programs is very small (e.g., 0.07 or less) [4].] The other arguments provided by Coleman and Summerville [7], and

Coleman, Summerville, and Dameron [8] do not have suitable merit to support using an extreme value Type B

distribution, or even eliminating the on-time delivery programs (schedule change of 1.0). Results from analyses

performed and given in Table 6 show that the acceptance of the extreme value Type B distribution is largely related

to sample size. In effect, the non rejection at the 0.05 level by the Kolmogorov-Smirnov test that resulted from the

analyses of Coleman and Summerville [7], and Coleman, Summerville, and Dameron [8], was almost certainly the

result of a Type II error due to weak statistical power associated with small sample size. (This is in part because


10

statistical tests are not very sensitive to minor differences between the data and the candidate distribution type,

particularly for small sample sizes [9].) The assertion associated with non-rejection being related to a Type II error

is almost certainly proven by contrasting results from the Conrow full sample case (278 programs) with 1.0 values

removed (Table 6, second entry), against the three reduced sample size cases (47 programs) that were created from

the same Conrow full sample case. In the former case the extreme value Type B distribution was rejected at the 0.05

level (and a broad range around this level) by both the Anderson-Darling and Kolmogorov-Smirnov tests, while in

the latter three cases developed and evaluated, the assumption of an extreme value Type B distribution could not be

rejected by either statistical test at the 0.05 level.

What sample size is sufficient for distribution fitting given the small sample size problems that existed in the

analyses of Dubos, Saleh, and Braun ([1], 28 values), Coleman and Summerville ([7], 47 and 59 values), and

Coleman, Summerville, and Dameron ([8], 47 and 59 values)? Clearly 28, 47, or 59 representative data points are

insufficient except to rule out certain distribution types. For example, if the Anderson-Darling and Kolmogorov-

Smirnov tests reject a candidate distribution at the 0.05 level for 50 data points and the sample is representative of

the population of corresponding schedule change values, then larger sample sizes will also be rejected because

statistical power increases with sample size (cet. par.).

While no single value or even range of values is sufficient in all cases, approximately 200 data points may be a

sufficient lower bound to permit fitting with univariate, unimodal distributions when other clear evidence that

supports a particular distribution type does not exist. (As given in Table 6, the extreme value Type B hypothesis

was rejected by both the Anderson-Darling and Kolmogorov-Smirnov test for a sample of 227 program schedule

change values that had 1.0 values removed.) However, fitting data from a normal distribution obtained from a high

quality commercial Monte Carlo simulation did not result in rejecting many of the candidate distribution types and

having a much smaller Anderson-Darling test statistic than the other distribution types until much larger samples

sizes were used (e.g., 500 values).

For very large sample sizes (e.g., several thousand values), statistical tests will often reject the null hypothesis

that the sample represents a particular distribution type. This is because even very small differences between the

data and the hypothesized distribution will be detected. These same differences may be present but unobserved with

small sample sizes, but the rejection is in part due to the fact that statistical tests don’t differentiate between exact

and “nearly correct” for a given distribution type [9].

Finally, the analyst should recognize that there is no guarantee that schedule change or any other data will fit any

particular distribution type. Given this potential problem, what approach should the analyst use? One approach to

estimate the type of probability distribution that is sometimes used when limited data exists is to convert the data

into a histogram. This approach is not recommended because using a finite number of equal width bins can lead to

errors when relatively small sample sizes exist because of aggregation and/or drop-out. In effect, a histogram with a

finite number of bins only approximates the true PDF associated with the data. Another, even more inappropriate

approach, is simply guessing at the type and specific characteristics of a probability distribution when no data exists

or using a default probability distribution, then not mentioning the resulting limitations in the subsequent analysis

ground rules and assumptions.

Given the above illustrations of faulty or weak methodologies, is there a better approach the analyst should

consider? When data is available, and particularly when it was derived at the same Work Breakdown Structure level

as it will be used for risk analysis purposes, the recommended approach is to convert the data into an ascending CDF

and directly use this (empirical) CDF in the risk analysis tool (e.g., Monte Carlo simulation)‡. This approach is

more accurate than: 1) guessing the probability distribution, 2) using the default model distribution without noting

this or other limitations, 3) distribution fitting data when sample sizes are small to moderate, or 4) using histograms

to represent the data.

Conclusions

Schedule change is typically derived from comparing actual versus initial schedule durations over the same

acquisition events (whether as a ratio or percent change). Acquisition dynamics and trade preferences among cost,

performance, and schedule contribute to both cost change and schedule change growing (slip) during the course of

‡ Some high quality commercial Monte Carlo simulation packages, such as @RISK 5.5 or higher, Industrial or

Professional, versions from Palisade Corporation allow the analyst to develop or import an ascending CDF and

apply this distribution to specified model element(s). Note: mention of @RISK does not constitute an

endorsement—only that the software has the capability to develop or import an ascending CDF and use it directly in

simulations.


11

defense and space programs. In addition, these factors also contribute to the distribution of cost change and

schedule change values having a right-hand skew (rather than a symmetrical or left-hand skew). There is little

specific information in the literature about the type(s) of distributions that may be associated with schedule change.

Schedule change is clearly not normally distributed for space programs in particular or aerospace programs in

general. This assertion is supported by numerous facts, including: 1) acquisition dynamics and trade preferences

among cost, performance, and schedule that lead to larger potential cost growth and/or schedule slippage than

reductions in these variables; 2) skewness and kurtosis estimates from various samples of space and aerospace

programs ranging in size from 28 NASA programs to 365 combined space (301) and non-space (64) programs that

clearly do not represent a normal distribution (0.0 skewness and 3.0 kurtosis); and 3) results from distribution fitting

various samples of space and aerospace programs—in all cases the normal distribution assumption was rejected for

schedule change by the Anderson-Darling and Kolmogorov Smirnov statistical tests at the 0.05 level.

If schedule change is not normally distributed, then should we assume that it follows some other type of

continuous, univariate distribution? The answer to this question based upon statistical results presented in this paper

is resounding no. Results from evaluating 365 combined space (301) and non-space (64) programs sample showed

that all 23 of the candidate distribution types evaluated were rejected by the Anderson-Darling and Kolmogorov

Smirnov statistical tests at the 0.05 level.

Of some interest were the results for the extreme value (Gumbel) distribution. The attributes attributed to this

distribution might somewhat mimic the preference for performance over cost and/or schedule, and thus provide a

suitable distribution for the right hand skew associated with increased schedule change (and possibly cost change,

although not evaluated here). Prior researchers using a single sample of 59 programs found that the extreme value

distribution was rejected at the 0.05 level by the Kolmogorov Smirnov statistical test. However, when the on-time

deliveries were removed (12 programs), the resulting sample (47 programs) was not rejected at the 0.05 level by the

Kolmogorov Smirnov statistical test. The same evaluation was then performed by Conrow on a much larger sample

(365 programs) and a subset with on-time deliveries removed (87 programs), leaving 288 programs. In the former

case all 23 candidate distributions tested were rejected by both the Anderson-Darling and Kolmogorov Smirnov

statistical tests at the 0.05 level. [The same results occurred when only the 301 space programs (which included on-

time deliveries) were evaluated.] In the latter case only one of 23 candidate distributions was nor rejected by the

Anderson-Darling (Pearson Type VI) and Kolmogorov Smirnov (Pearson Type V) statistical tests at the 0.05 level.

The combined data set with 1.0 values removed was then down-sampled three different ways to reach 47 values—

the same sample size as used by prior researchers. In each of the three cases, the extreme value distribution was not

rejected by the Anderson-Darling and Kolmogorov Smirnov statistical tests at the 0.05 level. The difference in

results between the 278 program sample and 47 program samples was solely due to the sample size as no new

information was added, only data were eliminated. The different results are caused by reduced statistical power for

both the Anderson-Darling and Kolmogorov Smirnov statistical tests due to a much smaller sample size in the

down-sampled data (5/6 of the data were removed). Hence, attempting to perform distribution fitting on small

sample sizes (e.g., 50 values or less) may lead to erroneous results if the distribution type is not rejected. Such

erroneous results can be caused by Type II errors resulting from diminished statistical power, which are related to

the small sample size.

Sample sizes of approximately 200 data points may be a sufficient lower bound to permit fitting with univariate,

unimodal distributions when other clear evidence that supports a particular distribution type does not exist.

However, fitting data from a normal distribution obtained from a high quality commercial Monte Carlo simulation

did not result in rejecting many of the candidate distribution types and having a much smaller Anderson-Darling test

statistic than the other distribution types until much larger samples sizes were used (e.g., 500 values). When much

smaller samples exist, the recommended approach when data modeling is needed is to sort the data into an ascending

CDF, then use the CDF in subsequent analyses (e.g., a Monte Carlo simulation). This approach does not introduce

any errors of interpolation or extrapolation in and of itself and should be accurate so long as the underlying data

sample represents the data population.

References

[1] Dubos, G., Saleh, J., and Braun, R., “Technology Readiness Level, Schedule Risk, and Slippage in Spacecraft

Design,” AIAA Journal of Spacecraft and Rockets, Vol. 45, No. 4, July–August 2008, pp. 840-841.

[2] Conrow, E., Effective Risk Management: Some Keys to Success, Second Edition, American Institute of

Aeronautics and Astronautics, 2003, pp. 2-13, 427-430, 431-433.

[3] Conrow, E., “Some Long-Term Issues and Impediments Affecting Military Systems Acquisition Reform,”

Acquisition Review Quarterly, Vol. 2, No. 3, Summer 1995, pp. 199–212.


12

[4] Conrow, E., “Some Inherent Limitations Of Quantitative Cost Risk Assessment Methodologies,” 29th

Annual

DoD Cost Analysis Symposium, 21 February 1996.

[5] Candreva, P., “Rethinking Acquisition Reform: Cost Growth Solutions May Aggravate More Important

Problems,” 5th Annual Acquisition Research Symposium of the Naval Postgraduate School: Acquisition Research:

Creating Synergy for Informed Change, 14-15 May 2008

[6] Arena, M., Leonard, R., Murray, S., Younossi, O., “Historical Cost Growth of Completed Weapon System

Programs,” RAND, TR-343, 2006, pp. xii, 22, 27.

[7] Coleman, R. and Summerville, J., “A Survey of Cost Risk Methods for Project Management,” PMI Risk SIG

Project Risk Symposium, 16 May 2004.

[8] Coleman, R., Summerville, J., and Dameron, M., “The Relationship Between Cost Growth and Schedule

Growth,” Acquisition Review Quarterly, Vol. 10, No. 2, Spring 2003, pp. 117-122.

[9] Law, A., Simulation Modeling and Analysis, Fourth Edition, McGraw Hill, 2007, pp. 340-352.

Date post:	16-Dec-2016
Category:	Documents
Upload:	edmund
View:	213 times
Download:	1 times

[American Institute of Aeronautics and Astronautics AIAA SPACE 2010 Conference & Exposition -...

Documents