Copyright © 2010 by Edmund H. Conrow
1
Space Program Schedule Change Probability Distributions
Dr. Edmund H. Conrow, CMC, CPCM, CRM, PMP*
Management and Technology Associates, Redondo Beach, California 90278
Many complex space and non-space programs are driven by performance requirements
with cost and/or schedule being less dominant, if not dependent variables. The net result
when comparing actual versus initial projections is cost and schedule change that grows
(slips) during the course of a program. Several different samples of space and non-space
data were examined to estimate descriptive statistics (e.g., mean, median, skewness, kurtosis)
and determine what types of statistical distributions might represent schedule change data.
The analyses performed also tested assertions of other researchers who postulated that space
and non-space program schedule change can be represented by a normal distribution or an
extreme value (Gumbel) distribution. A data sample of 28 NASA programs was examined to
test the hypothesis claimed by other researchers that schedule change can be represented by
a normal distribution. Both the skewness and kurtosis of the data and results from
Anderson-Darling and Kolmogorov-Smirnov statistical tests show that this data is not
normally distributed. A large sample of space and non-space programs (365 programs) was
obtained and evaluated against 23 different types of continuous univariate distributions.
Anderson-Darling and Kolmogorov-Smirnov test results show that all 23 distribution types
(including normal and extreme value distributions) were rejected at the 0.05 level. The
sample size was subsequently artificially reduced by only including every sixth value both
from random and sorted representations of the data that also had values associated with on-
time deliveries removed. Anderson-Darling and Kolmogorov-Smirnov test results show that
the extreme value distribution could not be rejected at the 0.05 level. This potential
contradiction resulted from the much smaller sample size versus the full sample (with or
without on-time delivery values removed) which led to diminished statistical power of the
Anderson-Darling and Kolmogorov-Smirnov tests; thus not rejecting the extreme value
distribution at the 0.05 level. Consequently, sample sizes of roughly 200 or more data points
may be necessary to provide a meaningful distribution fit of schedule change and potentially
other data. For smaller, and particularly much smaller, sample sizes it is recommended that
the data not be evaluated against candidate distribution types, but simply be converted into
an ascending cumulative distribution function (CDF) and this function used as appropriate
in subsequent analyses (e.g., Monte Carlo simulations).
Introduction
A variety of probability distributions have been assumed to represent variations in cost and schedule during the
acquisition phase of space (and aerospace) programs (e.g., extreme value, normal, triangular distributions) but few
rigorous statistical analyses have been conducted to evaluate the types of distributions that should be eliminated as
well as those that may be possible. In this paper I examine schedule variations for a wide variety of government and
commercial programs, primarily space programs, and provide an evaluation of which types of common probability
distributions can be rejected, and which types cannot be rejected. (The analyses were performed at the total program
level since schedule change information is rarely available at lower WBS levels.)
Dubos, Saleh, and Braun assumed that schedule change associated with NASA spacecraft development was
normally distributed [1], specifically:
“The data in our sample are not rich enough to allow us to infer a probability distribution function
for final total schedule duration (FTD) or relative schedule slippage (RSS). However, for the
purpose of introducing the concept of a technology readiness level (TRL) schedule-risk curve, let
us assume that for a given TRL, RSS (or FTD) is normally distributed. That is, we formulate the
hypothesis that the RSS (or FTD) has a normal probability density function,”…
* Principal, P. O. Box 1125, Redondo Beach, CA 90278, www.risk-services.com, Associate Fellow and Life
Member.
AIAA SPACE 2010 Conference & Exposition 30 August - 2 September 2010, Anaheim, California
AIAA 2010-8834
Copyright © 2010 by Edmund H. Conrow. Published by the American Institute of Aeronautics and Astronautics, Inc., with permission.
Copyright © 2010 by Edmund H. Conrow
2
While the probability density function (PDF) of the final total schedule duration (FTD) is somewhat
unimportant, the PDF for relative schedule slippage (RSS) between the actual versus planned schedule duration is of
key interest for the scope of the analysis reported in this paper. Note that Dubos, Saleh, and Braun provided no
evidence to support their assertion that RSS is normally distributed, yet evidence exists in the literature related to
both theoretical considerations and from statistical testing that aerospace program cost change and schedule change
are likely not normally distributed, for example they have a right hand skew (see [2], [3], [5], [6]).
Dubos, Saleh, and Braun then state [1]:
“If more RSS empirical data are available and warrant a different probability distribution function,
schedule-risk curves can still be developed … with the new probability distribution function and
its parameters;…”
The above statement assumes that a probability distribution function (assumed to be a PDF) may be defined that
represents RSS (or schedule change) that is not normal, but the authors provided no evidence to support this
assertion. Both the Dubos, Saleh, and Braun normality assertion and the assertion that a different but definable PDF
may exist will be explored in this paper.
Coleman, Summerville, and Dameron assumed that an extreme value distribution is “what we expect
theoretically” to model schedule change for DoD acquisition programs [7], [8]. However, an inadequate
justification was provided to support this assertion, specifically “the case is actually made by the statistical test” [8].
The assertions, methodology, and results of Coleman, Summerville, and Dameron ([7], [8]) are also explored in this
paper
Finally, results are provided from an extensive statistical evaluation of a sample of space and non-space program
schedule change data. Hence, three sets of analyses were performed and are reported in this paper. First, the same
NASA schedule change data used by Dubos, Saleh, and Braun [1] was evaluated to see whether/not a normal
distribution is an acceptable statistical distribution fit to that data. [Note: even if a normal distribution is not
rejected by the statistical tests used in the distribution fitting, this does not mean that it should be accepted as an
appropriate distribution type (let alone “the appropriate” distribution type) to represent the data.] Second, a large
sample of schedule change data derived from 365 space and non-space programs was evaluated to determine which
distribution types both were rejected and not rejected at the 0.05 level using two appropriate statistical tests. Third,
the assertions, methodology, and results of Coleman, Summerville, and Dameron, who evaluated a 59 program
sample, including those results associated with the extreme value distribution, are examined in light of the much
larger schedule change sample developed by Conrow (365 programs).
Statistical Analysis of Schedule Change Data
From a theoretical perspective Conrow investigated the trade space between program cost, performance, and
schedule and developed an analytical framework that incorporates these variables, acquisition dynamics during the
course of a program, variable constraints, and likely outcomes. One finding was that performance is the typical
dominant variable for Department of Defense and NASA program development programs [2] [3]. In effect, both the
government and contractor attempt to meet performance requirements, and in the process cost and or schedule are
typically adjusted (often meaning growth) on a case-by-case basis in order to meet these requirements [2] [3]. This
framework was validated in part by statistical results from analyzing roughly 50 major first generation DoD
development programs (none of which had been re-baselined in the final development phase). The estimated
skewness for cost change, performance change, and schedule change were 1.24, 0.38, and 1.24 respectively [2] [3].
The estimated kurtosis using the same data for cost change, performance change, and schedule change were 4.36,
4.62, and 4.70 respectively. Since a normal distribution has a skewness of 0.00 and kurtosis of 3.00, respectively,
the skewness results for cost change and schedule change show that both variables have a right hand skew and
together with the kurtosis results, neither variable is likely normally distributed. (For every DoD and NASA
program that is nearing its major delivery milestone, is it likely that the item(s) will be delivered ahead of schedule,
on-schedule, or behind schedule? More frequently, the delivery will be behind schedule than on or ahead of
schedule, which leads to a right-hand skew.) Furthermore, when the Conrow cost change and schedule change data
were evaluated against 23 candidate distributions, the normal distribution was rejected at the 0.05 level with both the
Anderson-Darling and Kolmogorov-Smirnov tests. The above theoretical and statistical analyses, first published in
part in 1995 [3] and 1996 [4], clearly show that a normal distribution is not an appropriate choice to represent cost
change or schedule change data.
Conrow evaluated the same 28 program data set that Dubos, Saleh, and Braun used in their analysis. (The data
was provided by Dr. Dale Thomas of NASA Marshall Space Flight Center.) Conrow first converted the RSS data to
a simple schedule change ratio defined by the final total schedule duration (FTD) divided by the initial schedule
Copyright © 2010 by Edmund H. Conrow
3
duration estimate (IDE). Conrow determined that the sample had an average and median schedule change of 1.52
and 1.28, respectively. [Note, values equal to 1.0 indicate an on-time delivery, values less than 1.0 indicate an
accelerated delivery, and values greater than 1.0 indicate a late delivery (commonly called schedule slippage.]
Hence, the average and median schedule change was a 52 percent and 28 percent slippage, respectively. In addition,
the skewness and kurtosis were 1.35 and 4.49, respectively, which is considerably different than what would exist
for a normal distribution (0.00 and 3.00, respectively). The skewness and kurtosis results should have given Dubos,
Saleh, and Braun concern that the NASA 28 program sample was not normally distributed, even though these results
cannot be related to a statistical significance level. (Note: while it would have been almost trivial to generate,
Dubos, Saleh, and Braun did not provide descriptive statistics for the NASA 28 program sample in their paper [1],
which again would have provided evidence that the NASA 28 program data were not normally distributed.)
Descriptive statistics for the NASA 28 program sample estimated by Conrow are given in Table 1 for both schedule
change (ratio) and RSS. The results are identical for schedule change (ratio) and RSS given the differences in
representation of the two variable types (and minor rounding errors).
Table 1. Descriptive Statistics for Schedule Slippage and RSS
of 28 NASA Programs
The first and third quartiles are equivalent to the 25th
and 75th
percentiles, respectively. Skewness, a measure of
the symmetry of data, and kurtosis, corresponding to the “peakedness” of the data, represent a normalized form of
the third and fourth standardized moments of a distribution, respectively. (See:
http://en.wikipedia.org/wiki/Skewness and http://en.wikipedia.org/wiki/Kurtosis for additional information on
skewness and kurtosis, respectively.) [The reader should be aware that some statistical packages (e.g., Microsoft
Excel) estimate excess kurtosis rather than kurtosis. Excess kurtosis is simply kurtosis minus 3.0.]
The same 28 program sample was then evaluated to determine what types of common distributions might be
rejected and what types might not be rejected. The results of this analysis using schedule change (ratio) data are
given in Table 2. Note that the normal distribution was rejected at the 0.05 level (and actually at the 0.01 level) by
both the Anderson-Darling and Kolmogorov-Smirnov tests (which are both considered more powerful than the Chi-
Square test [9]). [In terms of common statistical tests, statistical power increases in the following order [ceteris
paribus (cet. par.)—all other factors held constant]: Chi Square, Kolmogorov-Smirnov, and Anderson-Darling [9].
The Chi Square test was not used in the schedule change analysis because of limited statistical power and common
implementation difficulties with relatively small sample sizes. Statistical power also increases with increasing
sample size and test significance level, and with decreasing sample standard deviation.] Note, however, that not
rejecting a type of probability distribution at the 0.05 or other level is different than accepting that this distribution
satisfactorily represents the data—this is a common mistake [9]. Accepting a given probability distribution may
require solid “real world” and/or theoretical justification for the type of distribution under evaluation beyond (just)
the statistical test results.
In addition, the same results occurred using the RSS data—both the Anderson Darling and Kolmogorov-Smirnov
tests rejected the normal distribution at the 0.05 level (and actually at the 0.01 level). [From Table 2, of the 23
candidate distributions evaluated against the NASA 28 program data set, 18 and 12 distributions were rejected at the
0.05 level by the Anderson-Darling and Kolmogorov-Smirnov tests, respectively. Given the small sample size, it is
Item Ratio RSS
Count 28 28
Mean 1.52 52.2
Standard Deviation 0.54 53.7Skewness 1.35 1.35
Kurtosis 4.49 4.49Minimum 1.00 0.0
1st Quartile 1.11 11.1Median 1.28 27.5
3rd Quartile 1.81 81.2
Maximum 3.14 213.7Range 2.14 213.7
Copyright © 2010 by Edmund H. Conrow
4
understandable that multiple distribution types were not rejected in the evaluation. [This is in part because the
power of the statistical tests used is limited for small sample sizes. This assertion was verified by distribution fitting
data generated by a commercial normal probability distribution random number generator. For a sample of 28
points, 16 of 21 distribution types tested were not rejected at the 0.05 level by both the Anderson-Darling and
Kolmogorov-Smirnov tests. This number decreased to 13, 11, nine, nine, and eight different distribution types not
being rejected out of 21 distributions tested for a sample size of 64, 120, 250, 385, and 500 data points, respectively.
In addition, the normal distribution Anderson-Darling test statistic was not substantially smaller than those for the
other 20 distribution types until the sample reached 500 values, where it was 43 times smaller than the next smallest
test statistic (gamma distribution).] As mentioned above not rejecting a type of probability distribution is different
than accepting that distribution as being representative of the data [9].
Table 2. Evaluation of Candidate Probability Distributions for Schedule Slippage
of 28 NASA Programs
A sample of 365 aerospace development programs, composed of 301 spacecraft [Data Set (DS) 1] and 64 non-
spacecraft [DS 2] comprised of a single schedule change (ratio) value per program were collected by Conrow and
evaluated to determine what candidate distribution types could be rejected and not rejected.† (It was valid to
combine the two separate data sets because the P-value from the Mann-Whitney W test was greater than 0.05.
† Data set 1 was initially included 348 space programs. Six acquisition outliers were identified by the database
developer and Conrow and removed, leaving 342 programs. Conrow identified internal computational errors
affecting the plan and/or actual dates for 41 programs (two separate estimates had an inconsistency greater than 2
months), leaving 301 DS 1 programs. This computational error was unknown to the database developer, but the
developer agreed that Conrow’s approach was correct for filtering the data. Data set 2 initially included 244
programs. Eighty one programs were removed, leaving 163 programs, because they didn’t have the key delivery
milestone needed for the analysis. Forty one programs were then removed, leaving 122 programs, because the key
delivery milestone was an estimate (not an actual) value. Fifty eight programs were then removed, leaving 64 DS 2
programs, because one or more other necessary actual or plan schedule milestones were missing. These data issues
were unknown to the database developer, but the developer agreed that Conrow’s approach was correct for filtering
the data. (The DS 1 and DS 2 issues had existed for more than 10 years before being discovered.)
Anderson-Darling Reject at 0.05 Kolmogorov-Smirnov Reject at 0.05
Model Test Statistic Prob. Level? Test Statistic Prob. Level?
Beta 0.47 N 0.14 N
Cauchy 3.07 N 0.23 NErlang 1.18 Y 0.21 N
Error 19.35 Y 0.73 Y
Exponential 6.39 Y 0.48 Y
Extreme Value A 2.23 Y 0.25 YExtreme Value B 1.19 Y 0.20 Y
Gamma 1.27 Y 0.22 N
Inverse Gaussian 1.12 N 0.21 N
Inverted Weibull 0.77 Y 0.15 N
Laplace 2.52 Y 0.25 NLogistic 1.43 Y 0.19 Y
Log-Laplace 1.90 Y 0.19 N
Log-Logistic 1.04 Y 0.17 Y
Lognormal 1.12 Y 0.20 Y
Normal 1.60 Y 0.23 YPearson Type V 0.99 Y 0.19 N
Pearson Type VI 1.10 N 0.20 N
Random Walk 1.15 N 0.21 N
Rayleigh 2.45 Y 0.32 YUniform 44.56 Y 0.43 Y
Wald 15.99 Y 0.59 Y
Weibull 1.48 Y 0.21 Y
Copyright © 2010 by Edmund H. Conrow
5
Hence, there was not a statistically significant difference between the medians at the 95.0% confidence level.)
Descriptive statistics for this 365 program data set are given in Table 3 (along with descriptive statistics for all other
schedule change data sets evaluated in this paper). (Note: descriptive statistics for the 28 program NASA data set
are repeated in Table 3 for consistency—the results are identical to those given in the ratio column of Table 1.)
From Table 3, the average and median schedule change for the 365 program sample was 1.27 and 1.16,
respectively. The skewness and kurtosis were 1.37 and 5.04, respectively, which is considerably different than what
would exist for a normal distribution (0.00 and 3.00, respectively). The data were then evaluated to determine what
types of common distributions might be rejected and what types might not be rejected. The results of this analysis
are given in Table 4. Note that the normal distribution, extreme value Type B distribution, and 21 other
distributions were rejected at the 0.05 level by both the Anderson-Darling and Kolmogorov-Smirnov tests. As
clearly shown in Table 4, schedule change data for space and non-space programs can not be accurately represented
by any of the 23 common distribution types evaluated.
Table 3. Descriptive Statistics of Key Evaluated Schedule Change Samples
Table 4. Evaluation of Candidate Probability Distributions for Schedule Change
of 365 Space and Non-Space Programs
Finally, since the P-value from the Mann-Whitney W test (difference in the distribution medians) was less than
0.05 it was not valid to combine the 28 program NASA data set with the previously mentioned 365 program data set
(DS 1 and DS 2) into a single data set.
DS 1 and DS 1 and DS 1 and DS 2, No 1.0, DS 1 and DS 2, No 1.0, DS 1 and DS 2, No 1.0, 1/6 Values,
NASA DS 2 DS 2, No 1.0 1/6 Values. Random 1/6 Values. Sorted Sorted, Different Endpoint
Count 28 365 278 47 47 47Mean 1.52 1.27 1.35 1.33 1.35 1.35Standard Deviation 0.54 0.33 0.34 0.32 0.37 0.38Skewness 1.35 1.37 1.07 0.92 0.95 1.18Kurtosis 4.49 5.04 4.48 3.11 4.78 5.83Minimum 1.00 0.50 0.50 0.92 0.50 0.5025th Percentile 1.11 1.00 1.12 1.07 1.11 1.11
Median 1.28 1.16 1.25 1.24 1.25 1.2575th Percentile 1.81 1.44 1.54 1.52 1.52 1.52Maximum 3.14 2.70 2.70 2.18 2.52 2.70Range 2.14 2.20 2.20 1.26 2.02 2.20
Anderson-Darling Reject at 0.05 Kolmogorov-Smirnov Reject at 0.05
Model Test Statistic Prob. Level? Test Statistic Prob. Level?
Beta 204.07 Y 0.62 Y
Cauchy 23.61 Y 0.22 YErlang 11.83 Y 0.14 Y
Error 279.53 Y 0.74 Y
Exponential 100.56 Y 0.50 Y
Extreme Value A 28.70 Y 0.24 YExtreme Value B 8.36 Y 0.12 Y
Gamma 12.03 Y 0.13 Y
Inverse Gaussian 10.36 Y 0.13 Y
Inverted Weibull 8.28 Y 0.15 Y
Laplace 15.94 Y 0.19 YLogistic 12.87 Y 0.16 Y
Log-Laplace 11.89 Y 0.16 Y
Log-Logistic 9.31 Y 0.14 Y
Lognormal 10.28 Y 0.13 Y
Normal 16.27 Y 0.15 YPearson Type V 8.87 Y 0.13 Y
Pearson Type VI 10.14 Y 0.13 Y
Random Walk 10.47 Y 0.13 Y
Rayleigh 51.71 Y 0.38 YUniform 88.50 Y 0.35 Y
Wald 128.80 Y 0.50 Y
Weibull 18.00 Y 0.19 Y
Copyright © 2010 by Edmund H. Conrow
6
As previously mentioned, it was valid to combine the two separate space and non-space data sets (DS 1 and DS
2) into a single data set because the P-value from the Mann-Whitney W test was greater than 0.05. Hence, there was
not a statistically significant difference between the medians at the 95.0% confidence level. Nevertheless,
distribution fitting results are presented here for the 301 program DS 1 sample. The results of the distribution fitting
analysis of space program schedule change (DS 1) are given in Table 5. As in the full sample case (365 programs,
DS 1 and DS 2 combined), the normal distribution, extreme value Type B distribution, and 21 other distributions
were rejected at the 0.05 level by both the Anderson-Darling and Kolmogorov-Smirnov tests. As clearly shown in
Table 5, schedule change data for space programs can not be accurately represented by any of the 23 common
distribution types evaluated.
Coleman and Summerville, assumed that an extreme value distribution is “what we expect theoretically” to
model schedule change for DoD acquisition programs [7]. However, Coleman, Summerville, and Dameron
provided an inadequate justification to support this assertion, specifically “the case is actually made by the statistical
test” [8]. Coleman, Summerville, and Dameron also stated that “given that finishing a program is akin to waiting for
the last event to finish, it is appealing that the Gumbel, or extreme value distribution, is the best fit to the data [8].”
Table 5. Evaluation of Candidate Probability Distributions for Schedule Change
of 301 Space Programs
From Table 4, both the extreme value Type A and Type B distributions were rejected at the 0.05 level by both
the Anderson-Darling and Kolmogorov-Smirnov tests. Coleman, Summerville, and Dameron removed schedule
change values of 1.00 (indicating on-time delivery vs. plan) because “the number of 1.0’s in the data base (schedules
finishing “on time”) creates problems in the fit statistics” [7]. Of the 59 total programs, 12 had a schedule ratio of
1.0 (or about 20 percent). [Coleman and Summerville noted that “we believe the disproportionate amount of 1.0’s is
politically motivated and not a natural occurrence” [7].] This statement is erroneous because award fee, user needs,
and other criteria often makes on-time delivery a priority (hence a ratio of 1.0).] They eliminated the 1.0 values then
re-fit the data to an extreme value (Type B) distribution. (The 1.0 values were later modeled using a discrete
distribution 12/59 of the time.) Coleman and Summerville, and Coleman, Summerville, and Dameron estimated the
extreme value location (µ) and scale (β) coefficients to be 1.16 and 0.32, respectively, as given in the last entry of
Table 6 [7] [8]. They also found that the extreme value (Type B) distribution could not be rejected at the 0.05 level
Anderson-Darling Reject at 0.05 Kolmogorov-Smirnov Reject at 0.05
Model Test Statistic Prob. Level? Test Statistic Prob. Level?
Beta 7.82 Y 0.15 Y
Cauchy 19.50 Y 0.22 YErlang 8.81 Y 0.14 Y
Error 229.28 Y 0.73 Y
Exponential 81.58 Y 0.49 Y
Extreme Value A 20.93 Y 0.23 YExtreme Value B 6.44 Y 0.13 Y
Gamma 8.86 Y 0.13 Y
Inverse Gaussian 7.66 Y 0.13 Y
Inverted Weibull 6.81 Y 0.16 Y
Laplace 13.05 Y 0.19 YLogistic 9.94 Y 0.16 Y
Log-Laplace 9.80 Y 0.16 Y
Log-Logistic 7.35 Y 0.14 Y
Lognormal 7.65 Y 0.13 Y
Normal 11.88 Y 0.15 YPearson Type V 6.67 Y 0.13 Y
Pearson Type VI 7.00 Y 0.13 Y
Random Walk 7.74 Y 0.13 Y
Rayleigh 40.82 Y 0.38 YUniform 69.12 Y 0.34 Y
Wald 109.50 Y 0.50 Y
Weibull 12.90 Y 0.18 Y
Copyright © 2010 by Edmund H. Conrow
7
with the Kolmogorov-Smirnov test [7] [8], but they did not report results (or potentially evaluate the data) using the
more statistically powerful Anderson-Darling test.
When the 365 program data set assembled by Conrow (roughly six times larger than the Coleman and
Summerville, and Coleman, Summerville, and Dameron data set) was similarly evaluated, a different outcome was
reached, as shown in Table 6. The resulting extreme value Type B µ and β coefficients were estimated to be 1.12
and 0.24, respectively. However, the extreme value Type B distribution was rejected at the 0.05 level by both the
Anderson-Darling and Kolmogorov-Smirnov tests. (The rejection was from significance levels of 0.01 to 0.25, and
0.01 to 0.10 with the Anderson-Darling and Kolmogorov-Smirnov tests, respectively, where exact critical values
exist.)
A subset of the Conrow data was created by removing programs with an on-time delivery (1.0 value). Of the
365 program sample, 87 programs had a schedule ratio of 1.0, and were subsequently were removed (or about 24
percent), leaving 278 programs to evaluate by distribution fitting. The resulting extreme value Type B µ and β
coefficients were estimated to be 1.20 and 0.27, respectively as given in Table 6. The remaining 278 values (with
Table 6. Summary of Extreme Value Distribution Schedule Change Statistical Test Results
schedule change not equal to 1.0) were then evaluated to determine what types of common distributions might be
rejected and what types might not be rejected. The results of this analysis are given in Table 7. Note that the normal
distribution, extreme value Type B distribution (and 19 other distributions) were rejected at the 0.05 level by both
the Anderson-Darling and Kolmogorov-Smirnov tests. (For the extreme value Type B distribution, the rejection was
from significance levels of 0.01 to 0.25, and 0.01 to 0.10 with the Anderson-Darling and Kolmogorov-Smirnov
tests, respectively, where exact critical values exist.) Of the distributions evaluated, only the Pearson Type VI was
not rejected by the Anderson-Darling test at the 0.05 level, and only the Pearson Type V was not rejected by the
Kolmogorov-Smirnov test at the 0.05 level. Finally, note that by removing the 1.0 values (278 remaining programs)
the extreme value Type B µ and β coefficients both increased vs. those associated with the original 365 program
values.
Analyses were then performed to reduce the Conrow 278 program sample size to the same size of Coleman and
Summerville, and Coleman, Summerville, and Dameron. In the first case every sixth value from the randomly
ordered Conrow sample (excluding 1.0 values) was determined. This provided 47 values, the same number used by
Coleman and Summerville, and Coleman, Summerville, and Dameron. As given in Table 6, the resulting extreme
value Type B µ and β coefficients were estimated to be 1.18 and 0.23, respectively, and the distribution type was not
rejected at the 0.05 level by both the Anderson-Darling and Kolmogorov-Smirnov tests. For the second case, a
different 47 program data set was created by taking every sixth value of the sorted Conrow 278 values with 1.0
values removed. [Here the sorted 278 values represented an ascending cumulative distribution function (CDF).] As
given in Table 6, the resulting extreme value Type B µ and β coefficients were estimated to be 1.18 and 0.32,
respectively, and the distribution type was not rejected at the 0.05 level by both the Anderson-Darling and
Kolmogorov-Smirnov tests. For the third case, a different 47 program data set was created by using the
methodology of the second case together with taking the maximum value of the 278 programs substituted for the
Anderson-Darling Kolmogorov-Smirnov
Sample Reject at 0.05 Reject at 0.05 Location Scale
Case Size Prob. Level? Prob. Level? Parameter Parameter
Space and Non-Space Programs 365 Y Y 1.12 0.24
Space and Non-Space Programs, No 1.0 Values 278 Y Y 1.20 0.27
Space and Non-Space Programs, No 1.0 Values, Random
1/6 Program Sample, Correct Endpoint 47 N N 1.18 0.23
Space and Non-Space Programs, No 1.0 Values, Sorted 1/6 Program Sample, Correct Endpoint 47 N N 1.18 0.32
Space and Non-Space Programs, No 1.0 Values, Sorted
1/6 Program Sample, Maximum Endpoint 47 N N 1.18 0.32Space Only Programs 301 Y Y 1.13 0.25Space Only Programs, No 1.0 Values 227 Y Y 1.21 0.29Non-Space Only Programs 64 Y Y 1.10 0.17
Non-Space Only Programs, No 1.0 Values 51 Y N 1.14 0.19
Non-Space Only Programs (Coleman and Summerville, Coleman; Summerville, and Dameron) 59 Unknown Y Unknown Unknown
Non-Space Only Programs, No 1.0 Values (Coleman and Summerville; Coleman, Summerville, and Dameron) 47 Unknown N 1.16 0.32
Copyright © 2010 by Edmund H. Conrow
8
Table 7. Evaluation of Candidate Probability Distributions for Schedule Change
of 278 Programs
47th
value of the second case. (Hence, the first 46 values in the first and second cases were identical—only the final,
47th
value was different.) As given in Table 6, the extreme value Type B µ and β coefficients were estimated to be
1.18 and 0.32, respectively, and the distribution type was not rejected at the 0.05 level by both the Anderson-Darling
and Kolmogorov-Smirnov tests.
Four other cases were then developed and evaluated. (The descriptive statistics for these four cases are not
reported in Table 3 because these results are considered secondary. The extreme value coefficients, Anderson-
Darling test results, and Kolmogorov-Smirnov test results for each case are given in Table 6.) Case four
corresponds to data from DS 1, 301 space programs. The extreme value Type B µ and β coefficients were estimated
to be 1.18 and 0.32, respectively, and the distribution type, plus the 22 other distribution types evaluated were
rejected at the 0.05 level by both the Anderson-Darling and Kolmogorov-Smirnov tests (as shown in Table 5). Case
five corresponds to data from DS 1 with the 74 1.0 values removed (24.6% of the total), thus leaving 227 programs
in the sample. The extreme value Type B µ and β coefficients were estimated to be 1.21 and 0.29, respectively, and
the extreme value Type B distribution type was rejected at the 0.05 level by both the Anderson-Darling and
Kolmogorov-Smirnov tests. (Only the beta, inverse Gaussian, Pearson Type VI, and random walk distributions
were not rejected by both the Anderson-Darling and Kolmogorov-Smirnov tests. In addition, the Pearson Type V
distribution was not rejected at the 0.05 level by the Kolmogorov-Smirnov test.) Case six corresponds to data from
DS 2, 64 non space programs. The extreme value Type B µ and β coefficients were estimated to be 1.10 and 0.17,
respectively, and the distribution type was rejected at the 0.05 level by both the Anderson-Darling and Kolmogorov-
Smirnov tests. (Only the beta distribution was not rejected by both the Anderson-Darling and Kolmogorov-Smirnov
tests. The Erlang, gamma, inverse Gaussian, log Laplace, Pearson Type V, Pearson Type VI, and random walk
distributions were not rejected by the Kolmogorov-Smirnov test but were rejected by the Anderson-Darling test.)
Case seven corresponds to data from DS 2 with the 13 1.0 values removed (20.3% of the total), thus leaving 51
programs in the sample. The extreme value Type B µ and β coefficients were estimated to be 1.14 and 0.19,
respectively, and the distribution type was rejected at the 0.05 level by the Anderson-Darling but not by the
Kolmogorov-Smirnov test. (Only the beta, Cauchy, inverse Gaussian, inverted Weibull, Pearson Type VI, and
random walk distributions were not rejected by both the Anderson-Darling and Kolmogorov-Smirnov tests. The
Anderson-Darling Reject at 0.05 Kolmogorov-Smirnov Reject at 0.05
Model Test Statistic Prob. Level? Test Statistic Prob. Level?
Beta 102.57 Y 0.44 Y
Cauchy 9.85 Y 0.13 YErlang 3.30 Y 0.10 Y
Error 213.87 Y 0.72 Y
Exponential 75.97 Y 0.46 Y
Extreme Value A 16.51 Y 0.18 YExtreme Value B 1.44 Y 0.07 Y
Gamma 3.30 Y 0.10 Y
Inverse Gaussian 2.26 Y 0.09 Y
Inverted Weibull 5.45 Y 0.10 Y
Laplace 8.38 Y 0.11 YLogistic 4.49 Y 0.09 Y
Log-Laplace 5.14 Y 0.10 Y
Log-Logistic 1.99 Y 0.07 Y
Lognormal 2.23 Y 0.09 Y
Normal 6.33 Y 0.13 YPearson Type V 1.54 Y 0.07 N
Pearson Type VI 2.11 N 0.09 Y
Random Walk 2.31 Y 0.09 Y
Rayleigh 38.40 Y 0.33 YUniform 51.41 Y 0.32 Y
Wald 137.01 Y 0.50 Y
Weibull 8.35 Y 0.12 Y
Copyright © 2010 by Edmund H. Conrow
9
Erlang, extreme value Type B, Gamma, Laplace, log Laplace, log logistic, and Pearson Type V distributions were
not rejected by the Kolmogorov-Smirnov test but were rejected by the Anderson-Darling test.) Finally, note that by
removing the 1.0 values (13 programs) the extreme value Type B µ and β coefficients both increased vs. those
associated with the original 64 program values.
The above extreme value Type B analysis points to the danger in performing distribution fitting on small to
moderate sized samples, and potentially accepting the statistical results. This is a different situation than if the
results were rejected at the 0.05 level when the smaller sample size (47 programs) was used, because statistical
power increases with sample size (cet. par.) Statistical power limitations of small samples explains some key results
given in Table 6. Even when the 1.0 values were eliminated, when the sample was reduced from 278 programs to
47 programs the extreme value Type B distribution went from being rejected by both the Anderson-Darling and
Kolmogorov-Smirnov tests at the 0.05 level (278 programs) to not being rejected by these same tests (47 programs).
The primary difference in these three data subset cases was that the sample size was reduced by a factor of 6, or put
another way, 5/6 of the data was eliminated. No new information was added in any of the three subset cases—
information was in fact eliminated, and the structure of the three cases did not constrain the results. The latter point
is clear from examining the descriptive statistics given in Table 3 coupled with the extreme value Type B location
and scale parameters given in Table 6—variations between the full sample without 1.0 values (278 programs) and
the three samples reduced to 1/6 of the values (47 programs) were generally minor.
Discussion
The first Dubos, Saleh, and Braun claim given in the introduction to this paper, RSS is normally distributed,
cannot be accepted given the results presented in Table 2 from distribution fitting the same 28 program sample. The
results show that a normal distribution was rejected at the 0.05 level by both the Anderson-Darling and
Kolmogorov-Smirnov tests. (In addition, the skewness value given in Table 1, 1.35 is inconsistent with a normal
distribution, whose skewness value is 0.0.) Using a much larger sample of space and non-space data (365 programs,
Table 4), and space data only (301 programs, Table 5), all of the 23 distribution types evaluated, including the
normal distribution, were rejected at the 0.05 level by the Anderson-Darling and Kolmogorov-Smirnov tests. The
second Dubos, Saleh, and Braun claim given in the Introduction, is that if more RSS data is obtained and warrant a
different PDF, then that new PDF can be used to develop schedule-risk curves. Results given in Table 3 show that
space and non-space schedule change data, and space program only schedule change data has a right hand skew
(skewness coefficient > 0). More importantly, results given in Tables 4 and 5 show that no pre-defined PDF (of the
23 types tested) accurately represents either space or space and non-space schedule change data when the sample
size is sufficiently large to preclude Type II errors at a given significance level. [A Type II error is failure to reject a
given distribution type when that distribution is incorrect—a false negative. Thus when the statistical power for a
test increases, the chances of it making a Type II error decrease. As previously mentioned, statistical power
increases with sample size (cet. par.). In addition, in terms of common statistical tests, statistical power increases in
the following order (cet. par.): Chi Square, Kolmogorov-Smirnov, and Anderson-Darling [9].]
Of the rationale provided by Coleman, Summerville, and Dameron to support the use of an extreme value Type
B distribution, only the statement: “given that finishing a program is akin to waiting for the last event to finish, it is
appealing that the Gumbel, or extreme value distribution, is the best fit to the data [8]” has one acceptable theoretical
facet. As mentioned above, one of Conrow’s findings was that performance is the typical dominant variable for
DoD and NASA program development programs [2] [3]. In effect, both the government and contractor attempt to
meet performance requirements, and in the process cost and/or schedule are typically adjusted (often meaning
growth) on a case-by-case basis as all three variables are traded in order to meet performance requirements within
cost and schedule constraints [2] [3]. Hence, the statement of Coleman, Summerville, and Dameron that “finishing a
program is akin to waiting for the last event to finish [8],” implies cost, performance, and schedule trades continuing
throughout the development program, which more often then not, contributes to cost growth and/or schedule
slippage. [The nature of these trades is complex, and Conrow determined that the coefficient of determination (R2)
between cost change, performance change, and schedule change on a roughly 50 program sample of DoD non-space
programs is very small (e.g., 0.07 or less) [4].] The other arguments provided by Coleman and Summerville [7], and
Coleman, Summerville, and Dameron [8] do not have suitable merit to support using an extreme value Type B
distribution, or even eliminating the on-time delivery programs (schedule change of 1.0). Results from analyses
performed and given in Table 6 show that the acceptance of the extreme value Type B distribution is largely related
to sample size. In effect, the non rejection at the 0.05 level by the Kolmogorov-Smirnov test that resulted from the
analyses of Coleman and Summerville [7], and Coleman, Summerville, and Dameron [8], was almost certainly the
result of a Type II error due to weak statistical power associated with small sample size. (This is in part because
Copyright © 2010 by Edmund H. Conrow
10
statistical tests are not very sensitive to minor differences between the data and the candidate distribution type,
particularly for small sample sizes [9].) The assertion associated with non-rejection being related to a Type II error
is almost certainly proven by contrasting results from the Conrow full sample case (278 programs) with 1.0 values
removed (Table 6, second entry), against the three reduced sample size cases (47 programs) that were created from
the same Conrow full sample case. In the former case the extreme value Type B distribution was rejected at the 0.05
level (and a broad range around this level) by both the Anderson-Darling and Kolmogorov-Smirnov tests, while in
the latter three cases developed and evaluated, the assumption of an extreme value Type B distribution could not be
rejected by either statistical test at the 0.05 level.
What sample size is sufficient for distribution fitting given the small sample size problems that existed in the
analyses of Dubos, Saleh, and Braun ([1], 28 values), Coleman and Summerville ([7], 47 and 59 values), and
Coleman, Summerville, and Dameron ([8], 47 and 59 values)? Clearly 28, 47, or 59 representative data points are
insufficient except to rule out certain distribution types. For example, if the Anderson-Darling and Kolmogorov-
Smirnov tests reject a candidate distribution at the 0.05 level for 50 data points and the sample is representative of
the population of corresponding schedule change values, then larger sample sizes will also be rejected because
statistical power increases with sample size (cet. par.).
While no single value or even range of values is sufficient in all cases, approximately 200 data points may be a
sufficient lower bound to permit fitting with univariate, unimodal distributions when other clear evidence that
supports a particular distribution type does not exist. (As given in Table 6, the extreme value Type B hypothesis
was rejected by both the Anderson-Darling and Kolmogorov-Smirnov test for a sample of 227 program schedule
change values that had 1.0 values removed.) However, fitting data from a normal distribution obtained from a high
quality commercial Monte Carlo simulation did not result in rejecting many of the candidate distribution types and
having a much smaller Anderson-Darling test statistic than the other distribution types until much larger samples
sizes were used (e.g., 500 values).
For very large sample sizes (e.g., several thousand values), statistical tests will often reject the null hypothesis
that the sample represents a particular distribution type. This is because even very small differences between the
data and the hypothesized distribution will be detected. These same differences may be present but unobserved with
small sample sizes, but the rejection is in part due to the fact that statistical tests don’t differentiate between exact
and “nearly correct” for a given distribution type [9].
Finally, the analyst should recognize that there is no guarantee that schedule change or any other data will fit any
particular distribution type. Given this potential problem, what approach should the analyst use? One approach to
estimate the type of probability distribution that is sometimes used when limited data exists is to convert the data
into a histogram. This approach is not recommended because using a finite number of equal width bins can lead to
errors when relatively small sample sizes exist because of aggregation and/or drop-out. In effect, a histogram with a
finite number of bins only approximates the true PDF associated with the data. Another, even more inappropriate
approach, is simply guessing at the type and specific characteristics of a probability distribution when no data exists
or using a default probability distribution, then not mentioning the resulting limitations in the subsequent analysis
ground rules and assumptions.
Given the above illustrations of faulty or weak methodologies, is there a better approach the analyst should
consider? When data is available, and particularly when it was derived at the same Work Breakdown Structure level
as it will be used for risk analysis purposes, the recommended approach is to convert the data into an ascending CDF
and directly use this (empirical) CDF in the risk analysis tool (e.g., Monte Carlo simulation)‡. This approach is
more accurate than: 1) guessing the probability distribution, 2) using the default model distribution without noting
this or other limitations, 3) distribution fitting data when sample sizes are small to moderate, or 4) using histograms
to represent the data.
Conclusions
Schedule change is typically derived from comparing actual versus initial schedule durations over the same
acquisition events (whether as a ratio or percent change). Acquisition dynamics and trade preferences among cost,
performance, and schedule contribute to both cost change and schedule change growing (slip) during the course of
‡ Some high quality commercial Monte Carlo simulation packages, such as @RISK 5.5 or higher, Industrial or
Professional, versions from Palisade Corporation allow the analyst to develop or import an ascending CDF and
apply this distribution to specified model element(s). Note: mention of @RISK does not constitute an
endorsement—only that the software has the capability to develop or import an ascending CDF and use it directly in
simulations.
Copyright © 2010 by Edmund H. Conrow
11
defense and space programs. In addition, these factors also contribute to the distribution of cost change and
schedule change values having a right-hand skew (rather than a symmetrical or left-hand skew). There is little
specific information in the literature about the type(s) of distributions that may be associated with schedule change.
Schedule change is clearly not normally distributed for space programs in particular or aerospace programs in
general. This assertion is supported by numerous facts, including: 1) acquisition dynamics and trade preferences
among cost, performance, and schedule that lead to larger potential cost growth and/or schedule slippage than
reductions in these variables; 2) skewness and kurtosis estimates from various samples of space and aerospace
programs ranging in size from 28 NASA programs to 365 combined space (301) and non-space (64) programs that
clearly do not represent a normal distribution (0.0 skewness and 3.0 kurtosis); and 3) results from distribution fitting
various samples of space and aerospace programs—in all cases the normal distribution assumption was rejected for
schedule change by the Anderson-Darling and Kolmogorov Smirnov statistical tests at the 0.05 level.
If schedule change is not normally distributed, then should we assume that it follows some other type of
continuous, univariate distribution? The answer to this question based upon statistical results presented in this paper
is resounding no. Results from evaluating 365 combined space (301) and non-space (64) programs sample showed
that all 23 of the candidate distribution types evaluated were rejected by the Anderson-Darling and Kolmogorov
Smirnov statistical tests at the 0.05 level.
Of some interest were the results for the extreme value (Gumbel) distribution. The attributes attributed to this
distribution might somewhat mimic the preference for performance over cost and/or schedule, and thus provide a
suitable distribution for the right hand skew associated with increased schedule change (and possibly cost change,
although not evaluated here). Prior researchers using a single sample of 59 programs found that the extreme value
distribution was rejected at the 0.05 level by the Kolmogorov Smirnov statistical test. However, when the on-time
deliveries were removed (12 programs), the resulting sample (47 programs) was not rejected at the 0.05 level by the
Kolmogorov Smirnov statistical test. The same evaluation was then performed by Conrow on a much larger sample
(365 programs) and a subset with on-time deliveries removed (87 programs), leaving 288 programs. In the former
case all 23 candidate distributions tested were rejected by both the Anderson-Darling and Kolmogorov Smirnov
statistical tests at the 0.05 level. [The same results occurred when only the 301 space programs (which included on-
time deliveries) were evaluated.] In the latter case only one of 23 candidate distributions was nor rejected by the
Anderson-Darling (Pearson Type VI) and Kolmogorov Smirnov (Pearson Type V) statistical tests at the 0.05 level.
The combined data set with 1.0 values removed was then down-sampled three different ways to reach 47 values—
the same sample size as used by prior researchers. In each of the three cases, the extreme value distribution was not
rejected by the Anderson-Darling and Kolmogorov Smirnov statistical tests at the 0.05 level. The difference in
results between the 278 program sample and 47 program samples was solely due to the sample size as no new
information was added, only data were eliminated. The different results are caused by reduced statistical power for
both the Anderson-Darling and Kolmogorov Smirnov statistical tests due to a much smaller sample size in the
down-sampled data (5/6 of the data were removed). Hence, attempting to perform distribution fitting on small
sample sizes (e.g., 50 values or less) may lead to erroneous results if the distribution type is not rejected. Such
erroneous results can be caused by Type II errors resulting from diminished statistical power, which are related to
the small sample size.
Sample sizes of approximately 200 data points may be a sufficient lower bound to permit fitting with univariate,
unimodal distributions when other clear evidence that supports a particular distribution type does not exist.
However, fitting data from a normal distribution obtained from a high quality commercial Monte Carlo simulation
did not result in rejecting many of the candidate distribution types and having a much smaller Anderson-Darling test
statistic than the other distribution types until much larger samples sizes were used (e.g., 500 values). When much
smaller samples exist, the recommended approach when data modeling is needed is to sort the data into an ascending
CDF, then use the CDF in subsequent analyses (e.g., a Monte Carlo simulation). This approach does not introduce
any errors of interpolation or extrapolation in and of itself and should be accurate so long as the underlying data
sample represents the data population.
References
[1] Dubos, G., Saleh, J., and Braun, R., “Technology Readiness Level, Schedule Risk, and Slippage in Spacecraft
Design,” AIAA Journal of Spacecraft and Rockets, Vol. 45, No. 4, July–August 2008, pp. 840-841.
[2] Conrow, E., Effective Risk Management: Some Keys to Success, Second Edition, American Institute of
Aeronautics and Astronautics, 2003, pp. 2-13, 427-430, 431-433.
[3] Conrow, E., “Some Long-Term Issues and Impediments Affecting Military Systems Acquisition Reform,”
Acquisition Review Quarterly, Vol. 2, No. 3, Summer 1995, pp. 199–212.
Copyright © 2010 by Edmund H. Conrow
12
[4] Conrow, E., “Some Inherent Limitations Of Quantitative Cost Risk Assessment Methodologies,” 29th
Annual
DoD Cost Analysis Symposium, 21 February 1996.
[5] Candreva, P., “Rethinking Acquisition Reform: Cost Growth Solutions May Aggravate More Important
Problems,” 5th Annual Acquisition Research Symposium of the Naval Postgraduate School: Acquisition Research:
Creating Synergy for Informed Change, 14-15 May 2008
[6] Arena, M., Leonard, R., Murray, S., Younossi, O., “Historical Cost Growth of Completed Weapon System
Programs,” RAND, TR-343, 2006, pp. xii, 22, 27.
[7] Coleman, R. and Summerville, J., “A Survey of Cost Risk Methods for Project Management,” PMI Risk SIG
Project Risk Symposium, 16 May 2004.
[8] Coleman, R., Summerville, J., and Dameron, M., “The Relationship Between Cost Growth and Schedule
Growth,” Acquisition Review Quarterly, Vol. 10, No. 2, Spring 2003, pp. 117-122.
[9] Law, A., Simulation Modeling and Analysis, Fourth Edition, McGraw Hill, 2007, pp. 340-352.