Post on 29-Aug-2018
transcript
1
CS626 Data Analysis and Simulation
Today:Stochastic Input Modeling
Reference: Law/Kelton, Simulation Modeling and Analysis, Ch 6.NIST/SEMATECH e-Handbook of Statistical Methods, http://www.itl.nist.gov/div898/handbook/
Instructor: Peter Kemper R 104A, phone 221-3462, email:kemper@cs.wm.eduOffice hours: Monday,Wednesday 2-4 pm
What is input modeling?
Input modeling Deriving a representation of the uncertainty or randomness in a
stochastic simulation. Common representations
Measurement data Distributions derived from measurement data <-- focus of “Input modeling”
usually requires that samples are i.i.d and corresponding random variables in the simulation model are i.i.d
i.i.d. = independent and identically distributed theoretical distributions empirical distribution
Time-dependent stochastic process Other stochastic processes
Examples include time to failure for a machining process; demand per unit time for inventory of a product; number of defective items in a shipment of goods; times between arrivals of calls to a call center. 2
Overview of fitting with data
Check if key assumptions hold (i.i.d) Select one or more candidate distributions based on physical characteristics of the process and graphical examination of the data.
Fit the distribution to the data determine values for its unknown parameters.
Check the fit to the data via statistical tests and via graphical analysis.
If the distribution does not fit, select another candidate and repeat the process, or use an empirical distribution.
3from WSC 2010 Tutorial by Biller and Gunes, CMU, slides used with permission
Check the fit to the data Graphical analysis Plot fitted distribution and data in a way that differences can be
recognized beyond obvious cases, there is a grey area of subjective acceptance/rejection
Challenges How much difference is significant enough to trash a fitted distribution? Which graphical representation is easy to judge?
Options: Histogram-based plots Probability plots: P-P plot, Q-Q plot
Statistical tests define a measure X for the difference between fitted distribution & data X is an RV, so if we find an argument what distribution X has, we get a
statistical test to see if in a concrete case a value of X is significant Goodness-of-fit tests:
Chi-square test(χ2), Kolmogorov-Smirnov test(K-S), Anderson Darling test(AD)
4
Sample test characteristic for Chi-Square test (all parameters known)
5
One-sidedRight side: - critical region- region of rejectionLeft side:- region of acceptance where we fail to reject hypothesisP-value of x: 1-F(x)
Graphic Analysis vs Goodness-of-fit tests Graphic analysis includes: Histogram with fitted distribution Probability plots: P-P plot, Q-Q plot.
Goodness-of-fit tests represent lack of fit by a summary statistic, while plots show where
the lack of fit occurs and whether it is important. may accept the fit, but the plots may suggest the opposite,
especially when the number of observations is small.
6
!"
#$%&'()*+,%-./(/
+*0%1%*/21*34*56*37/2$8%1(3,/*(/*72-(2820*13*72*4$39*%*,3$9%-*0(/1$(7:1(3,;*<'2*43--3=(,>*%$2*1'2*!?8%-:2/*4$39*)'(?/@:%$2*12/1*%,0*A?B*12/1C
D'(?/@:%$2*12/1C*6;EFFA?B*12/1C*G6;EH
I'%1*(/*.3:$*)3,)-:/(3,J
Density Histogram
compares sample histogram (mind the bin sizes) with fitted distribution
7
Frequency Histogram
compares histogram from data with histogram according to fitted distribution
8
Differences in distributions are easier to see along a straight line:
9
Graphical comparisons
10
Graphical comparisons
Frequency ComparisonsFeatures:• Graphical comparison of a histogram of the data with the density function of the fitted distribution.
• Sensitive to how we group the data.
Probability PlotsFeatures:• Graphical comparison of an estimate of the true distribution function of the data with the distribution function of the fit.
•Q-Q (P-P) plot amplifies differences between the tails (middle) of the model and sample distribution functions.
• Use every graphical tool in the software to examine the fit.
• If histogram-based tool, then play with the widths of the cells.
• Q-Q plot is very highly recommended!
from WSC 2010 Tutorial by Biller and Gunes, CMU, slides used with permission
P-P plots and Q-Q plots
11
Q-Q plot
vs
for q1,...,qn
P-P plot
vs
for p1,...,pn
This intuitive definitionneeds an adjustment to handle ties (multiple samples of same value)
Q-Q Plot
Recall that one way to generate data from cdf F is via
The Q-Q plot displays the sorted data
12
Q-Q plot
Recall that one way to generate data from cdf F is via
The Q-Q plot displays the sorted data
vs.
)(1 RFY
nYYY 21
njn
jF ,2,1,
2/11
Q-Q plot
Recall that one way to generate data from cdf F is via
The Q-Q plot displays the sorted data
vs.
)(1 RFY
nYYY 21
njn
jF ,2,1,
2/11
Q-Q plot
Recall that one way to generate data from cdf F is via
The Q-Q plot displays the sorted data
vs.
)(1 RFY
nYYY 21
njn
jF ,2,1,
2/11
vs
from WSC 2010 Tutorial by Biller and Gunes, CMU, slides used with permission
Q-Q Plot Intuition
13
!"
#$# %&'()*+(,-(-'+
. /0)1230)2)4256&0)!"#$!%#&#!' 2+7)80)9-()2)7-4(:-;,(-'+)( (12()80)1'60)-4)<''7=
. *9)80)+'8)<0+0:2(0)2):2+7'5)4256&0)'9)4->0)' 9:'5)($-()41',&7)&''?)2;',()&-?0)!"#$!%#&#!')
. @10)#$# 6&'()<0+0:2(04)2)*+,-+./ :2+7'5)4256&0)9':)A'562:-4'+=
Features of the Q-Q plot
It does not depend on how the data are grouped.
It is much better than a density-histogram when the number of data points is small.
Deviations from a straight line show where the distribution does not match.
A straight line implies that the family of distributions is correct. A 45o line implies that parameters fit as well.
14from WSC 2010 Tutorial by Biller and Gunes, CMU, slides used with permission
@RISK Student Version
For Academic Use Only
@RISK Student Version
For Academic Use Only
@RISK Student Version
For Academic Use Only
@RISK Student Version
For Academic Use Only
@RISK Student Version
For Academic Use Only
@RISK Student Version
For Academic Use Only
@RISK Student Version
For Academic Use Only
@RISK Student Version
For Academic Use Only
@RISK Student Version
For Academic Use Only
@RISK Student Version
For Academic Use Only
LogLogistic(-113.32, 156.71, 16.107)
0
20
40
60
80
100
120
0 20 40 60 80 100 120
Input quantile
@RISK Student Version
For Academic Use Only
@RISK Student Version
For Academic Use Only
@RISK Student Version
For Academic Use Only
@RISK Student Version
For Academic Use Only
@RISK Student Version
For Academic Use Only
@RISK Student Version
For Academic Use Only
@RISK Student Version
For Academic Use Only
@RISK Student Version
For Academic Use Only
@RISK Student Version
For Academic Use Only
@RISK Student Version
For Academic Use Only
Exponential(44.468) Shift=-0.58
0
20
40
60
80
100
120
Fitte
d qu
antil
e
0 20 40 60 80 100 120
Input quantile
Pretty good fit, but missesa bit on the right tail.
Poor fit, misses badly in both tails.
Features of the Q-Q Plot
A straight line implies the family of distributions is correct; a 45-degree line implies correct parameters.
15from WSC 2010 Tutorial by Biller and Gunes, CMU, slides used with permission
Examples of Q-Q plot
16
!"
#$%&'()*+,-+./. '(,0
12
13
42
43
!2
12 13 42 43 !2
56)+-700789+7*+':)00;+9,,<=
3
12
13
42
43
!2
!3
"2
"3
32
3 12 13 42 43 !2 !3 "2 "3 32
56)+<7*0:7>?07,8+-%&7(;+*))&*+@A+>?0+06)+'%:%&)0):*+%:)+8,0
Example of Q-Q plot
17
!"
#$%&'()*+,*-.- '(+/
0++1*,2/3*4255*6%7(8*29*/:)*(),/*/%2(
;*7%/%*5)/*+,*!<*+65)1=%/2+95*25*6)(2)=)7*/+*6)*,1+&*%*9+1&%(*725/126>/2+9?*@:)*,+((+A29B*%1)*/:)*!.=%(>)5*,1+&*C:2.5->%1)*/)5/*%97*D.E*/)5/F
G:2.5->%1)*/)5/F*<?HIID.E*/)5/F*J<?H"
H<
H"
K<
K"
!<
!"
H< H" K< K" !< !"
P-P plot vs Q-Q plot: Sensitive to different kinds of deviations
18
Should we just use the best fit?
Software tools exercise a set of distributions optimize parameter settings for data and distribution evaluate statistical tests suggest a “best fit”
Some concerns about the fully automated solution: Tests represent lack of fit by a single summary statistic, while plots
show where the lack of fit occurs and whether it is important. Be sure to try different numbers of histogram cells; it affects the p-
value of the χ2 test, and your perception of the fit. Be cautious with ranking fits by Chi-Sq, K-S and A-D statistics and
always check the Q-Q plot. If there is a strong physical basis for a particular distribution choice,
then use it even if it is not the best fit.
Don’t be afraid to use your brain in addition to software!19from WSC 2010 Tutorial by Biller and Gunes, CMU, slides used with permission
Overview of fitting with data
Select one or more candidate distributions, based on physical characteristics of the process and graphical examination of the data.Fit the distribution to the data (determine values for its unknown parameters).Check the fit to the data via tests and graphical analysis.If the distribution does not fit, then select another candidate and repeat the process.
What if no distribution provides a good fit?
20from WSC 2010 Tutorial by Biller and Gunes, CMU, slides used with permission
What if no distribution provides a good fit?
Use the data itself when… No standard distribution fits well. We have no justification for a standard distribution. There is too little data to distinguish between standard distributions.
Reuse the data via empirical distribution An example:
21
Empirical distribution --- An example
0
1/2
0 2.1 3.4 5.7 8.1 10input data
prob
abili
ty m
ass f
unct
ion
1/4
3/4
Equally likely to be re-sampled
Objective Fit an input model to data 2.1, 5.7, 3.4, 8.1 via empirical distribution function.
from WSC 2010 Tutorial by Biller and Gunes, CMU, slides used with permission
Empirical distribution
Each data point is equally likely to be resampled. If you are concerned that only the values you saw can appear again, then fill in gaps by linearly interpolating between the sorted data points:
22
Empirical distribution
Each data point is equally likely to be resampled.
If you are concerned that only the values you saw can appear again, then fill in gaps by linearly interpolating between the sorted data points:
Interpolated Empirical cdf
0
0.33
0.67
1
0
0.2
0.4
0.6
0.8
1
0 2 4 6 8 10
X
cum
ulat
ive
prob
abili
ty
from WSC 2010 Tutorial by Biller and Gunes, CMU, slides used with permission
Empirical distribution
Formal definition (Law/Kelton, p 326) Let X1≤ X2...≤ Xn be the sorted sequence of observationsF(x) = 0 if x < X1
F(x) = (i-1)/(n-1) + (x-Xi)/[(n-1)(Xi+1-Xi)] if Xi≤x≤Xi+1
for i=1,2,...,n-1F(x) = 1 Xn ≤ x
Ok, but cannot yield values less than X1 or more than Xn
Also, mean F(x) does not match sample mean.If data is grouped, different approach necessary.
Law/Kelton describes such an extension with interpolation.Real challenge are skewed distributions (mostly right) with likely too few samples from tail due to small tail probabilities.
Consider appending artificial tail with the help of an exponential distribution
23
What if we have no data at all?
We have to use anything we can find... Engineering data, standards and ratings can provide central values. Expert opinion. Physical or conventional limitations can provide bounds. Physical basis of the process can suggest appropriate distribution
families.
We model the expert opinion using either breakpoints method, or mean and variability method.
24from WSC 2010 Tutorial by Biller and Gunes, CMU, slides used with permission
Breakpoints method
Useful for modeling quantities with a large number of possible outcomes such as quarterly sales volume. Example Sales of XYZ-123 will be no less than 1000 units, no more than 5000 units, and is most likely to be 3500 units.
25
Breakpoints method
Useful for modeling quantities with a large number of possible outcomes such as quarterly sales volume.Example Sales of XYZ-123 will be no less than 1000 units, no more than 5000 units, and is most likely to be 3500 units.
Triangular Distribution
X <= 17075%
X <= 445295%
0.0002
0.0003
0.0004
0.0005
0.0006
500
@RISK Student VersionFor Academic Use Only
@RISK Student VersionFor Academic Use Only
@RISK Student VersionFor Academic Use Only
@RISK Student VersionFor Academic Use Only
@RISK Student VersionFor Academic Use Only
@RISK Student VersionFor Academic Use Only
@RISK Student VersionFor Academic Use Only
@RISK Student VersionFor Academic Use Only
@RISK Student VersionFor Academic Use Only
@RISK Student VersionFor Academic Use Only
dens
ity f
unct
ion
sales0.0001
0.00001000 1500 2000 2500 3000 3500 4000 4500 5000 5500
from WSC 2010 Tutorial by Biller and Gunes, CMU, slides used with permission
Breakpoints method
Example Sales of XYZ-123 will be between 1000 and 5000 with
25% chance of being at most $2000, 75% chance of being at most $3500, 99% chance of being at most $4500.
Use only as many breakpoints as you can confidently get. Try to get breakpoints near the extremes if possible. Might be easier for experts to give the chance of exceeding a value.
26
Breakpoints methodExample Sales of XYZ-123 will be between 1000 and 5000 with 25% chance of being £ 2000, 75% chance of being £ 3500, and 99% chance of being £ 4500.
– Use only as many breakpoints as you can confidently get.– Try to get breakpoints near the extremes if possible.– Might be easier for experts to give the chance of exceeding a value.
0.000000.000050.000100.00015
0.000200.000250.00030
0.00035
500 1000 1500 2000 2500 3000 3500 4000 4500 5000 5500
@RISK Student VersionFor Academic Use Only
@RISK Student VersionFor Academic Use Only
@RISK Student VersionFor Academic Use Only
@RISK Student VersionFor Academic Use Only
@RISK Student VersionFor Academic Use Only
@RISK Student VersionFor Academic Use Only
@RISK Student VersionFor Academic Use Only
@RISK Student VersionFor Academic Use Only
@RISK Student VersionFor Academic Use Only
@RISK Student VersionFor Academic Use Only
X <= 12005.0%
X <= 433395.0%
Cumulative Distribution
dens
ity f
unct
ion
from WSC 2010 Tutorial by Biller and Gunes, CMU, slides used with permission
Mean and variability method
Example The sales of XYZ-123 was 10,000. This year we expect a 15%
increase, with a typical swing of 5% above or below that value. However, we won’t sell less than 7000 units, or more than 16,000 under any conditions.
27
Mean and variability methodExample The sales of XYZ-123 was 10,000. This year we expect a 15% increase, with a typical swing of 5% above or below that value. However, we won’t sell less than 7000 units, or more than 16,000 under any conditions.
X <= 1244695.0%
X <= 105545.0%
0.0000
0.0001
0.0002
0.0003
0.0004
0.0005
0.0006
0.0007
0.0008
7000 8000 9000 10000 11000 12000 13000 14000 15000 16000
@RISK Student VersionFor Academic Use Only
@RISK Student VersionFor Academic Use Only
@RISK Student VersionFor Academic Use Only
@RISK Student VersionFor Academic Use Only
@RISK Student VersionFor Academic Use Only
@RISK Student VersionFor Academic Use Only
@RISK Student VersionFor Academic Use Only
@RISK Student VersionFor Academic Use Only
dens
ity f
unct
ion
sales
Normal Distribution
from WSC 2010 Tutorial by Biller and Gunes, CMU, slides used with permission
Checking the input model
Sensitivity analysis (varying the parameters of the input model) is especially important when the model is not based on data. While looking for marked changes in the output results, pay special attention to the standard deviation, bounds, or limits. Concentrate sensitivity analysis on those inputs to which the outputs are most sensitive.
28from WSC 2010 Tutorial by Biller and Gunes, CMU, slides used with permission
What if the process is dependent?
Usually we assume that all generated random observations across a simulation are independent. Sometimes this is not true: A difficult part requires long processing in adjacent operations of a
production system. This is positive correlation.
Ignoring such relations can invalidate model.
29from WSC 2010 Tutorial by Biller and Gunes, CMU, slides used with permission
Does Dependence Really Matter?
30
Average Waiting Time
0
2
4
6
8
10
@RISK Student Version
Average Number Waiting
0
3
6
9
12
15
IndependentArrivals
Server A Exit
Inventory in front of Server A
Correlation0.9
Dependent Arrivals
Server B Exit
Inventory in front of Server B
Does Dependence Really Matter?YES, IT DOES!
Average Waiting Time
0
2
4
6
8
10
@RISK Student Version
Average Number Waiting
0
3
6
9
12
15
IndependentArrivals
Server A Exit
Inventory in front of Server A
Correlation0.9
Dependent Arrivals
Server B Exit
Inventory in front of Server B
Does Dependence Really Matter?YES, IT DOES!
from WSC 2010 Tutorial by Biller and Gunes, CMU, slides used with permission
Conclusion
Use input models to represent uncertainty in simulation The particular input model chosen matters! Selection of the an input model is not an exact science no right answer, but the issues to consider are
theoretical vs. empirical data physical basis of the distribution assessment of the goodness of a fit independence of samples
Assess the sensitivity of simulation output results to input models chosen Use expert opinion whenever you can Do not automatically trust a completely automated derivation of an input model.
31