1 Analyzing Input and Output Simulation Data MIO 310 Optimering och Simulering 2012 (Operations...

1

Analyzing Input and Output Simulation Data

MIO 310 Optimering och Simulering

2012

(Operations Research, Basic Course)

The main reference for this material is chapter 9 in the book Business Process Modeling, Simulation and Design by M. Laguna and J. Marklund,

Prentice Hall 2005.

2

Overview

• Analysis of Input Data– Identification of field data distributions

Goodness-of-fit tests Random number generation

• Analysis of Simulation Output Data– Non-terminating v.s. terminating processes

– Confidence intervals

– Hypothesis testing for comparing designs

3

• Analysis of input data– Necessary for building a valid model

– Three aspects Identification of (time) distributionsRandom number generationGeneration of random variates

Why Input and Output Data Analysis?

• Analysis of output data– Necessary for drawing correct conclusions

– The reported performance measures are typically random variables!

Integrated into Extend

Simulation ModelOutput DataInput Data

Random Random

Example from IKEA• To develop a general method to determine the most

appropriate statistical distribution to describe average customer demand during the lead time from DC to store

STO

http://www.ikea.com/se/sv/

Project description • Optimization of safety stock

– High service level and low costs– Important to know customer demand

• Today the normal distribution is used


Developed method- Process chart flow

Tools– Matlab– Extend– Excel


The outcomes of the project

• A method to find the most appropriate distribution to describe average customer demand during the lead time from DC to store

• The normal distribution should not be used– The gamma distribution seems to be a better fit


8

1. Collect raw field data and use as input for the simulation+ No question about relevance

– Expensive/impossible to retrieve a large enough data set

– Not available for new processes

– Not available for multiple scenarios No sensitivity analysis

+ Very valuable for model validation

2. Generate artificial data to use as input data Must capture the characteristics of the real data

1. Collect a sufficient sample of field data

2. Characterize the data statistically – Distribution type and parameters

3. Generate random artificial data mimicking the real data

High flexibility – easy to handle new scenarios Cheap Requires proper statistical analysis to ensure model validity

Capturing Randomness in Input Data

9

• Plot histograms of the data• Compare the histogram graphically

(“eye-balling”) with shapes of well known distribution functions

– How about the tails of the distribution, limited or unlimited?

– How to handle negative outcomes?

Procedure for Modeling Input Data

4. Perform Goodness–of–fit test

(Reject the hypothesis that the picked distribution is correct?)

Dis

trib

utio

n hy

poth

esis

rej

ecte

d

1. Gather data from the real system

2. Identify an appropriate distribution family

3. Estimate distribution parameters and pick an “exact” distribution

• Informal test – “eye-balling”

• Formal tests, for example 2 - test

– Kolmogorov-Smirnov test

If a known distribution can not be accepted Use an empirical distribution

10

1. Data gathering from the real system

Example – Modeling Interarrival Times (I)

Interarrival Time (t) Frequency

0t<3 23

3t<6 10

6t<9 5

9t<12 1

12t<15 1

15t<18 2

18t<21 0

21t<24 1

24t<27 1

Etc.

11

2. Identify an appropriate distribution type/family– Plot a histogram

1) Divide the data material into appropriate intervals Usually of equal size

2) Determine the event frequency for each interval (or bin)

3) Plot the frequency (y-axis) for each interval (x-axis)

Example – Modeling Interarrival Times (II)

0

5

10

15

20

25

0-3 3-6 6-9 9-12 <15 <18 <21 <24 <27

The Exponential distribution seems to be a good first guess!

12

3. Estimate the parameters defining the chosen distribution

– In the current example Exp()has been chosen need to estimate the parameter ti = the ith interarrival time in the collected sample of n

observations

Example – Modeling Interarrival Times (III)

084.0...t1

N

t

t1

N

1ii

13

4. Perform Goodness-of-fit test– The purpose is to test the hypothesis that the data material is

adequately described by the “exact” distribution chosen in steps 1-3.

– Two of the most well known standardized tests are• The 2-test

– Should not be applied if the sample size n<20

• The Kolmogorov-Smirnov test

– A relatively simple but imprecise test

– Often used for small sample sizes

– The 2-test will be applied for the current example

Example – Modeling Interarrival Times (III)

14

In principle A statistical test comparing the relative frequencies for the

intervals/bins in a histogram with the theoretical probabilities of the chosen distribution

• Assumptions– The distribution involves k parameters estimated from the sample– The sample contains n observations (sample size=n)– F0(x) denotes the chosen/hypothesized CDF

Performing a 2-Test (I)

Data: x1, x2, …, xn (n

observations from the real system)

Model: X1, X2,…, Xn (Random variables, independent and identically distributed with CDF F(x))

Null hypothesis H0: F(x) = F0(x)

Alternative hypothesis HA: F(x) F0(x)

15

Performing a 2-Test (II)

1. Take the entire data range and divide it into r non overlapping intervals or bins

• pi = The probability that an observation X belongs to bin i The Null Hypothesis pi = F0(ai) - F0(ai-1)

• To improve the accuracy of the test– choose the bins (intervals) so that the probabilities pi (i=1,2, …r)

are equal for all bins

The area = p2 = F0(a2) - F0(a1)

Data values

Min=a0 a1 a2 ar-1 ar=Max…a3 ar-2

Bin: 1 2 3 r-1 r

f0(x)

16

2. Define r random variables Oi, i=1, 2, …r– Oi=number of observations in bin i (= the interval (ai-1, ai])

– If H0 is true the expected value of Oi = n*pi

• Oi is Binomially distributed with parameters n and pi

3. Define the test variable T

Performing a 2-Test (III)

r

1i i

2ii

pnpnO

T

– If H0 is true T follows a 2(r-k-1) distribution k = # of estimated parameters in the theoretical distribution being tested

– T = The critical value of T corresponding to a significance level

obtained from a 2(r-k-1) distribution table

– Tobs = The value of T computed from the data material If Tobs > T H0 can be rejected on the significance level

17

• Depends on the sample size n and on the bin selection (the size of the intervals)

• Rules of thumb– The 2-test is acceptable for ordinary significance levels (=1%,

5%) if the expected number of observations in each interval is greater than 5 (n*pi>5 for all i)

– In the case of continuous data and a bin selection such that pi is equal for all bins n20 Do not use the 2-test 20<n 50 5-10 bins recommendable 50<n 100 10-20 bins recommendable n >100 n0.5 – 0.2n bins recommendable

Validity of the 2-Test

18

• Hypothesis – the interarrival time Y is Exp(0.084) distributedH0: YExp(0.084)

HA: YExp(0.084)

• Bin sizes are chosen so that the probability pi is equal for all r bins and n*pi>5 for all i– Equal pi pi=1/r

– n*pi>5 n/r > 5 r<n/5

– n=50 r<50/5=10 Choose for example r=8 pi=1/8

• Determining the interval limits ai, i=0,1,…8

Example – Modeling Interarrival Times (IV)

ia*084.0i0 e1)a(FH

084.0)p*i1ln(

ae1p*i ii

a*084.0i

i

i=1 a1=ln(1-(1/8))/(-0.084)=1.590

i=2 a2=ln(1-(2/8))/(-0.084)=3.425i=8 a8 =ln(1-(8/8))/(-0.084)=

19

• Determining the critical value T

– If H0 is true T2(8-1-1)=2(6)

– If =0.05 P(T T0.05)=1-=0.95 /2 table/ T0.05=12.60

• Rejecting the hypothesis– Tobs=39.6>12.6= T0.05

H0 is rejected on the 5% level

Example – Modeling Interarrival Times (V)

6.39

8/508/50o

T8

1i

2i

obs

Note:oi = the actual number of

observations in bin i

• Computing the test statistic Tobs

22

• Common situation especially when designing new processes– Try to draw on expert knowledge from people involved in similar tasks

When estimates of interval lengths are available– Ex. The service time ranges between 5 and 20 minutes

Plausible to use a Uniform distribution with min=5 and max=20

When estimates of the interval and most likely value exist– Ex. min=5, max=20, most likely=12

Plausible to use a Triangular distribution with those parameter values

When estimates of min=a, most likely=c, max=b and the average value=x-bar are available Use a -distribution with parameters and

Distribution Choice in Absence of Sample Data

)ab)(xc()bac2)(ax(

)ax(xb

23

• Needed to create artificial input data to the simulation model

• Generating truly random numbers is difficult– Computers use pseudo-random number generators based on

mathematical algorithms – not truly random but good enough

• A popular algorithm is the “linear congruential method”1. Define a random seed x0 from which the sequence is started

2. The next “random” number in the sequence is obtained from the previous through the relation

where a, c, and m are integers > 0

Random Number Generators

mmod)cxa(x n1n

24

• Assume that m=8, a=5, c=7 and x0=4

Example – The Linear Congruential Method

8mod)7x5(x n1n

n xn 5xn+7 (5xn+7)/8 xn+1

0 4 27 3 + 3 /8 3

1 3 22 2 + 6 /8 6

2 6 37 4 + 5 /8 5

3 5 32 4 + 0 /8 0

4 0 7 0 + 7 /8 7

5 7 42 5 + 2 /8 2

6 2 17 2 + 1 /8 1

7 1 12 1 + 4 /8 4

Larger m longer sequence before it starts repeating itself

27

• Assume random numbers, r, from a Uniform (0, 1) distribution are available Random numbers from any distribution can be obtained by applying

the “inverse transformation technique”

The inverse Transformation Technique1. Generate a U[0, 1] distributed random number r

2. T is a random variable with a CDF FT(t) from which we would like to obtain a sequence of random numbers

– Note: 0 FT(t) 1 for all values of t

Generating Random Variates

)r(Fttforsolveandr)t(FLet 1TT

t is a random number from the distribution of T, i.e., a realization of T

• See Example – The Exponential distribution

28

The output data collected from a simulation model are realizations of stochastic variables

– Results from random input data and random processing times

Statistical analysis is required to1. Estimate performance characteristics

– Mean, variance, confidence intervals etc. for output variables

2. Compare performance characteristics for different designs

• The validity of the statistical analysis and the design conclusions are contingent on a careful sampling approach

– Sample sizes – run length and number of runs.– Inclusion or exclusion of “warm-up” periods?– One long simulation run or several shorter ones?

Analysis of Simulation Output Data

29

ProcessSimulationProcess

Simulation

TerminatingTerminatingNon-terminatingNon-terminating

Event-controlledtermination


Time-controlledtermination


Transient stateanalysis


Steady stateanalysis


ProcessSimulationProcess

Simulation

TerminatingTerminatingNon-terminatingNon-terminating









Terminating v.s. Non-Terminating Processes

30

• Does not end naturally within a particular time horizon– Ex. Inventory systems

• Usually reach steady state after an initial transient period– Assumes that the input data is stationary

• To study the steady state behavior it is vital to determine the duration of the transient period– Examine line plots of the output variables

• To reduce the duration of the transient (=“warm-up) period– Initialize the process with appropriate average

values

Non-Terminating Processes

31

Illustration Transient and Steady state

0

5

10

15

20

25

30

0 5 10 15 20 25 30 35 40 45 50

Simulation time

Cycl

e ti

me

Line plot of cycle times and average cycle time

Transient state

Steady state

32

• Ends after a predetermined time span– Typically the system starts from an empty state and ends in an

empty state

– Ex. A grocery store, a construction project, …

• Terminating processes may or may not reach steady state– Usually the transient period is of great interest for these processes

• Output data usually obtained from multiple independent simulation runs– The length of a run is determined by the natural termination of the

process

– Each run need a different stream of random numbers

– The initial state of each run is typically the same

Terminating Processes

33

• Statistical estimation of measures from a data material are typically done in two ways– Point estimates (single values)

– Confidence intervals (intervals)

• The confidence level – Indicates the probability of not finding the true value within the

interval (Type I error)

– Chosen by the analyst/manager

• Determinants of confidence interval width– The chosen confidence level

Lower wider confidence interval

– The sample size and the standard deviation () Larger sample smaller standard deviation narrower interval

Confidence Intervals and Point Estimates

34

• In simulation the most commonly used statistics are the mean and standard deviation ()– From a sample of n observations

Point estimate of the mean:

Point estimate of :

Important Point Estimates

nx...xx

x n21

1n

)xx(s

n

1i

2i

35

Characteristics of the point estimate for the population mean– Xi = Random variable representing the value of the ith observation in a

sample of size n, (i=1, 2, …, n)

– Assume that all observations Xi are independent random variables

– The population mean = E[Xi]=– The population standard deviation=(Var[Xi])0.5=

– Point estimate of the population mean=

– Mean and Std. Dev. of the point estimate for the population mean

Confidence Interval for Population Means (I)

nXXX

X n21

nn

nXEXEXE

XE n21

nn

n

n

)X(Var)X(Var)X(Var

2

2

221

x

36

For any distribution of Xi (i=1, 2, …n), when n is large (n30), due to the Central Limit Theorem

If all Xi (i=1, 2, …n) are normally distributed, for any n

• A standard transformation:

Confidence Interval for Population Means (II)

),(NX x

)1,0(NX

Zx

x2/x2/2/x

2/ ZxZxZx

Z

• Defining a symmetric two sided confidence interval– P(Z/2 Z Z/2) = 1 is known Z/2 can be found from a N(0, 1) probability table

Confidence interval for the population mean

Distribution of the point estimate for population means–

37

• In case is unknown we need to estimate it– Use the point estimate s The test variable Z is no longer Normally distributed, it follows a

Students-t distribution with n-1 degrees of freedom

Confidence Interval for Population Means (III)

x2/x2/ ZxZx

nx

ns

txn

stx 2/),1n(2/),1n(

In practice when n is large (30) the t-distribution is often approximated with the Normal distribution!

• In case the population standard deviation, , is known

38

• A common problem in simulation– How many runs and how long should they be?

• Depends on the variability of the sought output variables

• If a symmetric confidence interval of width 2d is desired for a mean performance measure

Determining an Appropriate Sample Size

dxdx

22/2/ d/)Z(nn/)Z(d

22/ d/)Zs(n

If is unknown and estimated with s

– If x-bar is normally distributed

39

1. Testing if a population mean () is equal to, larger than or smaller than a given value

– Suppose that in a sample of n observations the point estimate of =

Hypothesis Testing (I)

x

Hypothesis Reject H0 if … Type of test

H0: =a Symmetric two tail test

HA: a

H0: a One tail test

HA: <a

H0: a One tail test

HA: >a

orZn/s

ax2/

2/Zn/s

ax

Z

n/s

ax

Z

n/s

ax

40

2. Testing if two sample means are significantly different– Useful when comparing process designs

• A two tail test when 1=2=s– H0: 1- 2=a /typically a=0/

HA: 1- 2a

– The test statistic Z belongs to a Student-t distribution

– Reject H0 on the significance level if it is not true that

Hypothesis Testing (II)

)2nn(t

n1

n1

s

)(xxZ 21

21

2121

)2/1(),2nn()2/1(),2nn( 2121tZt

41

• If the sample sizes are large (n1+n2-2>30) Z is approximately N(0, 1) distributed Reject H0 if it is not true that

Hypothesis Testing (III)

2/2/ ZZZ

0n

s

n

s3xx3xx

2

2

1

2

21)xx(2121

21

• In practice, when comparing designs non-overlapping 3

intervals are often used as a criteria– H0: 1- 2>0

HA: 1- 20

– Reject H0 if

Date post:	28-Dec-2015
Category:	Documents
Upload:	candice-simon
View:	212 times
Download:	0 times

1 Analyzing Input and Output Simulation Data MIO 310 Optimering och Simulering 2012 (Operations...

Documents