Sem.org IMAC XXIII Conf s10p04 a Method Extending Size Latin Hypercube Sample

A Method for Extending the Size of Latin Hypercube Sample

Cedric J. SALLABERRY*, Jon C. HELTON+ Nuclear & Risk Technologies Center

Sandia National Laboratories, New Mexico PO Box 5800

Albuquerque, NM 87185-0776 USA

(*: [email protected]) (+: [email protected]) ABSTRACT Latin Hypercube Sampling (LHS) is widely used as sampling based method for probabilistic calculations. This method has some clear advantages over classical random sampling (RS) that derive from its efficient stratification properties. However, one of its limitations is that it is not possible to extend the size of an initial sample by simply adding new simulations, as this will lead to a loss of the efficient stratification associated with LHS. We describe a new method to extend the size of an LHS to n (>=2) times its original size while preserving both the LHS structure and any induced correlations between the input parameters. This method involves introducing a refined grid for the original sample and then filling in empty rows and columns with new data in a way that conserves both the LHS structure and any induced correlations. An estimate of the bounds of the resulting correlation between two variables is derived for n=2. This result shows that the final correlation is close to the average of the correlations from the original sample and the new sample used in the infilling of the empty rows and columns indicated above. Keywords: Latin Hypercube Sample, Correlation Control, Extension of Sample Size.

1. Introduction Developed in the 1970s, the Latin Hypercube Sampling (LHS) is now a widely used sampling method in probabilistic analysis ([1], [2]). This method has, indeed, many advantages, compared to the classical Random Sampling (RS) method (cf. [2] for a comparison of both methods). However, LHS suffers from some drawbacks. One of them is that it is not possible to extend the size of the sample by incrementally adding new sample elements to the original sample because this will destroy the stratification. When the sample size is large enough (e.g., several thousand sample elements), this will not greatly effect the results. However, when the simulations are expensive in terms of calculation time, one may be limited to a few (e.g., 50 ) simulations. Moreover, making another LHS (of same size, for example) and adding this sample to the previous one will lead to a different global result. The next section (Sect. 2) presents a methodology for creating a larger LHS from an existing LHS that maintains the LHS stratification and allows the calculation of an unbiased estimate of the mean and the Cumulative Distribution Function (CDF). A simple example is also presented. In Sect. 3, we derive the resulting correlation coefficient between two samples variables, which provides a way to bound this coefficient with respect to the correlation coefficient of the initial sample. Then, in Sect. 4, a Monte Carlo method is used to determine many resulting correlation coefficients and check the validity of the estimated boundaries. The last section (Sect. 5)

presents some concluding remarks on the usefulness of this extension and on the possible improvement of the method.

2. Methodology The methodology is rather simple and is based on the following observation. When a LHS method of size k is generated, k equiprobable strata are defined for each variable (i.e., kSSS ,,, 21 L ). In each stratum, one samples a single random value (Figure 1 ).

Si Si+1 Figure 1. Position of the values in strata Si and Si+1 in a LHS of size k

If we apply a n*k stratified pattern on the sample, i.e. if we split each stratum into n smaller (equiprobable strata), then the original sampled value will be in only one of the new strata, and n-1 strata will have no sampled value (Figure 2).

Si Si+1 Figure 2. Position of the values in a LHS of size nk

If we fill the available strata with samples values, we can create a larger LHS while by keeping the previously sampled points. It is possible to work directly with the values for filling the empty strata, as it is done in [3], but this approach destroys any specified correlation between the input variables. By working not with the values themselves but with the rank of the values allows one to keep the correlation between the variables by using the Iman and Conover rank correlation method ([4]). The different steps of the method are presented for increasing the size of a sample by twice its original size. This method can be easily extended for multiplying the original size by ,n 2n

2.1. Creation of a Latin Hypercube Sample The first step consists of creating an initial sample with the classical LHS method, which for some reason, is considered to have a sample size k too small. Let )1,0(~1 NX and )1,0(~2 NX be two random variables with the associated rank correlation matrix:

=17.7.1

Corr (Eq. 1)

A sample of size k=10 has been created, and the resulting sample matrix is

=

338.2785.2661.146.450.554.214.267.659.1736.1657.196.1208.1031.1985.741.

053.365.455.136.

1S With corresponding rank matrix

=

110864354

1013992286775

1RS

The resulting rank correlation matrix is equal to

=1782.782.11

1010xCorr

So, on a 10x10 strata space, the values are positioned as follow:

X1

X2

1 2 3 4 5 6 7 8 9 1012

34

5

67

89

10Since its a classical LHS,there is one value per row and one per column

Rank

Rank

Figure 3. Position of the sample points in the strata space

2.2. Creation of one rank LHS sample

For the next step, one rank LHS of size k is created. The term rank is used to underline the fact that only the ranks of the values are determined, and not the values themselves. However, the rank values are sufficient to respect the correlation matrix between the variables. The second step of the methodology consists of determining a rank-LHS matrix of size N, for which only the rank of the values are calculated, but not the values themselves. Working with the rank is enough to use correlation control. For instance, one can obtain the following rank-matrix that respects the correlation matrix defined in step1.

=

3683

107244872611995510

2RS

= 1326.326.12

1010Corr

Because of the small sample size, the resulting correlation matrix is not very good in this case. Then, if we add these positions into the 10x10 strata space presented in Figure 3, we obtain the following representation.

X1

X2

1 2 3 4 5 6 7 8 9 101234

5

67

89

10

We now have two values per row and two per column

Exact values(sample 1)

Rank values(sample 2)

Rank

Rank

Figure 4. Representation of the two samples in the strata space

2.3. Splitting of each variable stratum into 2 equiprobable strata The third step of the methodology consists of applying a 2k stratification pattern to the initial sample (i.e. splitting each stratum into 2 equiprobable strata). The initial sample (for which we have the exact value) will occupy, for each variable, k of these new 2k strata. In other words, for each old stratum, we have 1 stratum occupied by the initial sample, 1 stratum which is available and 1 position not affected for the 1 rank sample (Figure 2). For example, from a 10x10 strata space, we created a 20x20 strata space (i.e. we split each row and each column into two equiprobable strata). Its important to note that since we have the values for the sample1, S1, it is possible to determine in which strata each point lies.

X1

X2



Original grid(10x10)New grid(20x20)

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

12

34

56

78

910

1112

1314

1617

181920

Rank

Rank Figure 5. Split of the strata space

As we can see in Figure 5, there are now 20 rows and 20 columns and exactly half of the rows and half of the columns are occupied by a point from sample1 (strata number in black bold font). Thus, the rank matrix for the sample1 will be in this strata space:

=

12015127698201618

184416

11131410

]2020[1RS

2.4. Determination of the available strata and association to the rank sample In step 4, the available strata are determined. The rank sample will move from a sample of size k in a k-strata pattern to a sample of size k in a 2k pattern.

For example, for each square stratum of sample 2, 3 of the 4 possible values are eliminated since they correspond either to a column, or to a row used by sample 1. Then, as shown in Figure 6, only one square stratum is available for sample 2. By taking note of the unused rank coordinates (non bold font in Figure 6), we can find the values for the rank sample matrix 2 in the 20x20 strata space:

=

511165101437815

133122217

1791019

]2020[2RS

X1

X2

1

2

3

4

5

6

7

8

9

10



Free strata

Original grid(10x10)New grid(20x20)

11

12

13

14

15

16

17

18

19

20

12

34

56

78

910

1112

1314

1516

1718

1920

Rank

Rank Figure 6: free strata for sample 2 in a 20x20 strata space.

2.5. Determination of the values for the new sample In step 5, we determine the values of the variables of the rank sample by randomly picking a value in the strata previously determined. For example, the rank matrix for sample 2 is known and it respects:

The LHS feature if we combine the two samples

The correlation matrix The only thing we have to do is then to choose randomly the values of the sample2 in the strata previously determined. For example, it could be:

=

539.099.811.811.081.487.251.1450.

328.524.3125.143.12098.465.1

398.1941.024.1187.08.455.1

2S

2.6. Grouping the results In step 6, the samples are combined to create a sample of size 2k, which respects the Latin Hypercube pattern. As shown in the next section, the resulting correlation matrix will be close to the mean of the two correlation matrices. For example, by merging the two samples, we obtain one LHS sample. The resulting correlation matrix is, in this case, equal to:

=1628.628.1]2020[

2,1 xxCorr

3. Boundaries of the resulting correlations

3.1. A simple case: correlations are not changed Let X and Y be two variables. Here, we are interested in the correlation between X and Y . So, instead of working with the variables, we will work with the rank of the variables. Let 1RS and 2RS be the ranks of variable values for two samples of size k, respectively kiii yx ,...,1),( = for 1RS and kiii yx ,...,1)~,~( = for 2RS (cf. Eq. 2).

=

kk yx

yxRS MM

11

1 ,

=

kk yx

yxRS

~~

~~11

2 MM (Eq. 2)

Each element of RS1 and RS2 is one of the k first integers. This means that for RS1 and RS2, the mean of each column is the same, as well as the variance and the standard deviation. Thus, we have:

21+= km

12)1)(1( += kks

The correlation coefficients (between X and Y) from the samples 1RS and 2RS are:

yx

yxcorr ),cov(

1 = ; yx

yxcorr~~

2)~,~cov(

=

If we combine the two samples, for obtaining a larger sample of size 2k, in the form:

=

=++

+

kk

kk

kk

kk

kk

yx

yxyx

yx

yx

yxyx

yx

RS

22

11

11

11

11

~~

~~

MM

MM

MM

MM

The resulting mean is equal to 2

12

121

21

21 +=

++

+ kkk

The resulting correlation matrix is equal to:

From 2RS

From 1RS

( ) ( )

( ) ( ) ( ) ( )

( ) ( ) ( ) ( )

( ) ( )

( ) ( ) ( ) ( )

( ) ( ) ( ) ( )corrcorr

yxyx

yxyx

yyxxyx

yyxxyx

yyxxyxyx

mymyk

mxmxk

mymxk

mymxk

mymyk

mxmxk

mymxk

mymxk

myk

mxk

mymxkCorr

k

ii

k

ii

k

ii

k

ii

k

iii

k

iii

k

kii

k

ii

k

kii

k

ii

k

kiii

k

iii

k

ii

k

ii

k

iii

21

21

)~var(2.)~var(2)~,~cov(

)var(2.)var(2),cov(

)~var()var(.)~var()var()~,~cov(

)~var()var(.)~var()var(),cov(

)~var()var(.)~var()var()~,~cov(),cov(

~1.~1

)~)(~(1))((1

1.1

))((1))((1

21.

21

))((21

2/12/12/12/1

2/12/12/12/1

2/12/1

2/1

1

2

1

22/1

1

2

1

2

11

2/12

1

2

1

22/12

1

2

1

2

2

11

2/12

1

22/12

1

2

2

1

+=

+=

+++++=

+++=

+

+

+=

+

+

+=

=

====

==

+==+==

+==

==

=+

(Eq. 3)

We see that combining two samples will give for the correlation between the variables of the new sample, a mean of the two previous correlations. This method is easily extended to multiple variables.

3.2. Generalization of the method when the rank correlations change In our method, we not only combine the two samples, but we also make a slight modification to the rank values. When the original sample size is doubled, this change is equivalent to multiply by two the previous rank value, and to subtract 1 if the new smaller stratum is on the left of the old stratum. For instance, an old rank of 2, could become a rank of 4 (=2*2) or 3 (=2*2-1).

For calculation purposes, instead of multiplying by two and subtracting 1 half of the time, we will keep the original value and subtract 0.5 half of the time. This is equivalent to dividing all the ranks by two, and thus, has no effect on the resulting correlation. Let A be one part of the new correlation, i.e.,

( )( )( ) ( ) ( ) ( ) 2/1

1

2*

1

2*2/1

1

2*

1

2*

1

**

+

+

=

====

=k

iic

k

ii

k

iic

k

ii

k

iii

mymymxmx

mymxA

with

41

21

21* =

+= mkm

=21

i

i

i x

xx and

==

=iii

iii

ic

xxifx

xxifxx

21

21

=21

i

i

i y

yy and

==

=iii

iii

ic

yyify

yyifyy

21

21

[ ]kxxx ,,, 21 L is a permutation of [ ]k,,2,1 L [ ]kyyy ,,, 21 L is a permutation of [ ]k,,2,1 L We will try to compare the value of A to the value of another relation, called B, such that

( )( )( ) ( ) 2/1

1

22/1

1

2

1

21

=

==

=k

ii

k

ii

k

iii

mymx

mymxB

We want to bound the difference between A and B. This leads to a maximization problem, which can be written as:

Find M, such that:

( )( )( ) ( )

( )( )( ) ( ) ( ) ( ) 2/1

1

2*

1

2*2/1

1

2*

1

2*

1

**

2/1

1

22/1

1

2

1

21

max

max

+

+

=

=

====

=

==

=

k

iic

k

ii

k

iic

k

ii

k

iii

k

ii

k

ii

k

iii

mymymxmx

mymx

mymx

mymx

ABM

(Eq. 4)

The denominator of B of Eq. 4 can be simplified as follows:

( )( )

( )( )( ) ( )( )

121

61

21

63324

21

21)1(

312

21

41

21

212

6)12)(1(

12

2

21 1

2

1

2

1

2

=

+=

++=

+++++=

++++++=

+= = ===

kk

kkk

kkkk

kkkkk

kkkkkkkk

mxmxmxk

i

k

ii

k

ii

k

ii

In the same way, we have ( ) ( )=

=k

ii

kkmy1

22

121

This leads to the following equality:

( ) ( ) ( )12

122/1

1

22/1

1

2 =

==

kkmymxk

ii

k

ii

It is also possible to simplify the denominator of A of Eq. 4 by using the following equality:

= =

=

+

+k

i

k

iiici mxmxmx

1

2

1

222

21

21

41

41

where [ ]kxxx ,,, 21 L is a permutation of [ ]k2,,2,1 L and 212 += km

Furthermore,

( ) ( )( )( )

2414

1214

2

12142

41

41

21

21

2

2

22

1

22

1

2

=

=

==

==

kk

kk

kkmxmxk

ii

k

ii

Hence, the denominator of A is: ( )

=

=

+

+k

iici

kkmymy1

222

2414

41

41

Therefore, Eq. 4 is equivalent to:

( )( ) ( )( )( )

2414

6)1(

max 21

**

21

===

kk

mymx

kk

mymxM

k

iii

k

iii

(Eq. 5)

Which can also be written as:

( )( ) ( ) ( )( ) ( ) ( )( )

= ==

k

iii

k

iii mymxkmymxk

kkkM

1

**2

1

222

141

411

6max (Eq. 6)

Let us call the sum ( ) ( )( )=

k

iii mymxk

1

2

41 and the sum ( ) ( )( )

=

k

iii mymxk

1

**2 1

and can respectively be written as: ( ) ( )

( )( ) ( ) ( ) ( ) kkkyxkkyxk

myxmyxk

myxmyxk

k

ii

k

ii

k

iii

k

i

k

iii

k

iii

k

iiiii

41.4

12

1.41

41

1)(41

)(41

22

11

2

1

2

1

2

11

2

1

22

++

++=

++=

++=

===

===

=

( ) ( ) ( )( ) ( ) ( )( ) ( ) ( )( ) kkkyxkkyxk

myxmyxk

myxmyxk

k

ii

k

ii

k

iii

k

i

k

iii

k

iii

k

iiiii

42

11

22

1.11

141)(4

11

41)(4

11

2

2

11

2

1

2

1

2

11

2

1

22

++

++=

++=

++=

===

===

=

Which leads to the following equality:

( ) ( ) ( ) ( )( ) ( )( )( ) ( )( ) kkkkkk

yxk

kyxkk

yxkyxk

k

ii

k

ii

k

ii

k

ii

k

iii

k

iii

42

11

41

41

2

21

12

14

1

141

2

22

2

11

2

11

2

1

2

1

2

+++

+++

++

=

====

==

(Eq. 7)

The following factorizations and equality ( ) ( )( )2121412 += kkk ( ) ( )( )1112 += kkk

= =

=+=k

i

k

iii y

kkx1 12

1

leads to the rewriting of Eq. 7 as

( ) ( ) ( ) ( ) ( )( )( )( )( )2121121

121

2114

1

2

1 11

22

1

2

++

+

++= = ===

kkkk

yxkkyxkyxkk

i

k

ii

k

iiii

k

iii

(Eq. 8)

Without any assumptions on the distribution of ix , iy , ix and iy , its not possible to simplify Eq. 8 any further. The next step will be to study separately the two cases where 0> and 0

First case: 0> To maximize Eq. 8 (which is equivalent to finding M since the values are positive), we have to maximize the term

( ) ( ) ( )===

+

+n

iii

n

ii

n

ii yxkyxkk

1

2

11

2 1121

21

Since ( ) 0121

21 2 >

+ kk and ( ) 012 >k , for k>1 a simple way is to maximize + ==

n

ii

n

ii yx

11

and to

minimize ( )=

n

iii yx

1

.

This maximization is obtained by taking ii xx = and ii yy = ; The maximization is obtained by taking

21 = ii xx and 21 = ii yy . It is evident that this situation is impossible, but we can obtain an upper limit

threshold that can never be reached. ( ) ( ) ( ) ( ) ( ) ( ) ( )

( )( )( ) ( )( )( )21211211211214

112

12114

1

22

1

222

1

2

+++++

++

Eq. 6 is thus equivalent to:

( )( ) ( )( )( )

1466

411.

46

41)1)(1()1)(1)(1(

46

114411

6

2

2

2

222

Date post:	18-Dec-2015
Category:	Documents
Upload:	luis-macedo
View:	6 times
Download:	0 times

Sem.org IMAC XXIII Conf s10p04 a Method Extending Size Latin Hypercube Sample

Documents