Date post: | 18-Dec-2015 |
Category: |
Documents |
Upload: | luis-macedo |
View: | 6 times |
Download: | 0 times |
A Method for Extending the Size of Latin Hypercube Sample
Cedric J. SALLABERRY*, Jon C. HELTON+ Nuclear & Risk Technologies Center
Sandia National Laboratories, New Mexico PO Box 5800
Albuquerque, NM 87185-0776 USA
(*: [email protected]) (+: [email protected]) ABSTRACT Latin Hypercube Sampling (LHS) is widely used as sampling based method for probabilistic calculations. This method has some clear advantages over classical random sampling (RS) that derive from its efficient stratification properties. However, one of its limitations is that it is not possible to extend the size of an initial sample by simply adding new simulations, as this will lead to a loss of the efficient stratification associated with LHS. We describe a new method to extend the size of an LHS to n (>=2) times its original size while preserving both the LHS structure and any induced correlations between the input parameters. This method involves introducing a refined grid for the original sample and then filling in empty rows and columns with new data in a way that conserves both the LHS structure and any induced correlations. An estimate of the bounds of the resulting correlation between two variables is derived for n=2. This result shows that the final correlation is close to the average of the correlations from the original sample and the new sample used in the infilling of the empty rows and columns indicated above. Keywords: Latin Hypercube Sample, Correlation Control, Extension of Sample Size.
1. Introduction Developed in the 1970s, the Latin Hypercube Sampling (LHS) is now a widely used sampling method in probabilistic analysis ([1], [2]). This method has, indeed, many advantages, compared to the classical Random Sampling (RS) method (cf. [2] for a comparison of both methods). However, LHS suffers from some drawbacks. One of them is that it is not possible to extend the size of the sample by incrementally adding new sample elements to the original sample because this will destroy the stratification. When the sample size is large enough (e.g., several thousand sample elements), this will not greatly effect the results. However, when the simulations are expensive in terms of calculation time, one may be limited to a few (e.g., 50 ) simulations. Moreover, making another LHS (of same size, for example) and adding this sample to the previous one will lead to a different global result. The next section (Sect. 2) presents a methodology for creating a larger LHS from an existing LHS that maintains the LHS stratification and allows the calculation of an unbiased estimate of the mean and the Cumulative Distribution Function (CDF). A simple example is also presented. In Sect. 3, we derive the resulting correlation coefficient between two samples variables, which provides a way to bound this coefficient with respect to the correlation coefficient of the initial sample. Then, in Sect. 4, a Monte Carlo method is used to determine many resulting correlation coefficients and check the validity of the estimated boundaries. The last section (Sect. 5)
presents some concluding remarks on the usefulness of this extension and on the possible improvement of the method.
2. Methodology The methodology is rather simple and is based on the following observation. When a LHS method of size k is generated, k equiprobable strata are defined for each variable (i.e., kSSS ,,, 21 L ). In each stratum, one samples a single random value (Figure 1 ).
Si Si+1 Figure 1. Position of the values in strata Si and Si+1 in a LHS of size k
If we apply a n*k stratified pattern on the sample, i.e. if we split each stratum into n smaller (equiprobable strata), then the original sampled value will be in only one of the new strata, and n-1 strata will have no sampled value (Figure 2).
Si Si+1 Figure 2. Position of the values in a LHS of size nk
If we fill the available strata with samples values, we can create a larger LHS while by keeping the previously sampled points. It is possible to work directly with the values for filling the empty strata, as it is done in [3], but this approach destroys any specified correlation between the input variables. By working not with the values themselves but with the rank of the values allows one to keep the correlation between the variables by using the Iman and Conover rank correlation method ([4]). The different steps of the method are presented for increasing the size of a sample by twice its original size. This method can be easily extended for multiplying the original size by ,n 2n
2.1. Creation of a Latin Hypercube Sample The first step consists of creating an initial sample with the classical LHS method, which for some reason, is considered to have a sample size k too small. Let )1,0(~1 NX and )1,0(~2 NX be two random variables with the associated rank correlation matrix:
=17.7.1
Corr (Eq. 1)
A sample of size k=10 has been created, and the resulting sample matrix is
=
338.2785.2661.146.450.554.214.267.659.1736.1657.196.1208.1031.1985.741.
053.365.455.136.
1S With corresponding rank matrix
=
110864354
1013992286775
1RS
The resulting rank correlation matrix is equal to
=1782.782.11
1010xCorr
So, on a 10x10 strata space, the values are positioned as follow:
X1
X2
1 2 3 4 5 6 7 8 9 1012
34
5
67
89
10Since its a classical LHS,there is one value per row and one per column
Rank
Rank
Figure 3. Position of the sample points in the strata space
2.2. Creation of one rank LHS sample
For the next step, one rank LHS of size k is created. The term rank is used to underline the fact that only the ranks of the values are determined, and not the values themselves. However, the rank values are sufficient to respect the correlation matrix between the variables. The second step of the methodology consists of determining a rank-LHS matrix of size N, for which only the rank of the values are calculated, but not the values themselves. Working with the rank is enough to use correlation control. For instance, one can obtain the following rank-matrix that respects the correlation matrix defined in step1.
=
3683
107244872611995510
2RS
= 1326.326.12
1010Corr
Because of the small sample size, the resulting correlation matrix is not very good in this case. Then, if we add these positions into the 10x10 strata space presented in Figure 3, we obtain the following representation.
X1
X2
1 2 3 4 5 6 7 8 9 101234
5
67
89
10
We now have two values per row and two per column
Exact values(sample 1)
Rank values(sample 2)
Rank
Rank
Figure 4. Representation of the two samples in the strata space
2.3. Splitting of each variable stratum into 2 equiprobable strata The third step of the methodology consists of applying a 2k stratification pattern to the initial sample (i.e. splitting each stratum into 2 equiprobable strata). The initial sample (for which we have the exact value) will occupy, for each variable, k of these new 2k strata. In other words, for each old stratum, we have 1 stratum occupied by the initial sample, 1 stratum which is available and 1 position not affected for the 1 rank sample (Figure 2). For example, from a 10x10 strata space, we created a 20x20 strata space (i.e. we split each row and each column into two equiprobable strata). Its important to note that since we have the values for the sample1, S1, it is possible to determine in which strata each point lies.
X1
X2
Exact values(sample 1)
Rank values(sample 2)
Original grid(10x10)New grid(20x20)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
12
34
56
78
910
1112
1314
1617
181920
Rank
Rank Figure 5. Split of the strata space
As we can see in Figure 5, there are now 20 rows and 20 columns and exactly half of the rows and half of the columns are occupied by a point from sample1 (strata number in black bold font). Thus, the rank matrix for the sample1 will be in this strata space:
=
12015127698201618
184416
11131410
]2020[1RS
2.4. Determination of the available strata and association to the rank sample In step 4, the available strata are determined. The rank sample will move from a sample of size k in a k-strata pattern to a sample of size k in a 2k pattern.
For example, for each square stratum of sample 2, 3 of the 4 possible values are eliminated since they correspond either to a column, or to a row used by sample 1. Then, as shown in Figure 6, only one square stratum is available for sample 2. By taking note of the unused rank coordinates (non bold font in Figure 6), we can find the values for the rank sample matrix 2 in the 20x20 strata space:
=
511165101437815
133122217
1791019
]2020[2RS
X1
X2
1
2
3
4
5
6
7
8
9
10
Exact values(sample 1)
Rank values(sample 2)
Free strata
Original grid(10x10)New grid(20x20)
11
12
13
14
15
16
17
18
19
20
12
34
56
78
910
1112
1314
1516
1718
1920
Rank
Rank Figure 6: free strata for sample 2 in a 20x20 strata space.
2.5. Determination of the values for the new sample In step 5, we determine the values of the variables of the rank sample by randomly picking a value in the strata previously determined. For example, the rank matrix for sample 2 is known and it respects:
The LHS feature if we combine the two samples
The correlation matrix The only thing we have to do is then to choose randomly the values of the sample2 in the strata previously determined. For example, it could be:
=
539.099.811.811.081.487.251.1450.
328.524.3125.143.12098.465.1
398.1941.024.1187.08.455.1
2S
2.6. Grouping the results In step 6, the samples are combined to create a sample of size 2k, which respects the Latin Hypercube pattern. As shown in the next section, the resulting correlation matrix will be close to the mean of the two correlation matrices. For example, by merging the two samples, we obtain one LHS sample. The resulting correlation matrix is, in this case, equal to:
=1628.628.1]2020[
2,1 xxCorr
3. Boundaries of the resulting correlations
3.1. A simple case: correlations are not changed Let X and Y be two variables. Here, we are interested in the correlation between X and Y . So, instead of working with the variables, we will work with the rank of the variables. Let 1RS and 2RS be the ranks of variable values for two samples of size k, respectively kiii yx ,...,1),( = for 1RS and kiii yx ,...,1)~,~( = for 2RS (cf. Eq. 2).
=
kk yx
yxRS MM
11
1 ,
=
kk yx
yxRS
~~
~~11
2 MM (Eq. 2)
Each element of RS1 and RS2 is one of the k first integers. This means that for RS1 and RS2, the mean of each column is the same, as well as the variance and the standard deviation. Thus, we have:
21+= km
12)1)(1( += kks
The correlation coefficients (between X and Y) from the samples 1RS and 2RS are:
yx
yxcorr ),cov(
1 = ; yx
yxcorr~~
2)~,~cov(
=
If we combine the two samples, for obtaining a larger sample of size 2k, in the form:
=
=++
+
kk
kk
kk
kk
kk
yx
yxyx
yx
yx
yxyx
yx
RS
22
11
11
11
11
~~
~~
MM
MM
MM
MM
The resulting mean is equal to 2
12
121
21
21 +=
++
+ kkk
The resulting correlation matrix is equal to:
From 2RS
From 1RS
( ) ( )
( ) ( ) ( ) ( )
( ) ( ) ( ) ( )
( ) ( )
( ) ( ) ( ) ( )
( ) ( ) ( ) ( )corrcorr
yxyx
yxyx
yyxxyx
yyxxyx
yyxxyxyx
mymyk
mxmxk
mymxk
mymxk
mymyk
mxmxk
mymxk
mymxk
myk
mxk
mymxkCorr
k
ii
k
ii
k
ii
k
ii
k
iii
k
iii
k
kii
k
ii
k
kii
k
ii
k
kiii
k
iii
k
ii
k
ii
k
iii
21
21
)~var(2.)~var(2)~,~cov(
)var(2.)var(2),cov(
)~var()var(.)~var()var()~,~cov(
)~var()var(.)~var()var(),cov(
)~var()var(.)~var()var()~,~cov(),cov(
~1.~1
)~)(~(1))((1
1.1
))((1))((1
21.
21
))((21
2/12/12/12/1
2/12/12/12/1
2/12/1
2/1
1
2
1
22/1
1
2
1
2
11
2/12
1
2
1
22/12
1
2
1
2
2
11
2/12
1
22/12
1
2
2
1
+=
+=
+++++=
+++=
+
+
+=
+
+
+=
=
====
==
+==+==
+==
==
=+
(Eq. 3)
We see that combining two samples will give for the correlation between the variables of the new sample, a mean of the two previous correlations. This method is easily extended to multiple variables.
3.2. Generalization of the method when the rank correlations change In our method, we not only combine the two samples, but we also make a slight modification to the rank values. When the original sample size is doubled, this change is equivalent to multiply by two the previous rank value, and to subtract 1 if the new smaller stratum is on the left of the old stratum. For instance, an old rank of 2, could become a rank of 4 (=2*2) or 3 (=2*2-1).
For calculation purposes, instead of multiplying by two and subtracting 1 half of the time, we will keep the original value and subtract 0.5 half of the time. This is equivalent to dividing all the ranks by two, and thus, has no effect on the resulting correlation. Let A be one part of the new correlation, i.e.,
( )( )( ) ( ) ( ) ( ) 2/1
1
2*
1
2*2/1
1
2*
1
2*
1
**
+
+
=
====
=k
iic
k
ii
k
iic
k
ii
k
iii
mymymxmx
mymxA
with
41
21
21* =
+= mkm
=21
i
i
i x
xx and
==
=iii
iii
ic
xxifx
xxifxx
21
21
=21
i
i
i y
yy and
==
=iii
iii
ic
yyify
yyifyy
21
21
[ ]kxxx ,,, 21 L is a permutation of [ ]k,,2,1 L [ ]kyyy ,,, 21 L is a permutation of [ ]k,,2,1 L We will try to compare the value of A to the value of another relation, called B, such that
( )( )( ) ( ) 2/1
1
22/1
1
2
1
21
=
==
=k
ii
k
ii
k
iii
mymx
mymxB
We want to bound the difference between A and B. This leads to a maximization problem, which can be written as:
Find M, such that:
( )( )( ) ( )
( )( )( ) ( ) ( ) ( ) 2/1
1
2*
1
2*2/1
1
2*
1
2*
1
**
2/1
1
22/1
1
2
1
21
max
max
+
+
=
=
====
=
==
=
k
iic
k
ii
k
iic
k
ii
k
iii
k
ii
k
ii
k
iii
mymymxmx
mymx
mymx
mymx
ABM
(Eq. 4)
The denominator of B of Eq. 4 can be simplified as follows:
( )( )
( )( )( ) ( )( )
121
61
21
63324
21
21)1(
312
21
41
21
212
6)12)(1(
12
2
21 1
2
1
2
1
2
=
+=
++=
+++++=
++++++=
+= = ===
kk
kkk
kkkk
kkkkk
kkkkkkkk
mxmxmxk
i
k
ii
k
ii
k
ii
In the same way, we have ( ) ( )=
=k
ii
kkmy1
22
121
This leads to the following equality:
( ) ( ) ( )12
122/1
1
22/1
1
2 =
==
kkmymxk
ii
k
ii
It is also possible to simplify the denominator of A of Eq. 4 by using the following equality:
= =
=
+
+k
i
k
iiici mxmxmx
1
2
1
222
21
21
41
41
where [ ]kxxx ,,, 21 L is a permutation of [ ]k2,,2,1 L and 212 += km
Furthermore,
( ) ( )( )( )
2414
1214
2
12142
41
41
21
21
2
2
22
1
22
1
2
=
=
==
==
kk
kk
kkmxmxk
ii
k
ii
Hence, the denominator of A is: ( )
=
=
+
+k
iici
kkmymy1
222
2414
41
41
Therefore, Eq. 4 is equivalent to:
( )( ) ( )( )( )
2414
6)1(
max 21
**
21
===
kk
mymx
kk
mymxM
k
iii
k
iii
(Eq. 5)
Which can also be written as:
( )( ) ( ) ( )( ) ( ) ( )( )
= ==
k
iii
k
iii mymxkmymxk
kkkM
1
**2
1
222
141
411
6max (Eq. 6)
Let us call the sum ( ) ( )( )=
k
iii mymxk
1
2
41 and the sum ( ) ( )( )
=
k
iii mymxk
1
**2 1
and can respectively be written as: ( ) ( )
( )( ) ( ) ( ) ( ) kkkyxkkyxk
myxmyxk
myxmyxk
k
ii
k
ii
k
iii
k
i
k
iii
k
iii
k
iiiii
41.4
12
1.41
41
1)(41
)(41
22
11
2
1
2
1
2
11
2
1
22
++
++=
++=
++=
===
===
=
( ) ( ) ( )( ) ( ) ( )( ) ( ) ( )( ) kkkyxkkyxk
myxmyxk
myxmyxk
k
ii
k
ii
k
iii
k
i
k
iii
k
iii
k
iiiii
42
11
22
1.11
141)(4
11
41)(4
11
2
2
11
2
1
2
1
2
11
2
1
22
++
++=
++=
++=
===
===
=
Which leads to the following equality:
( ) ( ) ( ) ( )( ) ( )( )( ) ( )( ) kkkkkk
yxk
kyxkk
yxkyxk
k
ii
k
ii
k
ii
k
ii
k
iii
k
iii
42
11
41
41
2
21
12
14
1
141
2
22
2
11
2
11
2
1
2
1
2
+++
+++
++
=
====
==
(Eq. 7)
The following factorizations and equality ( ) ( )( )2121412 += kkk ( ) ( )( )1112 += kkk
= =
=+=k
i
k
iii y
kkx1 12
1
leads to the rewriting of Eq. 7 as
( ) ( ) ( ) ( ) ( )( )( )( )( )2121121
121
2114
1
2
1 11
22
1
2
++
+
++= = ===
kkkk
yxkkyxkyxkk
i
k
ii
k
iiii
k
iii
(Eq. 8)
Without any assumptions on the distribution of ix , iy , ix and iy , its not possible to simplify Eq. 8 any further. The next step will be to study separately the two cases where 0> and 0
First case: 0> To maximize Eq. 8 (which is equivalent to finding M since the values are positive), we have to maximize the term
( ) ( ) ( )===
+
+n
iii
n
ii
n
ii yxkyxkk
1
2
11
2 1121
21
Since ( ) 0121
21 2 >
+ kk and ( ) 012 >k , for k>1 a simple way is to maximize + ==
n
ii
n
ii yx
11
and to
minimize ( )=
n
iii yx
1
.
This maximization is obtained by taking ii xx = and ii yy = ; The maximization is obtained by taking
21 = ii xx and 21 = ii yy . It is evident that this situation is impossible, but we can obtain an upper limit
threshold that can never be reached. ( ) ( ) ( ) ( ) ( ) ( ) ( )
( )( )( ) ( )( )( )21211211211214
112
12114
1
22
1
222
1
2
+++++
++
Eq. 6 is thus equivalent to:
( )( ) ( )( )( )
1466
411.
46
41)1)(1()1)(1)(1(
46
114411
6
2
2
2
222