+ All Categories
Home > Documents > Sem.org IMAC XXIII Conf s10p04 a Method Extending Size Latin Hypercube Sample

Sem.org IMAC XXIII Conf s10p04 a Method Extending Size Latin Hypercube Sample

Date post: 18-Dec-2015
Category:
Upload: luis-macedo
View: 6 times
Download: 0 times
Share this document with a friend
Description:
HyperCube
Popular Tags:
21
A Method for Extending the Size of Latin Hypercube Sample Cedric J. SALLABERRY * , Jon C. HELTON + Nuclear & Risk Technologies Center Sandia National Laboratories, New Mexico PO Box 5800 Albuquerque, NM 87185-0776 USA ( * : [email protected] ) ( + : [email protected] ) ABSTRACT Latin Hypercube Sampling (LHS) is widely used as sampling based method for probabilistic calculations. This method has some clear advantages over classical random sampling (RS) that derive from its efficient stratification properties. However, one of its limitations is that it is not possible to extend the size of an initial sample by simply adding new simulations, as this will lead to a loss of the efficient stratification associated with LHS. We describe a new method to extend the size of an LHS to n (>=2) times its original size while preserving both the LHS structure and any induced correlations between the input parameters. This method involves introducing a refined grid for the original sample and then filling in empty rows and columns with new data in a way that conserves both the LHS structure and any induced correlations. An estimate of the bounds of the resulting correlation between two variables is derived for n=2. This result shows that the final correlation is close to the average of the correlations from the original sample and the new sample used in the infilling of the empty rows and columns indicated above. Keywords: Latin Hypercube Sample, Correlation Control, Extension of Sample Size. 1. Introduction Developed in the 1970’s, the Latin Hypercube Sampling (LHS) is now a widely used sampling method in probabilistic analysis ([1], [2]). This method has, indeed, many advantages, compared to the classical Random Sampling (RS) method (cf. [2] for a comparison of both methods). However, LHS suffers from some drawbacks. One of them is that it is not possible to extend the size of the sample by incrementally adding new sample elements to the original sample because this will destroy the stratification. When the sample size is large enough (e.g., several thousand sample elements), this will not greatly effect the results. However, when the simulations are expensive in terms of calculation time, one may be limited to a few (e.g., 50 ) simulations. Moreover, making another LHS (of same size, for example) and adding this sample to the previous one will lead to a different global result. The next section (Sect. 2) presents a methodology for creating a larger LHS from an existing LHS that maintains the LHS stratification and allows the calculation of an unbiased estimate of the mean and the Cumulative Distribution Function (CDF). A simple example is also presented. In Sect. 3, we derive the resulting correlation coefficient between two samples variables, which provides a way to bound this coefficient with respect to the correlation coefficient of the initial sample. Then, in Sect. 4, a Monte Carlo method is used to determine many resulting correlation coefficients and check the validity of the estimated boundaries. The last section (Sect. 5)
Transcript
  • A Method for Extending the Size of Latin Hypercube Sample

    Cedric J. SALLABERRY*, Jon C. HELTON+ Nuclear & Risk Technologies Center

    Sandia National Laboratories, New Mexico PO Box 5800

    Albuquerque, NM 87185-0776 USA

    (*: [email protected]) (+: [email protected]) ABSTRACT Latin Hypercube Sampling (LHS) is widely used as sampling based method for probabilistic calculations. This method has some clear advantages over classical random sampling (RS) that derive from its efficient stratification properties. However, one of its limitations is that it is not possible to extend the size of an initial sample by simply adding new simulations, as this will lead to a loss of the efficient stratification associated with LHS. We describe a new method to extend the size of an LHS to n (>=2) times its original size while preserving both the LHS structure and any induced correlations between the input parameters. This method involves introducing a refined grid for the original sample and then filling in empty rows and columns with new data in a way that conserves both the LHS structure and any induced correlations. An estimate of the bounds of the resulting correlation between two variables is derived for n=2. This result shows that the final correlation is close to the average of the correlations from the original sample and the new sample used in the infilling of the empty rows and columns indicated above. Keywords: Latin Hypercube Sample, Correlation Control, Extension of Sample Size.

    1. Introduction Developed in the 1970s, the Latin Hypercube Sampling (LHS) is now a widely used sampling method in probabilistic analysis ([1], [2]). This method has, indeed, many advantages, compared to the classical Random Sampling (RS) method (cf. [2] for a comparison of both methods). However, LHS suffers from some drawbacks. One of them is that it is not possible to extend the size of the sample by incrementally adding new sample elements to the original sample because this will destroy the stratification. When the sample size is large enough (e.g., several thousand sample elements), this will not greatly effect the results. However, when the simulations are expensive in terms of calculation time, one may be limited to a few (e.g., 50 ) simulations. Moreover, making another LHS (of same size, for example) and adding this sample to the previous one will lead to a different global result. The next section (Sect. 2) presents a methodology for creating a larger LHS from an existing LHS that maintains the LHS stratification and allows the calculation of an unbiased estimate of the mean and the Cumulative Distribution Function (CDF). A simple example is also presented. In Sect. 3, we derive the resulting correlation coefficient between two samples variables, which provides a way to bound this coefficient with respect to the correlation coefficient of the initial sample. Then, in Sect. 4, a Monte Carlo method is used to determine many resulting correlation coefficients and check the validity of the estimated boundaries. The last section (Sect. 5)

  • presents some concluding remarks on the usefulness of this extension and on the possible improvement of the method.

    2. Methodology The methodology is rather simple and is based on the following observation. When a LHS method of size k is generated, k equiprobable strata are defined for each variable (i.e., kSSS ,,, 21 L ). In each stratum, one samples a single random value (Figure 1 ).

    Si Si+1 Figure 1. Position of the values in strata Si and Si+1 in a LHS of size k

    If we apply a n*k stratified pattern on the sample, i.e. if we split each stratum into n smaller (equiprobable strata), then the original sampled value will be in only one of the new strata, and n-1 strata will have no sampled value (Figure 2).

    Si Si+1 Figure 2. Position of the values in a LHS of size nk

    If we fill the available strata with samples values, we can create a larger LHS while by keeping the previously sampled points. It is possible to work directly with the values for filling the empty strata, as it is done in [3], but this approach destroys any specified correlation between the input variables. By working not with the values themselves but with the rank of the values allows one to keep the correlation between the variables by using the Iman and Conover rank correlation method ([4]). The different steps of the method are presented for increasing the size of a sample by twice its original size. This method can be easily extended for multiplying the original size by ,n 2n

    2.1. Creation of a Latin Hypercube Sample The first step consists of creating an initial sample with the classical LHS method, which for some reason, is considered to have a sample size k too small. Let )1,0(~1 NX and )1,0(~2 NX be two random variables with the associated rank correlation matrix:

    =17.7.1

    Corr (Eq. 1)

    A sample of size k=10 has been created, and the resulting sample matrix is

  • =

    338.2785.2661.146.450.554.214.267.659.1736.1657.196.1208.1031.1985.741.

    053.365.455.136.

    1S With corresponding rank matrix

    =

    110864354

    1013992286775

    1RS

    The resulting rank correlation matrix is equal to

    =1782.782.11

    1010xCorr

    So, on a 10x10 strata space, the values are positioned as follow:

    X1

    X2

    1 2 3 4 5 6 7 8 9 1012

    34

    5

    67

    89

    10Since its a classical LHS,there is one value per row and one per column

    Rank

    Rank

    Figure 3. Position of the sample points in the strata space

    2.2. Creation of one rank LHS sample

  • For the next step, one rank LHS of size k is created. The term rank is used to underline the fact that only the ranks of the values are determined, and not the values themselves. However, the rank values are sufficient to respect the correlation matrix between the variables. The second step of the methodology consists of determining a rank-LHS matrix of size N, for which only the rank of the values are calculated, but not the values themselves. Working with the rank is enough to use correlation control. For instance, one can obtain the following rank-matrix that respects the correlation matrix defined in step1.

    =

    3683

    107244872611995510

    2RS

    = 1326.326.12

    1010Corr

    Because of the small sample size, the resulting correlation matrix is not very good in this case. Then, if we add these positions into the 10x10 strata space presented in Figure 3, we obtain the following representation.

  • X1

    X2

    1 2 3 4 5 6 7 8 9 101234

    5

    67

    89

    10

    We now have two values per row and two per column

    Exact values(sample 1)

    Rank values(sample 2)

    Rank

    Rank

    Figure 4. Representation of the two samples in the strata space

    2.3. Splitting of each variable stratum into 2 equiprobable strata The third step of the methodology consists of applying a 2k stratification pattern to the initial sample (i.e. splitting each stratum into 2 equiprobable strata). The initial sample (for which we have the exact value) will occupy, for each variable, k of these new 2k strata. In other words, for each old stratum, we have 1 stratum occupied by the initial sample, 1 stratum which is available and 1 position not affected for the 1 rank sample (Figure 2). For example, from a 10x10 strata space, we created a 20x20 strata space (i.e. we split each row and each column into two equiprobable strata). Its important to note that since we have the values for the sample1, S1, it is possible to determine in which strata each point lies.

  • X1

    X2

    Exact values(sample 1)

    Rank values(sample 2)

    Original grid(10x10)New grid(20x20)

    1

    2

    3

    4

    5

    6

    7

    8

    9

    10

    11

    12

    13

    14

    15

    16

    17

    18

    19

    20

    12

    34

    56

    78

    910

    1112

    1314

    1617

    181920

    Rank

    Rank Figure 5. Split of the strata space

    As we can see in Figure 5, there are now 20 rows and 20 columns and exactly half of the rows and half of the columns are occupied by a point from sample1 (strata number in black bold font). Thus, the rank matrix for the sample1 will be in this strata space:

    =

    12015127698201618

    184416

    11131410

    ]2020[1RS

    2.4. Determination of the available strata and association to the rank sample In step 4, the available strata are determined. The rank sample will move from a sample of size k in a k-strata pattern to a sample of size k in a 2k pattern.

  • For example, for each square stratum of sample 2, 3 of the 4 possible values are eliminated since they correspond either to a column, or to a row used by sample 1. Then, as shown in Figure 6, only one square stratum is available for sample 2. By taking note of the unused rank coordinates (non bold font in Figure 6), we can find the values for the rank sample matrix 2 in the 20x20 strata space:

    =

    511165101437815

    133122217

    1791019

    ]2020[2RS

    X1

    X2

    1

    2

    3

    4

    5

    6

    7

    8

    9

    10

    Exact values(sample 1)

    Rank values(sample 2)

    Free strata

    Original grid(10x10)New grid(20x20)

    11

    12

    13

    14

    15

    16

    17

    18

    19

    20

    12

    34

    56

    78

    910

    1112

    1314

    1516

    1718

    1920

    Rank

    Rank Figure 6: free strata for sample 2 in a 20x20 strata space.

    2.5. Determination of the values for the new sample In step 5, we determine the values of the variables of the rank sample by randomly picking a value in the strata previously determined. For example, the rank matrix for sample 2 is known and it respects:

    The LHS feature if we combine the two samples

  • The correlation matrix The only thing we have to do is then to choose randomly the values of the sample2 in the strata previously determined. For example, it could be:

    =

    539.099.811.811.081.487.251.1450.

    328.524.3125.143.12098.465.1

    398.1941.024.1187.08.455.1

    2S

    2.6. Grouping the results In step 6, the samples are combined to create a sample of size 2k, which respects the Latin Hypercube pattern. As shown in the next section, the resulting correlation matrix will be close to the mean of the two correlation matrices. For example, by merging the two samples, we obtain one LHS sample. The resulting correlation matrix is, in this case, equal to:

    =1628.628.1]2020[

    2,1 xxCorr

    3. Boundaries of the resulting correlations

    3.1. A simple case: correlations are not changed Let X and Y be two variables. Here, we are interested in the correlation between X and Y . So, instead of working with the variables, we will work with the rank of the variables. Let 1RS and 2RS be the ranks of variable values for two samples of size k, respectively kiii yx ,...,1),( = for 1RS and kiii yx ,...,1)~,~( = for 2RS (cf. Eq. 2).

    =

    kk yx

    yxRS MM

    11

    1 ,

    =

    kk yx

    yxRS

    ~~

    ~~11

    2 MM (Eq. 2)

    Each element of RS1 and RS2 is one of the k first integers. This means that for RS1 and RS2, the mean of each column is the same, as well as the variance and the standard deviation. Thus, we have:

  • 21+= km

    12)1)(1( += kks

    The correlation coefficients (between X and Y) from the samples 1RS and 2RS are:

    yx

    yxcorr ),cov(

    1 = ; yx

    yxcorr~~

    2)~,~cov(

    =

    If we combine the two samples, for obtaining a larger sample of size 2k, in the form:

    =

    =++

    +

    kk

    kk

    kk

    kk

    kk

    yx

    yxyx

    yx

    yx

    yxyx

    yx

    RS

    22

    11

    11

    11

    11

    ~~

    ~~

    MM

    MM

    MM

    MM

    The resulting mean is equal to 2

    12

    121

    21

    21 +=

    ++

    + kkk

    The resulting correlation matrix is equal to:

    From 2RS

    From 1RS

  • ( ) ( )

    ( ) ( ) ( ) ( )

    ( ) ( ) ( ) ( )

    ( ) ( )

    ( ) ( ) ( ) ( )

    ( ) ( ) ( ) ( )corrcorr

    yxyx

    yxyx

    yyxxyx

    yyxxyx

    yyxxyxyx

    mymyk

    mxmxk

    mymxk

    mymxk

    mymyk

    mxmxk

    mymxk

    mymxk

    myk

    mxk

    mymxkCorr

    k

    ii

    k

    ii

    k

    ii

    k

    ii

    k

    iii

    k

    iii

    k

    kii

    k

    ii

    k

    kii

    k

    ii

    k

    kiii

    k

    iii

    k

    ii

    k

    ii

    k

    iii

    21

    21

    )~var(2.)~var(2)~,~cov(

    )var(2.)var(2),cov(

    )~var()var(.)~var()var()~,~cov(

    )~var()var(.)~var()var(),cov(

    )~var()var(.)~var()var()~,~cov(),cov(

    ~1.~1

    )~)(~(1))((1

    1.1

    ))((1))((1

    21.

    21

    ))((21

    2/12/12/12/1

    2/12/12/12/1

    2/12/1

    2/1

    1

    2

    1

    22/1

    1

    2

    1

    2

    11

    2/12

    1

    2

    1

    22/12

    1

    2

    1

    2

    2

    11

    2/12

    1

    22/12

    1

    2

    2

    1

    +=

    +=

    +++++=

    +++=

    +

    +

    +=

    +

    +

    +=

    =

    ====

    ==

    +==+==

    +==

    ==

    =+

    (Eq. 3)

    We see that combining two samples will give for the correlation between the variables of the new sample, a mean of the two previous correlations. This method is easily extended to multiple variables.

    3.2. Generalization of the method when the rank correlations change In our method, we not only combine the two samples, but we also make a slight modification to the rank values. When the original sample size is doubled, this change is equivalent to multiply by two the previous rank value, and to subtract 1 if the new smaller stratum is on the left of the old stratum. For instance, an old rank of 2, could become a rank of 4 (=2*2) or 3 (=2*2-1).

  • For calculation purposes, instead of multiplying by two and subtracting 1 half of the time, we will keep the original value and subtract 0.5 half of the time. This is equivalent to dividing all the ranks by two, and thus, has no effect on the resulting correlation. Let A be one part of the new correlation, i.e.,

    ( )( )( ) ( ) ( ) ( ) 2/1

    1

    2*

    1

    2*2/1

    1

    2*

    1

    2*

    1

    **

    +

    +

    =

    ====

    =k

    iic

    k

    ii

    k

    iic

    k

    ii

    k

    iii

    mymymxmx

    mymxA

    with

    41

    21

    21* =

    += mkm

    =21

    i

    i

    i x

    xx and

    ==

    =iii

    iii

    ic

    xxifx

    xxifxx

    21

    21

    =21

    i

    i

    i y

    yy and

    ==

    =iii

    iii

    ic

    yyify

    yyifyy

    21

    21

    [ ]kxxx ,,, 21 L is a permutation of [ ]k,,2,1 L [ ]kyyy ,,, 21 L is a permutation of [ ]k,,2,1 L We will try to compare the value of A to the value of another relation, called B, such that

    ( )( )( ) ( ) 2/1

    1

    22/1

    1

    2

    1

    21

    =

    ==

    =k

    ii

    k

    ii

    k

    iii

    mymx

    mymxB

    We want to bound the difference between A and B. This leads to a maximization problem, which can be written as:

    Find M, such that:

  • ( )( )( ) ( )

    ( )( )( ) ( ) ( ) ( ) 2/1

    1

    2*

    1

    2*2/1

    1

    2*

    1

    2*

    1

    **

    2/1

    1

    22/1

    1

    2

    1

    21

    max

    max

    +

    +

    =

    =

    ====

    =

    ==

    =

    k

    iic

    k

    ii

    k

    iic

    k

    ii

    k

    iii

    k

    ii

    k

    ii

    k

    iii

    mymymxmx

    mymx

    mymx

    mymx

    ABM

    (Eq. 4)

    The denominator of B of Eq. 4 can be simplified as follows:

    ( )( )

    ( )( )( ) ( )( )

    121

    61

    21

    63324

    21

    21)1(

    312

    21

    41

    21

    212

    6)12)(1(

    12

    2

    21 1

    2

    1

    2

    1

    2

    =

    +=

    ++=

    +++++=

    ++++++=

    += = ===

    kk

    kkk

    kkkk

    kkkkk

    kkkkkkkk

    mxmxmxk

    i

    k

    ii

    k

    ii

    k

    ii

    In the same way, we have ( ) ( )=

    =k

    ii

    kkmy1

    22

    121

    This leads to the following equality:

    ( ) ( ) ( )12

    122/1

    1

    22/1

    1

    2 =

    ==

    kkmymxk

    ii

    k

    ii

    It is also possible to simplify the denominator of A of Eq. 4 by using the following equality:

    = =

    =

    +

    +k

    i

    k

    iiici mxmxmx

    1

    2

    1

    222

    21

    21

    41

    41

    where [ ]kxxx ,,, 21 L is a permutation of [ ]k2,,2,1 L and 212 += km

    Furthermore,

  • ( ) ( )( )( )

    2414

    1214

    2

    12142

    41

    41

    21

    21

    2

    2

    22

    1

    22

    1

    2

    =

    =

    ==

    ==

    kk

    kk

    kkmxmxk

    ii

    k

    ii

    Hence, the denominator of A is: ( )

    =

    =

    +

    +k

    iici

    kkmymy1

    222

    2414

    41

    41

    Therefore, Eq. 4 is equivalent to:

    ( )( ) ( )( )( )

    2414

    6)1(

    max 21

    **

    21

    ===

    kk

    mymx

    kk

    mymxM

    k

    iii

    k

    iii

    (Eq. 5)

    Which can also be written as:

    ( )( ) ( ) ( )( ) ( ) ( )( )

    = ==

    k

    iii

    k

    iii mymxkmymxk

    kkkM

    1

    **2

    1

    222

    141

    411

    6max (Eq. 6)

    Let us call the sum ( ) ( )( )=

    k

    iii mymxk

    1

    2

    41 and the sum ( ) ( )( )

    =

    k

    iii mymxk

    1

    **2 1

    and can respectively be written as: ( ) ( )

    ( )( ) ( ) ( ) ( ) kkkyxkkyxk

    myxmyxk

    myxmyxk

    k

    ii

    k

    ii

    k

    iii

    k

    i

    k

    iii

    k

    iii

    k

    iiiii

    41.4

    12

    1.41

    41

    1)(41

    )(41

    22

    11

    2

    1

    2

    1

    2

    11

    2

    1

    22

    ++

    ++=

    ++=

    ++=

    ===

    ===

    =

  • ( ) ( ) ( )( ) ( ) ( )( ) ( ) ( )( ) kkkyxkkyxk

    myxmyxk

    myxmyxk

    k

    ii

    k

    ii

    k

    iii

    k

    i

    k

    iii

    k

    iii

    k

    iiiii

    42

    11

    22

    1.11

    141)(4

    11

    41)(4

    11

    2

    2

    11

    2

    1

    2

    1

    2

    11

    2

    1

    22

    ++

    ++=

    ++=

    ++=

    ===

    ===

    =

    Which leads to the following equality:

    ( ) ( ) ( ) ( )( ) ( )( )( ) ( )( ) kkkkkk

    yxk

    kyxkk

    yxkyxk

    k

    ii

    k

    ii

    k

    ii

    k

    ii

    k

    iii

    k

    iii

    42

    11

    41

    41

    2

    21

    12

    14

    1

    141

    2

    22

    2

    11

    2

    11

    2

    1

    2

    1

    2

    +++

    +++

    ++

    =

    ====

    ==

    (Eq. 7)

    The following factorizations and equality ( ) ( )( )2121412 += kkk ( ) ( )( )1112 += kkk

    = =

    =+=k

    i

    k

    iii y

    kkx1 12

    1

    leads to the rewriting of Eq. 7 as

    ( ) ( ) ( ) ( ) ( )( )( )( )( )2121121

    121

    2114

    1

    2

    1 11

    22

    1

    2

    ++

    +

    ++= = ===

    kkkk

    yxkkyxkyxkk

    i

    k

    ii

    k

    iiii

    k

    iii

    (Eq. 8)

    Without any assumptions on the distribution of ix , iy , ix and iy , its not possible to simplify Eq. 8 any further. The next step will be to study separately the two cases where 0> and 0

  • First case: 0> To maximize Eq. 8 (which is equivalent to finding M since the values are positive), we have to maximize the term

    ( ) ( ) ( )===

    +

    +n

    iii

    n

    ii

    n

    ii yxkyxkk

    1

    2

    11

    2 1121

    21

    Since ( ) 0121

    21 2 >

    + kk and ( ) 012 >k , for k>1 a simple way is to maximize + ==

    n

    ii

    n

    ii yx

    11

    and to

    minimize ( )=

    n

    iii yx

    1

    .

    This maximization is obtained by taking ii xx = and ii yy = ; The maximization is obtained by taking

    21 = ii xx and 21 = ii yy . It is evident that this situation is impossible, but we can obtain an upper limit

    threshold that can never be reached. ( ) ( ) ( ) ( ) ( ) ( ) ( )

    ( )( )( ) ( )( )( )21211211211214

    112

    12114

    1

    22

    1

    222

    1

    2

    +++++

    ++

  • Eq. 6 is thus equivalent to:

    ( )( ) ( )( )( )

    1466

    411.

    46

    41)1)(1()1)(1)(1(

    46

    114411

    6

    2

    2

    2

    222


Recommended