+ All Categories
Home > Documents > Estimation of Similarity Indices via Two-Sample Jackknife Procedure

Estimation of Similarity Indices via Two-Sample Jackknife Procedure

Date post: 19-Jan-2017
Category:
Upload: trinhdung
View: 222 times
Download: 0 times
Share this document with a friend
10
Estimation of Similarity Indices via Two-Sample Jackknife Procedure Chia-Jui Chuang Applied Mathematics, National Chung Hsing University, Taichung, Taiwan 402, R.O.C. Abstract The similarity indices are often used to assess the biodiversity of two communities. Because some species are not observed in the samples, the common naive estimators may be unsatisfactory. Based on the quadrat sampling, a series of two-sample jackknife estimators for the Jaccard and Sørensen indices are developed. The sequence of estimators is able to reduce bias by increasing the jackknife order; however, it may result in a large variation with the increasing order. To compensate for the bias-variance trade-off, we consider a sequential testing procedure to select the jackknife order. A simulation study based on two real forest plots is used to evaluate the performance of the proposed method. Key Words: Jaccard Index, Sørensen Index, Two-Sample Jackknife, Quadrat Sampling 1. Introduction The similarity index provides a quantitatively based measurement for comparing two populations. For exam- ple, the DNA-fingerprint is an application of similarity for DNA profiles [1], and is commonly used in parental testing and criminal investigation. In the computer sci- ence and data mining research, similar pairs can be found among objects [2]. The similarity index has also been widely used in ecology [3,4] and is, therefore, the focus of this paper. The advantages of classic similarity indices, such as the Jaccard index, Sørensen index, Bray-Curtis index, and the Morisita-Horn index, have been discussed [5,6]. Among these similarity indices, the Jaccard index [7] and Sørensen index [8] are commonly applied in eco- logy. These indices are often used to measure the spe- cies-diversity for the optimum size for natural protection [9,10]. Boyce and Ellison [11] indicated that these indi- ces have a consistently high performance. Therefore, this study focuses on the Jaccard and Sørensen indices. The Jaccard index is defined as the number of shared species divided by the number of total distinct species in two communities. The Sørensen index is the ratio of the num- ber of shared species to the average number of total spe- cies in two communities. The definitions of the Jaccard and Sørensen indices are based on the numbers of species in two populations. However, we usually obtain the information by samples collected from populations. Hence, similarity indices are usually estimated by substituting the numbers of species as the observed numbers of species from samples in practice. These estimators are referred to as naive esti- mators. In general, the naive estimators result in biases because some species are not observed in the sample; thus there is still room to improve them. The jackknife procedure can reduce the bias of esti- mators in various applications [12,13]. Moreover, Heltshe and Forrester [14] applied the jackknife procedure for one community. Burnham and Overton [15] extended the jack- knife procedure in order to reduce more bias. In this paper, a new two-sample jackknife procedure for the Jaccard and Sørensen indices was developed in this study. We start with naive estimators and derive serial jackknife estimators for these indices. When the order of jackknife estimators in- creases, the resulting bias decreases; however, a higher variation may occur. Therefore, a sequential test for select- ing a proper order of estimators is also proposed. Journal of Applied Science and Engineering, Vol. 15, No. 3, pp. 301-310 (2012) 301 *Corresponding author. E-mail: [email protected]
Transcript
Page 1: Estimation of Similarity Indices via Two-Sample Jackknife Procedure

Estimation of Similarity Indices via Two-Sample

Jackknife Procedure

Chia-Jui Chuang

Applied Mathematics, National Chung Hsing University,

Taichung, Taiwan 402, R.O.C.

Abstract

The similarity indices are often used to assess the biodiversity of two communities. Because some

species are not observed in the samples, the common naive estimators may be unsatisfactory. Based on

the quadrat sampling, a series of two-sample jackknife estimators for the Jaccard and Sørensen indices

are developed. The sequence of estimators is able to reduce bias by increasing the jackknife order;

however, it may result in a large variation with the increasing order. To compensate for the bias-variance

trade-off, we consider a sequential testing procedure to select the jackknife order. A simulation study

based on two real forest plots is used to evaluate the performance of the proposed method.

Key Words: Jaccard Index, Sørensen Index, Two-Sample Jackknife, Quadrat Sampling

1. Introduction

The similarity index provides a quantitatively based

measurement for comparing two populations. For exam-

ple, the DNA-fingerprint is an application of similarity

for DNA profiles [1], and is commonly used in parental

testing and criminal investigation. In the computer sci-

ence and data mining research, similar pairs can be found

among objects [2]. The similarity index has also been

widely used in ecology [3,4] and is, therefore, the focus

of this paper.

The advantages of classic similarity indices, such as

the Jaccard index, Sørensen index, Bray-Curtis index,

and the Morisita-Horn index, have been discussed [5,6].

Among these similarity indices, the Jaccard index [7]

and Sørensen index [8] are commonly applied in eco-

logy. These indices are often used to measure the spe-

cies-diversity for the optimum size for natural protection

[9,10]. Boyce and Ellison [11] indicated that these indi-

ces have a consistently high performance. Therefore, this

study focuses on the Jaccard and Sørensen indices. The

Jaccard index is defined as the number of shared species

divided by the number of total distinct species in two

communities. The Sørensen index is the ratio of the num-

ber of shared species to the average number of total spe-

cies in two communities.

The definitions of the Jaccard and Sørensen indices

are based on the numbers of species in two populations.

However, we usually obtain the information by samples

collected from populations. Hence, similarity indices are

usually estimated by substituting the numbers of species

as the observed numbers of species from samples in

practice. These estimators are referred to as naive esti-

mators. In general, the naive estimators result in biases

because some species are not observed in the sample;

thus there is still room to improve them.

The jackknife procedure can reduce the bias of esti-

mators in various applications [12,13]. Moreover, Heltshe

and Forrester [14] applied the jackknife procedure for one

community. Burnham and Overton [15] extended the jack-

knife procedure in order to reduce more bias. In this paper,

a new two-sample jackknife procedure for the Jaccard and

Sørensen indices was developed in this study. We start with

naive estimators and derive serial jackknife estimators for

these indices. When the order of jackknife estimators in-

creases, the resulting bias decreases; however, a higher

variation may occur. Therefore, a sequential test for select-

ing a proper order of estimators is also proposed.

Journal of Applied Science and Engineering, Vol. 15, No. 3, pp. 301�310 (2012) 301

*Corresponding author. E-mail: [email protected]

Page 2: Estimation of Similarity Indices via Two-Sample Jackknife Procedure

Prior studies had discussed the estimation of simi-

larity indices. Heltshe [16] suggested a jackknifed sim-

ple matching coefficient (SMC) to estimate the Jaccard

index. Kaitala [17] used the hypergeometric distribution

with a simple rank approach to estimate the similarity

index. More recently, Yue and Clayton [18] estimated

the Jaccard index by the nonparametric maximum likeli-

hood estimators. Severiano et al. [19] used the jackknife

and bootstrap method to find the confidence intervals of

similarity indices. In this paper, we find that the jack-

knifed SMC estimator may be inappropriate in certain

situations. The estimator is evaluated in section 4.2.

In section 2, we demonstrate the development of the

two-sample jackknife algorithm, and the sequential test-

ing procedure for the Jaccard and Sørensen indices under

the abundance datasets. Section 3 considers the two-sam-

ple jackknife procedure for the quadrat datasets. In section

4, the performance of jackknife estimators for the Jaccard

and Sørensen indices are discussed with the simulations

of artificial datasets and two tropical forests in Panama.

Section 5 offers concluding remarks and discussions.

2. Methodology and Procedures

2.1 Jackknife Estimates for the Jaccard Index

Let M1 and M2 be the numbers of existent species in

community I and II, respectively. Assume that these spe-

cies can be decomposed as: M12 is shared species in both

communities; M1 � M12 and M2 � M12 are unique species

in community I and II, respectively. Let X = (X1, ..., X M1)

and Y = (Y1, ..., YM2) be the sampling frequencies with

samples sizes n1 and n2 from community I and II, respec-

tively. We assume that X follows the multinomial distri-

bution with a total size n1 and cell probabilities (p1, ...,

pS1), and for Y with a total size n2 and cell probabilities

(q1, ..., qS2).

The Jaccard index is defined as �J = M12/M, where M

= M1 + M2 � M12 denotes the number of total species in

two communities. Let D I Xi

M

1 110� �

�� ( ) be the num-

ber of observed species in sample I where I(�) denotes the

usual indicator function, and let D2 be the same for sam-

ple II. Let D I X Yi ii

M

12 10 0� � �

�� ( , ) be the number of

species observed in both samples. We denote fjk =

I X j Y ki ii

M( , )� �

�� 1as the frequency that is accurately

represented by j individuals in sample I and k individuals

in sample II. The number of total observed species in

both samples is D = D1 + D2 � D12. Due to the missing of

M12 and M, the naive method takes D12/D to estimate the

Jaccard index. The number of observed species is gener-

ally less than the actual number. Consequently, the bias

in the naive estimator occurs.

The two-sample jackknife procedure is applied to re-

duce the naive estimator’s bias [20]. Let the naive esti-

mator be the initial estimator, denoted as�� J 0

, for �J based

on all n1 + n2 observations. Define�

��

J 0

( , )� �as an estimator

by evaluating�� J 0

when the �th observation in X has been

deleted, where � �1 1, ..., n . The �th pseudo-value defines

as n nJ J1 10 01

� ��

� �� �� �

( )( , )

. Then, the jackknife estimator is

defined as the average of these pseudo-values and is de-

rived as:

where f I X Yi ii

M

1 11 0

� �� � �� ( , ) denotes that the num-

ber of observed shared species accurately represented

one individual in sample I. For Y, the jackknife estima-

tor by deleting one individual in Y at a time is

where f I X Yi ii

M

� �� � ��1 1

0 1( , ). The weighted aver-

age of�� J X0 ,

and�� J Y0 ,

is the resulting first-order jackknife

estimator:

Following the two-sample jackknife [21], we con-

tinue to remove one individual in Y at a time for the esti-

mator�� J X0 ,

. Let��

J

m

X0 ,

( , )� �be an estimator that evaluates

�� J X0 ,

when the mth individual is removed from Y. The mth pse-

udo-value is n nJ J

m

X X2 20 0

1� �� �

, ,( )

( , )� �

� �and the second-order

302 Chia-Jui Chuang

Page 3: Estimation of Similarity Indices via Two-Sample Jackknife Procedure

jackknife estimator is derived as:

where�

��

J

m

0

( , )� � denotes as an estimator by evaluating�� J 0

when the �th observation in X and mth observation in Y

has been deleted.

We may continue the jackknife procedure to further

reduce the bias. However, the number of terms for the

jackknife estimator with order k is 3k for k = 1, 2, … and

the form of estimator is complicated, as k is large. There-

fore, we suggest ignoring entries, in terms of 1/D2 and

1/D3, in developing the jackknife estimator. Conse-

quently, the approximated estimators of�� J1

and�� J 2

are

The approximated estimator~�J1

and~�J 2

are nearly to�� J1

and�� J 2

as D is large enough. In section 4.1, we consider

several scenarios of communities to confirm the proxim-

ity between approximated and original estimators.

Note that~�J1

and~�J 2

can be represented as~�J i

=

c f Djk

i

jkkj

( )/

�� �� 00, where cjk

i( )is the coefficient corre-

sponding to fjk and i = 1, 2. According to the two-sample

jackknife procedure above and ignoring the1 / D � terms

for � � 2, our jackknife algorithm is summarized as:

Step 0. Set initially v = 0.

Step 1. Let i = 2v + 1. Define� �� �J X Jv v

n

v2 2

1

1, �

�� �

� �

��n v

vn

J

n

v

111

1

1 2

1( )

/( , )

��

�� , and

�� J Yv2 , �

n

vJ v

2

1 2�

�� �

n v

vn

J

m

m

n

v

221

1

1 2

2� �

� �

��( )

/( , )

�� . The ith

order jackknife estimator is� �� �J J Xi v

n� �( ,1 2

n J Yv2 2

�� , ) / (n1 + n2). An approximated estimator

for the ith order is~

/( )

�J jk

i

jkkjic f D�

�� �� 00.

Step 2. The (i + 1)th order jackknife estimator is

An approximated estimator for the (i + 1)th order

is

Step 3. Increase integer v to v + 1 and return to Step 1.

The explicit formulae for~�J1

to~�J 6

are provided in Ap-

pendix A.

The estimated variance of~�J i

for i = 1, …, 6 are de-

rived from the delta method. For a given D, the estimated

variance of~�J i

is

(1)

Note that f = (f11, …, fjk, …) conditional on D approxi-

mates to the multinomial distribution with size D and

cell probabilities� jk jkf D� / for all j, k > 0. Hence, the

estimated covariance in Eq. (1) is

Furthermore, by replacing the covariance in Eq. (1)

with the estimated covariance above; Eq. (1) can be

simplified as

2.2 Order Selection

Although a higher-order jackknife estimator might

Estimation of Similarity Indices via Two-Sample Jackknife Procedure 303

Page 4: Estimation of Similarity Indices via Two-Sample Jackknife Procedure

reduce bias further, it is usually accompanied by a higher

variance. The bias-variance trade-off of estimators, there-

fore, is crucial to select the jackknife order. We follow

Burnham and Overton [15] to develop a sequential test

for selecting an ideal order of jackknife estimator. The

test is based on the hypothesis

(2)

for i = 1, …, 6. The difference of two adjacent estima-

tors can also be expressed:

Using the delta method described in section 2.1, the con-

ditional variance of the difference can be estimated by

We assume, under the null hypothesis H0i, that the test-

ing statistics asymptotically follow the standard normal

distribution (Burnham and Overton [15]);

Given a significance level , the testing hypothesis

(2) begins at i = 1 to determine whether~�J1

and�� J 0

are

significantly different. If the p-value of Ti is less than ,

we continue the testing hypothesis for H0,i+1 until the

p-value of Ti+1 is greater than . Assume that the proce-

dure stops at i = i*, the estimator~

*�J

i �1

is treated as our

proposed estimator for the Jaccard index, denoted that~ ~

*� �J J

i

��1

. The variance of~�J is suggested [15].

However, the estimated variance is usually biased down-

ward as it does not count the variability of the selecting

order. Instead, we suggest the non-parametric bootstrap

method [22] to obtain the variance of~

*�J

i �1

.

2.3 The Sørensen Index

In addition to the Jaccard index, the Sørensen index

is also commonly used. The Sørensen similarity index is

defined as � = 2M12/(M1 + M2). Let the naive estimator of

the Sørensen index be�� � �2 12 1 2D D D/ ( ) and regard as

the initial estimator for the two-sample jackknife proce-

dure. Following the steps in section 2.1, the order 1 and 2

jackknife estimates, which ignore the terms of 1/(D1 +

D2)2 and 1/(D1 + D2)

3 are

(3)

In Appendix A, we also provide the explicit formulae of

order 1 to 6.

Rewrite Eq. (3) as~

/( )

� �S S jk

i

jkkjid f� �

�� ���

0 00

(D1 + D2 � 1), where d jk

i( )is the coefficient corresponding

to fjk. Given D, the estimated variance of~�Si

can be ap-

proximated by the delta method, mentioned in section

2.1. The difference formula between two adjacent esti-

mators is~ ~

( ) /( ) ( )

� �S S jk

i

jk

i

jkkji id d f� � �

�� ��1

1

00(D1 +

D2 � 1). Following the sequential hypothesis testing in

section 2.2, we can determine an ideal order as the final

Sørensen estimator.

3. Similarity Indices for Incidence Data

It is common to collect data by quadrat sampling de-

sign in a field survey, and especially in a forest commu-

nity. Instead of counting the exact abundance of each

species in a quadrat, only the presence (1) or absence (0)

for each species can be recorded at each sampled quadrat.

For example, assume that the ith species is observed in

the j th quadrat, we denote zij = 1 and otherwise zij = 0. Let

t1 be the number of sampling quadrats in community I,

and be the number of quadrats where the ith species was

present (i.e. X zi ijj

t*�

�� 1

1 ). Let X X X M

* * *( , ..., )� 1 1be

the collected data. Let n X ii

M

1 1

1* *�

�� then X* follows the

multinomial distribution with cell probabilities ( ( ) /*E X 1

n E X nM1 11

* * *, ..., ( ) / ) when n1

* is given. Similarly, let t2 be

the number of sampling quadrats and Y Y YM

* * *( , ..., )� 1 2

be the sampling data of community II. Let n Yii

M

2 1

2* *�

��then Y* also follows the multinomial distribution with

304 Chia-Jui Chuang

Page 5: Estimation of Similarity Indices via Two-Sample Jackknife Procedure

size n2

* and cell probabilities ( ( ) / , ...,* *E Y n1 2 E YM( ) /*

2n2

* ).

Let D be the total number of observed species in two

samples and fjk represent the exact number of shared spe-

cies detected by j quadrats in sample I and k quadrats in

sample II. Following the same jackknife algorithm in

section 2.1, we find that the jackknife estimators, derived

from the quadrat sampling design, are the same as those

shown in Appendix A. Hence, the sequential testing cri-

terion in section 2.2 is also recommended to select the

jackknife order for the Jaccard and Sørensen indices

under the quadrat sampling design.

4. Simulation Study

4.1 Artificial Populations

A simulation study is conducted to examine the dif-

ference between the proposed jackknife estimators and

the original estimators under different scenarios, and to

investigate the performance of the selected jackknife es-

timators (~�J and

~�S ). In order to test the performance of

the two-sample jackknife procedure under various data

structure, we consider three communities, consisting of

500 species in each, with the discovery probabilities as

follows:

Community 1: Pi’s are independently generated from

uniform distribution with range from 0 to

0.5.

Community 2: The values of Pi’s are set as Pi = 0.01, i =

1, …, 100;

Pi = 0.02, i = 101, …, 200; Pi = 0.15, i =

201, …, 300;

Pi = 0.56, i = 301, …, 400; Pi = 0.98, i =

401, …, 500.

Community 3: Pi = 4/(511 � i), i = 1, …, 500.

The coefficient of variation (CV) for these three

communities is 0.58, 1.09 and 1.45, respectively. We

consider six possible combinations from the three com-

munities, namely, 1 vs. 1, 1 vs. 2, …, 3 vs. 3. The num-

bers of sampling quadrats are set as t1 = t2 = 50. For each

combination, we assume that the first 100, 250, and 400

species are the shared species, and generate 1000 data-

sets for each scenario.

The averaged difference between the proposed jack-

knife estimators (~

,~

,~

,~

)� � � �J J S S1 2 1 2and the original jack-

knife estimators ( , , , )� � � �� � � �J J S S1 2 1 2

are summarized in

Table 1. All the proposed jackknife estimators are higher

than the original estimators. The exceeding value is quite

small for all of the cases considered. A significant dis-

tinction occurs in case 3 vs. 3, which has a higher CV

than the other five cases. Since the value of difference

increases as the number of shared species increases, we

suggest a small number of shared species to assess the

performance of the selected jackknife estimators.

In the simulation, the number of shared species from

120 to 400 has been considered. Moreover, we set that

the sampling quadrate range from 10 to 100. Due to the

word count limit, we only report the result of 120 shared

species with sampling size 50 because the conclusions

are similar. Figure 1 presents the mean of naive estima-

tors, the first order to the fourth order jackknife esti-

mators, and the selected estimators for the Jaccard and

Sørensen indices. In Figure 1, the x-axis corresponds

with the sequence of six combinations. The horizontally

dotted lines in the figure indicate the true values 0.1364

and 0.24 for the Jaccard and Sørensen indices, respec-

tively. Furthermore, the root mean square error (RMSE)

of the selected jackknife estimator does not vary a lot,

with the significant level from 5% to 10%. Therefore,

we only report the mean of selected estimators in Figure

1, with the significant level at 10%.

Estimation of Similarity Indices via Two-Sample Jackknife Procedure 305

Page 6: Estimation of Similarity Indices via Two-Sample Jackknife Procedure

In Figure 1, the naive estimators significantly under-

estimate the true values. The first three order jackknife

estimators are also underestimated; however, they de-

crease the bias as the corresponding order increases. The

fourth order jackknife estimators (~�J 4

and~�S4

) become

overestimated; however, their absolute bias are still

smaller than those of the naive estimators. We then ob-

tain that the selected jackknife estimator lies between the

third and fourth jackknife estimator for all cases. Al-

though the selected jackknife estimators have the small-

est bias, they are not the ideal choice in terms of RMSE,

shown in Figure 2 as the selecting procedure leads to the

extra variation. The third order jackknife estimators (~�J 3

and~�S3

) have the smallest RMSE in almost all the cases

considered. Hence, the third order jackknife estimators

are also recommended for simplified application in eco-

logy study.

4.2 Real Populations

There are two protected forests in Panama: the Sher-

man, and the Cocoli plot. General survey data for these

two forests can be found on the Website of the Center for

Tropical Forest Science. Various studies have been con-

ducted on their distinguishing characteristics [23]. The

Sherman forest is located in the San Lorenzo National

Park, in the tropical moist forest along the Caribbean

Ocean coast. It is L-shaped, and covers a surface area of

5.96 ha (a 400 m � 100 m rectangle and a 140 m � 140 m

square). Furthermore, a census of species has been taken

three times in January 1996, December 1997, and Febru-

ary 1999. The Cocoli plot is located on the Pacific Ocean

side of the Panama Canal. It covers a surface area of 4 ha

and is also L-shaped (a 300 m � 100 m rectangle and a

100 m � 100 m square). The census of the species was

taken in November of 1994, 1997 and 1998. The dis-

tance between these two forests is 58.8 km, and the 1997

census data is available for both. There are 50 shared

species in these forests. The dataset selected for this

study includes tree species, with a diameter breast height

greater than 10 mm. The basic characteristics of the two

forests are summarized in Table 2.

As the surface area covered by the two forests is very

small, a 5 m � 5 m quadrat size with five sampling pro-

portions (2%, 4%, 6%, 10%, and 20%) is examined.

1000 repetitions are carried out at random for each sam-

pling proportion, and 100 times bootstrap are used to cal-

culate the standard error of the selected jackknife esti-

mator for each repetition. Due to the word count limit,

we only report the sampling size of (2%, 4%, 10%). The

first six order jackknife estimators for the Jaccard and

Sørensen indices and the selected estimator at 5% and

10% significant levels are assessed in this study. How-

ever, we find that most testing procedures stop before the

fourth jackknife estimator. Hence, the first four estima-

306 Chia-Jui Chuang

Figure 1. The averaged value of estimators for the Jaccardand Sørensen indices. � denotes

��J0

; � denotes~�J1

; denotes

~�J2

; � denotes~�J3

; � denotes~�J4

; � de-notes

~�J . The same symbols for the Sørensen index.

Figure 2. The RMSE of of estimators for the Jaccard andSørensen indices. � denotes

��J0

; � denotes~�J1

;

denotes~�J2

; � denotes~�J3

; � denotes~�J4

; � denotes~�J . The same symbols for the Sørensen index.

Page 7: Estimation of Similarity Indices via Two-Sample Jackknife Procedure

tors, and the selected estimator of significant level at

10%, are reported in Tables 3 and 4, where � denotes the

mean of standard error and�� denotes the mean of esti-

mated standard error.

In general, sampling without replacement is more

suitable than with replacement in the case of sedentary

species such as plants. As the maximum sampling pro-

portion is 20% in our setting, the sampling data with re-

placement or without replacement are similar. As ob-

served in Tables 3 and 4, the number of shared species is

low in small sampling sizes (less than 10%). Therefore,

the mean of the naive estimators significantly underesti-

mate for the Jaccard and Sørensen indices. The proposed

jackknife estimators and selecting estimators are effi-

cient, in terms of reducing bias. The jackknife estimators

perform better, as the sampling size becomes larger and

performs optimally in the case of 10% sampling size.

The estimated variance of the selecting estimates is

Estimation of Similarity Indices via Two-Sample Jackknife Procedure 307

Table 2. Several characteristics of the Sherman and Cocoli forests

Sherman Cocoli

Location 9�21’ N, 97�57’ W 8�58’ N, 79�35’ W

Size of Plot 5.96 ha 4 ha

No. of Species 224 170

No. of Individuals 21799 8288

No. of Quadrat for 5 m � 5 m 2384 1600

No. of Shared Species 50

Jaccard index 0.1453

Sørensen index 0.2538

Page 8: Estimation of Similarity Indices via Two-Sample Jackknife Procedure

slightly underestimated, and debases the reliability of~�J

and~�S in 95% coverage rate.

The averages of selecting estimators always lie be-

tween the third order and the fourth order jackknife esti-

mator. For the~�J , its mean is close to the mean of the

third order, and the RMSE is close to that of the fourth

order. In terms of RMSE,~�J 3

has the smallest value for

almost all cases. Then~�J 3

is also recommended for the

Jaccard index due to the simple application. Although~�S2

has the smallest RMSE in Table 4, its bias tends to be

higher than~�S3

. Therefore, for simplification, we recom-

mend~�S3

for the Sørensen index.

5. Discussion

This study presented a new procedure based on the

two-sample jackknife to estimate the Jaccard and Søren-

sen indices in the case of the abundance dataset and the

quadrat dataset. A sequential testing criterion, for select-

ing a proper order between jackknife estimators, is also

suggested. Heltshe [16] proposed the two-sample jack-

knife for estimating the Jaccard index by using the SMC

estimator; however, our findings reveal that�� SMC is un-

suitable for this dataset.

Heltshe and Forrester [14] pointed out the jackknife

estimators sensitive to the sampling size. The two-sam-

ple jackknife also has the similar problem. We discover

that the jackknife estimators always underestimate when

sampling size is small. As the sampling size increases,

the performance of the jackknife estimators improves. In

addition, the relative abundance distribution (Mouillot

and Lepretre [24]) also affects the performance of the

jackknife estimators. For the six combinations listed in

section 4.1, we consider the abundance shared species in

one community vs. the rare shared species in the other

community. For most cases, the selected jackknife esti-

mator still performs very well besides the case 3 vs. 3

which has highest CV than the other cases. Although the

selected jackknife estimator overestimates in the case of

3 vs. 3, it has smaller bias and RMSE than the naive

estimator.

The criterion of the sequential hypothesis test for

identifying the most suitable order of the estimator is

based on the work of Burnham and Overton [15], but is

also accompanied by extra variation in the selecting pro-

cedure. Hence, the bootstrap method is appropriate, to

308 Chia-Jui Chuang

Page 9: Estimation of Similarity Indices via Two-Sample Jackknife Procedure

estimate the variance of the selected estimator. However,

the defect of bootstrap is time consuming. Therefore,

further investigation is required for the variance esti-

mation of the true variation.

A number of similarity indices have not been co-

vered in this paper, including the Kulczynski index, the

Morisita-Horn index, and the Bray-Curtis index. The

comparisons between similarity indices could provide

insights into which index is the most appropriate, and

which index is inappropriate. In addition, future studies

should be conducted on how to extend the jackknife pro-

cedure to multiple communities for the similarity index

issue.

Appendix A. Jackknife Estimators~� J i

and~� Si

for i = 1, …, 6

Define a coefficient matrix which depends on sampling

individuals n as

Furthermore, the frequencies fjk, for j, k = 1, 2, 3, reorga-

nize into frequency matrix

The formula of jackknife estimators are summarized as

following:

Estimation of Similarity Indices via Two-Sample Jackknife Procedure 309

Page 10: Estimation of Similarity Indices via Two-Sample Jackknife Procedure

References

[1] Lynch, M., “The Similarity Index and DNA Finger-

printing,” Molecular Biology and Evolution, Vol. 7,

pp. 478�484 (1990).

[2] Tan, P.-N., Steinbach, M. and Kumar, V., Introduction

to Data Mining, Addison-Wesley (2005).

[3] Hubalek, Z., “Coeffcients of Association and Similarity,

Based on Binary (Presence-Absence) Data: An Evalua-

tion,” Biological Reviews, Vol. 57, pp. 669�689 (1982).

[4] Chao, A., Chazdon, R. L., Colwell, R. K. and Shen,

T.-J., “Abundance-Based Similarity Indices and Their

Estimation When There are Unseen Species in Sam-

ples,” Biometrics, Vol. 62, pp. 361�371 (2006).

[5] Magurran, A. E., Ecological Diversity and Its Mea-

surement, Princeton University Press (1988).

[6] Magurran, A. E., Measuring Biological Diversity,

Wiley-Blackwell (2004).

[7] Jaccard, P., “Lois De Distribution Florale Dans La

Zone Alpine,” Bulletin Societe Vau-doise Sciences Na-

turelles, Vol. 38, pp. 67�130 (1902).

[8] Sørensen, T., “A Method of Establishing Groups of

Equal Amplitude in Plant Sociology Based on Similar-

ity of Species and Its Application to Analyses of the

Begetation on Danish Commons,” Biologiske Skrifter /

Kongelige Danske Videnskabernes Selskab, Vol. 5, pp.

1�34 (1957).

[9] Higgs, A. J. and Usher, M. B., “Should Nature Re-

serves Be Large or Small?” Nature, Vol. 285, pp.

568�569 (1980).

[10] Legendre, P. and Legendre, L., Numerical Ecology,

2nd ed, Elsevier Science (1998).

[11] Boyce, R. L. and Ellison, P. C., “Choosing the Best

Similarity Index When Performing Fuzzy Set Ordina-

tion on Binary Data,” Vol. 12, pp. 711�720 (2001).

[12] Quenouille, M. H., “Notes on Bias in Estimation,”

Biometrika, Vol. 61, pp. 353�360 (1956).

[13] Schucany, W. R., Gray, H. L. and Owen, D. B., “On

Bias Reduction in Estimation,” Journal of the American

Statistical Association, Vol. 66, pp. 524�533 (1971).

[14] Heltsche, J. F. and Forrester, N. E., “Estimating Spe-

cies Richness Using the Jackknife Procedure,” Bio-

metrics, Vol. 39, pp. 1�11 (1983).

[15] Burnham, K. P. and Overton W. S., “Estimation of the

Size of a Closed Population When Capture Probabili-

ties Vary Among Animals,” Biometrika, Vol. 65, pp.

625�633 (1978).

[16] Heltshe, J. F., “Jackknife Estimate of the Matching

Coefficient of Similarity,” Biometrics, Vol. 44, pp.

447�460 (1988).

[17] Kaitala, S., Maximov, V. N. and Niemi A., “A Simple

Approach to Estimate Similarity in Ecosystem Analy-

sis,” Plant Ecology, Vol. 92, pp. 101�112 (1991).

[18] Yue, J.-C. and Clayton, M. K., “A Similarity Measure

Based on Species Proportions,” Communications in

Statistics�Theory and Methods, Vol. 34, pp. 2123�

2131 (2005).

[19] Severiano, A., Carrico J. A., Robinson, D. A., Ramirez,

M. and Pinto F. R., “Evaluation of Jackknife and Boot-

strap for Defining Confidence Intervals for Pairwise

Agreement Measures,” PLoS ONE, Vol. 6, pp. 1�11

(2011).

[20] Arvesen, J. N., “Jackknifing U-Statistics,” The Annals

of Mathematical Statistics, Vol. 40, pp. 2076�2100

(1969).

[21] Schechtman, E. and Wang, S., “Jackknifing Two-Sam-

ple Statistics,” Journal of Statistical Planning and In-

ference, Vol. 119, pp. 329�340 (2004).

[22] Chao, A., Hwang, W.-H., Chen, Y.-C. and Kuo, C.-Y.,

“Estimating the Number of Shared Species in Two

Communities,” Statistica Sinica, Vol. 10, pp. 227�246

(2000).

[23] Condit, R., Watts, K., Bohlman, S. A., Perez, R.,

Hubbell, S. P. and Foster, R. B., “Quantifying the De-

ciduousness of Tropical Forest Canopies under Vary-

ing Climates,” Journal of Vegetation Science, Vol. 11,

pp. 649�658 (2000).

[24] Mouillot, D. and Lepretre, A., “Introduction of Rela-

tive Abundance Distribution (RAD) Indices, Esti-

mated from the Rank-Frequency Diagrams (RFD), to

Assess Changes in Community Diversity,” Environ-

mental Monitoring and Assessment, Vol. 63, pp. 279�

295 (2000).

Manuscript Received: Aug. 8, 2011

Accepted: Nov. 14, 2011

310 Chia-Jui Chuang


Recommended