STATISTICSHYPOTHESES TEST (III)
Nonparametric Goodness-of-fit (GOF) tests
Professor Ke-Sheng ChengDepartment of Bioenvironmental Systems Engineering
National Taiwan University
Description of nonparametric Problems
• Until now, in the estimation and hypotheses testing problems, we have assumed that the available observations come from distributions for which the exact form is known, even though the values of some parameters are unknown. In other words, we have assumed that the observations come from a certain parametric family of distributions, and a statistical inference must be made about the values of the parameters defining that family.
04/21/23Dept of Bioenvironmental Systems Engineering
National Taiwan University2
• In many situations, we do not assume that the available observations come from a particular family of distributions. Instead, we want to study inferences that can be made about the distribution from which the observations come, without making special assumptions about the form of that distribution.
04/21/23Dept of Bioenvironmental Systems Engineering
National Taiwan University3
• For example, we might simply assume that observations form a random sample from a continuous distribution, without specifying the form of this distribution any further; and we then investigate the possibility that this distribution is a normal distribution.
04/21/23Dept of Bioenvironmental Systems Engineering
National Taiwan University4
• Problems in which the possible distributions of the observations are not restricted to a specific parametric family are called nonparametric problems, and the statistical methods that are applicable in such problems are called nonparametric methods.
04/21/23Dept of Bioenvironmental Systems Engineering
National Taiwan University5
Goodness-of-fit test• A very common statistical problem in
hydrological frequency analysis or water resources planning is that whether the available observations (a random sample available to us) come from a particular type of distribution. For example, before we can estimate the magnitude of the 24-hour rainfall depth with 100-year return period, we must decide (identify) the type of probability distribution for the rainfall data (the annual maximum series) through statistical tests.
04/21/23Dept of Bioenvironmental Systems Engineering
National Taiwan University6
• Let’s consider statistical problems based on data such that each observation can be classified as belonging to one of a finite number of possible categories. If a large population consists of data of k different categories, and let pi denote the probability that
an observation will belong to category i (i = 1, 2, …, k). Of course, for i = 1, 2, …, k and .
04/21/23Dept of Bioenvironmental Systems Engineering
National Taiwan University7
0ip
k
iip
1
1
04/21/23Dept of Bioenvironmental Systems Engineering
National Taiwan University8
04/21/23Dept of Bioenvironmental Systems Engineering
National Taiwan University9
04/21/23Dept of Bioenvironmental Systems Engineering
National Taiwan University10
• Therefore, it seems reasonable to base a test on the values of the differences
for i = 1, 2, …, k and reject Ho when the
magnitudes of these differences are relatively large.
04/21/23Dept of Bioenvironmental Systems Engineering
National Taiwan University11
ii en
Chi-square GOF test
04/21/23Dept of Bioenvironmental Systems Engineering
National Taiwan University12
04/21/23Dept of Bioenvironmental Systems Engineering
National Taiwan University13
04/21/23Dept of Bioenvironmental Systems Engineering
National Taiwan University14
04/21/23Dept of Bioenvironmental Systems Engineering
National Taiwan University15
04/21/23Dept of Bioenvironmental Systems Engineering
National Taiwan University16
04/21/23Dept of Bioenvironmental Systems Engineering
National Taiwan University17
04/21/23Dept of Bioenvironmental Systems Engineering
National Taiwan University18
Sample size
Num
ber
of c
ateg
orie
s
04/21/23Dept of Bioenvironmental Systems Engineering
National Taiwan University19
Kolmogorov-Smirnov GOF test
• The chi-square test compares the empirical histogram against the theoretical histogram.
• In contrast, the K-S test compares the empirical cumulative distribution function (ECDF) against the theoretical CDF.
04/21/23Dept of Bioenvironmental Systems Engineering
National Taiwan University20
04/21/23Dept of Bioenvironmental Systems Engineering
National Taiwan University21
04/21/23Dept of Bioenvironmental Systems Engineering
National Taiwan University22
04/21/23Dept of Bioenvironmental Systems Engineering
National Taiwan University23
• In order to measure the difference between Fn(X) and F(X), ECDF statistics based on the
vertical distances between Fn(X) and F(X)
have been proposed.
04/21/23Dept of Bioenvironmental Systems Engineering
National Taiwan University24
04/21/23Dept of Bioenvironmental Systems Engineering
National Taiwan University25
04/21/23Dept of Bioenvironmental Systems Engineering
National Taiwan University26
04/21/23Dept of Bioenvironmental Systems Engineering
National Taiwan University27
04/21/23Dept of Bioenvironmental Systems Engineering
National Taiwan University28
04/21/23Dept of Bioenvironmental Systems Engineering
National Taiwan University29
04/21/23Dept of Bioenvironmental Systems Engineering
National Taiwan University30
04/21/23Dept of Bioenvironmental Systems Engineering
National Taiwan University31
04/21/23Dept of Bioenvironmental Systems Engineering
National Taiwan University32
Values of for the Kolmogorov-Smirnov test
,nD
Goodness-of-fit tests using R• 2 test for GOF test
– chisq.test
– The above test doesn’t account for any parameters in determining the expected values.
– The degree of freedom of the test statistic is k-1.
• Kolmogorov-Smirnov GOF test
– ks.test (one-sample test)
04/21/23Dept of Bioenvironmental Systems Engineering
National Taiwan University33
ks.test(x, y, parameters, alternative=”…”)where x is the data vector to be tested, y is a string vector specifying the hypothesized distribution, parameters are the values of distribution parameters corresponding to y, and alternative represents a string vector (“less”, “greater”, or “two.sided”) for one-tail or two-tail test.
• Examplesks.test(x, ”pnorm”, 30, 10, alternative=”two.sided”)ks.test(x, ”pexp”, 0.2, alternative=”greater”)
04/21/23Dept of Bioenvironmental Systems Engineering
National Taiwan University34