arXiv:2111.02252v1 [stat.ME] 3 Nov 2021

A GOODNESS-OF-FIT TEST BASED ON A RECURSIVE PRODUCTOF SPACINGS

Philipp Eller ∗1,2 and Lolian Shtembari † 3

1Technical University Munich, Garching, Germany2Exzellenzcluster ORIGINS, Garching, Germany

3Max Planck Institute for Physics, Munich, Germany

November 4, 2021

ABSTRACT

We introduce a new statistical test based on the observed spacings of ordered data. The statisticis sensitive to detect non-uniformity in random samples, or short-lived features in event time se-ries. Under some conditions, this new test can outperform existing ones, such as the well knownKolmogorov-Smirnov or Anderson-Darling tests, in particular when the number of samples is smalland differences occur over a small quantile of the null hypothesis distribution. A detailed descriptionof the test statistic is provided including an illustration and examples, together with a parameterizationof its distribution based on simulation.

Keywords test statistic, hypothesis test, goodness of fit, spacing, interarrival time, waiting time, gap

1 Introduction

Assessing the goodness-of-fit of a distribution given a number of random samples is an often-encountered problemin data analysis. Such statistical hypothesis tests find applications in many fields, ranging from the natural and socialsciences over engineering to quality control. Several non-parametric tests exist, some of which have become standardtools, including the Kolgogorov-Smirnov (KS) test [1, 2] or the Anderson-Darling (AD) test [3]. Reference [4]provides a comprehensive overview of existing tests, and a comparison of their performance for the case of detectingnon-uniformity for a set of alternative distributions.

In this work, we are in contrast interested in the case where the bulk of samples are actually distributed according to thenull hypothesis, and only few additional samples are introduced that are following a different distribution, representinga narrow excess over a known background. We present the new test statistic "recursive product of spacings", or shortRPS, that is based on the spacings between ordered samples, and introduced in Sec. 2. A performance comparison toseveral other test statistics is provided. In Sec. 3.1 we provide a parametrization of its distribution based on simulation,followed by a discussion of the quality of the approximation. The rest of the article focuses on some illustrations,comparisons and examples.

1.1 Goodness-of-fit Tests

Suppose that we have obtained n samples xi, and want to quantitatively test the hypothesis of those samples beingrandom variates of a known distribution f(x), i.e. independent and identically distributed (i.i.d.) according to f(x).Here, we consider only continuous distributions f(x) with cumulative F (x), and hence can transform samples ontothe unit interval [0, 1] via yi = F (xi). This reduces the task at hand to test transformed samples yi being distributedaccording to the standard uniform distribution U(0, 1).∗[email protected]†[email protected]

arX

iv:2

111.

0225

2v1

[st

at.M

E]

3 N

ov 2

021

A PREPRINT - NOVEMBER 4, 2021

First, let us briefly introduce other, existing test statistics to which we will compare the RPS statistic. We consider inparticular two groups of statistics, those based on the empirical distribution (EDF Statistics), and those based on thespacings between ordered samples (Spacings Statistics). An comprehensive overview of existing test statistics can befound in Ref. [4].

1.1.1 EDF Statistics

This class of test statistics compares the empirical distribution function (EDF) Fn(x) to the cumulative distributionfunction (CDF) F (x), (here F (x) = x). In particular the following are widely used and considered here:

• Kolmogorov-Smirnov (KS) [1, 2]: Dn = supx |Fn(x)− F (x)|• Cramer-von-Mises (CvM) [5, 6]: T = n

∫∞−∞(Fn(x)− F (x))2dF (x)

• Anderson-Darling (AD) [3]: A2 = n∫∞−∞

(Fn(x)−F (x))2

F (x) (1−F (x)) dF (x)

Similar are a type of statistics defined on the ordered set. Given the n samples {x1, x2, . . . , xn}, we define the orderedset of samples as {x(1), x(2), . . . , x(n)}, where x(i) < x(i+1) ∀i. The expected value of ordered sample i is i/(n+ 1),and we define the deviation to the expected values as δi = x(i) − i/(n+ 1) for each sample i. Based on this we canwrite out the following two statistics:

• Pyke’s Modified KS (C) [7, 8]: Cn = max(max(δi),−min(δi))

• Brunk’s Modified KS (K) [9]: Kn = max(δi)−min(δi)

1.1.2 Spacings Statistics

Based on the ordered set, we can further define the n+1 spacings s as si = x(i)−x(i−1), with x(0) = 0 and x(n+1) = 1.Several test statistics built from these spacings are considered in literature, including:

• Moran (M) [10]: M = −∑n+1

i=1 log si

• Greenwood (G) [11]: G =∑n+1

i=1 s2i

In the context of a fixed rate Poisson process, these spacings can also be interpreted as interarrival times or waitingtimes. In some other areas, spacings are also referred to as gaps.

So-called higher order spacings can be defined by summing up neighbouring spacings. Here we consider the overlappingm-th order spacings s(m)

i = x(i+m) − x(i). With those, we can define generalisations of Moran and Greenwood,respectively, as discussed by Cressie:

• Logarithms of higher order spacings (Lm) [12]: L(m)n = −

∑n−m+1i=0 log s

(m)i

• Squares of higher order spacings (Sm) [13]: S(m)n =

∑n−m+1i=0 (s

(m)i )2

For our comparisons presented later, we choose m = 2 and m = 3, respectively, to limit ourselves to a finite list oftests.

Other statistics based on spacings exist and are being actively developed and used, such as, for example, tests based onthe k smallest or largest spacings [14].

2 Recursive Product of Spacings (RPS)

In this work, our goal is to construct a new test statistic, that is sensitive to narrow features or clusters in an otherwiseuniform distribution of samples. The tell-tale sign we are looking for is a localized group of uncommonly small spacingsof the ordered data. For this purpose, we propose a new class of test statistics, that are including higher order spacingsin a recursive way.

The recursive product of spacings (RPS) can be thought of as an extension of the Moran statistic, and is defined as

RPS(n) = Mn+1 +Mn + · · ·+M1, (1)

2


where the term Mn+1 is the simple sum of negative log spacings equivalent to the Moran statistic

Mn+1 = −n+1∑i=1

log (si) . (2)

All following terms are computed in the same way

M j = −j∑

i=1

log(sji

), (3)

but with modified spacings sji , defined as:

sji =sj+1i + sj+1

i+1∑i s

ji

(4)

which there are j of, and that depend on the spacings sj+1 used to compute the previous term M j+1 (hence therecursiveness). We can see that term Mn is identical to L(2)

n up to a normalization factor 1/∑

i si. It is important toinclude such a normalization for the spacings of each level, as this ensures that the case of equidistant spacings—themost regular and uniform case—yields the smallest possible RPS value. This minimum value of RPS(n), given by theconfiguration of equidistant samples, can be expressed easily, as each spacing sj is equal to 1

j , and thus:

min(RPS(n)) = −n+1∑j=1

j∑i=1

log

(1

j

)=

n+1∑j=i

j · log(j). (5)

At the other extreme, very small spacings will yield a large contribution to the sum of Eq. 3, thus max(RPS(n)) =∞for any given number of samples n. These extrema show that RPS measures the irregularity in sample positions. TheRPS statistic increases the more samples aggregate into local clusters.

The RPS quantity calculated so far has an infinite support. Since approximating the tails of distributions is often easierwhen dealing with a bounded quantity instead, we transform the RPS in order to bound its support in the range [0, 1],via

RPS∗(n) =min(RPS(n))

RPS(n)(6)

into a new quantity RPS∗ that we will consider when using our test-statistic.

The following pseudo code illustrates how the computation of the RPS value can be implemented:

Algorithm 1 Calculates the recursive product of spacings rps from ordered samples x(i)

x = [0, x(1), x(2), . . . , x(n), 1]rps = 0s = x[first + 1 : last]− x[first : last− 1] . initial spacingswhile len(s) > 1 do

rps = rps− sum(log(s))s = s[first : last− 1] + s[first + 1 : last] . spacings for next iterations = s/sum(s) . normalize

normalized_rps = min_rps(n)/rps

This algorithm has a computational complexity of O(n2), and can become inefficient for very large sample sizes n. Inthis work we limit ourselves to n ≤ 1000.

In an analogue way, we can also define an extension to Greenwood G(n), that instead of logarithms of spacings, sums

over the squares of spacings. This means that we substitute Eq. 3 with M j =∑j

i=1

(sji

)2, while keeping the definition

of sji from Eq. 4. We call this recursive form the "RSS" test statistic in the following comparison.

3


2.1 Illustration

To illustrate better how our test statistic works, and to highlight differences to other tests, we use the example set ofsamples drawn from a uniform (null hypothesis H0) and a non-uniform distribution, respectively, shown in Fig. 1.

0.0 0.2 0.4 0.6 0.8 1.0

0.02

590.

1346

0.18

440.

2046

0.26

680.

2997

0.33

030.

4204

0.43

530.

4360

0.51

360.

5291

0.54

970.

6193

0.62

11

Uniform Samples

0.0 0.2 0.4 0.6 0.8 1.0

0.06

530.

0796

0.09

650.

1272

0.36

110.

4281

0.45

440.

4692

0.47

810.

4942

0.50

520.

5491

0.78

530.

8466

0.85

40

Non-uniform Samples

Figure 1: Example of 15 standard uniformly distributed samples (left) and 10 standard uniformly + 5 normally(µ = 0.5, σ = 0.1) distributed samples (right). The sample positions on the [0, 1] interval are annotated via the arrows +text.

The Moran test is based on the spacings between samples, and the smallest and largest spacings in the specific exampleare present in the uniform case. This leads to a more extreme test statistic value t and hence p-value p = P (T ≥ t|H0)of 0.117 for the uniform case, while it evaluates to p = 0.335 in the non-uniform case. The feature of samples clusteringlocally together, as in the non-uniform case, is—by construction—completely missed by Moran’s test.

The KS test can detect such clustering via the CDF, however in our chosen example it is challenged by the fact thatsamples trend towards the left in the uniform case, while they are more balanced in the non-uniform case. This leads top-values of 0.048 for uniform, and 0.356 for non-uniform, respectively.

The RPS test, however, taking into account also spacing between spacings, finds a p-value of 0.532 for the uniform case,and a much lower p-value of 0.057 for the non-uniform samples. The behaviour of RPS is further illustrated in Fig. 2,that shows the individual contribution of spacings of all recursion levels that build up the test statistic value. The Moranstatistic corresponds to the sum over the first row (M16), while all subsequent levels are added for RPS. The uniformsamples exhibit the most extreme values in the first level, but then even out rapidly. Contrary to that, the non-uniformsamples with their clustering give rise to larger values at later levels, explaining the lower observed p-value.

M16Uniform Samples Non-Uniform Samples

M15

M14

M13

M12

M11

M10

M9

M8

M7

M6

M5

M4

M3

M2 12

34

56

7TS

sum

man

d va

lue

Figure 2: Illustration of the test statistic contributions from all recursion levels for the uniformly distributed samples(left) and the non-uniform samples (right). The sum over the first level only (M16) is equivalent to the Moran statistic.

2.2 Comparison

We now compare the performance of the RPS test statistics to all others referenced in the introduction. We are interestedin detecting small changes in an otherwise uniform distribution, and therefore construct the following generic benchmarkscenario: For one simulation of a specific test case HK(n, s, w) we generate (1 − s) · n random variates3 from a

3Numbers of samples are rounded to the closest integer

4


standard uniform distribution U(0, 1), where s is a signal fraction. In addition, we include s · n samples distributedaccording to ∆ + U(0, w) with the offset ∆ = U(0, 1− w), i.e. a more narrow uniform distribution of width w over arandom interval within (0, 1).

To test a wide range of parameter settings, we choose n = 10, 100, 1000, w = 0.01, 0.03, 0.1, 0.3 and s = 1/√n,

resulting in twelve specific test cases HK . To assess the performance of the different statistics T , we generate 10,000MC simulations for each test case HK and also the null hypothesis H0—i.e. n samples from a standard uniformdistribution. For our comparison we fix the type I error α and use null hypothesis trials to compute critical values cT foreach test statistic T :

α = P (T < cT |H0). (7)

Based on those, we then compute the type II errors βT for the different test cases HK :

βT (HK , α) = P (T ≥ cT (α)|HK). (8)

Figure 3 summarizes the results of this comparison for α = 0.05 in the form of a grouped bar chart. We can see that inour benchmark, the RPS statistic performs comparatively well across the board. For some individual test cases HK ,other tests do perform slightly better, for instance L2 in the case of n = 10, or several EDF statistics for w = 0.3.

n =

10w

= 0.

01

n =

100

w =

0.01

n =

1000

w =

0.01

n =

10w

= 0.

03

n =

100

w =

0.03

n =

1000

w =

0.03

n =

10w

= 0.

1

n =

100

w =

0.1

n =

1000

w =

0.1

n =

10w

= 0.

3

n =

100

w =

0.3

n =

1000

w =

0.3

0.0

0.2

0.4

0.6

0.8

1.0

Type

II E

rror (

)

Type I Error ( ) = 0.05

L1 (Moran)L2L3S1 (Greenwood)S2S3KSADCvMCKRPSRSS

Figure 3: Type II error β for a fixed Type I error α = 0.05 for various test statistics under different test cases. Errors areestimated from 10,000 MC trials.

As a summary of the performance of the test statistics T across all test cases K, we compute a total inefficiency numberiT , similar to Ref. [4]. This is defined as the difference in type II error to the best performer per test case, iT (HK),summed up over all test cases:

iT (α) =∑K

iT (HK , α) =∑K

βT (HK , α)−minT

(βT (HK , α)) (9)

This inefficiency shows how much larger our total error β would be for the choice of a single test T as opposed toselecting the best test for every case. Table 1 indicates that RPS indeed performs best. The statistic L2, that outperformsRPS in some cases is much more inefficient when considering all test cases. The analogue extension of Greenwood—therecursive sum of squares RSS—performs better than any of the sum of squared spacings Sm, but is much less sensitivethan RPS. Therefore, in the following we will only focus on RPS, but analogue procedures could be followed for RSS.

3 Cumulative Distribution of RPS

In order to use RPS as a statistical test yielding p-values, we need its cumulative distribution F . In the case of n = 1that has only two spacings—the simplest non-trivial case we can encounter—we can easily derive the distribution ofRPS∗(1), which is:

FRPS∗(x;n = 1) = 1−√

1− 4x−1x (10)

5


Test Statistic Total Inefficiency (iT (α = 0.05))L1 (Moran) 2.9086

L2 1.6281L3 1.4739

S1 (Greenwood) 3.8974S2 3.7279S3 3.6526KS 3.2216AD 3.4064

CvM 3.5021C 3.1441K 2.7634

RPS 0.4093RSS 2.0276

Table 1: Total inefficiency (as defined in Eq. 9) for the various test statistics over the set of test cases.

For n ≥ 2, however, it is not simple to derive this distribution. Therefore, we resort to numerically approximating thedistribution of RPS∗ discussed in the following section.

3.1 Approximate Distribution

We have built an approximation for the cumulative distribution FRPS∗(x;n) precise enough to compute meaningfulp-values up to relatively extreme values of up to 10−7, and large sample sizes n of up to 1000. Figure 4 shows someexamples of RPS∗ distributions for a few values of n.

0.9990.990.90RPS *

0.0

0.2

0.4

0.6

0.8

1.0

CDF

n = 1n = 2n = 5n = 13n = 31n = 74n = 177n = 421n = 1000

Figure 4: Example of CDFs of the RPS∗ distribution for a few different values of n. N.B.: the x-axis is displayed ininverted logarithm.

We base our approximation on simulation, drawing events with uniform distribution in the range [0, 1] for a given n,and collecting N = 108 samples of RPS∗(n). Such simulation could be directly used to calculate p-value estimates bycounting the fraction of trials below or above an observed RPS∗ value x for a fixed n. However, we want to provide acontinuous and smooth function valid for any n ≤ 1000. For this, we use simulated data to infer the values x of ourtest statistics corresponding to a discrete list of specific quantiles p ∈ [10−7, 1− 10−7]. Taking the i-th element in thesorted simulation set gives an estimate for the value of x(p = i/N). With bootstrapping [15] we can improve theseestimates by collecting different realisations of x by resampling the original dataset with replacement. The result is adistribution of values of x for each p, from which we then extract the mean and the standard deviation, indicative of theerror (see Fig 5). Repeating this for all p, we obtain a list of points x(p) with mean and error for a choice of n.

It would be inefficient to produce such simulation for any n, and hence we repeat the above procedure for only 180different choices of n between 2 and 1000 following approximately a logarithmic spacing. We use a 1-d cubic B-Splineinterpolation per quantile p along the n-axis, based on the bootstrapping mean from the simulation. The spline is

6


allowed to deviate from the means within the uncertainties estimated from the bootstrapping standard deviations, andcan thereby provide a more accurate approximation for any n by smoothing out stochastic noise. Points from theanalytic solution for n = 1 (Eq. 10) are added to the list as anchor points at the boundary. Figure 5 shows a fitted splinerepresentation of x(n|p) for different values of p. This procedure is evaluated independently for all p-values.

Based on the resulting list of corresponding p and x values, that can now be obtained for any n, we generate anotherspline interpolation as the approximation of the desired cumulative distribution F (x;n) for a given n. As the cumulativedistribution function F is strictly monotonous in x, we use the Fritsch & Butland [16] monotonic spline interpolationon the points [x(p;n), p] to produce the final CDFs, shown in Figure 4 for a few values of n.

Figure 5: Example of spline fitted x-values across n for a few extreme p-values. The colored bands show the 1, 2 and 3sigma bands estimated via bootstrapping, the black, dashed lines show the approximations by the spline fits.

3.2 Validation of Approximate Distribution

For the validation of our approximate CDF of the RPS test statistic, we produced additional reference simulation for afew choices of the samples size n = 25, 75, . . . , 975, 1000 4. These new sets contain 109 realisations each, i.e. oneorder of magnitude higher statistics than what the approximation is based on.

For a given n and p-value ptest that we would like to validate, we first compute the corresponding RPS∗ value xtestgiven our parameterization. For this value we then use the reference data and bootstrapping to generate a distribution ofreference p-values pref. The deviation of the value ptest to two extrema of the bootstrap distribution, the 1% quantile andthe 99% quantile, is then what we define as the error of our approximation. Since in most applications the relative erroron the p-value is of interest, we divide the error by ptest.

A result from repeating the above procedure for a range of test values ptest for the example case of n = 75 is shown inFig. 6a. The errors are increasing towards smaller p-values and exhibit an approximately linear behaviour in the log-logplot. Fig. 6c shows the same errors, but divided by the square root of the p-value instead, which is approximatelyconstant. We see that the estimated upper bound of the relative error for a p-value of 10−3 is below 1%, while for ap-value of 10−5 it increases to < 10% and ultimately to < 100% for p-values of 10−7. Such a "large" relative error forsmall p-values may sound alarming at first, but estimating a p-value of 10−7 and knowing it could actually be closer to2 · 10−7 would hardly change the statistical interpretation of a result.

Figure 6b shows the larger of the two errors as a function of n for a few choices of p-values spanning a large range. Thisillustrates that the behaviour and error introduced by our approximation behaves similarly for any tested n, whetherincluded in the spline fitting or not, underlining the robustness of our approximation.

The RPS∗ samples produced for the training and validation datasets were obtained using the Mersenne-Twisterpseudo-random number generator (RNG) [17]. Given the large number of samples used to construct and validate ourapproximate distributions, we test the consistency of our error estimates by re-sampling a few of the validation datasetsusing the Linux’s Kernel native entropy RNG [18]. Figure 6d shows the difference in the root relative squared errorcurves between the two RNGs. The observed differences are at a level of 10−5 or less, which is more than an order

4samples of size n = 25, 125, and 1000 were also generated for the approximation, whereas the rest of the points are not

7


(a) Estimated relative error of fitted p-value with respect to p-values obtained via bootstrapping. The vertical axis reports thescale of the relative error in percent for two extremes, the 1% andthe 99% quantile of the bootstrapping distribution.

(b) Maximum of the relative errors (as shown in panel 6a) for afew different choices of p-values across all validation data sets n.

(c) Estimated root of relative squared error of fitted p-value withrespect to p-values obtained via bootstrapping with respect totwo extremes, the 1% and the 99% quantile of the bootstrappingdistribution.

(d) Comparison of the root of relative squared errors (as shownin Fig. 6c) estimated using random samples obtained with aMersenne-Twister RNG or using Linux’s entropy, shown as theirdifference relative to two extremes, the 1% and the 99% quantileof the bootstrapping distribution

Figure 6: Validation of approximate RPS distribution using a Mersenne-Twister RNG for a fixed n = 75

of magnitude smaller than the root relative squared error of our approximation, and we therefore conclude that nosignificant effect in the choice of RNG can be observed.

3.3 Implementation

The RPS test is made available as open-source packages for Python5 and Julia6, respectively, with the p-valueparametrizations initially available up to 1000 samples.

Below we give a minimal example to evaluate the RPS test for an array x in both language implementations, with xbeing:

x = [0.1, 0.4, 0.76]

5https://pypi.org/project/spacings/6https://juliapackages.com/p/spacingstatistics

8


The python library can be used like the following:

>>> from spacings import rps>>> rps(x, "uniform")RPStestResult(statistic =0.9547378863245608 , pvalue =0.8865399970192409)

and the Julia equivalent giving identical results in the following:

>>> using SpacingsTests>>> rps(x, Uniform ())(statistic =0.9547378863245608 , pvalue =0.8865399970192409)

4 Extended Performance Comparison

This section presents a more in depth performance comparison of several tests on the same scenario as used in Sec 2.2 todetect small distortions in an otherwise uniform distribution of samples. The main difference is, that here we require thatp-values can be computed. This is the case for KS, AD, CvM and Moran, for which exact or approximate distributionsexist, and for RPS where we use the parametrization discussed earlier.

In this comparison, we vary all three parameters of HK(n, s, w), i.e. the number of samples n, as well as the fraction sand width w of the injected signal events. A sensitive test should be able to detect the presence of the added, narrowersignal samples by reporting a low p-value.

Figure 7 show the performance of several tests as a function of the above three parameters. As a metric, we show themedian p-value obtained from repeated trials, and we interpret a lower reported median p-value as a more powerful test.This number can be interpreted as the median significance at which we expect to be able to reject the null hypothesis.What can be observed is, that for all the tested scenarios the RPS test is performing either on par or significantly betterthan the Moran test. The EDF based tests (KS, AD or CvM) start to dominate in terms of performance only for relativelywide signals of around 25% total width or more. When analysing the goodness-of-fit given a large number of samples,i.e. order of several hundreds, the differences between RPS and the EDF-style tests start to become smaller. Overall, theoutcome of this performance study suggests that when signals are expected of widths that span over less than a 25%percentile of the null hypothesis distribution, and if the number of samples is n < 1000, the RPS tests compares veryfavourably against all others considered.

We also investigated other metrics to judge the test’s performance, such as the area under the receiver operatingcharacteristics (ROC) curve between signal and null hypothesis trials. The overall picture does not change substantially.

5 Example

In this section, we illustrate how the RPS test could be used in a physics scenario. We consider a detector thatcollects a number of events in an observable x, where x could for example be the energy of an event, the detectiontime, or a reconstructed quantity like an invariant mass. We expect some or all of the observed events to followa known background distribution fB(x), but there may be an additional contribution of events from an unknownsignal distribution fS(x)—such as a rare, exotic particle decay with unknown mass. Hence we want to quantify thegoodness-of-fit of the background only model to our data. A resulting low p-value could indicate the presence of eventsdistributed according to an additional, unknown signal distribution.

In the example here, we use an exponential distribution fB(x) = e−x for the background model (null-hypothesis).In order to illustrate how the presence of an actual signal (alternative hypothesis) would affect the outcome, we alsoinject additional events following a normal distribution centred at x = 1 and width σ = 0.05. The number of events isPoisson fluctuated for both background and signal, with expected values of 〈nb〉 = 100 and 〈ns〉 varied as specified. InFig. 8, an example distribution of observed events is shown, together with the assumed background distribution, and thedistribution with injected signal (here 〈ns〉 = 5).

The example case chosen is similar to that, for instance, of a search for an exotic particle with unknown mass —aproblem sometime referred to as "bump hunting". In this case, x would represent an invariant mass.

N.B., we do not assume that we know the rate of the underlying processes, meaning that the number of observed countsis not included in our analysis other than for the calculation of the test statistic. This means that we test for the "shape"of the distribution, not its normalization.

9


2 × 10 1

3 × 10 1

4 × 10 1

2 × 10 1

3 × 10 1

4 × 10 1

2 × 10 1

3 × 10 1

4 × 10 1

2 × 10 1

3 × 10 1

4 × 10 1

2 × 10 1

3 × 10 1

4 × 10 1

2 × 10 1

3 × 10 1

4 × 10 1

p-v

alu

eKSADCvMRPSMoran

10 1

p-v

alu

e

10 3

10 2

10 1

p-v

alu

e

10 3

10 2

10 1

p-v

alu

e

10 4

10 3

10 2

10 1

p-v

alu

e10 5

10 4

10 3

10 2

10 1

p-v

alu

e

10 5

10 4

10 3

10 2

10 1

p-v

alu

e

0.5 1 5 15 25 35

10

22

46

100

215

464

1000

1 6.8 13 18 24 30

num

ber

of

sam

ple

s n

signal width w (%)

signal fraction s (%)

Figure 7: Comparison of the performance (median p-value of repeated trials, individual panel’s xy-axis) as a function ofthe number of total samples (large y-axis), the width of the signal (large x-axis) samples, and the fraction of signalsamples (individual panel’s x-axis). The number of signal samples is rounded to the closest integer, hence the "step"-likefeatures visible mostly in the first few rows.

10


0 1 2 3 4 5 6x

0.0

0.2

0.4

0.6

0.8

1.0

1.2

Dens

ity

Background only modelBackground + Signal injectionSamplesHistogrammed Samples

Figure 8: Example physics problem, with observed events distributed in x. We test the goodness-of-fit of the backgroundonly model (blue) to the samples. Here the samples have been generated according to a different distribution with aninjected signal (orange).

5.1 Analysis

The p-value distributions under the assumption of H0 (i.e. only background is present) for repeated trials with〈nb〉 = 100, and various injected 〈ns〉 = [0, 3, 6, 9, 12, 15] are shown in Fig. 9. All distributions with no signal(〈ns〉 = 0) show a flat p-value distribution as expected, since in that case all events are drawn from the backgrounddistribution pB . For trials with injected signal, the distributions are trending towards smaller p-values, indicating theworsened goodness-of-fit for the background only model. In the example, all tests exhibit this behaviour, while the RPStest offers the largest rejection probability of the null hypothesis.

0 0.2 0.4 0.6 0.8 1p-value

10 2

10 1

100

101

102 RPSns = 0ns = 1ns = 2

ns = 3ns = 4ns = 5

0 0.2 0.4 0.6 0.8 1p-value

Moran

0 0.2 0.4 0.6 0.8 1p-value

KS

0 0.2 0.4 0.6 0.8 1p-value

CvM

Figure 9: p-value distributions for background only samples (〈ns〉 = 0) and background plus randomised signalinjections comparing to the background model for several choices of test statistics.

We quantify the sensitivity of the analysis to reject the background only model at different significance levels under theassumption of the presence of a signal. Therefore we check the median p-value of repeated trials, and at what valueof 〈ns〉 it crosses specific critical values (See left panel of Fig. 10). In our chosen example, for a signal of strength〈ns〉 = 10 we expect to reject the background only model using RPS at the 2σ significance level7, whereas for the othertests, a signal of at least 〈ns〉 = 20 is needed to achieve the same. Such a large signal of 〈ns〉 = 20 would allow toreject the background only model at > 4σ significance with the RPS test.

7A significance level in terms of numbers of k standard deviations σ can be translated to a p-value as one minus the integral overa unit normal distribution form −k to +k.

11


0.0 2.5 5.0 7.5 10.0 12.5 15.0 17.5 20.0ns

10 5

10 4

10 3

10 2

10 1

Med

ian

p-va

lue

1

2

3

4

Expected rejection of background only model

RPSMoranKSCvM

Figure 10: The expected significance level at which the background model can be excluded under the assumption of asignal, as a function of 〈ns〉 for the different tests.

6 Conclusions

The RPS test statistic is a sensitive measure to detect deviations of samples from a continuous distribution with knownCDF. The analytic distribution of the RPS statistic is not available for n > 1, but a high accuracy parameterization validup to sample sizes of n = 1000 is provided in order to use RPS as a goodness-of-fit test. In the presented test scenarios,the RPS test outperforms other tests significantly under certain circumstances, in particular when the observed sample issmall (n < 1000) and introduced deviations are narrow, i.e. concentrated over a small quantile. In an example physicsanalysis case presented, we show that the sensitivity of an experiment could be boosted by up to a factor of two bychoosing the RPS test over others.

Acknowledgements

We would like to thank Allen Caldwell, Oliver Schulz and Johannes Buchner for helpful discussions and comments.This research was supported by the Deutsche Forschungsgemeinschaft (DFG, German Research Foundation) underGermany´s Excellence Strategy – EXC-2094 – 390783311

References

[1] A. Kolmogorov, “Sulla determinazione empirica di una legge di distribuzione,” G. Ist. Ital., vol. 4, 1933.[2] N. Smirnov, “Table for estimating the goodness of fit of empirical distributions,” Annals of Mathematical Statistics,

vol. 19, pp. 279–281, 1948.[3] T. W. Anderson and D. A. Darling, “A test of goodness of fit,” Journal of the American Statistical Association,

vol. 50, no. 268, pp. 765–769, 1954.[4] Y. Marhuenda, D. Morales, and M. C. Pardo, “A comparison of uniformity tests,” Statistics, vol. 39, no. 4,

pp. 315–327, 2005.[5] H. Cramer, “On the composition of elementary errors,” Scandinavian Actuarial Journal, vol. 1928, no. 1, pp. 13–74,

1928.[6] R. V. Mises, Wahrscheinlichkeit Statistik und Wahrheit. Springer-Verlag Berlin Heidelberg, 1928.[7] R. Pyke, “The Supremum and Infimum of the Poisson Process,” The Annals of Mathematical Statistics, vol. 30,

no. 2, pp. 568 – 576, 1959.[8] J. Durbin, “Tests for serial correlation in regression analysis based on the periodogram of least-squares residuals,”

Biometrika, vol. 56, no. 1, pp. 1–15, 1969.[9] H. D. Brunk, “On the Range of the Difference between Hypothetical Distribution Function and Pyke’s Modified

Empirical Distribution Function,” The Annals of Mathematical Statistics, vol. 33, no. 2, pp. 525 – 532, 1962.

12


[10] R. C. H. Cheng and M. A. Stephens, “A goodness-of-fit test using moran’s statistic with estimated parameters,”Biometrika, vol. 76, no. 2, pp. 385–392, 1989.

[11] M. Greenwood, “The statistical study of infectious diseases,” Journal of the Royal Statistical Society, vol. 109,no. 2, pp. 85–110, 1946.

[12] N. Cressie, “On the logarithms of high-order spacings,” Biometrika, vol. 63, no. 2, pp. 343–355, 1976.[13] N. Cressie, “An optimal statistic based on higher order gaps,” Biometrika, vol. 66, no. 3, pp. 619–627, 1979.[14] L. Shtembari and A. Caldwell, “On the sum of ordered spacings,” 2020.[15] B. Efron, “Bootstrap Methods: Another Look at the Jackknife,” Annals Statist., vol. 7, no. 1, pp. 1–26, 1979.[16] F. N. Fritsch and J. Butland, “A method for constructing local monotone piecewise cubic interpolants,” SIAM

Journal on Scientific and Statistical Computing, vol. 5, no. 2, pp. 300–304, 1984.[17] M. Matsumoto and T. Nishimura, “Mersenne twister: A 623-dimensionally equidistributed uniform pseudo-random

number generator,” ACM Trans. Model. Comput. Simul., vol. 8, p. 3–30, Jan. 1998.[18] S. Müller, S. Mayer, C. H. auf der Heide, and A. Hohenegger, Documentation and Analysis of the Linux Random

Number Generator. German Federal Office for Information Security, 2021.

13

Date post:	18-Dec-2021
Category:	Documents
Upload:	others
View:	3 times
Download:	0 times

arXiv:2111.02252v1 [stat.ME] 3 Nov 2021

Documents