Statistical Characterization of Data - Oregon State...

Statistical Characterization of Data What are the model-independent characteristics of a set of data? When a set of data tends to cluster about a particular value, they are said to have acentral tendency. There are certain important measures of central tendency. Theyinclude: The first moment, the mean, defined as:

†

x =

xi1

N

ÂN

The second moment, the variance, which is a measure of the spread about the centralvalue, is defined as:

†

s 2 =

(x j - x)2

j=1

N

Â

N -1The square root of the variance is known as the standard deviation, s. There are highermoments, which can be illustrated in the following diagram:

Skewness is a measure of the asymmetry of the distribution. It is defined as:

†

skewness =1N

x j - xs

È

Î Í

˘

˚ ˙

3

j=1

N

ÂTo be significant, the skewness should be greater than (15/N)1/2

Kurtosis Kurtosis is a measure of the flatness of the distribution. Formally, kurtosis is defined as:

†

kurtosis =1N

x j - xs

È

Î Í

˘

˚ ˙

j=1

N

Â4Ï

Ì Ô

Ó Ô

¸ ˝ Ô

˛ Ô - 3

Kurtosis should be greater than (96/N)1/2. Median

The median is the value of x for which larger and smaller values of x are equallyprobable. Formally for even N, the median is 0.5(xN/2 + xN/2+1). For odd N, the median isx(N+1)/2. t-Test The t-test is designed to answer the question: Do two distributions have the same mean? If we define SD as the standard error of the difference of two means, then

†

SD =(xi - xA )2 + (xi - xB )2

BÂAÂNA + NB - 2

1NA

+1

NB

Ê

Ë Á

ˆ

¯ ˜

t =xA - xB

SD

t is compared with expectations based upon assumed statistics (where one assumes thedistributions have equal variances). F-test This statistic is designed to answer the question: Do two distributions have differentvariances? F is defined as the ratio of (variance of A)/(variance of B). The value of F iscompared with “expectations”.

C2 Test (Chi-squared test) The C2 test is designed to answer the question of whether two distributions are different. Formally C2 is defined as

†

C2 =(Ni - ni)

2

niiÂ

where Ni is the number of events in the ith bin in distribution 1 while ni is the number ofevents in the ith bin in distribution 2. If we define the number of degrees of freedom, n,as the number of data points – the number of parameters determined from the points, wecan also define a quantity, the reduced C2 is defined as C2/n. Roughly the reduced chi-squared should be about 1. The next page shows the expected values of F and C2.

Distributions

Binomial Distribution Describes the probability of observing x successes out of n tries when the probability forsuccess in each try is p;

†

PB (x;n, p) =n!

x!(n - x)!px (1- p)n-x

m = nps 2 = np(1- p)

Poisson distributions A Poisson distribution is the limiting case of the binomialdistribution for m<<n because p << 1; it is appropriate fordescribing small samples from a large population.

†

PP (x,m) =mx

x!e-m

s 2 = m

Gaussian (Normal) Distribution A Gaussian distribution is a limiting case of the binomial distribution for large n andfinite p; it is appropriate for smooth symmetric distributions.

†

PG (x,m,s) =1

s 2pexp -

12

x - ms

Ê

Ë Á

ˆ

¯ ˜

2È

Î Í

˘

˚ ˙

The half-width, G, is 2.354s while the probable error is 0.6745s.

Lorentzian distribution The distribution relation is

†

PL (x,m,G) =1p

G /2(x - m)2 + (G /2)2

Propagation of Errors Suppose we wish to determine x where

†

x = f (u,v,...)

xi - x @ (ui - u) ∂x∂u

Ê

Ë Á

ˆ

¯ ˜ + (vi - v) ∂x

∂vÊ

Ë Á

ˆ

¯ ˜ +L

We can express the variance s2 in terms of the variances of u, v, etc. as

†

s x2 = s u

2 ∂x∂u

Ê

Ë Á

ˆ

¯ ˜

2

+ s v2 ∂x

∂vÊ

Ë Á

ˆ

¯ ˜

2

+L

where we have neglected any correlation between u and v, etc. Using these ideas we canmake a handy little table of relations that allow one to calculate the standard deviationassociated with quantities after performing various arithmetic operations. The table is asfollows:

Function Standard Deviationx=a+b sx=(sa

2 + sb2)1/2

x=ab sx=x((sa/a)2+(sb/b)2)1/2

x=a/b sx=x((sa/a)2+(sb/b)2)1/2

x=au±b sx=±bsux/ux=ae±bu sx=±bsuxx=a±bu sx=±(b ln a)sux

x=a ln(±bu) sx=±absu/u

Weighted Means All this stuff is nice but a tad unrealistic. Suppose you have a group of numbers that youare averaging. If one of them is very uncertain, you might not want that number to countthe same as the rest in computing the average. The same comment applies to all thestatistical measures you might compute from the data. Thus we need to understand theuse of weights in computing various statistics. Consider a set of points xi withuncertainties si. The weighted average of the group is given as

†

x =wixiÂwiÂ

where the weighting factors are taken as

†

wi =1

s i2

The variance of the weighted mean is then given as

†

s m2 =

11

s i2

Ê

Ë Á

ˆ

¯ ˜ Â

Smoothing Generally smoothing is not something you wish to do with data because of thepossibility of distorting or altering it. However, smoothing is a useful trend for allowingvisual recognition of trends in data. It is important that smoothing be carried out in animpartial and well recognized manner. One very well recognized method of smoothingdata is to apply a Savitsky-Golay filter to the data. This procedure involves, in effect,

fitting the data to extract a smooth tendency from it. The formal definition of theSavitsky-Golay filters is given in the following table.

Thus, to apply a 5-point smooth to a data set (a common choice), we say: Y(I)=(-3*Y(I-2)+12*Y(I-1)+17*Y(I)+12*Y(I+1)-3*Y(I+2))/35. We apply this procedure stepwise for each point in the array. The effect of thissmoothing upon data is shown in the next two figures where smoothing the data revealssome peak structures not obvious in the original data.

Date post:	23-Jul-2018
Category:	Documents
Upload:	doanque
View:	214 times
Download:	0 times

Statistical Characterization of Data - Oregon State...

Documents