+ All Categories
Home > Documents > Wiki. Normal Distribution

Wiki. Normal Distribution

Date post: 18-Dec-2015
Category:
Upload: brunoschwarz
View: 12 times
Download: 4 times
Share this document with a friend
Description:
wikisource
Popular Tags:
25
Normal distribution This article is about the univariate normal distribution. For normally distributed vectors, see Multivariate normal distribution. In probability theory, the normal (or Gaussian) distri- bution is a very common continuous probability distribu- tion. Normal distributions are important in statistics and are often used in the natural and social sciences to repre- sent real-valued random variables whose distributions are not known. [1][2] The normal distribution is remarkably useful because of the central limit theorem. In its most general form, un- der mild conditions, it states that averages of random variables independently drawn from independent distri- butions are normally distributed. Physical quantities that are expected to be the sum of many independent pro- cesses (such as measurement errors) often have distribu- tions that are nearly normal. [3] Moreover, many results and methods (such as propagation of uncertainty and least squares parameter fitting) can be derived analytically in explicit form when the relevant variables are normally distributed. The normal distribution is sometimes informally called the bell curve. However, many other distributions are bell-shaped (such as Cauchy's, Student's, and logistic). The terms Gaussian function and Gaussian bell curve are also ambiguous because they sometimes refer to mul- tiples of the normal distribution that cannot be directly interpreted in terms of probabilities. The probability density of the normal distribution is: f (x, μ, σ)= 1 2πσ 2 e - (x-μ) 2 2σ 2 Here, μ is the mean or expectation of the distribution (and also its median and mode). The parameter σ is its standard deviation; its variance is then σ 2 . A random variable with a Gaussian distribution is said to be nor- mally distributed and is called a normal deviate. If μ =0 and σ =1 , the distribution is called the stan- dard normal distribution or the unit normal distribu- tion denoted by N (0, 1) and a random variable with that distribution is a standard normal deviate. The normal distribution is the only absolutely continuous distribution whose cumulants beyond the first two (i.e., other than the mean and variance) are zero. It is also the continuous distribution with the maximum entropy for a specified mean and variance. [4][5] The normal distribution is a subclass of the elliptical dis- tributions. The normal distribution is symmetric about its mean, and is non-zero over the entire real line. As such it may not be a suitable model for variables that are inher- ently positive or strongly skewed, such as the weight of a person or the price of a share. Such variables may be better described by other distributions, such as the log- normal distribution or the Pareto distribution. The value of the normal distribution is practically zero when the value x lies more than a few standard devia- tions away from the mean. Therefore, it may not be an appropriate model when one expects a significant frac- tion of outliers — values that lie many standard devia- tions away from the mean — and least squares and other statistical inference methods that are optimal for nor- mally distributed variables often become highly unreli- able when applied to such data. In those cases, a more heavy-tailed distribution should be assumed and the ap- propriate robust statistical inference methods applied. The Gaussian distribution belongs to the family of stable distributions which are the attractors of sums of independent, identically distributed distributions whether or not the mean or variance is finite. Except for the Gaus- sian which is a limiting case, all stable distributions have heavy tails and infinite variance. 1 Definition 1.1 Standard normal distribution The simplest case of a normal distribution is known as the standard normal distribution. This is a special case where μ=0 and σ=1, and it is described by this probability density function: ϕ(x)= e - 1 2 x 2 2π The factor 1/ 2π in this expression ensures that the total area under the curve ϕ(x) is equal to one. [6] The 1/2 in the exponent ensures that the distribution has unit variance (and therefore also unit standard deviation). This function is symmetric around x=0, where it attains its maximum value 1/ 2π ; and has inflection points at +1 and −1. Authors may differ also on which normal distribution should be called the “standard” one. Gauss himself de- 1
Transcript
  • Normal distribution

    This article is about the univariate normal distribution.For normally distributed vectors, see Multivariate normaldistribution.

    In probability theory, the normal (or Gaussian) distri-bution is a very common continuous probability distribu-tion. Normal distributions are important in statistics andare often used in the natural and social sciences to repre-sent real-valued random variables whose distributions arenot known.[1][2]

    The normal distribution is remarkably useful because ofthe central limit theorem. In its most general form, un-der mild conditions, it states that averages of randomvariables independently drawn from independent distri-butions are normally distributed. Physical quantities thatare expected to be the sum of many independent pro-cesses (such as measurement errors) often have distribu-tions that are nearly normal.[3] Moreover, many resultsandmethods (such as propagation of uncertainty and leastsquares parameter tting) can be derived analytically inexplicit form when the relevant variables are normallydistributed.The normal distribution is sometimes informally calledthe bell curve. However, many other distributions arebell-shaped (such as Cauchy's, Student's, and logistic).The terms Gaussian function and Gaussian bell curveare also ambiguous because they sometimes refer to mul-tiples of the normal distribution that cannot be directlyinterpreted in terms of probabilities.The probability density of the normal distribution is:

    f(x; ; ) =1p22

    e(x)222

    Here, is the mean or expectation of the distribution(and also its median and mode). The parameter is itsstandard deviation; its variance is then 2 . A randomvariable with a Gaussian distribution is said to be nor-mally distributed and is called a normal deviate.If = 0 and = 1 , the distribution is called the stan-dard normal distribution or the unit normal distribu-tion denoted byN(0; 1) and a random variable with thatdistribution is a standard normal deviate.The normal distribution is the only absolutely continuousdistribution whose cumulants beyond the rst two (i.e.,other than the mean and variance) are zero. It is also thecontinuous distribution with the maximum entropy for aspecied mean and variance.[4][5]

    The normal distribution is a subclass of the elliptical dis-tributions. The normal distribution is symmetric about itsmean, and is non-zero over the entire real line. As such itmay not be a suitable model for variables that are inher-ently positive or strongly skewed, such as the weight ofa person or the price of a share. Such variables may bebetter described by other distributions, such as the log-normal distribution or the Pareto distribution.The value of the normal distribution is practically zerowhen the value x lies more than a few standard devia-tions away from the mean. Therefore, it may not be anappropriate model when one expects a signicant frac-tion of outliers values that lie many standard devia-tions away from the mean and least squares and otherstatistical inference methods that are optimal for nor-mally distributed variables often become highly unreli-able when applied to such data. In those cases, a moreheavy-tailed distribution should be assumed and the ap-propriate robust statistical inference methods applied.The Gaussian distribution belongs to the family ofstable distributions which are the attractors of sums ofindependent, identically distributed distributions whetheror not the mean or variance is nite. Except for the Gaus-sian which is a limiting case, all stable distributions haveheavy tails and innite variance.

    1 Denition

    1.1 Standard normal distribution

    The simplest case of a normal distribution is known asthe standard normal distribution. This is a special casewhere =0 and =1, and it is described by this probabilitydensity function:

    (x) =e

    12x

    2

    p2

    The factor 1/p2 in this expression ensures that the totalarea under the curve (x) is equal to one.[6] The 1/2 in theexponent ensures that the distribution has unit variance(and therefore also unit standard deviation). This functionis symmetric around x=0, where it attains its maximumvalue 1/

    p2 ; and has inection points at +1 and 1.

    Authors may dier also on which normal distributionshould be called the standard one. Gauss himself de-

    1

  • 2 2 PROPERTIES

    ned the standard normal as having variance 2 = 1/2,that is

    (x) =ex

    2

    p

    Stigler[7] goes even further, dening the standard normalwith variance 2 = 1/2 :

    (x) = ex2

    1.2 General normal distribution

    Any normal distribution is a version of the standard nor-mal distribution whose domain has been stretched by afactor (the standard deviation) and then translated by (the mean value):

    f(x; ; ) =1

    x

    :

    The probability density must be scaled by 1/ so that theintegral is still 1.If Z is a standard normal deviate, then X = Z + willhave a normal distribution with expected value and stan-dard deviation . Conversely, if X is a general normaldeviate, then Z = (X )/ will have a standard normaldistribution.Every normal distribution is the exponential of aquadratic function:

    f(x) = eax2+bx+c

    where a is negative and c is b2/(4a) + ln(a/)/2 . Inthis form, the mean value is b/(2a), and the variance2 is 1/(2a). For the standard normal distribution, a is1/2, b is zero, and c is ln(2)/2 .

    1.3 Notation

    The standard Gaussian distribution (with zero mean andunit variance) is often denoted with the Greek letter (phi).[8] The alternative form of the Greek phi letter, ,is also used quite often.The normal distribution is also often denoted by N(,2).[9] Thus when a random variable X is distributed nor-mally with mean and variance 2, we write

    X N (; 2):

    1.4 Alternative parameterizationsSome authors advocate using the precision as the pa-rameter dening the width of the distribution, instead ofthe deviation or the variance 2. The precision is nor-mally dened as the reciprocal of the variance, 1/2.[10]The formula for the distribution then becomes

    f(x) =

    r

    2e(x)2

    2 :

    This choice is claimed to have advantages in numericalcomputations when is very close to zero and simplifyformulas in some contexts, such as in the Bayesian infer-ence of variables with multivariate normal distribution.Also the reciprocal of the standard deviation 0 = 1/might be dened as the precision and the expression ofthe normal distribution becomes

    f(x) = 0p2

    e(0)2(x)2

    2 :

    According to Stigler, this formulation is advantageousbecause of a much simpler and easier-to-remember for-mula, the fact that the pdf has unit height at zero, andsimple approximate formulas for the quantiles of the dis-tribution.

    2 Properties

    2.1 Symmetries and derivativesThe normal distribution f(x), with any mean and anypositive deviation , has the following properties:

    It is symmetric around the point x = , which is atthe same time the mode, the median and the meanof the distribution.[11]

    It is unimodal: its rst derivative is positive for x , and zero only at x = .

    Its density has two inection points (where the sec-ond derivative of f is zero and changes sign), locatedone standard deviation away from the mean, namelyat x = and x = + .[11]

    Its density is log-concave.[11]

    Its density is innitely dierentiable, indeedsupersmooth of order 2.[12]

    Its second derivative f(x) is equal to its derivativewith respect to its variance 2

    Furthermore, the density of the standard normal dis-tribution (with = 0 and = 1) also has the followingproperties:

  • 2.3 Fourier transform and characteristic function 3

    Its rst derivative (x) is x(x). Its second derivative (x) is (x2 1)(x) More generally, its n-th derivative (n)(x) is(1)nHn(x)(x), where Hn is the Hermite polyno-mial of order n.[13]

    It satises the dierential equation

    2f 0(x)+f(x)(x) = 0; f(0) = e2/(22)p2

    or

    f 0(x)+f(x)(x) = 0; f(0) =pe

    2/2

    p2

    :

    2.2 MomentsSee also: List of integrals of Gaussian functions

    The plain and absolute moments of a variable X are theexpected values of Xp and |X|p,respectively. If the ex-pected value of X is zero, these parameters are calledcentral moments. Usually we are interested only in mo-ments with integer order p.If X has a normal distribution, these moments exist andare nite for any p whose real part is greater than 1. Forany non-negative integer p, the plain central moments are

    E [Xp] =(0 ifpodd, isp (p 1)!! ifpeven. is

    Here n!! denotes the double factorial, that is, the productof every number from n to 1 that has the same parity asn.The central absolute moments coincide with plain mo-ments for all even orders, but are nonzero for odd orders.For any non-negative integer p,

    E [jXjp] = p (p1)!!(q

    2 ifpodd is

    1 ifpeven is

    )= p2

    p2p+12

    p

    The last formula is valid also for any non-integer p > 1.When the mean is not zero, the plain and absolute mo-ments can be expressed in terms of conuent hypergeo-metric functions 1F1 and U.

    E [Xp] = p (ip2)p U

    12p;

    1

    2; 1

    2(/)2

    ;

    E [jXjp] = p2 p2 1+p2

    p

    1F1

    12p;

    1

    2; 1

    2(/)2

    :

    These expressions remain valid even if p is not integer.See also generalized Hermite polynomials.

    2.3 Fourier transform and characteristicfunction

    The Fourier transform of a normal distribution f withmean and deviation is[14]

    ^(t) =

    Z 11

    f(x)eitxdx = eite12 (t)

    2

    where i is the imaginary unit. If the mean is zero, therst factor is 1, and the Fourier transform is also a normaldistribution on the frequency domain, with mean 0 andstandard deviation 1/. In particular, the standard normaldistribution (with =0 and =1) is an eigenfunction ofthe Fourier transform.In probability theory, the Fourier transform of the prob-ability distribution of a real-valued random variable X iscalled the characteristic function of that variable, and canbe dened as the expected value of ei tX, as a function ofthe real variable t (the frequency parameter of the Fouriertransform). This denition can be analytically extendedto a complex-value parameter t.[15]

    2.4 Moment and cumulant generatingfunctions

    The moment generating function of a real random vari-able X is the expected value of etX, as a function of thereal parameter t. For a normal distribution with mean and deviation , the moment generating function existsand is equal to

    M(t) = ^(it) = ete 122t2

    The cumulant generating function is the logarithm of themoment generating function, namely

    g(t) = lnM(t) = t+ 122t2

    Since this is a quadratic polynomial in t, only the rst twocumulants are nonzero, namely the mean and the vari-ance 2.

  • 4 3 CUMULATIVE DISTRIBUTION FUNCTION

    3 Cumulative distribution functionThe cumulative distribution function (CDF) of the stan-dard normal distribution, usually denoted with the capitalGreek letter (phi), is the integral

    (x) =1p2

    Z x1

    et2/2 dt

    In statistics one often uses the related error function,or erf(x), dened as the probability of a random vari-able with normal distribution of mean 0 and variance 1/2falling in the range [x; x] ; that is

    erf(x) = 1p

    Z xx

    et2

    dt

    These integrals cannot be expressed in terms of elemen-tary functions, and are often said to be special functions.However, many numerical approximations are known;see below.The two functions are closely related, namely

    (x) =1

    2

    1 + erf

    xp2

    For a generic normal distribution f with mean and de-viation , the cumulative distribution function is

    F (x) =

    x

    =

    1

    2

    1 + erf

    x p2

    The complement of the standard normal CDF, Q(x) =1 (x) , is often called the Q-function, especiallyin engineering texts.[16][17] It gives the probability thatthe value of a standard normal random variable X willexceed x. Other denitions of the Q-function, all ofwhich are simple transformations of , are also usedoccasionally.[18]

    The graph of the standard normal CDF has 2-foldrotational symmetry around the point (0,1/2); that is,(x) = 1 (x) . Its antiderivative (indenite in-tegral)

    R(x) dx is

    R(x) dx = x(x) + (x) .

    The cumulative distribution function (CDF) of thestandard normal distribution can be expanded byIntegration by parts into a series:

    (x) = 0:5+1p2ex2/2

    x+

    x3

    3+

    x5

    3 5 + +x2n+1

    (2n+ 1)!!+

    where !! denotes the double factorial. Example of Pas-cal function to calculate CDF (sum of rst 100 elements)[See comments on the talk page under the CDF heading]

    function CDF(x:extended):extended; varvalue,sum:extended; i:integer; begin sum:=x;value:=x; for i:=1 to 100 do beginvalue:=(value*x*x/(2*i+1)); sum:=sum+value; end;result:=0.5+(sum/sqrt(2*pi))*exp(-(x*x)/2); end;

    3.1 Standard deviation and tolerance in-tervals

    Main article: Tolerance intervalAbout 68% of values drawn from a normal distribution

    For the normal distribution, the values less than one standard de-viation away from the mean account for 68.27% of the set; whiletwo standard deviations from the mean account for 95.45%; andthree standard deviations account for 99.73%.

    are within one standard deviation away from the mean;about 95% of the values lie within two standard devia-tions; and about 99.7% are within three standard devia-tions. This fact is known as the 68-95-99.7 (empirical)rule, or the 3-sigma rule.More precisely, the probability that a normal deviate liesin the range n and + n is given by

    F (+n)F (n) = (n)(n) = erf

    np2

    ;

    To 12 decimal places, the values for n = 1, 2, , 6 are:[19]

    3.2 Quantile function

    The quantile function of a distribution is the inverse of thecumulative distribution function. The quantile functionof the standard normal distribution is called the probitfunction, and can be expressed in terms of the inverseerror function:

    1(p) =p2 erf1(2p 1); p 2 (0; 1):

  • 5For a normal random variable with mean and variance2, the quantile function is

    F1(p) = +1(p) = +p2 erf1(2p1); p 2 (0; 1):

    The quantile 1(p) of the standard normal distributionis commonly denoted as zp. These values are used inhypothesis testing, construction of condence intervalsand Q-Q plots. A normal random variable X will exceed + zp with probability 1p; and will lie outside the in-terval zp with probability 2(1p). In particular, thequantile z. is 1.96; therefore a normal random variablewill lie outside the interval 1.96 in only 5% of cases.The following table gives the multiple n of such that Xwill lie in the range n with a specied probabilityp. These values are useful to determine tolerance intervalfor sample averages and other statistical estimators withnormal (or asymptotically normal) distributions:[20]

    4 Zero-variance limitIn the limit when tends to zero, the probability densityf(x) eventually tends to zero at any x , but grows with-out limit if x = , while its integral remains equal to 1.Therefore, the normal distribution cannot be dened asan ordinary function when = 0.However, one can dene the normal distribution with zerovariance as a generalized function; specically, as Diracsdelta function translated by the mean , that is f(x)= (x). Its CDF is then the Heaviside step functiontranslated by the mean , namely

    F (x) =

    (0 ifx < 1 ifx

    5 Central limit theorem

    As the number of discrete events increases, the function begins toresemble a normal distribution

    n = 1p(k)0.180.160.140.120.100.080.050.040.020.00 k123456

    1 / 6

    n = 2p(k)0.180.160.140.120.100.080.050.040.020.00 k2 127

    1 / 6

    n = 3p(k)0.180.160.140.120.100.080.050.040.020.00 k3 1810,11

    1 / 8

    n = 4p(k)0.180.160.140.120.100.080.050.040.020.00 k4 2414

    73 / 648

    n = 5p(k)0.180.160.140.120.100.080.050.040.020.00 k5 3017,18

    65 / 648

    Comparison of probability density functions, p(k) for the sum ofn fair 6-sided dice to show their convergence to a normal distri-bution with increasing n, in accordance to the central limit theo-rem. In the bottom-right graph, smoothed proles of the previousgraphs are rescaled, superimposed and compared with a normaldistribution (black curve).

    Main article: Central limit theorem

    The central limit theorem states that under certain (fairlycommon) conditions, the sum of many random variableswill have an approximately normal distribution. Morespecically, where X1, , Xn are independent and iden-tically distributed random variables with the same arbi-trary distribution, zero mean, and variance 2; and Z istheir mean scaled bypn

    Z =pn

    1

    n

    nXi=1

    Xi

    !

    Then, as n increases, the probability distribution of Z willtend to the normal distribution with zero mean and vari-ance 2.The theorem can be extended to variables Xi that are notindependent and/or not identically distributed if certainconstraints are placed on the degree of dependence andthe moments of the distributions.Many test statistics, scores, and estimators encounteredin practice contain sums of certain random variables inthem, and even more estimators can be represented assums of random variables through the use of inuencefunctions. The central limit theorem implies that thosestatistical parameters will have asymptotically normaldistributions.

  • 6 7 OPERATIONS ON NORMAL DEVIATES

    The central limit theorem also implies that certain distri-butions can be approximated by the normal distribution,for example:

    The binomial distribution B(n, p) is approximatelynormal with mean np and variance np(1p) for largen and for p not too close to zero or one.

    The Poisson distribution with parameter is approx-imately normal with mean and variance , for largevalues of .[21]

    The chi-squared distribution 2(k) is approximatelynormal with mean k and variance 2k, for large k.

    The Students t-distribution t() is approximatelynormal with mean 0 and variance 1 when is large.

    Whether these approximations are suciently accuratedepends on the purpose for which they are needed, andthe rate of convergence to the normal distribution. It istypically the case that such approximations are less accu-rate in the tails of the distribution.A general upper bound for the approximation error in thecentral limit theorem is given by the BerryEsseen the-orem, improvements of the approximation are given bythe Edgeworth expansions.

    6 Maximum entropyOf all probability distributions over the reals with a spec-ied mean and variance 2, the normal distributionN(, 2) is the one with maximum entropy.[22] If X is acontinuous random variable with probability density f(x),then the entropy of X is dened as[23][24][25]

    H(X) = Z 11

    f(x) log f(x)dx

    where f(x) log f(x) is understood to be zero wheneverf(x) = 0. This functional can bemaximized, subject to theconstraints that both a mean and a variance are specied,by using variational calculus. A Lagrangian function withtwo Lagrangian multipliers is dened:

    L =

    Z 11

    f(x) ln(f(x)) dx01

    Z 11

    f(x) dx

    2

    Z 11

    f(x)(x )2 dx

    where f(x) is, for now, regarded as some function withmean and standard deviation . When the entropy off(x) is at a maximum and the constraints satised, then asmall variation f(x) about f(x) will produce a variationL about L which is equal to zero:

    0 = L =

    Z 11

    f(x)ln(f(x)) + 1 + 0 + (x )2

    dx

    Since this must hold for any small f(x), the term inbrackets must be zero, and solving for f(x) yields:

    f(x) = e01(x)2

    Using the constraint equations to solve for 0 and yieldsthe normal distribution:

    f(x; ; ) =1p22

    e(x)222

    7 Operations on normal deviatesThe family of normal distributions is closed under lineartransformations: if X is normally distributed with mean and standard deviation , then the variable Y = aX + b,for any real numbers a and b, is also normally distributed,with mean a + b and standard deviation |a|.Also if X1 and X2 are two independent normal randomvariables, with means 1, 2 and standard deviations1, 2, then their sum X1 + X2 will also be normallydistributed,[proof] with mean 1 + 2 and variance 21+22.In particular, if X and Y are independent normal devi-ates with zero mean and variance 2, then X + Y and X Y are also independent and normally distributed, withzero mean and variance 22. This is a special case of thepolarization identity.[26]

    Also, if X1, X2 are two independent normal deviates withmean and deviation , and a, b are arbitrary real num-bers, then the variable

    X3 =aX1 + bX2 (a+ b)p

    a2 + b2+

    is also normally distributed with mean and deviation. It follows that the normal distribution is stable (withexponent = 2).More generally, any linear combination of independentnormal deviates is a normal deviate.

    7.1 Innite divisibility and Cramrs theo-rem

    For any positive integer n, any normal distribution withmean and variance 2 is the distribution of the sum ofn independent normal deviates, each with mean /n andvariance 2/n. This property is called innite divisibil-ity.[27]

    Conversely, if X1 and X2 are independent random vari-ables and their sum X1 + X2 has a normal distribution,then both X1 and X2 must be normal deviates.[28]

  • 7This result is known as Cramrs decomposition theo-rem, and is equivalent to saying that the convolution oftwo distributions is normal if and only if both are nor-mal. Cramrs theorem implies that a linear combinationof independent non-Gaussian variables will never have anexactly normal distribution, although it may approach itarbitrarily close.[29]

    7.2 Bernsteins theoremBernsteins theorem states that ifX andY are independentand X + Y and X Y are also independent, then both Xand Y must necessarily have normal distributions.[30][31]

    More generally, if X1, , Xn are independent randomvariables, then two distinct linear combinations akXkand bkXk will be independent if and only if all Xk's arenormal and akbk 2k = 0, where 2k denotes the variance of Xk.[30]

    8 Other properties1. If the characteristic function X of some random

    variable X is of the form X(t) = eQ(t), where Q(t)is a polynomial, then the Marcinkiewicz theorem(named after Jzef Marcinkiewicz) asserts that Qcan be at most a quadratic polynomial, and thereforeX a normal random variable.[29] The consequence ofthis result is that the normal distribution is the onlydistribution with a nite number (two) of non-zerocumulants.

    2. If X and Y are jointly normal and uncorrelated, thenthey are independent. The requirement that X andY should be jointly normal is essential, without it theproperty does not hold.[32][33][proof] For non-normalrandom variables uncorrelatedness does not implyindependence.

    3. The KullbackLeibler divergence of one normaldistribution X1 N(1, 21 )from another X2 N(2, 22 )is given by:[34]

    DKL(X1 kX2) = (1 2)2

    222+

    1

    2

    2122

    1 ln 21

    22

    :

    The Hellinger distance between the same distribu-tions is equal to

    H2(X1; X2) = 1 s

    21221 +

    22

    e 14

    (12)221+

    22 :

    4. The Fisher information matrix for a normal distri-bution is diagonal and takes the form

    I =

    12 00 124

    5. Normal distributions belongs to an exponential fam-ily with natural parameters 1= 2 and 2= 122 , andnatural statistics x and x2. The dual, expectation pa-rameters for normal distribution are 1 = and 2= 2 + 2.

    6. The conjugate prior of the mean of a normal dis-tribution is another normal distribution.[35] Speci-cally, if x1, , xn are iid N(, 2) and the prior is ~ N(0, 20), then the posterior distribution for the estimatorof will be

    jx1; : : : ; xn N

    2

    n 0 + 20x

    2

    n + 20

    ;

    n

    2+

    1

    20

    1!

    7. The family of normal distributions forms a manifoldwith constant curvature 1. The same family isat with respect to the (1)-connections (e) and(m).[36]

    9 Related distributions

    9.1 Operations on a single random vari-able

    If X is distributed normally with mean and variance 2,then

    The exponential of X is distributed log-normally: eX~ ln(N (, 2)).

    The absolute value of X has folded normal distribu-tion: |X| ~ Nf (, 2). If = 0 this is known as thehalf-normal distribution.

    The square of X/ has the noncentral chi-squareddistribution with one degree of freedom: X2/2 ~21(2/2). If = 0, the distribution is called simplychi-squared.

    The distribution of the variable X restricted to aninterval [a, b] is called the truncated normal distri-bution.

    (X )2 has a Lvy distribution with location 0 andscale 2.

    9.2 Combination of two independent ran-dom variables

    If X1 and X2 are two independent standard normal ran-dom variables with mean 0 and variance 1, then

    Their sum and dierence is distributed normallywith mean zero and variance two: X1 X2 N(0,2).

  • 8 9 RELATED DISTRIBUTIONS

    Their product Z = X1X2 follows the product-normal distribution[37] with density function fZ(z)= 1K0(|z|), where K0 is the modied Bessel func-tion of the second kind. This distribution is sym-metric around zero, unbounded at z = 0, and has thecharacteristic function Z(t) = (1 + t 2)1/2.

    Their ratio follows the standard Cauchy distribution:X1 X2 Cauchy(0, 1).

    Their Euclidean norm pX21 +X22 has the Rayleighdistribution.

    9.3 Combination of two or more indepen-dent random variables

    If X1, X2, , Xn are independent standard normalrandom variables, then the sum of their squares hasthe chi-squared distribution with n degrees of free-dom

    X21 + +X2n 2n:

    If X1, X2, , Xn are independent normally dis-tributed random variables with means and vari-ances 2, then their sample mean is independentfrom the sample standard deviation,[38] which canbe demonstrated using Basus theorem or Cochranstheorem.[39] The ratio of these two quantities willhave the Students t-distribution with n 1 degreesof freedom:

    t =X S/pn

    =1n (X1 + +Xn) q

    1n(n1)

    (X1 X)2 + + (Xn X)2

    tn1: If X1, , Xn, Y1, , Ym are independent standardnormal random variables, then the ratio of their nor-malized sums of squares will have the F-distributionwith (n, m) degrees of freedom:[40]

    F =

    X21 +X

    22 + +X2n

    /n

    (Y 21 + Y22 + + Y 2m) /m

    Fn;m:

    9.4 Operations on the density functionThe split normal distribution is most directly dened interms of joining scaled sections of the density functionsof dierent normal distributions and rescaling the densityto integrate to one. The truncated normal distribution re-sults from rescaling a section of a single density function.

    9.5 ExtensionsThe notion of normal distribution, being one of the mostimportant distributions in probability theory, has been ex-tended far beyond the standard framework of the univari-ate (that is one-dimensional) case (Case 1). All these ex-tensions are also called normal orGaussian laws, so a cer-tain ambiguity in names exists.

    The multivariate normal distribution describes theGaussian law in the k-dimensional Euclidean space.A vectorX Rk is multivariate-normally distributedif any linear combination of its components kj=1aj Xj has a (univariate) normal distribution. Thevariance of X is a kk symmetric positive-denitematrix V. The multivariate normal distribution is aspecial case of the elliptical distributions. As such,its iso-density loci in the k = 2 case are ellipses andin the case of arbitrary k are ellipsoids.

    Rectied Gaussian distribution a rectied version ofnormal distribution with all the negative elementsreset to 0

    Complex normal distribution deals with the complexnormal vectors. A complex vector X Ck is saidto be normal if both its real and imaginary compo-nents jointly possess a 2k-dimensional multivariatenormal distribution. The variance-covariance struc-ture of X is described by two matrices: the variancematrix , and the relation matrix C.

    Matrix normal distribution describes the case ofnormally distributed matrices.

    Gaussian processes are the normally distributedstochastic processes. These can be viewed as ele-ments of some innite-dimensional Hilbert spaceH,and thus are the analogues of multivariate normalvectors for the case k = . A random element h H is said to be normal if for any constant a H thescalar product (a, h) has a (univariate) normal dis-tribution. The variance structure of such Gaussianrandom element can be described in terms of thelinear covariance operator K: H H. Several Gaus-sian processes became popular enough to have theirown names: Brownian motion, Brownian bridge, OrnsteinUhlenbeck process.

    Gaussian q-distribution is an abstract mathematicalconstruction that represents a "q-analogue" of thenormal distribution.

    the q-Gaussian is an analogue of the Gaussian dis-tribution, in the sense that it maximises the Tsallisentropy, and is one type of Tsallis distribution. Notethat this distribution is dierent from the Gaussianq-distribution above.

  • 9One of the main practical uses of the Gaussian law is tomodel the empirical distributions of many dierent ran-dom variables encountered in practice. In such case apossible extension would be a richer family of distribu-tions, having more than two parameters and therefore be-ing able to t the empirical distribution more accurately.The examples of such extensions are:

    Pearson distribution a four-parametric family ofprobability distributions that extend the normal lawto include dierent skewness and kurtosis values.

    10 Normality testsMain article: Normality tests

    Normality tests assess the likelihood that the given dataset {x1, , xn} comes from a normal distribution. Typ-ically the null hypothesis H0 is that the observations aredistributed normally with unspecied mean and vari-ance 2, versus the alternative Ha that the distribution isarbitrary. Many tests (over 40) have been devised for thisproblem, the more prominent of them are outlined below:

    Visual tests are more intuitively appealing butsubjective at the same time, as they rely on infor-mal human judgement to accept or reject the nullhypothesis.

    Q-Q plot is a plot of the sorted values fromthe data set against the expected values ofthe corresponding quantiles from the standardnormal distribution. That is, its a plot ofpoint of the form (1(pk), xk), where plot-ting points pk are equal to pk = (k )/(n + 1 2) and is an adjustment constant, whichcan be anything between 0 and 1. If the nullhypothesis is true, the plotted points should ap-proximately lie on a straight line.

    P-P plot similar to the Q-Q plot, but usedmuch less frequently. This method consists ofplotting the points ((zk), pk), where z(k)=(x(k)^)/^ . For normally distributed data thisplot should lie on a 45 line between (0, 0) and(1, 1).

    Shapiro-Wilk test employs the fact that theline in the Q-Q plot has the slope of . Thetest compares the least squares estimate of thatslope with the value of the sample variance,and rejects the null hypothesis if these twoquantities dier signicantly.

    Normal probability plot (rankit plot) Moment tests:

    D'Agostinos K-squared test

    JarqueBera test Empirical distribution function tests:

    Lilliefors test (an adaptation of theKolmogorovSmirnov test)

    AndersonDarling test

    11 Estimation of parametersSee also: Standard error of the mean, Standard deviation Estimation, Variance Estimation and Maximum like-lihood Continuous distribution, continuous parameterspace

    It is often the case that we don't know the parametersof the normal distribution, but instead want to estimatethem. That is, having a sample (x1, , xn) from a nor-mal N(, 2) population we would like to learn the ap-proximate values of parameters and 2. The stan-dard approach to this problem is the maximum likeli-hood method, which requires maximization of the log-likelihood function:

    lnL(; 2) =nXi=1

    ln f(xi; ; 2) = n2ln(2)n

    2ln2 1

    22

    nXi=1

    (xi)2:

    Taking derivatives with respect to and 2 and solvingthe resulting system of rst order conditions yields themaximum likelihood estimates:

    ^ = x 1n

    nXi=1

    xi; ^2 =

    1

    n

    nXi=1

    (xi x)2:

    Estimator ^ is called the sample mean, since it is the arith-metic mean of all observations. The statistic x is completeand sucient for , and therefore by the LehmannSche theorem, ^ is the uniformly minimum varianceunbiased (UMVU) estimator.[41] In nite samples it isdistributed normally:

    ^ N (; 2/n):

    The variance of this estimator is equal to the -elementof the inverse Fisher information matrix I1 . This im-plies that the estimator is nite-sample ecient. Of prac-tical importance is the fact that the standard error of ^ isproportional to 1/pn , that is, if one wishes to decreasethe standard error by a factor of 10, one must increasethe number of points in the sample by a factor of 100.This fact is widely used in determining sample sizes foropinion polls and the number of trials in Monte Carlosimulations.

  • 10 12 BAYESIAN ANALYSIS OF THE NORMAL DISTRIBUTION

    From the standpoint of the asymptotic theory, ^ isconsistent, that is, it converges in probability to as n . The estimator is also asymptotically normal, whichis a simple corollary of the fact that it is normal in nitesamples:

    pn(^ ) d!N (0; 2):

    The estimator ^2 is called the sample variance, since itis the variance of the sample (x1, , xn). In practice,another estimator is often used instead of the ^2 . Thisother estimator is denoted s2, and is also called the sam-ple variance, which represents a certain ambiguity in ter-minology; its square root s is called the sample standarddeviation. The estimator s2 diers from ^2 by having (n 1) instead of n in the denominator (the so-called Besselscorrection):

    s2 =n

    n 1 ^2 =

    1

    n 1nXi=1

    (xi x)2:

    The dierence between s2 and ^2 becomes negligiblysmall for large n's. In nite samples however, the motiva-tion behind the use of s2 is that it is an unbiased estima-tor of the underlying parameter 2, whereas ^2 is biased.Also, by the LehmannSche theorem the estimator s2is uniformly minimum variance unbiased (UMVU),[41]which makes it the best estimator among all unbiasedones. However it can be shown that the biased estimator^2 is better than the s2 in terms of the mean squarederror (MSE) criterion. In nite samples both s2 and ^2have scaled chi-squared distribution with (n 1) degreesof freedom:

    s2 2

    n 1 2n1; ^

    2 2

    n 2n1 :

    The rst of these expressions shows that the variance ofs2 is equal to 24/(n1), which is slightly greater than the-element of the inverse Fisher information matrix I1 .Thus, s2 is not an ecient estimator for 2, andmoreover,since s2 is UMVU, we can conclude that the nite-sampleecient estimator for 2 does not exist.Applying the asymptotic theory, both estimators s2 and^2 are consistent, that is they converge in probability to2 as the sample size n. The two estimators are alsoboth asymptotically normal:

    pn(^2 2) ' pn(s2 2) d!N (0; 24):

    In particular, both estimators are asymptotically ecientfor 2.By Cochrans theorem, for normal distributions the sam-ple mean ^ and the sample variance s2 are independent,

    which means there can be no gain in considering theirjoint distribution. There is also a reverse theorem: if ina sample the sample mean and sample variance are inde-pendent, then the sample must have come from the nor-mal distribution. The independence between ^ and s canbe employed to construct the so-called t-statistic:

    t =^ s/pn

    =x q

    1n(n1)

    P(xi x)2

    tn1

    This quantity t has the Students t-distribution with (n 1) degrees of freedom, and it is an ancillary statistic (in-dependent of the value of the parameters). Inverting thedistribution of this t-statistics will allow us to constructthe condence interval for ;[42] similarly, inverting the2 distribution of the statistic s2 will give us the con-dence interval for 2:[43]

    2^+ tn1;/2

    1pns; ^+ tn1;1/2

    1pns

    ^ jz/2j 1p

    ns; ^+ jz/2j 1p

    ns

    ;

    2 2"

    (n 1)s22n1;1/2

    ;(n 1)s22n1;/2

    #"s2 jz/2j

    p2pns2; s2 + jz/2j

    p2pns2

    #;

    where tk,p and 2k,p are the pth quantiles of the t- and 2-distributionsrespectively. These condence intervals are of thecondence level 1 , meaning that the true values and 2 fall outside of these intervals with probability (orsignicance level) . In practice people usually take = 5%, resulting in the 95% condence intervals. Theapproximate formulas in the display above were derivedfrom the asymptotic distributions of ^ and s2. The ap-proximate formulas become valid for large values of n,and are more convenient for the manual calculation sincethe standard normal quantiles z/2 do not depend on n.In particular, the most popular value of = 5%, resultsin |z.| = 1.96.

    12 Bayesian analysis of the normaldistribution

    Bayesian analysis of normally distributed data is compli-cated by the many dierent possibilities that may be con-sidered:

    Either the mean, or the variance, or neither, may beconsidered a xed quantity.

    When the variance is unknown, analysis may bedone directly in terms of the variance, or in termsof the precision, the reciprocal of the variance. Thereason for expressing the formulas in terms of pre-cision is that the analysis of most cases is simplied.

  • 12.2 Sum of dierences from the mean 11

    Both univariate and multivariate cases need to beconsidered.

    Either conjugate or improper prior distributions maybe placed on the unknown variables.

    An additional set of cases occurs in Bayesian lin-ear regression, where in the basic model the data isassumed to be normally distributed, and normal pri-ors are placed on the regression coecients. Theresulting analysis is similar to the basic cases ofindependent identically distributed data, but morecomplex.

    The formulas for the non-linear-regression cases are sum-marized in the conjugate prior article.

    12.1 Sum of two quadratics12.1.1 Scalar form

    The following auxiliary formula is useful for simplifyingthe posterior update equations, which otherwise becomefairly tedious.

    a(xy)2+b(xz)2 = (a+b)x ay + bz

    a+ b

    2+

    ab

    a+ b(yz)2

    This equation rewrites the sum of two quadratics in xby expanding the squares, grouping the terms in x, andcompleting the square. Note the following about the com-plex constant factors attached to some of the terms:

    1. The factor ay+bza+b has the form of a weighted averageof y and z.

    2. aba+b = 11a+

    1b

    = (a1 + b1)1: This shows thatthis factor can be thought of as resulting from a situ-ation where the reciprocals of quantities a and b adddirectly, so to combine a and b themselves, its nec-essary to reciprocate, add, and reciprocate the re-sult again to get back into the original units. Thisis exactly the sort of operation performed by theharmonic mean, so it is not surprising that aba+b isone-half the harmonic mean of a and b.

    12.1.2 Vector form

    A similar formula can be written for the sum of two vectorquadratics: If x, y, z are vectors of length k, and A and Bare symmetric, invertible matrices of size k k , then

    (yx)0A(yx)+(xz)0B(xz) = (xc)0(A+B)(xc)+(yz)0(A1+B1)1(yz)

    where

    c = (A+ B)1(Ay+ Bz)

    Note that the form x A x is called a quadratic form andis a scalar:

    x0Ax =Xi;j

    aijxixj

    In other words, it sums up all possible combinations ofproducts of pairs of elements from x, with a separate co-ecient for each. In addition, since xixj = xjxi , onlythe sum aij + aji matters for any o-diagonal elementsofA, and there is no loss of generality in assuming thatAis symmetric. Furthermore, if A is symmetric, then theform x0Ay = y0Ax .

    12.2 Sum of dierences from the mean

    Another useful formula is as follows:

    nXi=1

    (xi )2 =nXi=1

    (xi x)2 + n(x )2

    where x = 1nPn

    i=1 xi:

    12.3 With known variance

    For a set of i.i.d. normally distributed data points Xof size n where each individual point x follows x N (; 2) with known variance 2, the conjugate priordistribution is also normally distributed.This can be shown more easily by rewriting the vari-ance as the precision, i.e. using = 1/2. Then ifx N (; 1/) and N (0; 1/0); we proceed asfollows.First, the likelihood function is (using the formula abovefor the sum of dierences from the mean):

    p(Xj; ) =nYi=1

    r

    2exp

    12(xi )2

    = 2

    n2 exp

    12

    nXi=1

    (xi )2!

    = 2

    n2 exp

    "12

    nXi=1

    (xi x)2 + n(x )2!#

    :

    Then, we proceed as follows:

  • 12 12 BAYESIAN ANALYSIS OF THE NORMAL DISTRIBUTION

    p(jX) / p(Xj)p()

    = 2

    n2 exp

    "12

    nXi=1

    (xi x)2 + n(x )2!#r

    02

    exp120( 0)2

    / exp 12

    nXi=1

    (xi x)2 + n(x )2!

    + 0( 0)2!!

    / exp12

    n(x )2 + 0( 0)2

    = exp

    12(n + 0)

    n x+ 00

    n + 0

    2+

    n0n + 0

    (x 0)2!

    / exp 12(n + 0)

    n x+ 00

    n + 0

    2!In the above derivation, we used the formula above for thesum of two quadratics and eliminated all constant factorsnot involving . The result is the kernel of a normal dis-tribution, with mean n x+00n+0 and precision n + 0 ,i.e.

    p(jX) Nn x+ 00n + 0

    ;1

    n + 0

    This can be written as a set of Bayesian update equationsfor the posterior parameters in terms of the prior param-eters:

    00 = 0 + n

    00 =n x+ 00n + 0

    x =1

    n

    nXi=1

    xi

    That is, to combine n data points with total precision of n(or equivalently, total variance of n/2) and mean of val-ues x , derive a new total precision simply by adding thetotal precision of the data to the prior total precision, andform a new mean through a precision-weighted average,i.e. a weighted average of the data mean and the priormean, each weighted by the associated total precision.This makes logical sense if the precision is thought of asindicating the certainty of the observations: In the distri-bution of the posterior mean, each of the input compo-nents is weighted by its certainty, and the certainty of thisdistribution is the sum of the individual certainties. (Forthe intuition of this, compare the expression the wholeis (or is not) greater than the sum of its parts. In addi-tion, consider that the knowledge of the posterior comesfrom a combination of the knowledge of the prior andlikelihood, so it makes sense that we are more certain ofit than of either of its components.)The above formula reveals why it is more convenientto do Bayesian analysis of conjugate priors for the nor-mal distribution in terms of the precision. The posterior

    precision is simply the sum of the prior and likelihoodprecisions, and the posterior mean is computed througha precision-weighted average, as described above. Thesame formulas can be written in terms of variance by re-ciprocating all the precisions, yielding the more ugly for-mulas

    200=

    1n2 +

    120

    00 =nx2 +

    020

    n2 +

    120

    x =1

    n

    nXi=1

    xi

    12.4 With known mean

    For a set of i.i.d. normally distributed data points Xof size n where each individual point x follows x N (; 2) with known mean , the conjugate prior of thevariance has an inverse gamma distribution or a scaledinverse chi-squared distribution. The two are equivalentexcept for having dierent parameterizations. Althoughthe inverse gamma is more commonly used, we use thescaled inverse chi-squared for the sake of convenience.The prior for 2 is as follows:

    p(2j0; 20) =(20

    02 )

    02

    02

    exph020

    22

    i(2)1+

    02

    /exp

    h02022

    i(2)1+

    02

    The likelihood function from above, written in terms ofthe variance, is:

    p(Xj; 2) =

    1

    22

    n2

    exp" 122

    nXi=1

    (xi )2#

    =

    1

    22

    n2

    exp S22

    where

    S =nXi=1

    (xi )2:

    Then:

  • 12.5 With unknown mean and unknown variance 13

    p(2jX) / p(Xj2)p(2)

    =

    1

    22

    n2

    exp S22

    (20

    02 )

    02

    02

    exph020

    22

    i(2)1+

    02

    /

    1

    2

    n2 1

    (2)1+02

    exp S22

    +02022

    =

    1

    (2)1+0+n2

    exp0

    20 + S

    22

    The above is also a scaled inverse chi-squared distributionwhere

    00 = 0 + n

    00200= 0

    20 +

    nXi=1

    (xi )2

    or equivalently

    00 = 0 + n

    200=

    020 +

    Pni=1(xi )2

    0 + n

    Reparameterizing in terms of an inverse gamma distribu-tion, the result is:

    0 = +n

    2

    0 = +Pn

    i=1(xi )22

    12.5 With unknown mean and unknownvariance

    For a set of i.i.d. normally distributed data points Xof size n where each individual point x follows x N (; 2) with unknown mean and unknown variance2, a combined (multivariate) conjugate prior is placedover the mean and variance, consisting of a normal-inverse-gamma distribution. Logically, this originates asfollows:

    1. From the analysis of the case with unknown meanbut known variance, we see that the update equa-tions involve sucient statistics computed from thedata consisting of themean of the data points and thetotal variance of the data points, computed in turnfrom the known variance divided by the number ofdata points.

    2. From the analysis of the case with unknown variancebut known mean, we see that the update equationsinvolve sucient statistics over the data consistingof the number of data points and sum of squareddeviations.

    3. Keep in mind that the posterior update values serveas the prior distribution when further data is han-dled. Thus, we should logically think of our priors interms of the sucient statistics just described, withthe same semantics kept in mind as much as possi-ble.

    4. To handle the case where both mean and varianceare unknown, we could place independent priorsover the mean and variance, with xed estimates ofthe average mean, total variance, number of datapoints used to compute the variance prior, and sumof squared deviations. Note however that in reality,the total variance of the mean depends on the un-known variance, and the sum of squared deviationsthat goes into the variance prior (appears to) dependon the unknown mean. In practice, the latter depen-dence is relatively unimportant: Shifting the actualmean shifts the generated points by an equal amount,and on average the squared deviations will remainthe same. This is not the case, however, with thetotal variance of the mean: As the unknown vari-ance increases, the total variance of the mean willincrease proportionately, and we would like to cap-ture this dependence.

    5. This suggests that we create a conditional prior of themean on the unknown variance, with a hyperparam-eter specifying the mean of the pseudo-observationsassociated with the prior, and another parameterspecifying the number of pseudo-observations. Thisnumber serves as a scaling parameter on the vari-ance, making it possible to control the overall vari-ance of the mean relative to the actual varianceparameter. The prior for the variance also hastwo hyperparameters, one specifying the sum ofsquared deviations of the pseudo-observations asso-ciated with the prior, and another specifying onceagain the number of pseudo-observations. Note thateach of the priors has a hyperparameter specify-ing the number of pseudo-observations, and in eachcase this controls the relative variance of that prior.These are given as two separate hyperparameters sothat the variance (aka the condence) of the two pri-ors can be controlled separately.

    6. This leads immediately to the normal-inverse-gamma distribution, which is the product of the twodistributions just dened, with conjugate priors used(an inverse gamma distribution over the variance,and a normal distribution over the mean, conditionalon the variance) and with the same four parametersjust dened.

    The priors are normally dened as follows:

    p(j2;0; n0) N (0; 2/n0)p(2; 0;

    20) I2(0; 20) = IG(0/2; 020/2)

  • 14 13 OCCURRENCE

    The update equations can be derived, and look as follows:

    x =1

    n

    nXi=1

    xi

    00 =n00 + nx

    n0 + n

    n00 = n0 + n00 = 0 + n

    00200= 0

    20 +

    nXi=1

    (xi x)2 + n0nn0 + n

    (0 x)2

    The respective numbers of pseudo-observations add thenumber of actual observations to them. The new meanhyperparameter is once again a weighted average, thistime weighted by the relative numbers of observations.Finally, the update for 0020

    0 is similar to the case withknown mean, but in this case the sum of squared devi-ations is taken with respect to the observed data meanrather than the true mean, and as a result a new interac-tion term needs to be added to take care of the additionalerror source stemming from the deviation between priorand data mean.[Proof]

    The prior distributions are

    p(j2;0; n0) N (0; 2/n0) = 1q2

    2

    n0

    exp n022

    ( 0)2

    / (2)1/2 exp n022

    ( 0)2

    p(2; 0; 20) I2(0; 20) = IG(0/2; 020/2)

    =(200/2)

    0/2

    (0/2)

    exph020

    22

    i(2)1+0/2

    / (2)(1+0/2) exp020

    22

    Therefore, the joint prior is

    p(; 2;0; n0; 0; 20) = p(j2;0; n0) p(2; 0; 20)

    / (2)(0+3)/2 exp 122

    0

    20 + n0( 0)2

    The likelihood function from the section above withknown variance is:

    p(Xj; 2) =

    1

    22

    n/2exp

    " 122

    nXi=1

    (xi )2!#

    Writing it in terms of variance rather than precision, weget:

    p(Xj; 2) =

    1

    22

    n/2exp

    " 122

    nXi=1

    (xi x)2 + n(x )2!#

    / 2n/2 exp 122

    S + n(x )2

    where S =Pni=1(xi x)2:Therefore, the posterior is (dropping the hyperparametersas conditioning factors):

    p(; 2jX) / p(; 2) p(Xj; 2)

    / (2)(0+3)/2 exp 122

    0

    20 + n0( 0)2

    2

    n/2 exp 122

    S + n(x )2

    = (2)(0+n+3)/2 exp 122

    0

    20 + S + n0( 0)2 + n(x )2

    = (2)(0+n+3)/2 exp

    " 122

    0

    20 + S +

    n0n

    n0 + n(0 x)2 + (n0 + n)

    n00 + nx

    n0 + n

    2!#

    / (2)1/2 exp"n0 + n

    22

    n00 + nx

    n0 + n

    2#

    (2)(0/2+n/2+1) exp 122

    0

    20 + S +

    n0n

    n0 + n(0 x)2

    = Nj2

    n00 + nx

    n0 + n;

    2

    n0 + n

    IG2

    1

    2(0 + n);

    1

    2

    0

    20 + S +

    n0n

    n0 + n(0 x)2

    :

    In other words, the posterior distribution has the form ofa product of a normal distribution over p(|2) times aninverse gamma distribution over p(2), with parametersthat are the same as the update equations above.

    13 OccurrenceThe occurrence of normal distribution in practical prob-lems can be loosely classied into four categories:

    1. Exactly normal distributions;2. Approximately normal laws, for example when such

    approximation is justied by the central limit theo-rem; and

    3. Distributions modeled as normal the normal distri-bution being the distribution with maximum entropyfor a given mean and variance.

    4. Regression problems the normal distribution be-ing found after systematic eects have been mod-eled suciently well.

    13.1 Exact normalityCertain quantities in physics are distributed normally, aswas rst demonstrated by James Clerk Maxwell. Exam-ples of such quantities are:

  • 13.3 Assumed normality 15

    The ground state of a quantum harmonic oscillator has theGaussian distribution.

    Velocities of the molecules in the ideal gas. Moregenerally, velocities of the particles in any system inthermodynamic equilibrium will have normal distri-bution, due to the maximum entropy principle.

    Probability density function of a ground state in aquantum harmonic oscillator.

    The position of a particle that experiences diusion.If initially the particle is located at a specic point(that is its probability distribution is the dirac deltafunction), then after time t its location is describedby a normal distribution with variance t, which sat-ises the diusion equation /t f(x,t) = 1/2 2/x2f(x,t). If the initial location is given by a certaindensity function g(x), then the density at time t isthe convolution of g and the normal PDF.

    13.2 Approximate normality

    Approximately normal distributions occur in many situ-ations, as explained by the central limit theorem. Whenthe outcome is produced by many small eects acting ad-ditively and independently, its distribution will be close tonormal. The normal approximation will not be valid ifthe eects act multiplicatively (instead of additively), orif there is a single external inuence that has a consider-ably larger magnitude than the rest of the eects.

    In counting problems, where the central limit the-orem includes a discrete-to-continuum approxima-tion and where innitely divisible and decomposabledistributions are involved, such as

    Binomial random variables, associated withbinary response variables;

    Poisson random variables, associated with rareevents;

    Thermal light has a BoseEinstein distribution onvery short time scales, and a normal distribution onlonger timescales due to the central limit theorem.

    13.3 Assumed normality

    Histogram of sepal widths for Iris versicolor from Fishers Irisower data set, with superimposed best-tting normal distribu-tion.

    I can only recognize the occurrence of thenormal curve the Laplacian curve of errors as a very abnormal phenomenon. It is roughlyapproximated to in certain distributions; forthis reason, and on account for its beautifulsimplicity, we may, perhaps, use it as a rstapproximation, particularly in theoreticalinvestigations.Pearson (1901)

    There are statistical methods to empirically test that as-sumption, see the above Normality tests section.

    In biology, the logarithm of various variables tendto have a normal distribution, that is, they tend tohave a log-normal distribution (after separation onmale/female subpopulations), with examples includ-ing:

    Measures of size of living tissue (length,height, skin area, weight);[44]

    The length of inert appendages (hair, claws,nails, teeth) of biological specimens, in the di-rection of growth; presumably the thickness oftree bark also falls under this category;

    Certain physiological measurements, such asblood pressure of adult humans.

    In nance, in particular the BlackScholes model,changes in the logarithm of exchange rates, price in-dices, and stock market indices are assumed normal(these variables behave like compound interest, not

  • 16 14 GENERATING VALUES FROM NORMAL DISTRIBUTION

    like simple interest, and so aremultiplicative). Somemathematicians such as Benot Mandelbrot have ar-gued that log-Levy distributions, which possessesheavy tails would be a more appropriate model, inparticular for the analysis for stock market crashes.

    Measurement errors in physical experiments are of-ten modeled by a normal distribution. This use of anormal distribution does not imply that one is as-suming the measurement errors are normally dis-tributed, rather using the normal distribution pro-duces the most conservative predictions possiblegiven only knowledge about the mean and varianceof the errors.[45]

    Fitted cumulative normal distribution to October rainfalls, seedistribution tting

    In standardized testing, results can be made to havea normal distribution by either selecting the numberand diculty of questions (as in the IQ test) or trans-forming the raw test scores into output scores bytting them to the normal distribution. For exam-ple, the SAT's traditional range of 200800 is basedon a normal distribution with a mean of 500 and astandard deviation of 100.

    Many scores are derived from the normal distri-bution, including percentile ranks (percentiles orquantiles), normal curve equivalents, stanines,z-scores, and T-scores. Additionally, some be-havioral statistical procedures assume that scoresare normally distributed; for example, t-tests andANOVAs. Bell curve grading assigns relative gradesbased on a normal distribution of scores.

    In hydrology the distribution of long duration riverdischarge or rainfall, e.g. monthly and yearly totals,is often thought to be practically normal accordingto the central limit theorem.[46] The blue picture il-lustrates an example of tting the normal distribu-tion to ranked October rainfalls showing the 90%condence belt based on the binomial distribution.The rainfall data are represented by plotting posi-tions as part of the cumulative frequency analysis.

    13.4 Produced normality

    In regression analysis, lack of normality in residuals sim-ply indicates that the model postulated is inadequate inaccounting for the tendency in the data and needs to beaugmented; in other words, normality in residuals can al-ways be achieved given a properly constructed model.

    14 Generating values from normaldistribution

    The bean machine, a device invented by Francis Galton, can becalled the rst generator of normal random variables. This ma-chine consists of a vertical board with interleaved rows of pins.Small balls are dropped from the top and then bounce randomlyleft or right as they hit the pins. The balls are collected into bins atthe bottom and settle down into a pattern resembling the Gaussiancurve.

    In computer simulations, especially in applications of theMonte-Carlo method, it is often desirable to generate val-ues that are normally distributed. The algorithms listedbelow all generate the standard normal deviates, since aN(, 2) can be generated as X = + Z, where Z is stan-dard normal. All these algorithms rely on the availabilityof a random number generator U capable of producinguniform random variates.

    The most straightforward method is based on theprobability integral transform property: if U is dis-tributed uniformly on (0,1), then 1(U) will havethe standard normal distribution. The drawback ofthis method is that it relies on calculation of theprobit function 1, which cannot be done analyt-ically. Some approximate methods are described inHart (1968) and in the erf article. Wichura[47] givesa fast algorithm for computing this function to 16decimal places, which is used by R to compute ran-dom variates of the normal distribution.

  • 17

    An easy to program approximate approach, that re-lies on the central limit theorem, is as follows: gen-erate 12 uniform U(0,1) deviates, add them all up,and subtract 6 the resulting random variable willhave approximately standard normal distribution. Intruth, the distribution will be IrwinHall, which isa 12-section eleventh-order polynomial approxima-tion to the normal distribution. This random deviatewill have a limited range of (6, 6).[48]

    The BoxMuller method uses two independent ran-dom numbers U and V distributed uniformly on(0,1). Then the two random variables X and Y

    X =p2 lnU cos(2V ); Y = p2 lnU sin(2V ):

    will both have the standard normal distribu-tion, and will be independent. This formula-tion arises because for a bivariate normal ran-dom vector (X Y) the squared norm X2 + Y2will have the chi-squared distribution with twodegrees of freedom, which is an easily gener-ated exponential random variable correspond-ing to the quantity 2ln(U) in these equations;and the angle is distributed uniformly aroundthe circle, chosen by the random variable V.

    Marsaglia polar method is a modication of theBoxMuller method algorithm, which does not re-quire computation of functions sin() and cos(). Inthis method U and V are drawn from the uniform(1,1) distribution, and then S = U2 + V2 is com-puted. If S is greater or equal to one then the methodstarts over, otherwise two quantities

    X = U

    r2 lnS

    S; Y = V

    r2 lnS

    S

    are returned. Again, X and Y will be indepen-dent and standard normally distributed.

    The Ratio method[49] is a rejection method. The al-gorithm proceeds as follows:

    Generate two independent uniform deviates Uand V;

    Compute X = 8/e (V 0.5)/U; Optional: if X2 5 4e1/4U then accept Xand terminate algorithm;

    Optional: if X2 4e1.35/U + 1.4 then rejectX and start over from step 1;

    If X2 4 lnU then accept X, otherwise startover the algorithm.

    The ziggurat algorithm[50] is faster than the BoxMuller transform and still exact. In about 97% ofall cases it uses only two random numbers, one ran-dom integer and one random uniform, one multipli-cation and an if-test. Only in 3% of the cases, wherethe combination of those two falls outside the coreof the ziggurat (a kind of rejection sampling usinglogarithms), do exponentials and more uniform ran-dom numbers have to be employed.

    There is also some investigation[51] into the con-nection between the fast Hadamard transform andthe normal distribution, since the transform employsjust addition and subtraction and by the central limittheorem random numbers from almost any distribu-tion will be transformed into the normal distribution.In this regard a series of Hadamard transforms canbe combined with random permutations to turn ar-bitrary data sets into a normally distributed data.

    15 Numerical approximations forthe normal CDF

    The standard normal CDF is widely used in scientic andstatistical computing.The values (x) may be approximated very accuratelyby a variety of methods, such as numerical integration,Taylor series, asymptotic series and continued fractions.Dierent approximations are used depending on the de-sired level of accuracy.

    A very simple and practical approximation is givenby Bell [52] with a maximum absolute error of 0.003:

    (x) 121 + sign(x)

    h1 e( 2 x2)

    i 12

    The inverse is also easily obtained.

    Zelen & Severo (1964) give the approximation for(x) for x > 0 with the absolute error |(x)|


Recommended