Wiki Book on Statistics

ContentsArticles

Bernoulli distribution 1Beta distribution 3Beta function 31Beta-binomial distribution 35Binomial coefficient 41Binomial distribution 59Cauchy distribution 66CauchySchwarz inequality 74Characteristic function (probability theory) 81Chernoff bound 89Chi-squared distribution 95Computational complexity of mathematical operations 102Conjugate prior 106Continuous mapping theorem 112Convergence of random variables 114Convergent series 123Copula (probability theory) 127Coupon collector's problem 134Degrees of freedom (statistics) 137Determinant 143Dirichlet distribution 161Effect size 169Erlang distribution 179Expectationmaximization algorithm 182Exponential distribution 191F-distribution 201F-test 204Fisher information 209Fisher's exact test 217Gamma distribution 221Gamma function 232Geometric distribution 246Hypergeometric distribution 250Hlder's inequality 257

Inverse Gaussian distribution 264Inverse-gamma distribution 269Iteratively reweighted least squares 271Kendall tau rank correlation coefficient 273KolmogorovSmirnov test 277Kronecker's lemma 280KullbackLeibler divergence 281Laplace distribution 291Laplace's equation 295Laplace's method 301Likelihood-ratio test 307List of integrals of exponential functions 311List of integrals of Gaussian functions 313List of integrals of hyperbolic functions 315List of integrals of logarithmic functions 317Lists of integrals 319Local regression 327Log-normal distribution 331Logrank test 339Lvy distribution 341MannWhitney U 344Matrix calculus 350Maximum likelihood 368McNemar's test 379Multicollinearity 382Multivariate normal distribution 387n-sphere 397Negative binomial distribution 405Noncentral chi-squared distribution 414Noncentral F-distribution 419Noncentral t-distribution 421Norm (mathematics) 425Normal distribution 432Order statistic 460Ordinary differential equation 465Partial differential equation 475Pearson's chi-squared test 488PerronFrobenius theorem 494

Poisson distribution 506Poisson process 515Proportional hazards models 519Random permutation statistics 523Rank (linear algebra) 535Resampling (statistics) 541Schur complement 548Sign test 550Singular value decomposition 551Stein's method 566Stirling's approximation 572Student's t-distribution 577Summation by parts 590Taylor series 592Uniform distribution (continuous) 603Uniform distribution (discrete) 609Weibull distribution 612Wilcoxon signed-rank test 618Wishart distribution 621

ReferencesArticle Sources and Contributors 626Image Sources, Licenses and Contributors 634

Article LicensesLicense 638

Bernoulli distribution 1

Bernoulli distribution

Bernoulli

Parameters

Support

PMF

CDF

Mean

Median

Mode

Variance

Skewness

Ex. kurtosis

Entropy

MGF

CF

PGF

In probability theory and statistics, the Bernoulli distribution (or binomial distribution), named after Swissscientist Jacob Bernoulli, is a discrete probability distribution, which takes value 1 with success probability andvalue 0 with failure probability . So if X is a random variable with this distribution, we have:

A classical example of a Bernoulli experiment is a single toss of a coin. The coin might come up heads withprobability p and tails with probability 1-p. The experiment is called fair if p=0.5, indicating the origin of theterminology in betting (the bet is fair if both possible outcomes have the same probability).The probability mass function f of this distribution is

This can also be expressed as

The expected value of a Bernoulli random variable X is , and its variance is

Bernoulli distribution 2

The above can be derived from the Bernoulli distribution as a special case of the Binomial distribution.[1]

The kurtosis goes to infinity for high and low values of p, but for the Bernoulli distribution has a lowerkurtosis than any other probability distribution, namely -2.The Bernoulli distribution is a member of the exponential family.The maximum likelihood estimator of p based on a random sample is the sample mean.

Related distributions If are independent, identically distributed (i.i.d.) random variables, all Bernoulli distributed with

success probabilityp, then (binomial distribution). The Bernoulli

distribution is simply . The categorical distribution is the generalization of the Bernoulli distribution for variables with any constant

number of discrete values. The Beta distribution is the conjugate prior of the Bernoulli distribution. The geometric distribution is the number of Bernoulli trials needed to get one success.

Notes[1][1] McCullagh and Nelder (1989), Section 4.2.2.

References McCullagh, Peter; Nelder, John (1989). Generalized Linear Models, Second Edition. Boca Raton: Chapman and

Hall/CRC. ISBN0-412-31760-5. Johnson, N.L., Kotz, S., Kemp A. (1993) Univariate Discrete Distributions (2nd Edition). Wiley. ISBN

0-471-54897-9

External links Hazewinkel, Michiel, ed. (2001), "Binomial distribution" (http:/ / www. encyclopediaofmath. org/ index.

php?title=p/ b016420), Encyclopedia of Mathematics, Springer, ISBN978-1-55608-010-4 Weisstein, Eric W., " Bernoulli Distribution (http:/ / mathworld. wolfram. com/ BernoulliDistribution. html)"

from MathWorld.

Beta distribution 3

Beta distribution

Beta

Probability density function

Cumulative distribution function

Parameters shape (real)shape (real)

Support

PDF

CDF

Mean

(see digamma function)Median no general closed form, see textMode for

Variance

(see polygamma function)Skewness

Ex. kurtosis

Entropy see text

Beta distribution 4

MGF

CF (see Confluent hypergeometric function)

In probability theory and statistics, the beta distribution is a family of continuous probability distributions definedon the interval [0, 1] parametrized by two positive shape parameters, denoted by and . The beta distribution hasbeen applied to model the behavior of random variables limited to intervals of finite length. It has been used inpopulation genetics for a statistical description of the allele frequencies in the components of a sub-dividedpopulation. It has also been used extensively in PERT, critical path method (CPM) and other project management /control systems to describe the statistical distributions of the time to completion and the cost of a task. It has alsobeen applied in acoustic analysis to assess damage to gears, as the kurtosis of the beta distribution has been reportedas a good indicator of the condition of gears[1]. It has also been used to model sunshine data for application to solarrenewable energy utilization[2]. It has also been used for parametrizing variability of soil properties at the regionallevel for crop yield estimation, modeling crop response over the area of the association[3]. It has also been used todetermine well-log shale parameters, to describe the proportions of the mineralogical components existing in acertain stratigraphic interval[4]. It is used extensively in Bayesian inference, since beta distributions provide a familyof conjugate prior distributions for binomial and geometric distributions. For example, the beta distribution can beused in Bayesian analysis to describe initial knowledge concerning probability of success such as the probability thata space vehicle will successfully complete a specified mission. The beta distribution is a suitable model for therandom behavior of percentages. It can be suited to the statistical modelling of proportions in applications wherevalues of proportions equal to 0 or 1 do not occur. One theoretical case where the beta distribution arises is as thedistribution of the ratio formed by one random variable having a Gamma distribution divided by the sum of it andanother independent random variable also having a Gamma distribution with the same scale parameter (but possiblydifferent shape parameter).The usual formulation of the beta distribution is also known as the beta distribution of the first kind, whereas betadistribution of the second kind is an alternative name for the beta prime distribution.

Characterization

Probability density functionThe probability density function of the beta distribution, for 0x1, and shape parameters > 0 and > 0, is:

where is the gamma function. The beta function, B, appears as a normalization constant to ensure that the totalprobability integrates to unity.This definition includes both ends x= 0 and x=1, which is consistent with definitions for other continuousdistributions supported on a bounded interval which are special cases of the beta distribution, for example the arcsinedistribution, and consistent with several authors, such as N.L.Johnson and S.Kotz[5][6][7][8] . Note, however, thatseveral other authors, including W. Feller [9] [10][11], choose to exclude the ends x= 0 and x=1, (such that the twoends are not actually part of the density function) and consider instead 0

Beta distribution 5

Several authors, including N.L.Johnson and S.Kotz[5], use the nomenclature p instead of and q instead of for theshape parameters of the beta distribution, reminiscent of the nomenclature traditionally used for the parameters of theBernoulli distribution, because the beta distribution approaches the Bernoulli distribution in the limit as both shapeparameters and approach the value of zero.In the following, that a random variable X is Beta-distributed with parameters and will be denoted by:

Cumulative distribution functionThe cumulative distribution function is

where is the incomplete beta function and is the regularized incomplete beta function.

Properties

ModeThe mode of a Beta distributed random variable X with both parameters and greater than one is:[5]

When both parameters are less than one ( < 1 and < 1), this is the anti-mode: the lowest point of the probability

density curve[7]. Letting in the above expression one obtains , showing that for

the mode (in the case > 1 and > 1), or the anti-mode (in the case < 1 and < 1), are at the center of thedistribution: it is symmetric in those cases. See "Shapes" section in this article for a full list of mode cases, forarbitrary values of and . For several of these cases, the maximum value of the density function occurs at one orboth ends. In some cases the (maximum) value of the density function occurring at the end is finite, for example inthe case of =2, =1 (or =1, =2), the right-triangle distribution, while in several other cases there is a singularity atthe end, and hence the value of the density function approaches infinity at the end, for example in the case ==1/2,the arcsine distribution. The choice whether to include, or not to include, the ends x=0, and x=1, as part of the densityfunction, whether a singularity can be considered to be a mode, and whether cases with two maxima are to beconsidered bimodal, is responsible for some authors considering these maximum values at the end of the densitydistribution to be considered[12] modes or not[10].

Beta distribution 6

Mode for Beta distribution for 15 and 15

Median

The median of the beta distribution isthe unique real number x for which theregularized incomplete beta function

. There is no generalclosed-form expression for the medianof the beta distribution for arbitraryvalues of and . Closed-formexpressions for particular values of theparameters and follow:

For symmetric cases .

For(this case is the mirror-image of the power function [0,1] distribution)

For (this case is the power function [0,1] distribution[10]) For the real [0,1] solution to the quartic

equation For

Median for Beta distribution for 05 and 05

A reasonable approximation of thevalue of the median of the betadistribution, for both and greater orequal to one, is given by theformula[13]

For 1 and 1, the relative error(the absolute error divided by themedian) in this approximation is lessthan 4% and for both 2 and 2 itis less than 1%. The absolute errordivided by the difference between themean and the mode is similarly small:

Beta distribution 7

MeanThe expected value (mean) ( ) of a Beta distribution random variable X with parameters and is:[5]

Letting in the above expression one obtains , showing that for the mean is at the center of

the distribution: it is symmetric. Also, the following limits can be obtained from the above expression:

Therefore, for , or for , the mean is located at the right end, x = 1. For these limit ratios, the beta

distribution becomes a 1 point Degenerate distribution with a Dirac delta function spike at the right end, x = 1, withprobability 1, and zero probability everywhere else. There is 100% probability (absolute certainty) concentrated atthe right end, x = 1.

Similarly, for , or for , the mean is located at the left end, x = 0. The beta distribution becomes a

1 point Degenerate distribution with a Dirac delta function spike at the left end, x = 0, with probability 1, and zeroprobability everywhere else. There is 100% probability (absolute certainty) concentrated at the left end, x = 0.

Mean for Beta distribution for 05 and 05

Variance

The variance (the second momentcentered around the mean) of a Betadistribution random variable X withparameters and is:[5]

Letting in the aboveexpression one obtains

, showing

that for the variance decreasesmonotonically as increases.Setting in thisexpression, one finds the maximumvariance [5] which only occurs approaching the limit, at .The beta distribution may also be parametrized in terms of its mean (0 < < 1) and sample size = + ( > 0)(see section below titled "Mean and sample size"):

Using this parametrization, one can express the variance in terms of the mean and the sample size as follows:

Beta distribution 8

Since , it must follow that For a symmetric distribution, the mean is at the middle of the distribution, = 1/2, and therefore:

Also, the following limits (with only the noted variable approaching the limit) can be obtained from the aboveexpressions:

Skewness

Skewness for Beta Distribution as a function of variance and mean

The skewness (the third momentcentered around the mean, normalizedby the 3/2 power of the variance) ofthe beta distribution is[5]

Letting in the above expression one obtains , showing once again that for the distributionis symmetric and hence the skewness is zero. Positive skew (right-tailed) for < , negative skew (left-tailed) for > .Using the parametrization in terms of mean and sample size = + :

one can express the skewness in terms of the mean and the sample size as follows:

Beta distribution 9

The skewness can also be expressed just in terms of the variance var and the mean as follows:

The accompanying plot of skewness as a function of variance and mean shows that maximum variance (1/4) iscoupled with zero skewness and the symmetry condition ( = 1/2), and that maximum skewness (positive or negativeinfinity) occurs when the mean is located at one end or the other, so that that the "mass" of the probabilitydistribution is concentrated at the ends (minimum variance).For the symmetric case ( = ), skewness = 0 over the whole range, and the following limits apply:

For the unsymmetric cases ( ) the following limits (with only the noted variable approaching the limit) can beobtained from the above expressions:

Kurtosis

Excess Kurtosis for Beta Distribution as a function of variance and mean

The beta distribution has been appliedin acoustic analysis to assess damageto gears, as the kurtosis of the betadistribution has been reported to be agood indicator of the condition of agear[1]. Kurtosis has also been used todistinguish the seismic signalgenerated by a person's footsteps fromother signals. As persons or othertargets moving on the ground generate

Beta distribution 10

continuous signals in the form of seismic waves, one can separate different targets based on the seismic waves theygenerate. Kurtosis is sensitive to impulsive signals, so its much more sensitive to the signal generated by humanfootsteps than other signals generated by vehicles, winds, noise, etc.[14] Unfortunately, the notation for kurtosis hasnot been standardized. Kenney and Keeping[15] use the symbol for the excess kurtosis, but Abramowitz andStegun[16] use different terminology. To prevent confusion[17] between kurtosis (the fourth moment centered aroundthe mean, normalized by the square of the variance) and excess kurtosis, when using symbols, they will be spelledout as follows[11][10]:

Letting in the above expression one obtains

.

Therefore for symmetric beta distributions, the excess kurtosis is negative, increasing from a minimum value of - 2at the limit as { = } 0, and approaching a maximum value of zero as { = } . The value of - 2 is theminimum value of excess kurtosis that any distribution (not just beta distributions, but any distribution of anypossible kind) can ever achieve. This minimum value is reached when all the probability density is entirelyconcentrated at each end x = 0 and x = 1, with nothing in between: a 2-point Bernoulli distribution with equalprobability 1/2 at each end (a coin toss: see section below "Kurtosis bounded by the square of the skewness" forfurther discussion). The description of kurtosis as a measure of the "peakedness" (or "heavy tails") of the probabilitydistribution, is strictly applicable to unimodal distributions (for example the normal distribution). However, for moregeneral distributions, like the beta distribution, a more general description of kurtosis is that it is a measure of theproportion of the mass density near the mean. The higher the proportion of mass density near the mean, the higherthe kurtosis, while the higher the mass density away from the mean, the lower the kurtosis. For , skewed betadistributions, the excess kurtosis can reach unlimited positive values (particularly for 0 for finite , or for 0for finite ) because all the mass density is concentrated at the mean when the mean coincides with one of the ends.Minimum kurtosis takes place when the mass density is concentrated equally at each end (and therefore the mean isat the center), and there is no probability mass density in between the ends.Using the parametrization in terms of mean and sample size = + :

one can express the excess kurtosis in terms of the mean and the sample size as follows:

The excess kurtosis can also be expressed in terms of just the following two parameters: the variance var, and thesample size as follows:

and, in terms of the variance var and the mean as follows:


The plot of excess kurtosis as a function of the variance and the mean shows that the minimum value of the excesskurtosis (- 2, which is the minimum possible value for excess kurtosis for any distribution) is intimately coupled withthe maximum value of variance (1/4) and the symmetry condition: the mean occurring at the midpoint ( = 1/2). Thisoccurs for the symmetric case of = = 0, with zero skewness. At the limit, this is the 2 point Bernoulli distributionwith equal probability 1/2 at each Dirac delta function end x = 0 and x = 1 and zero probability everywhere else. (Acoin toss: one face of the coin being x = 0 and the other face being x = 1.) Variance is maximum because thedistribution is bimodal with nothing in between the two modes (spikes) at each end. Excess kurtosis is minimum: theprobability density "mass" is zero at the mean and it is concentrated at the two peaks at each end. Excess kurtosisreaches the minimum possible value (for any distribution) when the probability density function has two spikes ateach end: it is bi-"peaky" with nothing in between them.On the other hand, the plot shows that for extreme skewed cases, where the mean is located near one or the other end( = 0 or = 1), the variance is close to zero, and the excess kurtosis rapidly approaches infinity when the mean ofthe distribution approaches either end.Alternatively, the excess kurtosis can also be expressed in terms of just the following two parameters: the square ofthe skewness, and the sample size as follows:

From this last expression, one can obtain the same limits published practically a century ago by Karl Pearson in hispaper[18], for the beta distribution (see section below titled "Kurtosis bounded by the square of the skewness").Setting + = = 0 in the above expression, one obtains Pearson's lower boundary (values for the skewness andexcess kurtosis below the boundary (excess kurtosis + 2 - skewness2 = 0) cannot occur for any distribution, andhence Karl Pearson appropriately called the region below this boundary the "impossible region"). The limit of + = determines Pearson's upper boundary.

therefore:

Values of = + such that ranges from zero to infinity, 0 < < , span the whole region of the beta distributionin the plane of excess kurtosis versus squared skewness.For the symmetric case ( = ), the following limits apply:

For the unsymmetric cases ( ) the following limits (with only the noted variable approaching the limit) can beobtained from the above expressions:


Characteristic function

Re(characteristic function) symmetric case = ranging from 25 to 0

The characteristic function is theFourier transform of the probabilitydensity function. The characteristicfunction of the beta distribution isKummer's confluent hypergeometricfunction (of the first kind)[5][16][19]:


Re(characteristic function) symmetric case = ranging from 0 to 25

Re(characteristic function) = + 1/2; ranging from 25 to 0




where

is the rising factorial, also called the "Pochhammer symbol". The value of the characteristic function for t = 0, is one:

.Also, the real and imaginary parts of the characteristic function enjoy the following symmetries with respect to theorigin of variable t:

The symmetric case = simplifies the characteristic function of the beta distribution to a Bessel function, since inthe special case the confluent hypergeometric function (of the first kind) reduces to a Bessel function(the modified Bessel function of the first kind ) using Kummer's second transformation as follows:

In the accompanying plots, the real part (Re) of the characteristic function of the beta distribution is displayed forsymmetric ( = ) and skewed ( ) cases.


Moment generating functionIt also follows[5][10] that the moment generating function is

Higher moments

Using the moment generating function, the th raw moment is given by[5] the factor

multiplying the (exponential series) term in the series of the moment generating function

where is a Pochhammer symbol representing rising factorial. It can also be written in a recursive form as

Moments of transformed random variablesOne can also show the following expectations for a transformed random variable[5]

The following transformation by inversion of the random variable (X/(1 - X) gives the expected value of the invertedbeta distribution or beta prime distribution (also known as beta distribution of the second kind or Pearson's Type VI)[5]:

Expected values for logarithmic transformations (that may be useful for maximum likelihood estimates, forexample):


Higher order logarithmic moments can be expressed in terms of higher order poly-gamma functions:

therefore the variance of a logarithmic variable is:

also

and therefore the covariance of LnX and Ln(1-X) is:

These identities can be derived by using the representation of a Beta distribution as a proportion of two Gammadistributions and differentiating through the integral.

Quantities of information (entropy)Given a beta distributed random variable, X ~ Beta(, ), the differential entropy of X is[20] (measured in nats), theexpected value of the negative of the logarithm of the probability density function:

where is the probability density function of the beta distribution:

The digamma function appears in the formula for the differential entropy as a consequence of Euler's integralformula for the harmonic numbers which follows from the integral:

where is the Euler-Mascheroni constant.The differential entropy of the beta distribution is negative for all values of and greater than zero, except at==1 (for which values the beta distribution is the same as the uniform distribution), where the differentialentropy reaches its maximum value of zero. It is to be expected that the maximum entropy should take place whenthe beta distribution becomes equal to the uniform distribution, since uncertainty is maximal when all possible eventsare equiprobable.For or approaching zero, the differential entropy approaches its minimum value of negative infinity. For (either or both) or approaching zero, there is a maximum amount of order: all the probability density is concentrated at the ends, and there is zero probability density at points located between the ends. Similarly for (either or both) or approaching infinity, the differential entropy approaches its minimum value of negative infinity, and a maximum amount of order. If either or approaches infinity (and the other is finite) all the probability density is concentrated


at an end, and the probability density is zero everywhere else. If both shape parameters are equal (the symmetriccase), = , and they approach infinity simultaneously, the probability density becomes a spike (Dirac deltafunction) concentrated at the middle x = 1/2, and hence there is 100% probability at the middle x = 1/2 and zeroprobability everywhere else.

The (continuous case) differential entropy was introduced by Shannon in his original paper (where he named it the"entropy of a continuous distribution"), as the concluding part[21] of the same paper where he defined the discreteentropy. It is known since then that the differential entropy may differ from the infinitesimal limit of the discreteentropy by an infinite offset, therefore the differential entropy can be negative (as it is for the beta distribution).What really matters is the relative value of entropy.Given two beta distributed random variables, X ~ Beta(, ) and Y ~ Beta(', '), the cross entropy is (measured innats)

It follows that the relative entropy, or KullbackLeibler divergence, between these two beta distributions is(measured in nats)

The relative entropy, or KullbackLeibler divergence, is always non-negative

Relationships between statistical measures

Mean, mode and median relationship

If 1 < < then mode median mean.[13] Expressing the mode (only for >1 and >1), and the mean in termsof and :

If 1 1 the absolute distance between themean and the median is less than 5% of the distance between the maximum and minimum values ofx. On the otherhand, the absolute distance between the mean and the mode can reach 50% of the distance between the maximumand minimum values of x, for the (pathological) case of 1 and 1 (for which values the beta distributionapproaches the uniform distribution and the differential entropy approaches its maximum value, and hence maximum"disorder").For example, for = 1.0001 and = 1.00000001: mode = 0.9999; PDF(mode) = 1.00010 mean = 0.500025; PDF(mean) = 1.00003


median = 0.500035; PDF(median) = 1.00003 mean mode = 0.499875 mean median = 9.65538106

(where PDF stands for the value of the probability density function)

Kurtosis bounded by the square of the skewness

As remarked by Feller[9], in the Pearson system the beta probability density appears as type I (any differencebetween the beta distribution and Pearson's type I distribution is only superficial and it makes no difference for thefollowing discussion regarding the relationship between kurtosis and skewness). Karl Pearson showed, in Plate 1 ofhis paper [18] published in 1916, a graph with the kurtosis as the vertical axis (ordinate) and the square of theskewness as the horizontal axis (abscissa), in which a number of distributions were displayed[22]. The regionoccupied by the beta distribution is bounded by the following two lines in the (skewness2,kurtosis) plane, or the(skewness2,excess kurtosis) plane:

or, equivalently,

(At a time when there were no powerful digital computers), Karl Pearson accurately computed furtherboundaries[18][8], for example, separating the "U-shaped" from the "J-shaped" distributions. The lower boundary line(excess kurtosis + 2 - skewness2 = 0) is produced by "U-shaped" beta distributions with values of shape parameters and close to zero. The upper boundary line (excess kurtosis - (3/2) skewness2 = 0) is produced by extremelyskewed distributions with very large values of one of the parameters and very small values of the other parameter.An example of a beta distribution near the upper boundary (excess kurtosis - (3/2) skewness2 = 0) is given by = 0.1,=1000, for which the ratio (excess kurtosis)/(skewness2) = 1.49835 approaches the upper limit of 1.5 from below.An example of a beta distribution near the lower boundary (excess kurtosis + 2 - skewness2 = 0) is given by =0.0001, =0.1, for which values the expression (excess kurtosis+2)/(skewness2) =1.01621 approaches the lower limitof 1 from above. In the infinitesimal limit for both and approaching zero symmetrically, the excess kurtosisreaches its minimum value at -2. This minimum value occurs at the point at which the lower boundary line intersectsthe vertical axis (ordinate). (Note, however, that in Pearson's original chart, the ordinate is kurtosis, instead of excesskurtosis, and that it increases downwards rather than upwards).Values for the skewness and excess kurtosis below the lower boundary (excess kurtosis + 2 - skewness2 = 0) cannot occur for any distribution, and hence Karl Pearson appropriately called the region below this boundary the "impossible region." The boundary for this "impossible region" is determined by (symmetric or skewed) bimodal "U"-shaped distributions for which parameters and approach zero and hence all the probability density is concentrated at each end: x = 0 and x = 1 with practically nothing in between them. Since for 0 the


probability density is concentrated at the two ends x = 0 and x = 1, this "impossible boundary" is determined by a2-point distribution: the probability can only take 2 values (Bernoulli distribution), one value with probability p andthe other with probability q = 1 - p. For cases approaching this limit boundary with symmetry = , skewness 0,excess kurtosis -2 (this is the lowest excess kurtosis possible for any distribution), and the probabilities are p q 1/2. For cases approaching this limit boundary with skewness, excess kurtosis - 2 + skewness2, and the probabilitydensity is concentrated more at one end than the other end (with practically nothing in between), with probabilities at

the left end and at the right end .

SymmetryAll statements are conditional on > 0 and > 0 Probability density function

Cumulative distribution function reflection symmetry plus unitary translation

Mode reflection symmetry plus unitary translation

Median reflection symmetry plus unitary translation

Mean reflection symmetry plus unitary translation

Variance symmetry

Skewness skew-symmetry

Excess kurtosis symmetry

Characteristic function symmetry of Real part (with respect to the origin of variable "t")

Characteristic function skew-symmetry of Imaginary part (with respect to the origin of variable "t")

Differential Entropy symmetry


ShapesThe beta density function can take a wide variety of different shapes depending on the values of the two parameters and :Symmetric

the density function is symmetric about 1/2 (blue & teal plots).

is U-shaped (blue plot).

[5]

is a 2 point Bernoulli distribution with equal probability 1/2 at each Dirac delta

function end x = 0 and x = 1 and zero probability everywhere else. A coin toss: one face of the coinbeing x = 0 and the other face being x = 1.

a lower value than this is impossible for any distribution

to reach. The differential entropy approaches a minimum value of

is the arcsine distribution

is the uniform [0,1] distribution The (negative anywhere else) differential entropy reaches its maximum value of zero

is symmetric unimodal

[5]

is a semi-elliptic [0,1] distribution, see: Wigner semicircle distribution

is the parabolic [0,1] distribution

is bell-shaped, with inflection points located to either side of the mode

is a 1 point Degenerate distribution with a Dirac delta function spike at themidpoint x = 1/2 with probability 1, and zero probability everywhere else. There is 100%probability (absolute certainty) concentrated at the single point x = 1/2.


The differential entropy approaches a minimum value of

Skewed

the density function is skewed. An interchange of parameter values yields the mirror image (the reverse) ofthe initial curve. is skewed U-shaped. Positive skew for < , negative skew for > .

is skewed unimodal (magenta & cyan plots). Positive skew for < , negative skew for >.

is reverse J-shaped with a right tail, positively skewed, strictly decreasing, strictly convex (maximum variance occurs for , or = the golden

ratio conjugate) is positively skewed, strictly decreasing (red plot), a reversed (mirror-image) power function

[0,1] distribution

is strictly concave

is a straight line with slope -2, the right-triangular distribution with right angle at theleft end, at x=0

is reverse J-shaped with a right tail, strictly convex

is J-shaped with a left tail, negatively skewed, strictly increasing, strictly convex (maximum variance occurs for , or = the golden

ratio conjugate) is negatively skewed, strictly increasing (green plot), the power function [0,1] distribution[10]

is strictly concave


is a straight line with slope +2, the right-triangular distribution with right angle at the

right end, at x=1

is J-shaped with a left tail, strictly convex

Parameter estimation

Method of momentsUsing the method of moments, with the first two moments (sample mean and sample variance), let:

be the sample mean and

be the sample variance. The method-of-moments estimates of the parameters are

conditional on

conditional on

When the distribution is required over known interval other than , say , then replace with

and with in the above equations (see "Alternative parametrizations, four parameters" section

below).[23]

Maximum likelihoodAs it is also the case for maximum likelihood estimates for the gamma distribution, the maximum likelihoodestimates for the beta distribution do not have a closed form solution for arbitrary values of the shape parameters. If

are independent random variables each having a beta distribution, the following system of coupledmaximum likelihood estimate equations (for the average log-likelihoods) needs to be inverted to obtain the(unknown) shape parameter estimates in terms of the (known) average of logarithms of the samples

[5]:

These coupled equations containing digamma functions of the shape parameter estimates must be solved bynumerical methods as done, for example, by Beckman et.al.[24] .


Gnanadesikan et. al. give numerical solutions for a few cases[25] .N.L.Johnson and S.Kotz[5] suggest that for "not toosmall" shape parameter estimates , the logarithmic approximation to the digamma function

may be used to obtain initial values for an iterative solution, since the equations resultingfrom this approximation can be solved exactly:

More readily, and perhaps more accurately, the estimates provided by the method of moments can instead be used asinitial values for an iterative solution of the maximum likelihood coupled equations in terms of the digammafunctions.

When the distribution is required over a known interval other than , say , then replace in the

first equation with and replace in the second equation with (see "Alternative

parametrizations, four parameters" section below).If one of the shape parameters is known, the problem is considerably simplified. The following transformation canbe used to solve for the unknown shape parameter (for skewed cases such that , otherwise, if symmetric,both -equal- parameters are known when one is known):

If, for example, is known, the unknown parameter is provided, exactly, by the inverse[26] of the digammafunction:

Note that this is the logarithm of the transformation by inversion of the random variable (X/(1 - X) that transforms abeta distribution to the inverted beta distribution or beta prime distribution (also known as beta distribution of thesecond kind or Pearson's Type VI).

In particular, if one of the shape parameters has a value of unity, for example for (the power function

distribution with bounded support [0,1]), using the recurrence relation , the maximum

likelihood estimator for the unknown parameter is[5], exactly:

Generating beta-distributed random variatesIf and are independent, with and then , so one

algorithm for generating beta variates is to generate X/(X+Y), where X is a gamma variate with parameters ( )and Y is an independent gamma variate with parameters ( ).[27]Also, the kth order statistic of uniformly distributed variates is , so an alternative if and are small integers is to generate uniform variates and choose the -th largest.[28]


Related distributions

Transformations

If then mirror-image symmetry If then . The beta prime distribution, also called "beta

distribution of the second kind". If then (assuming n>0 and m>0). The Fisher-Snedecor F

distribution If then

where PERT denotes a distribution used in PERTanalysis, and m=most likely value[29]. Traditionally[30] in PERT analysis.

If then Kumaraswamy distribution with parameters If then Kumaraswamy distribution with parameters If then

Special and limiting cases

the standard uniform distribution. If and then Wigner semicircle distribution. is the Jeffreys prior for a proportion and is equivalent to arcsine distribution. the exponential distribution the gamma distribution

Derived from other distributions The kth order statistic of a sample of size n from the uniform distribution is a beta random variable,

[28]

If and then If and then If and then . The power function distribution.

Combination with other distributions

and then for all x>0.

Compounding with other distributions

If and then beta-binomial distribution If and then beta negative binomial distribution

Generalisations The Dirichlet distribution is a multivariate generalization of the beta distribution. Univariate marginals of the

Dirichlet distribution have a beta distribution. The beta distribution is equivalent to the values that make the Pearson type I distribution a proper probability

distribution. the noncentral beta distribution


Applications

Order statisticsThe beta distribution has an important application in the theory of order statistics. A basic result is that thedistribution of the k'th largest of a sample of size n from a continuous uniform distribution has a beta distribution.[28]

This result is summarized as:

From this, and application of the theory related to the probability integral transform, the distribution of anyindividual order statistic from any continuous distribution can be derived.[28]

Rule of successionA classic application of the beta distribution is the rule of succession, introduced in the 18th century by Pierre-SimonLaplace in the course of treating the sunrise problem. It states that, given s successes in n conditionally independentBernoulli trials with probability p, that p should be estimated as . This estimate may be regarded as the

expected value of the posterior distribution over p, namely Beta(s+1,ns+1), which is given by Bayes' rule if oneassumes a uniform prior over p (i.e., Beta(1,1)) and then observes that p generated s successes in n trials.

Bayesian inferenceBeta distributions are used extensively in Bayesian inference, since beta distributions provide a family of conjugateprior distributions for binomial (including Bernoulli) and geometric distributions. The Beta(0,0) distribution is animproper prior and sometimes used to represent ignorance of parameter values.The domain of the beta distribution can be viewed as a probability, and in fact the beta distribution is often used todescribe the distribution of an unknown probability value typically, as the prior distribution over a probabilityparameter, such as the probability of success in a binomial distribution or Bernoulli distribution. In fact, the betadistribution is the conjugate prior of the binomial distribution and Bernoulli distribution.The beta distribution is the special case of the Dirichlet distribution with only two parameters, and the beta isconjugate to the binomial and Bernoulli distributions in exactly the same way as the Dirichlet distribution isconjugate to the multinomial distribution and categorical distribution.In Bayesian inference, the beta distribution can be derived as the posterior probability of the parameter p of abinomial distribution after observing 1 successes (with probability p of success) and 1 failures (withprobability 1p of failure). Another way to express this is that placing a prior distribution of Beta(,) on theparameter p of a binomial distribution is equivalent to adding pseudo-observations of "success" and pseudo-observations of "failure" to the actual number of successes and failures observed, then estimating theparameter p by the proportion of successes over both real- and pseudo-observations. If and are greater than 0,this has the effect of smoothing out the distribution of the parameters by ensuring that some positive probabilitymass is assigned to all parameters even when no actual observations corresponding to those parameters is observed.Values of and less than 1 favor sparsity, i.e. distributions where the parameter p is close to either 0 or 1. In effect, and , when operating together, function as a concentration parameter; see that article for more details.


Subjective logicIn standard logic, propositions are considered to be either true or false. In contradistinction, subjective logic assumesthat humans cannot determine with absolute certainty whether a proposition about the real world is absolutely true orfalse. In subjective logic the posteriori probability estimates of binary events can be represented by betadistributions.[31]

Wavelet analysisA wavelet is a wave-like oscillation with an amplitude that starts out at zero, increases, and then decreases back tozero. It can typically be visualized as a "brief oscillation" that promptly decays. Wavelets can be used to extractinformation from many different kinds of data, including but certainly not limited to audio signals and images.Thus, wavelets are purposefully crafted to have specific properties that make them useful for signal processing.Wavelets are localized in both time and frequency whereas the standard Fourier transform is only localized infrequency. Therefore, standard Fourier Transforms are only applicable to stationary processes, while wavelets areapplicable to non-stationary processes. Continuous wavelets can be constructed based on the beta distribution. Betawavelets[32] can be viewed as a soft variety of Haar wavelets whose shape is fine-tuned by two shape parameters and .

Project Management: task cost and schedule modelingThe beta distribution can be used to model events which are constrained to take place within an interval defined by aminimum and maximum value. For this reason, the beta distribution along with the triangular distribution isused extensively in PERT, critical path method (CPM), Joint Cost Schedule Modeling (JCSM) and other projectmanagement / control systems to describe the time to completion and the cost of a task. In project management,shorthand computations are widely used to estimate the mean and standard deviation of the beta distribution[30]:

where a is the minimum, c is the maximum, and b is the most likely value (the mode for >1 and >1).

The above estimate for the mean is known as the PERT three-point estimation and it is

exact for either of the following values of (for arbitrary within these ranges):

for > 1 (symmetric case) with standard deviation , skewness = 0, and

excess kurtosis =

or

for 5 > > 1 (skewed case) with standard deviation ,

skewness = , and excess kurtosis =

The above estimate for the standard deviation is exact for either of the following values of and

:

(symmetric) with skewness = 0, and excess kurtosis =


or

and (right-tailed, positive skew) with skewness = , and excess kurtosis = 0

or

and (left-tailed, negative skew) with skewness = , and excess kurtosis=0

Otherwise, these can be poor approximations for beta distributions with other values of of and . For example, theparticular values and resulting in and

have been shown to exhibit average errors of 40% in the

mean and 549% in the variance[33][34][35]

Alternative parametrizations

Two parameters

Mean and sample size

The beta distribution may also be reparameterized in terms of its mean (0 < < 1) and sample size = + ( >0). This is useful in Bayesian parameter estimation if one wants to place an unbiased (uniform) prior over the mean.For example, one may administer a test to a number of individuals. If it is assumed that each person's score (0 1) is drawn from a population-level Beta distribution, then an important statistic is the mean of this population-leveldistribution. The mean and sample size parameters are related to the shape parameters and via[36]

Under this parametrization, one can place a uniform prior over the mean, and a vague prior (such as an exponentialor gamma distribution) over the positive reals for the sample size.

Mean (allele frequency) and (Wright's) genetic distance between two populations

The BaldingNichols model is a two-parameter parametrization of the beta distribution used in population genetics.It is a statistical description of the allele frequencies in the components of a sub-divided population:See the articles BaldingNichols model, F-statistics, fixation index and coefficient of relationship, for furtherinformation.

Mean and variance

Solving the system of (coupled) equations given in the above sections as the equations for the mean and the varianceof the beta distribution in terms of the original parameters and , one can express the and parameters in termsof the mean () and the variance (var):

This parametrization of the beta distribution may lead to a more intuitive understanding (than the one based on the original parameters and ), for example, by expressing the mode, skewness, excess kurtosis and differential


entropy in terms of the mean and the variance:

Four parametersA beta distribution with the two shape parameters and is supported on the range [0,1]. It is possible to alter thelocation and scale of the distribution by introducing two further parameters representing the minimum, a, andmaximum c, values of the distribution[5], by a linear transformation substituting the non-dimensional variable x interms of the new variable y (with support [a,c]) and the parameters a and c:

The probability density function of the four parameter beta distribution is then given by

The mean, mode and variance of the four parameters Beta distribution are:

Since the skewness and excess kurtosis are non-dimensional quantities (as moments normalized by the standarddeviation), they are independent of the parameters a and c, and therefore equal to the expressions given above interms of X (with support [0,1]):


History

Karl Pearson analyzed the beta distribution as the solution "Type I" of Pearsondistributions

The first systematic, modern discussion of thebeta distribution is probably due to KarlPearson FRS[37] (27 March 1857 27 April1936[38]), an influential English mathematicianwho has been credited with establishing thediscipline of mathematical statistics.[39]. InPearson's papers[18] the beta distribution iscouched as a solution of a differentialequation: Pearson's Type I distribution. Thebeta distribution is essentially identical toPearson's Type I distribution for Pearson'sparameter values for which Pearson'sdifferential equation solution becomes a properstatistical distribution (with area under theprobability distribution equal to 1). In fact, inseveral English books and journal articles inthe few decades prior to World War II, it wascommon to refer to the beta distribution asPearson's Type I distribution. According toDavid and Edwards's comprehensive treatiseon the history of statistics[40] the first moderntreatment of the beta distribution[41] using thebeta designation that has become standard isdue to Corrado Gini,(May 23, 1884 March13, 1965), an Italian statistician, demographer, sociologist, who developed the Gini coefficient.

References[1] Oguamanam, D.C.D.; Martin, H.R. , Huissoon, J.P. (1995). "On the application of the beta distribution to gear damage analysis". Applied

Acoustics 45 (3): 247261. doi:10.1016/0003-682X(95)00001-P.[2] Sulaiman, M.Yusof; W.M Hlaing Oo, Mahdi Abd Wahab, Azmi Zakaria (December 1999). "Application of beta distribution model to

Malaysian sunshine data". Renewable Energy 18 (4): 573579. doi:10.1016/S0960-1481(99)00002-6.[3] Haskett, Jonathan D.; Yakov A. Pachepsky, Basil Acock (1995). "Use of the beta distribution for parameterizing variability of soil properties

at the regional level for crop yield estimation". Agricultural Systems 48 (1): 7386. doi:10.1016/0308-521X(95)93646-U.[4] Gullco, Robert S.; Malcolm Anderson (December 2009). "Use of the Beta Distribution To Determine Well-Log Shale Parameters". SPE

Reservoir Evaluation & Engineering 12 (6): 929-942. doi:10.2118/106746-PA.[5] Johnson, Norman L.; Kotz, Samuel; Balakrishnan, N. (1995). "Chapter 21:Beta Distributions". Continuous Univariate Distributions Vol. 2

(2nd ed.). Wiley. ISBN978-0-471-58494-0.[6] Keeping, E. S. (2010). Introduction to Statistical Inference. Dover Publications. pp.462 pages. ISBN978-0486685021.[7] Wadsworth, George P. and Joseph Bryan (1960). Introduction to Probability and Random Variables. McGraw-Hill. pp.101.[8] Hahn, Gerald J. and S. Shapiro (1994). Statistical Models in Engineering (Wiley Classics Library). Wiley-Interscience. pp.376.

ISBN978-0471040651.[9] Feller, William (1971). An Introduction to Probability Theory and Its Applications, Vol. 2. Wiley. pp.669. ISBN978-0471257097.[10] Gupta (Editor), Arjun K. (2004). Handbook of Beta Distribution and Its Applications. CRC Press. pp.42. ISBN978-0824753962.[11] Panik, Michael J (2005). Advanced Statistics from an Elementary Point of View. Academic Press. pp.273. ISBN978-0120884940.[12] Rose, Colin, and Murray D. Smith (2002). Mathematical Statistics with MATHEMATICA. Springer. pp.496 pages. ISBN978-0387952345.[13][13] Kerman J (2011) "A closed-form approximation for the median of the beta distribution". arXiv:1111.0433v1[14] Liang, Zhiqiang; Jianming Wei, Junyu Zhao, Haitao Liu, Baoqing Li, Jie Shen and Chunlei Zheng (27 August 2008). "The Statistical

Meaning of Kurtosis and Its New Application to Identification of Persons Based on Seismic Signals". Sensors 8: 51065119.doi:10.3390/s8085106.


[15] Kenney, J. F., and E. S. Keeping (1951). Mathematics of Statistics Part Two, 2nd edition. D. Van Nostrand Company Inc. pp.429.[16] Abramowitz, Milton and Irene A. Stegun (1965). Handbook Of Mathematical Functions With Formulas, Graphs, And Mathematical Tables.

Dover. pp.1046 pages. ISBN78-0486612720.[17] Weisstein., Eric W.. "Kurtosis" (http:/ / mathworld. wolfram. com/ Kurtosis. html). MathWorld--A Wolfram Web Resource. . Retrieved 13

August 2012.[18] Pearson, Karl (1916). "Mathematical contributions to the theory of evolution, XIX: Second supplement to a memoir on skew variation".

Philosophical Transactions of the Royal Society of London. Series A, Containing Papers of a Mathematical or Physical Character 216(538548): 429457. Bibcode1916RSPTA.216..429P. doi:10.1098/rsta.1916.0009. JSTOR91092.

[19] Gradshteyn, I. S. , and I. M. Ryzhik (2000). Table of Integrals, Series, and Products, 6th edition. Academic Press. pp.1163.ISBN978-0122947575.

[20] A. C. G. Verdugo Lazo and P. N. Rathie. "On the entropy of continuous probability distributions," IEEE Trans. Inf. Theory, IT-24:120122,1978.

[21] Shannon, Claude E., "A Mathematical Theory of Communication," Bell System Technical Journal, 27 (4):623656,1948. PDF (http:/ /www. alcatel-lucent. com/ bstj/ vol27-1948/ articles/ bstj27-4-623. pdf)

[22] Pearson, Egon S. (July 1969). "Some historical reflections traced through the development of the use of frequency curves" (http:/ / www.smu. edu/ Dedman/ Academics/ Departments/ Statistics/ Research/ TechnicalReports). THEMIS Statistical Analysis Research Program,Technical Report 38 Office of Naval Research, Contract N000014-68-A-0515 (Project NR 042-260): 23. .

[23] Engineering Statistics Handbook (http:/ / www. itl. nist. gov/ div898/ handbook/ eda/ section3/ eda366h. htm)[24] Beckman, R. J.; G. L. Tietjen (1978). "Maximum likelihood estimation for the beta distribution". Journal of Statistical Computation and

Simulation 7 (3-4): 253-258. doi:10.1080/00949657808810232.[25] Gnanadesikan, R.,Pinkham and Hughes (1967). "Maximum likelihood estimation of the parameters of the beta distribution from smallest

order statistics". Technometrics 9: 607-620.[26] Fackler, Paul. "Inverse Digamma Function (Matlab)" (http:/ / hips. seas. harvard. edu/ content/ inverse-digamma-function-matlab). Harvard

University School of Engineering and Applied Sciences. . Retrieved 08/18/2012.[27][27] van der Waerden, B. L., "Mathematical Statistics", Springer, ISBN 978-3-540-04507-6.[28] David, H. A., Nagaraja, H. N. (2003) Order Statistics (3rd Edition). Wiley, New Jersey pp 458. ISBN 0-471-38926-9[29] Herreras-Velasco, Jos Manuel and Herreras-Pleguezuelo, Rafael and Ren van Dorp, Johan. (2011). Revisiting the PERT mean and

Variance. European Journal of Operational Research (210), p. 448451.[30] Malcolm, D. G.; Roseboom, C. E., Clark, C. E. and Fazar, W., (1959). "Application of a technique for research and development program

evaluation". Operations Research 7: 646649.[31] A. Jsang. A Logic for Uncertain Probabilities. International Journal of Uncertainty, Fuzziness and Knowledge-Based Systems. 9(3),

pp.279-311, June 2001. PDF (http:/ / www. unik. no/ people/ josang/ papers/ Jos2001-IJUFKS. pdf)[32] H.M. de Oliveira and G.A.A. Arajo,. Compactly Supported One-cyclic Wavelets Derived from Beta Distributions. Journal of

Communication and Information Systems. vol.20, n.3, pp.27-33, 2005.[33] Keefer, Donald L. and Verdini, William A. (1993). Better Estimation of PERT Activity Time Parameters. Management Science 39(9), p.

10861091.[34] Keefer, Donald L. and Bodily, Samuel E. (1983). Three-point Approximations for Continuous Random variables. Management Science

29(5), p. 595609.[35] DRMI Newsletter, Issue 12, April 8, 2005 (http:/ / www. nps. edu/ drmi/ docs/ 1apr05-newsletter. pdf)[36] Kruschke, J. (2011). Doing Bayesian data analysis: A tutorial with R and BUGS. Academic Press / Elsevier ISBN 978-0123814852 (p. 83)[37] Yule, G. U.; Filon, L. N. G. (1936). "Karl Pearson. 1857-1936". Obituary Notices of Fellows of the Royal Society 2 (5): 72.

doi:10.1098/rsbm.1936.0007. JSTOR769130.[38] "Library and Archive catalogue" (http:/ / www2. royalsociety. org/ DServe/ dserve. exe?dsqIni=Dserve. ini& dsqApp=Archive&

dsqCmd=Show. tcl& dsqDb=Persons& dsqPos=0& dsqSearch=((text)=' Pearson: Karl (1857 - 1936) '))). Sackler Digital Archive. RoyalSociety. . Retrieved 2011-07-01.

[39] "Karl Pearson sesquicentenary conference" (http:/ / www. economics. soton. ac. uk/ staff/ aldrich/ KP150. htm). Royal Statistical Society.2007-03-03. . Retrieved 2008-07-25.

[40] David, H. A. and A.W.F. Edwards (2001). Annotated Readings in the History of Statistics. Springer; 1 edition. pp.252.ISBN978-0387988443.

[41] Gini, Corrado (1911). Studi Economico-Giuridici della Universit de Cagliari Anno III (reproduced in Metron 15, 133,171, 1949): 5-41.


External links Weisstein, Eric W., " Beta Distribution (http:/ / mathworld. wolfram. com/ BetaDistribution. html)" from

MathWorld. "Beta Distribution" (http:/ / demonstrations. wolfram. com/ BetaDistribution/ ) by Fiona Maclachlan, the Wolfram

Demonstrations Project, 2007. Beta Distribution Overview and Example (http:/ / www. xycoon. com/ beta. htm), xycoon.com Beta Distribution (http:/ / www. brighton-webs. co. uk/ distributions/ beta. htm), brighton-webs.co.uk

Beta functionIn mathematics, the beta function, also called the Euler integral of the first kind, is a special function defined by

for The beta function was studied by Euler and Legendre and was given its name by Jacques Binet; its symbol is aGreek capital rather than the similar Latin capital B.

PropertiesThe beta function is symmetric, meaning that

[1]

When x and y are positive integers, it follows trivially from the definition of the gamma function that:

It has many other forms, including:

[1]

[2]

[2]

where is a truncated power function and the star denotes convolution. The second identity shows inparticular . Some of these identities, e.g. the trigonometric formula, can be applied to deriving thevolume of an n-ball in Cartesian coordinates.Euler's integral for the beta function may be converted into an integral over the Pochhammer contour C as

Beta function 32

This Pochhammer contour integral converges for all values of and and so gives the analytic continuation of thebeta function.Just as the gamma function for integers describes factorials, the beta function can define a binomial coefficient afteradjusting indices:

Moreover, for integer n, can be integrated to give a closed form, an interpolation function for continuous values ofk:

The beta function was the first known scattering amplitude in string theory, first conjectured by Gabriele Veneziano.It also occurs in the theory of the preferential attachment process, a type of stochastic urn process.

Relationship between gamma function and beta functionTo derive the integral representation of the beta function, write the product of two factorials as

Changing variables by putting u=zt, v=z(1-t) shows that this is

Hence

The stated identity may be seen as a particular case of the identity for the integral of a convolution. Taking

and , one has:

.

DerivativesWe have

where is the digamma function.

IntegralsThe NrlundRice integral is a contour integral involving the beta function.

Beta function 33

ApproximationStirling's approximation gives the asymptotic formula

for large x and large y. If on the other hand x is large and y is fixed, then

Incomplete beta functionThe incomplete beta function, a generalization of the beta function, is defined as

For x = 1, the incomplete beta function coincides with the complete beta function. The relationship between the twofunctions is like that between the gamma function and its generalization the incomplete gamma function.The regularized incomplete beta function (or regularized beta function for short) is defined in terms of theincomplete beta function and the complete beta function:

Working out the integral (one can use integration by parts) for integer values of a and b, one finds:

The regularized incomplete beta function is the cumulative distribution function of the Beta distribution, and isrelated to the cumulative distribution function of a random variable X from a binomial distribution, where the"probability of success" is p and the sample size is n:

Properties

CalculationEven if unavailable directly, the complete and incomplete Beta function values can be calculated using functionscommonly included in spreadsheet or Computer algebra systems. With Excel as an example, using the GammaLnand (cumulative) Beta distribution functions, we have:

Complete Beta Value = Exp(GammaLn(a) + GammaLn(b) - GammaLn(a + b))

and,Incomplete Beta Value = BetaDist(x, a, b) * Exp(GammaLn(a) + GammaLn(b) - GammaLn(a + b)).

These result from rearranging the formulae for the Beta distribution, and the incomplete beta and complete betafunctions, which can also be defined as the ratio of the logs as above.

Beta function 34

Similarly, in MATLAB and GNU Octave, betainc (Incomplete beta function) computes the regularized incompletebeta function - which is, in fact, the Cumulative Beta distribution - and so, to get the actual incomplete beta function,one must multiply the result of betainc by the result returned by the corresponding beta function..//

References[1][1] Davis (1972) 6.2.2 p.258[2][2] Davis (1972) 6.2.1 p.258

Askey, R. A.; Roy, R. (2010), "Beta function" (http:/ / dlmf. nist. gov/ 5. 12), in Olver, Frank W. J.; Lozier,Daniel M.; Boisvert, Ronald F. et al., NIST Handbook of Mathematical Functions, Cambridge University Press,ISBN978-0521192255, MR2723248

Zelen, M.; Severo, N. C. (1972), "26. Probability functions", in Abramowitz, Milton; Stegun, Irene A., Handbookof Mathematical Functions with Formulas, Graphs, and Mathematical Tables, New York: Dover Publications,pp.925-995, ISBN978-0-486-61272-0

Davis, Philip J. (1972), "6. Gamma function and related functions" (http:/ / www. math. sfu. ca/ ~cbm/ aands/page_258. htm), in Abramowitz, Milton; Stegun, Irene A., Handbook of Mathematical Functions with Formulas,Graphs, and Mathematical Tables, New York: Dover Publications, ISBN978-0-486-61272-0

Paris, R. B. (2010), "Incomplete beta functions" (http:/ / dlmf. nist. gov/ 8. 17), in Olver, Frank W. J.; Lozier,Daniel M.; Boisvert, Ronald F. et al., NIST Handbook of Mathematical Functions, Cambridge University Press,ISBN978-0521192255, MR2723248

Press, WH; Teukolsky, SA; Vetterling, WT; Flannery, BP (2007), "Section 6.1 Gamma Function, Beta Function,Factorials" (http:/ / apps. nrbook. com/ empanel/ index. html?pg=256), Numerical Recipes: The Art of ScientificComputing (3rd ed.), New York: Cambridge University Press, ISBN978-0-521-88068-8

External links Evaluation of beta function using Laplace transform (http:/ / planetmath. org/ ?op=getobj& amp;from=objects&

amp;id=6206), PlanetMath.org. Arbitrarily accurate values can be obtained from:

The Wolfram Functions Site (http:/ / functions. wolfram. com): Evaluate Beta Regularized Incomplete beta(http:/ / functions. wolfram. com/ webMathematica/ FunctionEvaluation. jsp?name=BetaRegularized)

danielsoper.com: Incomplete Beta Function Calculator (http:/ / www. danielsoper. com/ statcalc/ calc36. aspx),Regularized Incomplete Beta Function Calculator (http:/ / www. danielsoper. com/ statcalc/ calc37. aspx)

Beta-binomial distribution 35

Beta-binomial distribution

Probability mass function

Cumulative distribution function

Parameters n N0 number of trials(real)(real)

Support k { 0, , n }

PMF

CDF

where 3F2(a,b,k) is the generalized hypergeometricfunction=3F2(1,+k+1,n+k+1;k+2,n+k+2;1)

Mean

Variance

Skewness

Ex. kurtosis See text

MGF

CF

In probability theory and statistics, the beta-binomial distribution is a family of discrete probability distributions ona finite support of non-negative integers arising when the probability of success in each of a fixed or known numberof Bernoulli trials is either unknown or random. It is frequently used in Bayesian statistics, empirical Bayes methodsand classical statistics as an overdispersed binomial distribution.


It reduces to the Bernoulli distribution as a special case when n=1. For ==1, it is the discrete uniformdistribution from 0 ton. It also approximates the binomial distribution arbitrarily well for large and. Thebeta-binomial is a one-dimensional version of the Dirichlet-multinomial distribution, as the binomial and betadistributions are special cases of the multinomial and Dirichlet distributions, respectively.

Motivation and derivation

Beta-binomial distribution as a compound distributionThe Beta distribution is a conjugate distribution of the binomial distribution. This fact leads to an analyticallytractable compound distribution where one can think of the parameter in the binomial distribution as beingrandomly drawn from a beta distribution. Namely, if

is the binomial distribution where p is a random variable with a beta distribution

then the compound distribution is given by

Using the properties of the beta function, this can alternatively be written

It is within this context that the beta-binomial distribution appears often in Bayesian statistics: the beta-binomial isthe predictive distribution of a binomial random variable with a beta distribution prior on the success probability.

Beta-binomial as an urn modelThe beta-binomial distribution can also be motivated via an urn model for positive integer values of and .Specifically, imagine an urn containing red balls and black balls, where random draws are made. If a red ball isobserved, then two red balls are returned to the urn. Likewise, if a black ball is drawn, it is replaced and anotherblack ball is added to the urn. If this is repeated n times, then the probability of observing k red balls follows abeta-binomial distribution with parameters n, and .Note that if the random draws are with simple replacement (no balls over and above the observed ball are added tothe urn), then the distribution follows a binomial distribution and if the random draws are made without replacement,the distribution follows a hypergeometric distribution.


Moments and propertiesThe first three raw moments are

and the kurtosis is

Letting we note, suggestively, that the mean can be written as

and the variance as

where is the pairwise correlation between the n Bernoulli draws and is called the over-dispersion

parameter.

Point estimates

Method of momentsThe method of moments estimates can be gained by noting the first and second moments of the beta-binomialnamely

and setting these raw moments equal to the sample moments

and solving for and we get

Note that these estimates can be non-sensically negative which is evidence that the data is either undispersed orunderdispersed relative to the binomial distribution. In this case, the binomial distribution and the hypergeometricdistribution are alternative candidates respectively.


Maximum likelihood estimationWhile closed-form maximum likelihood estimates are impractical, given that the pdf consists of common functions(gamma function and/or Beta functions), they can be easily found via direct numerical optimization. Maximumlikelihood estimates from empirical data can be computed using general methods for fitting multinomial Plyadistributions, methods for which are described in (Minka 2003). The R package VGAM through the function vglm,via maximum likelihood, facilitates the fitting of glm type models with responses distributed according to thebeta-binomial distribution. Note also that there is no requirement that n is fixed throughout the observations.

ExampleThe following data gives the number of male children among the first 12 children of family size 13 in 6115 familiestaken from hospital records in 19th century Saxony (Sokal and Rohlf, p.59 from Lindsey). The 13th child is ignoredto assuage the effect of families non-randomly stopping when a desired gender is reached.

Males 0 1 2 3 4 5 6 7 8 9 10 11 12

Families 3 24 104 286 670 1033 1343 1112 829 478 181 45 7

We note the first two sample moments are

and therefore the method of moments estimates are

The maximum likelihood estimates can be found numerically

and the maximized log-likelihood is

from which we find the AIC

The AIC for the competing binomial model is AIC=25070.34 and thus we see that the beta-binomial modelprovides a superior fit to the data i.e. there is evidence for overdispersion. Trivers and Willard posit a theoreticaljustification for heterogeneity in gender-proneness among families (i.e. overdispersion).The superior fit is evident especially among the tails


Males 0 1 2 3 4 5 6 7 8 9 10 11 12

Observed Families 3 24 104 286 670 1033 1343 1112 829 478 181 45 7

Predicted (Beta-Binomial) 2.3 22.6 104.8 310.9 655.7 1036.2 1257.9 1182.1 853.6 461.9 177.9 43.8 5.2

Predicted (Binomial p = 0.519215) 0.9 12.1 71.8 258.5 628.1 1085.2 1367.3 1265.6 854.2 410.0 132.8 26.1 2.3

Further Bayesian considerationsIt is convenient to reparameterize the distributions so that the expected mean of the prior is a single parameter: Let

where

so that

The posterior distribution (|k) is also a beta distribution:

And

while the marginal distribution m(k|, M) is given by

Because the marginal is a complex, non-linear function of Gamma and Digamma functions, it is quite difficult toobtain a marginal maximum likelihood estimate (MMLE) for the mean and variance. Instead, we use the method ofiterated expectations to find the expected value of the marginal moments.Let us write our model as a two-stage compound sampling model. Let ki be the number of success out of ni trials forevent i:


We can find iterated moment estimates for the mean and variance using the moments for the distributions in thetwo-stage model:

(Here we have used the law of total expectation and the law of total variance.)

We want point estimates for and . The estimated mean is calculated from the sample

The estimate of the hyperparameter M is obtained using the moment estimates for the variance of the two-stagemodel:

Solving:

where

Since we now have parameter point estimates, and , for the underlying distribution, we would like to find apoint estimate for the probability of success for event i. This is the weighted average of the eventestimate and . Given our point estimates for the prior, we may now plug in these values to find a pointestimate for the posterior

Shrinkage factorsWe may write the posterior estimate as a weighted average:

where is called the shrinkage factor.


Related distributions where is the discrete uniform distribution.

References* Minka, Thomas P. (2003). Estimating a Dirichlet distribution [1]. Microsoft Technical Report.

External links Using the Beta-binomial distribution to assess performance of a biometric identification device [2]

Fastfit [3] contains Matlab code for fitting Beta-Binomial distributions (in the form of two-dimensional Plyadistributions) to data.

References[1] http:/ / research. microsoft. com/ ~minka/ papers/ dirichlet/[2] http:/ / it. stlawu. edu/ ~msch/ biometrics/ papers. htm[3] http:/ / research. microsoft. com/ ~minka/ software/ fastfit/

Binomial coefficient

The binomial coefficients can be arranged to formPascal's triangle.

In mathematics, binomial coefficients are a family of positiveintegers that occur as coefficients in the binomial theorem. Theyare indexed by two nonnegative integers; the binomial coefficientindexed by n and k is usually written . It is the coefficient ofthe xk term in the polynomial expansion of the binomial power(1+x)n. Under suitable circumstances the value of the coefficientis given by the expression . Arranging binomial

coefficients into rows for successive values of n, and in which kranges from 0 to n, gives a triangular array called Pascal's triangle.This family of numbers also arises in many other areas thanalgebra, notably in combinatorics. For any set containing nelements, the number of distinct k-element subsets of it that can be formed (the k-combinations of its elements) isgiven by the binomial coefficient . Therefore is often read as "n choose k". The properties of binomialcoefficients have led to extending the meaning of the symbol beyond the basic case where n and k arenonnegative integers with k n; such expressions are then still called binomial coefficients.The notation was introduced by Andreas von Ettingshausen in 1826,[1] although the numbers were alreadyknown centuries before that (see Pascal's triangle). The earliest known detailed discussion of binomial coefficients isin a tenth-century commentary, due to Halayudha, on an ancient Hindu classic, Pingala's chandastra. In about1150, the Hindu mathematician Bhaskaracharya gave a very clear exposition of binomial coefficients in his bookLilavati.[2]

Alternative notations include C(n, k), nCk, nCk, C

kn, C

nk,

[3] in all of which the C stands for combinations or choices.

Binomial coefficient 42

Definition and interpretationsFor natural numbers (taken to include 0) n and k, the binomial coefficient can be defined as the coefficient ofthe monomial Xk in the expansion of (1 + X)n. The same coefficient also occurs (if k n) in the binomial formula

(valid for any elements x,y of a commutative ring), which explains the name "binomial coefficient".Another occurrence of this number is in combinatorics, where it gives the number of ways, disregarding order, that kobjects can be chosen from among n objects; more formally, the number of k-element subsets (or k-combinations) ofan n-element set. This number can be seen as equal to the one of the first definition, independently of any of theformulas below to compute it: if in each of the n factors of the power (1 + X)n one temporarily labels the term X withan index i (running from 1 to n), then each subset of k indices gives after expansion a contribution Xk, and thecoefficient of that monomial in the result will be the number of such subsets. This shows in particular that is anatural number for any natural numbers n and k. There are many other combinatorial interpretations of binomialcoefficients (counting problems for which the answer is given by a binomial coefficient expression), for instance thenumber of words formed of n bits (digits 0 or 1) whose sum is k is given by , while the number of ways to write

where every ai is a nonnegative integer is given by . Most of these

interpretations are easily seen to be equivalent to counting k-combinations.

Computing the value of binomial coefficientsSeveral methods exist to compute the value of without actually expanding a binomial power or countingk-combinations.

Recursive formulaOne has a recursive formula for binomial coefficients

with initial values

The formula follows either from tracing the contributions to Xk in (1 + X)n1(1 + X), or by counting k-combinationsof {1, 2, ..., n} that contain n and that do not contain n separately. It follows easily that =0 when k>n, and

=1 for all n, so the recursion can stop when reaching such cases. This recursive formula then allows theconstruction of Pascal's triangle.


Multiplicative formulaA more efficient method to compute individual binomial coefficients is given by the formula

where the numerator of the first fraction is expressed as a falling factorial power. This formula is easiest tounderstand for the combinatorial interpretation of binomial coefficients. The numerator gives the number of ways toselect a sequence of k distinct objects, retaining the order of selection, from a set of n objects. The denominatorcounts the number of distinct sequences that define the same k-combination when order is disregarded.

Factorial formulaFinally there is a formula using factorials that is easy to remember:

where n! denotes the factorial of n. This formula follows from the multiplicative formula above by multiplyingnumerator and denominator by (n k)!; as a consequence it involves many factors common to numerator anddenominator. It is less practical for explicit computation unless common factors are first canceled (in particular sincefactorial values grow very rapidly). The formula does exhibit a symmetry that is less evident from the multiplicativeformula (though it is from the definitions)

(1)

Generalization and connection to the binomial seriesThe multiplicative formula allows the definition of binomial coefficients to be extended[4] by replacing n by anarbitrary number (negative, real, complex) or even an element of any commutative ring in which all positiveintegers are invertible:

With this definition one has a generalization of the binomial formula (with one of the variables set to 1), whichjustifies still calling the binomial coefficients:

(2)

This formula is valid for all complex numbers and X with |X|n are zero, and the infinite series becomes a finite sum, therebyrecovering the binomial formula. However for other values of , including negative integers and rational numbers,the series is really infinite.


Pascal's triangle

1000th row of Pascal'striangle, arranged vertically,

with grey-scalerepresentations of decimaldigits of the coefficients,

right-aligned. The leftboundary of the image

corresponds roughly to thegraph of the logarithm of the

binomial coefficients, andillustrates that they form a

log-concave sequence.

Pascal's rule is the important recurrence relation

(3)

which can be used to prove by mathematical induction that is a natural number for all n and k, (equivalent to thestatement that k! divides the product of k consecutive integers), a fact that is not immediately obvious from formula(1).Pascal's rule also gives rise to Pascal's triangle:


0: 1

1: 1 1

2: 1 2 1

3: 1 3 3 1

4: 1 4 6 4 1

5: 1 5 10 10 5 1

6: 1 6 15 20 15 6 1

7: 1 7 21 35 35 21 7 1

8: 1 8 28 56 70 56 28 8 1

Row number n contains the numbers for k = 0,,n. It is constructed by starting with ones at the outside andthen always adding two adjacent numbers and writing the sum directly underneath. This method allows the quickcalculation of binomial coefficients without the need for fractions or multiplications. For instance, by looking at rownumber 5 of the triangle, one can quickly read off that

(x + y)5 = 1 x5 + 5 x4y + 10 x3y2 + 10 x2y3 + 5 x y4 + 1 y5.The differences between elements on other diagonals are the elements in the previous diagonal, as a consequence ofthe recurrence relation (3) above.

Combinatorics and statisticsBinomial coefficients are of importance in combinatorics, because they provide ready formulas for certain frequentcounting problems:

There are ways to choose k elements from a set of n elements. See Combination. There are ways to choose k elements from a set of n if repetitions are allowed. See Multiset. There are strings containing k ones and n zeros. There are strings consisting of k ones and n zeros such that no two ones are adjacent.[5]

The Catalan numbers are

The binomial distribution in statistics is The formula for a Bzier curve.

Binomial coefficients as polynomialsFor any nonnegative integer k, the expression can be simplified and defined as a polynomial divided by k!:

This presents a polynomial in t with rational coefficients.As such, it can be evaluated at any real or complex number t to define binomial coefficients with such firstarguments. These "generalized binomial coefficients" appear in Newton's generalized binomial theorem.

For each k, the polynomial can be characterized as the unique degree k polynomial p(t) satisfying p(0) = p(1) =... = p(k 1) = 0 and p(k) = 1.Its coefficients are expressible in terms of Stirling numbers of the first kind, by definition of the latter:


The derivative of can be calculated by logarithmic differentiation:

Binomial coefficients as a basis for the space of polynomialsOver any field containing Q, each polynomial p(t) of degree at most d is uniquely expressible as a linear

combination . The coefficient ak is the kth difference of the sequence p(0), p(1), , p(k). Explicitly,[6]

(3.5)

Integer-valued polynomials

Each polynomial is integer-valued: it takes integer values at integer inputs. (One way to prove this is byinduction on k, using Pascal's identity.) Therefore any integer linear combination of binomial coefficient polynomialsis integer-valued too. Conversely, (3.5) shows that any integer-valued polynomial is an integer linear combination ofthese binomial coefficient polynomials. More generally, for any subring R of a characteristic 0 field K, a polynomialin K[t] takes values in R at all integers if and only if it is an R-linear combination of binomial coefficientpolynomials.

ExampleThe integer-valued polynomial 3t(3t+1)/2 can be rewritten as

Identities involving binomial coefficientsThe factorial formula facilitates relating nearby binomial coefficients. For instance, if k is a positive integer and n isarbitrary, then

(4)

and, with a little more work,

Moreover, the following may be useful:


Series involving binomial coefficientsThe formula

(5)

is obtained from (2) using x = 1. This is equivalent to saying that the elements in one row of Pascal's triangle alwaysadd up to two raised to an integer power. A combinatorial interpretation of this fact involving double counting isgiven by counting subsets of size 0, size 1, size 2, and so on up to size n of a set S of n elements. Since we count thenumber of subsets of size i for 0 i n, this sum must be equal to the number of subsets of S, which is known to be2n. That is, Equation5 is a statement that the power set for a finite set with n elements has size 2n. More explicitly,consider a bit string with n digits. This bit string can be used to represent 2n numbers. Now consider all of the bitstrings with no ones in them. There is just one, or rather n choose 0. Next consider the number of bit strings with justa single one in them. There are n, or rather n choose 1. Continuing this way we can see that the equation above holds.The formulas

(6a)

and

(6b)

follow from (2) after differentiating with respect to x (twice in the latter) and then substituting x = 1.The Chu-Vandermonde identity, which holds for any complex-values m and n and any non-negative integer k, is

(7a)

and can be found by examination of the coefficient of in the expansion of (1+x)m(1+x)nm = (1+x)n usingequation (2). When m=1, equation (7a) reduces to equation (3).A similar looking formula, which applies for any integers j, k, and n satisfying 0jkn, is

(7b)

and can be found by examination of the coefficient of in the expansion of

using When j=k, equation (7b) gives

From expansion (7a) using n=2m, k=m, and (1), one finds

(8)

Let F(n) denote the n-th Fibonacci number. We obtain a formula about the diagonals of Pascal's triangle


(9)

This can be proved by induction using (3) or by Zeckendorf's representation (Just note that the lhs gives the numberof subsets of {F(2),...,F(n)} without consecutive members, which also form all the numbers below F(n+1)). Acombinatorial proof is given below.Also using (3) and induction, one can show that

(10)

Although there is no closed formula for

(unless one resorts to Hypergeometric functions), one can again use (3) and induction, to show that for k = 0, ..., n1

(11)

as well as

(12)

[except in the trivial case where n = 0, where the result is 1 instead] which is itself a special case of the result fromthe theory of finite differences that for any polynomial P(x) of degree less than n,[7]

(13a)

Differentiating (2) k times and setting x = 1 yields this for , when 0 k < n,and the general case follows by taking linear combinations of these.When P(x) is of degree less than or equal to n,

(13b)

where is the coefficient of degree n in P(x).More generally for (13b),

(13c)

where m and d are complex numbers. This follows immediately applying (13b) to the polynomial Q(x):=P(m + dx)instead of P(x), and observing that Q(x) has still degree less than or equal to n, and that its coefficient of degree n isdnan.The infinite series


(14)

is convergent for k 2. This formula is used in the analysis of the German tank problem. It is equivalent to theformula for the finite sum

which is proved for M>m by induction on M.Using (8) one can derive

(15)

and

(16)

Series multisection gives the following identity for the sum of binomial coefficients taken with a step s and offset tas a closed-form sum of s terms:

Identities with combinatorial proofsMany identities involving binomial coefficients can be proved by combinatorial means. For example, the followingidentity for nonnegative integers (which reduces to (6a) when q = 1):

(16b)

can be given a double counting proof as follows. The left side counts the number of ways of selecting a subset of [n]= {1, 2, , n} with at least q elements, and marking q elements among those selected. The right side counts thesame parameter, because there are ways of choosing a set of q marks and they occur in all subsets that

additionally contain some subset of the remaining elements, of which there are In the Pascal's rule

both sides count the number of k-element subsets of [n] with the right hand side rst grouping them into those thatcontain element n and those that do not.The identity (8) also has a combinatorial proof. The identity reads

Suppose you have empty squares arranged in a row and you want to mark (select) n of them. There are ways to do this. On the other hand, you may select your n squares by selecting k squares from among the first n and

squares from the remaining n squares; any k from 1 to n will work. This gives


Now apply (4) to get the result.The identity (9),

has the following combinatorial proof. The number denotes the number of paths in a two-dimensional lattice

from to using steps and . This is easy to see: there are steps in total andone may choose the steps. Now, replace each step by a step; note that there are exactly .

Then one arrives at point using steps and . Doing this for all between and gives

all paths from to using steps and . Clearly, there are exactly such paths.Sum of coefficients row

The number of k-combinations for all k, , is the sum of the nth row (counting from 0) of the

binomial coefficients. These combinations are enumerated by the 1 digits of the set of base 2 numbers counting from0 to , where each d

Date post:	27-Oct-2015
Category:	Documents
Upload:	radhika-nangia
View:	666 times
Download:	7 times

Wiki Book on Statistics

Documents