Chapter 2: Basics from Probability Theory and Statistics

IRDM WS 2005 2-1

Chapter 2: Basics from Probability Theoryand Statistics

2.1 Probability TheoryEvents, Probabilities, Random Variables, Distributions, Moments

Generating Functions, Deviation Bounds, Limit Theorems

Basics from Information Theory

2.2 Statistical Inference: Sampling and EstimationMoment Estimation, Confidence Intervals

Parameter Estimation, Maximum Likelihood, EM Iteration

2.3 Statistical Inference: Hypothesis Testing and RegressionStatistical Tests, p-Values, Chi-Square Test

Linear and Logistic Regression

mostly following L. Wasserman, with additions from other sources

IRDM WS 2005 2-2

2.2 Statistical Inference: Sampling and Estimation

A statistical model is a set of distributions (or regression functions),e.g., all unimodal, smooth distributions.A parametric model is a set that is completely described bya finite number of parameters,(e.g., the family of Normal distributions).

Statistical inference: given a sample X1, ..., Xn how do weinfer the distribution or its parameters within a given model.

For multivariate models with one specific „outcome (response)“ variable Y, this is called prediction or regression, for discrete outcome variable also classification.r(x) = E[Y | X=x] is called the regression function.

IRDM WS 2005 2-3

Statistical EstimatorsA point estimator for a parameter of a prob. distribution is arandom variable X derived from a random sample X1, ..., Xn.Examples:

Sample mean:

Sample variance:

n

iiX

n:X

1

1

2

1

2

1

1)XX(

n:S

n

ii

An estimator T for parameter is unbiased

if ;

otherwise the estimator has bias .

An estimator on a sample of size n is consistent

if

]T[E

01 eachfor]T[Plimn

]T[E

Sample mean and sample variance are unbiased, consistent estimators with minimal variance.

IRDM WS 2005 2-4

Estimator ErrorLet = T() be an estimator for parameter over sample X1, ..., Xn.

The distribution of is called the sampling distribution.

The standard error for is:

n̂

n̂ˆ ˆse( ) Var[ ] n̂

The mean squared error (MSE) for is: n̂2

nˆ ˆMSE( ) E[( ) ]

2n n

ˆ ˆbias ( ) Var[ ]

If bias 0 and se 0 then the estimator is consistent.

The estimator is asymptotically Normal if converges in distribution to standard Normal N(0,1)

n̂n

ˆ( ) / se

IRDM WS 2005 2-5

Nonparametric Estimation

The empirical distribution function is the cdf that

puts prob. mass 1/n at each data point Xi:nF̂

nn ii 1

1F̂ ( x ) I( X x )

n A statistical functional T(F) is any function of F,e.g., mean, variance, skewness, median, quantiles, correlation

Instead of the full empirical distribution, often compact data synopses may be used, such as histograms where X1, ..., Xn are grouped into

m cells (buckets) c1, ..., cm with bucket boundaries lb(ci) and ub(ci) s.t.

lb(c1) = , ub(cm) = , ub(ci) = lb(ci+1) for 1i<m, and

freq(ci) =

The plug-in estimator of = T(F) is: n nˆ ˆT( F )

nn i i1

1F̂ ( x ) I( lb( c ) X ub( c ))

n Histograms provide a (discontinuous) density estimator.

IRDM WS 2005 2-6

Parametric Inference: Method of Moments

Compute sample moments: for j-th moment j n j

n ii 11

ˆ Xn

Estimate parameter by method-of-moments estimator s.t. n̂

1 n 1ˆ ˆ( F( ))

and 2 n 2ˆ ˆ( F( ))

3 n 3ˆ ˆ( F( )) and

and ... (for some number of moments)

Method-of-moments estimators are usually consistent andasympotically Normal, but may be biased

IRDM WS 2005 2-7

Parametric Inference:Maximum Likelihood Estimators (MLE)

Estimate parameter of a postulated distribution f(,x) such that the probability that the data of the sample are generated by this distribution is maximized. Maximum likelihood estimation: Maximize L(x1,...,xn, ) = P[x1, ..., xn originate from f(,x)]

(often written as L( | x1,...,xn) = f(x1,...,xn | ) )

or maximize log L if analytically untractable use numerical iteration methods

IRDM WS 2005 2-8

MLE Properties

Maximum Likelihood Estimators areconsistent, asymptotically Normal, andasymptotically optimal in the following sense:

Consider two estimators U and T which are asymptotically Normal.Let u2 and t2 denote the variances of the two Normal distributionsto which U and T converge in probability.The asymptotic relative efficiency of U to T is ARE(U,T) = t2/u2 .

Theorem: For an MLE and any other estimator the following inequality holds:

n̂ n

n nˆARE( , ) 1

IRDM WS 2005 2-9

Simple Example forMaximum Likelihood Estimator

given: • coin with Bernoulli distribution with unknown parameter p für head, 1-p for tail• sample (data): k times head with n coin tossesneeded: maximum likelihood estimation of p

Let L(k, n, p) = P[sample is generated from distr. with param. p]

knk ppk

n

)1(

Maximize log-likelihood function log L (k, n, p):n

log L log k log p (n k) log (1 p)k

nk

p 01

log

p

kn

p

k

p

L

IRDM WS 2005 2-10

Advanced Example for Maximum Likelihood Estimator

given: • Poisson distribution with parameter (expectation) • sample (data): numbers x1, ..., xn N0

needed: maximum likelihood estimation of

01ln

0

r

ii

if

L

xxn

f

fin

iir

ii

r

ii

1

0

0 1̂

r

i

ifi

n !ie),x,...,x(L

01

Let r be the largest among these numbers, and let f0, ..., fr be the absolute frequencies of numbers 0, ..., r.

IRDM WS 2005 2-11

Sophisticated Example for Maximum Likelihood Estimator

given: • discrete uniform distribution over [1,] N0 and density f(x) = 1/ • sample (data): numbers x1, ..., xn N0

MLE for is max{x1, ..., xn } (see Wasserman p. 124)

IRDM WS 2005 2-12

MLE for Parameters of Normal Distributions

n

i

)ix(n

n e),,x,...,x(L1

22

2

21

2

1

n

i2i 1

ln( L ) 12( x ) 0

2

02

1

2 1

2422

n

ii )x(

n)Lln(

n

iix

nˆ

1

1 2

1

2 1)ˆx(

nˆ

n

ii

IRDM WS 2005 2-13

Bayesian Viewpoint of Parameter Estimation

• assume prior distribution f() of parameter • choose statistical model (generative model) f(x | ) that reflects our beliefs about RV X• given RVs X1, ..., Xn for observed data,

the posterior distribution is f( | x1, ..., xn)

for X1=x1, ..., Xn=xn the likelihood isL(x1, ..., xn | )

i in n 'ii 1 i 1

f ( | x ) f ( x | ') f ( ')f ( x | )

f ( )

which implies

1 n 1 nf ( | x ,...,x ) ~ L( x ,...,x | ) f ( ) (posterior is proportional tolikelihood times prior)

MAP estimator (maximum a posteriori): compute that maximizes f( | x1, …, xn) given a prior for

IRDM WS 2005 2-14

Analytically Non-tractable MLE for parametersof Multivariate Normal Mixture

),...,,,...,,,...,,( 111 kkkxf

k

j

jxjT

jx

jmj e

1

)(1)(21

)2(

1

with expectation values and invertible, positive definite, symmetricmm covariance matrices

j

j

k

jjjj xn

1),,(

maximize log-likelihood function:

n

i

k

jjjij

n

iin xnxPxxL

1 111 ),,(log]|[log:),,...,(log

consider samples from a mixture of multivariate Normal distributionswith the density (e.g. height and weight of males and females):

IRDM WS 2005 2-15

Expectation-Maximization Method (EM)

Key idea:

when L(, X1, ..., Xn) (where the Xi and are possibly multivariate)

is analytically intractable then

• introduce latent (hidden, invisible, missing) random variable(s) Z

such that

• the joint distribution J(X1, ..., Xn, Z, ) of the „complete“ data

is tractable (often with Z actually being Z1, ..., Zn)

• derive the incomplete-data likelihood L(, X1, ..., Xn) by

integrating (marginalization) J: 1 nz

ˆ arg max J [ ,X ,...,X ,Z | Z z ]P[ Z z ]

IRDM WS 2005 2-16

EM Procedure

E step (expectation): estimate posterior probability of Z: P[Z | X1, …, Xn, (t)] assuming were known and equal to previous estimate (t), and compute EZ | X1, …, Xn, (t) [log J(X1, …, Xn, Z | )]by integrating over values for Z

Initialization: choose start estimate for (0)

Iterate (t=0, 1, …) until convergence:

M step (maximization, MLE step): Estimate (t+1) by maximizing EZ | X1, …, Xn, (t) [log J(X1, …, Xn, Z | )]

convergence is guaranteed (because the E step computes a lower bound of the true L function, and the M step yields monotonically non-decreasing likelihood), but may result in local maximum of log-likelihood function

IRDM WS 2005 2-17

EM Example for Multivariate Normal Mixture

Expectation step (E step):( t )

ij ij ih : P[ Z 1| x , ]

( t )i j

k( t )

i ll 1

P[ x | n ( ) ]

P[ x | n ( ) ]

Maximization step (M step):n

ij ii 1

j n

iji 1

h x

:

h

nT

ij i j i ji 1

j n

iji 1

h ( x )( x )

:

h

n n

ij iji 1 i 1

j k n

ijj 1i 1

h h

:n

h

Zij = 1 if ith data pointwas generatedby jth component,0 otherwise

( t 1 )

IRDM WS 2005 2-18

Confidence IntervalsEstimator T for an interval for parameter such that

For the distribution of random variable X a value x (0< <1) withis called a quantile; the 0.5 quantile is called the median.For the normal distribution N(0,1) the quantile is denoted .

1]xX[P]xX[P

1]aTaT[P

[T-a, T+a] is the confidence interval and 1- is the confidence level.

IRDM WS 2005 2-19

Confidence Intervals for Expectations (1)Let x1, ..., xn be a sample from a distribution with unknown

expectation and known variance 2. For sufficiently large n the sample mean is N(,2/n) distributedand is N(0,1) distributed:

))z(()z()z()z(]zn)X(

z[P

1

X

n)X(

]n

zX

n

zX[P

For required confidence interval or confidence level 1- set]aX,aX[

na:z or ),(Nofquantile)(:z 10

21

12121 ]

nX

nX[P //

then look up (z)z

a :n

then

IRDM WS 2005 2-20

Confidence Intervals for Expectations (2)Let x1, ..., xn be a sample from a distribution with unknown

expectation and unknown variance 2 and sample variance S2 .For sufficiently large n the random variable

S

n)X(:T

has a t distribution (Student distribution)

with n-1 degrees of freedom:

2

12

1

1

2

2

1

nn,T

n

tn

n

n

)t(f

with the Gamma function:

0

1 0xfürdtte)x( xt

))x(x)x(and)(propertiesthewith( 111

1

211211]

n

StX

n

StX[P

/,n/,n

IRDM WS 2005 2-21

Normal Distribution Table

IRDM WS 2005 2-22

Student‘s t Distribution Table

IRDM WS 2005 2-23

2.3 Statistical Inference: Hypothesis Testing and Regression

Hypothesis testing: • aims to falsify some hypothesis by lack of statistical evidence• design of test RV and its (approximate / limit) distribution

Regression: • aims to estimate joint distribution of input and output RVs based on some model and usually minimizing quadratic error

IRDM WS 2005 2-24

Statistical Hypothesis TestingA hypothesis test determines a probability 1-(test level , significance level) that a sample X1, ..., Xn

from some unknown probability distribution has a certain property.Examples:1) The sample originates from a normal distribution.2) Under the assumption of a normal distribution the sample originates from a N(, 2) distribution.3) Two random variables are independent.

4) Two random variables are identically distributed.5) Parameter of a Poisson distribution from which the sample stems has value 5.6) Parameter p of a Bernoulli distribution from which the sample stems has value 0.5.

General form:null hypothesis H0 vs. alternative hypothesis H1

needs test variable X (derived from X1, ..., Xn, H0, H1) andtest region R with XR for rejecting H0 and XR for retaining H0

Retain H0 Reject H0

H0 true type I errorH1 true type II error

IRDM WS 2005 2-25

Hypotheses and p-ValuesA hypothesis of the form = 0 is called a simple hypothesis.A hypothesis of the form > 0 or < 0 is called composite hypothesis.A test of the form H0: = 0 vs. H1: 0 is called a two-sided test.A test of the form H0: 0 vs. H1: > 0 or H0: 0 vs. H1: < 0 is called a one-sided test.

Suppose that for every level (0,1) there is a testwith rejection region R. Then the p-value is the smallest level

at which we can reject H0: 1 np value inf{ |T( X ,...,X ) R

small p-value means strong evidence against H0

IRDM WS 2005 2-26

Hypothesis Testing Example

Null hypothesis for n coin tosses: coin is fair or has head probability p = p0; alternative hypothesis: p p0

Test variable: X, the #heads, is N(pn, p(1-p)n2) distributed (by the Central Limit Theorem), thus is N(0, 1) distributed

)1(

/:

pp

pnXZ

Rejection of null hypothesis at test level (e.g. 0.05) if

221 // ZZ

IRDM WS 2005 2-27

Wald Test

for testing H0: = 0 vs. H1: 0 use the test variable 0

ˆW ˆse( )

with sample estimate and standard error̂

W converges in probability to N(0,1)

reject H0 at level when |W| > / 2

ˆ ˆse( ) Var[ ]

IRDM WS 2005 2-28

Chi-Square Distribution

Let X1, ..., Xn be independent, N(0,1) distributed random variables.

Then the random variable

is chi-square distributed with n degrees of freedom:

221

2nn X...X:

otherwise,xforn

ex)x(f

n

xn

n00

22 2

22

2

2

Let n be a natural number, let X be N(0,1) distributed and

Y 2 distributed with n degrees of freedom.

Then the random variable

is t distributed with n degrees of freedom.Y

Xn:Tn

IRDM WS 2005 2-29

Chi-Square Goodness-of-Fit-TestGiven: n sample values X1, ..., Xn of random variable X

with relative frequencies H1, ..., Hk for k value classes vi

(e.g. value intervals) of random variable XNull hypothesis: the values Xi are f distributed (e.g. uniformly distributed),

where f has expectation and variance 2

Approach: and

k

iiik nvEHY

1

/))((:

Rejection of null hypothesis at test level (e.g. 0.05) if 2

11 ,kkZ

k

i i

iik vE

vEHZ

1

2

)(

)((:

are both approximately 2 distributed with k-1 degrees of freedom

with E(vi) := n P[X is in class vi according to f ]

IRDM WS 2005 2-30

Chi-Square Independence TestGiven: n samples of two random variables X, Y or, equivalently, a twodimensional random variable with (absolute) frequencies H11, ..., Hrc for r*c value classes, where X has r and Y has c distinct classes. (This is called a contingency table.) Null hypothesis: X und Y are independent; then the expectations for the relative frequencies of the value classes would be

n

CRE

jiij with

c

jiji H:R

1and

r

iijj H:C

12r c

ij ij

iji 1 j 1

( H E )Z :

E

Approach: is approximately 2 distributed

with (r-1)(c-1) degrees of freedom Rejection of null hypothesis at test level (e.g. 0.05) if

2111 ),c)(r(Z

IRDM WS 2005 2-31

Chi-Square Distribution Table

IRDM WS 2005 2-32

Chi-Square Distribution Table

IRDM WS 2005 2-33

Linear Regression(often used for parameter fitting of models)

Estimate r(x) = E[Y | X1=x1 ... Xm=xm] using a linear modelm

0 i ii 1Y r( x ) x with error with E[]=0

given n sample points (x1(i) , ..., xm

(i), y(i)), i=1..n, theleast-squares estimator (LSE) minimizes the quadratic error:

2( i ) ( i )

k 0 mki 1..n k 0..m

x y : E( ,..., )

(with xo(i)=1)

Solve linear equation system:k

E0

for k=0, ..., m

equivalent to MLE T 1 T( X X ) X Y

with Y = (y(1) ... y(n))T and

( 1 ) ( 1 ) ( 1 )m1 2

( 2 ) ( 2 ) ( 2 )m1 2

( n ) ( n ) ( n )m1 2

1 x x ... x

1 x x ... xX

...

1 x x ... x

IRDM WS 2005 2-34

Logistic RegressionEstimate r(x) = E[Y | X=x] using a logistic model

m0 i ii 1

m0 i ii 1

x

x

eY r( x )

1 e

with error with E[]=0

solution for MLE for i values based on numerical methods

IRDM WS 2005 2-35

Additional Literature for Chapter 2

• Manning / Schütze: Chapters 2 und 6• Duda / Hart / Stork: Appendix A• R. Nelson: Probability, Stochastic Processes, and Queueing Theory, Springer, 1995• M. Mitzenmacher, E. Upfal: Probability and Computing, Cambridge University Press, 2005• M. Greiner, G. Tinhofer: Stochastik für Studienanfänger der Informatik, Carl Hanser Verlag, 1996• G. Hübner: Stochastik, Vieweg, 1996• Sean Borman: The Expectation Maximization Algorithm: A Short Tutorial, http://www.seanborman.com/publications/EM_algorithm.pdf• Jason Rennie: A Short Tutorial on Using Expectation-Maximization with Mixture Models, http://people.csail.mit.edu/jrennie/writing/mixtureEM.pdf

Date post:	03-Jan-2016
Category:	Documents
Upload:	barrett-morse
View:	56 times
Download:	2 times

Chapter 2: Basics from Probability Theory and Statistics

Documents