of 28
Probability and Statistics
Cookbook
Copyright c Matthias Vallentin, [email protected]
18th May, 2014
This cookbook integrates a variety of topics in probability the-
ory and statistics. It is based on literature [1, 6, 3] and in-class
material from courses of the statistics department at the Uni-
versity of California in Berkeley but also influenced by other
sources [4, 5]. If you find errors or have suggestions for further
topics, I would appreciate if you send me an email. The most re-
cent version of this document is available at http://matthias.
vallentin.net/probability-and-statistics-cookbook/. To
reproduce, please contact me.
Contents
1 Distribution Overview 31.1 Discrete Distributions . . . . . . . . . . 31.2 Continuous Distributions . . . . . . . . 4
2 Probability Theory 6
3 Random Variables 63.1 Transformations . . . . . . . . . . . . . 7
4 Expectation 7
5 Variance 7
6 Inequalities 8
7 Distribution Relationships 8
8 Probability and Moment GeneratingFunctions 9
9 Multivariate Distributions 99.1 Standard Bivariate Normal . . . . . . . 99.2 Bivariate Normal . . . . . . . . . . . . . 99.3 Multivariate Normal . . . . . . . . . . . 9
10 Convergence 910.1 Law of Large Numbers (LLN) . . . . . . 1010.2 Central Limit Theorem (CLT) . . . . . 10
11 Statistical Inference 1011.1 Point Estimation . . . . . . . . . . . . . 1011.2 Normal-Based Confidence Interval . . . 1111.3 Empirical distribution . . . . . . . . . . 1111.4 Statistical Functionals . . . . . . . . . . 11
12 Parametric Inference 11
12.1 Method of Moments . . . . . . . . . . . 11
12.2 Maximum Likelihood . . . . . . . . . . . 12
12.2.1 Delta Method . . . . . . . . . . . 12
12.3 Multiparameter Models . . . . . . . . . 12
12.3.1 Multiparameter delta method . . 13
12.4 Parametric Bootstrap . . . . . . . . . . 13
13 Hypothesis Testing 13
14 Bayesian Inference 14
14.1 Credible Intervals . . . . . . . . . . . . . 14
14.2 Function of parameters . . . . . . . . . . 14
14.3 Priors . . . . . . . . . . . . . . . . . . . 15
14.3.1 Conjugate Priors . . . . . . . . . 15
14.4 Bayesian Testing . . . . . . . . . . . . . 15
15 Exponential Family 16
16 Sampling Methods 16
16.1 The Bootstrap . . . . . . . . . . . . . . 16
16.1.1 Bootstrap Confidence Intervals . 16
16.2 Rejection Sampling . . . . . . . . . . . . 17
16.3 Importance Sampling . . . . . . . . . . . 17
17 Decision Theory 17
17.1 Risk . . . . . . . . . . . . . . . . . . . . 17
17.2 Admissibility . . . . . . . . . . . . . . . 17
17.3 Bayes Rule . . . . . . . . . . . . . . . . 18
17.4 Minimax Rules . . . . . . . . . . . . . . 18
18 Linear Regression 18
18.1 Simple Linear Regression . . . . . . . . 18
18.2 Prediction . . . . . . . . . . . . . . . . . 19
18.3 Multiple Regression . . . . . . . . . . . 19
18.4 Model Selection . . . . . . . . . . . . . . 19
19 Non-parametric Function Estimation 20
19.1 Density Estimation . . . . . . . . . . . . 20
19.1.1 Histograms . . . . . . . . . . . . 20
19.1.2 Kernel Density Estimator (KDE) 21
19.2 Non-parametric Regression . . . . . . . 21
19.3 Smoothing Using Orthogonal Functions 21
20 Stochastic Processes 2220.1 Markov Chains . . . . . . . . . . . . . . 2220.2 Poisson Processes . . . . . . . . . . . . . 22
21 Time Series 2321.1 Stationary Time Series . . . . . . . . . . 2321.2 Estimation of Correlation . . . . . . . . 2421.3 Non-Stationary Time Series . . . . . . . 24
21.3.1 Detrending . . . . . . . . . . . . 2421.4 ARIMA models . . . . . . . . . . . . . . 24
21.4.1 Causality and Invertibility . . . . 2521.5 Spectral Analysis . . . . . . . . . . . . . 25
22 Math 2622.1 Gamma Function . . . . . . . . . . . . . 2622.2 Beta Function . . . . . . . . . . . . . . . 2622.3 Series . . . . . . . . . . . . . . . . . . . 2722.4 Combinatorics . . . . . . . . . . . . . . 27
1 Distribution Overview
1.1 Discrete Distributions
Notation1 FX(x) fX(x) E [X] V [X] MX(s)
Uniform Unif {a, . . . , b}
0 x < abxca+1ba a x b
1 x > b
I(a x b)b a+ 1
a+ b
2
(b a+ 1)2 112
eas e(b+1)ss(b a)
Bernoulli Bern (p) (1 p)1x px (1 p)1x p p(1 p) 1 p+ pes
Binomial Bin (n, p) I1p(n x, x+ 1)(n
x
)px (1 p)nx np np(1 p) (1 p+ pes)n
Multinomial Mult (n, p)n!
x1! . . . xk!px11 pxkk
ki=1
xi = n npi npi(1 pi)(
ki=0
piesi
)n
Hypergeometric Hyp (N,m, n) (
x npnp(1 p)
) (mx
)(mxnx
)(Nx
) nmN
nm(N n)(N m)N2(N 1)
Negative Binomial NBin (r, p) Ip(r, x+ 1)
(x+ r 1r 1
)pr(1 p)x r 1 p
pr
1 pp2
(p
1 (1 p)es)r
Geometric Geo (p) 1 (1 p)x x N+ p(1 p)x1 x N+ 1p
1 pp2
pes
1 (1 p)es
Poisson Po () exi=0
i
i!
xe
x! e(e
s1)
l l l l l l1n
a bx
PMF
Uniform (discrete)
lllll
l
l
l
l
l
lll
l
l
l
l
lllllllllllllllllllllll0.0
0.1
0.2
0 10 20 30 40x
PMF
l n = 40, p = 0.3n = 30, p = 0.6n = 25, p = 0.9
Binomial
ll
ll
l l l l l l l0.0
0.2
0.4
0.6
0.8
3 6 9x
PMF
l p = 0.2p = 0.5p = 0.8
Geometricl l
l
l
ll l l l l l l l l l l l l l l l0.0
0.1
0.2
0.3
0 5 10 15 20x
PMF
l = 1 = 4 = 10
Poisson
1We use the notation (s, x) and (x) to refer to the Gamma functions (see 22.1), and use B(x, y) and Ix to refer to the Beta functions (see 22.2).3
1.2 Continuous Distributions
Notation FX(x) fX(x) E [X] V [X] MX(s)
Uniform Unif (a, b)
0 x < axaba a < x < b
1 x > b
I(a < x < b)
b aa+ b
2
(b a)212
esb esas(b a)
Normal N (, 2) (x) = x
(t) dt (x) =1
2piexp
{ (x )
2
22
} 2 exp
{s+
2s2
2
}Log-Normal lnN (, 2) 1
2+
1
2erf
[lnx
22
]1
x
2pi2exp
{ (lnx )
2
22
}e+
2/2 (e2 1)e2+2
Multivariate Normal MVN (,) (2pi)k/2||1/2e 12 (x)T1(x) exp{T s+
1
2sTs
}
Students t Student() Ix(
2,
2
) ( +12
)pi
(2
) (1 + x2
)(+1)/20
{ 2 > 2 1 < 2
Chi-square 2k1
(k/2)
(k
2,x
2
)1
2k/2(k/2)xk/21ex/2 k 2k (1 2s)k/2 s < 1/2
F F(d1, d2) I d1xd1x+d2
(d12,d12
) (d1x)d1dd22(d1x+d2)
d1+d2
xB(d12, d1
2
) d2d2 2
2d22(d1 + d2 2)d1(d2 2)2(d2 4)
Exponential Exp () 1 ex/ 1ex/ 2
1
1 s (s < 1/)
Gamma Gamma (, )(, x/)
()
1
()x1ex/ 2
(1
1 s)
(s < 1/)
Inverse Gamma InvGamma (, )(,
x
) ()
()x1e/x
1 > 12
( 1)2( 2)2 > 22(s)/2
()K
(4s
)
Dirichlet Dir ()(k
i=1 i)
ki=1 (i)
ki=1
xi1iiki=1 i
E [Xi] (1 E [Xi])ki=1 i + 1
Beta Beta (, ) Ix(, ) (+ )
() ()x1 (1 x)1
+
(+ )2(+ + 1)1 +
k=1
(k1r=0
+ r
+ + r
)sk
k!
Weibull Weibull(, k) 1 e(x/)k k
(x
)k1e(x/)
k
(1 +
1
k
)2
(1 +
2
k
) 2
n=0
snn
n!(
1 +n
k
)Pareto Pareto(xm, ) 1
(xmx
)x xm x
m
x+1x xm xm
1 > 1xm
( 1)2( 2) > 2 (xms)(,xms) s < 0
4
l l
l l
1b a
a bx
Uniform (continuous)
0.00
0.25
0.50
0.75
5.0 2.5 0.0 2.5 5.0x
(x)
= 0, 2 = 0.2 = 0, 2 = 1 = 0, 2 = 5 = 2, 2 = 0.5
Normal
0.00
0.25
0.50
0.75
1.00
0 1 2 3x
= 0, 2 = 3 = 2, 2 = 2 = 0, 2 = 1 = 0.5, 2 = 1 = 0.25, 2 = 1 = 0.125, 2 = 1
LogNormal
0.0
0.1
0.2
0.3
0.4
5.0 2.5 0.0 2.5 5.0x
= 1 = 2 = 5 =
Student's t
0
1
2
3
4
0 2 4 6 8x
k = 1k = 2k = 3k = 4k = 5
2
0
1
2
3
0 1 2 3 4 5x
d1 = 1, d2 = 1d1 = 2, d2 = 1d1 = 5, d2 = 2d1 = 100, d2 = 1d1 = 100, d2 = 100
F
0.0
0.5
1.0
1.5
2.0
0 1 2 3 4 5x
= 2 = 1 = 0.4
Exponential
0.0
0.1
0.2
0.3
0.4
0.5
0 5 10 15 20x
= 1, = 2 = 2, = 2 = 3, = 2 = 5, = 1 = 9, = 0.5
Gamma
0
1
2
3
4
0 1 2 3 4 5x
= 1, = 1 = 2, = 1 = 3, = 1 = 3, = 0.5
Inverse Gamma
0
1
2
3
4
5
0.00 0.25 0.50 0.75 1.00x
= 0.5, = 0.5 = 5, = 1 = 1, = 3 = 2, = 2 = 2, = 5
Beta
0
1
2
3
4
0.0 0.5 1.0 1.5 2.0 2.5x
= 1, k = 0.5 = 1, k = 1 = 1, k = 1.5 = 1, k = 5
Weibull
0
1
2
3
0 1 2 3 4 5x
xm = 1, = 1xm = 1, = 2xm = 1, = 4
Pareto
5
2 Probability Theory
Definitions
Sample space Outcome (point or element) Event A -algebra A
1. A2. A1, A2, . . . , A =
i=1Ai A
3. A A = A A Probability Distribution P
1. P [A] 0 A2. P [] = 1
3. P
[ i=1
Ai
]=
i=1
P [Ai]
Probability space (,A,P)Properties
P [] = 0 B = B = (A A) B = (A B) (A B) P [A] = 1 P [A] P [B] = P [A B] + P [A B] P [] = 1 P [] = 0 (nAn) = n An (nAn) = n An DeMorgan P [nAn] = 1 P [n An] P [A B] = P [A] + P [B] P [A B]
= P [A B] P [A] + P [B] P [A B] = P [A B] + P [A B] + P [A B] P [A B] = P [A] P [A B]
Continuity of Probabilities
A1 A2 . . . = limn P [An] = P [A] whereA =i=1Ai
A1 A2 . . . = limn P [An] = P [A] whereA =i=1Ai
Independence A B P [A B] = P [A]P [B]
Conditional Probability
P [A |B] = P [A B]P [B]
P [B] > 0
Law of Total Probability
P [B] =ni=1
P [B|Ai]P [Ai] =ni=1
Ai
Bayes Theorem
P [Ai |B] = P [B |Ai]P [Ai]nj=1 P [B |Aj ]P [Aj ]
=
ni=1
Ai
Inclusion-Exclusion Principle ni=1
Ai
= nr=1
(1)r1
ii1
3.1 Transformations
Transformation functionZ = (X)
Discrete
fZ(z) = P [(X) = z] = P [{x : (x) = z}] = P[X 1(z)] =
x1(z)f(x)
Continuous
FZ(z) = P [(X) z] =Az
f(x) dx with Az = {x : (x) z}
Special case if strictly monotone
fZ(z) = fX(1(z))
ddz1(z) = fX(x) dxdz
= fX(x) 1|J |The Rule of the Lazy Statistician
E [Z] =(x) dFX(x)
E [IA(x)] =IA(x) dFX(x) =
A
dFX(x) = P [X A]
Convolution
Z := X +Y fZ(z) =
fX,Y (x, zx) dx X,Y0= z
0
fX,Y (x, zx) dx
Z := |X Y | fZ(z) = 2
0
fX,Y (x, z + x) dx
Z := XY
fZ(z) =
|x|fX,Y (x, xz) dx =
xfx(x)fX(x)fY (xz) dx
4 Expectation
Definition and properties
E [X] = X =x dFX(x) =
x
xfX(x) X discrete
xfX(x) dx X continuous
P [X = c] = 1 = E [X] = c E [cX] = cE [X] E [X + Y ] = E [X] + E [Y ]
E [XY ] =X,Y
xyfX,Y (x, y) dFX(x) dFY (y)
E [(Y )] 6= (E [X]) (cf. Jensen inequality) P [X Y ] = 0 = E [X] E [Y ] P [X = Y ] = 1 = E [X] = E [Y ] E [X] =
x=1
P [X x]
Sample mean
Xn =1
n
ni=1
Xi
Conditional expectation
E [Y |X = x] =yf(y |x) dy
E [X] = E [E [X |Y ]] E[(X,Y ) |X = x] =
(x, y)fY |X(y |x) dx
E [(Y,Z) |X = x] =
(y, z)f(Y,Z)|X(y, z |x) dy dz E [Y + Z |X] = E [Y |X] + E [Z |X] E [(X)Y |X] = (X)E [Y |X] E[Y |X] = c = Cov [X,Y ] = 0
5 VarianceDefinition and properties
V [X] = 2X = E[(X E [X])2] = E [X2] E [X]2
V[
ni=1
Xi
]=
ni=1
V [Xi] + 2i 6=j
Cov [Xi, Yj ]
V[
ni=1
Xi
]=
ni=1
V [Xi] if Xi Xj
Standard deviationsd[X] =
V [X] = X
Covariance
Cov [X,Y ] = E [(X E [X])(Y E [Y ])] = E [XY ] E [X]E [Y ] Cov [X, a] = 0 Cov [X,X] = V [X] Cov [X,Y ] = Cov [Y,X] Cov [aX, bY ] = abCov [X,Y ]
7
Cov [X + a, Y + b] = Cov [X,Y ]
Cov ni=1
Xi,
mj=1
Yj
= ni=1
mj=1
Cov [Xi, Yj ]
Correlation
[X,Y ] =Cov [X,Y ]V [X]V [Y ]
Independence
X Y = [X,Y ] = 0 Cov [X,Y ] = 0 E [XY ] = E [X]E [Y ]Sample variance
S2 =1
n 1ni=1
(Xi Xn)2
Conditional variance
V [Y |X] = E [(Y E [Y |X])2 |X] = E [Y 2 |X] E [Y |X]2 V [Y ] = E [V [Y |X]] + V [E [Y |X]]
6 Inequalities
Cauchy-SchwarzE [XY ]2 E [X2]E [Y 2]
Markov
P [(X) t] E [(X)]t
Chebyshev
P [|X E [X]| t] V [X]t2
Chernoff
P [X (1 + )] (
e
(1 + )1+
) > 1
JensenE [(X)] (E [X]) convex
7 Distribution Relationships
Binomial
Xi Bern (p) =ni=1
Xi Bin (n, p)
X Bin (n, p) , Y Bin (m, p) = X + Y Bin (n+m, p)
limn Bin (n, p) = Po (np) (n large, p small) limn Bin (n, p) = N (np, np(1 p)) (n large, p far from 0 and 1)
Negative Binomial
X NBin (1, p) = Geo (p) X NBin (r, p) = ri=1 Geo (p) Xi NBin (ri, p) =
Xi NBin (
ri, p)
X NBin (r, p) . Y Bin (s+ r, p) = P [X s] = P [Y r]Poisson
Xi Po (i) Xi Xj =ni=1
Xi Po(
ni=1
i
)
Xi Po (i) Xi Xj = Xi
nj=1
Xj Bin nj=1
Xj ,inj=1 j
Exponential
Xi Exp () Xi Xj =ni=1
Xi Gamma (n, )
Memoryless property: P [X > x+ y |X > y] = P [X > x]Normal
X N (, 2) = (X ) N (0, 1) X N (, 2) Z = aX + b = Z N (a+ b, a22) X N (1, 21) Y N (2, 22) = X + Y N (1 + 2, 21 + 22) Xi N
(i,
2i
)= iXi N (i i,i 2i )
P [a < X b] = (b
) (a )
(x) = 1 (x) (x) = x(x) (x) = (x2 1)(x) Upper quantile of N (0, 1): z = 1(1 )
Gamma
X Gamma (, ) X/ Gamma (, 1) Gamma (, ) i=1 Exp () Xi Gamma (i, ) Xi Xj =
iXi Gamma (
i i, )
()
=
0
x1ex dx
Beta
1B(, )
x1(1 x)1 = (+ )()()
x1(1 x)1
E [Xk] = B(+ k, )B(, )
=+ k 1
+ + k 1E[Xk1
] Beta (1, 1) Unif (0, 1)
8
8 Probability and Moment Generating Functions
GX(t) = E[tX] |t| < 1
MX(t) = GX(et) = E[eXt]
= E
[ i=0
(Xt)i
i!
]=
i=0
E[Xi]
i! ti
P [X = 0] = GX(0) P [X = 1] = GX(0)
P [X = i] = G(i)X (0)
i! E [X] = GX(1) E [Xk] = M (k)X (0) E
[X!
(X k)!]
= G(k)X (1
)
V [X] = GX(1) +GX(1) (GX(1))2
GX(t) = GY (t) = X d= Y
9 Multivariate Distributions
9.1 Standard Bivariate Normal
Let X,Y N (0, 1) X Z where Y = X +
1 2Z
Joint density
f(x, y) =1
2pi
1 2 exp{x
2 + y2 2xy2(1 2)
}Conditionals
(Y |X = x) N (x, 1 2) and (X |Y = y) N (y, 1 2)Independence
X Y = 0
9.2 Bivariate Normal
Let X N (x, 2x) and Y N (y, 2y).f(x, y) =
1
2pixy
1 2 exp{ z
2(1 2)}
z =
[(x xx
)2+
(y yy
)2 2
(x xx
)(y yy
)]
Conditional mean and variance
E [X |Y ] = E [X] + XY
(Y E [Y ])
V [X |Y ] = X
1 2
9.3 Multivariate Normal
Covariance matrix (Precision matrix 1)
=
V [X1] Cov [X1, Xk]... . . . ...Cov [Xk, X1] V [Xk]
If X N (,),
fX(x) = (2pi)n/2 ||1/2 exp
{1
2(x )T1(x )
}Properties
Z N (0, 1) X = + 1/2Z = X N (,) X N (,) = 1/2(X ) N (0, 1) X N (,) = AX N (A,AAT ) X N (,) a = k = aTX N (aT, aTa)
10 Convergence
Let {X1, X2, . . .} be a sequence of rvs and let X be another rv. Let Fn denotethe cdf of Xn and let F denote the cdf of X.
Types of convergence
1. In distribution (weakly, in law): XnD X
limnFn(t) = F (t) t where F continuous
2. In probability: XnP X( > 0) lim
nP [|Xn X| > ] = 0
3. Almost surely (strongly): Xnas X
P[
limnXn = X
]= P
[ : lim
nXn() = X()]
= 1
9
4. In quadratic mean (L2): Xnqm X
limnE
[(Xn X)2
]= 0
Relationships
Xn qm X = Xn P X = Xn D X Xn as X = Xn P X Xn D X (c R) P [X = c] = 1 = Xn P X Xn P X Yn P Y = Xn + Yn P X + Y Xn qm X Yn qm Y = Xn + Yn qm X + Y Xn P X Yn P Y = XnYn P XY Xn P X = (Xn) P (X) Xn D X = (Xn) D (X) Xn qm b limn E [Xn] = b limnV [Xn] = 0 X1, . . . , Xn iid E [X] = V [X]
11.2 Normal-Based Confidence Interval
Suppose n N(, se2
). Let z/2 =
1(1 (/2)), i.e., P [Z > z/2] = /2and P
[z/2 < Z < z/2] = 1 where Z N (0, 1). ThenCn = n z/2se
11.3 Empirical distribution
Empirical Distribution Function (ECDF)
Fn(x) =
ni=1 I(Xi x)
n
I(Xi x) ={
1 Xi x0 Xi > x
Properties (for any fixed x)
E[Fn
]= F (x)
V[Fn
]=F (x)(1 F (x))
n
mse = F (x)(1 F (x))n
D 0 Fn P F (x)
Dvoretzky-Kiefer-Wolfowitz (DKW) inequality (X1, . . . , Xn F )
P[supx
F (x) Fn(x) > ] = 2e2n2Nonparametric 1 confidence band for F
L(x) = max{Fn n, 0}U(x) = min{Fn + n, 1}
=
1
2nlog
(2
)
P [L(x) F (x) U(x) x] 1
11.4 Statistical Functionals
Statistical functional: T (F ) Plug-in estimator of = (F ): n = T (Fn) Linear functional: T (F ) = (x) dFX(x) Plug-in estimator for linear functional:
T (Fn) =
(x) dFn(x) =
1
n
ni=1
(Xi)
Often: T (Fn) N(T (F ), se2
)= T (Fn) z/2se
pth quantile: F1(p) = inf{x : F (x) p} = Xn 2 = 1
n 1ni=1
(Xi Xn)2
=1n
ni=1(Xi )33j
=ni=1(Xi Xn)(Yi Yn)n
i=1(Xi Xn)2n
i=1(Yi Yn)
12 Parametric Inference
Let F ={f(x; ) : } be a parametric model with parameter space Rk
and parameter = (1, . . . , k).
12.1 Method of Moments
jth moment
j() = E[Xj]
=
xj dFX(x)
jth sample moment
j =1
n
ni=1
Xji
Method of moments estimator (MoM)
1() = 1
2() = 2
... =...
k() = k11
Properties of the MoM estimator
n exists with probability tending to 1 Consistency: n P Asymptotic normality:
n( ) D N (0,)
where = gE[Y Y T
]gT , Y = (X,X2, . . . , Xk)T ,
g = (g1, . . . , gk) and gj =
1j ()
12.2 Maximum Likelihood
Likelihood: Ln : [0,)
Ln() =ni=1
f(Xi; )
Log-likelihood
`n() = logLn() =ni=1
log f(Xi; )
Maximum likelihood estimator (mle)
Ln(n) = supLn()
Score function
s(X; ) =
log f(X; )
Fisher informationI() = V [s(X; )]
In() = nI()
Fisher information (exponential family)
I() = E[ s(X; )
]Observed Fisher information
Iobsn () = 2
2
ni=1
log f(Xi; )
Properties of the mle
Consistency: n P
Equivariance: n is the mle = (n) ist the mle of () Asymptotic normality:
1. se 1/In()(n )
seD N (0, 1)
2. se
1/In(n)
(n )se
D N (0, 1)
Asymptotic optimality (or efficiency), i.e., smallest variance for large sam-ples. If n is any other estimator, the asymptotic relative efficiency is
are(n, n) =V[n
]V[n
] 1 Approximately the Bayes estimator
12.2.1 Delta Method
If = () where is differentiable and () 6= 0:(n )se()
D N (0, 1)
where = () is the mle of and
se =() se(n)
12.3 Multiparameter Models
Let = (1, . . . , k) and = (1, . . . , k) be the mle.
Hjj =2`n2
Hjk =2`njk
Fisher information matrix
In() =
E [H11] E [H1k]... . . . ...E [Hk1] E [Hkk]
Under appropriate regularity conditions
( ) N (0, Jn)12
with Jn() = I1n . Further, if j is the j
th component of , then
(j j)sej
D N (0, 1)
where se2j = Jn(j, j) and Cov[j , k
]= Jn(j, k)
12.3.1 Multiparameter delta method
Let = (1, . . . , k) and let the gradient of be
=
1...
k
Suppose =6= 0 and = (). Then,
( )se()
D N (0, 1)
where
se() =
()T
Jn
()
and Jn = Jn() and = =
.
12.4 Parametric Bootstrap
Sample from f(x; n) instead of from Fn, where n could be the mle or methodof moments estimator.
13 Hypothesis Testing
H0 : 0 versus H1 : 1Definitions
Null hypothesis H0 Alternative hypothesis H1 Simple hypothesis = 0 Composite hypothesis > 0 or < 0 Two-sided test: H0 : = 0 versus H1 : 6= 0 One-sided test: H0 : 0 versus H1 : > 0 Critical value c Test statistic T Rejection region R = {x : T (x) > c} Power function () = P [X R] Power of a test: 1 P [Type II error] = 1 = inf
1()
Test size: = P [Type I error] = sup0
()
Retain H0 Reject H0H0 true
Type I Error ()
H1 true Type II Error ()
(power)p-value
p-value = sup0 P [T (X) T (x)] = inf{ : T (x) R
} p-value = sup0 P [T (X?) T (X)]
1F(T (X)) since T (X?)F
= inf{ : T (X) R
}p-value evidence< 0.01 very strong evidence against H00.01 0.05 strong evidence against H00.05 0.1 weak evidence against H0> 0.1 little or no evidence against H0
Wald test
Two-sided test Reject H0 when |W | > z/2 where W = 0
se P [|W | > z/2] p-value = P0 [|W | > |w|] P [|Z| > |w|] = 2(|w|)
Likelihood ratio test (LRT)
T (X) = sup Ln()sup0 Ln()
=Ln(n)Ln(n,0) 13
(X) = 2 log T (X) D 2rq whereki=1
Z2i 2k and Z1, . . . , Zk iid N (0, 1)
p-value = P0 [(X) > (x)] P[2rq > (x)
]Multinomial LRT
mle: pn =(X1n, . . . ,
Xkn
) T (X) = Ln(pn)Ln(p0) =
kj=1
(pjp0j
)Xj (X) = 2
kj=1
Xj log
(pjp0j
)D 2k1
The approximate size LRT rejects H0 when (X) 2k1,Pearson Chi-square Test
T =kj=1
(Xj E [Xj ])2E [Xj ]
where E [Xj ] = np0j under H0
T D 2k1 p-value = P [2k1 > T (x)] Faster D X2k1 than LRT, hence preferable for small n
Independence testing
I rows, J columns, X multinomial sample of size n = I J mles unconstrained: pij = Xijn mles under H0: p0ij = pipj = Xin Xjn LRT: = 2Ii=1Jj=1Xij log ( nXijXiXj ) PearsonChiSq: T = Ii=1Jj=1 (XijE[Xij ])2E[Xij ] LRT and Pearson D 2k, where = (I 1)(J 1)
14 Bayesian Inference
Bayes Theorem
f( |x) = f(x | )f()f(xn)
=f(x | )f()f(x | )f() d Ln()f()
Definitions
Xn = (X1, . . . , Xn)
xn = (x1, . . . , xn) Prior density f() Likelihood f(xn | ): joint density of the data
In particular, Xn iid = f(xn | ) =ni=1
f(xi | ) = Ln()
Posterior density f( |xn) Normalizing constant cn = f(xn) =
f(x | )f() d
Kernel: part of a density that depends on Posterior mean n =
f( |xn) d =
Ln()f() Ln()f() d
14.1 Credible Intervals
Posterior interval
P [ (a, b) |xn] = ba
f( |xn) d = 1
Equal-tail credible interval a
f( |xn) d = b
f( |xn) d = /2
Highest posterior density (HPD) region Rn
1. P [ Rn] = 1 2. Rn = { : f( |xn) > k} for some k
Rn is unimodal = Rn is an interval
14.2 Function of parameters
Let = () and A = { : () }.Posterior CDF for
H(r |xn) = P [() |xn] =A
f( |xn) d
Posterior density
h( |xn) = H ( |xn)Bayesian delta method
|Xn N((), se
())14
14.3 Priors
Choice
Subjective bayesianism. Objective bayesianism. Robust bayesianism.
Types
Flat: f() constant Proper: f() d = 1 Improper: f() d = Jeffreys prior (transformation-invariant):
f() I() f()
det(I())
Conjugate: f() and f( |xn) belong to the same parametric family
14.3.1 Conjugate Priors
Discrete likelihood
Likelihood Conjugate prior Posterior hyperparameters
Bern (p) Beta (, ) +
ni=1
xi, + nni=1
xi
Bin (p) Beta (, ) +
ni=1
xi, +
ni=1
Ni ni=1
xi
NBin (p) Beta (, ) + rn, +
ni=1
xi
Po () Gamma (, ) +
ni=1
xi, + n
Multinomial(p) Dir () +
ni=1
x(i)
Geo (p) Beta (, ) + n, +
ni=1
xi
Continuous likelihood (subscript c denotes constant)
Likelihood Conjugate prior Posterior hyperparameters
Unif (0, ) Pareto(xm, k) max{x(n), xm
}, k + n
Exp () Gamma (, ) + n, +
ni=1
xi
N (, 2c) N (0, 20) (020 +ni=1 xi2c
)/
(1
20+
n
2c
),(
1
20+
n
2c
)1N (c, 2) Scaled Inverse Chi-
square(, 20) + n,
20 +ni=1(xi )2 + n
N (, 2) Normal-scaled InverseGamma(, , , )
+ nx
+ n, + n, +
n
2,
+1
2
ni=1
(xi x)2 + (x )2
2(n+ )
MVN(,c) MVN(0,0)(10 + n
1c
)1 (10 0 + n
1x),(
10 + n1c
)1MVN(c,) Inverse-
Wishart(,)n+ , +
ni=1
(xi c)(xi c)T
Pareto(xmc , k) Gamma (, ) + n, +
ni=1
logxixmc
Pareto(xm, kc) Pareto(x0, k0) x0, k0 kn where k0 > knGamma (c, ) Gamma (0, 0) 0 + nc, 0 +
ni=1
xi
14.4 Bayesian Testing
If H0 : 0:
Prior probability P [H0] =
0
f() d
Posterior probability P [H0 |xn] =
0
f( |xn) d
Let H0, . . . ,HK1 be K hypotheses. Suppose f( |Hk),
P [Hk |xn] = f(xn |Hk)P [Hk]K
k=1 f(xn |Hk)P [Hk]
,
15
Marginal likelihood
f(xn |Hi) =
f(xn | ,Hi)f( |Hi) d
Posterior odds (of Hi relative to Hj)
P [Hi |xn]P [Hj |xn] =
f(xn |Hi)f(xn |Hj)
Bayes Factor BFij
P [Hi]P [Hj ]
prior odds
Bayes factorlog10BF10 BF10 evidence
0 0.5 1 1.5 Weak0.5 1 1.5 10 Moderate1 2 10 100 Strong> 2 > 100 Decisive
p =p
1pBF101 + p1pBF10
where p = P [H1] and p = P [H1 |xn]
15 Exponential Family
Scalar parameter
fX(x | ) = h(x) exp {()T (x)A()}= h(x)g() exp {()T (x)}
Vector parameter
fX(x | ) = h(x) exp{
si=1
i()Ti(x)A()}
= h(x) exp {() T (x)A()}= h(x)g() exp {() T (x)}
Natural form
fX(x | ) = h(x) exp { T(x)A()}= h(x)g() exp { T(x)}= h(x)g() exp
{TT(x)
}16 Sampling Methods
16.1 The Bootstrap
Let Tn = g(X1, . . . , Xn) be a statistic.
1. Estimate VF [Tn] with VFn [Tn].2. Approximate VFn [Tn] using simulation:
(a) Repeat the following B times to get T n,1, . . . , Tn,B , an iid sample from
the sampling distribution implied by Fn
i. Sample uniformly X1 , . . . , Xn Fn.
ii. Compute T n = g(X1 , . . . , X
n).
(b) Then
vboot = VFn =1
B
Bb=1
(T n,b
1
B
Br=1
T n,r
)2
16.1.1 Bootstrap Confidence Intervals
Normal-based interval
Tn z/2seboot
Pivotal interval
1. Location parameter = T (F )
2. Pivot Rn = n 3. Let H(r) = P [Rn r] be the cdf of Rn4. Let Rn,b =
n,b n. Approximate H using bootstrap:
H(r) =1
B
Bb=1
I(Rn,b r)
5. = sample quantile of (n,1, . . . ,
n,B)
6. r = sample quantile of (Rn,1, . . . , R
n,B), i.e., r
=
n
7. Approximate 1 confidence interval Cn =(a, b)
where
a = n H1(
1 2
)= n r1/2 = 2n 1/2
b = n H1(
2
)= n r/2 = 2n /2
Percentile interval
Cn =(/2,
1/2
)16
16.2 Rejection Sampling
Setup
We can easily sample from g() We want to sample from h(), but it is difficult We know h() up to a proportional constant: h() = k()
k() d
Envelope condition: we can find M > 0 such that k() Mg() Algorithm
1. Draw cand g()2. Generate u Unif (0, 1)3. Accept cand if u k(
cand)
Mg(cand)
4. Repeat until B values of cand have been accepted
Example
We can easily sample from the prior g() = f() Target is the posterior h() k() = f(xn | )f() Envelope condition: f(xn | ) f(xn | n) = Ln(n) M Algorithm
1. Draw cand f()2. Generate u Unif (0, 1)3. Accept cand if u Ln(
cand)
Ln(n)
16.3 Importance Sampling
Sample from an importance function g rather than target density h.Algorithm to obtain an approximation to E [q() |xn]:
1. Sample from the prior 1, . . . , niid f()
2. wi =Ln(i)Bi=1 Ln(i)
i = 1, . . . , B
3. E [q() |xn] Bi=1 q(i)wi17 Decision Theory
Definitions
Unknown quantity affecting our decision:
Decision rule: synonymous for an estimator Action a A: possible value of the decision rule. In the estimation
context, the action is just an estimate of , (x).
Loss function L: consequences of taking action a when true state is ordiscrepancy between and , L : A [k,).
Loss functions
Squared error loss: L(, a) = ( a)2
Linear loss: L(, a) ={K1( a) a < 0K2(a ) a 0
Absolute error loss: L(, a) = | a| (linear loss with K1 = K2) Lp loss: L(, a) = | a|p
Zero-one loss: L(, a) ={
0 a =
1 a 6=
17.1 Risk
Posterior risk
r( |x) =L(, (x))f( |x) d = E|X
[L(, (x))
](Frequentist) risk
R(, ) =
L(, (x))f(x | ) dx = EX|
[L(, (X))
]Bayes risk
r(f, ) =
L(, (x))f(x, ) dx d = E,X
[L(, (X))
]r(f, ) = E
[EX|
[L(, (X)
]]= E
[R(, )
]r(f, ) = EX
[E|X
[L(, (X)
]]= EX
[r( |X)
]17.2 Admissibility
dominates if : R(, ) R(, ) : R(, ) < R(, )
is inadmissible if there is at least one other estimator that dominatesit. Otherwise it is called admissible.
17
17.3 Bayes Rule
Bayes rule (or Bayes estimator)
r(f, ) = inf r(f, ) (x) = inf r( |x) x = r(f, ) = r( |x)f(x) dx
Theorems
Squared error loss: posterior mean Absolute error loss: posterior median Zero-one loss: posterior mode
17.4 Minimax Rules
Maximum riskR() = sup
R(, ) R(a) = sup
R(, a)
Minimax rulesupR(, ) = inf
R() = inf
supR(, )
= Bayes rule c : R(, ) = cLeast favorable prior
f = Bayes rule R(, f ) r(f, f )
18 Linear Regression
Definitions
Response variable Y Covariate X (aka predictor variable or feature)
18.1 Simple Linear Regression
ModelYi = 0 + 1Xi + i E [i |Xi] = 0, V [i |Xi] = 2
Fitted liner(x) = 0 + 1x
Predicted (fitted) values
Yi = r(Xi)
Residualsi = Yi Yi = Yi
(0 + 1Xi
)
Residual sums of squares (rss)
rss(0, 1) =
ni=1
2i
Least square estimates
T = (0, 1)T : min
0,1
rss
0 = Yn 1Xn
1 =
ni=1(Xi Xn)(Yi Yn)n
i=1(Xi Xn)2=
ni=1XiYi nXYni=1X
2i nX2
E[ |Xn
]=
(01
)V[ |Xn
]=
2
ns2X
(n1
ni=1X
2i Xn
Xn 1)
se(0) =
sXn
ni=1X
2i
n
se(1) =
sXn
where s2X = n1n
i=1(Xi Xn)2 and 2 = 1n2ni=1
2i (unbiased estimate).
Further properties:
Consistency: 0 P 0 and 1 P 1 Asymptotic normality:
0 0se(0)
D N (0, 1) and 1 1se(1)
D N (0, 1)
Approximate 1 confidence intervals for 0 and 1:
0 z/2se(0) and 1 z/2se(1)
Wald test for H0 : 1 = 0 vs. H1 : 1 6= 0: reject H0 if |W | > z/2 whereW = 1/se(1).
R2
R2 =
ni=1(Yi Y )2ni=1(Yi Y )2
= 1ni=1
2in
i=1(Yi Y )2= 1 rss
tss18
Likelihood
L =ni=1
f(Xi, Yi) =
ni=1
fX(Xi)ni=1
fY |X(Yi |Xi) = L1 L2
L1 =ni=1
fX(Xi)
L2 =ni=1
fY |X(Yi |Xi) n exp{ 1
22
i
(Yi (0 1Xi)
)2}
Under the assumption of Normality, the least squares parameter estimators arealso the MLEs, but the least squares variance estimator is not the MLE
2 =1
n
ni=1
2i
18.2 Prediction
Observe X = x of the covariate and want to predict their outcome Y.
Y = 0 + 1x
V[Y]
= V[0
]+ x2V
[1
]+ 2xCov
[0, 1
]Prediction interval
2n = 2
(ni=1(Xi X)2
ni(Xi X)2j
+ 1
)Y z/2n
18.3 Multiple Regression
Y = X +
where
X =
X11 X1k... . . . ...Xn1 Xnk
=1...k
=1...n
Likelihood
L(,) = (2pi2)n/2 exp{ 1
22rss
}
rss = (y X)T (y X) = Y X2 =Ni=1
(Yi xTi )2
If the (k k) matrix XTX is invertible,
= (XTX)1XTY
V[ |Xn
]= 2(XTX)1
N (, 2(XTX)1)Estimate regression function
r(x) =
kj=1
jxj
Unbiased estimate for 2
2 =1
n kni=1
2i = X Y
mle
= X 2 =n kn
2
1 Confidence intervalj z/2se(j)
18.4 Model Selection
Consider predicting a new observation Y for covariates X and let S Jdenote a subset of the covariates in the model, where |S| = k and |J | = n.Issues
Underfitting: too few covariates yields high bias Overfitting: too many covariates yields high variance
Procedure
1. Assign a score to each model
2. Search through all models to find the one with the highest score
Hypothesis testing
H0 : j = 0 vs. H1 : j 6= 0 j JMean squared prediction error (mspe)
mspe = E[(Y (S) Y )2
]Prediction risk
R(S) =
ni=1
mspei =
ni=1
E[(Yi(S) Y i )2
]19
Training error
Rtr(S) =
ni=1
(Yi(S) Yi)2
R2
R2(S) = 1 rss(S)tss
= 1 Rtr(S)tss
= 1ni=1(Yi(S) Y )2ni=1(Yi Y )2
The training error is a downward-biased estimate of the prediction risk.
E[Rtr(S)
]< R(S)
bias(Rtr(S)) = E[Rtr(S)
]R(S) = 2
ni=1
Cov[Yi, Yi
]Adjusted R2
R2(S) = 1 n 1n k
rss
tss
Mallows Cp statistic
R(S) = Rtr(S) + 2k2 = lack of fit + complexity penalty
Akaike Information Criterion (AIC)
AIC(S) = `n(S , 2S) k
Bayesian Information Criterion (BIC)
BIC(S) = `n(S , 2S)
k
2log n
Validation and training
RV (S) =
mi=1
(Y i (S) Y i )2 m = |{validation data}|, oftenn
4or
n
2
Leave-one-out cross-validation
RCV (S) =
ni=1
(Yi Y(i))2 =ni=1
(Yi Yi(S)1 Uii(S)
)2
U(S) = XS(XTSXS)
1XS (hat matrix)
19 Non-parametric Function Estimation
19.1 Density Estimation
Estimate f(x), where f(x) = P [X A] = Af(x) dx.
Integrated square error (ise)
L(f, fn) =
(f(x) fn(x)
)2dx = J(h) +
f2(x) dx
Frequentist risk
R(f, fn) = E[L(f, fn)
]=
b2(x) dx+
v(x) dx
b(x) = E[fn(x)
] f(x)
v(x) = V[fn(x)
]19.1.1 Histograms
Definitions
Number of bins m Binwidth h = 1m Bin Bj has j observations Define pj = j/n and pj =
Bjf(u) du
Histogram estimator
fn(x) =mj=1
pjhI(x Bj)
E[fn(x)
]=pjh
V[fn(x)
]=pj(1 pj)
nh2
R(fn, f) h2
12
(f (u))2 du+
1
nh
h =1
n1/3
(6
(f (u))2du
)1/3
R(fn, f) Cn2/3
C =
(3
4
)2/3((f (u))2 du
)1/3Cross-validation estimate of E [J(h)]
JCV (h) =
f2n(x) dx
2
n
ni=1
f(i)(Xi) =2
(n 1)h n+ 1
(n 1)hmj=1
p2j
20
19.1.2 Kernel Density Estimator (KDE)
Kernel K
K(x) 0 K(x) dx = 1 xK(x) dx = 0 x2K(x) dx 2K > 0
KDE
fn(x) =1
n
ni=1
1
hK
(xXih
)R(f, fn) 1
4(hK)
4
(f (x))2 dx+
1
nh
K2(x) dx
h =c2/51 c
1/52 c
1/53
n1/5c1 =
2K , c2 =
K2(x) dx, c3 =
(f (x))2 dx
R(f, fn) =c4n4/5
c4 =5
4(2K)
2/5
(K2(x) dx
)4/5
C(K)
((f )2 dx
)1/5
Epanechnikov Kernel
K(x) =
{3
4
5(1x2/5) |x| 0 i pij = 1
Chapman-Kolmogorov
pij(m+ n) =k
pij(m)pkj(n)
Pm+n = PmPn
Pn = P P = PnMarginal probability
n = (n(1), . . . , n(N)) where i(i) = P [Xn = i]
0 , initial distributionn = 0P
n
20.2 Poisson Processes
Poisson process
{Xt : t [0,)} = number of events up to and including time t X0 = 0 Independent increments:
t0 < < tn : Xt1 Xt0 Xtn Xtn1
Intensity function (t)
P [Xt+h Xt = 1] = (t)h+ o(h) P [Xt+h Xt = 2] = o(h)
Xs+t Xs Po (m(s+ t)m(s)) where m(t) = t
0(s) ds
Homogeneous Poisson process
(t) = Xt Po (t) > 0
Waiting times
Wt := time at which Xt occurs
Wt Gamma(t,
1
)
Interarrival times
St = Wt+1 Wt
St Exp(
1
)
tWt1 Wt
St
22
21 Time Series
Mean function
xt = E [xt] =
xft(x) dx
Autocovariance function
x(s, t) = E [(xs s)(xt t)] = E [xsxt] stx(t, t) = E
[(xt t)2
]= V [xt]
Autocorrelation function (ACF)
(s, t) =Cov [xs, xt]V [xs]V [xt]
=(s, t)
(s, s)(t, t)
Cross-covariance function (CCV)
xy(s, t) = E [(xs xs)(yt yt)]Cross-correlation function (CCF)
xy(s, t) =xy(s, t)
x(s, s)y(t, t)
Backshift operatorBk(xt) = xtk
Difference operatord = (1B)d
White noise
wt wn(0, 2w) Gaussian: wt iid N
(0, 2w
) E [wt] = 0 t T V [wt] = 2 t T w(s, t) = 0 s 6= t s, t T
Random walk
Drift xt = t+
tj=1 wj
E [xt] = tSymmetric moving average
mt =
kj=k
ajxtj where aj = aj 0 andk
j=kaj = 1
21.1 Stationary Time Series
Strictly stationary
P [xt1 c1, . . . , xtk ck] = P [xt1+h c1, . . . , xtk+h ck]
k N, tk, ck, h Z
Weakly stationary
E [x2t ]
21.2 Estimation of Correlation
Sample mean
x =1
n
nt=1
xt
Sample variance
V [x] =1
n
nh=n
(1 |h|
n
)x(h)
Sample autocovariance function
(h) =1
n
nht=1
(xt+h x)(xt x)
Sample autocorrelation function
(h) =(h)
(0)
Sample cross-variance function
xy(h) =1
n
nht=1
(xt+h x)(yt y)
Sample cross-correlation function
xy(h) =xy(h)x(0)y(0)
Properties
x(h) =1n
if xt is white noise
xy(h) =1n
if xt or yt is white noise
21.3 Non-Stationary Time Series
Classical decomposition model
xt = t + st + wt
t = trend st = seasonal component wt = random noise term
21.3.1 Detrending
Least squares
1. Choose trend model, e.g., t = 0 + 1t+ 2t2
2. Minimize rss to obtain trend estimate t = 0 + 1t+ 2t2
3. Residuals , noise wt
Moving average
The low-pass filter vt is a symmetric moving average mt with aj = 12k+1 :
vt =1
2k + 1
ki=k
xt1
If 12k+1ki=k wtj 0, a linear trend function t = 0 + 1t passes
without distortion
Differencing
t = 0 + 1t = xt = 1
21.4 ARIMA models
Autoregressive polynomial
(z) = 1 1z pzp z C p 6= 0
Autoregressive operator
(B) = 1 1B pBp
Autoregressive model order p, AR (p)
xt = 1xt1 + + pxtp + wt (B)xt = wtAR (1)
xt = k(xtk) +k1j=0
j(wtj)k,||
Moving average polynomial
(z) = 1 + 1z + + qzq z C q 6= 0Moving average operator
(B) = 1 + 1B + + pBp
MA (q) (moving average model order q)
xt = wt + 1wt1 + + qwtq xt = (B)wt
E [xt] =qj=0
jE [wtj ] = 0
(h) = Cov [xt+h, xt] =
{2wqhj=0 jj+h 0 h q
0 h > q
MA (1)xt = wt + wt1
(h) =
(1 + 2)2w h = 0
2w h = 1
0 h > 1
(h) =
{
(1+2) h = 1
0 h > 1
ARMA (p, q)
xt = 1xt1 + + pxtp + wt + 1wt1 + + qwtq(B)xt = (B)wt
Partial autocorrelation function (PACF)
xh1i , regression of xi on {xh1, xh2, . . . , x1} hh = corr(xh xh1h , x0 xh10 ) h 2 E.g., 11 = corr(x1, x0) = (1)
ARIMA (p, d, q)dxt = (1B)dxt is ARMA (p, q)
(B)(1B)dxt = (B)wtExponentially Weighted Moving Average (EWMA)
xt = xt1 + wt wt1
xt =
j=1
(1 )j1xtj + wt when || < 1
xn+1 = (1 )xn + xn
Seasonal ARIMA
Denoted by ARIMA (p, d, q) (P,D,Q)s P (Bs)(B)Ds dxt = + Q(Bs)(B)wt
21.4.1 Causality and Invertibility
ARMA (p, q) is causal (future-independent) {j} :j=0 j
Amplitude A Phase U1 = A cos and U2 = A sin often normally distributed rvs
Periodic mixture
xt =
qk=1
(Uk1 cos(2pikt) + Uk2 sin(2pikt))
Uk1, Uk2, for k = 1, . . . , q, are independent zero-mean rvs with variances 2k (h) = qk=1 2k cos(2pikh) (0) = E [x2t ] = qk=1 2k
Spectral representation of a periodic process
(h) = 2 cos(2pi0h)
=2
2e2pii0h +
2
2e2pii0h
=
1/21/2
e2piih dF ()
Spectral distribution function
F () =
0 < 02/2 < 02 0
F () = F (1/2) = 0 F () = F (1/2) = (0)
Spectral density
f() =
h=
(h)e2piih 12 1
2
Needs h= |(h)| 1 (n) = (n 1)! n N (1/2) = pi
22.2 Beta Function
Ordinary: B(x, y) = B(y, x) = 1
0
tx1(1 t)y1 dt = (x)(y)(x+ y)
Incomplete: B(x; a, b) = x
0
ta1(1 t)b1 dt Regularized incomplete:Ix(a, b) =
B(x; a, b)
B(a, b)
a,bN=
a+b1j=a
(a+ b 1)!j!(a+ b 1 j)!x
j(1 x)a+b1j26
I0(a, b) = 0 I1(a, b) = 1 Ix(a, b) = 1 I1x(b, a)
22.3 Series
Finite
nk=1
k =n(n+ 1)
2
nk=1
(2k 1) = n2
nk=1
k2 =n(n+ 1)(2n+ 1)
6
nk=1
k3 =
(n(n+ 1)
2
)2
nk=0
ck =cn+1 1c 1 c 6= 1
Binomial
nk=0
(n
k
)= 2n
nk=0
(r + k
k
)=
(r + n+ 1
n
)
nk=0
(k
m
)=
(n+ 1
m+ 1
) Vandermondes Identity:
rk=0
(m
k
)(n
r k)
=
(m+ n
r
) Binomial Theorem:
nk=0
(n
k
)ankbk = (a+ b)n
Infinite
k=0
pk =1
1 p ,k=1
pk =p
1 p |p| < 1
k=0
kpk1 =d
dp
( k=0
pk
)=
d
dp
(1
1 p)
=1
(1 p)2 |p| < 1
k=0
(r + k 1
k
)xk = (1 x)r r N+
k=0
(
k
)pk = (1 + p) |p| < 1 , C
22.4 Combinatorics
Sampling
k out of n w/o replacement w/ replacement
ordered nk =
k1i=0
(n i) = n!(n k)! n
k
unordered
(n
k
)=nk
k!=
n!
k!(n k)!(n 1 + r
r
)=
(n 1 + rn 1
)
Stirling numbers, 2nd kind{n
k
}= k
{n 1k
}+
{n 1k 1
}1 k n
{n
0
}=
{1 n = 0
0 else
Partitions
Pn+k,k =
ni=1
Pn,i k > n : Pn,k = 0 n 1 : Pn,0 = 0, P0,0 = 1
Balls and Urns f : B U D = distinguishable, D = indistinguishable.
|B| = n, |U | = m f arbitrary f injective f surjective f bijective
B : D, U : D mn
{mn m n0 else
m!
{n
m
} {n! m = n
0 else
B : D, U : D(m+ n 1
n
) (m
n
) (n 1m 1
) {1 m = n
0 else
B : D, U : Dmk=1
{n
k
} {1 m n0 else
{n
m
} {1 m = n
0 else
B : D, U : Dmk=1
Pn,k
{1 m n0 else
Pn,m
{1 m = n
0 else
References
[1] P. G. Hoel, S. C. Port, and C. J. Stone. Introduction to Probability Theory.Brooks Cole, 1972.
[2] L. M. Leemis and J. T. McQueston. Univariate Distribution Relationships.The American Statistician, 62(1):4553, 2008.
[3] R. H. Shumway and D. S. Stoffer. Time Series Analysis and Its ApplicationsWith R Examples. Springer, 2006.
[4] A. Steger. Diskrete Strukturen Band 1: Kombinatorik, Graphentheorie,Algebra. Springer, 2001.
[5] A. Steger. Diskrete Strukturen Band 2: Wahrscheinlichkeitstheorie undStatistik. Springer, 2002.
[6] L. Wasserman. All of Statistics: A Concise Course in Statistical Inference.Springer, 2003.
27
Un
ivari
ate
dis
trib
uti
on
rela
tion
ship
s,co
urt
esy
Lee
mis
an
dM
cQu
esto
n[2
].
28
Distribution OverviewDiscrete DistributionsContinuous Distributions
Probability TheoryRandom VariablesTransformations
ExpectationVarianceInequalitiesDistribution RelationshipsProbability and Moment Generating FunctionsMultivariate DistributionsStandard Bivariate NormalBivariate NormalMultivariate Normal
ConvergenceLaw of Large NumbersCentral Limit Theorem
Statistical InferencePoint EstimationNormal-Based Confidence IntervalEmpirical distributionStatistical Functionals
Parametric InferenceMethod of MomentsMaximum LikelihoodDelta Method
Multiparameter ModelsMultiparameter delta method
Parametric Bootstrap
Hypothesis TestingBayesian InferenceCredible IntervalsFunction of parametersPriorsConjugate Priors
Bayesian Testing
Exponential FamilySampling MethodsThe BootstrapBootstrap Confidence Intervals
Rejection SamplingImportance Sampling
Decision TheoryRiskAdmissibilityBayes RuleMinimax Rules
Linear RegressionSimple Linear RegressionPredictionMultiple RegressionModel Selection
Non-parametric Function EstimationDensity EstimationHistogramsKernel Density Estimator (KDE)
Non-parametric RegressionSmoothing Using Orthogonal Functions
Stochastic ProcessesMarkov ChainsPoisson Processes
Time SeriesStationary Time SeriesEstimation of CorrelationNon-Stationary Time SeriesDetrending
ARIMA modelsCausality and Invertibility
Spectral Analysis
MathGamma FunctionBeta FunctionSeriesCombinatorics