+ All Categories
Home > Documents > Review of Probability Theory - KTI

Review of Probability Theory - KTI

Date post: 11-Feb-2022
Category:
Upload: others
View: 3 times
Download: 0 times
Share this document with a friend
116
Knowledge Discovery and Data Mining 1 (VO) (707.003) Review of Probability Theory Denis Helic KTI, TU Graz Oct 9, 2014 Denis Helic (KTI, TU Graz) KDDM1 Oct 9, 2014 1 / 110
Transcript
Page 1: Review of Probability Theory - KTI

Knowledge Discovery and Data Mining 1 (VO) (707.003)Review of Probability Theory

Denis Helic

KTI, TU Graz

Oct 9, 2014

Denis Helic (KTI, TU Graz) KDDM1 Oct 9, 2014 1 / 110

Page 2: Review of Probability Theory - KTI

Big picture: KDDM

Probability Theory Linear Algebra Hardware & ProgrammingModel

Mathematical Tools Infrastructure

Knowledge Discovery Process

Information Theory Statistical Inference

Denis Helic (KTI, TU Graz) KDDM1 Oct 9, 2014 2 / 110

Page 3: Review of Probability Theory - KTI

Outline

1 Introduction

2 Conditional Probability and Independence

3 Random Variables

4 Discrete Random Variables

5 Continuous Random Variables

Denis Helic (KTI, TU Graz) KDDM1 Oct 9, 2014 3 / 110

Page 4: Review of Probability Theory - KTI

Introduction

Random experiments

In random experiments we can not predict the output in advance

We can observe some “regularity” if we repeat the experiment a largenumber of times

E.g. when tossing a coin we can not predict the result of a single toss

If we toss many times we get an average of 50% of “heads” (fair coin)

Probability theory is a mathematical theory which describes suchphenomena

Denis Helic (KTI, TU Graz) KDDM1 Oct 9, 2014 4 / 110

Page 5: Review of Probability Theory - KTI

Introduction

The state space

It is the set of all possible outcomes of the experiment

We denote the state space by Ω

Coin toss: Ω = t, hTwo successive coin tosses: Ω = tt, th, ht, hhDice roll: Ω = 1, 2, 3, 4, 5, 6The lifetime of a light-bulb: Ω = R+

Denis Helic (KTI, TU Graz) KDDM1 Oct 9, 2014 5 / 110

Page 6: Review of Probability Theory - KTI

Introduction

The events

An event is a property that either holds or does not hold after theexperiment is done

Mathematically, an event is a subset of Ω

We denote the events by capital letters: A,B,C , . . .

E.g. rolling at least one heads in two successive coin tosses

A = th, ht, hh

Denis Helic (KTI, TU Graz) KDDM1 Oct 9, 2014 6 / 110

Page 7: Review of Probability Theory - KTI

Introduction

The events

Some basic properties of events

If A and B are two events, then:

The contrary event of A is the complement set Ac

The event “A or B” is the union A ∪ B

The event “A and B” is the intersection A ∩ B

Denis Helic (KTI, TU Graz) KDDM1 Oct 9, 2014 7 / 110

Page 8: Review of Probability Theory - KTI

Introduction

The events

Some basic properties of events

If A and B are two events, then:

The sure event is Ω

The impossible event is the empty set ∅An elementary (atomic) event is a subset of Ω containing a singleelement, e.g. ω

Denis Helic (KTI, TU Graz) KDDM1 Oct 9, 2014 8 / 110

Page 9: Review of Probability Theory - KTI

Introduction

The events

We denote by A the family of all events

Very often A = 2Ω, the set of all subsets of Ω

The family A should be closed under the operations from above

If A,B ∈ A, then we must have: Ac ∈ A, A ∩ B ∈ A, A ∪ B ∈ AAlso: Ω ∈ A and ∅ ∈ A

Denis Helic (KTI, TU Graz) KDDM1 Oct 9, 2014 9 / 110

Page 10: Review of Probability Theory - KTI

Introduction

The probability

With each event we associate a number P(A) called the probability ofA

P(A) is between 0 and 1

“Frequency” interpretation

P(A) is a limit of the “frequency” with which A is realized

P(A) = limit of f (A)n as n goes to positive infinity

Denis Helic (KTI, TU Graz) KDDM1 Oct 9, 2014 10 / 110

Page 11: Review of Probability Theory - KTI

Introduction

The probability

Basic properties of the probabilities

(i) 0 ≤ P(A) ≤ 1

(ii) P(Ω) = 1

(iii) P(A ∪ B) = P(A) + P(B) if A ∩ B = ∅

Denis Helic (KTI, TU Graz) KDDM1 Oct 9, 2014 11 / 110

Page 12: Review of Probability Theory - KTI

Introduction

The probability

The model is a triple (Ω,A,P)

Ω is the state space

A is the collection of all events

P(A) is the collection of all probabilities for A ∈ AP is a mapping from A into [0, 1] which satisfies at least properties(ii) and (iii) (Kolmogorov axioms)

Denis Helic (KTI, TU Graz) KDDM1 Oct 9, 2014 12 / 110

Page 13: Review of Probability Theory - KTI

Introduction

Venn diagrams

Ω

A

Denis Helic (KTI, TU Graz) KDDM1 Oct 9, 2014 13 / 110

Page 14: Review of Probability Theory - KTI

Introduction

Probability measure

The probability P(A) of the event A is the area of the set in thediagram

The area of Ω is 1

E.g. radius of the event A is r = 0.2

P(A) = 0.1257

Denis Helic (KTI, TU Graz) KDDM1 Oct 9, 2014 14 / 110

Page 15: Review of Probability Theory - KTI

Introduction

Properties of probability measures

Properties

If P is a probability measure on (Ω,A), then:

(i) P(∅) = 0

(ii) For every finite sequence An of pairwise disjoint (whenever i 6= j ,Ai ∩ Aj = ∅) elements of A we have:

P(∪mn=1) =m∑

n=1

P(An)

Property (ii) is called additivity

Denis Helic (KTI, TU Graz) KDDM1 Oct 9, 2014 15 / 110

Page 16: Review of Probability Theory - KTI

Introduction

Additivity

Ω

A B

Denis Helic (KTI, TU Graz) KDDM1 Oct 9, 2014 16 / 110

Page 17: Review of Probability Theory - KTI

Introduction

Probability measure

The probability of P(A ∪ B) of the event A ∪ B is the sum of areas ofthe sets A and B in the diagram

P(A) = 0.1257

P(A) = 0.1257

P(A ∪ B) = 0.2514

Denis Helic (KTI, TU Graz) KDDM1 Oct 9, 2014 17 / 110

Page 18: Review of Probability Theory - KTI

Introduction

Properties of probability measures

Properties

If P is a probability measure on (Ω,A), then:

(i) For A,B ∈ A, A ⊂ B =⇒ P(A) ≤ P(B)

(ii) For A,B ∈ A, P(A ∪ B) = P(A) + P(B)− P(A ∩ B)

(iii) For A ∈ A, P(A) = 1− P(Ac)

Denis Helic (KTI, TU Graz) KDDM1 Oct 9, 2014 18 / 110

Page 19: Review of Probability Theory - KTI

Introduction

Subsets

Ω

AB

Denis Helic (KTI, TU Graz) KDDM1 Oct 9, 2014 19 / 110

Page 20: Review of Probability Theory - KTI

Introduction

Union

Ω

A B

A∩B

Denis Helic (KTI, TU Graz) KDDM1 Oct 9, 2014 20 / 110

Page 21: Review of Probability Theory - KTI

Introduction

Complement

ΩAc

A

Denis Helic (KTI, TU Graz) KDDM1 Oct 9, 2014 21 / 110

Page 22: Review of Probability Theory - KTI

Conditional Probability and Independence

Conditional probability

We always have the triple (Ω,A,P)

Typically we suppress (Ω,A) and talk only about P

Nevertheless, they are always present!

Conditional probability and independence are crucial for application ofprobability theory in data mining!

Denis Helic (KTI, TU Graz) KDDM1 Oct 9, 2014 22 / 110

Page 23: Review of Probability Theory - KTI

Conditional Probability and Independence

Conditional probability

Definition

If P(B) > 0 then we define the conditional probability of A given B:

P(A|B) =P(A ∩ B)

P(B)

Denis Helic (KTI, TU Graz) KDDM1 Oct 9, 2014 23 / 110

Page 24: Review of Probability Theory - KTI

Conditional Probability and Independence

Conditional probability

0.0 0.2 0.4 0.6 0.8 1.00.0

0.2

0.4

0.6

0.8

1.0Ω

A B

A∩B

Denis Helic (KTI, TU Graz) KDDM1 Oct 9, 2014 24 / 110

Page 25: Review of Probability Theory - KTI

Conditional Probability and Independence

Conditional probability

P(B) = 0.16

P(A) = 0.12

P(A ∩ B) = 0.04

P(A|B) = P(A∩B)P(B) = 0.25

Denis Helic (KTI, TU Graz) KDDM1 Oct 9, 2014 25 / 110

Page 26: Review of Probability Theory - KTI

Conditional Probability and Independence

Conditional probability

One intuitive explanation is that B occurred first and then we askwhat is the probability that now A occurs as well

Time dimension

Another intuitive explanation is that our knowledge about the worldincreased

We have more information and know that B already occurred

Technically, B restricts the state space (makes it smaller)

Denis Helic (KTI, TU Graz) KDDM1 Oct 9, 2014 26 / 110

Page 27: Review of Probability Theory - KTI

Conditional Probability and Independence

Conditional probability

0.0 0.2 0.4 0.6 0.8 1.00.0

0.2

0.4

0.6

0.8

1.0

Ω′=B

A∩B

Denis Helic (KTI, TU Graz) KDDM1 Oct 9, 2014 27 / 110

Page 28: Review of Probability Theory - KTI

Conditional Probability and Independence

Conditional probability: example

We throw two dies

Event A = snake eyesEvent B = doubleΩ = (1, 1), (1, 2), . . . , (6, 5), (6, 6)A = (1, 1), B = (1, 1), (2, 2), . . . , (6, 6)

Denis Helic (KTI, TU Graz) KDDM1 Oct 9, 2014 28 / 110

Page 29: Review of Probability Theory - KTI

Conditional Probability and Independence

Conditional probability: example

P(A) = 136

P(B) =6∑

i=1

1

36by final additivity and because events are pairwise

disjoint =⇒ P(B) = 16

A ∩ B = A

P(A|B) = P(A∩B)P(B) =

13616

= 16

A = (1, 1), B = (1, 1), (2, 2), . . . , (6, 6)

Denis Helic (KTI, TU Graz) KDDM1 Oct 9, 2014 29 / 110

Page 30: Review of Probability Theory - KTI

Conditional Probability and Independence

Conditional probability: example

We have two boxes: red and blue

Each box contains apples and oranges

We first pick box at random

Then we pick a fruit from that box again at random

We are interested in the conditional probabilities of picking a specificfruit if a specific box was selected

Denis Helic (KTI, TU Graz) KDDM1 Oct 9, 2014 30 / 110

Page 31: Review of Probability Theory - KTI

Conditional Probability and Independence

Conditional probability: example

Figure: From the book “Pattern Recognition and Machine Learning” by Bishop

Denis Helic (KTI, TU Graz) KDDM1 Oct 9, 2014 31 / 110

Page 32: Review of Probability Theory - KTI

Conditional Probability and Independence

Conditional probability: example

P(A|B) = 34

P(O|B) = 14

P(A|R) = 14

P(O|R) = 34

Denis Helic (KTI, TU Graz) KDDM1 Oct 9, 2014 32 / 110

Page 33: Review of Probability Theory - KTI

Conditional Probability and Independence

Independence

Definition

Event A and B are independent if:

P(A ∩ B) = P(A)P(B)

Denis Helic (KTI, TU Graz) KDDM1 Oct 9, 2014 33 / 110

Page 34: Review of Probability Theory - KTI

Conditional Probability and Independence

Independence

Events A and B are not related to each other

We flip coin once: event A

Second flip is the event B

The outcome of the second flip is not dependent on the outcome ofthe first flip

Intuitively, A and B are independent

Denis Helic (KTI, TU Graz) KDDM1 Oct 9, 2014 34 / 110

Page 35: Review of Probability Theory - KTI

Conditional Probability and Independence

Independence

0.0 0.2 0.4 0.6 0.8 1.00.0

0.2

0.4

0.6

0.8

1.0ΩA

B

A∩B

Denis Helic (KTI, TU Graz) KDDM1 Oct 9, 2014 35 / 110

Page 36: Review of Probability Theory - KTI

Conditional Probability and Independence

Independence

P(A) = 0.5

P(B) = 0.5

P(A ∩ B) = 0.25

P(A)P(B) = 0.25

Denis Helic (KTI, TU Graz) KDDM1 Oct 9, 2014 36 / 110

Page 37: Review of Probability Theory - KTI

Conditional Probability and Independence

Conditional independence

Definition

Suppose P(C ) > 0. Event A and B are conditionally independent given Cif:

P(A ∩ B|C ) = P(A|C )P(B|C )

Denis Helic (KTI, TU Graz) KDDM1 Oct 9, 2014 37 / 110

Page 38: Review of Probability Theory - KTI

Conditional Probability and Independence

Conditional independence

0.0 0.2 0.4 0.6 0.8 1.00.0

0.2

0.4

0.6

0.8

1.0Ω

A

B

C

Denis Helic (KTI, TU Graz) KDDM1 Oct 9, 2014 38 / 110

Page 39: Review of Probability Theory - KTI

Conditional Probability and Independence

Conditional independence

P(A) = 18 , P(B) = 1

8 , P(C ) = 14

P(A|C ) = 12 , P(B|C ) = 1

2

P(A ∩ B|C ) = 14 , P(A|C )P(B|C ) = 1

4

P(A ∩ B) = 116 , P(A)P(B) = 1

64

Denis Helic (KTI, TU Graz) KDDM1 Oct 9, 2014 39 / 110

Page 40: Review of Probability Theory - KTI

Conditional Probability and Independence

Conditional independence

Remark

(i) Independence 6=⇒ conditional independence

(ii) Conditional independence 6=⇒ independence

Remark

Suppose P(B) > 0. Events A and B are independent if and only ifP(A|B) = P(A).

Denis Helic (KTI, TU Graz) KDDM1 Oct 9, 2014 40 / 110

Page 41: Review of Probability Theory - KTI

Conditional Probability and Independence

Independence: Example

Pick a card at random from a deck of 52 cards

A = the card is a heart, B = the card is QueenP(i) = 1

52

By additivity, P(A) = 1352 , P(B) = 4

52

P(A ∩ B) = 152 (Queen heart), P(A)P(B) = 1

52

A and B are independent

Denis Helic (KTI, TU Graz) KDDM1 Oct 9, 2014 41 / 110

Page 42: Review of Probability Theory - KTI

Conditional Probability and Independence

Bayes rule

Remark

Suppose P(A) > 0 and P(B) > 0. Then,

P(A ∩ B) = P(A|B)P(B) = P(B|A)P(A)

Theorem

Suppose P(A) > 0 and P(B) > 0. Then,

P(B|A) =P(A|B)P(B)

P(A)

Denis Helic (KTI, TU Graz) KDDM1 Oct 9, 2014 42 / 110

Page 43: Review of Probability Theory - KTI

Conditional Probability and Independence

Bayes rule

One of the most important concepts in statistical inference

Bayesian statistics

You start with a probabilistic model with parameters B

You observe data A and you are interested in the probability ofparameters given the data

Denis Helic (KTI, TU Graz) KDDM1 Oct 9, 2014 43 / 110

Page 44: Review of Probability Theory - KTI

Conditional Probability and Independence

Chain & partition rule

Theorem

If A1,A2, . . . ,An are events and P(A1 ∩ · · · ∩ An−1) > 0, then

P(A1∩· · ·∩An) = P(A1)P(A2|A1)P(A3|A1∩A2) . . .P(An|A1∩· · ·∩An−1)

Definition

A partition of Ω is a finite or countable collection (Bn) if Bn ∈ A and:

(i) P(Bn) > 0, ∀n

(ii) Bi ∩ Bj = ∅,∀i 6= j (pairwise disjoint)

(iii) ∪iBi = Ω

Denis Helic (KTI, TU Graz) KDDM1 Oct 9, 2014 44 / 110

Page 45: Review of Probability Theory - KTI

Conditional Probability and Independence

Partition rule

Ω

Denis Helic (KTI, TU Graz) KDDM1 Oct 9, 2014 45 / 110

Page 46: Review of Probability Theory - KTI

Conditional Probability and Independence

Partition rule

Theorem

Let Bn, n ≥ 1 a finite or countable partition of Ω. Then if A ∈ A:

P(A) =∑n

P(A|Bn)P(Bn)

P(A) =∑n

P(A ∩ Bn)

Denis Helic (KTI, TU Graz) KDDM1 Oct 9, 2014 46 / 110

Page 47: Review of Probability Theory - KTI

Conditional Probability and Independence

Partition rule

Ω

A

Denis Helic (KTI, TU Graz) KDDM1 Oct 9, 2014 47 / 110

Page 48: Review of Probability Theory - KTI

Conditional Probability and Independence

Bayes rule revisited

Theorem

Let Bn, n ≥ 1 a finite or countable partition of Ω and suppose P(A) > 0.Then

P(Bn|A) =P(A|Bn)P(Bn)∑m P(A|Bm)P(Bm)

Denis Helic (KTI, TU Graz) KDDM1 Oct 9, 2014 48 / 110

Page 49: Review of Probability Theory - KTI

Conditional Probability and Independence

Bayes rule: example

Medical tests

Donated blood is screened for AIDS. Suppose that if the blood is HIVpositive the test will be positive in 99% of cases. The test has also 5%false positive rating. In this age group one in ten thousand people are HIVpositive.Suppose that a person is screened as positive. What is the probability thatthis person has AIDS?

Denis Helic (KTI, TU Graz) KDDM1 Oct 9, 2014 49 / 110

Page 50: Review of Probability Theory - KTI

Conditional Probability and Independence

Bayes rule: example

P(A) =

0.0001

P(Ac) = 0.9999

P(P|A) = 0.99

P(N|A) = 0.01

P(P|Ac) = 0.05

P(N|Ac) = 0.95

P(A|P) =?

Denis Helic (KTI, TU Graz) KDDM1 Oct 9, 2014 50 / 110

Page 51: Review of Probability Theory - KTI

Conditional Probability and Independence

Bayes rule: example

P(A) = 0.0001

P(Ac) =

0.9999

P(P|A) = 0.99

P(N|A) = 0.01

P(P|Ac) = 0.05

P(N|Ac) = 0.95

P(A|P) =?

Denis Helic (KTI, TU Graz) KDDM1 Oct 9, 2014 50 / 110

Page 52: Review of Probability Theory - KTI

Conditional Probability and Independence

Bayes rule: example

P(A) = 0.0001

P(Ac) = 0.9999

P(P|A) =

0.99

P(N|A) = 0.01

P(P|Ac) = 0.05

P(N|Ac) = 0.95

P(A|P) =?

Denis Helic (KTI, TU Graz) KDDM1 Oct 9, 2014 50 / 110

Page 53: Review of Probability Theory - KTI

Conditional Probability and Independence

Bayes rule: example

P(A) = 0.0001

P(Ac) = 0.9999

P(P|A) = 0.99

P(N|A) =

0.01

P(P|Ac) = 0.05

P(N|Ac) = 0.95

P(A|P) =?

Denis Helic (KTI, TU Graz) KDDM1 Oct 9, 2014 50 / 110

Page 54: Review of Probability Theory - KTI

Conditional Probability and Independence

Bayes rule: example

P(A) = 0.0001

P(Ac) = 0.9999

P(P|A) = 0.99

P(N|A) = 0.01

P(P|Ac) =

0.05

P(N|Ac) = 0.95

P(A|P) =?

Denis Helic (KTI, TU Graz) KDDM1 Oct 9, 2014 50 / 110

Page 55: Review of Probability Theory - KTI

Conditional Probability and Independence

Bayes rule: example

P(A) = 0.0001

P(Ac) = 0.9999

P(P|A) = 0.99

P(N|A) = 0.01

P(P|Ac) = 0.05

P(N|Ac) =

0.95

P(A|P) =?

Denis Helic (KTI, TU Graz) KDDM1 Oct 9, 2014 50 / 110

Page 56: Review of Probability Theory - KTI

Conditional Probability and Independence

Bayes rule: example

P(A) = 0.0001

P(Ac) = 0.9999

P(P|A) = 0.99

P(N|A) = 0.01

P(P|Ac) = 0.05

P(N|Ac) = 0.95

P(A|P) =?

Denis Helic (KTI, TU Graz) KDDM1 Oct 9, 2014 50 / 110

Page 57: Review of Probability Theory - KTI

Conditional Probability and Independence

Bayes rule: example

P(A|P) =P(P|A)P(A)

P(P)

=P(P|A)P(A)

P(P|A)P(A) + P(P|Ac)P(Ac)

= 0.00198

Denis Helic (KTI, TU Graz) KDDM1 Oct 9, 2014 51 / 110

Page 58: Review of Probability Theory - KTI

Conditional Probability and Independence

Bayes rule: example

The disease is so rare that the number of false positives outnumbersthe people who have the disease

E.g. what can we expect in a population of 1 million

100 will have the disease and 99 will be correctly diagnosed

999,900 will not have the disease but 49,995(!) will be falselydiagnosed

If your test is positive the likelihood that you have the disease is99

99+49995 = 0.00198

Denis Helic (KTI, TU Graz) KDDM1 Oct 9, 2014 52 / 110

Page 59: Review of Probability Theory - KTI

Conditional Probability and Independence

Bayes rule: example

Figure: From the book “Pattern Recognition and Machine Learning” by Bishop

Denis Helic (KTI, TU Graz) KDDM1 Oct 9, 2014 53 / 110

Page 60: Review of Probability Theory - KTI

Conditional Probability and Independence

Bayes rule: example

P(R) = 25

P(B) = 35

P(A|B) = 34

P(O|B) = 14

P(A|R) = 14

P(O|R) = 34

Denis Helic (KTI, TU Graz) KDDM1 Oct 9, 2014 54 / 110

Page 61: Review of Probability Theory - KTI

Conditional Probability and Independence

Bayes rule: example

Fruits

We select orange. What is the probability the box was red? P(R|O) =?

P(R|O) =P(O|R)P(R)

P(O)

=P(O|R)P(R)

P(O|R)P(R) + P(O|B)P(B)

=2

3

Denis Helic (KTI, TU Graz) KDDM1 Oct 9, 2014 55 / 110

Page 62: Review of Probability Theory - KTI

Random Variables

Random variables

We use random variables to refer to “random quantities”

E.g. we flip a coin 5 times and are interested in the number of heads

E.g. the lifetime of the bulb

E.g. the number of occurrences of a word in a text document

Denis Helic (KTI, TU Graz) KDDM1 Oct 9, 2014 56 / 110

Page 63: Review of Probability Theory - KTI

Random Variables

Random variables

Definition

Given a probability measure space (Ω,A,P) a random variable is afunction X : Ω→ R such that ω ∈ Ω : X (ω) ≤ x ∈ A, ∀x ∈ R

Remark

The condition is a technicality ensuring that the set is measurable. Wejust need to know that a random variable is a function that maps eventsonto numbers.

Denis Helic (KTI, TU Graz) KDDM1 Oct 9, 2014 57 / 110

Page 64: Review of Probability Theory - KTI

Random Variables

Random variables: example

We transmit 10 data packets over a communication channel

Events are of the form (S , S , S ,F ,F , S ,S ,S , S , S)

State space Ω contains all possible 210 sequences

What is the probability that we will observe n successful transmissions

We associate a r.v. with the number of successful transmissions

The r.v. takes on the values 0, 1, . . . , 10

Denis Helic (KTI, TU Graz) KDDM1 Oct 9, 2014 58 / 110

Page 65: Review of Probability Theory - KTI

Discrete Random Variables

Discrete random variables

Definition

A r.v. X is discrete if X (Ω) is countable (finite or countably infinite).

Remark

E.g. X (Ω) = x1, x2, . . . Ω is countable =⇒ X (Ω) is countable and X is discrete

These r.v. are called discrete random variables

Denis Helic (KTI, TU Graz) KDDM1 Oct 9, 2014 59 / 110

Page 66: Review of Probability Theory - KTI

Discrete Random Variables

Discrete random variables

A discrete r.v. is characterized by its probability mass function(PMF)

pX (x) = P(X = x)

pX (x) =∑

ω:X (ω)=x

P(ω)

We will shorten the notation and write p(x)

Denis Helic (KTI, TU Graz) KDDM1 Oct 9, 2014 60 / 110

Page 67: Review of Probability Theory - KTI

Discrete Random Variables

Discrete random variables: example

Die rolls

Let X be the sum of two die rolls.

Ω = (1, 1), (1, 2), . . . , (6, 5), (6, 6)X (Ω) = 2, 3, . . . , 12E.g. X ((1, 2)) = 3,X ((2, 1)) = 3,X ((3, 5)) = 8, . . .

E.g. p(3) =?

p(3) =∑

ω:X (ω)=3

P(ω) = P((1, 2)) + P((2, 1)) =2

36

Denis Helic (KTI, TU Graz) KDDM1 Oct 9, 2014 61 / 110

Page 68: Review of Probability Theory - KTI

Discrete Random Variables

Discrete random variables: example

E.g. p(4) =?

p(4) =∑

ω:X (ω)=4

P(ω) = P((1, 3)) + P((3, 1)) + P((2, 2)) =3

36

p(x) =x − 1

36, 2 ≤ x ≤ 7

Denis Helic (KTI, TU Graz) KDDM1 Oct 9, 2014 62 / 110

Page 69: Review of Probability Theory - KTI

Discrete Random Variables

Joint probability mass function

We can introduce multiple r.v. on the same probability measure space(Ω,A,P)

Let X and Y be r.v. on that space, then the probability that X andY take on values x and y is given by:

P(ω ∈ Ω|X (ω) = x ,Y (ω) = y)

Shortly, we write:

P(X = x ,Y = y)

Denis Helic (KTI, TU Graz) KDDM1 Oct 9, 2014 63 / 110

Page 70: Review of Probability Theory - KTI

Discrete Random Variables

Joint probability mass function

We define joint PMF as:

pXY (x , y) = P(X = x ,Y = y)

Shortly, we write:

p(x , y)

Denis Helic (KTI, TU Graz) KDDM1 Oct 9, 2014 64 / 110

Page 71: Review of Probability Theory - KTI

Discrete Random Variables

Joint PMF: example

Text classification

Suppose we have a collection of documents that are either about China orJapan (document topics). We model a word occurrence as an event ω in aprobability space. Let Ω = all word occurrences. Let X be a r.v. thatmaps those occurrences to an enumeration of words and let Y be a r.v.that maps an occurrence to an enumeration of topics (either China orJapan). What is the joint PMF p(x , y).

Document Class

Chinese Beijing Chinese China

Chinese Chinese Shanghai China

Chinese Macao China

Tokyo Japan Chinese Japan

Denis Helic (KTI, TU Graz) KDDM1 Oct 9, 2014 65 / 110

Page 72: Review of Probability Theory - KTI

Discrete Random Variables

Joint PMF: example

Word Class

Chinese China

Beijing China

Chinese China

Chinese China

Chinese China

Shanghai China

Chinese China

Macao China

Tokyo Japan

Japan Japan

Chinese Japan

Denis Helic (KTI, TU Graz) KDDM1 Oct 9, 2014 66 / 110

Page 73: Review of Probability Theory - KTI

Discrete Random Variables

Joint PMF: example

YX

Chinese Beijing Shanghai Macao Tokyo Japan

China 5 1 1 1 0 0

Japan 1 0 0 0 1 1

Denis Helic (KTI, TU Graz) KDDM1 Oct 9, 2014 67 / 110

Page 74: Review of Probability Theory - KTI

Discrete Random Variables

Joint PMF: example

YX

Chinese Beijing Shanghai Macao Tokyo Japan

China 5/11 1/11 1/11 1/11 0 0

Japan 1/11 0 0 0 1/11 1/11

Denis Helic (KTI, TU Graz) KDDM1 Oct 9, 2014 68 / 110

Page 75: Review of Probability Theory - KTI

Discrete Random Variables

Marginal PMF

p(x) and p(y) are called marginal probability mass functions

Remark

p(x) =∑y

p(x , y)

p(y) =∑x

p(x , y)

Denis Helic (KTI, TU Graz) KDDM1 Oct 9, 2014 69 / 110

Page 76: Review of Probability Theory - KTI

Discrete Random Variables

Marginal PMF: example

YX

Chinese Beijing Shanghai Macao Tokyo Japan p(y)

China 5/11 1/11 1/11 1/11 0 0 8/11Japan 1/11 0 0 0 1/11 1/11 3/11p(x) 6/11 1/11 1/11 1/11 1/11 1/11

Denis Helic (KTI, TU Graz) KDDM1 Oct 9, 2014 70 / 110

Page 77: Review of Probability Theory - KTI

Discrete Random Variables

Conditional PMF

Definition

A conditional probability mass function is defined as:

p(x |y) =p(x , y)

p(y)

Again, we can easily establish connection to the underlayingprobability space and events

Denis Helic (KTI, TU Graz) KDDM1 Oct 9, 2014 71 / 110

Page 78: Review of Probability Theory - KTI

Discrete Random Variables

Conditional PMF: example

p(x |C )X

Chinese Beijing Shanghai Macao Tokyo Japan

p(x |China) 5/8 1/8 1/8 1/8 0 0

p(x |J)X

Chinese Beijing Shanghai Macao Tokyo Japan

p(x |Japan) 1/3 0 0 0 1/3 1/3

Denis Helic (KTI, TU Graz) KDDM1 Oct 9, 2014 72 / 110

Page 79: Review of Probability Theory - KTI

Discrete Random Variables

Independence

Definition

Two r.v. X and Y are independent if:

p(x , y) = p(x)p(y), ∀x , y ∈ R

All rules are equivalent to the rules for events, we just work withPMFs instead.

Denis Helic (KTI, TU Graz) KDDM1 Oct 9, 2014 73 / 110

Page 80: Review of Probability Theory - KTI

Discrete Random Variables

Joint PMF

In general we can have many r.v. defined on the same probabilitymeasure space Ω

X1, . . . ,Xn

We define the joint PMF as:

p(x1, . . . , xn) = P(X1 = x1, . . . ,Xn = xn)

Denis Helic (KTI, TU Graz) KDDM1 Oct 9, 2014 74 / 110

Page 81: Review of Probability Theory - KTI

Discrete Random Variables

Common discrete random variables

Certain random variables commonly appear in nature and applications

Bernoulli random variable

Binomial random variable

Geometric random variable

Poisson random variable

Power-law random variable

Denis Helic (KTI, TU Graz) KDDM1 Oct 9, 2014 75 / 110

Page 82: Review of Probability Theory - KTI

Discrete Random Variables

Common discrete random variables

IPython Notebook examples

http:

//kti.tugraz.at/staff/denis/courses/kddm1/pmf.ipynb

Command Line

ipython notebook –pylab=inline pmf.ipynb

Denis Helic (KTI, TU Graz) KDDM1 Oct 9, 2014 76 / 110

Page 83: Review of Probability Theory - KTI

Discrete Random Variables

Bernoulli random variable

PMF

p(x) =

1− p if x = 0

p if x = 1

Bernoulli r.v. with parameter p

Models situations with two outcomes

E.g. we start a task on a cluster node. Does the node fail (X = 0) orsuccessfully finish the task (X = 1)?

Denis Helic (KTI, TU Graz) KDDM1 Oct 9, 2014 77 / 110

Page 84: Review of Probability Theory - KTI

Discrete Random Variables

Bernoulli random variable

0 1k

0.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7pr

obab

ility

of k

Probability mass function of a Bernoulli random variable; differing p values

p=0.3

p=0.6

Denis Helic (KTI, TU Graz) KDDM1 Oct 9, 2014 78 / 110

Page 85: Review of Probability Theory - KTI

Discrete Random Variables

Binomial random variable

Suppose X1, . . . ,Xn are independent and identical Bernoulli r.v.

The Binomial r.v. with parameters (p, n) is

Y = X1 + · · ·+ Xn

Models the number of successes in n Bernoulli trials

Denis Helic (KTI, TU Graz) KDDM1 Oct 9, 2014 79 / 110

Page 86: Review of Probability Theory - KTI

Discrete Random Variables

Binomial random variable

Cluster nodes

We start tasks on n cluster nodes. How many nodes successfully finishtheir task?

Probability of a single cluster configuration with k successes.

p(ω) = (1− p)n−kpk

Denis Helic (KTI, TU Graz) KDDM1 Oct 9, 2014 80 / 110

Page 87: Review of Probability Theory - KTI

Discrete Random Variables

Binomial random variable

How many successful configurations exist?

p(k) = N(k)(1− p)n−kpk

N(k) =

(n

k

)=

n!

(n − k)!k!

Denis Helic (KTI, TU Graz) KDDM1 Oct 9, 2014 81 / 110

Page 88: Review of Probability Theory - KTI

Discrete Random Variables

Binomial random variable

PMF

p(k) =

(n

k

)(1− p)n−kpk

E.g. how many heads we get in n coin flips

E.g. how many packets we transmit over n communication channels

Denis Helic (KTI, TU Graz) KDDM1 Oct 9, 2014 82 / 110

Page 89: Review of Probability Theory - KTI

Discrete Random Variables

Binomial random variable

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19k

0.00

0.05

0.10

0.15

0.20

0.25

0.30pr

obab

ility

of k

Probability mass function of a Binomial random variable; differing p values

p=0.1

p=0.6

Denis Helic (KTI, TU Graz) KDDM1 Oct 9, 2014 83 / 110

Page 90: Review of Probability Theory - KTI

Discrete Random Variables

Power-law (Zipf) random variable

Power-law distribution is a very commonly occurring distribution

Word occurences in natural language

Friendships in a social network

Links on the web

PageRank, etc.

Denis Helic (KTI, TU Graz) KDDM1 Oct 9, 2014 84 / 110

Page 91: Review of Probability Theory - KTI

Discrete Random Variables

Power-law (Zipf) random variable

PMF

p(k) =k−α

ζ(α)

k ∈ N, k ≥ 1, α > 1

ζ(α) is the Riemann zeta function

ζ(α) =∞∑k=1

k−α

Denis Helic (KTI, TU Graz) KDDM1 Oct 9, 2014 85 / 110

Page 92: Review of Probability Theory - KTI

Discrete Random Variables

Power-law (Zipf) random variable

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19k

0.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9pro

babili

ty o

f k

Probability mass function of a Zipf random variable; differing α values

α=2.0

α=3.0

Denis Helic (KTI, TU Graz) KDDM1 Oct 9, 2014 86 / 110

Page 93: Review of Probability Theory - KTI

Discrete Random Variables

Power-law (Zipf) random variable

0 1 2 3 4 5 6 7 8 9k

10-3

10-2

10-1

100pro

babili

ty o

f k

Probability mass function of a Zipf random variable; differing α values

α=2.0

α=3.0

Denis Helic (KTI, TU Graz) KDDM1 Oct 9, 2014 87 / 110

Page 94: Review of Probability Theory - KTI

Discrete Random Variables

Expectation

Definition

The expectation of a discrete r.v. X with PMF p is

E [X ] =∑

x∈X (Ω)

xp(x)

when this sum is “well-defined”, otherwise the expectation does not exist.

Remark

(i) “Well-defined”: it could be infinite, but it should not alternatebetween −∞ and ∞

(ii) Expectation is the average value of a r.v.

Denis Helic (KTI, TU Graz) KDDM1 Oct 9, 2014 88 / 110

Page 95: Review of Probability Theory - KTI

Discrete Random Variables

Expectation: example

Gambling game

We play repeatedly a gambling game. Each time we play we either win10¿ or lose 10¿. What are our average winnings?

Let wk be our winning for game k . Then the average winning in ngames is:

W =w1 + · · ·+ wn

n

Denis Helic (KTI, TU Graz) KDDM1 Oct 9, 2014 89 / 110

Page 96: Review of Probability Theory - KTI

Discrete Random Variables

Expectation: example

Let nW be the number of wins and nL the number of losses. Then,

W =10nW − 10nL

n= 10

nW

n− 10

nL

n

If we approximate P(win) ≈ nWn and P(loss) ≈ nL

n . Then,

W = 10P(win)− 10P(loss) =∑

x∈X (Ω)

xp(x)

Denis Helic (KTI, TU Graz) KDDM1 Oct 9, 2014 90 / 110

Page 97: Review of Probability Theory - KTI

Discrete Random Variables

Linearity of expectation

Theorem

Suppose X and Y are discrete r.v. such that E [X ] <∞ and E [Y ] <∞.Then,

E [aX ] = aE [X ], ∀a ∈ RE [X + Y ] = E [X ] + E [Y ]

Denis Helic (KTI, TU Graz) KDDM1 Oct 9, 2014 91 / 110

Page 98: Review of Probability Theory - KTI

Discrete Random Variables

Variance

Definition

The variance σ2(X ), var(X ) of a discrete r.v. X is the expectation of ther.v. (X − E [X ])2

var(X ) = E [(X − E [X ])2]

Remark

Variance indicates how close X typically is to E [X ]

var(X ) = E [X 2]− (E [X ])2

Denis Helic (KTI, TU Graz) KDDM1 Oct 9, 2014 92 / 110

Page 99: Review of Probability Theory - KTI

Discrete Random Variables

Covariance

Definition

The covariance cov(X ,Y ) of two discrete r.v. X and Y is the expectationof the r.v. (X − E [X ])(Y − E [Y ])

cov(X ,Y ) = E [(X − E [X ])(Y − E [Y ])]

cov(X ,Y ) = E [XY ]− E [X ]E [Y ]

Denis Helic (KTI, TU Graz) KDDM1 Oct 9, 2014 93 / 110

Page 100: Review of Probability Theory - KTI

Discrete Random Variables

Covariance

Remark

Covariance measures how much two r.v. change together, i.e. do X and Ytend to be small together, or is X large when Y is small (or vice versa), ordo they change independently of each other

If greater values of X correspond with greater values of Y , and sameholds for small values then cov(X ,Y ) > 0

In the opposite case cov(X ,Y ) < 0

Denis Helic (KTI, TU Graz) KDDM1 Oct 9, 2014 94 / 110

Page 101: Review of Probability Theory - KTI

Discrete Random Variables

Covariance

If X and Y are independent then cov(X ,Y ) = 0

This follows because E [XY ] = E [X ]E [Y ] in the case of independence

Is the opposite true?

If cov(X ,Y ) = 0 are X and Y independent?

Denis Helic (KTI, TU Graz) KDDM1 Oct 9, 2014 95 / 110

Page 102: Review of Probability Theory - KTI

Discrete Random Variables

Covariance

Covariance and independence

Suppose X takes on values −2,−1, 1, 2 with equal probability. SupposeY = X 2.

cov(X ,Y ) = E [XY ]− E [X ]E [Y ] = E [X 3] = 0

Clearly X and Y are not independent

They are linearly independent, but not independent in general

Denis Helic (KTI, TU Graz) KDDM1 Oct 9, 2014 96 / 110

Page 103: Review of Probability Theory - KTI

Discrete Random Variables

Covariance

Remark

(i) Independence of X and Y =⇒ cov(X ,Y ) = 0

(ii) cov(X ,Y ) = 0 6=⇒ Independence of X and Y

Denis Helic (KTI, TU Graz) KDDM1 Oct 9, 2014 97 / 110

Page 104: Review of Probability Theory - KTI

Continuous Random Variables

Continuous random variables

Definition

A r.v. X is continuous (general) if X (Ω) is uncountable.

A general description is given by P(X ∈ (−∞, x ])

We define the cumulative distribution function (CDF) of X as:

FX (x) = P(X ∈ (−∞, x ])

We will write shortly F (x) and P(X ≤ x)

Denis Helic (KTI, TU Graz) KDDM1 Oct 9, 2014 98 / 110

Page 105: Review of Probability Theory - KTI

Continuous Random Variables

Cumulative distribution function (CDF)

Definition

A cumulative distribution function (CDF) is a function F : R→ R suchthat

(i) F is non-decreasing (x ≤ y =⇒ F (x) ≤ F (y))

(ii) F is right-continuous ( limxa

= F (a))

(iii) limx→∞

F (x) = 1

(iv) limx→−∞

F (x) = 0

Denis Helic (KTI, TU Graz) KDDM1 Oct 9, 2014 99 / 110

Page 106: Review of Probability Theory - KTI

Continuous Random Variables

CDF: non-decreasing

10 5 0 5 100.0

0.2

0.4

0.6

0.8

1.0

Denis Helic (KTI, TU Graz) KDDM1 Oct 9, 2014 100 / 110

Page 107: Review of Probability Theory - KTI

Continuous Random Variables

CDF: right-continuous

10 5 0 5 100.0

0.2

0.4

0.6

0.8

1.0

Denis Helic (KTI, TU Graz) KDDM1 Oct 9, 2014 101 / 110

Page 108: Review of Probability Theory - KTI

Continuous Random Variables

CDF: infinity limits

20 15 10 5 0 5 10 15 200.0

0.2

0.4

0.6

0.8

1.0

Denis Helic (KTI, TU Graz) KDDM1 Oct 9, 2014 102 / 110

Page 109: Review of Probability Theory - KTI

Continuous Random Variables

Probability density function (PDF)

Suppose that CDF is continuous and differentiable

Definition

A probability density function (PDF) of a r.v. X is defined as:

f (x) =dF (x)

dx

Definition

A joint PDF of two r.v. X and Y is defined as:

f (x , y) =∂2F (x , y)

∂x∂y

Denis Helic (KTI, TU Graz) KDDM1 Oct 9, 2014 103 / 110

Page 110: Review of Probability Theory - KTI

Continuous Random Variables

Probability density function (PDF)

Definition

Suppose X and Y are two r.v. defined on the same probability measurespace. Conditional PDF of X given Y is defined as:

f (x |y) =f (x , y)

f (y)

Denis Helic (KTI, TU Graz) KDDM1 Oct 9, 2014 104 / 110

Page 111: Review of Probability Theory - KTI

Continuous Random Variables

Expectation

Definition

Expectation E [X ] of a r.v. X with a PDF f (x) is defined as:

E [X ] =

∫ ∞−∞

xp(x)dx

In a similar way we define variance and covariance for a joint PDF

Denis Helic (KTI, TU Graz) KDDM1 Oct 9, 2014 105 / 110

Page 112: Review of Probability Theory - KTI

Continuous Random Variables

Common continuous random variables

Certain random variables commonly appear in nature and applications

Exponential random variable

Normal (Gaussian) random variable

Power-law random variable

Denis Helic (KTI, TU Graz) KDDM1 Oct 9, 2014 106 / 110

Page 113: Review of Probability Theory - KTI

Continuous Random Variables

Common continuous random variables

IPython Notebook examples

http:

//kti.tugraz.at/staff/denis/courses/kddm1/pdf.ipynb

Command Line

ipython notebook –pylab=inline pdf.ipynb

Denis Helic (KTI, TU Graz) KDDM1 Oct 9, 2014 107 / 110

Page 114: Review of Probability Theory - KTI

Continuous Random Variables

Normal (Gaussian) random variable

Normal distribution is a very commonly occurring distribution

Continuous approximate to the binomial for large n and p not tooclose to neither 0 nor 1

Continuous approximate to the Poisson dist. with nλ large

Measurement errors

Student grades

Measures of sizes of living organisms

Denis Helic (KTI, TU Graz) KDDM1 Oct 9, 2014 108 / 110

Page 115: Review of Probability Theory - KTI

Continuous Random Variables

Normal random variable

PDF

f (x) =1√

2πσ2e−

(x−µ)2

2σ2

µ is the mean (expectation) and σ2 is the variance of a normallydistributed r.v.

CDF

F (x) = Φ(x − µσ

),Φ(x) =1√2π

∫ x

−∞e−

x′22 dx ′

Denis Helic (KTI, TU Graz) KDDM1 Oct 9, 2014 109 / 110

Page 116: Review of Probability Theory - KTI

Continuous Random Variables

Normal random variable

10 5 0 5 10x

0.00

0.05

0.10

0.15

0.20

0.25

0.30

0.35

0.40pr

obab

ility

of x

PDF of a Normal random variable; differing µ and σ values

µ=0.0,σ=1.0

µ=−2.0,σ=2.0

Denis Helic (KTI, TU Graz) KDDM1 Oct 9, 2014 110 / 110


Recommended