+ All Categories
Home > Documents > L2: Review of probability and statistics - Texas A&M...

L2: Review of probability and statistics - Texas A&M...

Date post: 20-Apr-2018
Category:
Upload: ngonhi
View: 217 times
Download: 3 times
Share this document with a friend
24
CSCE 666 Pattern Analysis | Ricardo Gutierrez-Osuna | CSE@TAMU 1 L2: Review of probability and statistics โ€ข Probability โ€“ Definition of probability โ€“ Axioms and properties โ€“ Conditional probability โ€“ Bayes theorem โ€ข Random variables โ€“ Definition of a random variable โ€“ Cumulative distribution function โ€“ Probability density function โ€“ Statistical characterization of random variables โ€ข Random vectors โ€“ Mean vector โ€“ Covariance matrix โ€ข The Gaussian random variable
Transcript

CSCE 666 Pattern Analysis | Ricardo Gutierrez-Osuna | CSE@TAMU 1

L2: Review of probability and statistics

โ€ข Probability

โ€“ Definition of probability

โ€“ Axioms and properties

โ€“ Conditional probability

โ€“ Bayes theorem

โ€ข Random variables

โ€“ Definition of a random variable

โ€“ Cumulative distribution function

โ€“ Probability density function

โ€“ Statistical characterization of random variables

โ€ข Random vectors

โ€“ Mean vector

โ€“ Covariance matrix

โ€ข The Gaussian random variable

CSCE 666 Pattern Analysis | Ricardo Gutierrez-Osuna | CSE@TAMU 2

Review of probability theory

โ€ข Definitions (informal) โ€“ Probabilities are numbers assigned to events that

indicate โ€œhow likelyโ€ it is that the event will occur when a random experiment is performed

โ€“ A probability law for a random experiment is a rule that assigns probabilities to the events in the experiment

โ€“ The sample space S of a random experiment is the set of all possible outcomes

โ€ข Axioms of probability โ€“ Axiom I: ๐‘ƒ ๐ด๐‘– โ‰ฅ 0

โ€“ Axiom II: ๐‘ƒ ๐‘† = 1

โ€“ Axiom III: ๐ด๐‘– โˆฉ ๐ด๐‘— = โˆ… โ‡’ ๐‘ƒ ๐ด๐‘–โ‹ƒ๐ด๐‘— = ๐‘ƒ ๐ด๐‘– + ๐‘ƒ ๐ด๐‘—

A1

A2

A3 A4

event

pro

bab

ility

A1 A2 A3 A4

Sample space

Probability law

CSCE 666 Pattern Analysis | Ricardo Gutierrez-Osuna | CSE@TAMU 3

โ€ข Warm-up exercise โ€“ I show you three colored cards

โ€ข One BLUE on both sides โ€ข One RED on both sides โ€ข One BLUE on one side, RED on the other

โ€“ I shuffle the three cards, then pick one and show you one side only. The side visible to you is RED โ€ข Obviously, the card has to be either A or C, right?

โ€“ I am willing to bet $1 that the other side of the card has the same color, and need someone in class to bet another $1 that it is the other color โ€ข On the average we will end up even, right? โ€ข Letโ€™s try it!

A B C

CSCE 666 Pattern Analysis | Ricardo Gutierrez-Osuna | CSE@TAMU 4

โ€ข More properties of probability

โ€“ ๐‘ƒ ๐ด๐ถ = 1 โˆ’ ๐‘ƒ ๐ด

โ€“ ๐‘ƒ ๐ด โ‰ค 1

โ€“ ๐‘ƒ โˆ… = 0

โ€“ ๐‘”๐‘–๐‘ฃ๐‘’๐‘› ๐ด1โ€ฆ๐ด๐‘ , ๐ด๐‘– โˆฉ ๐ด๐‘— = โˆ…,โˆ€๐‘–๐‘— โ‡’ ๐‘ƒ โ‹ƒ ๐ด๐‘˜๐‘๐‘˜=1 = ๐‘ƒ ๐ด๐‘˜

๐‘๐‘˜=1

โ€“ ๐‘ƒ ๐ด1โ‹ƒ๐ด2 = ๐‘ƒ ๐ด1 + ๐‘ƒ ๐ด2 โˆ’ ๐‘ƒ ๐ด1 โˆฉ ๐ด2

โ€“ ๐‘ƒ โ‹ƒ ๐ด๐‘˜๐‘๐‘˜=1 =

๐‘ƒ ๐ด๐‘˜ โˆ’ ๐‘ƒ ๐ด๐‘— โˆฉ ๐ด๐‘˜ +โ‹ฏ+ โˆ’1 ๐‘+1๐‘ƒ ๐ด1 โˆฉ ๐ด2 โ€ฆโˆฉ ๐ด๐‘๐‘๐‘—<๐‘˜

๐‘๐‘˜=1

โ€“ ๐ด1 โŠ‚ ๐ด2 โ‡’ ๐‘ƒ ๐ด1 โ‰ค ๐‘ƒ ๐ด2

CSCE 666 Pattern Analysis | Ricardo Gutierrez-Osuna | CSE@TAMU 5

โ€ข Conditional probability โ€“ If A and B are two events, the probability of event A when we already

know that event B has occurred is

๐‘ƒ ๐ด|๐ต =๐‘ƒ ๐ดโ‹‚๐ต

๐‘ƒ ๐ต ๐‘–๐‘“ ๐‘ƒ ๐ต > 0

โ€ข This conditional probability P[A|B] is read: โ€“ the โ€œconditional probability of A conditioned on Bโ€, or simply

โ€“ the โ€œprobability of A given Bโ€

โ€“ Interpretation โ€ข The new evidence โ€œB has occurredโ€ has the following effects

โ€ข The original sample space S (the square) becomes B (the rightmost circle)

โ€ข The event A becomes AB

โ€ข P[B] simply re-normalizes the probability of events that occur jointly with B

S S

A AB B A AB B B has

occurred

CSCE 666 Pattern Analysis | Ricardo Gutierrez-Osuna | CSE@TAMU 6

โ€ข Theorem of total probability โ€“ Let ๐ต1, ๐ต2โ€ฆ๐ต๐‘ be a partition of ๐‘† (mutually exclusive that add to ๐‘†)

โ€“ Any event ๐ด can be represented as

๐ด = ๐ด โˆฉ ๐‘† = ๐ด โˆฉ ๐ต1 โˆช ๐ต2โ€ฆ๐ต๐‘ = ๐ด โˆฉ ๐ต1 โˆช ๐ด โˆฉ ๐ต2 โ€ฆ ๐ด โˆฉ ๐ต๐‘

โ€“ Since ๐ต1, ๐ต2โ€ฆ๐ต๐‘ are mutually exclusive, then

๐‘ƒ ๐ด = ๐‘ƒ ๐ด โˆฉ ๐ต1 + ๐‘ƒ ๐ด โˆฉ ๐ต2 +โ‹ฏ+ ๐‘ƒ ๐ด โˆฉ ๐ต๐‘

โ€“ and, therefore

๐‘ƒ ๐ด = ๐‘ƒ ๐ด|๐ต1 ๐‘ƒ ๐ต1 +โ‹ฏ๐‘ƒ ๐ด|๐ต๐‘ ๐‘ƒ ๐ต๐‘ = ๐‘ƒ ๐ด|๐ต๐‘˜ ๐‘ƒ ๐ต๐‘˜๐‘๐‘˜=1

B1

B2

B3 BN-1

BN

A

B4

CSCE 666 Pattern Analysis | Ricardo Gutierrez-Osuna | CSE@TAMU 7

โ€ข Bayes theorem โ€“ Assume ๐ต1, ๐ต2โ€ฆ๐ต๐‘ is a partition of S

โ€“ Suppose that event ๐ด occurs

โ€“ What is the probability of event ๐ต๐‘—?

โ€“ Using the definition of conditional probability and the Theorem of total probability we obtain

๐‘ƒ ๐ต๐‘—|๐ด =๐‘ƒ ๐ด โˆฉ ๐ต๐‘—

๐‘ƒ ๐ด=

๐‘ƒ ๐ด|๐ต๐‘— ๐‘ƒ ๐ต๐‘—

๐‘ƒ ๐ด|๐ต๐‘˜ ๐‘ƒ ๐ต๐‘˜๐‘๐‘˜=1

โ€“ This is known as Bayes Theorem or Bayes Rule, and is (one of) the most useful relations in probability and statistics

CSCE 666 Pattern Analysis | Ricardo Gutierrez-Osuna | CSE@TAMU 8

โ€ข Bayes theorem and statistical pattern recognition โ€“ When used for pattern classification, BT is generally expressed as

๐‘ƒ ๐œ”๐‘—|๐‘ฅ =๐‘ ๐‘ฅ|๐œ”๐‘— ๐‘ƒ ๐œ”๐‘—

๐‘ ๐‘ฅ|๐œ”๐‘˜ ๐‘ƒ ๐œ”๐‘˜๐‘๐‘˜=1

=๐‘ ๐‘ฅ|๐œ”๐‘— ๐‘ƒ ๐œ”๐‘—

๐‘ ๐‘ฅ

โ€ข where ๐œ”๐‘— is the ๐‘—-th class (e.g., phoneme) and ๐‘ฅ is the

feature/observation vector (e.g., vector of MFCCs)

โ€“ A typical decision rule is to choose class ๐œ”๐‘— with highest P ๐œ”๐‘—|๐‘ฅ

โ€ข Intuitively, we choose the class that is more โ€œlikelyโ€ given observation ๐‘ฅ

โ€“ Each term in the Bayes Theorem has a special name

โ€ข ๐‘ƒ ๐œ”๐‘— prior probability (of class ๐œ”๐‘—)

โ€ข ๐‘ƒ ๐œ”๐‘—|๐‘ฅ posterior probability (of class ๐œ”๐‘— given the observation ๐‘ฅ)

โ€ข ๐‘ ๐‘ฅ|๐œ”๐‘— likelihood (probability of observation ๐‘ฅ given class ๐œ”๐‘—)

โ€ข ๐‘ ๐‘ฅ normalization constant (does not affect the decision)

CSCE 666 Pattern Analysis | Ricardo Gutierrez-Osuna | CSE@TAMU 9

โ€ข Example โ€“ Consider a clinical problem where we need to decide if a patient has a

particular medical condition on the basis of an imperfect test

โ€ข Someone with the condition may go undetected (false-negative)

โ€ข Someone free of the condition may yield a positive result (false-positive)

โ€“ Nomenclature

โ€ข The true-negative rate P(NEG|ยฌCOND) of a test is called its SPECIFICITY

โ€ข The true-positive rate P(POS|COND) of a test is called its SENSITIVITY

โ€“ Problem

โ€ข Assume a population of 10,000 with a 1% prevalence for the condition

โ€ข Assume that we design a test with 98% specificity and 90% sensitivity

โ€ข Assume you take the test, and the result comes out POSITIVE

โ€ข What is the probability that you have the condition?

โ€“ Solution โ€ข Fill in the joint frequency table next slide, or

โ€ข Apply Bayes rule

CSCE 666 Pattern Analysis | Ricardo Gutierrez-Osuna | CSE@TAMU 10

TEST IS POSITIVE

TEST IS NEGATIVE

ROW TOTAL

HAS CONDITION True-positive P(POS|COND)

100ร—0.90

False-negative P(NEG|COND) 100ร—(1-0.90)

100

FREE OF CONDITION

False-positive P(POS|ยฌCOND) 9,900ร—(1-0.98)

True-negative P(NEG|ยฌCOND)

9,900ร—0.98

9,900 COLUMN TOTAL 288 9,712 10,000

CSCE 666 Pattern Analysis | Ricardo Gutierrez-Osuna | CSE@TAMU 11

TEST IS POSITIVE

TEST IS NEGATIVE

ROW TOTAL

HAS CONDITION True-positive P(POS|COND)

100ร—0.90

False-negative P(NEG|COND) 100ร—(1-0.90)

100

FREE OF CONDITION

False-positive P(POS|ยฌCOND) 9,900ร—(1-0.98)

True-negative P(NEG|ยฌCOND)

9,900ร—0.98

9,900 COLUMN TOTAL 288 9,712 10,000

CSCE 666 Pattern Analysis | Ricardo Gutierrez-Osuna | CSE@TAMU 12

โ€“ Applying Bayes rule

๐‘ƒ ๐‘๐‘œ๐‘›๐‘‘| + =

=๐‘ƒ +|๐‘๐‘œ๐‘›๐‘‘ ๐‘ƒ ๐‘๐‘œ๐‘›๐‘‘

๐‘ƒ +=

=๐‘ƒ +|๐‘๐‘œ๐‘›๐‘‘ ๐‘ƒ ๐‘๐‘œ๐‘›๐‘‘

๐‘ƒ +|๐‘๐‘œ๐‘›๐‘‘ ๐‘ƒ ๐‘๐‘œ๐‘›๐‘‘ + ๐‘ƒ +|ยฌ๐‘๐‘œ๐‘›๐‘‘ ๐‘ƒ ยฌ๐‘๐‘œ๐‘›๐‘‘=

=0.90 ร— 0.01

0.90 ร— 0.01 + 1 โˆ’ 0.98 ร— 0.99=

= 0.3125

CSCE 666 Pattern Analysis | Ricardo Gutierrez-Osuna | CSE@TAMU 13

โ€ข Random variables โ€“ When we perform a random experiment we are usually interested in

some measurement or numerical attribute of the outcome โ€ข e.g., weights in a population of subjects, execution times when

benchmarking CPUs, shape parameters when performing ATR

โ€“ These examples lead to the concept of random variable โ€ข A random variable ๐‘‹ is a function that assigns a real number ๐‘‹ ๐œ‰ to each

outcome ๐œ‰ in the sample space of a random experiment

โ€ข ๐‘‹ ๐œ‰ maps from all possible outcomes in sample space onto the real line

โ€“ The function that assigns values to each outcome is fixed and deterministic, i.e., as in the rule โ€œcount the number of heads in three coin tossesโ€ โ€ข Randomness in ๐‘‹ is due to the underlying randomness

of the outcome ๐œ‰ of the experiment

โ€“ Random variables can be โ€ข Discrete, e.g., the resulting number after rolling a dice

โ€ข Continuous, e.g., the weight of a sampled individual

๐ƒ

๐‘ฅ = ๐‘‹ ๐œ‰

๐‘ฅ

๐‘†๐‘ฅ

real line

๐‘†

CSCE 666 Pattern Analysis | Ricardo Gutierrez-Osuna | CSE@TAMU 14

โ€ข Cumulative distribution function (cdf) โ€“ The cumulative distribution function ๐น๐‘‹ ๐‘ฅ

of a random variable ๐‘‹ is defined as the probability of the event ๐‘‹ โ‰ค ๐‘ฅ

๐น๐‘‹ ๐‘ฅ = ๐‘ƒ ๐‘‹ โ‰ค ๐‘ฅ โˆ’ โˆž < ๐‘ฅ < โˆž

โ€“ Intuitively, ๐น๐‘‹ ๐‘ is the long-term proportion of times when ๐‘‹ ๐œ‰ โ‰ค ๐‘

โ€“ Properties of the cdf

โ€ข 0 โ‰ค ๐น๐‘‹ ๐‘ฅ โ‰ค 1

โ€ข lim๐‘ฅโ†’โˆž

๐น๐‘‹ ๐‘ฅ = 1

โ€ข lim๐‘ฅโ†’โˆ’โˆž

๐น๐‘‹ ๐‘ฅ = 0

โ€ข ๐น๐‘‹ ๐‘Ž โ‰ค ๐น๐‘‹ ๐‘ ๐‘–๐‘“ ๐‘Ž โ‰ค ๐‘

โ€ข FX ๐‘ = limโ„Žโ†’0

๐น๐‘‹ ๐‘ + โ„Ž = ๐น๐‘‹ ๐‘+

1 2 3 4 5 6

P(X

<x)

x

cdf for rolling a dice

1

5/6

4/6

3/6

2/6

1/6

P(X

<x)

100 200 300 400 500 x(lb)

cdf for a personโ€™s weight

1

CSCE 666 Pattern Analysis | Ricardo Gutierrez-Osuna | CSE@TAMU 15

โ€ข Probability density function (pdf)

โ€“ The probability density function ๐‘“๐‘‹ ๐‘ฅ of a continuous random variable ๐‘‹, if it exists, is defined as the derivative of ๐น๐‘‹ ๐‘ฅ

๐‘“๐‘‹ ๐‘ฅ =๐‘‘๐น๐‘‹ ๐‘ฅ

๐‘‘๐‘ฅ

โ€“ For discrete random variables, the equivalent to the pdf is the probability mass function

๐‘“๐‘‹ ๐‘ฅ =ฮ”๐น๐‘‹ ๐‘ฅ

ฮ”๐‘ฅ

โ€“ Properties

โ€ข ๐‘“๐‘‹ ๐‘ฅ > 0

โ€ข ๐‘ƒ ๐‘Ž < ๐‘ฅ < ๐‘ = ๐‘“๐‘‹ ๐‘ฅ ๐‘‘๐‘ฅ๐‘

๐‘Ž

โ€ข ๐น๐‘‹ ๐‘ฅ = ๐‘“๐‘‹ ๐‘ฅ ๐‘‘๐‘ฅ๐‘ฅ

โˆ’โˆž

โ€ข 1 = ๐‘“๐‘‹ ๐‘ฅ ๐‘‘๐‘ฅโˆž

โˆ’โˆž

โ€ข ๐‘“๐‘‹ ๐‘ฅ|๐ด =๐‘‘

๐‘‘๐‘ฅ๐น๐‘‹ ๐‘ฅ|๐ด ๐‘คโ„Ž๐‘’๐‘Ÿ๐‘’ ๐น๐‘‹ ๐‘ฅ|๐ด =

๐‘ƒ ๐‘‹<๐‘ฅ โˆฉ๐ด

๐‘ƒ ๐ด ๐‘–๐‘“ ๐‘ƒ ๐ด > 0

100 200 300 400 500

pd

f

x(lb)

pdf for a personโ€™s weight

1

1 2 3 4 5 6

pm

f

x

pmf for rolling a (fair) dice

1

5/6

4/6

3/6

2/6

1/6

CSCE 666 Pattern Analysis | Ricardo Gutierrez-Osuna | CSE@TAMU 16

โ€ข What is the probability of somebody weighting 200 lb? โ€ข According to the pdf, this is about 0.62 โ€ข This number seems reasonable, right?

โ€ข Now, what is the probability of somebody weighting 124.876 lb?

โ€ข According to the pdf, this is about 0.43 โ€ข But, intuitively, we know that the probability should be zero (or very,

very small)

โ€ข How do we explain this paradox? โ€ข The pdf DOES NOT define a probability, but a probability DENSITY! โ€ข To obtain the actual probability we must integrate the pdf in an interval โ€ข So we should have asked the question: what is the probability of

somebody weighting 124.876 lb plus or minus 2 lb?

1 2 3 4 5 6

pm

f

x

pmf for rolling a (fair) dice

1

5/6

4/6

3/6

2/6

1/6

100 200 300 400 500

pd

f

x(lb)

pdf for a personโ€™s weight

1

โ€ข The probability mass function is a โ€˜trueโ€™ probability (reason why we call it a โ€˜massโ€™ as opposed to a โ€˜densityโ€™)

โ€ข The pmf is indicating that the probability of any number when rolling a fair dice is the same for all numbers, and equal to 1/6, a very legitimate answer

โ€ข The pmf DOES NOT need to be integrated to obtain the probability (it cannot be integrated in the first place)

CSCE 666 Pattern Analysis | Ricardo Gutierrez-Osuna | CSE@TAMU 17

โ€ข Statistical characterization of random variables โ€“ The cdf or the pdf are SUFFICIENT to fully characterize a r.v.

โ€“ However, a r.v. can be PARTIALLY characterized with other measures

โ€“ Expectation (center of mass of a density)

๐ธ ๐‘‹ = ๐œ‡ = ๐‘ฅ๐‘“๐‘‹ ๐‘ฅ ๐‘‘๐‘ฅโˆž

โˆ’โˆž

โ€“ Variance (spread about the mean)

๐‘ฃ๐‘Ž๐‘Ÿ ๐‘‹ = ๐œŽ2 = ๐ธ ๐‘‹ โˆ’ ๐ธ ๐‘‹ 2 = ๐‘ฅ โˆ’ ๐œ‡ 2๐‘“๐‘‹ ๐‘ฅ ๐‘‘๐‘ฅโˆž

โˆ’โˆž

โ€“ Standard deviation ๐‘ ๐‘ก๐‘‘ ๐‘‹ = ๐œŽ = ๐‘ฃ๐‘Ž๐‘Ÿ ๐‘‹ 1/2

โ€“ N-th moment

๐ธ ๐‘‹๐‘ = ๐‘ฅ๐‘๐‘“๐‘‹ ๐‘ฅ ๐‘‘๐‘ฅโˆž

โˆ’โˆž

CSCE 666 Pattern Analysis | Ricardo Gutierrez-Osuna | CSE@TAMU 18

โ€ข Random vectors

โ€“ An extension of the concept of a random variable

โ€ข A random vector ๐‘‹ is a function that assigns a vector of real numbers to each outcome ๐œ‰ in sample space ๐‘†

โ€ข We generally denote a random vector by a column vector

โ€“ The notions of cdf and pdf are replaced by โ€˜joint cdfโ€™ and โ€˜joint pdfโ€™

โ€ข Given random vector ๐‘‹ = ๐‘ฅ1, ๐‘ฅ2โ€ฆ๐‘ฅ๐‘๐‘‡we define the joint cdf as

๐น๐‘‹ ๐‘ฅ = ๐‘ƒ๐‘‹ ๐‘‹1 โ‰ค ๐‘ฅ1 โˆฉ ๐‘‹2 โ‰ค ๐‘ฅ2 โ€ฆ ๐‘‹๐‘ โ‰ค ๐‘ฅ๐‘

โ€ข and the joint pdf as

๐‘“๐‘‹ ๐‘ฅ =๐œ•๐‘๐น๐‘‹ ๐‘ฅ

๐œ•๐‘ฅ1๐œ•๐‘ฅ2โ€ฆ๐œ•๐‘ฅ๐‘

โ€“ The term marginal pdf is used to represent the pdf of a subset of all the random vector dimensions

โ€ข A marginal pdf is obtained by integrating out variables that are of no interest

โ€ข e.g., for a 2D random vector ๐‘‹ = ๐‘ฅ1, ๐‘ฅ2๐‘‡, the marginal pdf of ๐‘ฅ1 is

๐‘“๐‘‹1 ๐‘ฅ1 = ๐‘“๐‘‹1๐‘‹2 ๐‘ฅ1๐‘ฅ2 ๐‘‘๐‘ฅ2

๐‘ฅ2=+โˆž

๐‘ฅ2=โˆ’โˆž

CSCE 666 Pattern Analysis | Ricardo Gutierrez-Osuna | CSE@TAMU 19

โ€ข Statistical characterization of random vectors โ€“ A random vector is also fully characterized by its joint cdf or joint pdf

โ€“ Alternatively, we can (partially) describe a random vector with measures similar to those defined for scalar random variables

โ€“ Mean vector

๐ธ ๐‘‹ = ๐œ‡ = ๐ธ ๐‘‹1 , ๐ธ ๐‘‹2 โ€ฆ๐ธ ๐‘‹๐‘๐‘‡= ๐œ‡1, ๐œ‡2, โ€ฆ ๐œ‡๐‘

๐‘‡

โ€“ Covariance matrix

๐‘๐‘œ๐‘ฃ ๐‘‹ = ฮฃ = ๐ธ ๐‘‹ โˆ’ ๐œ‡ ๐‘‹ โˆ’ ๐œ‡๐‘‡=

=๐ธ ๐‘ฅ1 โˆ’ ๐œ‡1

2 โ€ฆ ๐ธ ๐‘ฅ1 โˆ’ ๐œ‡1 ๐‘ฅ๐‘ โˆ’ ๐œ‡๐‘โ‹ฎ โ‹ฑ โ‹ฎ

๐ธ ๐‘ฅ1 โˆ’ ๐œ‡1 ๐‘ฅ๐‘ โˆ’ ๐œ‡๐‘ โ€ฆ ๐ธ ๐‘ฅ๐‘ โˆ’ ๐œ‡๐‘2

=

=๐œŽ12 โ€ฆ ๐‘1๐‘โ‹ฎ โ‹ฑ โ‹ฎ๐‘1๐‘ โ€ฆ ๐œŽ๐‘

2

CSCE 666 Pattern Analysis | Ricardo Gutierrez-Osuna | CSE@TAMU 20

โ€“ The covariance matrix indicates the tendency of each pair of features (dimensions in a random vector) to vary together, i.e., to co-vary*

โ€ข The covariance has several important properties

โ€“ If ๐‘ฅ๐‘– and ๐‘ฅ๐‘˜ tend to increase together, then ๐‘๐‘–๐‘˜ > 0

โ€“ If ๐‘ฅ๐‘– tends to decrease when ๐‘ฅ๐‘˜ increases, then ๐‘๐‘–๐‘˜ < 0

โ€“ If ๐‘ฅ๐‘– and ๐‘ฅ๐‘˜ are uncorrelated, then ๐‘๐‘–๐‘˜ = 0

โ€“ ๐‘๐‘–๐‘˜ โ‰ค ๐œŽ1๐œŽ๐‘˜, where ๐œŽ๐‘– is the standard deviation of ๐‘ฅ๐‘–

โ€“ ๐‘๐‘–๐‘– = ๐œŽ๐‘–2 = ๐‘ฃ๐‘Ž๐‘Ÿ ๐‘ฅ๐‘–

โ€ข The covariance terms can be expressed as ๐‘๐‘–๐‘– = ๐œŽ๐‘–2 and ๐‘๐‘–๐‘˜ = ๐œŒ๐‘–๐‘˜๐œŽ๐‘–๐œŽ๐‘˜

โ€“ where ๐œŒ๐‘–๐‘˜ is called the correlation coefficient

Xi

Xk

Cik=-sisk

rik=-1

Xi

Xk

Cik=-ยฝsisk

rik=-ยฝ

Xi

Xk

Cik=0

rik=0

Xi

Xk

Cik=+ยฝsisk

rik=+ยฝ

Xi

Xk

Cik=sisk

rik=+1

CSCE 666 Pattern Analysis | Ricardo Gutierrez-Osuna | CSE@TAMU 21

A numerical example

โ€ข Given the following samples from a 3D distribution โ€“ Compute the covariance matrix

โ€“ Generate scatter plots for every pair of vars.

โ€“ Can you observe any relationships between the covariance and the scatter plots?

โ€“ You may work your solution in the templates below

Exam

ple

x 1

x 2

x 3

x 1-

1

x 2-

2

x 3-

3

(x1-

1)2

(x2-

2) 2

(x3-

3)2

(x1-

1)(

x2-

2)

(x1-

1)(

x3-

3)

(x2-

2)(

x3-

3)

1 2 3 4

Average

Variables (or features)

Examples x1 x2 x3

1 2 2 4

2 3 4 6

3 5 4 2

4 6 6 4

x1

x2

x3

x2

x3

x1

CSCE 666 Pattern Analysis | Ricardo Gutierrez-Osuna | CSE@TAMU 22

โ€ข The Normal or Gaussian distribution โ€“ The multivariate Normal distribution ๐‘ ๐œ‡, ฮฃ is defined as

๐‘“๐‘‹ ๐‘ฅ =1

2๐œ‹ ๐‘›/2 ฮฃ 1/2๐‘’โˆ’12 ๐‘ฅโˆ’๐œ‡

๐‘‡ฮฃโˆ’1 ๐‘ฅโˆ’๐œ‡

โ€“ For a single dimension, this expression is reduced to

๐‘“๐‘‹ ๐‘ฅ =1

2๐œ‹๐œŽ๐‘’โˆ’๐‘ฅโˆ’๐œ‡ 2

2๐œŽ2

CSCE 666 Pattern Analysis | Ricardo Gutierrez-Osuna | CSE@TAMU 23

โ€“ Gaussian distributions are very popular since

โ€ข Parameters ๐œ‡, ฮฃ uniquely characterize the normal distribution

โ€ข If all variables ๐‘ฅ๐‘– are uncorrelated ๐ธ ๐‘ฅ๐‘–๐‘ฅ๐‘˜ = ๐ธ ๐‘ฅ๐‘– ๐ธ ๐‘ฅ๐‘˜ , then

โ€“ Variables are also independent ๐‘ƒ ๐‘ฅ๐‘–๐‘ฅ๐‘˜ = ๐‘ƒ ๐‘ฅ๐‘– ๐‘ƒ ๐‘ฅ๐‘˜ , and

โ€“ ฮฃ is diagonal, with the individual variances in the main diagonal

โ€ข Central Limit Theorem (next slide)

โ€ข The marginal and conditional densities are also Gaussian

โ€ข Any linear transformation of any ๐‘ jointly Gaussian rvโ€™s results in ๐‘ rvโ€™s that are also Gaussian

โ€“ For ๐‘‹ = ๐‘‹1๐‘‹2โ€ฆ๐‘‹๐‘๐‘‡jointly Gaussian, and ๐ด๐‘ร—๐‘ invertible, then ๐‘Œ = ๐ด๐‘‹ is

also jointly Gaussian

๐‘“๐‘Œ ๐‘ฆ =๐‘“๐‘‹ ๐ด

โˆ’1๐‘ฆ

๐ด

CSCE 666 Pattern Analysis | Ricardo Gutierrez-Osuna | CSE@TAMU 24

โ€ข Central Limit Theorem โ€“ Given any distribution with a mean ๐œ‡ and variance ๐œŽ2, the sampling

distribution of the mean approaches a normal distribution with mean ๐œ‡ and variance ๐œŽ2/๐‘ as the sample size ๐‘ increases โ€ข No matter what the shape of the original distribution is, the sampling

distribution of the mean approaches a normal distribution

โ€ข ๐‘ is the sample size used to compute the mean, not the overall number of samples in the data

โ€“ Example: 500 experiments are performed using a uniform distribution

โ€ข ๐‘ = 1 โ€“ One sample is drawn from the distribution

and its mean is recorded (500 times)

โ€“ The histogram resembles a uniform distribution, as one would expect

โ€ข ๐‘ = 4 โ€“ Four samples are drawn and the mean of the

four samples is recorded (500 times)

โ€“ The histogram starts to look more Gaussian

โ€ข As ๐‘ grows, the shape of the histograms resembles a Normal distribution more closely


Recommended