Probability theory and statistical analysis: a...

Modeling Uncertainty in the Earth Sciences

Jef Caers

Stanford University

Probability theory and statistical analysis: a review

Concepts assumed known

Histograms, mean, median, spread, quantiles

Probability, conditional probability, Gaussian distribution, Random variable, probability density, cumulative distribution

Expectation, variance

Scatterplot, correlation coefficient

Concepts in statistics

Probability, Statistics: a review

Graphical analysis

But what if we need to compare two data sets and have only very few sample values?

Quantile-quantile plot

a p-quantile with p [0,1] (a percentile) is defined as that value such that a proportion of 100 × p of the data does not exceed this value

Dataset 1 34 21 8 7 10 15 Dataset 2 16 22 5 9 11 37 Rank-order Dataset 1 7 8 10 15 21 34 Dataset 2 5 9 11 16 22 37 Percentile 1/6 2/6 3/6 4/6 5/6 6/6

Plotting

0

5

10

15

20

25

30

35

40

0 5 10 15 20 25 30 35 40

Dataset 1

Dat

aset

2

Concepts in probability theory

Probability, Statistics: a review

Probability

“There is a 60% probability/chance of finding iron ore in this region”

Interpretation 1: The geologist feels that, over the long run, in 60% of similar regions that he studies, one will actually yield iron ore. Interpretation 2: The geologist assesses, based and his/her expertise and prior knowledge that it is more likely that the region will contain iron ore. In fact, 60/100 is a quantitative measure of the geologist’s assessment about the hypothesis that the region will contain iron ore, where 0/100 means there is certainly no iron ore and 100/100 means there is certainly iron ore.

Conditional probability

P event occurs event occurs P( )E F E F

surface of intersection of circles and P( and )P( )

surface of circle P( )

E F E FE F

F F

S E F

Bayes’ rule

Posterior Prior Likelihood

P( )P( )P( ) P( )P( )

P( )

F E EE F F E E

F

“Learning from data” (and hoping to reduce uncertainty) How much we learn depends on P(E) : how we knew before we had the data P(E|F): the (uncertain) relationship between data and the unknown

Example

1/10 of all diamond deposits being considered for appraisal are economical

Garnet is a mineral that tends to co-occur with diamond: in fact, historically: the probability of garnet exceeding 5ppm for profitable

deposits is 4/5 and only 2/5 for non-profitable deposits

Your analysis of garnet data for the current deposit reveals that the garnet content equal 6.5 ppm

What is the probability that the deposit is profitable?

Result

1 1 11 1

1

1 1 1 1 1 2 2

1 2

P( )P( ) 4 / 5 1 / 10 2P( )

P( ) 4 / 5 x 1 / 10 2 / 5 x 9 / 10 11

rule of total probability (removing by summing over all possibilities)

P( ) P( )P( ) P( )P( )

Note: P( ) P( ) 1

F E EE F

F

E

F F E E F E E

E E

E1 = “the deposit is profitable” F1 = “the garnet content exceeds 6.5ppm”

Random variable

A random variable Z: random variable is a variable whose outcome is unknown but its frequency of outcome is quantified by a probability distribution model

Discrete RVs

Probability mass function (pmf)

Cumulative distribution function (cdf)

Continuous RVs

Probability density function (pdf)

Cumulative distribution function (cdf)

Probability mass function

Notation pX(a) = P(X=a)

For continuous variables pX(a) = P(X=a) = always zero !

Probability density function

P( ) ( )b

Xaa X b f x dx

fX(x)

x

a bx1

Shaded arearepresents a probability

x2

( ) 1 some outcome

will occur

( ) 0 probabilities cannot

be negative

X

X

f x dx

f x

Likelihood

fX(x)

x

a bx1

Shaded arearepresents a probability

x2

11 2

2

( ) has the meaning of a likelihood, not of probability

( ) this ratio indicate how much more/less likely will occur vs

( )

X

X

X

f x

f xx x

f x

Cumulative distribution function

fX(x)

x

FX(x) always between 0 and 1 and never decreases

x

Area AArea A

1

FX(x) = P(X ≤ x)

Examples

1

fX(x) FX(x)

a b a b

x x

Poisson

Uniform

( ) P( )!

is the number of points per unit area

i

Xp i X i ei

Examples

fX

m=1,s=1 m=0,s=1

m=0,s=2

21 1

( ) exp22

X

xf x

m sms

Empirical distribution function

n=6 data: 10.1 / 15.4 / 8.6 / 9.5 / 20.6 / 3.2

x3.2 20.68.6 10.19.5 15.4

1

1/6

ˆ ( )XF x

Modeling from data

3.2 20.68.6 10.19.5 15.4

1

100

F̂

1/7

1/(n+1) to allow extrapolation

Linear inter/extrapolation

3.2 20.68.6 10.19.5 15.4

1

100

F̂

Monte Carlo simulation

Aim

Mimicking the process of actual sampling

Needed: Pseudo random number generator

A computer software program that creates (deterministically) a set of uniform random numbers between [0,1], as initiated with a “seed” (a large odd integer such as 56781)

Example: 0.10135, 0.58382, 0.98182, 0.0534 etc…

Other terminology: drawing, sampling

Mechanism

x

1

F

Any type of cdfp

xp: value randomly drawn from F

Use any random number generator that renders a value between [0,1]

Some first “models of uncertainty”

1 2 3

P( )

P( | )

( )

Samples: , , ,...,

X

n

A

A B b

f X Y y

x x x x

Samples drawn by Monte Carlo simulation are a valid model of uncertainty

Data Transformation

Aim

To transform the empirical distribution of a dataset into another empirical distribution

Why?

Certain methods require that the empirical data has a certain shape, such as a standard normal shape

To lessen the influence of extreme values for skewed distributions

Mechanism

x y

cdf data cdf Gaussian

1/6

2/6

3/6

4/6

5/6

3 6 8 9 20 -0.96 -0.43 0 0.43 0.96

ysxs

Correlation or association

0.01

0.10

1.00

10.00

100.00

1000.00

10000.00

0.001 0.010 0.100 1.000 10.000

Dia

mo

nd

val

ue

US

$

Diamond size (ct)

(linear) correlation coefficient

1

1

1

ni i

i x y

x x y yr

n s s

2

1

1

1

n

x ii

s x xn

If |r| larger stronger correlation If r > 0 positive correlation If r < 0 negative correlation If |r| = 1 perfect linear correlation If r = 0 no linear correlation the range of r restricted to [-1,1]

Examples

0

1

2

3

4

5

6

7

8

9

0 5 10 15 20 25

Y

X

r = 0.58

0

1

2

3

4

5

6

7

8

9

0 5 10 15 20

Y

X

r = 0.91

0

0.5

1

1.5

2

2.5

3

3.5

4

4.5

-3 -2 -1 0 1 2 3

Y

X

r = 0.15

Date post:	12-Jul-2020
Category:	Documents
Upload:	others
View:	3 times
Download:	0 times

Probability theory and statistical analysis: a...

Documents