Modeling Uncertainty in the Earth Sciences
Jef Caers
Stanford University
Probability theory and statistical analysis: a review
Concepts assumed known
Histograms, mean, median, spread, quantiles
Probability, conditional probability, Gaussian distribution, Random variable, probability density, cumulative distribution
Expectation, variance
Scatterplot, correlation coefficient
Concepts in statistics
Probability, Statistics: a review
Graphical analysis
But what if we need to compare two data sets and have only very few sample values?
Quantile-quantile plot
a p-quantile with p [0,1] (a percentile) is defined as that value such that a proportion of 100 × p of the data does not exceed this value
Dataset 1 34 21 8 7 10 15 Dataset 2 16 22 5 9 11 37 Rank-order Dataset 1 7 8 10 15 21 34 Dataset 2 5 9 11 16 22 37 Percentile 1/6 2/6 3/6 4/6 5/6 6/6
Plotting
0
5
10
15
20
25
30
35
40
0 5 10 15 20 25 30 35 40
Dataset 1
Dat
aset
2
Concepts in probability theory
Probability, Statistics: a review
Probability
“There is a 60% probability/chance of finding iron ore in this region”
Interpretation 1: The geologist feels that, over the long run, in 60% of similar regions that he studies, one will actually yield iron ore. Interpretation 2: The geologist assesses, based and his/her expertise and prior knowledge that it is more likely that the region will contain iron ore. In fact, 60/100 is a quantitative measure of the geologist’s assessment about the hypothesis that the region will contain iron ore, where 0/100 means there is certainly no iron ore and 100/100 means there is certainly iron ore.
Conditional probability
P event occurs event occurs P( )E F E F
surface of intersection of circles and P( and )P( )
surface of circle P( )
E F E FE F
F F
S E F
Bayes’ rule
Posterior Prior Likelihood
P( )P( )P( ) P( )P( )
P( )
F E EE F F E E
F
“Learning from data” (and hoping to reduce uncertainty) How much we learn depends on P(E) : how we knew before we had the data P(E|F): the (uncertain) relationship between data and the unknown
Example
1/10 of all diamond deposits being considered for appraisal are economical
Garnet is a mineral that tends to co-occur with diamond: in fact, historically: the probability of garnet exceeding 5ppm for profitable
deposits is 4/5 and only 2/5 for non-profitable deposits
Your analysis of garnet data for the current deposit reveals that the garnet content equal 6.5 ppm
What is the probability that the deposit is profitable?
Result
1 1 11 1
1
1 1 1 1 1 2 2
1 2
P( )P( ) 4 / 5 1 / 10 2P( )
P( ) 4 / 5 x 1 / 10 2 / 5 x 9 / 10 11
rule of total probability (removing by summing over all possibilities)
P( ) P( )P( ) P( )P( )
Note: P( ) P( ) 1
F E EE F
F
E
F F E E F E E
E E
E1 = “the deposit is profitable” F1 = “the garnet content exceeds 6.5ppm”
Random variable
A random variable Z: random variable is a variable whose outcome is unknown but its frequency of outcome is quantified by a probability distribution model
Discrete RVs
Probability mass function (pmf)
Cumulative distribution function (cdf)
Continuous RVs
Probability density function (pdf)
Cumulative distribution function (cdf)
Probability mass function
Notation pX(a) = P(X=a)
For continuous variables pX(a) = P(X=a) = always zero !
Probability density function
P( ) ( )b
Xaa X b f x dx
fX(x)
x
a bx1
Shaded arearepresents a probability
x2
( ) 1 some outcome
will occur
( ) 0 probabilities cannot
be negative
X
X
f x dx
f x
Likelihood
fX(x)
x
a bx1
Shaded arearepresents a probability
x2
11 2
2
( ) has the meaning of a likelihood, not of probability
( ) this ratio indicate how much more/less likely will occur vs
( )
X
X
X
f x
f xx x
f x
Cumulative distribution function
fX(x)
x
FX(x) always between 0 and 1 and never decreases
x
Area AArea A
1
FX(x) = P(X ≤ x)
Examples
1
fX(x) FX(x)
a b a b
x x
Poisson
Uniform
( ) P( )!
is the number of points per unit area
i
Xp i X i ei
Examples
fX
m=1,s=1 m=0,s=1
m=0,s=2
21 1
( ) exp22
X
xf x
m sms
Empirical distribution function
n=6 data: 10.1 / 15.4 / 8.6 / 9.5 / 20.6 / 3.2
x3.2 20.68.6 10.19.5 15.4
1
1/6
ˆ ( )XF x
Modeling from data
3.2 20.68.6 10.19.5 15.4
1
100
F̂
1/7
1/(n+1) to allow extrapolation
Linear inter/extrapolation
3.2 20.68.6 10.19.5 15.4
1
100
F̂
Monte Carlo simulation
Aim
Mimicking the process of actual sampling
Needed: Pseudo random number generator
A computer software program that creates (deterministically) a set of uniform random numbers between [0,1], as initiated with a “seed” (a large odd integer such as 56781)
Example: 0.10135, 0.58382, 0.98182, 0.0534 etc…
Other terminology: drawing, sampling
Mechanism
x
1
F
Any type of cdfp
xp: value randomly drawn from F
Use any random number generator that renders a value between [0,1]
Some first “models of uncertainty”
1 2 3
P( )
P( | )
( )
Samples: , , ,...,
X
n
A
A B b
f X Y y
x x x x
Samples drawn by Monte Carlo simulation are a valid model of uncertainty
Data Transformation
Aim
To transform the empirical distribution of a dataset into another empirical distribution
Why?
Certain methods require that the empirical data has a certain shape, such as a standard normal shape
To lessen the influence of extreme values for skewed distributions
Mechanism
x y
cdf data cdf Gaussian
1/6
2/6
3/6
4/6
5/6
3 6 8 9 20 -0.96 -0.43 0 0.43 0.96
ysxs
Correlation or association
0.01
0.10
1.00
10.00
100.00
1000.00
10000.00
0.001 0.010 0.100 1.000 10.000
Dia
mo
nd
val
ue
US
$
Diamond size (ct)
(linear) correlation coefficient
1
1
1
ni i
i x y
x x y yr
n s s
2
1
1
1
n
x ii
s x xn
If |r| larger stronger correlation If r > 0 positive correlation If r < 0 negative correlation If |r| = 1 perfect linear correlation If r = 0 no linear correlation the range of r restricted to [-1,1]
Examples
0
1
2
3
4
5
6
7
8
9
0 5 10 15 20 25
Y
X
r = 0.58
0
1
2
3
4
5
6
7
8
9
0 5 10 15 20
Y
X
r = 0.91
0
0.5
1
1.5
2
2.5
3
3.5
4
4.5
-3 -2 -1 0 1 2 3
Y
X
r = 0.15