MChem Computing and Chemistry [B14SC3]
“Data Analysis for Beginners…”or
• How to avoid disasters whenwriting up your research work…
or
• On the Meaning of Life, the Universeand Everything!
Lecture #42 Dr Roderick Ferguson – Summer 2008
The “data analysis” assignment…
This is in 3 parts
Some useful information, and a copy of the
MS Word document can be found at:-
http://www.eps.hw.ac.uk/~cherrf/B14SC3
(this link is also available from my chemistry staff home page)
The “data analysis” assignment…
Now, if you are really stuck and can’t see
how to get started with the 2nd part of this…
then I will be available over the next 2 days
i.e. Wed 4th and Thurs 5th June to offer some
help - My office is now DB 2.49
But, you should have enough knowledge
to be able to attempt this yourselves!
Data Analysis for Beginners…
1. Introduction to Data Analysis.
2. Linear Least Squares.
3. Nonlinear Least Squares.
4. Theoretical Models (and Maths).
5. Errors (and what to do with them!).
Data Analysis for Beginners…
1. Introduction to Data Analysis.
2. Linear Least Squares.
3. Nonlinear Least Squares.
4. Theoretical Models (and Maths).
5. Errors (and what to do with them!).
Introduction to Data Analysis
Why do you need to do it?
• Data Analysis is an essential skill for a
professional scientist today.
• Many modern instruments can generate large quantities of numerical data which require some sort of analysis and/or theoretical interpretation.
Introduction to Data Analysis
How can you do it?
• Most people now have access to very powerful Desktop Computers…
• There are many software tools that can be used to analyse numerical data…
• Today we will focus on what can be done with something that is readily available –ie Microsoft Excel… !
Introduction to Data AnalysisAn historical aside
• This was not always true…
• Modern research workers have no idea of what performing data analysis was like before the (micro) computer revolution
that started in the 1980’s…
Introduction to Data AnalysisAn historical aside (cont)
• To do data analysis, you had to have access to a large “mainframe” computer…
• You also had to learn at least one computer programming language…
• And you also had to type in both your numerical data and analysis program
onto punched paper cards!
Introduction to Data Analysis• There are also many pitfalls and traps that
the new research worker can very easily fall into!
• Thus some background knowledge on both the how and the why aspects is required…
• Also, it is never a good idea to use something without first understandinghow it works!
Introduction to Data Analysis
At first, your reactions
will probably
be fear and
confusion
when you
try to do
Data Analysis…
Introduction to Data Analysis
• However,
Don’t Panic!
• because, it’s easier than you think…
Data Analysis for Beginners…
1. Introduction to Data Analysis.
2. Linear Least Squares.
3. Nonlinear Least Squares.
4. Theoretical Models (and Maths).
5. Errors (and what to do with them!).
Linear Least Squares
Whilst performing data analysis, you will
encounter the following terms…
• “Best Fit”
• “Goodness of Fit”
• “Residuals”
• “Sum of Squares”
What do they mean?
Linear Least Squares – an example
We have two columns of data and want
to see if there is a LINEAR relationship
between them …
point # X Y 1 1.1 1.4 2 2.0 1.8 3 2.9 2.2 4 4.2 2.8 5 5.0 2.9
1.2
1.4
1.6
1.8
2.0
2.2
2.4
2.6
2.8
3.0
1 2 3 4 5
X
Y
Step 0 – Draw a Graph!
Linear Least Squares – an example
What would your idea of a good straight line fit to
this data be?
1.2
1.4
1.6
1.8
2.0
2.2
2.4
2.6
2.8
3.0
1 2 3 4 5
X
Y
Linear Least Squares – an example
Can we give a more precise mathematical
description for the idea of “Best Fit” ?
Yes – we can!
Need some definitions. We’ll look at Residuals
and the Sum of Squares (SS)
or more precisely,
the “Sum of Squares of the Residuals”.
Linear Least Squares – an example
Back to our graph again …
1.2
1.4
1.6
1.8
2.0
2.2
2.4
2.6
2.8
3.0
1 2 3 4 5
X
Y
… but now with the residuals added!
Linear Least Squares – theory (1)
Consider the i’th data point with values (Xi,Yi) and
suppose that the data can be described by the
familiar straight line relationship
F = m X + C where m is the slope and
C is the intercept.
• Now, for each experimental Yi we calculate a theoretical Yi (which we’ll call Fi ) by using the above equation ie Fi = m Xi + c, for all of the i data points.
Linear Least Squares – theory (2)
• The difference between each calculated and experimental value of Y is called the Residual ie we have Ri = Yi – Fi for all i data points.
• Note that sometimes Ri will be positive and also sometimes it will be negative…
• How can we get an overall measure of how close the theoretical line is to our data?
• Clearly, it must have something to do with ALL of the residuals …
Linear Least Squares – theory (3)
We define the Sum of Squares of the Residuals
(or just SS) as :-
• This gives us a single quantity that measures how good a fit the straight line is to the data.
• Note also that SS will depend on both m and C
N
iii
N
ii FYRSS
1
2
1
2 )(
Linear Least Squares – theory (3)
ie SS = SS(m,C), which means that SS is a function
of two independent variables {m and C}
so that:
From our original problem we have now got a new
problem i.e. create a Sum of Squares function,
and we need to find values of m and C which
minimise this function!
2
1
)(),( CmXYCmSS ii
N
i
Linear Least Squares – theory (4)
In other words,
• How do we find minimum values (or minima) of functions ?
• We need the help of Calculus – ie the part of Maths that deals with the rate of change of a function …
Best Fit => find Minimum of the SS function!
F(x)
Linear Least Squares – Simple Calculus (1)
Recall a function of one variable:
Tangent line (slope +ve)
Tangent line ( slope -ve)
Tangent line (slope = 0)
=> a minimum!
Linear Least Squares – Simple Calculus (2)
Function of one variable:
• The slope of the tangent line is given by the rate of change of F with x, dF/dx or the derivative of F.
• Furthermore, at a minimum (or maximum) value of the function F(x), the slope of the tangent line is zero ie dF/dx = 0
We can also define functions of more than one
variable…
Linear Least Squares – Simple Calculus (3)
Function of more than one variable:
• If F = F(x,y,z), a function of three variables x, y and z – then we can define 3 partial derivatives namely, ∂F/ ∂x, ∂F/∂y and ∂F/∂z.
You may be familiar with this notation from
your Thermodynamics studies…
Note that partial derivatives are very useful
in ALL branches of the Physical Sciences!
Linear Least Squares – theory (5)
Recall our original problem of finding the minimum
of the function, SS(m,C):
0
0
C
SSm
SS• need to find the values of
m and C that make the two partial derivatives of SS vanish
• ie we need to solve the pair of equations:-
Linear Least Squares – theory (6)• this is very easy to do for the straight line case
• Our SS function, SS(m,C), is given by the following (after some expansion!)
ie SS is of the general form:
2222 222
),(
iiiiii YYCYXmNCXmCXm
CmSS
GFCEmBCHmCAmCmSS 222),( 22
Linear Least Squares – theory (7)
For the straight line, the Sum of Squares function
is a conic section ie a contour map of this surface
will be a series of concentric ellipses.
m
C
SS
Linear Least Squares – theory (8)
When we perform the two differentiations, we get:
or
0222
0222 2
ii
iiii
YCNXmC
SS
YXXCXmm
SS
FBCHmC
SS
EHCAmm
SS
222
222
Linear Least Squares – theory (9)
These two linear equations are very easy to solve
for both m and C…
(you can try doing this as an exercise… !)
=> any good pocket calculator can do linear least
squares fits to data!
Now, we’ll look at a few applications of
Linear Least Squares theory…
Linear Least Squares – applications (1)
Linear least squares analysis can be extended to
help with other problems.
Often, you will encounter polynomials eg
and we can set up an SS function for this n’th
degree polynomial…
ie SS = (yi-fi)2 = SS(a0, a1, a2, a3,…, an)
nnn xaxaxaxaaxP ...)( 3
32
210
Linear Least Squares – applications (2)
For a polynomial of degree ‘n’, we have to solve
the following system of ‘n+1’ linear equations:-
0
0
0
1
0
na
SS
a
SS
a
SS These equations can be solved by standard matrix methods using ‘Linear Algebra’
The Microsoft EXCEL spreadsheet computer program can do linear least square fits of this more general kind via the LINEST function.
Linear Least Squares – applications (3)
Two uses of polynomials…
1) Calibration curves
In Chemistry, polynomial functions are often used
to construct calibration curves for some analytical
technique such as Mass Spectrometry
or Atomic Absorption
Here the instrument response is known to be
describable by a polynomial function
(usually a 3rd or 4th degree polynomial).
Linear Least Squares – applications (4)
Two uses of polynomial functions…
2) Data Smoothing
Another application of polynomials and linear least
squares fitting is data smoothing and interpolation.
Example – X Ray scattering data from amorphous
polymer samples (Dr Arrighi).
These experiments can generate HUGE data files!
Linear Least Squares – applications (5)
The Intensity vs angle and temperature data are
conveniently stored as 2 dimensional Excel
spreadsheets.
Also, the I(Q,T) vs Q plots for a fixed temperature,
T, are often found to be very noisy.
Quadratic polynomials can be used to ‘smooth’ the
data so that important features stand out.
How does this work?
Linear Least Squares – applications (6)
First, here’s the original noisy data:-
Q
I(Q
,T)
Linear Least Squares – applications (7)
And now, here’s the smoothed data:-
Q
I(Q
,T)
Linear Least Squares – applications (8)
And here’s how smoothing works…
Original data point
Interpolated data point
Smoothing Polynomial
Linear Least Squares – applications (9)
Data Smoothing is a potentially risky operation…
You must be very careful when you do this…
Why?
Because you could be throwing away some vital
information – especially if you use too much data
smoothing!
Data Analysis for Beginners…
1. Introduction to Data Analysis.
2. Linear Least Squares.
3. Nonlinear Least Squares.
4. Theoretical Models (and Maths).
5. Errors (and what to do with them!).
Nonlinear Least Squares
• Often we need to fit data to a more complicated nonlinear function…
• The sum of squares equations are then also nonlinear…
• and so we have to use other methods to solve the minimisation problem…
Nonlinear Least Squares
• Another feature of this type of problem is that you have to supply a reasonable starting guess for the parameters used in your theoretical model…
• You must have a feeling for the behaviour of your model function…
• Good idea to plot out both your data and model function on the SAME graph!
Nonlinear Least Squares
• By trying out several different sets of parameter values, you can get a rough idea of where a good starting guess is.
• Once a suitable starting set of parameters has been found, then there are several methods (algorithms) that can be used to locate the minimum in the SS function.
Nonlinear Least Squares Example
• For a good example of a Chemistry based nonlinear curve fitting exercise, one can look at First Order Chemical Kinetics.
• The concentration of a new molecule that is being produced in a chemical reaction that follows First Order Kinetics can be described as:-
Nonlinear Least Squares Example
here c(t) is the concentration at any time t,
and c∞ is the final steady state concentration
ie c∞ = c(∞)
To see where this comes from, let’s look atthe rate of change of c with time ie dc/dt
)]exp(1[)( ktctc
Nonlinear Least Squares Example
or equivalently
This is an example of a first order linear differential
equation.
)]([)exp()(
tcckktkcdt
tdc
kctkcdt
tdc)(
)(
Nonlinear Least Squares Example
Here is a plot of the c(t) function:-
Nonlinear Least Squares Example• For some models, one can transform a
nonlinear function into a linear function
by using some maths…
• eg if the model was c(t) = c0 exp(-kt) then
taking logs would give a linear equation
• However, this trick does not work for our particular first order kinetics problem!
Nonlinear Least Squares Example• We need to set this up as a nonlinear least
squares curve fitting problem.
• get an rough idea of possible starting values for the parameters from the c(t) graph.
• Rate Constant, k, obtained from initial slope at t = 0
• Steady State Concentration, c∞ from long time data.
Nonlinear Least Squares Example
Parameter estimation…
Nonlinear Least Squares Example• a knowledge of the Chemistry or Physics
behind any curve fitting problem can be very useful in helping you to pick good starting parameter values!
• Also, once you have got a curve fit to your data you need to be able to:
• a) interpret the results and
• b) estimate errors in your parameters.
Questions to ask about first…Before starting to look at your particular curve fitting problem, you should ask yourself the following questions:
1 Do I have enough data points?
(Straight line plots need at least
5 data points)
2 Can my data be fitted to a linear model?
(eg straight line, polynomial or using
transformed datasets)
Questions to ask about first…3. Am I using too many parameters?
Remember Occam’s Razor – assume as few hypotheses in your theory as possible
ie Keep it Simple!
Danger of overparametrising a model by using a 5th degree polynomial when only a 2nd degree quadratic would do…
Questions to ask about first…and a very important final question…
4 Do I really understand how nonlinear least squares curve fitting works?
We’ll look at this now.
How does NLLSQ work?Here, we try to describe our data by some
more complicated function
y = f(x, p1, p2, p3, … pm)
and we have to find best fit values for the ‘m’
parameters {p1, p2, p3, … pm}
The SS function now becomes
SS = (yi-fi)2 = SS(p1, p2, p3, … pm)
How does NLLSQ work?where, fi = f(xi, p1, p2, p3, … pm)
is the value of f(x) at the i’th data point xi
• The system of equations that we get by setting the ‘m’ partial derivatives of SS equal to zero are now found to be nonlinear
• The SS function itself may now have more than one minimum value
How does NLLSQ work?• We must try and find the lowest possible
minimum value ie the global minimum.
• This is a lot more trickier to do than what we have been used to with the previous linear least squares fits.
But don’t panic! – as there are ways to deal
with this situation.
How does NLLSQ work?• We use a search method to explore the
multidimensional (hyperspace) surface defined by the SS function.
• We need a starting value for our parameters and we may also need expressions for the partial derivatives (∂f/∂pj) of our model function f.
• These are used to find the fastest route to the minimum.
How does NLLSQ work?Search methods can include:
• Steepest Descent
• Parabolic Surface Approximation or Quadratic Form.
• Simplex Algorithm
• Genetic Algorithms
How does NLLSQ work?• In practice, most good programs will use
a combination of the first two methods
• eg the Levenberg - Marquhardt Algorithm.
• The search method will continue looking for the minimum (iterating) until some suitable stopping criterion has been met.
How does NLLSQ work?• The EXCEL spreadsheet ‘Solver’ add in
uses the ‘steepest descent’ method
• it evaluates the partial derivatives (∂f/∂pj) numerically, i.e. by approximation
• your model function, f, must be smooth and well behaved eg no spikey or sharp bits!
How does NLLSQ work?Let’s now look at a two parameter model,
this uses parameters p1 and p2
so we can write f = f(x, p1, p2)
• this could be our earlier First Order Chemical Kinetics example
• where x = time t, p1 = k, and p2 = C∞.
How does NLLSQ work?Here’s a possible contour map of the
Sum of Squares function SS(p1,p2)
p2
p1
How does NLLSQ work?here’s the result of using a search method to
look for the minima in the SS function:-
p2
p1
How does NLLSQ work?• If it is computationally difficult or
expensive to calculate function derivatives, then there are other search methods such as the SIMPLEX method.
• For a two parameter problem, this uses three points on the SS surface to define a Simplex shape which can be moved to hunt for a minimum…
The SIMPLEX method…A 2 parameter simplex search
p2
p1
SS
The SIMPLEX method…Step #0
p2
p1
SS
The SIMPLEX method…Step #1
p2
p1
SS
The SIMPLEX method…Step #2
p2
p1
SS
The SIMPLEX method…Step #3
p2
p1
SS
The SIMPLEX method…Step #4
p2
p1
SS
The SIMPLEX method…Step #6
p2
p1
SS
How does NLLSQ work?• For a 3 parameter model, the simplex
method would use 4 points (a tetrahedral simplex)
• For a model with ‘m’ parameters, we would use a simplex constructed from ‘m+1’ points on the ‘m’ dimensional SS hypersurface.
How does NLLSQ work?• There are some situations where you
need to curve fit several sets of data at the same time…
• This can be achieved by generalising the idea of the Sum of Squares to include several data sets e.g.
M
k
kN
ikiki fySS
1
)(
1
2,, )(
NLLSQ problem?• There is one important problem that you
will experience at some stage in your career when doing curve fitting…
• It is one which is not obvious at first sight, and can cause unnecessary grief to the inexperienced…
…especially when writing up your PhD !!
NLLSQ problem?• The technical description of this problem
is “Overparametrisation of a Model”
• What happens is that one of your model parameters depends implicitly on some of the other parameters.
• This is not easy to spot at first!
Overparametrisation…An example
A 4th year project student was asked to fit
their data to the following function:-
This seems to be a 4 parameter function.
dcx
baxxf
)(
Overparametrisation…
To their surprise, they found that more than
one set of parameters gave rise to an
identical curve fit!
What is going on here?
Some further analysis is required…
Overparametrisation…Notice that for this function, we can divide
both the top and the bottom of the fraction by ‘a’
This yields the following expression for f:-
)/()/(
)/()(
adxac
abxxf
Overparametrisation…and this can be written as:-
where p1 = b/a, p2 = c/a, and p3 = d/a.
This analysis shows that we really have
a 3 parameter model and
not a 4 parameter one!
32
1)(pxp
pxxf
Overparametrisation…In more general terms
• We have some model function
f = f(x, p1, p2, p3, p4), and, unknown to us,
the parameter p3 is also a function of p1, p2 and p4 i.e. we have p3 = g(p1, p2, p4)
• This means that we are really dealing with a 3 parameter model, h(x, p1, p2, p4), where h = f(g(…)) and g is unknown!
Overparametrisation…
From the perspective of the SS function,
we are not dealing with a true minimum
• instead we have a surface that is now more like a valley (the Grand Canyon effect).
• What is the answer to this problem?
Overparametrisation Fixed!
Fix one of the parameters to a sensible value
and then let the others vary!
1 The assumption is that you have enough
information about your model to make a sensible choice for this parameter (and its value)
2 This may require you to perform different
additional experiments!
Data Analysis for Beginners…
1. Introduction to Data Analysis.
2. Linear Least Squares.
3. Nonlinear Least Squares.
4. Theoretical Models (and Maths).
5. Errors (and what to do with them!).
Theoretical Models and MathsSome examples from my past
(mainly polymer science).
• Polymer Physical Ageing data – uses KWW function, f(t) = exp[ -(t/)β ]
• Deconvolution of ‘FTIR’ peaks: Use of Gaussian functions to do peak area determination
Theoretical Models and Maths
0.0
10.0
20.0
30.0
40.0
50.0
60.0
70.0
80.0
90.0
100.0
164016601680170017201740176017801800
FTIR peak constructedfrom 3 Gaussian functions
Might be usedin Copolymer Composition Analysis…
Theoretical Models and Maths
• Copolymer Reactivity Ratios determination
• Tg vs “wt fraction” plots => Kwei equation
• Assymetric peaks fitted with Exponentially
Modified Gaussian function e.g. data from
a Dynamic Mechanical Thermal Analyser
(DMTA) experiment.
Theoretical Models and Maths
• Neutron Scattering data fitted with Dawson’s Integral Function…
dteexFx
tx 0
22
)(
Using the Excel ‘Solver’ add in
Spreadsheet layout (simple example) :-
p1 p2 p3 SS
3.5 -2 0.4 2.1460
X Y F Resid^2
0.0 2.92 3.50 0.3364
1.0 1.94 1.90 0.0016
2.0 1.55 1.10 0.2025
3.0 1.55 1.10 0.2025
4.0 2.09 1.90 0.0361
5.0 3.20 3.50 0.0900
6.0 4.77 5.90 1.2769
My Data
0.00
1.00
2.00
3.00
4.00
5.00
6.00
7.00
0.0 1.0 2.0 3.0 4.0 5.0 6.0 7.0
X
Using the Excel ‘Solver’ add inSpreadsheet layout (simple example) :-
1. Data, Fitting Function and Residuals clearly laid out
2. Parameter values and Sum of Squares clearly laid out
3. Graph of your data (points) and fitted function (curve). This gives you immediate visual feedback!!!
Using the Excel ‘Solver’ add inAfter invoking the Solver Dialog…
1. Set the Target Cell [$D$4]
2. Equal to ( )Max (*)Min ( ) Value of [ ]
3. By Changing Cells:
[$A$4:$C$4]
4. Then hit the Solve button!
Using the Excel ‘Solver’ add in
Spreadsheet layout (after Solver) :-
p1 p2 p3 SS
2.921429 -1.21607 0.253928 0.0035
X Y F Resid^2
0.0 2.92 2.92 0.0000
1.0 1.94 1.96 0.0004
2.0 1.55 1.51 0.0020
3.0 1.55 1.56 0.0001
4.0 2.09 2.12 0.0009
5.0 3.20 3.19 0.0001
6.0 4.77 4.77 0.0000
My Data
0.00
1.00
2.00
3.00
4.00
5.00
6.00
0.0 1.0 2.0 3.0 4.0 5.0 6.0 7.0
X
Good Practice…Before any curve fit:-
• Draw a Graph of your data!
• Can you evaluate the model function in a single spreadsheet cell?
• If not then you may need to use special techniques
Good Practice…Special spreadsheet techniques:-
1 Use extra columns as workspace
2 Or consider writing your own User Defined Function by employing Excel’s built in programming language (VBA)
This is better if you are dealing with a large
data set – and is less prone to errors when
setting up your spreadsheet!
Excel User Defined functions…
a) in the spreadsheet cell use:-
“=MyFunc(A4,$B$1,$B$2,$B$3)” to evaluate
a user function with 1 variable and 3 parameters
The $’s mean you are using absolute references –
these won’t change when copying cells!
b) create a new VBA Module (with shift F11)
and then use the following template code:-
Excel User Defined functions…
e.g.
Function MyFunc(x as double, p1 as double, p2 as double, p3 as double) as double
…
MyFunc = (x+p1)/(p2*x+p3)
End Function
Good Practice…After any curve fit:-
• ALWAYS look at the Residuals…
• Can give you a better idea of the quality of fit to your data!
• May indicate that a different model/theory is needed…
and also that your supervisor got it wrong!
Data Analysis for Beginners…
1. Introduction to Data Analysis.
2. Linear Least Squares.
3. Nonlinear Least Squares.
4. Theoretical Models (and Maths).
5. Errors (and what to do with them!).
Errors (and what to do with them!)
• This really needs a separate lecture!
• For Linear fits – use Excel’s LINEST function, as this will also report parameter errors
• For Nonlinear curve fits, this is more tricky and 3rd party add ins are required…
Errors (and what to do with them!)
• Could use “Solver Aid” – which is a VBA Excel Macro which will estimate the errors in your fitted parameters
• This is available from the website of Robert de Levie – who also has a book…
“Advanced Excel forscientific data analysis”
(Oxford University Press)
Errors (and what to do with them!)
• How does “Solver Aid” work?
• Need to look at what happens to the SS function near the minimum when all parameters are frozen apart from one
• This gives a function of one variable in the parameter of choice, eg p2
• the shape of this function near the minimum determines the error in p2
Conclusions…We have considered:-
• Introduction to Data Analysis.
• Linear Least Squares.
• Nonlinear Least Squares.
• Theoretical Models (and Maths).
• Errors (and what to do with them!).
Conclusions
When you have mastered these skills,
you will then have started the journey to
becoming a Professional Scientist…
Have fun, and remember Dr Ferguson’s
42nd Law:-
DON’T PANIC!!