MChem Computing and Chemistry [B14SC3]

MChem Computing and Chemistry [B14SC3]

“Data Analysis for Beginners…”or

• How to avoid disasters whenwriting up your research work…

or

• On the Meaning of Life, the Universeand Everything!

Lecture #42 Dr Roderick Ferguson – Summer 2008

The “data analysis” assignment…

This is in 3 parts

Some useful information, and a copy of the

MS Word document can be found at:-

http://www.eps.hw.ac.uk/~cherrf/B14SC3

(this link is also available from my chemistry staff home page)

http://www.eps.hw.ac.uk/~cherrf/B14SC3

The “data analysis” assignment…

Now, if you are really stuck and can’t see

how to get started with the 2nd part of this…

then I will be available over the next 2 days

i.e. Wed 4th and Thurs 5th June to offer some

help - My office is now DB 2.49

But, you should have enough knowledge

to be able to attempt this yourselves!

Data Analysis for Beginners…

1. Introduction to Data Analysis.

2. Linear Least Squares.

3. Nonlinear Least Squares.

4. Theoretical Models (and Maths).

5. Errors (and what to do with them!).







Introduction to Data Analysis

Why do you need to do it?

• Data Analysis is an essential skill for a

professional scientist today.

• Many modern instruments can generate large quantities of numerical data which require some sort of analysis and/or theoretical interpretation.


How can you do it?

• Most people now have access to very powerful Desktop Computers…

• There are many software tools that can be used to analyse numerical data…

• Today we will focus on what can be done with something that is readily available –ie Microsoft Excel… !

Introduction to Data AnalysisAn historical aside

• This was not always true…

• Modern research workers have no idea of what performing data analysis was like before the (micro) computer revolution

that started in the 1980’s…

Introduction to Data AnalysisAn historical aside (cont)

• To do data analysis, you had to have access to a large “mainframe” computer…

• You also had to learn at least one computer programming language…

• And you also had to type in both your numerical data and analysis program

onto punched paper cards!

Introduction to Data Analysis• There are also many pitfalls and traps that

the new research worker can very easily fall into!

• Thus some background knowledge on both the how and the why aspects is required…

• Also, it is never a good idea to use something without first understandinghow it works!


At first, your reactions

will probably

be fear and

confusion

when you

try to do

Data Analysis…


• However,

Don’t Panic!

• because, it’s easier than you think…







Linear Least Squares

Whilst performing data analysis, you will

encounter the following terms…

• “Best Fit”

• “Goodness of Fit”

• “Residuals”

• “Sum of Squares”

What do they mean?

Linear Least Squares – an example

We have two columns of data and want

to see if there is a LINEAR relationship

between them …

point # X Y 1 1.1 1.4 2 2.0 1.8 3 2.9 2.2 4 4.2 2.8 5 5.0 2.9

1.2

1.4

1.6

1.8

2.0

2.2

2.4

2.6

2.8

3.0

1 2 3 4 5

X

Y

Step 0 – Draw a Graph!


What would your idea of a good straight line fit to

this data be?

1.2

1.4

1.6

1.8

2.0

2.2

2.4

2.6

2.8

3.0

1 2 3 4 5

X

Y


Can we give a more precise mathematical

description for the idea of “Best Fit” ?

Yes – we can!

Need some definitions. We’ll look at Residuals

and the Sum of Squares (SS)

or more precisely,

the “Sum of Squares of the Residuals”.


Back to our graph again …

1.2

1.4

1.6

1.8

2.0

2.2

2.4

2.6

2.8

3.0

1 2 3 4 5

X

Y

… but now with the residuals added!

Linear Least Squares – theory (1)

Consider the i’th data point with values (Xi,Yi) and

suppose that the data can be described by the

familiar straight line relationship

F = m X + C where m is the slope and

C is the intercept.

• Now, for each experimental Yi we calculate a theoretical Yi (which we’ll call Fi ) by using the above equation ie Fi = m Xi + c, for all of the i data points.


• The difference between each calculated and experimental value of Y is called the Residual ie we have Ri = Yi – Fi for all i data points.

• Note that sometimes Ri will be positive and also sometimes it will be negative…

• How can we get an overall measure of how close the theoretical line is to our data?

• Clearly, it must have something to do with ALL of the residuals …


We define the Sum of Squares of the Residuals

(or just SS) as :-

• This gives us a single quantity that measures how good a fit the straight line is to the data.

• Note also that SS will depend on both m and C

N

iii

N

ii FYRSS

1

2

1

2 )(


ie SS = SS(m,C), which means that SS is a function

of two independent variables {m and C}

so that:

From our original problem we have now got a new

problem i.e. create a Sum of Squares function,

and we need to find values of m and C which

minimise this function!

2

1

)(),( CmXYCmSS ii

N

i


In other words,

• How do we find minimum values (or minima) of functions ?

• We need the help of Calculus – ie the part of Maths that deals with the rate of change of a function …

Best Fit => find Minimum of the SS function!

F(x)

Linear Least Squares – Simple Calculus (1)

Recall a function of one variable:

Tangent line (slope +ve)

Tangent line ( slope -ve)

Tangent line (slope = 0)

=> a minimum!


Function of one variable:

• The slope of the tangent line is given by the rate of change of F with x, dF/dx or the derivative of F.

• Furthermore, at a minimum (or maximum) value of the function F(x), the slope of the tangent line is zero ie dF/dx = 0

We can also define functions of more than one

variable…


Function of more than one variable:

• If F = F(x,y,z), a function of three variables x, y and z – then we can define 3 partial derivatives namely, ∂F/ ∂x, ∂F/∂y and ∂F/∂z.

You may be familiar with this notation from

your Thermodynamics studies…

Note that partial derivatives are very useful

in ALL branches of the Physical Sciences!


Recall our original problem of finding the minimum

of the function, SS(m,C):

0

0

C

SSm

SS• need to find the values of

m and C that make the two partial derivatives of SS vanish

• ie we need to solve the pair of equations:-

Linear Least Squares – theory (6)• this is very easy to do for the straight line case

• Our SS function, SS(m,C), is given by the following (after some expansion!)

ie SS is of the general form:

2222 222

),(

iiiiii YYCYXmNCXmCXm

CmSS

GFCEmBCHmCAmCmSS 222),( 22


For the straight line, the Sum of Squares function

is a conic section ie a contour map of this surface

will be a series of concentric ellipses.

m

C

SS


When we perform the two differentiations, we get:

or

0222

0222 2

ii

iiii

YCNXmC

SS

YXXCXmm

SS

FBCHmC

SS

EHCAmm

SS

222

222


These two linear equations are very easy to solve

for both m and C…

(you can try doing this as an exercise… !)

=> any good pocket calculator can do linear least

squares fits to data!

Now, we’ll look at a few applications of

Linear Least Squares theory…

Linear Least Squares – applications (1)

Linear least squares analysis can be extended to

help with other problems.

Often, you will encounter polynomials eg

and we can set up an SS function for this n’th

degree polynomial…

ie SS = (yi-fi)2 = SS(a0, a1, a2, a3,…, an)

nnn xaxaxaxaaxP ...)( 3

32

210


For a polynomial of degree ‘n’, we have to solve

the following system of ‘n+1’ linear equations:-

0

0

0

1

0

na

SS

a

SS

a

SS These equations can be solved by standard matrix methods using ‘Linear Algebra’

The Microsoft EXCEL spreadsheet computer program can do linear least square fits of this more general kind via the LINEST function.


Two uses of polynomials…

1) Calibration curves

In Chemistry, polynomial functions are often used

to construct calibration curves for some analytical

technique such as Mass Spectrometry

or Atomic Absorption

Here the instrument response is known to be

describable by a polynomial function

(usually a 3rd or 4th degree polynomial).


Two uses of polynomial functions…

2) Data Smoothing

Another application of polynomials and linear least

squares fitting is data smoothing and interpolation.

Example – X Ray scattering data from amorphous

polymer samples (Dr Arrighi).

These experiments can generate HUGE data files!


The Intensity vs angle and temperature data are

conveniently stored as 2 dimensional Excel

spreadsheets.

Also, the I(Q,T) vs Q plots for a fixed temperature,

T, are often found to be very noisy.

Quadratic polynomials can be used to ‘smooth’ the

data so that important features stand out.

How does this work?


First, here’s the original noisy data:-

Q

I(Q

,T)


And now, here’s the smoothed data:-

Q

I(Q

,T)


And here’s how smoothing works…

Original data point

Interpolated data point

Smoothing Polynomial


Data Smoothing is a potentially risky operation…

You must be very careful when you do this…

Why?

Because you could be throwing away some vital

information – especially if you use too much data

smoothing!







Nonlinear Least Squares

• Often we need to fit data to a more complicated nonlinear function…

• The sum of squares equations are then also nonlinear…

• and so we have to use other methods to solve the minimisation problem…


• Another feature of this type of problem is that you have to supply a reasonable starting guess for the parameters used in your theoretical model…

• You must have a feeling for the behaviour of your model function…

• Good idea to plot out both your data and model function on the SAME graph!


• By trying out several different sets of parameter values, you can get a rough idea of where a good starting guess is.

• Once a suitable starting set of parameters has been found, then there are several methods (algorithms) that can be used to locate the minimum in the SS function.

Nonlinear Least Squares Example

• For a good example of a Chemistry based nonlinear curve fitting exercise, one can look at First Order Chemical Kinetics.

• The concentration of a new molecule that is being produced in a chemical reaction that follows First Order Kinetics can be described as:-


here c(t) is the concentration at any time t,

and c∞ is the final steady state concentration

ie c∞ = c(∞)

To see where this comes from, let’s look atthe rate of change of c with time ie dc/dt

)]exp(1[)( ktctc


or equivalently

This is an example of a first order linear differential

equation.

)]([)exp()(

tcckktkcdt

tdc

kctkcdt

tdc)(

)(


Here is a plot of the c(t) function:-

Nonlinear Least Squares Example• For some models, one can transform a

nonlinear function into a linear function

by using some maths…

• eg if the model was c(t) = c0 exp(-kt) then

taking logs would give a linear equation

• However, this trick does not work for our particular first order kinetics problem!

Nonlinear Least Squares Example• We need to set this up as a nonlinear least

squares curve fitting problem.

• get an rough idea of possible starting values for the parameters from the c(t) graph.

• Rate Constant, k, obtained from initial slope at t = 0

• Steady State Concentration, c∞ from long time data.


Parameter estimation…

Nonlinear Least Squares Example• a knowledge of the Chemistry or Physics

behind any curve fitting problem can be very useful in helping you to pick good starting parameter values!

• Also, once you have got a curve fit to your data you need to be able to:

• a) interpret the results and

• b) estimate errors in your parameters.

Questions to ask about first…Before starting to look at your particular curve fitting problem, you should ask yourself the following questions:

1 Do I have enough data points?

(Straight line plots need at least

5 data points)

2 Can my data be fitted to a linear model?

(eg straight line, polynomial or using

transformed datasets)

Questions to ask about first…3. Am I using too many parameters?

Remember Occam’s Razor – assume as few hypotheses in your theory as possible

ie Keep it Simple!

Danger of overparametrising a model by using a 5th degree polynomial when only a 2nd degree quadratic would do…

Questions to ask about first…and a very important final question…

4 Do I really understand how nonlinear least squares curve fitting works?

We’ll look at this now.

How does NLLSQ work?Here, we try to describe our data by some

more complicated function

y = f(x, p1, p2, p3, … pm)

and we have to find best fit values for the ‘m’

parameters {p1, p2, p3, … pm}

The SS function now becomes

SS = (yi-fi)2 = SS(p1, p2, p3, … pm)

How does NLLSQ work?where, fi = f(xi, p1, p2, p3, … pm)

is the value of f(x) at the i’th data point xi

• The system of equations that we get by setting the ‘m’ partial derivatives of SS equal to zero are now found to be nonlinear

• The SS function itself may now have more than one minimum value

How does NLLSQ work?• We must try and find the lowest possible

minimum value ie the global minimum.

• This is a lot more trickier to do than what we have been used to with the previous linear least squares fits.

But don’t panic! – as there are ways to deal

with this situation.

How does NLLSQ work?• We use a search method to explore the

multidimensional (hyperspace) surface defined by the SS function.

• We need a starting value for our parameters and we may also need expressions for the partial derivatives (∂f/∂pj) of our model function f.

• These are used to find the fastest route to the minimum.

How does NLLSQ work?Search methods can include:

• Steepest Descent

• Parabolic Surface Approximation or Quadratic Form.

• Simplex Algorithm

• Genetic Algorithms

How does NLLSQ work?• In practice, most good programs will use

a combination of the first two methods

• eg the Levenberg - Marquhardt Algorithm.

• The search method will continue looking for the minimum (iterating) until some suitable stopping criterion has been met.

How does NLLSQ work?• The EXCEL spreadsheet ‘Solver’ add in

uses the ‘steepest descent’ method

• it evaluates the partial derivatives (∂f/∂pj) numerically, i.e. by approximation

• your model function, f, must be smooth and well behaved eg no spikey or sharp bits!

How does NLLSQ work?Let’s now look at a two parameter model,

this uses parameters p1 and p2

so we can write f = f(x, p1, p2)

• this could be our earlier First Order Chemical Kinetics example

• where x = time t, p1 = k, and p2 = C∞.

How does NLLSQ work?Here’s a possible contour map of the

Sum of Squares function SS(p1,p2)

p2

p1

How does NLLSQ work?here’s the result of using a search method to

look for the minima in the SS function:-

p2

p1

How does NLLSQ work?• If it is computationally difficult or

expensive to calculate function derivatives, then there are other search methods such as the SIMPLEX method.

• For a two parameter problem, this uses three points on the SS surface to define a Simplex shape which can be moved to hunt for a minimum…

The SIMPLEX method…A 2 parameter simplex search

p2

p1

SS

The SIMPLEX method…Step #0

p2

p1

SS


p2

p1

SS


p2

p1

SS


p2

p1

SS


p2

p1

SS


p2

p1

SS

How does NLLSQ work?• For a 3 parameter model, the simplex

method would use 4 points (a tetrahedral simplex)

• For a model with ‘m’ parameters, we would use a simplex constructed from ‘m+1’ points on the ‘m’ dimensional SS hypersurface.

How does NLLSQ work?• There are some situations where you

need to curve fit several sets of data at the same time…

• This can be achieved by generalising the idea of the Sum of Squares to include several data sets e.g.

M

k

kN

ikiki fySS

1

)(

1

2,, )(

NLLSQ problem?• There is one important problem that you

will experience at some stage in your career when doing curve fitting…

• It is one which is not obvious at first sight, and can cause unnecessary grief to the inexperienced…

…especially when writing up your PhD !!

NLLSQ problem?• The technical description of this problem

is “Overparametrisation of a Model”

• What happens is that one of your model parameters depends implicitly on some of the other parameters.

• This is not easy to spot at first!

Overparametrisation…An example

A 4th year project student was asked to fit

their data to the following function:-

This seems to be a 4 parameter function.

dcx

baxxf

)(

Overparametrisation…

To their surprise, they found that more than

one set of parameters gave rise to an

identical curve fit!

What is going on here?

Some further analysis is required…

Overparametrisation…Notice that for this function, we can divide

both the top and the bottom of the fraction by ‘a’

This yields the following expression for f:-

)/()/(

)/()(

adxac

abxxf

Overparametrisation…and this can be written as:-

where p1 = b/a, p2 = c/a, and p3 = d/a.

This analysis shows that we really have

a 3 parameter model and

not a 4 parameter one!

32

1)(pxp

pxxf

Overparametrisation…In more general terms

• We have some model function

f = f(x, p1, p2, p3, p4), and, unknown to us,

the parameter p3 is also a function of p1, p2 and p4 i.e. we have p3 = g(p1, p2, p4)

• This means that we are really dealing with a 3 parameter model, h(x, p1, p2, p4), where h = f(g(…)) and g is unknown!

Overparametrisation…

From the perspective of the SS function,

we are not dealing with a true minimum

• instead we have a surface that is now more like a valley (the Grand Canyon effect).

• What is the answer to this problem?

Overparametrisation Fixed!

Fix one of the parameters to a sensible value

and then let the others vary!

1 The assumption is that you have enough

information about your model to make a sensible choice for this parameter (and its value)

2 This may require you to perform different

additional experiments!







Theoretical Models and MathsSome examples from my past

(mainly polymer science).

• Polymer Physical Ageing data – uses KWW function, f(t) = exp[ -(t/)β ]

• Deconvolution of ‘FTIR’ peaks: Use of Gaussian functions to do peak area determination

Theoretical Models and Maths

0.0

10.0

20.0

30.0

40.0

50.0

60.0

70.0

80.0

90.0

100.0

164016601680170017201740176017801800

FTIR peak constructedfrom 3 Gaussian functions

Might be usedin Copolymer Composition Analysis…


• Copolymer Reactivity Ratios determination

• Tg vs “wt fraction” plots => Kwei equation

• Assymetric peaks fitted with Exponentially

Modified Gaussian function e.g. data from

a Dynamic Mechanical Thermal Analyser

(DMTA) experiment.


• Neutron Scattering data fitted with Dawson’s Integral Function…

dteexFx

tx 0

22

)(

Using the Excel ‘Solver’ add in

Spreadsheet layout (simple example) :-

p1 p2 p3 SS

3.5 -2 0.4 2.1460

X Y F Resid^2

0.0 2.92 3.50 0.3364

1.0 1.94 1.90 0.0016

2.0 1.55 1.10 0.2025

3.0 1.55 1.10 0.2025

4.0 2.09 1.90 0.0361

5.0 3.20 3.50 0.0900

6.0 4.77 5.90 1.2769

My Data

0.00

1.00

2.00

3.00

4.00

5.00

6.00

7.00

0.0 1.0 2.0 3.0 4.0 5.0 6.0 7.0

X

Using the Excel ‘Solver’ add inSpreadsheet layout (simple example) :-

1. Data, Fitting Function and Residuals clearly laid out

2. Parameter values and Sum of Squares clearly laid out

3. Graph of your data (points) and fitted function (curve). This gives you immediate visual feedback!!!

Using the Excel ‘Solver’ add inAfter invoking the Solver Dialog…

1. Set the Target Cell [$D$4]

2. Equal to ( )Max (*)Min ( ) Value of [ ]

3. By Changing Cells:

[$A$4:$C$4]

4. Then hit the Solve button!

Using the Excel ‘Solver’ add in

Spreadsheet layout (after Solver) :-

p1 p2 p3 SS

2.921429 -1.21607 0.253928 0.0035

X Y F Resid^2

0.0 2.92 2.92 0.0000

1.0 1.94 1.96 0.0004

2.0 1.55 1.51 0.0020

3.0 1.55 1.56 0.0001

4.0 2.09 2.12 0.0009

5.0 3.20 3.19 0.0001

6.0 4.77 4.77 0.0000

My Data

0.00

1.00

2.00

3.00

4.00

5.00

6.00

0.0 1.0 2.0 3.0 4.0 5.0 6.0 7.0

X

Good Practice…Before any curve fit:-

• Draw a Graph of your data!

• Can you evaluate the model function in a single spreadsheet cell?

• If not then you may need to use special techniques

Good Practice…Special spreadsheet techniques:-

1 Use extra columns as workspace

2 Or consider writing your own User Defined Function by employing Excel’s built in programming language (VBA)

This is better if you are dealing with a large

data set – and is less prone to errors when

setting up your spreadsheet!

Excel User Defined functions…

a) in the spreadsheet cell use:-

“=MyFunc(A4,$B$1,$B$2,$B$3)” to evaluate

a user function with 1 variable and 3 parameters

The $’s mean you are using absolute references –

these won’t change when copying cells!

b) create a new VBA Module (with shift F11)

and then use the following template code:-

Excel User Defined functions…

e.g.

Function MyFunc(x as double, p1 as double, p2 as double, p3 as double) as double

…

MyFunc = (x+p1)/(p2*x+p3)

End Function

Good Practice…After any curve fit:-

• ALWAYS look at the Residuals…

• Can give you a better idea of the quality of fit to your data!

• May indicate that a different model/theory is needed…

and also that your supervisor got it wrong!







Errors (and what to do with them!)

• This really needs a separate lecture!

• For Linear fits – use Excel’s LINEST function, as this will also report parameter errors

• For Nonlinear curve fits, this is more tricky and 3rd party add ins are required…


• Could use “Solver Aid” – which is a VBA Excel Macro which will estimate the errors in your fitted parameters

• This is available from the website of Robert de Levie – who also has a book…

“Advanced Excel forscientific data analysis”

(Oxford University Press)


• How does “Solver Aid” work?

• Need to look at what happens to the SS function near the minimum when all parameters are frozen apart from one

• This gives a function of one variable in the parameter of choice, eg p2

• the shape of this function near the minimum determines the error in p2

Conclusions…We have considered:-

• Introduction to Data Analysis.

• Linear Least Squares.

• Nonlinear Least Squares.

• Theoretical Models (and Maths).

• Errors (and what to do with them!).

Conclusions

When you have mastered these skills,

you will then have started the journey to

becoming a Professional Scientist…

Have fun, and remember Dr Ferguson’s

42nd Law:-

DON’T PANIC!!

Date post:	11-Jan-2016
Category:	Documents
Upload:	dakota
View:	17 times
Download:	3 times

MChem Computing and Chemistry [B14SC3]

Documents