+ All Categories
Home > Documents > Environmetrics I

Environmetrics I

Date post: 03-Jun-2018
Category:
Upload: dulcechic
View: 219 times
Download: 0 times
Share this document with a friend

of 72

Transcript
  • 8/12/2019 Environmetrics I

    1/72

    Environmetrics I, 1.3.2013 / VMT 1

    Environmetrics I

    Introduction

    In general, environmetrics is a discipline that deals with mathematical and statisticalmethods analyse and design environmental measurements. However, in this course,the emphasis is on methods related to environmental engineering.

    When statistical methods are applied to other sciences the term metricsis used,such as biometrics, chemometrics, econometrics, environmetrics, psychometrics etc.Some of these metric sciences have already a long history, e.g. biometrics or

    psychometrics but many of them are relatively new and not so well known, e.g.chemometrics or environmetrics. Statistical methods have a great importance also ininformatics and especially in bioinformatics. The reason for the increasing impor-tance is the nature of modern measurement technology which typically producesmultivariatemeasurement signals to be interpreted in a meaningful way. For exam-ple, spectra (IR, NIR, XRF, Raman), chromatograms , zeta-potential curves, DNA-micro array data all give signals that contain typically hundreds or thousands ofnumbers instead of a single response. Mathematically speaking, the response is avectorof numbers instead of a single number.

    The same development has occurred also in monitoring industrial processes.

    Modern on-line measurement technology gives a huge number of measurementsdescribing the state of a process at a given instant. Many of the statistical methodsused for analysing multivariate measurement signals can also be applied in analys-ing multivariate process measurements.

    Another field of applied statistics of increasing importance is related to improvingand controlling the quality of industrial processes, e.g. wastewater purification.Statistical process control (SPC) and statistical design of experiments (DOE) havebecome essential tools in all branches of industry. Historically, the use of statisticalmethods in industry began within production of industrial components during the

    second world war. Later the same ideas were introduced to process industry as well,especially by such statisticians as Box and Hunter. The Japanese developed theseideas further and statistical quality control was an essential part of the Japanesesuccess story. Some of the Japanese ideas developed for component production,especially those of Taguchi, were transported to process industry somewhatuncritically, neglecting the special nature of chemical and biological processes.

    The most recent quality philosophy is the so called Six Sigma quality policy,developed by Motorola and adopted for example in leading telecommunicationcompanies like Nokia and in many chemical companies as well, e.g. Dow Chemicalsor Du Pont. Actually Six Sigma is a collection of statistical methods applied both in

    production and product design in addition to general philosophy, where statisticalaspects of all measurement processes are taken into account. The aim of this courseis to give the basic knowledge and skills for understanding and applying such quality

  • 8/12/2019 Environmetrics I

    2/72

    Environmetrics I, 1.3.2013 / VMT 2

    policies and applying statistical methods in process industry in general.

    All statistical methods are computer intensive, i.e. applying these methods require

    use of statistical software. In this introductory course, we shall use the statisticalcapabilities of MS Excel, Matlab (or Octave) and R. Most of the tools found in Excelcan be found also in OpenOffice (www.openoffice.org).A plenty of commercialstatistical software is also available, such as MiniTab, SAS, BMDP, Systat,Statgraphics or S-plus. There is also a powerful freeware statistical softaware calledR (http://www.r-project.org/)which we get acquainted with during the course (seealso http://users.metropolia.fi/~velimt/Environmetrics/I/tutorial4R.doc ). Also generalmathematical software, such as Matlab, Mathematica, MathCad or Maple can beused for statistical calculations and analyses.

    The aim of this course is to provide the statistical background knowledge to underst-and and to learn the methods stated above. In addition we will learn the use of themost common statistical procedures. However, the emphasis is in understanding thebasic idea and to get familiar with the statistical terminology and, moreover, to learnhow to do some elementary statistical calculations in Excel or R.

    The course is divided into the following topics

    ! The nature of statistical variation! Graphical tools for describing statistical properties of measurement data! Computational (mathematical) tools for describing statistical properties of

    measurement data! Basic concepts of probability, random variables and their distributions! Measurement uncertainty! Confidence and prediction intervals! Principles of statistical testing and some most common tests! Regression analysis and calibration! Use of statistical software

    Most of the examples in the text are of general nature, but we shall take up alsoenvironmental applications in the lectures.

    The nature of statist ical variation in measurement data

    Some questions

    ! Why should an environmental engineer study statistics?! Why is it important to take statistical variation into account in making conclu-

    sions of measurement data?! How to discriminate between true (causal) dependencies and random (stochas-

    tic, statistical) differences?! Why is it possible to estimate the amount and the nature of statistical variation?! How can we make comparisons between different methods, equipment or

  • 8/12/2019 Environmetrics I

    3/72

    Environmetrics I, 1.3.2013 / VMT 3

    models taking the randomness in the measurement data into account?! How to control processes minimizing the effects caused by uncertainties in

    process data?

    Some facts

    ! Every measurement contains measurement errors.! Measurement errors can be roughly divided into two categories: systematic and

    random.! Random variation obeys some well known laws.! If random variation (statistical aspects) are not taken into account, seriously

    false conclusions can be made.! Environmental laws and regulations are laced with statistical terms and con-

    cepts.

    Exercise 1

    Try to figure out how the above questions and facts are related, i.e. which factsare important, or maybe even give an answer, to a specific question.

    Concepts related to statistical variation

    Samples and population

    A population in statistics means the collection of objects of interest. The objectscan e.g. individuals or measurement results. In most practical applications,population is a theoretical (hypothetical) concept, and typically with infinitelymany objects. In such cases all analyses are based on finite samples. In manyapplications of environmental measurements, finding out a good samplingscheme, such that will produce representative samples, is very important.However, sampling theory would be too vast field to be covered in this introduc-tory course, and our focus is on analysing sampled data assuming that it isrepresentative enough. In some of the examples we shall work on some specialproblems of sampling, e.g. testing homogeneity of a material.

    Precision and accuracy

    Consider measuring the concentration a chemical in a given sample. It isnatural to assume that the concentration is a constant, although unknown,value. However, if we conduct repeated measurements under similar conditions(with a sensitive enough analytical method), it is very unlikely to get two similarresults. The differences between the results of the replicate measurementsreflect the random errorof the measurement. The difference between the truevalue and the measured value is the measurement error. This error can bedivided into two parts: systematic(a bias) and random error. In most cases the

    true value is not known and consequently the error is not known. In some casesthe true value can be considered as known, e.g. when making measurementsof a standard reference material.

  • 8/12/2019 Environmetrics I

    4/72

    Environmetrics I, 1.3.2013 / VMT 4

    Although repeated measurements (when talking about repeated measure-ments, it is usually assumed that the measurements are conducted underconstant conditions!) vary, we assume that they have a tendency vary around afixed value, the expected value. The expected value (or the expectation) is ahypothetical concept, which is almost never exactly known. However itsexistence can be justified by the so called law of large numbersin the theory ofprobability. In practice, this means that if we could make infinitely many re-peated measurements, their average (their mean value) would be the expectedvalue. The difference between the true value and the expected value is calledthe biasand it is the systematic part of the error.

    The variation around the expected value, i.e. the differences between mea-sured values and the expected value, is the random (statistical) variation. Thecloseness of measured values to the true value is called the accuracyof themeasurement and the closeness of measured values to the expected value iscalled the precision of the measurement. It is important to note that an accuratemeasurement (a measurement having good accuracy) also has to be precise (ameasurement having good precision) but a precise measurement need not beaccurate!

    Precision is often divided into repeatabilityand reproducibility.

    Mathematically a measurement result is considered a random variable(weconsider this concept more later). We also use the convention that randomvariables are denoted by the Latin alphabet capital letters and non-random(deterministic) variables or constants are denoted by the Greek alphabet orLatin alphabet lower case letter. Thus if we denote measurement result by Y,the error by E, the expected value by and the true value by y0, we have thefollowing decomposition:

    (1)

    The first difference on the right hand side of Eq. (1) corresponds to the randomerror reflecting the precision and the second one to the systematic error (thebias). From this equation we can see that both the systematic and the randomerror are known only if we know the expected value and the true value. Asalready mentioned, the true value is seldom known, but the expected value canbe estimatedfrom repeated measurements. However, it is often assumed thatthe measurement device is well calibrated, and consequently the systematicerror is zero or at least negligible.

    Exercise 2

    1) Consider shooting at a target (or playing darts). Depict the following cases

  • 8/12/2019 Environmetrics I

    5/72

  • 8/12/2019 Environmetrics I

    6/72

    Environmetrics I, 1.3.2013 / VMT 6

    are crucial in statistical reasoning.

    Usually the models can be divided into a deterministic and into a stochastic(random) part, both based on certain assumptions related to the problem. Weillustrate this with a simple example, calibration by a straight line: consider aspectroscopic measurement. We measure the absorbance (Yi) at severalknownconcentrations (xi) and assume that the absorbance depends linearly ofthe concentration; this is the deterministic part of the model. We also assumethat the measured values can be represented in the form Yi= yi+ Ei, where yisdenote the expected values which are the true values in the absence ofsystematic errors, and Eis denote the random errors. It is usually assumed that

    the random errors are statistically independent, have expected value zeroandobey the normal (Gaussian) distribution. We come to these concepts later. Thisis the stochastic part of the model. The deterministic part of the model of thisexample can be expressed by the equation of a straight line

    (2)

    The task is to determine the unknown parameters in some optimal way. Ingeneral, determining unknown parameters, taking random errors into account,is called parameter estimationand the special case of estimating parameters of

    a known or hypothetical equation, is called regression analysis, both of whichwe shall discuss more later on.

    In practise, the stochastic part of the model is always related to a sample. Asample is a collection of measurements of a hypothetical population(in someother fields of science, the population need not be hypothetical). The questionon samplingis quite relevant in getting reliable results. Although problems ofsampling are not considered much in this course, its importance should be keptin mind.

    Variable types

    Variables can be classified in many ways. The most important one, fromengineering point of view, is qualitative vs. quantitative. A qualitative variablehas values that have no quantitative interpretation, e.g. raw material types.Qualitative variable can be called also categorical, and in some connections,factors. It is said that a qualitative variable is measured in nominal scale.

    Quantitative variables can further be divided into discrete and continuous.Some variables are semi-quantitative in the sense that the values can be

    ordered, but they cannot be unequivocally quantized. Such variables are calledordinal, or it is said that they are measured in ordinal scale.

  • 8/12/2019 Environmetrics I

    7/72

    Environmetrics I, 1.3.2013 / VMT 7

    Graphical too ls

    We shall now present some graphical tools for describing measurement errors

    that do not require such concepts as mean, standard deviation, normal distribu-tion etc. Later on, after studying computational statistics, we shall present somemore graphical tools.

    Histograms

    A histogram is a special case of a bar chart. The measured values, or moreoften, equally based intervals of measured values are put on the x-axis and abar whose area is proportional to the of number of observations falling ontothose intervals (or values) is drawn on each interval. If intervals have equallengths, the height is proportional to the of number of observations falling onto

    those intervals. The rules of thumb for creating good histograms are the follow-ing:

    ! Use equally spaced intervals whose end points are nice numbers! The number of intervals should roughly be the square root of the number

    of observations (measurements)! The union of the intervals should contain all observations

    An example: suppose we have measured the pH of a solution repeatedly 30times and got the following results:

    9.11 9.03 9.00 9.18 8.91 9.008.98 9.05 8.94 9.03 9.01 9.058.85 9.13 9.21 8.89 8.99 8.939.00 8.95 8.97 9.06 8.88 8.939.01 9.03 8.86 9.13 9.12 9.03

    Let us first sort the table (column-wise):

    8.85 8.93 8.98 9.01 9.03 9.128.86 8.93 8.99 9.01 9.05 9.138.88 8.94 9.00 9.03 9.05 9.13

    8.89 8.95 9.00 9.03 9.06 9.188.91 8.97 9.00 9.03 9.11 9.21

    Because 30 is ca. 5, we make 5 intervals whose union contains the interval[8.85, 9.21]. We get nice numbers if we divide the interval [8.80, 9.30] into 5sub-intervals, whose limits are 8.8, 8.9, 9.0, 9.1, 9.2, 9.3. This yields thefollowing histogram:

  • 8/12/2019 Environmetrics I

    8/72

    Environmetrics I, 1.3.2013 / VMT 8

    8.7 8.8 8.9 9 9.1 9.2 9.3 9.40

    1

    2

    3

    4

    5

    6

    7

    8

    9

    10

    11

    The figure tells us that the pH value tends to be around 9 and the frequenciesbecome the smaller the further away the measurements are from 9. Actually themeasurements seem to be quite normally distributedand we shall come to thispoint later on.

    In Excel, the histogram is drawn in the following way:

    1. Type in the data into Excel

    2. Type in theupper

    limits of the class intervals (these are calledbins

    inExcel)3. Click Tools ans then Data Analysis4. Select the data range into the Input Range box5. Select the bins range into the Bin Range box6. Mark the Chart Output option7. Click OK

    Note that Excel uses the upper class interval limits on the x-axis and does notautomatically produce connected bars. These can be changed later!

    Histogram

    01020

    8,90 9,00 9,10 9,20 9,30 More

    Bin

    Frequency

    Frequency

    It is useful to be able to make a histogram also by hand by hand. In R ahistogram is produced by the command hist(x), where x is a variable containing

    the data.

    Sample distributions

  • 8/12/2019 Environmetrics I

    9/72

    Environmetrics I, 1.3.2013 / VMT 9

    A histogram is an empirical representation of a so-called probability densityfunction, which in turn is the derivative function of some (cumulative) probabilitydistribution function (well come to these concepts later). The empirical

    distribution function is a staircase function where the sorted (in ascendingorder) measurement values are on the x-axis, and the sequential numberdivided by the total number is on the y-axis. If you want to see an example, typein R

    > x = r nor m( 100)> y = ( 1: 100) / 100> pl ot ( sor t ( x) , y, t ype=' b' )

    Or in Matlab

    >> x = r andn( 100, 1) ;

    >> y = ( 1: 100) / 100;>> pl ot ( sort ( x) , y, ' - o' )

    Scatter plots

    Scatter plots are mainly used to look for correlationsbetween variables. Butthey can also be used to check for trendsor statistical independence. In thiscase, the measured values are plotted against the order of measurement (ortime, so called time series plots). Let us suppose that the pH measurementsfrom our histogram example are ordered row-wise. Then the check-for-inde-pendence scatter plot looks like

    0 5 10 15 20 25 308.8

    8.85

    8.9

    8.95

    9

    9.05

    9.1

    9.15

    9.2

    9.25

    The figure does not show any clear patterns or trends supporting statisticalindependence of the measurements. Note that the points should be connectedwith lines only if the variable on the x-axis (here the order of measurements) isordered! In Excel a similar figure is produced by the Chart Wizard XY-Scattertool. Before using it, you have to rearrange the data into a single row or a

  • 8/12/2019 Environmetrics I

    10/72

    Environmetrics I, 1.3.2013 / VMT 10

    single column. In R you simply type plot(x).

    8,80

    8,85

    8,90

    8,95

    9,00

    9,05

    9,10

    9,15

    9,209,25

    0 10 20 30 40

    Series1

    Box and whisker plots (box plots)

    This is a common way of summarizing data classified by categorical variable. Itis not easy to make a Box and whisker plot in Excel, but it is very easy in R.

    Example

    This example is taken from Statistics for Environmental Engineers.

    The data can be found on the course web-page.

    The R-commands

    > data = r ead. t abl e( ' Ex3. 6. data' , header =TRUE)> at t ach( data)> pl ot ( Locat i on, TPH)

    give the following graph:

  • 8/12/2019 Environmetrics I

    11/72

  • 8/12/2019 Environmetrics I

    12/72

    Environmetrics I, 1.3.2013 / VMT 12

    (3)

    Note that, if the measurements are consider to be random variables, then themean value is also a random variable! The mean value of our previous pHsample is approximately 9.01 (in Excel AVERAGE(range) and in R mean(x) ,range means a range of cells, e.g. A1:B10).

    The median is defined to be the central most number of the sample, and if thesize of the sample is an even number, the average of the two central mostnumbers. In our pH sample the two central most numbers are 9.00 and 9.01

    and thus the median is 9.005 (in Excel MEDIAN(range) and in R median(x)).

    There are several other averages (squared, geometric, harmonic etc.), but theyare introduced when we need (not all of them needed in this course).

    The median can be quite different from the mean value:

    Exercise 4

    Calculate the mean value and the median of the samples: 1, 2, 3, 4, 5 and 1, 2,

    3, 4, 50.

    For reasons, obvious from the preceding exercise, the median is called arobust statistic.

    Exercise 5

    Explain, in your own words, the concept of robustnessin statistics. Try to figureout such case where the use of robust statistics would be important.

    The sample standard deviation of the sample is defined by

    (4)

    It easy to show that also the following is true

    (4b)

  • 8/12/2019 Environmetrics I

    13/72

    Environmetrics I, 1.3.2013 / VMT 13

    For the pHe data the standard deviation is ca. 0.091 (check it yourself!).

    The standard deviation is the root mean square distance of the values from

    their mean value. More interpretations are given later on. In Excel the standarddeviation is given by STDEV(range) and in R by sd(x).

    Exercise 6

    1) Calculate the standard deviation of the samples: 1, 2, 3, 4, 5 and 1, 2, 3, 4,50.2) Find a sample of real measurement data from a text book or from internetand calculate mean, median and standard deviation and draw a histogram of

    that data. Try to give interpretations!

  • 8/12/2019 Environmetrics I

    14/72

    Environmetrics I, 1.3.2013 / VMT 14

    Random variables and their distributions

    A random variable is a variable whose exact value cannot be told in advance.

    Instead, we can assign probabilitiesto the possible values. Every measurementis a random variable, because the exact value cannot be told until the measur-ement is carried out. In addition, the next similar (repeated) measurement willnot yield the same value (usually). To be able to make meaningful calculationsand conclusion we need the concept of a distribution of a given randomvariable. There are two different kinds of random variables: discreteandcontinuos.

    Discrete random variables

    A discrete random variable can have only discrete values. Typical cases are

    such as tossing a coin (heads or tails), tossing a dice (1, 2, 3, 4, 5 or 6),number of defects in a product, number of breaks in a process in a specifiedtime interval etc. The distribution of a discrete random variable is given by itspoint density function (pdf). A pdf can be represented by a table or by a pdf-plot.

    Let us consider tossing a dice. We have obviously the following table connect-ing the values and their probabilities:

    value probability

    1 1/6

    2 1/6

    3 1/6

    4 1/6

    5 1/6

    6 1/6

    The corresponding pdf-plot is

  • 8/12/2019 Environmetrics I

    15/72

    Environmetrics I, 1.3.2013 / VMT 15

    0 1 2 3 4 5 6 70

    0.02

    0.04

    0.06

    0.08

    0.1

    0.12

    0.14

    0.16

    0.18

    0.2

    Exercise 7

    1) Sketch the pdf of tossing a coin2) Sketch the pdf of the sum of the values of tossing two dices. Hint: to find outthe probabilities of the different possible sums of two dices make a table of sixrows and six columns for the possible sums of two dices.

    We shall get familiar with the two most common discrete probability distribu-tions: the binomial distributionandthe Poisson distribution.

    An experiment is called a binomial experiment, if the following conditions arefulfilled:

    1. The experiments are statistically independent, i.e. the result of one exper-iment has no effect whatsoever on the results of the other experiments.

    2. The outcome of each experiment can have only two values, one which iscalled a success (S) and the other which is called a failure (F). Note that

    in some applications, the success can mean, for example, a defect in aproduct. Thus, the term success simply means something that we areinterested in.

    3. The probability of a success is a constant value (p) in each experiment.

    Now, if Xis the number of successes (k) in nbinomial experiments, theprobability that Xgets the value k( P(X = k) ), i.e. the pdf of X, is given by thefollowing formula

    (5)

  • 8/12/2019 Environmetrics I

    16/72

    Environmetrics I, 1.3.2013 / VMT 16

    Example 1

    A coin is tossed 5 times. a) What is the probability of getting exactly 3 heads?

    b) What is the probability of getting at most 3 heads?

    Solution: a) Tossing a coin is clearly a binomial experiment and in this case n=5, k= 3 and p= 0.5. Thus Eq. (5) gives us

    b) In b) you have to know the fact that if A and B are disjoint events, i.e. eventsthat cannot happen mutually, like getting 2 heads and getting 3 heads in thesame series of 5 tosses, the probability of either of the results is calculatedsimply by adding the probabilities P(A) and P(B). (In mathematical notation thisexpressed by: ). The probability asked for is

    The binomial coefficients can be calculated by the definition

    or they can be obtained from the Pascals triangle.

    In Excel the function for binomial coefficients is BINOMDIST(k;n;p;false/true),value false for the fourth argument gives the pdf-values and true correspondsto the cumulative density function. A cumulative density function (cdf) is afunction F(x) which gives the probability . In R the binomial probabili-

    ties are given by dbinom(k,n,p) and cumulative binomial probabilities bypbinom(k,n,p).

    In Excel the solution for a) is given by =BINOMDIST(3;5;0,5;false) and for b) by= BINOMDIST(3;5;0,5;true) and in R the corresponding commands would be a)dbinom(3,5,0.5) and b) pbinom(3,5,0.5).

    Exercise 8

    1) What is the probability of getting at least 3 heads when tossing a coin 5times?2) What is the probability of getting at least 3 6's when tossing a dice 10 times?3) In average, 1% of certain products are out of specification. What is the prob-ability that in a package of 20 products there are more than 2 products that areout of specifications?

  • 8/12/2019 Environmetrics I

    17/72

    Environmetrics I, 1.3.2013 / VMT 17

    In 1) you have to use the fact that P(A) = 1-P(not getting A). This basic propertyof probabilities will be used many times later on.

    Poisson dist ribution

    When nincreases andpdecrease in a way that the product npis approachinga constant value , the limiting distribution is called the Poisson distribution.

    Thus binomial probabilities with large nand small pcan approximated with thePoisson distribution. However, the practical relevance of the Poisson distribu-tion relies on the fact that Poisson distribution describes well probabilities incases where something is sparsely distributed in a continuous medium. Acouple of examples will clarify what is meant by this:

    The number of microbes in a sample of a dilution is approximately Pois-son distributed.

    The number of breaks per a specified length in a thread coming from aspinning machine is approximately Poisson distributed.

    The number of flaws per specified area on a painted surface is approxi-mately Poisson distributed.

    The number of process breaks per a specified time period is approxi-mately Poisson distributed.

    The pdf for the Poisson distribution is given by

    (6)

    In Excel POISSON(k; ;false/true) and R dpois(k, ) and ppois(k, ).

    Exercise 9

    1) Calculate 3) of exercise 7 using the Poisson approximation and compare theresults.2) Consider a continuos process where the number of breaks per day isPoisson(2) distributed. During one week 25 breaks where reported. Do youconsider this as strong evidence for a claim that the meanbreak frequency ofthe process has increased during that week? Base your explanation on proba-bilities.

  • 8/12/2019 Environmetrics I

    18/72

    Environmetrics I, 1.3.2013 / VMT 18

    Continuous random variables

    Ordinary physicochemical quantities (pressure, temperature, voltage, concentr-

    ation etc.) can in principle have any real number values, at least within certainlimits. In practise, this is of course limited by the accuracy of the measurementdevice. The infinite number of possible values (i.e. values whose probability ispositive) causes theoretical problems in trying to figure out a reasonable pdf.This problem is overcome if we define probabilities only for intervals. Thusinstead of point values of a discrete pdf, we have a curve where the area underthe curve in a given interval is the probability of that interval. Such functionsare called probability density functions (pdf). So, the abbreviation pdf can meaneither a point density function or a probability density function.

    Below is an example of a distribution of a continuos random variable that can

    have values in the interval [1, 2]. By counting approximately the rectanglesunder the curve, we get ca. 50 rectangles. The shaded area (the interval [1.3,1.7]) contains ca. 28 rectangles. Thus the probability for the interval [1.3, 1.7] isapproximately 28/50 = 0.56. Remembering that areas under curves can be

    calculated as integrals, we can get an accurate value: ,

    where fis the function corresponding to the curve.

    Exercise 10

    a) Determine the defining expression for the function fbased on the fact that it

    is a parabola. b) Evaluate .

  • 8/12/2019 Environmetrics I

    19/72

    Environmetrics I, 1.3.2013 / VMT 19

    Naturally, the cumulative probability density function (cdf)is the integral up to agiven value x

    (7)

    Because, in the above example, probabilities for all values under 1 are zero,

    would be .

    Uniform distribution

    A random variable is said to be uniformly distributed if all values in a giveninterval are equally likely. This leads to the following pdf:

    (8)

    Exercise 11

    The expected value EX(theoretical mean value) of a continuous randomvariable is defined by the integral

    (9)

    and the (theoretical) variance of a continuous random variable is defined by theintegral

    . (10)

    Calculate the expected value and the variance of a random variable that isuniformly distributed on the interval [a, b]. The square root of the theoreticalvariance is called the theoretical standard deviation. It is important that youunderstand the difference between theoretical statistics and sample statistics!We shall discuss this important difference more later on.

  • 8/12/2019 Environmetrics I

    20/72

    Environmetrics I, 1.3.2013 / VMT 20

    Exponential distribution

    In reliability theory, lifetimes of different objects are studied. The simplestlifetime distribution is the exponential distribution which is well suited forexample to study lifetimes of light bulbs. Its pdf is

    (11)

    Exercise 12

    Instead of a lifetime of a car battery, we can consider the run-length (expressedin km) of it, i.e. how many kilometres have been run with the battery when itfinally stops functioning. Suppose that the run-length (X) is exponentially distri-

    buted: .

    a) What is the probality that battery will function at least 100000 km?b) Suppose that battery has functioned 100000 km. What is the probability thatit will last another 100000 km (i.e. total of 200000 km)?

    For b) youll have to know how to calculate conditional probabilities. Thegeneral formula is the following: LetAand Bbe any events. Then the condi-tional probability (read: probability ofAgiven B) is calculated

    (12)

    where means Aand B. The result of b) maybe surprising!

    Note that the above example shows a very strange property of the exponentialdistribution, i.e. the random variable forgets the past and the probabilities of anold bulb whose lifetime is known can be calculated as if it were new. What isreally amazing is that light bulbs behave very closely to this.

    The lifetime distributions of most other objects are more complicated and themost common of them is the Weibull distribution. You can easily find a lot ofinformation about common lifetime distributions in the internet.

  • 8/12/2019 Environmetrics I

    21/72

    Environmetrics I, 1.3.2013 / VMT 21

    Normal distribution

    The most important continuous distribution is the normal, or Gaussian distribut-ion. Its importance is based on a mathematical theorem, the central limit theor-em (CLT). The main idea behind the CLT is that the sum of any independentrandom variables is approximately normally distributed. Of course, certainconditions must be fulfilled, but we dont go into details here. The effect of theCLT can be seen for example in throwing several dices and looking at thedistribution of the sum of the results. Already the distribution of the sum of fivedices looks very normal.

    If we consider measurement errors, we notice that the final total error usually isa sum of many small errors coming from all kind of different sources. It is alsorare that these sub-errors should depend on each other. For these reasons,most measurement errors are roughly normally distributed.

    The probability density function of the normal distribution is

    (13)

    If a random variable Xis normally distributed, it is denoted by X ~ . It iseasy to show (by integration) that for a normally distributed X

    and .

    Unfortunately the normal pdf has not got such an integral function that could beexpressed by any ordinary functions. Therefore one has to rely on numericalintegration or tables of normal distribution. Below is a plot of N(5, 1)-distribu-tion:

  • 8/12/2019 Environmetrics I

    22/72

    Environmetrics I, 1.3.2013 / VMT 22

    0 1 2 3 4 5 6 7 8 9 100

    0.05

    0.1

    0.15

    0.2

    0.25

    0.3

    0.35

    0.4

    0.45

    0.5

    Rules of thumb for normal probabil ities

    In many cases it is accurate enough to use approximate normal probabilities

    given by the following rules of thumb: If X~ N( , ), then

    1.2.

    3.

    Note that practically all normally distributed measurements would be in theinterval . These rules of thumb are used in many tools of statistical quality

    control and in validation of laboratory analytical methods.

    Example

    Let X ~ N(10,12). Now -2 is 8 and +2 and thus ca. 95% of the results

    would in average be in the interval [8, 12]. We could also conclude that ca.0,15% of the results would in average be greater than 13, because 0,3% wouldbe outside the interval [7, 13].

    Such reasoning as above is much easier if you sketch by hand a plot thesituation in question similar to the one below (to depict the latter interval of theprevious example).

  • 8/12/2019 Environmetrics I

    23/72

    Environmetrics I, 1.3.2013 / VMT 23

    5 6 7 8 9 10 11 12 13 14 150

    0.01

    0.02

    0.03

    0.04

    0.05

    0.06

    0.07

    0.08

    0.09

    0.1

    99,7%

    0,15% 0,15%

    Exercise 13

    Suppose that the body length of a male population is N(180,5 2) distributed.Estimate the proportion of males whose length exceed a) 185, b) 190 and c)195.

    If more accurate probabilities are needed you have to use standard normaltables or some statistical software. In Excel the pdf/cdf of the normal distribu-tion function is NORMDIST(x; ; ;false/true). The option false gives the pdf

    and the option true gives the cdf. In R the corresponding functions are dnorm(x-, , ) and pnorm(x, , ). The inverse of the cdf in Excel is NORMINV(p; ; )

    and in R qnorm(p, , ), where pis a given probability. Note that the inverse cdf

    gives the upper limit of such interval [ , x] that P( [ , x]) is p.Again, asketch of the situation in hand helps in figuring out how to use these functions!

    Examples

    Let X ~ N(10,12). Then P(X 12) is in Excel NORMDIST(12;10;1;true) and

    Then P(X > 12) is in Excel 1-NORMDIST(12;10;1;true) and P(8 < X 12) is in

    Excel NORMDIST(12;10;1;true)-NORMDIST(8;10;1;true).

    Figure out yourself the corresponding R formulae.

    If we should want to know the interval whose centre is 10 and whose probability

    is 80% we would conclude (by symmetry) that probability of getting value belowthe upper limit of this interval is 90% and probability of getting value below thelower limit of this interval is 10%. Thus the upper limit is given by

  • 8/12/2019 Environmetrics I

    24/72

    Environmetrics I, 1.3.2013 / VMT 24

    NORMINV(0,9;10;1) and the lower limit by NORMINV(0,1;10;1).

    One should pay attention to the fact that these functions are always related to

    upper limits of intervals starting from minus infinity.

    Make sketches of the situation in the examples above!

    Exercise 14

    a) Plot pdfs of the normal distribution with different values for and . Find

    out the role of and in the shape and location of the curve!

    b)A concentration measurement is normally N(0,35; 0,03

    2

    ) distributed. What is,in average, the proportion of measurement falling in the interval [0,30; 0,40]?

    c)Consider the measurement in b). What is the interval whose centre is 0,35that contains 90% of the measurements (in average)?

    If computer programs are not available, one should use normal probabilitytables. Normal probability tables are available only for so called standardnormal distribution N(0,1). To be able to use them for any normal distribution,one has to know the following:

    (14)

    We need this formula in some other applications too.

    Normal probability plots (Q-Q-plots)

    There are many graphical ways to check the normality of a given sample ofmeasurement data. One way is to plot a histogram of the data and to look howmuch it resembles the Gaussian curve.

    For small samples, however, a better way exists: the normal probability plot.The idea in normal probability plots is similar that of using logarithmic scale forlinearizing exponential behaviour, in this case a sample distribution function islinearized. In Excel, producing normal probability plots is rather tedious. Weshall do it in the computer labs and you will get an example as an Excelworkbook. In R it is easy: one simply has to type qqnorm(x) where x is avariable containing the data. As an example 30 normally N(10,22) distributedrandom numbers are generated and then a normal probability plot is drawn:

  • 8/12/2019 Environmetrics I

    25/72

    Environmetrics I, 1.3.2013 / VMT 25

    > x = rnorm(30,10,2)> qqnorm(x)> qqline(x)

    These commands will give the plot below

    -2 -1 0 1 2

    4

    6

    8

    10

    12

    14

    Normal Q-Q Plot

    Theoretical Quantiles

    SampleQuantiles

    The points are approximately on a straight line, as they should.

    Exercise 15

    The table below contains 100 measured daily purities of oxygen delivered by acertain supplier. The numbers are the two decimals after 99%, i.e. the purity

    can be calculated by 99+x/100, where xis a number from the table. The dataare given in row-wise time order. a) Check by a suitable plot if there are anytrends in the data. b) Make a normal probability plot of the data. Explain howthe observed distribution deviates from the normal one. c) Make a histogram.

    63 61 67 58 55 50 55 56 52 6473 57 63 81 64 54 57 59 60 6858 57 67 56 66 60 49 79 60 6260 49 62 56 69 75 52 56 61 5866 67 56 55 66 55 69 60 69 70

    65 56 73 65 68 59 62 58 62 6657 60 66 54 64 62 64 64 50 5072 85 68 58 68 80 60 60 53 49

  • 8/12/2019 Environmetrics I

    26/72

    Environmetrics I, 1.3.2013 / VMT 26

    55 80 64 59 53 73 55 54 60 6058 50 53 48 78 72 51 60 49 67

    Note that you can copy the data in the pdf-file into Excel or read it into R.

    Expectation and theoretical standard deviation

    In exercise 10 we defined the expected value EX(theoretical mean value) of acontinuous random variable by the integral

    (9)

    and the (theoretical) variance of a continuous random variable is defined by theintegral

    . (10)

    The theoretical standard deviation is defined as the square root of the variance.For discrete random variables, the integrals are replace by sums and the

    density function is replaced by point densities. Remember that for a normaldistribution EX= and D2X= .

    Exercise 16

    Write down the definitions of EXand D2Xfor discrete random variables.

    Estimation and estimators

    It is important to make a clear distinction between theoretical and samplequantities. Actually the corresponding sample quantities are so called esti-

    matesor estimatorsfor the theoretical ones. For example, the sample mean

    is an estimator for , if the data is normally distributed. The value of an estima-

    tor is called an estimate. Of course, a good estimator is such that it gives inaverage a correct value for the theoretical. An other desirable property is thatthe variance of an estimator should be small. The first property is calledunbiasedness and the second one the minimum variance property. There aredifferent principles for constructing good estimators, the most common onesbeing the method of maximum likelihood and the method of least squareswhich, under certain assumptions, are closely related. We are not going into

  • 8/12/2019 Environmetrics I

    27/72

    Environmetrics I, 1.3.2013 / VMT 27

    details of the theory of statistical estimation in this course. However, we shallintroduce later several estimators and it is important to understand what ismeant by an estimator or by an estimate.

    Propagation of errors (properties of expectation and variance)

    A linear combination of random variables is an expression of the form

    , i.e.

    The expectation and variance have the following properties with respect tolinear combinations:

    (15)

    (16)

    Important! Formula 16 holds only if the random variable are independent

    of each other!

    ADD SOME SPECIAL CASES!!!

    Exercise 17

    a) Write down (below!) a formula for the standard deviation of a linear combina-

    tion, i.e. for .

    b) The sample mean (average) is a special case of linear combinations. Supp-ose that pindependent random variables Xihave the same expected value

    and the same variance . Write down the formulas for expected value,

    variance and standard deviation of the mean of such random variables.

    c) A difference of two random is a special case of linear combinations. Suppose

    that 2 independent random variables X1and X2have the same expected valueand the same variance . Write down the formulas for expected value,

    variance and standard deviation of the difference of such two random vari-

  • 8/12/2019 Environmetrics I

    28/72

    Environmetrics I, 1.3.2013 / VMT 28

    ables.

    Any constant can be considered a special random variable whose expectedvalue is the value of the constant itself and whose standard deviation is zero.According to this, we get e.g.

    (17)

    and

    . (18)

    If the random variables Xi, I= 1,2,...,pare independent, then the expected

    value of product is the product of the expected values, i.e. the formula

    (19)

    holds.

    A special case

    If the random variables Xi, I= 1,2,...,pare independent and also normallydistributed, the linear combinations are also normally distributed with expectedvalues and standard deviations given by the above formulae.

  • 8/12/2019 Environmetrics I

    29/72

    Environmetrics I, 1.3.2013 / VMT 29

    Propagation of errors(estimation of measurement uncertainty)

    The formulae 15-19 are not of much use if we consider nonlinear functions of

    independent random variables, i.e. expressions like . Fortu-nately, in ordinary measurement quantities, the standard errors are quite smallwith respect to the expected values. Therefore, the variation of Yaround theexpected values can be approximated by a first order Taylor

    series expansion:

    Now, if we replace variables by corresponding random variables, we notice thatthis a linear combination plus a constant and, consequently, we can useformulae 15-19:

    (20)

    (21)

    Note that all derivatives in Eq. 19 are evaluated at the expected values! Note

    also that in practice, the expected values are replaced by mean values and the(theoretical) standard deviations by sample standard deviations.

    It should also be noted that the Eq. 21 is valid only if the random variables areindependent. If they are correlated the formula is more complicated, and beondthe scope of this introductory course. However, it is easy to learn for anyonewith good enough knowledge of matrix algebra.

    A special case

    If the random variables Xi, I= 1,2,...,pare independent and also normally

    distributed and the standard deviations are small enough, the random variableYis approximately normally distributed with expected values and standarddeviations given by formulae 20 and 21.

    Example

    Let Ube a voltage measurement with expected value 9V and standard devia-tion 0,2V and Rbe measured resistance with expected value 3 and standard

    deviation 0,1 . What is the expected value and standard deviation of the

    current I= ? Using Eq. 20 and 21, we get

  • 8/12/2019 Environmetrics I

    30/72

    Environmetrics I, 1.3.2013 / VMT 30

    (A),

    and

    .

    and thus

    0,041 (A)

    and (or ) is 0,20A.

    Exercise 18

    a) The measured concentration of a sample is 12,5 0,5 mmol/l and thevolume of the sample is 0,200 0,003 l. The -values are standard

    measurement uncertainties. Calculate the number of moles in the sampleand estimate its standard measurement uncertainty.

    b) Consider that you have made replicate measurements of the voltage andthe current of a device. Which do you think is a better way to estimate thepower (P = UI2): i) first calculate the means of Uand Iand calculate P

    using the means, or ii) calculate aa many P s as you have replicates, andthe calculate the mean of these P s?

  • 8/12/2019 Environmetrics I

    31/72

    Environmetrics I, 1.3.2013 / VMT 31

    Confidence intervals

    A confidence interval is an interval that contains the true value of an unknownparameter with a given (predefined) probability. It is very important to under-stand the upper and lower limits of a confidence interval are random variableand thus varying from sample to sample! It is easy to construct a confidenceinterval for the unknown expected value of a normal distribution ( ) based on a

    sample of nreplicate measurements with a known standard deviation ( , i.e.

    for all i. According to the result of exercise 16b, the mean of the

    random variables is normally distributed with standard deviation . Therefore

    (supply the missing details yourself!)

    Now, because right hand side is a probability of the standard normal distribu-tion, for predefined probability, say , we know that

    (make a plot to see this!)

    where is the inverse value of the cumulative standard normal density of

    the probability . Finally, solving for x, we get and conse-

    quently the limits of the confidence interval are

    (22)

    The probability is called the confidence level. Typical values of are

    0.05, 0.01 and 0.001 the corresponding confidence levels being 0.95, 0.99 and0.999.

    As an example, let us consider the previous example of estimating the standarddeviation of a current. Let us also assume that the values 9V and 3 are meanvalues of 5 measurements. Now, what would be the 95%-confidence limit forthe current I?

  • 8/12/2019 Environmetrics I

    32/72

    Environmetrics I, 1.3.2013 / VMT 32

    Using e.g. Excel we get that is ca. 1.96 (the rules of thumb would

    give the value 2!). Therefore the confidence limit would be

    [2.9, 3.1]

    Thus the true current should be with probability 95% in the interval [2.9A, 3.1A].

    Exercise 19

    If the students in exercise 18 had made 10 replicate measurements of

    concentration and volume, and the mean values would have been the same asthe nominal values and the -values would have been known standard deviat-

    ions. What would then be the 99%-confidence interval for the number ofmoles?

    In practice, the standard deviation is seldom known. Instead, we have to use

    a value that has been estimated from replicate measurements, i.e. the samplestandard deviation (S) discussed in the beginning of this course. This causesincreased uncertainty in the confidence interval and has to be taken into acco-unt. Without going to the proof, we just state that in this case the Eq. 22 has tobe replaced by the following one

    , (23)

    where is obtained from the Students t-distribution. The Students t-

    distribution has a parameter that is called degrees of freedom (df). In generaldegrees of freedom is the number of observation needed for an estimate (Sinthis case) minus the number of mathematical restrictions implied in the estimate

    (1 in this case: the sum of deviations from the mean in the formula for Shas tobe zero!). The cdf for the Students t-distribution in Excel are TDIST and TINV,but they function differently. In R the corresponding functions are pt(x,df) andqt(p,df).

    Exercise 20

    a) Find out, using the Excel and R help facilities, how the R functions pt andqt differ from the corresponding Excel functions TDIST and TINV.

    b) Plot the pdf of the Students t-distribution in the same plot with severaldegrees of freedom.c) Find out the functional form of the pdf of the Students t-distribution from

  • 8/12/2019 Environmetrics I

    33/72

    Environmetrics I, 1.3.2013 / VMT 33

    literature or from the internet.

    The confidence interval for the variance, is calculated by:

    , (24)

    where is obtained from the -distribution with n1 degrees of

    freedom. The cdf for the -distribution in Excel are CHIDIST and CHIINV, butagain they function differently. In R the corresponding functions are pchisq(x,d-

    f) and qchisq(p,df).

    Exercise 21

    a) Find out how the R functions pchisq and qchisq differ from the corre-sponding Excel functions CHIDIST and CHIINV.

    b) Plot the pdf of the -distribution in the same plot with several degrees of

    freedom.

    c) Find out the functional form of the pdf of the -distribution from literature

    or from the internet.

    Exercise 22

    The article Multi-functional Pneumatic Gripper Operating under Constant InputActuation Air Pressure by J. Przybyl (Journal of Engineering Technology,1988) discussees the performance of a 6-digit pneumatic robotic gripper. Onepart of the article concerns the gripping pressure (measured by manometers)delivered to objects of different shapes for fixed input air pressures. The datagiven here are the measurements (in psi) reported for an actuation pressure of

    40 psi for (respectively) a 1.7 in. 1.5 in. 3.5 in. rectangular bar and acircular bar of radius 1.0 in. and length 3.5 in.

    a) Make 98% confidence intervals for the expected gripping pressure forboth objects.

    b) Compare the confidence intervals and make proper conclusions.c) Calculate the confidence intervals for the standard deviation of the

    gripping pressure fo both objects.d) Compare the confidence intervals and make proper conclusions.

  • 8/12/2019 Environmetrics I

    34/72

    Environmetrics I, 1.3.2013 / VMT 34

    Rectangular

    bar

    Circular bar

    76 84

    82 87

    85 94

    88 80

    82 92

    Prediction intervals

    A prediction interval is an interval that contains a future result with a givenprobability. The idea is that a future observation is predicted by the mean

    value, i.e. , where Eis the error having the same distribution as

    the original sample. Thus the standard error of Eis which is estimated by S.Using the rules of propagation of error the estimated standard error of is

    , and thus the prediction interval is:

    . (25)

    If the standard deviation is known, Scan be replaced by the known and thet-value by the corresponding z-value.

    Exercise 23

    Calculate 95% prediction intervals for the gripping pressure of both objects ofthe previous exercise.

    Statist ical tolerance intervals

    Statistical tolerances are used in statistical quality control (SPC). Theirconstruction is beyond the scope of this course, but it is good know what theyare. A statistical tolerance interval is an interval that has the property that theinterval covers at least p% of the population (usually results of a measurement)

    are contained within the interval with a probability . You can find for informa-tion about the topic (and other topics within this course as well) in the linkbelow:

  • 8/12/2019 Environmetrics I

    35/72

    Environmetrics I, 1.3.2013 / VMT 35

    http://www.itl.nist.gov/div898/handbook/prc/section2/prc263.htm.

    The homepage of this book is

    http://www.itl.nist.gov/div898/handbook/.

  • 8/12/2019 Environmetrics I

    36/72

    Environmetrics I, 1.3.2013 / VMT 36

    Comparing effects

    A very common problem in production or in industrial R&D is the question of

    deciding whether there is a real difference between two or more sets of resultsor not. To be more precise, whether there is a statistically significant differenceor not. Namely, any real results will differ from each other and the question is ifthe differences can be considered systematic or could they just come from theinevitable random variation. Statistical tests give a objective tool to make ratio-nal decisions in such cases.

    Statistical tests

    The basic idea behind the logic of statistical test can be described by thefollowing extreme case. Suppose we have a hypotheses that in certain mea-surements the results are between 9 and 11 and results outside those limits areabsolutely impossible and there is nothing in our instrument. Now we make anew result and get 12.5. What is the logical conclusion? The result contradictsthe assumptions and consequently the assumptions cannot be true.

    In real case we seldom get results that are absolutely impossible in regard tothe assumption, but of course, we can get extremely improbable results. Insuch cases it is more rational to suspect the assumptions than to accept thatwe were extremely lucky (in the sense of getting highly improbable results). Ofcourse we have to set a limit for what is improbable enough. This limit is the

    level of significance( ) and the most common values for are 0.05, 0.01 or0.001. Now, we shall formalize what is said above and we shall also getanother view to the level of significance and how to choose it.

    Any statistical test consists of the following:

    The hypotheses

    (the null hypothesis) and (the alternative hypothesis or the re-

    search hypothesis).

    is something we want to prove and is something which is rejectedif

    we accept . The is an analogy to trials: is guilty and is not-

    guilty. In most cases states that there is no difference between the

    things that are compared. In our analogy, the presumption is that thesuspect is not guilty.

    The statistical assumptions of the test

    In order to be able to calculate probabilities we have to assume some-

    thing about the measurements. The most common assumptions are thatthe measurement errors are approximately normally distributed with aconstant variance and statistically independent.

  • 8/12/2019 Environmetrics I

    37/72

    Environmetrics I, 1.3.2013 / VMT 37

    The two types of decision errors

    In making conclusions by a statistical test we have to possibilities to makean error (and two possibilities to be correct):

    Error of type I means to reject when it actually is true and

    Error of type II means not to reject when it actually is false.

    The probability of type I error is called a p-valueand a predefined upperlimit for it is the level of significance . The p-value can be interpreted as

    the improbability of the results and consequently a p-value below

    means that result improbable enough (impossible enough) for rejecting.

    The probability of type II error is denoted by . Unfortunately, andare not independent of each other and depends on and in addition

    also on the true differences between the things to be compare (oftendenoted by ). The quantity 1- is called the power of a testand its the

    probability of rejecting when it actually is not true. Naturally, we would

    like to have as powerful test as possible. Indeed, the most common tests,described below, are the most powerful ones under the common assumpt-ions of normality and independence.

    The test statist ic

    The test statistic is a quantity, calculated from a sample of measurementresults, whose distribution is known if is true. For this reason, the null

    hypothesis must always include an equality, and the test statisticprobabilities are calculated assuming the equality, and thus the parame-ters of the distribution are known. If the null hypothesis includes alsoinequality, i.e. it is of the type or , the logic is that we need to test only

    the case that is closest to the alternative hypothesis, i.e. the case ofequality.

    In order to be able to calculate the power (or the probability of type IIerror), one has to know the distribution of the test statistic if is true.

    However, if is true, we also have to know the true difference. There-

    fore, the power can only be calculated for assumed differences.

    Making the conclusion (decision)

    After calculating the value of a test statistic, we can calculate the probabil-ity of getting such a value, or even a more extreme value. This probability

  • 8/12/2019 Environmetrics I

    38/72

    Environmetrics I, 1.3.2013 / VMT 38

    is called the p-value of the test.

    The logic is, that if we did reject by the given value of the test statistic, we

    would reject it, of course, also by more improbable values as well. If the p-valueis below , the null hypothesis is rejected and otherwise not.

    Note that all statistical tests have the structure described above. Therefore, ifyou learn how to conduct any single test, you actually know how to

    conduct all statistical tests.The only new things in a new test are: the nullhypotheses, how to calculate the test statistic and its distribution under .

  • 8/12/2019 Environmetrics I

    39/72

    Environmetrics I, 1.3.2013 / VMT 39

    The most common simple statistical tests

    The two-sample t-test

    This test is suitable whenever the expected values of two samples (Aand B)are compared with each other. To guarantee the independence assumption,the experiments should be made in random order!

    : (27)

    : (two-sided alternative) or (28a)

    : (one-sided alternative) or (28b)

    : (one-sided alternative) (28c)

    Important:The form the alternative hypothesis, i.e. whether it is one or two-sided, must be decided before any experiments are made. Also the level ofsignificance should be decided beforehand. If the test that was decided to betwo-sided, would not reject but the corresponding one-sided test would, the

    logical way of acting is to make new experiments to confirm the suspicion of adifference. Also, it should be noted that in one-sided tests the actual nullhypothesis is (or ). However the p-value is always calculated

    under the null hypothesis . The logic is that if would be

    rejected with the alternative , then naturally it would be rejected underany smaller value of because the discrepancy with the null hypothesis

    would be greater.

    The statistical assumptions of this test are the ordinary ones: the measurementerrors are approximately normally distributed with a constant variance andstatistically independent.

    The test statistic (t) is calculated by the following formula:

    , (29)

    where is the so called pooled standard deviation and it is calculated by:

    . (30)

  • 8/12/2019 Environmetrics I

    40/72

    Environmetrics I, 1.3.2013 / VMT 40

    A random variable Tcorresponding to the test statistic is Students t-distributedwith degrees of freedom, denoted by . The p-value can

    be calculated by calculating the probability for the two-sided hypothe-

    sis (28a), or or for the one-sided hypotheses (28b and 28c).

    In Excel you can calculate p-values using the function TDIST(t; ;tails).

    Give tailsvalue 1 for a one-sided test and value 2 for a two-sided test. Notethat TDIST calculates probabilities for Tor |T| being greaterthan tunlike e.g.NORMDIST. In R you can use pt(t, ). The inverse function of TDIST is

    TINV(p;df) and the inverse function of pt is qt(p;df).

    If we assume a true difference , the power test statistic obeys the non-central

    t-distribution with a non-centrality parameter

    . (31)

    In practice, has to be replaced by its estimate . If you have a tool to calcu-

    late probabilities of non-central t-distribution, it is quite easy to calculate thepower for a given and by calculating the probability

    , (32)

    where obeys the non-central t-distribution with the non-centrality parameter

    given by Eq. 31. For one-sided test, use only the other inequality and

    instead of ! Actually, in most cases you could calculate only the probabil-

    ity , because the other probability is

    practically zero. Excel does not have function for exact power calculations andapproximate methods have to used (described below). In R there is a nice func-tion for carrying out power calculations of ordinary t-tests. The function is

    power.t.test(n = N, delta = , sd = ,

    sig.level = , power = 1- , type= type,alt= alternative),

    where typeis either "two.sample", "one.sample" or "paired" (the meaning ofthese terms will become apparent later) and alternative is either "two.sided" or"one.sided". power.t.test(power = .90, delta = 1, alt = "one.sided"). The defaultvalues for sig.level, type and alt are 0,05; two.sample and two.sided. Thefunction is used in such a way that any one of the first 5 arguments can be

    omitted and the function will calculate the omitted one, for example

  • 8/12/2019 Environmetrics I

    41/72

  • 8/12/2019 Environmetrics I

    42/72

    Environmetrics I, 1.3.2013 / VMT 42

    Example

    Suppose that you are producing something by an enzymatic reaction using anenzyme produced by supplier A. Another supplier B claims that their moreexpensive enzyme is more effective giving better yields. Before changing theenzyme you decide to make experiments with both enzymes (it is important thatyou conduct the experiments in a random order with respect to enzyme type!).Now, suppose that you have got the following results put in Excel

    We obviously have a two-sample test and we will use a one-sided alternativebecause of the other suppliers claim. We could of course calculate the meanvalues and standard deviations and then put these into the formulas above.However, it is much easier to use the tools provided in Excel. Just click Toolsand then click Data Analysis... and you will get the following alternatives

    and choose the highlighted one. Now you get the following window

  • 8/12/2019 Environmetrics I

    43/72

    Environmetrics I, 1.3.2013 / VMT 43

    Variable ranges mean the cell ranges of the two samples. The hypothesized

    mean difference means the difference in expected values according to the nullhypothesis (usually 0 which is also the default value). If you include titles in theranges for the samples, you have to mark the box Labels. You can also givethe desired level of significance in the box Alpa (0,05 is the default value). Inthe Output Range you can give a reference to a cell under which the results ofthe test will be written. After filling in, the window will look like

    The output will look like

  • 8/12/2019 Environmetrics I

    44/72

    Environmetrics I, 1.3.2013 / VMT 44

    We notice that the variance are quite different, and yet we chose a test assum-

    ing equal variances. Was that correct? It was, if the difference in the variancesis not significant; we shall learn later how to check it. We also note that the p-value is not less than 0,05 and therefore the null hypothesis is not rejected andone should not believe the other supplier claim at this level of significance.However, the p-value is quite near 0,05 and maybe we just did not conductenough experiments taking the large random variability into account. The differ-ence in means is slightly over 2 (percentages). Now we can check approxi-mately how many experiments would be needed assuming that the true differ-ence is 2. For the calculation we need an estimate for the standard deviation.The best estimate in this case is given by the square root of the so called pool-ed variance given in the Excel output which is ca. 2,84. Let us also have =

    0,05 and = 0,95. Now, (one-sided test!) is given by NORMINV(1-

    0,05;0;1) or NORMSINV(1-0,05) and it is ca. 1,96; happens to be the

    same in this example. (In R qnorm(1-0.05) ). Substituting these into the equa-tion gives us the Excel formula =2*(1,645+1,645)^2/(2/2,84)^2 giving over43,65, i.e. at least 44 experiments per supplier. The R function power.t.testwould give 44,34, i.e. at least 45 experiments per supplier. This example showshow difficult it would be to detect true differences, that are small with respect tothe standard error, with high reliability. It is possible only by conducting a largenumber of experiments.

  • 8/12/2019 Environmetrics I

    45/72

    Environmetrics I, 1.3.2013 / VMT 45

    In R, the test could have been carried out easily using the t.test function. Theprevious example could have solved by

    > A = c( 851, 831, 835, 883, 863, 845, 824, 861, 814, 878) / 100> B = c( 843, 834, 911, 877, 844, 891, 851, 931, 878, 843) / 100> t . t est ( A, B)

    Which gives the following output:

    Wel ch Two Sampl e t - t est

    dat a: A and Bt = - 1. 7192, df = 16. 065, p- val ue = 0. 1048al t er nat i ve hypot hesi s: t r ue di f f er ence i n means i s not equal t o 095 per cent conf i dence i nt er val : - 0. 48671686 0. 05071686sampl e est i mates:

    mean of x mean of y8. 485 8. 703

    Note the small differences to the Excel output. The reason is that R actuallycarries out the test without assuming the variances to be equal (note variancesare usually considered equal, if their difference is not statistically significant).This, so-called Welch t-test, is available also in Excel, vice versa, if you addthe argument var . equal =TRUEyoull get results of a classical t-test, i.e. thesame results as in Excel. The results of these two variants of the test are veryclose to each other, unless the variances differ substatntially.

    The one-sample t-test

    This test is suitable whenever the expected value of a single sample is com-pared to a nominal (or reference or standard) value ( ). To guarantee the

    independence assumption, the experiments should be made in random order!

    : (34)

    : (two-sided alternative) or (35a)

    : (one-sided alternative) or (35b): (one-sided alternative) (35c)

    Again, it should be noted that in one-sided tests the null hypothesis should bechanged into in 35b, and into in 35c. However, as already

    pointed out, the test statistic probabilities (p-values) are calculated assumingthe equality (34).

    The statistical assumptions of this test are the same as in the two-sample test.

    The test statistic (t) is calculated by the following formula:

  • 8/12/2019 Environmetrics I

    46/72

    Environmetrics I, 1.3.2013 / VMT 46

    , (36)

    A random variable Tcorresponding to the test statistic is Students t-distributedwith degrees of freedom, denoted by . The p-value is calculated

    exactly as in the two-sample test and the decision logic is similar as well. The

    changes in the power calculations are quite obvious ( is replaced by

    and is replaced by n-1) and in the approximation Eq. 33 the multi-

    plier 2 is simply omitted.

    This is due to the fact that the standard deviation of a difference of two random

    variables is times larger than the standard deviation of the single random

    variable and, in the present case, the other part of the difference is a constant.In R, you simple choose type = one.sample .

    There is no tool for this test in Excel, so you either have to do the calculationsyourself or use R function t.test. With t.test you can make both one-sample andtwo-sample tests. The next example shows how to use it in the one-sampletest.

    Example

    A nutrient producer produces a nutrient whose nitrogen concentration ispromised to be 30%. A client analyses 5 samples and gets the results

    31,04 31,03 30,87 29,97 30,7 (%)

    Should the client make a reclamation that the nitrogen concentration is not30%?

    Now the null hypotheses is that the nitrogen concentration is 30% ( =30) andalternative hypotheses is that it is not 30% ( 30). The calculations in Excel

    are shown below.

  • 8/12/2019 Environmetrics I

    47/72

    Environmetrics I, 1.3.2013 / VMT 47

    The p-value tells us the null hypotheses has to rejected on the significancelevel = 0,05. On the other hand the mean value tells us the true nitrogen

    concentration is higher than 30% and, consequently, normally there should notbe need for reclamation, assuming that a higher concentration is better.

    In R, the test would be performed as follows:

    > x=c(31.04,31.03,30.87,29.97,30.7)> t.test(x,mu=30)

    One Sample t-test

    data: xt = 3.6469, df = 4, p-value = 0.02183alternative hypothesis: true mean is not equal to 3095 percent confidence interval:30.17233 31.27167

    sample estimates:mean of x

    30.722

    Note that R calculates also the confidence interval for the concentration whichalso shows the fact that with 95% confidence, the true concentration is over30%. Of course we could have calculated the confidence interval in Excel aswell. Actually, correctly interpreted, confidence intervals carry the sameinformation as statistical tests.

  • 8/12/2019 Environmetrics I

    48/72

    Environmetrics I, 1.3.2013 / VMT 48

    The two-sample F-test

    This test is suitable whenever the variances of two samples (Aand B) are

    compared with each other. A very typical case would be a case where toanalytical methods are compared in order to find out whether there is a differ-ence in the measurement uncertainties of these two methods. Remember thatmeasurement uncertainties are measured by standard deviations and thatvariances are squares of standard deviations.

    The statistical assumptions are the same as in t-tests.

    The hypotheses are:

    : (37)

    : (two-sided alternative) or (38a)

    : (one-sided alternative) or (38b)

    : (one-sided alternative) (38c)

    The test statistic (f) is calculated by the following formula:

    , (39)

    Note that the remark given after formulae 34-35c holds again, and also for allstatistical other tests.

    A random variable Fcorresponding to the test statistic is Fishers F-distributedwith and degrees of freedom, denoted by . Be-

    cause of the asymmetry of the F-distribution, it is not quite obvious how to calc-

    ulate two-sided p-values in a F-test. The easiest, and the most commonly used,way is to calculate the probability , if fis greater than one and the

    probability , if fis smaller than one, and then multiply the probabil-

    ity obtained by two.

    In Excel you can calculate p-values using the function FDIST(f; ; ).

    Note that FDIST calculates probabilities for Fbeing greaterthan funlike e.g.NORMDIST or the corresponding R-functions. In R, the function ispf(f, , ). The inverse function of FDIST is FINV(p;df1;df2) and the

    inverse function of pf is qf(p,df1,df2).

  • 8/12/2019 Environmetrics I

    49/72

    Environmetrics I, 1.3.2013 / VMT 49

    The power calculations are based on the non-central F-distribution.

    Example

    Two analytical methods are compared, the Kjehdahl method and a spectrosco-pic (IR) method. The laboratory wants to find out whether the two methods haveequal repeatability standard deviations or not. This can be done by the two-sample F-test. There is a tool in Excel Data Analysis tools but in this case thecalculations are so easy that the test is faster to perform by doing the calcula-tion by oneself, as shown below. The null hypotheses is that the standarddeviations are equal which actually is converted to the equal hypothesis thatthe variances are equal. The alternative is that they are not equal, i.e. it is atwo-sided alternative.

    The p-value tells the laboratory that the standard deviations of these two meth-

    ods are not significantly different. Conduct a suitable test to test this hypothe-ses. Note that the null hypothesis must be that there is not any systematicdifference.

    Exercise 25

    Naturally the laboratory is also interested in whether there is a significantsystematic difference in the expected values. Conduct a suitable test to test

  • 8/12/2019 Environmetrics I

    50/72

    Environmetrics I, 1.3.2013 / VMT 50

    The one-sample F-test

    This test is suitable whenever the variance of a single sample is compared to a

    nominal (or reference or standard) value ( ).The hypotheses are:

    : (37)

    : (two-sided alternative) or (38a)

    : (one-sided alternative) or (38b)

    : (one-sided alternative) (38c)

    Remember that the null hypothesis changes in one-side tests.

    The test statistic ( ) is calculated by the following formula:

    , (39)

    A random variable (this is capital ) corresponding to the test statistic is

    -distributed with degrees of freedom, denoted by . The p-

    value in a two-sided test is calculated poses similar difficulties as in the twosample F-test. The most common way is to calculate the probability ,

    if is greater than one and the probability , if is smaller than

    one and then multiply the probability obtained by two.

    In Excel you can calculate p-values using the function CHIDIST( ;n-1). Note

    that CHIDIST calculates probabilities for being greaterthan unlike e.g.

    NORMDIST. In R, the function is pchisq( ,n-1). The inverse function of CHIDI-

    ST is CHIINV(p,df) and the inverse function of pchisq is qchisq(p,df).

    The power calculations are based on the non-central -distribution.

    Example

    The manufacturer of a diagnostic test claims that relative repeatability standarddeviation of the results is 5%. A client wants to test the claim by making 10

    replicate analyses and the performing a -test for variance. There isnt aready-made tool in Excel for this test and you have to do the calculations your-self as shown below. Note that first, in this example, you have to calculate thereference standard deviation knowing that it should be 5% of the mean value.

  • 8/12/2019 Environmetrics I

    51/72

    Environmetrics I, 1.3.2013 / VMT 51

    The p-value is below 0,05 and we have to conclude that the estimate of the truestandard deviation is significantly higher than the claimed one.

    Exercise 26

    Silicon wafers are produced batch-wise in lots of 1000 wafers. An improvementin the process is suggested, but it would require an investment of 100000 .Therefore the new, modified process is tested in pilot scale and compared tothe old one. Twenty batches, ten with the old and ten with the new process, aremade in random order. The numbers below are differences from the nominalthickness. Would you suggest that the investment is worth making? State allassumptions needed for an appropriate statistical tests and base your sugges-tion on the results of the tests.

    Old: -8.6 4.8 11.4 -4.7 10.2 9.0 -10.8 12.7 3.5 -12.3New: -0.9 6.0 -2.1 6.7 -0.1 5.9 -0.9 2.4 -4.6 1.0

    You should first check whether the difference in standard deviations is signific-ant or not by a two sample F-test. If not, you can use a two sample t-test fortesting the difference in the expected values.

  • 8/12/2019 Environmetrics I

    52/72

    Environmetrics I, 1.3.2013 / VMT 52

    SPC and control charts

    Statistical process control (SPC) was developed during the second world war

    mostly in US where it was mostly forgotten in industrial processes after the war.However, the ideas of SPC spread to Japan where the methodology was widelyaccepted by most industries, mainly due to the works of Deming, Ishikawa,Juran and Taguchi. The succes story of Japanese cars woke up Americanindustry and the methodology was reinvented in US. Today SPC is only a partof more general quality philosophies, e.g. Six Sigma or Taguchis philosophy.Yet, SPC has a fundamental role in creating good quality.

    The simplest SPC tools are the so called control charts. Their basic idea is thatprocesses should be adjusted only if one can be almost sure that the observedchanges in the process are not due to the normal, uncontrollable (random)

    variation in the process. There are three kinds of control charts, those whichare aimed to detect abrupt large changes, those which are designed to detectsmall but systematic changes and those which are designed to detect changesin variability.

    In order to make a control chart, process data are needed. These data can notbe just any data, but they should represent process that is under normalcontrol, i.e. a period (or periods) when everything is going alright. This calledthe construction data.

    The Shewhart ( -) control chart

    This control chart is primarily designed for detection of abrupt large changes,though with some additional decision rules, it can be, and is, used for otherpurposes as well. In most cases the Shewhart control chart does not followindividual process variables, rather it uses means of consecutive measure-ments. It should noticed that these means are not moving averages, but meansof distinct groups of constant group size (n). When the construction data have

    been gathered, group means ( ) and group standard deviations ( ) are

    calculated (Nis the number of groups). Then a pooled standard deviation is

    calculated as the mean square of the group standard deviations:

    . (40)

    After this the upper and lower control limits (UCL and ACL) are calculated as

    follows:

  • 8/12/2019 Environmetrics I

    53/72

    Environmetrics I, 1.3.2013 / VMT 53

    (41a)

    (41b)

    where stands for mean of the means (the grand mean) of the construction

    data. Sometimes a nominal value or a target value is used instead.

    It should be noted that the old fashioned way of calculating the control limits isbased on ranges of samples instead of standard deviations of the samples, andis still widely used. The range of a sample is the maximum value minus the

    minimum value. The reason for using ranges is the fact that they are simpler tocalculate and, by the time control charts were introduced, the only onesapplicable for hand calculations.

    Exercise 27(in computer lab)

    The data consist of yields of a process and the goal is have a steady yield.Your task is to construct a Shewhart

    Sometimes also limits are plotted on the chart. These called the

    upper and lower warning limitsand some additional decision rules can beapplied to them. The decision rules are often called runs rules(see for anexample in http://iew3.technion.ac.il/sqconline/runplot.html ).

    EWMA control charts

    Exponentially weighted moving averages (EWMA) are widely used especially inprocess industry. They are suitable for the same purposes as Shewhart controlcharts but, in addition, they are also used for ordinary process control. The ideais simple: the original process values ( ) taken at equally spaced time steps

    ( ) are replaces by the values

    . (42)

    The value for is chosen in such way that the mean square error between

    and is minimized in the construction data that is assumed to be undercontrol. The initial value for ( ) in EWMA plays an important role in comput-

    ing all the subsequent EWMA's ( s). Setting to is one method of

  • 8/12/2019 Environmetrics I

    54/72

    Environmetrics I, 1.3.2013 / VMT 54

    initialization. Another way is to set it to the target of the process.

    Yet another possibility would be to use the average of the first four or five

    observations. You can find more details e.g. in the link

    http://www.itl.nist.gov/div898/handbook/pmc/section4/pmc431.htm.

    CUSUM contro l charts

    In some applications it is important to be able to detect small systematic chan-ges in the process which is not possible by using Shewhart control charts. Forsuch problems, the CUSUM (cumulative sum) control chart is the appropriatetool.

    There is a good discussion about EWMA and CUSUM control charts in Box,Hunter & Hunter: Statistics for Experimenters.

  • 8/12/2019 Environmetrics I

    55/72

    Environmetrics I, 1.3.2013 / VMT 55

    Regression analysis

    Simple regression and calibration

    A very common problem in engineering and labororatory analyses is to de-scribe experimental data by a mathematical model (an equations or a set ofequations). In general, this is called regression analysis. There are two differ-ent situations that also typically lead to different techniques: regression relatedto empiricalor to mechanisticmodels. A mechanistic model is a model whoseequations can be derived from chemical, physical or biological theories, e.g.concentration of a chemical A in a first order reaction obeys the equation

    . An empirical model is such that the equations describing the

    dependency are not known and the regression is based on simple functions

    that are flexible enough to fit the data; typically such function are polynomials.In reality, most models are more or less in-between purely empirical and purelymechanistic models.

    Regression models can be classified also on other bases. The one that hasmost practical importance is whether the model is linear with respect to the

    unknown parameters of the model. For example, the model is

    not linear with respect to parameter k. But if we take logarithms at both sideswe get by renaming and we get

    a linear model y= a- kt. Many models can be linear by such re-parameterisati-

    on, but one should be aware of the fact that the re-parameterised model maygive less reliable estimates for the unknown parameters than the originalmodel.

    The far most common principle of estimating the unknowns is the method leastsquares, developed already by Gauss. The idea is simple: suppose that eachobservation is modelled by

    , (43)

    where represents the random error of the ith measurement, representsthe value of the independent variable or a vector of values of the independentvariables of the ith measurement, and is a vector containing the unknown

    parameters, e.g. = [a b]. Now the least squares estimate of , denoted often

    by is the value that minimizes the sum . The differences

    are called the residualsand they are the differences between

    measured and modelled, i.e. fitted, values. Naturally, small residuals mean agood fit.

    Another useful view is to consider regression as solving an overdeterminedsystem of equations.

  • 8/12/2019 Environmetrics I

    56/72

    Environmetrics I, 1.3.2013 / VMT 56

    (44)

    If such a system is linear, the regression is called linear regression.

    The difference between linear and non-linear regression models is that in linearthe minimum is obtained by a simple formula, but in non-linear models iterativemethods have to be used. However, using present mathematical software, e.g.

    Excels Solver, even iterative minimization is relatively easy.

    If the system (43) is linear, the estimation problem is really simple, becausesolving linear systems is simple. You simply have to be able to build thecoefficient matrix, and then solve it using (computer) matrix algebra. However,for the statistical analysis, many more calculations are needed, andconsequently in practice you will have to learn some dedicated regression tool,and to interpret its output as well. The aforementioned system of equationsview is useful in applying many of the regression tools, because they requirethe user to first build the coefficient matrix, though usually without the constantcolumn of ones corresponding to the intercept of the model.

    Example

    Consider the Monod equation , and the following data

    S [mg/l] 81 162 244 366 460[1/h] 0.0083 0.0151 0.0191 0.0216 0.023

    obtained from a wastewater treatment plant.

    If the data is substituted into the equation, we obtain the following system onnonlinear equations:

  • 8/12/2019 Environmetrics I

    57/72

    Environmetrics I, 1.3.2013 / VMT 57

    Solving this nonlinear system requires iterative tools which are avalaible inExcel, Matlab or R. However we shall study how this can be converted into a

    linear system. Now, let us manipulate the equation first by multiply-

    ing the both sides with the denominator: . Then let us

    rearrange it so that corresponds to a standard form of a linear system. First wehave to have known values on the rhs. For that, let us divide both sides with

    : . Let us group the known (measured) values as coeffi-

    cients of the unknown values: . Now, let us repara-

    meterise the equation by defining

    = b1, = b2, S= y, = x1, and S = x2.

    This gives . Note that the unknowns are the bs, not the xs. A

    system of such equations would make up a linear system in standard form.However, well make a slight change to the standard form, because in regres-sion analysis, the lhs and the rhs of the equations are typically interchanged,

    and also unknowns are typically represented as the coefficients of the mode,i.e. . It is important to notice, that coefficient matrix of the

    corresponding linear system is made up of xs, not of bs. Notice also that thisequation doesnt have an intercept. You could easily build the the coefficientmatrix of the corresponding linear system in e.g. Octave or FreeMat by thefollowing commands:

    >> S = [ 81 162 244 366 460] ' ;>> mu = [ 0. 0083 0. 0151 0. 0191 0. 0216 0. 023] ' ;>> CM = [ mu mu. *S]ans =

    0. 0083 0. 6723 0. 0151 2. 4462 0. 0191 4. 6604 0. 0216 7. 9056

  • 8/12/2019 Environmetrics I

    58/72

    Environmetrics I, 1.3.2013 / VMT 58

    0. 023 10. 58

    You could also solve the system equally easily, if you rememeber that the \operator gives the least square solution of an overdetermined system:

    >> r hs = S;>> b = CM\ r hsb =

    5792. 2 30. 543

    Of course, after obtaining the bs, we have to solve for and from the

    equations = b1, = b2:

    >> mu_max = 1/ b( 2) , K_S = b( 1) / b( 2)mu_max = 0. 032741K_S = 189. 64

    Whenever possible, you should visualize the goodness of the fit, e.g.

    f i gure(1) , pl ot ( S, mu, ' o' , x, y, ' Li neWi dt h' , 2)xl abel ( ' S ( mg/ L COD) ' )

    yl abel ( ' mu' )

    The source, from which this example was taken1

    , used a different linearization

    1http://www.dnr.state.wi.us/org/water/wm/ww/biophos/4biol.htm

  • 8/12/2019 Environmetrics I

    59/72

    Environmetrics I, 1.3.2013 / VMT 59

    (the so-called Lineweaver-Burke plot) and obtained the figure below (comparethe fits!).

    In Excel the solution would be equally easy: you would just type in the columnsof the coefficient matrix and the rhs, and then use the regression tool. However,in Excel, if the model doesnt contain an intercept, you have to tick a boxnamed Constant is Zero.

    After obtaining the regression coefficients, it is important to asses the reliabilityof the model obtained by statistical analyses. Many of the analysis methods arebased on the residuals of the model. In general the ith residual eiis defined bythe difference between measured and calculated (fitted) value, i.e. by theexpression . Naturally, calculating the residuals in Octave or

    FreeMat (or Matlab) would easy (note that we have residuals both for thelinearized model, and for the original nonlinear model):

    >> e_nl i n = mu- mu_max*S. / ( K_S+S)e_ l i n = 12. 3908 - 0. 1764 - 8. 9736 - 0. 5725 3. 6340>> e_nl i n = mu- mu_max*S. / ( K_S+S)e_nl i n = 1. 0e- 003 * - 1. 4990 0. 0164 0. 6775 0. 0337 - 0. 1831

    Note also, that standard deviations (uncertainties) for the parameters areo


Recommended